Remove Duplicates from a Fasta File :

The UCSC table browser allows to obtain 3' UTR sequences which are needed when searching for microRNA target genes. However the output files have the following format:

                            >hg19_refGene_NM_001184906 range=chr17:37408897-37417712 5'pad=0 3'pad=0 strand=- repeatMasking=none
                            CAATGGAGGTGGTCAACCTTGGCGAACTGAGTATTTAATGACACTTCTAG
                            AGCTACCGTGGAGTCTCTCCAGTGGAAGCAACCCCAGTGTTCTGAGCAAG
                        

The name of the sequence ' hg19_refGene_NM_001184906' would not be recognized by downstream analysis programs (functional enrichment analysis). This parser allows to substitute certain parts of the sequence name in order to recover the name of the transcript ( NM_001184906 in this case).

Specifying 'hg19_refGene_' in the textbox would make the parser to remove this string from the sequence name leaving only the name of the transcript. This parser removes also duplicated IDs and gives the additional possibility to remove duplicated sequence.