Working with FASTA files

Advanced FASTA Sequence Manipulation

GenBeans offers several tools & functions for an advanced manipulation of FASTA files.

Always save a copy of your original FASTA file before making advanced modifications.

Accessing Advanced FASTA tools

1. Open the FASTA file in Explorer.

2. Select the main FASTA sequence node in the Explorer.

3. Advanced FASTA functions are found in the main menu under the Tools item:

Order FASTA sequences functions

These two tools order sequences alphabetically based on either the sequence name or the sequence accession number. If the identifier contains multiple delimiters and you do not know which field will be used as name or accession, use the Customize FASTA Header function to isolate them. If a sequence with no name or no accession is encounter, the method is aborted and the file remains unchanged.

Customize FASTA Header function

The description line in FASTA format is often used to pack information related to the database the sequences are originating from. The following table shows a few examples of database identifiers found in FASTA definition lines (source: The NCBI Handbook):

	Database name	Identifier syntax
	GenBank	gi\|gi-number\|gb\|accession\|locus
	EMBL Data Library	gi\|gi-number\|emb\|accession\|locus
	DDBJ, DNA Database of Japan	gi\|gi-number\|dbj\|accession\|locus
	NBRF PIR	pir\|\|entry
	Protein Research Foundation	prf\|\|name
	SWISS-PROT	sp\|accession\|name
	Brookhaven Protein Data Bank	pdb\|entry\|chain
	Patents	pat\|country\|number
	General database identifier	gnl\|database\|identifier
	NCBI Reference Sequence	ref\|accession\|locus
	Local Sequence identifier	lcl\|identifier

The FASTA Header Customizer can be used to remap fields in the definition line; after clicking the OK button, fields are isolated by the input header divider, numbered according to their occurence in the definition line and remapped according to the new field mapping; up to 10 fieds can be remapped and will be separated using the output field divider.

Truncate FASTA sequences

This function removes a few nucleotides to each sequence from the start position over the cut length. When operated from the 3' end, numbering is taken on the revese complement sequence. This tool works on protein sequences as well.

Extract FASTA DNA sequences

This tool extracts DNA sequences using IUPAC DNA motifs.

Remove duplicate FASTA sequences

This tool deleted sequences with duplicate name.

Remove selected FASTA sequences

This tool removes sequences having a particular regex motif in their definition line.

Working with FASTA files