All of this data comes at you in several formats, so becoming familiar with various format types helps you know how to interpret and store the data.
Where to find bioinformatics data
Bioinformatics combines information technology and molecular biology, so it makes sense that the internet is the main arena for pursuing bioinformatics information.
The following list offers links to helpful websites around the world and the areas that they specialize in.
-
Ensembl: Human genome
-
GenBank/DDBJ/EMBL: Nucleotide sequence
-
PubMed: Literature references
-
Swiss Institiute of Bioinformatics: Annotated protein sequences
-
InterProScan: Protein domains
-
OMIM: Genetic diseases
-
GenomeNet: Metabolic pathways
Websites for analyzing DNA/RNA sequences
The bioinformatics websites in the following list offer help in analyzing DNA and RNA sequences. And, in the marriage of information technology and molecular biology that is bioinformatics, this type of analysis is what it’s all about.
-
Webcutter: Restriction map
-
GenomeScan: Gene discovery
-
blastn, tblastn, blastx: Database search
-
The Genome Browser: Browse the ultimate data!
-
Mfold: RNA structure prediction
Websites for analyzing protein sequences
With bioinformatics you can explore molecular biology using information technology. The links to the websites in the following list focus on protein sequences. Some offer searchable databases, others help you investigate a single protein; all are helpful.
-
BLAST: Database homology search
-
Entrez: Database search
-
InterProScan: Find protein domains
-
ExPASy: Analyze a protein
-
ClustalW: Multiple sequence alignment
-
T-Coffee: Evaluate multiple alignment
-
Jalview: Multiple alignment editor
-
PSIPRED: Secondary structure prediction
-
Cn3D: Display and spin 3-D structures
Bioinformatics data formats
When you’re using the internet to help with your bioinformatics project, you come across data in all sorts of different formats. The following table can help you understand common bioinformatics formats and what you can and cannot do with them.
Format Name | Description |
---|---|
RAW | Sequence format that doesn’t contain any header. Spaces and numbers are usually tolerated. |
FASTA | This is the default format. Sequence format that contains a header line and the sequence: >name AGCTGTGTGGGTTGGTGGGTT |
PIR | Sequence format that’s similar to FASTA but less common |
MSF | Multiple sequence alignment format |
CLUSTAL | Multiple sequence alignment format (works with T-Coffee) |
TXT | Text format |
GIF, JPEG, PNG, PDF | Graphic formats. Do not use them to store important information. |