BHLog

bioinformatics and haskell

14  08 2008

BioHaskell tutorial, part 1: Sequence data

The Bioinformatics library is slowly but steadily growing, acquiring new features and improvements. Here’s a small walk-through, showing off some of the most central functionality. I believe this is simple enough to be read without any prior knowledge in biology, and with only rudimentary knowledge of Haskell.

Acquire and Install

The BioLib page should contain the necessary instructions. If you have any problems, be sure to email me about them!

Reading biological sequences

Biological sequence data are usually distributed in the FASTA-format. This is a really simple format, consisting of a header line starting with ‘>’, and subsequent lines of actual sequence data. Usually, multiple sequences are concatenated in one file, this is sometimes referred to as multi-FASTA. The sequence data can either be nucleotide (DNA or RNA) or amino acids (proteins), using different alphabets. Here’s an example nucleotide sequence:


>gi|171313982|gb|FE940675.1|FE940675 AHAI-aaa38f01.b2 Pristionchus_pacificus_AHAI1_EST_L1 Pristionchus pacificus cDNA similar to gb|AAH50439.1| C13orf9 protein [Homo sapiens], mRNA sequence
AAAAGTTGGCTGGATCGGATACTCCCAGTGAGAGAAGATACGATTTGAAAGATATAGTCTGAACAACTTT
TATTAAACAAATTCTTTGGTGAAGAAATATTGTCTAAATTATAATGAACTAGCCTCGTCCTGAGTGATAT
CCCCTTGTCTTGCTTTCATTTTTTCAGCGATACTCTTGGATAGAGTGACCATCTCCTTTGCAAATTCCAT
AAAACGAGACATATCCTCGAATGCCTGTGAAATGTCCATTGGAACTTACTTCACGAAACAACTAGATTTC
AAACTGACTTTGTCCGTAAGAGGGTCAAAATTTATATAAAGTTACCTGACTGATGCTTTCATGAGTTCTG
ATATGATTTTCTGCTAATTTCCGTTCAATTCCAGC

Note that the header line has an internal structure containing various information: accession numbers (database indices), species, and identificaiton of the corresponding protein. Unfortunately, the format will vary from organisation to organisation, and for generated sequences, from program to program. Additionally, it is frequently wrong. However, the first word of the sequence is supposed to be a unique identifier, I’ll call it the sequence label. Let’s store this sequence in a file, “test.fasta”, say, and see what we can do with it:


% ghci
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help
Loading package base ... linking ... done.
Prelude> :m + Bio.Sequence
Prelude Bio.Sequence> s <- readFasta "test.fasta"
Loading package array-0.1.0.0 ... linking ... done.
:
Loading package bio-0.3.4 ... linking ... done.

This loads the library and it’s dependencies, and you should now have the file contents stored in the variable ’s’. (If you also get a lot of contents dumped to screen, you can turn it off with the command “:set -fno-print-bind-result“)


Prelude Bio.Sequence> :t s
s :: [Sequence]

Reading a FASTA file gives us a list of sequences. That makes sense, I hope? We can check the number of sequences using the standard list operations:

Prelude Bio.Sequence> length s
1

Unsurprisingly, we’ve read one sequence from a file containing one sequence. See, I told you this was simple.


Prelude Bio.Sequence> :i Sequence
data Sequence = Seq !SeqData !SeqData !(Maybe QualData)
-- Defined in Bio.Sequence.SeqData
instance Eq Sequence -- Defined in Bio.Sequence.SeqData
instance Show Sequence -- Defined in Bio.Sequence.SeqData

This reveals a few things: First, we use the term Sequence, which unfortunately means quite different things to the functional programmer and the biologist. We stick with the latter. Second, there’s a sub-module called Bio.Sequence.SeqData that contains these definitions. Bio.Sequence that we imported just re-exports functionality from sub-modules in a convenient package. Also, the SeqData is just a synonym for Data.ByteString.Lazy.Char8, and QualData for Data.ByteString.Lazy. The derived Show instance for a Sequence makes it spill its guts, and it’s not pretty:

Prelude Bio.Sequence> head s
Seq (Chunk "gi|171313982|gb|FE940675.1|FE940675 AHAI-aaa38f01.b2 Pristionchus_pacificus_AHAI1_EST_L1 Pristionchus pacificus cDNA similar to gb|AAH50439.1| C13orf9 protein [Homo sapiens], mRNA sequence" Empty) (Chunk "AAAAGTTGGCTGGATCGGATACTCCCAGTGAGAGAAGATACGATTTGAAAGATATAGTCTGAACAACTTT" (Chunk "TATTAAACAAATTCTTTGGTGAAGAAATATTGTCTAAATTATAATGAACTAGCCTCGTCCTGAGTGATAT" (Chunk "CCCCTTGTCTTGCTTTCATTTTTTCAGCGATACTCTTGGATAGAGTGACCATCTCCTTTGCAAATTCCAT" (Chunk "AAAACGAGACATATCCTCGAATGCCTGTGAAATGTCCATTGGAACTTACTTCACGAAACAACTAGATTTC" (Chunk "AAACTGACTTTGTCCGTAAGAGGGTCAAAATTTATATAAAGTTACCTGACTGATGCTTTCATGAGTTCTG" (Chunk "ATATGATTTTCTGCTAATTTCCGTTCAATTCCAGC" Empty)))))) Nothing

This shows the gory details, and hopefully, it should be clear to you that the “Show” instance is for debugging and internal use only. There’s no alternative convert-to-string function, the easiest way to fake it is to:

Prelude Bio.Sequence> :m + IO
Prelude Bio.Sequence IO> hWriteFasta stdout s
>gi|171313982|gb|FE940675.1|FE940675 AHAI-aaa38f01.b2 Pristionchus_pacificus_AHAI1_EST_L1 Pristionchus pacificus cDNA similar to gb|AAH50439.1| C13orf9 protein [Homo sapiens], mRNA sequence
AAAAGTTGGCTGGATCGGATACTCCCAGTGAGAGAAGATACGATTTGAAAGATATAGTCT
GAACAACTTTTATTAAACAAATTCTTTGGTGAAGAAATATTGTCTAAATTATAATGAACT
AGCCTCGTCCTGAGTGATATCCCCTTGTCTTGCTTTCATTTTTTCAGCGATACTCTTGGA
TAGAGTGACCATCTCCTTTGCAAATTCCATAAAACGAGACATATCCTCGAATGCCTGTGA
AATGTCCATTGGAACTTACTTCACGAAACAACTAGATTTCAAACTGACTTTGTCCGTAAG
AGGGTCAAAATTTATATAAAGTTACCTGACTGATGCTTTCATGAGTTCTGATATGATTTT
CTGCTAATTTCCGTTCAATTCCAGC

The sequence we’ve used so far, is a nucleotide sequence. Nucleotides, in addition to the tendency to string together, also have the ability to pair up. For DNA, an A will bind to a T, and a C will bind to a G. If two sequences match each other appropriately, they can match up along all or part of their length, forming double stranded DNA (or RNA). For this to happen, one sequence will have to be the reverse complement to the other. In many cases, we don’t really know which strand a fragment is from, and thus whether we want the sequence or its reverse complement. There is of course a function for transforming one into the other:
Prelude Bio.Sequence IO> let pr = hWriteFasta stdout
Prelude Bio.Sequence IO> Prelude Bio.Sequence IO> pr (map revcompl s)
>gi|171313982|gb|FE940675.1|FE940675 AHAI-aaa38f01.b2 Pristionchus_pacificus_AHAI1_EST_L1 Pristionchus pacificus cDNA similar to gb|AAH50439.1| C13orf9 protein [Homo sapiens], mRNA sequence
GCTGGAATTGAACGGAAATTAGCAGAAAATCATATCAGAACTCATGAAAGCATCAGTCAG
GTAACTTTATATAAATTTTGACCCTCTTACGGACAAAGTCAGTTTGAAATCTAGTTGTTT
CGTGAAGTAAGTTCCAATGGACATTTCACAGGCATTCGAGGATATGTCTCGTTTTATGGA
ATTTGCAAAGGAGATGGTCACTCTATCCAAGAGTATCGCTGAAAAAATGAAAGCAAGACA
AGGGGATATCACTCAGGACGAGGCTAGTTCATTATAATTTAGACAATATTTCTTCACCAA
AGAATTTGTTTAATAAAAGTTGTTCAGACTATATCTTTCAAATCGTATCTTCTCTCACTG
GGAGTATCCGATCCAGCCAACTTTT

Here, we see that the AAAA in the beginning of the sequence has been turned into TTTT (the complement) at the end (it’s been reversed). Simple, yes?

Excercise:

  1. Add a more sensible Show instance. In fact, I’d like FASTA output, but indented somewhat, and prefixed with coordinates. For extra credit, use darcs and submit a patch (use ‘darcs send’).

Protein sequences

The point (well, not the whole point, but let’s keep things simple) of nucleotide sequences is to function as blueprints for sequences of amino acids, commonly referred to as proteins. Cells contain ribosomes which are basically machines that read nucleotide sequences, three nucleotides at a time, and builds the corresponding protein by chaining up amino acids. This process is referred to as translation. Each triplet of nucleotides that is translated into an amino acid is referred to as a codon. For example, the codon GAT is translated to…but I don’t have to tell you, you can find out for yourself: Notice that our sequence has GAT at position 12-14, which we can confirm using the ‘(!)’ accessor function:


Prelude Bio.Sequence> map (head s!) [12..14]
"GAT"

we can new use the ‘translate’ function, which translates a sequence starting at a given offset:


Prelude Bio.Sequence> head $ translate (head s) 12
Asp

Et voilà: it is asparagine. The ‘translate’ function is a bit clunky for this use, and the reason is that generally you want to translate the whole sequence in one go. Since it is translated in triples, there are three possible translations in each direction, corresponding to offsets of 0, 1, and 2.  Each such translation is called a frame.  Protein sequences are stored in FASTA-formatted text files, just like nucleotide sequences, but the alphabet is different.

Excercises:

  1.  Write a function that given a nucleotide sequence, outputs all six translations.
  2. Some codons are so-called stop codons.  Rather than translate to an amino acid, these work as punctuation, terminating the translation.The Amino Acid data type calls them STP.  A segment of a reading frame containing no such stop codons is called an open reading frame, or ORF.  Write a function that given a nucleotide sequence, outputs the longest open reading frame, considering all six possible frames.
  3. The translations from codons to amino acids are stored in a table called ‘trans_tbl’.  This implements the most common translation scheme, used by eukaryotes (things with a nucleus, e.g. us), but there are others. Look up variations of this table specific to certain prokaryotes, mitochondria, and chloroplasts, add them to the export list, and provide a translation function taking a translation table as a parameter.  Make sure it works, then submit the patch.

(to be continued)

Leave a Reply

You must be logged in to post a comment.