Biohaskell

bioinformatics and haskell

14  05 2009

Using a phantom type to label different kinds of sequences

Until now (version 0.3.5) of the bioinformatics library, the Sequnce data type has essentially been a wrapper around a couple of strings, with only the most rudimentary and generic structure.  This has the advantage that you can easily work with different kinds of sequences without caring about the particulars, but of course, nothing stops you from comparing a nucleotide sequence to a protein letter by letter.  We’d like some more safety without sacrificing flexibility, and by using phantom types we can get this. Below is my attempt at implementing this.

The safety that we seek can be exemplified by sequence alignment or similarity scores, as calculated by e.g. the BLAST suite of programs.  Here we have e.g. blastn for comparing nucleotide sequences, blastp for comparing amino acid sequences, and blastx for comparing nucleotides to proteins by first translating the nucleotides to the possible corresponding amino acid sequences. We’ll aim for an align function that does the right thing for the various combination of sequence types.

The old approach

Previously, there was only one Sequence data type, defined as:

data Sequence = Seq SeqLabel SeqData (Maybe QualData)

(where SeqLabel, SeqData and QualData are represented as various ByteStrings, but you can think of them as synonyms for String)

As mentioned, this isn’t really leveraging the type system much, and at the very least, it’d be a good thing to be able to differentiate between nucleotide and  amino acid sequence (a.k.a. peptides or proteins).

The algebraic data type approach

The default functional programming approach would be to use an algebraic (sum) type, what the imperative programmers woudl call a union.

data Sequence = Nucleotide ... | Protein ...

This is a good solution when the two branches have different structure, but here you’d essentially just repeat the structure.  Moreover, all functions will need to do run-time checks for each case, imposing a cost both in function complexity, running time, and type safety.  Note that performing the alignment usually requires a score matrix describing the penalty for replacing any given character with any other.

align :: Matrix -> Sequence -> Sequence -> Alignment

align mx (Nucleotide ...) (Nucleotide ...) =
align mx (Nucleotide ...) (Protein ...) =
align mx (Protein ...) (Nucleotide ...) =
align mx (Protein ...) (Protein ...) =

This is not even complete, since a similar restriction applies to the score matrices – sometimes it will contain replacement penalties for the nucleotide alphabet, and sometimes it will contain penalties for the amino alphabet, and the appropriate matrix must be used in each brach of the align function.

Another problem with this is that this requires you to over-specify the type.  Sometimes you don’t know or care what kind of sequence you have.  Say you are selecting sequences by name from a Fasta file.  Since the file format is the same you don’t really care what kind of sequence it is, and forcing it to be one or the other is….immoral.

The third way: phantom types

The chosen approach is instead to tag the Sequence type with a phantom type parameter, phantom meaning it does not affect the actual representation of the data.  It looks like this:

data Sequence t = Seq ....  -- but no data member of type 't'!

Now, we can write our alignment functions, and make them safer to use:

align :: Matrix t -> Sequence t -> Sequence t -> Alignment
alignX :: Matrix Amino -> Sequence Nuc -> Sequence Amino -> Alignment

Note that we also phantom-typed the Matrix type.  With this approach,  incorrect usage like comparing sequences of different type with the generic align will be flagged by the compiler, so run time checks are no longer necessary.

On the other hand, readFasta can now be given the type:

readFasta :: Filepath -> IO (Sequence a)

or:

readFasta :: FilePath -> IO (Sequence Unknown)

depening on how much you trust the programmer.  Since I’m writing most of the programs using this library, I know how much you can trust application programmers, so the second and safer method is chosen.

 In practice

The old biolib repository has now been replaced by two new ones: biolib-stable, currently containing version 0.3.5, and biolib-unstable at 0.4.0. Perhaps unsurprisingly, the latter version contains the phantomly typed Sequence definition.

Currently, three types are used for tags:  Amino, which is the type for the amino acid alphabet,and Nuc and Unknown, which are dummy types without any data constructors, and used solely for this purpose.

So what are the experiences so far?  Well, it does complicate things.   For some file types the sequence type is known, for instance the output of nucleotide sequencing machines in the form of ABI, SCF, or SFF files.  But often, file formats are agnostic with respect to sequence types, the most ubiquitous offender being the Fasta format. Currently, this is handled by the Unknown type tag, but I’m not entirely convinced this is the optimal solution.

Code needs to be updated, but mostly this is relatively easy.  The type system will spot the difficulties, and often you can get by by just replacing Sequence in the type signatures with Sequence t.  Just make sure that t isn’t already used in the signature – I stumbled into this one.

Of course, this is one step towards increased leveraging of the type system.  One could go further, and  tag different type of nucleotide sequences based on sequencing technology: Sanger sequencing has quite different error characteristics from 454 sequencing, and Solexa has a completely different interpretation of the quality values than either of those.  Also, one might want to include the presence or absence of quality data in the type as well.  There are also different amino acid alphabets – should they have different types?  This is a difficult design space – how much information should be encoded in the type?

While I think this is an improvement, I’m very curious how this works out in practice, or if there are other options I should consider.  Please comment!


Leave a Reply

You must be logged in to post a comment.

« »