The type system, safety, sequence alignments, and you
by Ketil; April 3, 2012
Haskell programmers will often make claims that type systems make our programs safer, and especially that careful use of the type system can improve safety further. The first claim tend to be met with counter-claims that the safety isn’t all that important, after all, if you inadvertently try to add a string to a number, tests will quickly catch this. However, static types go a bit beyond this, but it can be hard to communicate how to leverage the type system to somebody who doesn’t already have some hands-on experience with the issues involved.
So I’ll try to illustrate with an example I ran across.
In bioinformatics, sequence alignment (or edit distance) is part of our staple diet. For instance, an unknown protein can be aligned to known proteins to determine its function and structure. Sets of proteins from different species can be aligned against each other to determine phylogenetic lineage. And so on. So let’s look at alignments.
Definitions: an alignment of a pair of sequences is a mapping (f
, say) from positions in one sequence to positions in the other. In addition, we require two invariants, first that the mapping is one-to-one:
f x == f y <=> x == y
This makes the alignment invertible, so there exists a mapping f'
back from the second sequence to the first one. Makes sense, no? Second, the alignment is monotonic, satisfying:1
f x > f y <=> x > y
In other words, the positions make up subsequences of equal lengths. So you can skip bits in either sequence, but you’re not allowed to flip things around.
In order to model this in Haskell, we can define a position to simply be an Int
, and use a list of pairs of positions. So we can define a type alias for this:
type Alignment = [(Int,Int)]
Although this is very specific, it does feel a bit unsatisfactory from a typing perspective: the type doesn’t enforce either of the two invariants. E.g. it allows definitions like:
wrong1 = [(1,1),(2,1)] -- maps both a[1] and a[2] to b[1]
wrong2 = [(1,2),(2,1)] -- maps the wrong way around
It’s hard to come up with a good data structure that enforces the invariants (suggestions welcome!), but at least, it’s very concrete, and the compiler willl now protect you from accidentally substitute an alignments with a String
, or any other type. Right?
Again, if you’re a dynamic typing enthusiast, you might argue that this enforcement is not such a big deal, you’re not very likely to pass strings by accident here anyway – and I agree. In fact, I think it is much worse that all the Int
s are interchangeable; it is all too easy to, say, compare positions from different sequences, or performing arithmetic on them (what do you get when you multiply two positions?).
For example, we can align a sequence against another via a third sequence. Basically, given sequences A
, B
, and C
, we have alignments A
to B
and B
to C
, and would like to derive the alignment of A
to C
. Now this isn’t too complicated:
trans_align :: Alignment -> Alignment -> Alignment -- type signature
trans_align ((p1,p2):ps) ((q1,q2):qs)
| p2 == q1 = (p1,q2) : trans_align ps qs -- return match and continue
| p2 < q1 = trans_align ps ((q1,q2):qs) -- skip first 'p'
| q1 < p2 = trans_align ((p1,p2):ps) qs -- skip first 'q'
trans_align _ _ = [] -- one input is empty
The first line declares the type of the function; it takes two alignments and returns a third. Then, the input alignments are pattern matched to extract the first pair of each alignment, and their position in sequence B
are compared. If they are the same, we have found a position in A
that maps to a position in C
, if not, we skip the one which is smallest, and in any case we continue. The final line is for when the pattern match fails, i.e. one of the alignments is empty. In that case, no more mappings can be constructed, and we return the empty list.
But what if we make our Alignment
type less specific:
type Alignment a b = [(a,b)]
Here, a
and b
are type variables, they represent any type. So e.g.
[("foo",42),("bar",1337)]
is a valid alignment of type Alignment String Int
, for instance. In other words, we no longer require indices to different sequences to have the same type. We can of course give trans_align
the type:
trans_align :: Alignemnt Int Int -> Alignemnt Int Int -> Alignemnt Int Int
-- index into: A B B C A C
But we’re not acutally using any Int
operations here, and clearly, if the returned alignment is from A
to C
, we can enforce this by instead giving trans_align
a type of:
trans_align :: Alignment x y -> Alignment y z -> Alignment x z
(In reality, we also need an Ord
constraint on y
, without this, we couldn’t compare positions in B
in our implementation.2)
See what have we done? In the end we’ll probably just use Int
for all the type variables (x
, y
, and z)
, but the implementation of trans_align
doesn’t know this. Thus the type ensures that indices in the different sequences are kept separately, and things like comparing indices from different sequences, or returning indices from the wrong sequence are no longer not just bad ideas, they’re illegal, and enforced by the compiler. How cool is that? This function is even prohibited from comparing indices in A
or C
, all it can do with them is return them. And: unlike passing strings off as ints, mixing up indexes is a mistake that is easy to make.
In retrospect, this is perhaps obvious: The more general a type is - that is, the less you know aobut its specifics - the less you are allowed to do with it. A general type constrains its implementation. And the less you can do with it, the fewer wrong things are possible. Perfection is achieved not when there is nothing left to add, but when there is nothing left to take away. And perfect use of the type system is when all the wrong programs are eliminated, and only correct implementations are legal.
For some alignments, typically protein to protein, the alignment usually has to be monotonically increasing, for other cases, it can be either way.↩
In fact, if we wrote
trans_align
without specifying the type, this is, modulo theAlignment
alias, is the type Haskell will assign to it:*Main> :t trans_align trans_align :: Ord a => [(t, a)] -> [(a, t1)] -> [(t, t1)]
So we can either design types up front, or we can let the compiler work out the most general type - where “general type” is another way of saying the least constraint that must be satisfied. Isn’t it neat?↩