Giving users what they want':' Haskell scripts on the web with CGI
by Ketil Malde; May 27, 2008
As a consequence of IWC policies, the Institute of Marine Research is required to store genetic identification of each minke whale that is hunted. This of course means that people will come to me for help in bridging the gap between test tubes and the databases by providing some analysis tools that can extract the information from the data.
Well, I’m not complaining, work security and all that, but it does leave me with a slight problem. I’m a Linux user, and my experience with non-Unix OSes after I passed on my Amiga 500 to my little brother is rather limited. (It’s not that I won’t fix your PC, rather, it’s quite likely that I can’t.) But the users I support in this particular case are largely confined by the walls of Redmond, and not terribly enthusiastic about broadening their horizons in that regard. Here’s one way to solve it.
Genotyping for dummies: a 2-minute introduction
There are several approaches to genotyping, and they have different properties, including evolutionary scale (i.e. do we want to identify family members, or different sub-species?), cost, and reproducibility. The minke whale registry will use several different technologies, here we will focus on single-nucleotide polymorphisms, or SNPs, in the mitochondrial genome.
SNPs are positions in the genome that vary between individuals. Usually, this means that some people will have an adenine where others have a cytocine or similar, but sometimes there are insertions or deletions as well, which we generously include in the definition. Mitochondria are sub-cellular organs, or organelles, that are responsible for producing energy from sugar and oxygen. They have their own circular genome and use a slightly different genetic code. They probably originated as a separate organism that somehow merged with another, conferring the ability to breathe oxygen, and it looks like that was a really good idea since it was picked up by everything larger than a bacteria. But we digress. Anyway, this is a stable technology, and good for relatively long evolutionary distance comparison, but since the mitochondria are inherited from the mother only and there’s no recombination, maternally related family members are not likely to vary much or at all.
Basically, the lab will sequence a part of the mitochondrial DNA twice, once in each direction, and align it to a reference sequence. My job will be to take this alignment, verify that the different directions match, output the nucleotides in positions known to vary. In addition, we will check for and report any previously unknown SNPs, but the main product will be the table of known SNPs.
Options
Since I need to provide this application to Windows users, I need to choose a suitable approach:
- write a cross-platform command line tool and compile it for Windows
- write a cross-platform GUI tool
- write a web-based interface
I have never written a cross-platform Haskell application before. I mean, in theory it should all Just Work®, but in practice, theory and practice tends to diverge a bit. If I may say so without being disrespectful, I am also dubious about the convenience afforded my users by a command-line tool. A GUI tool would probably be nice, but it would require me to actually learn how to do it. Yes, it is – has been for a while, really – on my list of things to do, but if I can invent a plausible reason to procrastinate, well, it is the Way of Laziness, after all. So a CGI script looks like the path of least resistance.
Network.CGI
This library appears to come included with my GHC installations, so this is what we’ll use. For performance, there’s also the FastCGI library, tested in more detail in this benchmark test. Since we can reasonably estimate maybe tens of hits per year – limited by the current hunting quota – we will postpone the performance issues for now. Usage of Network.CGI
is described and examples provided at the Haskell wiki.
Implementation: testing the waters
I can think of two ways to enter input (the alignment): either cut and paste from Vector NTI into a text field, or save the alignment to a file and upload it to the web page. While it turns out that Vector NTI can save to the ACE alignment format – which the bioinformatics library can read – it wasn’t quite straightforward. One thing is the gratuitious “\r”s that revealed a small bug or two, which were easily fixed. Another is that the ACE file produced by Vector NTI contains some irregularities, like sequences with no content. Until I figure out what that is supposed to signify, I guess it’s text field cut-and-paste for my users. Which is probably less complicated to use, anyway.
I already had working code for parts of the analysis, so the first step is to write a CGI wrapper. My first attempt looks like this:
import Network.CGI
import Columns
main :: IO ()
main = runCGI $ handleErrors cgiMain
cgiMain = do
m <- getInput "alignment"
case m of
Just n -> output $ html (form ++ genResult n)
Nothing -> output $ html form
genResult s =
let (hs,ss) = unzip $ map splitLine $ drop 1 $ lines s
cols = transpose ss
in "<h1>Polymorphic columns</h1>\n\n"++
(formatCols hs $ filterCols snp_columns $ enumerate (last ss) cols)++ "n"
form = "<h1>Paste your alignment below:</h1>\n" ++
"<form method='post' enctype='multipart/form-data'>\n" ++
"<textarea rows='30' cols='120' name='alignment'></textarea><br />\n" ++
"<input type='submit' /><form>\n"
html c = "<html><head><title>Alignment analyzer</title></head><body>\n"
++ c ++"</body></html>\n"
This is all fairly straightforward: the longish expressions in ‘genResult’ are lifted from the pre-existing program. Since the usage of HTML is rather limited, I’ve opted to scratch down my own versions of HTML-generating functions, otherwise, Text.Xhtml can supply this.
Network.CGI’s API is clear, concise, and to the point. It provides a monad where we build our CGI action, and ‘runCGI’ turns it into an IO action, suitable to use as our ‘main’. We also include ‘handeErrors’, which deals with program errors, and returns a page with the error code 500 which contains the error message. To some extent. Sort of. In theory.
First deployment: thar she blows (up)!
First, the program needs to be built:
ghc --make SNPCGI.hs -o /var/www/cgi-bin/snp.cgi
Pointing the browser at the correct URL gives the expected form. Paste in the alignment to be analysed, hit ‘Submit’, and….nothing. My blank stare at the screen can only be matched by the blank page staring back at me . I check again, using ‘view code’, but there’s nothing.
It turns out there’s a bug in my code, apparently triggered by some change to the input data that happens when it is fed through a HTML form. Obviously, ‘handleErrors’ didn’t, but in the error logs for the Apache installation, we can read:
[Wed May 21 23:05:47 2008] [error] [client 10.1.4.167] snp.cgi: Columns.hs:10:10-37: Non-exhaustive patterns in function unlike, referer: http://gadidae/snp.cgi
It turns out, there is a strictness problem with error handling: any error must be raised before ‘output’ starts to evaluate. In other words, we need to fully evaluate the string that represents the resulting HTML document. A quick, but less than pretty, way to achieve this is by:
cgiMain = do m <- getInput "alignment"
case m of
Just n -> let r = form ++ genResult n in if r/=r then error "Aiieee"
else output $ html r
Nothing -> output $ html form
So, that takes care of the problem, no? Unfortunately, it still doesn’t work. This time, the browser just hangs there, for about ten minutes, before finally giving me a rather generic 500 page. Checking the logs again, I find:
[Thu May 22 22:00:15 2008] [error] [client 10.1.4.222] snp.cgi: Prelude.last: empty list, referer: http://gadidae/snp.cgi
[Thu May 22 22:22:44 2008] [error] [client 10.1.4.222] Premature end of script headers: snp-strict.cgi, referer: http://gadidae/snp-strict.cgi
[Thu May 22 22:26:32 2008] [error] [client 10.1.4.222] Premature end of script headers: snp-strict.cgi, referer: http://gadidae/snp-strict.cgi
[Thu May 22 22:27:44 2008] [error] [client 10.1.4.222] (500,"Internal Server Error",["Prelude.last: empty list"]), referer: http://gadidae/snp-strict.cgi
[Thu May 22 22:31:32 2008] [error] [client 10.1.4.222] (500,"Internal Server Error",["Prelude.last: empty list"]), referer: http://gadidae/snp-strict.cgi
This is just weird. Luckily, I tried the CGI on a different web server – this one, in fact – and guess what? It worked just as advertised. If you want, you can paste this file into the form here and see for yourself.
Moving on
The program above is rather crude, it delivers just the textual output wrapped in <pre> tags. Not something that will impress your mother in law, so the next step is to generate a proper HTML table. After that, why not highlight the polymorphic columns that were previously unknown, and so on, and so on. If you’re really interested, feel free to browse the darcs archive.
Acknowledgments
As it is always helpful to have somebody who listens without interrupting, I would like to thank the haskell café mailing list for being there for me. Also, a good article needs to be written, rewritten, and rewritten again. This time, Wordpress was responsible for this, as a misplaced </form> tag emphasized the omission of any versioning or backup facility for posts in development. Finally (and seriously), thanks to Björn Bringert for pointing out the laziness issue on IRC, that actually was a great help. Giving users what they want: Haskell scripts on the web with CGI.
comments powered by Disqus