Updates to the biosff library

by Ketil Malde; September 28, 2013

I recently received an email from Martin Mokrejš, who wanted to package some of the Biohaskell software for Gentoo. Which is awesome, I’ve been wanting to provide some of the more useful tools for various Linux distributions for some time, but the processes seem a bit complex, so I never got around to it.

In the process, Martin asked for some clarifications and enhancements, and this is just a quick note describing those, as implemented in biosff version 0.3.7.1.

(I forgot to include one module in 0.3.7, which is a-okay with cabal as long as it is still present on disk, but of course it doesn’t get included in the tarball. Sigh.)

Licensing

The first clarification needed was one of licensing. I generally slap GPLv2 on software tools, and LGPL on libraries. I’m not fanatical about it, but so far, I haven’t heard any compelling arguement for any other choice - in fact, I haven’t heard many arguments at all. I also don’t bother to include the actual license text, since I expect people who are interested to be able to find it. However, this was not sufficient, so I now included a COPYING file, and specified license versions. So far, so good - I hope.

Flower updates

Flower is a tool that works with 454 SFF files (these days, you can get those from IonTorrent, too) and extracts information in various ways.
One thing it can do, is to apply various trimming information to the output, e.g., removing low quality sequence bits. SFF files also can contain information on adapter sequence - that is, synthetic sequences used in the sequencing process, but not actually part of the data you are interested in. Up till now I have ignored this, simply because I haven’t had files where this was nonzero. But this is now supported, and the -a option trims for adapter.

FClip for modifying SFF files

Now, Flower can output a variety of formats, but not SFF. Instead of adding this functionality to Flower, I made a new, and much simpler, executable FClip for this. It only implements the trimming functionality from Flower (now moved into the biosff library), and spits out a new SFF with trimming applied.

Ideally, the trim points come in pairs, one left trimpoint, and one right. When the sequence is trimmed, coordinates change, and the remaining trimpoints are updated accordingly. Now, it isn’t always this simple, since these sometimes overlap. So care must be taken to avoid things like negative trimpoints, or a negative remaining sequence lenght. And a left trimpoints of 1, or a right trimpoint equal to sequence lenght should be set to zero. There were quite a few bugs in the previous code, but hopefully it is all fixed now. In addition, both Flower and FClip optionally supresses empty reads (i.e. where everything is trimmed) with the -E option.

FRecover for corrupted SFF files

This is an old program to try to salvage corrupted SFF files. It was included in the old Flower distribution (before it was merged with biosff), but I never saw corrupted SFF files again, and assumed it was a one-time thing. But Martin thought it would be useful, so it is now included again.

Enjoy!

Post scriptum

One thing - when trying to install this on older Linux distributions (which basically means any “enterprise” distribution), the system tends to want to upgrade a bunch of stuff. Usually, this is not a problem, but e.g. the latest array package requires the TrustWorthy extension, which is only supported by the latest GHC, and so breaks compilation on anything less than the cuttting edge. It is possible (but cumbersome) to work around this, if this is a problem for too many users, I’ll work out the details and do a writeup.

comments powered by Disqus
Feedback? Please email ketil@malde.org.