Biohaskell

bioinformatics and haskell

11  02 2009

Notes from PADL

Time flees.  It’s already been a while since PADL in Savannah, where I had the opportunity to enjoy talks in topics I mostly managed to follow and meet interesting and interested people.  Thanks to the organizers and committees making it all possible.  I presented a paper on Bloom filters that Bryan O’Sullivan and I wrote, and thought I’d make the paper available along with the slides (expanded somewhat), and a couple of ideas for extending Bloom filters that I think are original (or “novel”, as they say in science).

First things first, here are the files:

Bloom filters for bioinformatics – paper

Bloom filters for bioinformatics – slides

The new addition to the slides consist of the counting bloom filters, and a brief overview on how to locate matches in a Bloom filter.  The standard counting Bloom filter consists of replacing the array of bits with an array of bit buckets (of size b, say).  This lets the filter count up to 2^b occurrences of each element, typically using saturating counts.  If you want to retain the false positive rate, this means you’ll expand the size of the Bloom filter by a factor of b, which can be quite significant. In the example below, b is set to 3, and the filter saturates at a count of 7.

counting bloom filter

Therefore, I propose a different kind of counting bloom filter, using a distributed count.  Keep in mind that we have an infinite sequence of hash functions.  When inserting a new value, we can check the k hash functions to see if it’s already there.  If so, we calculate some additional hash functions to represent the count of two for this value.  And so on, until we find a count (and corresponding set of hashes) that have at least one zero bit, set those bits to one, and move on to the next value.  It may make sense here to decrease the number of hash values as the count goes up.  In the example below, we see that x is already inserted, and two new hash values are calculated to increase the count.

distributed counting bloom filter

There’s a trade off, of course.  Insertion and lookup is no longer (as) constant, but  proportional to value count. However, the Bloom filter is likely to be more compact – given n values in total, where u are unique, the standard counting filter needs a size proportional to bu, while the distributed counting filter needs size proportional to n.  If you want to count accurately, b needs to be set to the logarithm of the max count of any value.

Finally, it’s also possible to use Bloom filters for searching – for instance, locating unique words in a genome. A lookup in a Bloom filter basically gives one bit of information – present or not present.  We use this with a series of m Bloom filters, each indexing the regions of the genome corresponding to one bit of location information.  Looking up a unique word in the set of filters reveals the location with m bits of precision.

locating bloom filters


Leave a Reply

You must be logged in to post a comment.

« 454 sequencing and parsing the SFF binary format Current developments… »