A plan for Bloom filters

by Ketil Malde; July 31, 2008

Bloom filters is apparently a relatively old technology, dating from the 1970s or so, but it has somehow escaped my radar until Bryan O’Sullivan posted a message to the haskell mailing list announcing a high-performance implementation in Haskell, perhaps to support a chapter in the upcoming book. You can read all about Bloom filters on Wikipedia, but the executive summary of it is that it is a structure similar to Data.Set. Except that it is probabilistic, and may occasionally claim a value is a member when it’s not. On the positive side, the Bloom filter is very fast, and speed is independent on the size — in other words, lookup and insert is O(1) where Data.Set is O(log n).

Comparing sequences to find similarity is a common occurrence in bioinformatics. For instance, one might want to know where a certain gene is located in the chromosome, or which sequence fragments are similar enough to originate from the same gene. To speed up searches, it is common to index sequences in questions as overlapping, substrings (k-tuples, q-grams). This index seems like an obvious target for Bloom filters – large data, time critical, some false positives anyway – but for some reason, there are almost no such applications that use them. Until now.

Sequence clustering

Sequence clustering is a commonly used technique which is usually based on sequence similarity. I’ve written one sequence clusterer, xsact, which is based on blocks of exact matches. There are many others, and another example is d2cluster, which is based on occurence of fixed length words – which is right up Bloom alley, right?

So a straightforward way to build a Bloom filter based sequence clusterer is to represent each cluster as a set of words – stored as a Bloom filter. Now, adding a new sequence to the clusters is a simple matter of extracting the words from the sequence, identifying the cluster(s) containing a sufficient number of these words, and adding the remaining words to that cluster (or the union of the clusters, in the case multiple clusters match).

The interesting thing about this approach is that the whole thing becomes O(kn), for k clusters and data size n. I think all other clustering algorithms are based on sequence pairs, which makes them O(n²) – in the worst case, you need to check all pairs. (However, a straightforward similarity-based clustering will have worst-case behavior when no sequence math each other, while suffix-based methods like xsact will have worst case when all sequences match – so perhaps there is room for a better middle ground?)

Anyway – while the above looks promising, there is one snag: EST sequences can occur as a copy of the gene, or due various properties of the manufacturing process, the gene’s reverse complement. Thus, we need to be able to compare sequences in both directions simultaneously. This could be achieved with a slightly creative hashing function, but to keep things simple, we’ll stick with the ossified mental sweat in the provided implementation.

Index and search

A somewhat related area is indexing and search. Let’s say you have a bunch of DNA sequences (of for each chromosome, perhaps), and a set of ESTs, which you’ll remember are gene fragments in an unpredictable mix of forward and reverse-complement directions. Here’s the plan:

index by building a Bloom filter containing the q-tuples for each chromosome
for each EST, look up each q-tuple (first forward, then rev.comp.) against the filters, and assign it to the chromosome containing the most q-tuples
align each EST against the designated chromosome using traditional methods

As far as I can tell at this point, for word size q, c chromosomes of length m and e ESTs of lenght n, this should run in something like O(qn) + O(qenc) + O(emn), compared to just aligning directly at O(ecmn). Note that mn is the big factor here, so on a large scale, we’re reducing the total work by a factor of c. (Of course, in real life, nm would be too large, and you’d use a heuristic, subquadratic alignment step, but this is for illustration purposes. Try to be a bit generous, will you?)

Further plans

That concludes the current plan, but there are certainly improvements that can be made, two obvious ones are

hash function that hashes a word and its reverse complement to the same value
break long sequences into partially (1/3) overlapping regions to speed things up even more, as well as give more accurate placements
Another thing that struck me is that in my annotation tool, I could probably use this as a faster way to store the set of matching proteins before extracting GO terms. Currently, the performance is limited by XML parsing, so it’s probably not worth the bother at the moment.

More is likely to come up, but now it’s time to implement something!

comments powered by Disqus

Feedback? Please email ketil@malde.org.