k-mer counting in a high-level language?

by Ketil Malde; November 22, 2013

One important ingredient in the bioinformatician’s toolchest is the k-mer indexer or counter. (A k-mer, sometimes called a k-word or q-gram, is simply a contiguous substring of length k, and the k-mers of a data set are all overlapping such substrings.) This forms the basis for many kinds of sequence statistics and analysis, including alignments and assembly, for instance.

There exist a variety of tools to count k-mers, the important attributes are speed and memory usage - the latter is often more important, since data sets are typically large. Stumbling over this link, one gets the impression that Python and Perl, while convenient, suffer performance-wise. Consequently, k-mer counting programs tend to be written in C, C++, or some other bare metal language. So I thought it would be interesting to compare a Haskell implementation. Titus Brown and his group (who author the khmer probabilistic k-mer counter) has done some work comparing different tools, I won’t go into details except note that jellyfish seems to currently be the one to beat.

I believe most k-mer indexing tools pack nucleotide sequences (consisting of A, C, G, T) into two bits per nucleotide, and ensure somehow that a k-mer hashes to the same value as its reverse complement.

Then, there are several alternatives for storing the counts. The most common is probably some form of hash table, but others use suffix arrays (which are much more general), probably including compressed, sparse, and enhanced variants. And some use probabilistic methods based on Bloom filters or similar structures, which are fast and compact at the expense of being slightly incorrect and not being able to give you back the k-mers.

Anyway, since I am toying with Judy arrays (which like a hashtable is an associative data structure, but which is implemented using a trie), I thought I’d generate k-mers, and store them in a Judy array. I wrote the following program:

main = do
  [k',f] <- getArgs
  s <- readSangerQ f
  let k = read k'
      kmer_seq = kmers_rc k . unSD . seqdata
  freqs <- mk_judy (2*fromIntegral k)
  mapM_ (add_count freqs) $ concatMap kmer_seq s
  cts <- mk_judy 20  
  mapM_ (add_count cts . fromIntegral) =<< counts freqs
  writeHist cts (f++".hist")

This generates all k-mers, using a small wrapper around kmers_rc, which hashes each k-mer with its reverse complement. It then inserts their counts in a judy array called freqs, and then constructs a histogram called cts containing the counts from freqs, which are then written to a file.

How does this perform? I ran this on my laptop (Core i5 at 2.5GHz) on a file containing 250K 100bp reads, a 60MB file with about 25MB of actual data. I also ran jellyfish on the same file. It looks like this, first my program:

% /usr/bin/time ./mkhist 31 250K-reads.fastq
Init judy 62
Init judy 20
13.59user 0.19system 0:13.78elapsed 100%CPU (0avgtext+0avgdata 293544maxresident)k
0inputs+64outputs (0major+73470minor)pagefaults 0swaps

Not too shabby, perhaps, this is close to 2MB/s. Then jellyfish:

% /usr/bin/time jellyfish count -m 31 -C -s 10000000 250K-reads.fastq -o 250K
8.27user 0.15system 0:08.12elapsed 103%CPU (0avgtext+0avgdata 187392maxresident)k
0inputs+328864outputs (0major+47007minor)pagefaults 0swaps

% /usr/bin/time jellyfish merge 250K_0 250K_1 -o 250K.jf
1.29user 0.10system 0:01.35elapsed 103%CPU (0avgtext+0avgdata 60060maxresident)k
0inputs+325048outputs (0major+15097minor)pagefaults 0swaps

% /usr/bin/time jellyfish histo 250K.jf > 250K.hist
0.34user 0.01system 0:00.35elapsed 99%CPU (0avgtext+0avgdata 163984maxresident)k
0inputs+8outputs (0major+41061minor)pagefaults 0swaps

Jellyfish is a bit less convenient, it builds the counts in several files (two here), which must then be merged later. But it is much faster, only eight seconds to do the count, and about one and a half for the merge and histogram - a total of less than 10 seconds is a good bit better.

Now, these examples are pretty small, but you can also notice that my program uses a bit more memory, and if you suspect it will get worse as you scale up, you are right: jellyfish just generates more files to keep memory use down. My program barely manages to count 7M reads on my 8G laptop (in about ten minutes, if you have to know), so as it stands, it’s not very useful for real data. And to be fair, jellyfish also has good multiprocessor support - although I think that wouldn’t be too hard to add. And to be completely fair, I guess I might point out that the main reason we’re doing well, is that we rely on Judy arrays, written in C, lovingly handcrafted for compactness and efficiency. On the bright side, k-mer hashing is pure Haskell, and fast enough not be a noticeable bottleneck.

Oh, yeah, and I diffed the output files and got the same. I did get slightly different answers for the 7M reads file, I’ll have to look into that, I guess.

comments powered by Disqus
Feedback? Please email ketil@malde.org.