# Frequency counting in Haskell

by *Ketil Malde*; **November 11, 2013**

One of the things I find myself using associative data structures (a.k.a. finite maps, hash tables, …) for again and again, is frequency counts. This is such a common task, and thus it should be easy and efficient, but due to the non-strictness of Haskell, it’s in fact a bit tricky to get right. So if you use a `Data.Map`

(specifically, `Data.Map v Int`

), instead of a simple expression like:

`freqtable = foldl' (M.insertWith (+1) 1) M.empty`

I often found myself writing something like:

```
freqtable = go M.empty
where
go m [] = m
go m (i:is) = let v = M.findWithDefault 0 i m
v' = v+1
m' = M.insert i v' m
in v' `seq` m' `seq` go m' is
```

The good news is that the contortions above are not really necessary anymore. We now have strict versions of these data structures which always evaluate the value elements, as well as `insertWith'`

(with a tick at the end) which evaluates values (to WHNF) on insertion.

The bad news is that for various reasons, `Data.Map`

isn’t always the optimal data structure for this, and so I decided to investigate a bit.

## Candidate data structures

The default data structure for associative maps in Haskell is perhaps still `Data.Map`

, but in many cases, `Data.IntMap.Strict`

is a better choice. This is strict, and by using Int as the key (which isn’t really a severe limitation for most purposes), is more efficient. This is still implemented as a tree structure, which enables sharing and immutability and purity - the usual goodies. (But note that this isn’t so important here - we’ll often just build the table once at the outset and then use it, we’re not updating things much later on.)

Another alternative is a direct table of counts. `Data.Vector`

is Haskell’s new all-purpose arrays, providing a 0-indexed table with various options for content. The older `Data.Array`

provided more flexible arrays with multiple dimensions etc, but these can fairly easily be built on top of vectors. Like `Data.Array`

, vectors come in multiple flavors:

**regular**- unboxed, can hold complex (and unevaluated, non-strict) values**unboxed**- stores primitive data structures directly, and is strict**mutable**- can do destructive updates, but requires operations in a monad

Then finally, there is `Data.Judy`

- in fact, stumbling over its Haskell library by Don Stewart is the reason I started looking into this. The C version is advertised as an extremely performant, mutable, cache-friendly tree structure.

## Implementation

To benchmark these data structures, I’ve written a small driver program that basically inserts a given number of values from a given range into one of these data structures, and measured CPU time and memory usage with /usr/bin/time.

So, to make it clear, given a range `r`

, and a number of values `n`

, we insert `n`

random values between 0 and `r`

. Since `Vector`

has a predetermined size, we allocate an `r`

-sized array, for the others, the size grows dynamically.

Also, all counts are `Int`

s (or the equivalent `Word`

for Judy), especially `Vector`

could probably save by using a smaller data type for storing counts.

Oh, and Don’s Judy library is a bit sparse, among other things, I added an `insertWith`

. This makes frequency counting a bit faster (see below), as well as thread safe.

## Results

Anyway, this is how it looks so far:

Here we use a fixed range (up to 1M), and vary the number of inserts. We see that all implementations scale linearly - which is what we would expect. `IntMap`

and (boxed) `Vector`

are about seven times slower than the others, though, while unboxed and mutable vectors are a bit faster than `Judy`

(j is Judy with lookup and insert, jw is judy using `insertWith`

)

Memory consumption is lowest and constant for mutable and unboxed vectors. Judy has a comparable low memory footprint, but grows slightly with number of inserts. `IntMap`

is way more expensive, something like 20x bigger. Interestingly, boxed vector seems to use a large amount of memory, I’m not exactly sure why this is.

Of course, vectors are fast when you can allocate a relatively small array, and update it directly. But the price you pay is limited range, here we look at how a constant number of insertions (again 1M) look when we increase the range. Again, `IntMap`

and boxed vectors are slow, `Judy`

and the other vectors are substantially faster.

Finally, we see that for memory use, vectors scale linearly with the range. Interestingly, the mutable vectors have half the memory footprint of immutable, but unboxed vectors. I would have thought an `unsafeFreeze`

or equivalent would make them the same - but apparently not? And Judy really shines here, with a tiny footprint compared to the others.

## Conclusions

For small ranges and dense data (relatively few counts of zero), use `Data.Vector`

, the mutable kind. For large ranges and/or sparse data, use `Data.Judy`

.

A version of Judy that compiles on recent compilers, and adds some extra functions (e.g., `insertWith`

and `adjust`

) can be found here. Scripts etc used in benchmarking are here. I haven’t spent a lot of effort exploring implementation options, so it is likely that my implementations are suboptimal. As always, comments welcome.