Constructing An Assembly Evaluation Pipeline

by Ketil Malde; April 23, 2012

Warning: Although this software works well for me, it is in current development, and hasn’t seen a lot of testing. Thus, I’d be surprised if it installs smoothly and works without a hitch. If you try anyway, please let me know how it goes.

Part of my day job has recently been working on a de novo sequencing project. As we are are exploring many methods for genome assembly, it is imporant to be able to evaluate the resulting assembly in order to eliminate poor methods, improve on the more promising ones, and of course in the end select the best assembly. I have implemented a pipeline that performs this evaluation, and this is an attempt to briefly describe its goals and implementation.

Measures and input data

In order to assess quality, we rely on both internal and external measures. For instance, we can calculate the N50 contig size from the genome assembly directly, making it an internal measure. Alternatively, we can count the number of ESTs we are able to align to the assembly, making it an external measure.

For external measures, we will make use of many different types of data:

High-throughput DNA-sequencing reads
Paired-end and mate-pair reads
EST reads
Known DNA fragments, including the mitochondrion genome

Defining qualities

The pipeline addresses four aspects of assembly quality that ideally are independent of each other:

Completeness: does the genome assembly contain every bit of the genome, or is a part of it missing?
Fragmentation: does the genome come as a few, long, sequences, or as a zillion tiny fragments? And if the former, are the sequences correctly put together?
Accuracy: does the assembled genome sequence contain small-scale errors, from e.g. base calling errors?
Redundancy: is there a one-to-one correspondence between locations in the assembly and the actual genome, or does the assembly contain multiple copies of parts of the genome?

Note that it isn’t necessary that there is a fixed, optimal value for each measure that indicates a perfect assembly, it is only important that we can compare the scores for different assembly in order to determine which assembly is superior for that particular measure.

Completeness

Genomic (DNA) reads are aligned to the genome, and the proportion having matches is reported. We use BWA to align Illumina and 454 reads, and rely on ‘samtools flagstats’ to report the matches.

Also, various other sequences are mapped using BLAT, below we have a set of transcripts assembled from ESTs and RNASeq data, the mitochondrion genome, and fosmid ends. Here we report the average size of each match (as a percentage of the lenght of the query sequence), as well as the total number of sequences matching.

COMPLETENESS:
  Illumina mapped:      77.0411%
  PSL hits:
    all_transcripts_165.rna.psl.uniq:   92.0361%, 28398 matching sequences
    l.salmonis-mtDNA.dna.psl.uniq:      90.1952%, 3 matching sequences
    fosmids_fwd.dna.psl.uniq:   66.002%, 9411 matching sequences
    fosmids_rev.dna.psl.uniq:   63.7143%, 5886 matching sequences

Fragmentation

Fragmentation is measured by the standard N50 contig size, as well as N25 and N75. In order for it to be comparable, we use a fixed estimate for the genome size, not the size of the assembly itself.

Also, we count the proportion of Illumina reads that have the ‘proper pair’ flag set by BWA. Note that in order to make fragmentation independent of completeness, this is measured as a fraction of the matching reads, not all reads.

FRAGMENTATION:
  N25/50/75:    76308   46116   24010   using 600000000 as total size.
  Illumina pairs proportion:    67.7091%

Accuracy

We measure accuracy by looking at individual scores of alignments, in this case, genomic reads are aligned with BWA, and the alignments score is averaged.

ACCURACY:
  SFF average match:    148.565
  Illum average match:  46.7718

Redundancy

Redundancy looks at the total size of the assembly, comparing it to the estimated size. Then, we count the number of alignments of 454 reads and compare it to the number of query sequences.

Finally, we make use of the BLAT alignments, summing up the alignment scores as a proportion of the best scores for each sequence.

REDUNDANCY:
  Total size: 675758373, 112.626% of estimated size.
  SFF reads mapped:     117.235%
  PSL score, total vs unique:
    all_transcripts_165.rna.psl:            233.564%    67439222/28873929
    l.salmonis-mtDNA.dna.psl:       179.406%    76115/42426
    fosmids_fwd.dna.psl:            1046.08%    67459110/6448777
    fosmids_rev.dna.psl:            963.377%    37459791/3888385

Implementation

The pipeline is built as a makefile, which uses a mix of shell, AWK, and the occasional Haskell executable to do most of the heavy lifting.

Make is the widely used build system traditionally used in the Unix world. Essentially, the build process is described using a declarative language, and the make tool works out the process needed to constuct what’s desired.

Some features that makes it especially suitable is that:

it avoids redoing work when it’s not necessary
it automatically parallelizes the build process
it is ubiquitously available

Make

The basic syntax of make is pretty simple (but never fear, it quickly gets complicated enough!). Here’s an example rule:

  foo: bar
      process bar > foo

This tells make that, in order to build the file foo, it first needs the file bar, and when that is in order – either because it exists already, or because make can use some other rule to construct it – foo will be built by running the command process bar.

Make supports pattern-based rules, the archetypal example is probably something like:

  %.o: %.c
      $(CC) $<

Which means that we can build any file ending in .o from a prerequisite having the same name, except ending with .c, by running the command line from the variable CC (presumably the C compiler) on the prerequisite (from the built-in variable intuitively named $<).

While this makes it easy to set up rules for building stuff from other stuff, the caveat here is that it’s all pattern based, typically using a wildcard (%) and a suffix. This means that, for instance, paired sequence data in FASTQ must be named something.1.txt and something.2.txt.¹

In addition, make has a set of built-in string manipulation functions, they typically look somewhat like:

  STATS    := $(patsubst %.bam,%.stats,$(BAMS))

This populates the variable \$(STATS) with the contents of the variable \$(BAMS), except that it substitutes .stats for .bam for each element.

Shell

The commands in make files are executed by the shell, so it’s not too difficult to do things like loops. For instance, you can do:

  all-fstats: $(FSTATS)
     for a in $(FSTATS); do echo "$$a	`grep % $$a | tr '\n' '\t'`"; done > $@

I.e., to build the file all-fstats, we first make sure we have everything in the \$(FSTATS) variable, then we loop over them, doing some grep and tr stuff, and directing output into \$@, yet another intuitively named variable, this time referring to the target of the current rule, or in this case all-fstats. Note that to refer to shell variables, they must be prefixed with an extra \$, a single dollar refers to makefile variables.

AWK

AWK is useful to collect data and perform arithmetic. For instance, we construct a table summarizing the interesting stats from all-fstats like so:

all-fstats.tab: all-fstats
    sed -e 's/\t[^\t]*(\([0-9][0-9]*\.[0-9]*%\))/\t\1/g' < $< | \
    awk ' BEGIN { OFS="\t" } { print; mapped += $$2; paired += $$3; single += $$4 } \
          END { print "totals:    ",mapped/NR "%",paired/NR "%",single/NR "%" }' > $@

AWK is used here to accumulate sums of the various variables, and to calculate and print the averages in the end.

Haskell

There are also some tools written in Haskell that performs various processing. This includes psluniq, which extracts the best hit from PSL files, atcounts, which calculates GC/AT-ratio for each contig.

The whole thing is distributed as a Cabal library, so the required Haskell tools will be compiled or pulled in as dependencies on installation. Note that you’ll probably have to add $HOME/.cabal/bin to your $PATH.

Some thoughts

Make is Unix’s answer to democracy: it’s clearly the worst build tool out there, except for the others. It is often criticised for not being a full programming language, but I think this is a Good Thing: a more restrictive language is easier to get right. In fact, most of the problems of make stems from it having all kinds of extra functionality bolted on top.

One difficulty is the need to juggle between different paradigms, for instance, the various string processing functions in make go a long way in spite of all their messy syntax. But you still can’t escape some munging with AWK or Perl or similar tools. Wouldn’t it be nice if this were unified? Similarly, you have different layers with different sets of variables. Perhaps the ideal tool is make rules with Perl implementations?

An interesting alternative is Neil Mitchell’s shake, which does all of this in Haskell. In addition, it traces dependencies indirectly, so no more problems with forgotten dependencies. If or when I get around to trying out the alternatives, that’s clearly at the top of my list.

In practice

Getting it

You can browse the sources at http://malde.org/~ketil/asmeval. This is also a darcs repository, and you can get a copy by doing

darcs get http://malde.org/~ketil/asmeval

Of course, you’ll also need a bunch of tools, including the GHC compiler, the biopsl library, aligners BWA and BLAT, samtools, etc. Hopefully, you already have standard Unix tools like sed and awk.

On debian-derived Linux distributions, you should be able to get at least some of the way doing:

sudo apt-get install cabal-install darcs bwa
cabal update
darcs get http://malde.org/~ketil/asmeval
cd asmeval
cabal install

It looks like you need to install BLAT separately, and I may be missing some stuff - do let me know and I’ll update this post.

Configuring it

The configuration essentially consists of setting a bunch of make variables. This can be done on the command line, e.g.:

asmeval GENOME=mygenome.fasta ESTS=transcripts.fasta

However, asmeval will also look for a file called CONFIG in its current directory, which also can set these variables. It needs to be in makefile format, and is just included by the main makefile. The above translates to a CONFIG file containing:

GENOME := mygenome.fasta
ESTS   := transcripts.fasta

Note that this lets you use make’s string processing facilities in all their byzantine glory. There is an example (CONFIG.example) provided in the repository that can be used as a template.

Running it

After configuration, it should be possible to run everything by executing asmeval, which invokes make with some default options. These include -r to avoid the built-in rules (which only slow things down), and turning on warnings for uninitialized variables, in case you forgot something in the CONFIG file. You can add further options, for instance you may feel inclined to add -j to get parallel processing.

In order to run in the background and collect various information, I often type:

nice nohup time asmeval -j8 &

Note that j allows make to spawn multiple processes in parallel, but does nothing to take advantage of multi-threaded executables like BWA. There is a configuration variable, THREADS, you can set for this, so if an alternative to the above could be

nice nohup time asmeval THREADS=4 -j2 &

This will speed up some of the sub-processes, but run fewer of them in parallel.

The report

The default summary is output in a file called report, which contains the main statistics collected about the assembly. Hopefully it is self-explanatory enough to be useful.

If they aren’t, one solution is to add symbolic links to fix this.↩

comments powered by Disqus

Feedback? Please email ketil@malde.org.