tags
454 BAM Bayes' theorem EST Illumina RNAseq SFF SNP analysis alignment alignments annotation assembly comparison biocore biohaskell bioinformatics blast bloom filters carbon footprint clustering conscience de novo de novo assembly de novo genome environment fasta frequentist genome sequencing hardware hpc illustration library make meta optimization parsing productivity profiling programming science sequence alignment sequence analysis sequencing shell software static types unexplained vector masking whalesAll posts
-
Spring cleaning
- May, 2013
metaSome updates to this blog, and posting of various material that has been sitting around for far too long.
() -
Illumina RNAseq and bias
- May, 2013
Illumina, RNAseq, unexplainedTurns out RNAseq is inherently biased. This came as a surprise to me, and although this is in the literature, I don't think our tools take it into account. More interesting is that I'm not convinced the usual explanation - differences in random hexamer binding affinity - is the right one.
() -
Footprints of a cow
- May, 2013
environment, carbon footprint, conscienceSome of my neighbors are enthusiastic about sustainability, the environment, and related issues. This rubs off, and I've borrowed a couple of books on carbon footprints. I'm still busy reading, but one obvious conclusion so far is: the cows gotta go.
() -
My Bayesian Mind
- May, 2013
Bayes' theorem, frequentist, scienceSome ideas about how the brain is hardwired to work more similar to Bayesian than to the positivist (Popperian) mindset.
() -
The de novo genome project pathway
- January, 2013
de novo genome, illustrationA de novo genome project can make use of a diversity of data types and analysis software. This is an attempt to draw a map, inspired by metabolic pathway diagrams, to illustrate the various options.
() -
Biocore 0.3.1 is released
- January, 2013
bioinformatics, biohaskell, libraryAfter some prevaricating around the bush, a new release of the biocore library, namely version 0.3.1 is out on hackage. Along with a necessary, minor update of biosff.
() -
The effect of using different substitution matrices
- November, 2012
sequence alignment, blastDifferent substitution matrices can make BLAST searches more specific or more sensitive, but the effect is not very substantial, and is very far from the sensitivity of transalign.
() -
Presentation on the anatomy of de novo genome projects
- October, 2012
sequence analysis, de novo, genome sequencingSlides from a presentation I just gave, illustrating de novo genome assembly and annotation. Hopefully useful to y'all.
() -
Transitive alignments (and why they matter)
- October, 2012
sequence analysis, alignments, annotationTransitive alignments are calculated using an intermediate sequence database to help identify relationships that are highly diverged in the query and target sequences. This makes it possible to construct pairwise alignments well into the "twilight zone".
() -
Calculating insert stats from BAM files
- September, 2012
sequence analysis, BAMA small tool that reads a BAM file containing aligned reads (typically Illumina paired ends reads, but any paired type will do) and outputs various statistics on them.
() -
Low cost ARM computers
- June, 2012
hardwareARM computers are emerging from the telephone market, and are starting to look like a real option for many settings previously dominated by "real" computers - that is, PCs with an x86 CPU. Unfortunately, there is a plethora of CPU and GPU options. This is the notes I jotted down when attempting to make some sense of the available offerings.
() -
Constructing An Assembly Evaluation Pipeline
- April, 2012
de novo assembly, make, assembly comparisonOften, genome assemblies report metrics like N50 or number of contigs, but this says very little about the assembly accuracy or usefulness. The asmeval pipeline is an attempt at a more comprehensive approach, and makes it easier to compare assemblies on a broader set of parameters. It is still in development, so please contact me if you want to test it.
() -
The type system, safety, sequence alignments, and you
- April, 2012
alignment, static types, programmingAn simple example that illustrates how to leverage the type system to reduce the possibility of errors. By making our types more polymorphic (or less specific), we restrict the possible implementations.
() -
My blog software
- March, 2012
metaBy popular demand: A handful of updates and tweaks to Hakyll.
() -
From alignments to sequence clustering
- March, 2012
alignment, clusteringThe biopsl library is updated with some filtering and selection tools, and there's a simple clustering program using PSL files as its input.
() -
Recent developments: Flower and Bamstats
- January, 2012
alignment, sequencingA new utility for getting some information from BAM alignment files, and a small update to `flower`.
() -
The return of the blog
- January, 2012
meta -
Biocore 0.1 is released!
- September, 2011
biocore, biohaskellThe first incarnation of the _biocore_ library has been released.
() -
Compressing biological sequences
- June, 2011
sequence analysis, SFF, 454, parsingAs sequence data continue to grow exponentially in volume, compression is becoming more interesting. Here's a quick test of a couple of popular algorithms, and how to get improved compression rates by simple rearrangements of the data.
() -
Presentation on the (lack of) data management practices
- June, 2011
de novo assembly, make, assembly comparisonDuring a presentation on the data managment practices I employ - or rather, would like to employ - I mentioned some commonly read and cited works that I thought were relevant. Since there was some interest, the list is repeated here.
() -
Searching for poly(A) tails
- December, 2009
sequence analysis, fasta, ESTPoly-A tails is an inherent feature of mRNA sequences, and although they are useful to identify the end of transcripts, often we want to remove them before further analysis. Here is an algorithm to identify poly-A tails in FASTA sequences.
() -
454 sequencing and parsing the SFF binary format
- November, 2008
sequence analysis, SFF, 454, parsingRoche's 454 sequencing technology can produce biological sequence data on a scale that exceeds traditional Sanger sequencing by orders of magnitude. Due to the fundamentally different method used to generate the sequences, we would like to investigate the raw data and see if we can quantify -- and maybe also reduce the number or severity of -- errors. This means reading the binary SFF format. Below, we'll dissect the SFF format, and describe a Haskell implementation.
() -
The FASTQ file format for sequences
- September, 2008
software, shell, ESTThe FASTQ format is the format of choice for most new sequencing technologies. Here, we implement a Haskell parser, and look at some ..uh, interesting properties of the format and its implementations.
() -
Some thoughts on programmer productivity
- August, 2008
programming, productivity, biohaskellMeasuring programmer productivity can be hard, one common way is to count lines of code, another is to count VCS commits. This is not so much an article as it is a collection of annonated links.
() -
A plan for Bloom filters
- July, 2008
sequence analysis, bloom filtersThe Bloom filter is an interesting data structure, providing a probabilistic set. Here are some ideas for using them for bioinformatics.
() -
Updates and other trivialities
- July, 2008
software, hpcA brief overview over software updates and some other notes. Imported from my old Wordpress blog in order to preserve it for posteriority, and to make Google Webmaster tools happy. Links have been changed to protect the guilty...uh, I mean, improve the reader experience.
() -
Functional bash bracketing
- July, 2008
software, shell, ESTMy current development project is an EST pipeline. For various reasons, it is implemented in shell --- bash, to be exact. In other words, the pipeline is a script, or rather a set of scripts, that will tie together the various stages: masking, clustering, assembly, and annotation.
() -
Giving users what they want: Haskell scripts on the web with CGI
- May, 2008
SNP analysis, whales,Advanced genetic analysis (A.K.A. some simple text munging) implemented in Haskell, and deployed uisng as a FastCGI module. Using web as a deployment platform has its downsides, but does serve to lower the threshold for the users.
() -
Optimization week: making Haskell go faster
- May, 2008
sequence analysis, annotation, optimization, profilingOptimization being all the rage, we investigate how we can make a program that annotates sequences with GO terms, go faster (no pun intended). We use GHC profiling to identify performance culprits, and try to improve them.
() -
Can you spare five minutes?
- May, 2008
sequence analysis, clustering..to write a simple, but useful and efficient bioinformatics program? How to implement a simple tool to process ACE files, and extract clustering information and the contained sequences.
() -
Cleaning up sequences
- May, 2008
sequence analysis, vector maskingSome notes about cleaning up sequences. Sanger sequencing has perpaps become a bit old-skool in the years after this was written, but still offer long read lengths and good quality. The complicated protocol introduces a bit of junk, though, and removing this isn't as easy as it should be.
()

