tags
454 BAM Bayes' theorem EST Illumina RNAseq SFF SNP SNP analysis alignment alignments annotation assembly comparison bayesian big data biocore biohaskell bioinformatics blast bloom filters carbon footprint climate change clustering conscience data storage de novo de novo assembly de novo genome economy environment evolution fasta frequentist genome sequencing hardware hpc illustration judy arrays k-mer library linux make meta metadata nuclear nuclear power opinion optimization parsing phylogeny population genetics productivity profiling programming science sequence alignment sequence analysis sequencing shell short reads software solar solar power static types statistics transalign ubuntu unexplained varan vector masking war whales xmlSome newer writings can be found here.
Recent posts
-
The cost of fixing climate change - December 24, 2016
climate change, nuclear, solar, war, economy
CommentsAn old draft I had lying around, but which I somehow never got around to finish. Well, here goes. TL;DR: Nuclear is probably our best option to reverse climat change. This explains why.
-
HP EliteBook Folio G1 - September 12, 2016
linux, hardware, ubuntu
CommentsA review of the HP EliteBook Folio G1, and the story of how to install Linux on it.
-
CAS-based generic data store - August 3, 2016
science, statistics, whales
CommentsContent adressable storage makes a system for generic data storage slightly more opaque, but eliminates the need for any central naming authority, and comes with many desirable properties (immutability of data, verifiability, deduplication) for free.
-
Why we should stop talking, and start to prepare for climate change - May 23, 2016
climate change, nuclear power, solar power
CommentsFor all the focus on climate change, politicians, environmentalists, and basically everybody involved, don't seem particularly interested in actually solving the problem.
-
Probabilities for heterozygote genetic markers in hybrids - March 10, 2016
science, statistics, whales
CommentsWith a suitable set of genetic markers, it is fairly straightforward to identify organisms as belonging to one or the other population. But how useful are such markers for identifying hybrids when migrants from one population have mixed into the ohter?
-
Can you trust science? - March 25, 2015
science, statistics
CommentsWe are bombarded by reports of scientific results that promise to revolutionize aspects of our lives, but somehow actual progress seems to go at a much lower pace. In reality, many scientific results fail to be reproduced. And there's a good reason for that.
-
Thoughts on phylogenetic trees - February 15, 2015
phylogeny, evolution
CommentsPhylogenetic trees represent the evolutionary relationship between species. They are often constructed based on the sequences of genes, and different genes can give conflicting results. This is my attempt to sort things out.
-
Information content and allele frequency difference - July 17, 2014
sequence analysis, SNP, varan
CommentsA quick look at the relationship between the information value and the frequency difference of alleles.
-
Expected site information from SNPs - July 2, 2014
bioinformatics, bayesian
CommentsSNPs are commonly identified by calculating measures like $F_ST$ or $p$-values from high-throughput sequencing data. But these are proxies for what we really want to know, viz. the information to be gained from observing a site. Here's how to do that.
-
Big data revisited - May 5, 2014
bioinformatics, opinion, big data
CommentsWherein we examine the functional programming communities active in Munich, define the term "big data", and look at what it means in relation to bioinformatics, and science in general.
-
Some not-so-grand challenges in bioinformatics - April 8, 2014
bioinformatics, opinion
CommentsAfter working some years in bioinformatics, one realizes that there are some unmet challenges out there. Our methods and tools are often not quite as good as we would wish. Here are some of the issues I've run across.
-
Parallel SNP identification - March 26, 2014
sequence analysis, SNP, population genetics
CommentsI've recently been experimenting with various metrics for SNP discovery. One challenge is the time it takes to process the large data amounts, and many existing tools are quite slow to run. The obvious answer is of course parallelism, and one would think that for a program that essentially processes a stream of records one by one, something simple like Haskell's `parMap` would work. But it turns out `parMap` builds a strict list of all the work to be done, and my feeble attemts at getting around this (by chunking and so on), didn't quite work out. Here is a rather quick and dirty hack that did, and which gets a nice speedup.
-
k-mer counting in a high-level language? - November 22, 2013
sequence analysis, k-mer, judy arrays
CommentsI often argue that Haskell is a high-level language that unlike many other HLLs offer good tools that also results in good performance. Currently, I am toying around with k-mer indexing, here are the results so far.
-
Frequency counting in Haskell - November 11, 2013
sequence analysis, SFF, 454, parsing
CommentsFrequency counting is an important task, which can be implemented with a variety of underlying data structures. Here I explore a few of them in Haskell.
-
Generic storage for heterogeneous data - October 1, 2013
data storage, metadata, xml, bioinformatics
CommentsModern biology (and science in general) tends to produce data. Lots and lots of data. Often way more than can be sensibly analyzed. To exploit the value of these data in the future, it is necessary to store them in a way so that they can be easily cataloged, searched, retrieved, and interpreted. This is an attempt to design a system that addresses these needs, yet is as simple as possible.