Career path: a bit more detail.

Last month, Rolf Apweiler and I were kindly asked by Nature Jobs to comment on our career paths to becoming Associate Directors at the EBI (which will happen in April). The final piece focuses on just one of our career paths (mine). You can read the article here: http://www.nature.com/naturejobs/science/articles/10.1038%2Fnj7383-123a

It is very positive about me, about the EBI and about the other institutions I went through. It’s a good thing, mainly because it raises the EBI’s profile. However, some important things were missed that I would like to mention here – in particular, I think it’s important to acknowledge the people I’ve been lucky enough to work with over the years, and who have made a positive impact on my career.
Continue reading “Career path: a bit more detail.”

My Emotions in Science

First off, apologies that I have not blogged recently. This is because I have been pushing hard on a large, consortium paper, and it felt bad to have a public evidence of not working on it when I was asking many other people for pretty continuous effort over a protracted period of months. This is not entirely logical – we all actually need some diversity in what we do scientifically, even when we’re on a big push, but somehow makes sense sociologically.

Continue reading “My Emotions in Science”

Customising Ensembl Displays

Ensembl’s display customisation has become both far more detailed, and far easier toaccess. I’m goingto talk you through how to customise and what my favourite view is.

The default of Ensembl display is something like this. Nice, pretty, but very transcript/gene centric, and with a strong emphasis on splitting up the strands. I am now a more regulatory person, and genes really form the context of what I want to see, but most of the time I don’t want to see the details of the transcripts.

Continue reading “Customising Ensembl Displays”

Biology is a 1st class quantitative science

Last week I was at a “Town Hall” meeting for the UK High Performance Computing community (or at least part of it) which has a lot of computer people rubbing shoulder with what used to be called “hard” science scientists; high energy physics, astronomers, modellers (particularly climate change/oceanography) and the odd molecular modeller/chemist. High beard/non beard ratio, and almost 1:1 Y to X chromosome ratio. From their perspective I was a straight up biologist (though from the biologist’s perspective, I am a computer scientist. The curse of being interdisciplinary is that you are often seen as on the outside of all the disciplines you span. Hey ho.).

Continue reading “Biology is a 1st class quantitative science”

Five statistical things I wished I had been taught 20 years ago

I came through the English educational system, which meant that although I was mathematically minded, because I had chosen biochemistry for my undergraduate, my maths teaching rapidly stopped – in university I took the more challenging “Maths for Chemists” option in my first year, though in retrospect that was probably a mistake because it was all about partial differentiation, and not enough stats. Probably the maths for biologists was a better course, but even that I think spent too much time on things like t-test and ANOVA, and not enough on what you need. To my subsequent regret, no one took my aside and said “listen mate, you’re going to be doing alot of statistics, so just get the major statistical tools under your belt now”.

Biology is really about stats. Indeed, the foundation of much of frequentist statistics – RA Fisher and colleagues – were totally motivated by biological problems. We just lost the link the heyday of molecular biology when you could get away with n=2 (or n=1!) experiments due the “large effect size” – ie, band is either there or not – style of experiment. But now we’re back to working out far more intertwined and subtle things. So – every biologists – molecular or not – is going to have to become a “reasonable” statistician.

These are the pieces of hard won statistical knowledge I wish someone had taught me 20 years ago rather than my meandering, random walk approach.

1. Non parametric statistics. These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing “medians” (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman’s Rho (rather pearson’s r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can’t in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don’t forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So – if in doubt, do things on the rank of metric, rather than the metric itself.

2. R (or I guess S). R is a cranky, odd statistical language/system with a great scientific plotting package. Its a package written mainly by statisticians for statisticians, and is rather unforgiving the first time you use it. It is defnitely worth persevering. It’s basically a combination of excel spreadsheets on steriods (with no data entry. an Rdata frame is really the same logical set as a excel workbook – able to handle millions of points, not 1,000s), a statistical methods compendium (it’s usually the case that statistical methods are written first in R, and you can almost guarantee that there are no bugs in the major functions – unlike many other scenarios) and a graphical data exploration tool (in particular lattice and ggplot packages). The syntax is inconsistent, the documentation sometimes wonderful, often awful and the learning curve is like the face of the Eiger. But once you’ve met p.adjust(), xyplot() and apply(), you can never turn back.

3. The problem of multiple testing, and how to handle it, either with the Expected value, or FDR, and the backstop of many of piece of bioinformatics – large scale permutation. Large scale permutation is sometimes frowned upon by more maths/distribution purists but often is the only way to get a sensible sense of whether something is likely “by chance” (whatever the latter phrase means – it’s a very open question) given the complex, hetreogenous data we have. 10 years ago perhaps the lack of large scale compute resources meant this option was less open to people, but these days basically everyone should be working out how to appropriate permute the data to allow a good estimate of “surprisingness” of an observation.

4. The relationship between Pvalue, Effect size, and Sample size – this needs to be drilled into everyone – we’re far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it’s higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So – if something is significantly correlated/different, then you want to know about how much of an effect this observation has. This is not just about GWAS like statistics – in genomic biology we’re all too happy about quoting some small Pvalue not realising that with a million or so points often, even very small deviations will be significant. Quote your r2, Rhos or proportion of variance explained…

5. Linear models and PCA. There is a tendency often to jump to quite complex models – networks, or biologically inspired combinations, when our first instinct should be to crack out the well established lm() (linear model) for prediction and princomp() (PCA) for dimensionality reduction. These are old school techniques – and often if you want to talk about statistical fits one needs to make gaussian assumptions about distributions – but most of the things we do could be either done well in a linear model, and most of the correlation we look at could have been found with a PCA biplot. The fact that these are 1970s bits of statistics doesn’t mean they don’t work well.

Kudos to Leonid Kruglyak for spotting some minor errors

Compressing DNA: The future plan

I’ve talked about our work early this year on compressing next generation sequencing in the first part of this series of three blogs; my second post outlined the engineering task ahead of us (mainly Guy, Rasko and Vadim). This third post is to get into the fundamental question of whether this can scale “for the foreseeable future”. By “scale” here, I mean a very simple question – can we (the ENA) store the world’s public research DNA output inside of the current fixed disk budget we spend each year currently? (this applies to ENA as an archive, and to each sequencing centre or lab-with-a-machine individually about their own output).

Continue reading “Compressing DNA: The future plan”

Engineering around reference based compression

(this is my second blog post about DNA Compression, focusing on the practicalities of compression).

The reference DNA compression which I talked about in my previous blog post was a mixture of theoretical results and real life examples which we had taken from submitted datasets. But it was definitely a research paper, and the code that Markus wrote was Python code with some awful optimisations in the code/decode streams (codec as it’s called by the pros). So it… didn’t have a great runtime. This was one of the points we had to negotiate with the reviewers, and there’s a whole section in the discussion about this is going to work in the future with nice clean engineering.

Continue reading “Engineering around reference based compression”

Compressing DNA: Part 1.

(First off, I am back to blogging, and this is the first of three blogs
I’m planning about DNA compression)

Markus Hsi-yang Fritz – a student in my group – came up with a reference sequence compression scheme for next generation sequencing data. This has just come out in “proper” print format this month (it was previously fully available on line…) (Full open access). Markus not only handled the bases, but also the key other components in a sequencing experiment – read pairing, quality values and unaligned sequence. For bases and read pair information, the compression is lossless. For the quality information and unaligned sequence one has to be lossy – if not, we can’t see a way of making an efficient compression scheme. But one can quite carefully control the where one keeps and loses data (“controlled loss of precision” in the terminology of compression) meaning that one can save information around important regions (eg, SNPs, CNVs etc).

Continue reading “Compressing DNA: Part 1.”