In my semi-ongoing series of “things I wished someone had told me…” I wanted to share my sense of common “gotchas” in genomics. Here’s my top 10.
- No large scale dataset is perfect – and many are at some distance from perfection. This is true of everything from assemblies through to gene sets to transcription factor binding sites to phenotypes generated cohorts. For people who come from a more focused, single area of biology where in a single experiment you can have mastery of every component if desired this ends up being a bit weird – whenever you dig into something you’ll find some miscalls, errors or general weirdness. Welcome to large scale biology.
- When you do predictions, favour specificity over sensitivity. Most problems in genomics are bounded by the genome/gene set/protein products, so the totally wide “capture everything” experiment (usually called “a genome-wide approach”) has become routine. It is rarely the case (though not never) that one wants a high sensitivity set which is not genome-wide. This means for prediction methods you want to focus on specificity (ie, driving down your error rate) as long as one is generating a reasonable number of predictions (>1000 say) and, of course, cross validating your method.
- When you compare experiments you are comparing the combination of the experiment and the processing. If the processing was done in separate groups, in particular with complex scripting or filtering, expect many differences to be solely due to the processing.
- Interesting biology is confounded with Artefact (1). Interesting biology is inevitably about things which are outliers or form a separate cluster in some analysis. So are artefacts in the experimental process or bioinformatics process – everything from biases towards reference genome, to correlations of signals actually being driven by a small set of sites.
- Interesting biology is confounded with Artefacts (2). There is a subset to the above which is so common to be worth noting separately. When you have an error rate – and everything has an error rate due to point 1 – the errors are either correlated with biology classification (see point 2) or uniform. Even when they are uniform, you still get mislead because often you want to look at things which are rare – for example, homozygous stop codons in a whole genome sequencing run, or lack of orthologs between species. The fact that the biological phenomena you are looking for is rare means that you enrich for errors.
- Interesting biology is usually hard to model as a formal data structure and one has to make some compromises just to make things systematic. Should one classify the Ig locus as one gene, or many genes, or something else? Should one try to handle the creation of new selenocystine amber stop codon by a SNP as a non synonymous variant? To what extent should you model the difference between two epiptopes for a chip-seq pull down of the same factor when done in different laboratories? Trying to handle all of this “correctly” becomes such a struggle to be both systematic and precise one has to compromise at the some point, and just write down/reference papers a discussion in plain old English. Much of bioinformatics databases is trying to push the boundary between systematic knowledge and written down knowledge further; but you will always have to compromise. Biology is too diverse.
- The corollary of 1, 2 and 4 is that when most of your problems in making a large scale dataset is about modelling biological exceptions, your error rate is low enough. Until you are agonising over biological weirdness, you’ve still got to work on error rate.
- Evolution has a requirement that things work, not that it’s an elegant engineering solution. Expect jury rigged systems which can be bewildering in their complexity. My current favourite is the Platypus X chromosomal system which is just clearly a bonkers solution to hetreogametic sex. Have fun reading about it (here’s one paper to get you started)!
- Everyone should learn the basics of evolution (Trees, orthologs vs paralogs. And please, could everyone use these terms correctly!) and population genetics (Hardy Weinberg equilibrium, the impact of selection on allele frequency, and coalesence, in particular wrt to human populations). For the latter case people often need to be reminded that the fact a site is polymorphic does not mean it is not under negative selection.
- Chromosomes have been arbitrarily orientated p to q. This means that the forward and reverse strand have no special status on a reference genome. If any method gives a difference between forward and reverse strands on a genome wide scale – it’s probably a bug somewhere š
I am sure other people have other comments or suggestions š
Hi Ewan,
Great writeup. Regarding point 9, I would appreciate your suggestions on good book or articles, assuming that the reader in familiar with these terms but would like a more deep understanding.
All of the above all too true, and unfortunately not easy to escape. A couple more…
a) Don't make a meal out of small p-values if they don't explain much of your phenomenon. With genome-scale data it's easy to achieve statistical "significance" but is it biologically significant?
b) Never forget that a gene model is not a gene, and a genome sequence is not a genome. Both are approximate computational representations of real biological entities in the cell.
c) Don't confuse genome-wide approaches (e.g. WGS assemblies/microarrays) with whole-genome approaches (e.g. genetic screens). There is typically a lot of DNA missing from assemblies/arrays that have important biological effects (e.g. in heterochromatin).
@Vince: Good places to start for population genetics and (molecular) evolution are:
– A Primer of Population Genetics, Daniel L. Hartl
– Population Genetics: A Concise Guide, John H. Gillespie
* Principles of Population Genetics, Daniel L. Hartl & Andrew G. Clark
– Molecular Evolution, Wen-Hsiung Li
– The Origins of Genome Architecture, Michael Lynch
@caseybergman: Thank you. I'll check these out ASAP.
Another point: species and taxonomy are a mess but there are conceptual underpinnings for each species concept. These have consequences. There are also consequences for collaborators who don't have the same species concept in mind when they approach data analysis and presentation – negative consequences.
Once again, I agree with all your rules. Rule #1 definitely deserves to be #1. In fact, I'll go further: Most large-scale datasets are very imperfect. As a result, much of your time as a bioinformatician will be spent weeding out the junk.
Interesting post ā just found by a link. Have to say I agree with most of the comments, but strongly disagree that non-parametric tests should be preferred.
Preferring parametric tests over non-parametric tests was a big change in my thinking as I moved from getting my biology degree to getting my statistics degree. Biologists think this trade-off has to do with making assumptions and losing power, and have an intuitive inclination to prefer the method that āmakes less assumptionsā (not quite true though).
The real trade-off is whether one is able to say they understand their data and have a generative model for it, or in the (typically non-parametric case) have no idea why it looks the way it does. Without a good model for the distribution of some data, it is very hard to interpret any test result and a lot may be missed. For example, if I perform a clinical trial of a new cancer drug and find that it instantly kills one fourth of the patients but saves the remaining three fourths. A patient considering the drug would probably prefer to know what distinguished the dead patients from the live ones rather than see the results of a non-parametric test that showed the median effect of the drug was pretty good. I find biologists tend to think of statistics as a way of testing, whereas statisticians tend to think of statistics as a way to do quantitative modeling of outcomes, and testing predictions within that model.
Granted, a lot of times things arenāt normal or are confusing enough where we canāt generate a quantitative model for future replicates of the data, but data transforms, mixture models and additional co-variates can do a lot before one has to resort to non-parametric tests. Such tests should not be a ready excuse for not being able to explain why the data looks the way it does.
Completely agree with the power of linear models though.