5 September 2012 – Today sees the embargo lift on the second phase of the ENCODE project and the simultaneous publication of 30 coordinated, open-access papers in Nature, Genome Research and Genome Biology as well as publications in Science, Cell, JBC and others. The Nature publication has a number of firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a virtual machine.
This ENCODE event represents five years of dedicated work from over 400 scientists, one of whom is myself, Ewan Birney. I was the lead analysis coordinator for ENCODE for the past five years (and before that had effectively the same role in the pilot project) and for the past 11 months have spent a lot of time working up to this moment. There were countless details to see to for the scientific publications and, later, to explain it all in editorials, commentary, general press features and other exotic things.
But in telling the story over and over, only parts of it get picked up here and there – the shiny bits that make a neat story for one audience or another. Here I’d like to add my own voice, and to tell at least one person’s perspective of the ENCODE story uncut, from beginning to end.
This blog post is primarily for scientists, but I hope it is of interest to other people as well. Inspired by some of my more sceptical friends (you know who you are!), I’ve arranged this as a kind of Q&A.
Q. Isn’t this a lot of noise about publications when it should be about the data?
A. You are absolutely right it’s about the data – ENCODE is all about the data being used widely. This is what we say in the conclusions of the main paper: “The unprecedented number of functional elements identified in this study provides a valuable resource to the scientific community…” We focused on providing not only raw data but many ways to get to it and make sense of it using a variety of intermediate products: a virtual machine (see below), browse-able resources that can be accessed from www.encodeproject.org and the UCSC and Ensembl browsers (and soon NCBI browsers), and a new transcription-factor-centric resource, Factorbook. As I say in a Nature commentary, “The overall importance of consortia science can not be assessed until years after the data are assembled. But reference data sets are repeatedly used by numerous scientists worldwide, often long after the consortium disbands. We already know of more than 100 publications that make use of ENCODE data, and I expect many more in the forthcoming years.”
Q. Whatever – you love having this high-profile publication.
A. Of course I like the publications! Publications are the best way for scientists to communicate with each other, to explain key aspects of the data and draw some conclusions from them. But the impact of the project goes well beyond the publications themselves. While it is nice to see so much focus on the project, publishing is simply part of disseminating information and making the data more accessible.
Q. And 442 authors! Did they all really contribute to this?
A. Yes. I know a large proportion of them personally, and for the ones I don’t know, I know and trust the lead principal investigators who have indicated who was involved in this. To achieve systematic data generation on this scale – in particular to achieve the consistency – is a large, detailed task. Many of the other 30 papers – and many others to be published – go into specific areas in increasing levels of detail.
One group which I believe gets less credit than they deserve are the lead data production scientists; usually an individual with a PhD who heads up, motivates and trouble shoots the work of a dedicated group of technicians. There is a simple sentence in the paper: “For consistency, data were generated and processed using standardized guidelines, and for some assays, new quality-control measures were designed”. This hides a world of detailed, dedicated work.
There is no way to truly weigh the contribution of one group of scientists compared to another in a paper such as this; many individuals would satisfy the deletion test of “if this person’s work was excluded, would the paper have substantially changed”. However, two individuals stood out for their overall coordination and analysis, and 21 individuals in this data production area, including the key role of the Data Coordination Center.
Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e, if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.
Q. So remind me which one do you think is “functional”?
A. Back to that word “functional”: There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As I’ve pointed out in presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.
However, on the other end of the scale – using very strict, classical definitions of “functional” like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases – we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people – certainly it was to me!
In addition, in this phase of ENCODE we did sample broadly but nowhere near completely in terms of cell types or transcription factors. We estimated how well we have sampled, and our most generous view of our sampling is that we’ve seen around 50% of the elements. There are lots of reasons to think we have sampled less than this (e.g., the inability to sample developmental cell types; classes of transcription factors which we have not seen). A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20%
Q. [For the more statistically minded readers]: What about the whole headache of thresholding your enrichments? Surely this is a statistical nightmare across multiple assays and even worse with sampling estimates.
A. It is a bit of a nightmare, but thankfully we had a really first class non-parametric statistical group (the Bickel group) who developed a robust, non-parametric (so it makes minimal assumption about distribution), conservative statistic based on reproducibility (IDR). This is not perfect. Being conservative if one replicate has far better signal-to-noise than the other, it stops calling on the onset of noise in the noisiest replicate, but this is generally a conservative bias. And for the sampling issues, we explored different thresholds and looked at saturation when we were relaxed on thresholds and then shifted to being conservative. Read the supplementary information and have a ball.
Q. [For 50% of the readers]: Ok, I buy the 20% of the genome is really doing something specific. In fact, haven’t a lot of other people suggested this?
A. Yes. There have been famous discussions about how regulatory changes – not protein changes – must be responsible for recent evolution, and about other locus assays (including about 10 years of RNA surveys). But ENCODE has delivered the most comprehensive view of this to date.
Q. [For the other 50% of readers]: I still don’t buy this. I think the majority of this is “biological noise”, for instance binding that doesn’t do anything.
A. I really hate the phrase “biological noise” in this context. I would argue that “biologically neutral” is the better term, expressing that there are totally reproducible, cell-type-specific biochemical events that natural selection does not care about. This is similar to the neutral theory of amino acid evolution, which suggests that most amino acid changes are not selected either for or against. I think the phrase “biological noise” is best used in the context of stochastic variation inside a cell or system, which is sometimes exploited by the organism in aspects of biology, e.g. signal processing.
It’s useful to keep these ideas separate. Both are due to stochastic processes (and at some level everything is stochastic), but these biological neutral elements are as reproducible as the world’s most standard, developmentally regulated gene. Whichever term you use, we can agree that some of these events are “neutral” and are not relevant for evolution. This is consistent with what we’ve seen in the ENCODE pilot and what colleagues such as Paul Flicek and Duncan Odom have seen in elegant experiments directly tracking transcription factor binding across species
Q. Ok, so why don’t we use evolutionary arguments to define “functional”, regardless of what evolution ‘cares about’? Isn’t this 5% of the human genome?
A. Anything under negative selection in the human population (i.e. recent human evolution) is definitely functional. However, even with this stated criteria, it is very hard to work out how many bases this is. The often-quoted “5%”, which comes from the mouse genome paper, is actually the fitting of two Gaussians that look at the distribution of conservation between human and mouse in 50bp windows. We’ve been referring to 5% of those 50bp windows.
When you consider the number of bases being conserved this must be lower than this as we don’t expect 100% of the bases in these 50bp windows to be conserved. However, this only about pan-mammalian constraint, and we are interested in all constraint in the human genome, including the lineage specific elements, so this estimate just provides a floor to the numbers. The end result is that we don’t actually have a tremendously good handle on the number of bases under selection in humans.
Some have tried other estimates of negative selection, trying to get a handle on the more recent evolution. I particularly like Gerton Lunter’s and Chris Ponting’s estimates (published in Genome Research), which give a window of between 10% to 15% of the bases in the human genome being under selection – though I note some people dispute their methodology.
By identifying those regions likely to be under selection (because they have specific biochemical activity) in an orthogonal, experimental manner, ENCODE substantially adds to this debate. By identifying isolated, primate-specific insertions (where we can say with confidence that the sequence is unique to primates), we could contrast the bases inside ENCODE-identified regions with those outside. As ENCODE data covers the genome, we now have enough statistical power to look at the derived allele frequency (DAF) spectrum of SNPs in the human population. The SNPs inside ENCODE regions show more very low frequency alleles than the SNPs outside (accurate genome-wide frequencies due to the 1000 Genomes Project), which is a characteristic sign of negative selection and is not influenced by confounders such as mutation rate of the sequence (see Figure 1 of the main ENCODE paper).
We can do that across all of ENCODE, or break it down by broad sub-classification. Across all sub-classifications we see evidence of negative selection. Sadly, it is not trivial to estimate the proportion of bases from derived allele frequency spectra that are under selection, and the numbers are far more slippery than one might think. Over the next decade there will, I think, be much important reconciliation work, looking at both experimental and evolutionary/population aspects (bring on the million-person sequencing dataset!).
Q. So – we’re really talking about things under negative selection in human – is that our final definition of “functional”?
A. If it is under negative selection in the human population, for me it is definitely functional.
I, and other people, do think we need to be open to the possibility of bases that definitely effect phenotypes but are not under negative selection –both disease related phenotypes and other normal phenotypes. My colleague Paul Flicek uses the shape of the nose as an example; quite possibly the different nose shapes are not under selection – does that mean we’re not interested in this phenotype?
Regardless of all that, we really do need a full, cast-iron set of bases under selection in humans – this is a baseline set.
Q. Do you really need ENCODE for this?
A. Yes. Imagine that THE set of bases under selection in the human genome were dropped in your lap by some passing deity. Wonderful! But you would still want to know the how and why. ENCODE is the starting place to answer the biochemical “how”. And given that passing deities are somewhat thin on the ground, we should probably go ahead and figure out models of how things work so that we can establish this set of bases. I am particularly excited about the effectiveness of using position–weight matrices in the ENCODE analyses (my postdoc Mikhail Spivakov did a nice piece of work here).
Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to “4 million switches”, and that represents the bound motifs and footprints.
We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity.
Q. I get really annoyed with papers like ENCODE because it is all correlative. Why don’t people own up to that?
A. It is mainly correlative, and we do own up to it. (We did do number of specific experiments in a more “testing” manner – in particular I like our mouse and fish transgenics, but not for everything.) For example, from the main paper: “This is an inherently observational study of correlation patterns, and is consistent with a variety of mechanistic models with different causal links between the chromatin, transcription factor and RNA assays. However, it does indicate that there is enough information present at the promoter regions of genes to explain most of the variation in RNA expression.”
Interestingly enough, we had quite long debates about language/vocabulary. For example, when we built quantitative models, to what extent were we allowed to use the word “predict”? Both the model framework and the precise language used to describe the model imply a sort of causality. Similarly, we describe our segmentation-based results as finding states “enriched in enhancers”, rather than saying that we are providing a definition of an enhancer. Words are powerful things.
Q. I am still skeptical. What new insights does ENCODE offer, and are they really novel? Most of the time I think someone has already seen something similar before. <
A. I think that the scale of ENCODE – in particular the diversity of factors and assays – is impressive, and although correlative, this scale places some serious constraints on models. For example, the high, quantitative correlation between CAGE tags and histone marks at promoters limits the extent to which RNA processing changes RNA levels. (This is measured by 5’ ends – n.b. if there is a considerable amount of aborted transcription generating 5’ends, this need not mean full transcripts, though this correlation is high both for nuclear isolated 5’ends and cytoplasmic isolated 5’ ends.)
As for “someone has discovered it already,” I agree that the vast majority of our insights and models are consistent with at least one published study – often on a specific locus, sometimes not in human. Indeed, given the 30 years of study into transcription, I am very wary of putting forward concepts that don’t have support from at least some individual loci studies.
ENCODE has been selecting/confirming hypotheses that are broadly genome-wide, or multi-cell line true. ENCODE is a different beast from focused, mechanistic studies, which often (and rightly) involve precise perturbation experiments. Both the broader studies and the more focused studies help define phenomena such as transcription and chromatin dynamics.
This is all in the main paper, but then the network paper (led by Mike Synder and Mark Gerstein) on transcription factor co-binding, the open chromatin distribution paper (led by Greg Crawford, Jason Lieb, John Stamatoyannopolus), the DNaseI distribution paper (led by John Stamatoyannopolus), the RNA distribution and processing paper (led by Roderic Guigo and Tom Gingeras) and chromatin confirmation paper (led by Job Dekker) all provide non-obvious insights into how different components interact. And that’s just the Nature papers – there are another 30-odd papers to read. (We hope our new publishing innovation – “threads” – will help you navigate easily to the parts of all these papers you are most interested in reading.)
Q. You talk about how this will help medicine, but I don’t see this being directly relevant?
A. ENCODE is a foundational data set – a layer on top of the human genome – and its impact will be to make basic and applied research work faster and more cheaply. Because of our systematic, genome-wide approach, we’ve been able to deliver essential, high-quality reference material for smaller groups working on all manner of diseases. And in particular the overlap to genomewide association studies (GWAS) has been a very informative analysis.
Q. Moving to the disease genetics, were you surprised at this correlation with GWAS, as the current GWAS catalog is about lead association study SNPs, and we don’t expect this to overlap with functional data.
A. This was definitely a surprise to us. When I first saw this result I thought there was something wrong with some aspect of the analysis! The raw enrichment of GWAS-lead SNPs compared to baseline SNPs (e.g. those from the 1000 Genomes Project) is very striking, and yet if the GWAS-lead SNPs are expected to be tagging (but not coincident) with a functional variant, you would expect little or no enrichment.
We ended up with four groups implementing different approaches here, and all of them found the same two results. First, that the early SNP genotyping chips are quite biased towards functional regions. By talking to some of the people involved in those early designs (ca. 2003), I learned some of this is deliberate, for instance favouring SNPs near promoters. But even if you model this in, the enrichment of GWAS SNPs over a null set of matched SNPs is still there. This is similar to that card in Monopoly: “Bank Error in your favour; please collect 10 Euros/Dollars/Pounds”. In this case, it is: “Design bias in your favour; you will have more functional variants identified in the first screen than you think”.
We think that around 10% to 15% of GWAS “lead” loci are either the actual functional SNP in the condition studied or within 200bp of the functional variant. This is all great, but we can now do something really brilliant: break down this overall enrichment by phenotypes (from GWAS) and by functional type, in particular cell type (DNaseI) or transcription factor (TF). This matrix has a number of significant enrichments of particular phenotypes compared with factors or cell types. Some of these we understand well (e.g., Crohn’s disease and T-Helper cells); some of these enrichments are perfectly credible (e.g., Crohn’s disease and GATA-factor transcription factors); and some are a bit of a head-scratcher.
But the great thing about our data is that we didn’t have to choose a specific cell type to test or a particular disease. By virtue of being able to map both diseases and cell-specific (or transcription-factor-specific) elements to the genome, we can look across all possibilities. This will improve as we get more transcription factors and as we get better “fine mapping” of variants. This result for me alone is totally exciting: it’s very disease-relevant, and it leverages the unbiased, open, genome-wide nature of both ENCODE and GWAS studies to point to new insights for disease.
Q. You make a fuss about these new publishing aspects, such as “threads”. Should I be excited?
A. I hope so! The idea of threads is a novel attempt by us to help readers get the most out of this body of coordinated scientific work. Say you are only interested in a particular topic – say, enhancers – but you know that different groups in ENCODE are likely to have mentioned this (in particular the technical papers in Genome Research and Genome Biology). Previously you would have had to skim the abstract or text of all 30 papers to try and work out which ones were really relevant. Threads offer an alternative, lighting up a path through the assembled papers, pointing out the figures and paragraphs most relevant to any of 13 topics and taking you all the way through to the original data. The threads are there to help you discover more about the science we’ve done, and about the ENCODE data. Interestingly, this is something that’s only achievable in the digital form, and for the first time I found myself being far more interested in how the digital components work than in the print components.
The idea of threads came from the consortium, but the journal editors, in particular Magdalena Skipper from Nature made it a reality – remember that in these threads we are crossing publishing house boundaries. The resulting web site and iPad App I think works very well. I am going to be interested to see how other scientists react to this.
Q. And what about this Virtual Machine. Why is this interesting?
A. An innovation in computing over the last decade has been the use of virtualization, where the whole state of a computer can be saved to a file, and transported to another “host” computer and then restarted. This has given us a new opportunity to increase transparency for data intensive science.
Many people have noted that complex computational methods are very hard to track in all their detail. We currently place a lot of trust in each other as scientists that phrases such as “we then removed outliers” or “we normalised using standard methods” are executed appropriately. The ENCODE virtual machine provides all these complex details explicitly in a package that will last at least as long as the open virtualization format we use (OVF, VirtualBox). So if you are a computational biologist in three years’ time, and you want to see the precise details of how we achieved something, you can run the analysis codes yourself. The only caveat to this is that for the large, compute-scale pipelines we have an exemplar processing step, and then have the results of this parallelised (i.e. we do not have a virtualised pipelines). Think of this a bit like the ultimate materials and methods section of the paper. I believe this virtual machine substantially increases the transparency of this data-intensive science, and that we should produce virtual machines in the future for all data-intensive papers.
Q. I’ve read your Nature commentary about large projects, and admit that I’m uneasy about how these large projects throw their weight around. Isn’t there more friction and angst than you admit to?
A. There is indeed friction and angst, in particular with the smaller groups (“hypothesis testing groups, or R01 groups”) close to the scientific areas of ENCODE. I regret every instance of this and have tried my best to make things work out. After a lot of experience, I’ve realised a couple of things: Like any large beast, projects like ENCODE can inadvertently cause headaches for smaller groups. Part of this is actually due to third parties, for example reviewers of papers or grants who mistakenly think that the large datasets in ENCODE somehow replace or make redundant more focused studies. This is rarely the case – what the large project provides is a baseline dataset that is useful mainly for people who don’t have the time or inclination to do such a study and, importantly, who would not find it practical to do this work systematically (i.e. cutting to established, promising focus areas). ENCODE’s target audience is someone who needs this systematic approach, for example clinical researchers who might scan their (putatively) causative alleles or somatic variants against such a catalog. ENCODE does not replace the targeted perturbation experiment, which illuminates some aspect of chromatin or transcriptional mechanism (sometimes in a particular disease context). However, people less involved in this work can make the mistake of lumping together the mechanistic study and the catalog building as “doing ChIP-seq”, and assume they are redundant. As scientists in this area, both large and small groups need to regularly point out their explicit and non-overlapping complementarity.
Also, compared to some other scientific fields, genomics has a remarkably positive track record in data sharing and communication. We can do far better (more below), but everyone should be mindful that for all our faults, we do share datasets completely and openly, we nearly always share resources and techniques and we do communicate. Non-genomicists would be surprised sometimes at the depths of distrust in other fields. That said, there is always room for improvement. Although we did use pre-publication raw data sharing in ENCODE, we should have spent more time and effort sharing intermediate datasets (in addition to raw datasets). The 1000 Genomes Project provides an excellent example to follow.
Finally, I believe that the etiquette-based system of how to handle pre-publication data release (and I was a prominent participant in this discussion) is clumsy and out-moded: designed for a world where data generation – not analysis – is the bottleneck. I believe we need to have a new scheme. I’m not rushing to state my own opinion here – we need to have a deliberative process that balances getting broad buy-in and ideas with a timely and practical result.
Q. So ENCODE is all done now, right?
A. Nope! ENCODE “only” did 147 cell types and 119 transcription factors, and we need to have a baseline understanding of every cell type and transcription factor. Thankfully, NHGRI has approved the idea of pushing for this – not an unambitious task – over the next 5 years. I see there being three phases of ENCODE: the ENCODE Pilot (1% of the genome); the ENCODE scale-up (or production), where we showed that we can work at this scale and analyse the data sensibly; and next the ENCODE phase “build-out” to all cell types and factors.
Q. So you get to do this for another five years?<
A. Someone does. I have hung up my ENCODE “cat-herder-in-chief” hat, and moved onto new things, like the equally challenging world of delivering a pan-European bioinformatics infrastructure (ELIXIR). But that’s for another blog post!
Q. Be honest. Will you miss it?
A. Looking back on my ten years with ENCODE, you know, I really am going to miss this. (Okay, maybe I won’t miss three-hour teleconferences running to 2am…). It has been hard work and excellent science – I’ve met and interacted with so many great scientists and have honestly had a lot of fun.