Patients who contribute their data to research are primarily motivated by a desire to help others with the same plight, through the development of better treatments or even a cure. Out of respect for these individuals, and to uphold the fundamental tenets of the scientific process, I’d like the clinical trials community to shift its default position on data sharing and reuse to align to data availability on publication, similar to the life science community. This will enable more robust, rigorous research, create new opportunities for discovery and build trust between patients and scientists.
I have tweeted prolifically about the UK Referendum on membership in the European Union, strongly supporting the REMAIN (staying in the EU) campaign. In response to requests for a more substantial explanation of my position, I present here a short version and a long version of my views.
This is the third and final post in a series in which I share some lessons learned about how to plan, manage, analyse and deliver a ‘big biodata’ project successfully.
Now that you have the results of your carefully planned, meticulously managed and diligently analysed experiment, it’s time to decide on what to publish, and where.
Continue reading “Publishing Big Data Science”
This is the second of three blog posts about planning, managing and delivering a ‘big biodata’ project. Here, I share some of my experience and lessons learned in management and analysis – because you can’t have one without the other. Continue reading “Managing and Analysing Big Data – Part II”
Biology has changed a lot over the past decade, driven by ever-cheaper data gathering technologies: genomics, transcriptomics, proteomics, metabolomics and imaging of all sorts. After a few years of gleeful abandon in the data generation department, analysis has come to the fore, demanding a whole new outlook and on-going collaboration between scientists, statisticians, engineers and others who bring to the table a very broad range of skills and experience.
I have written about the rise of human as a first-class model organism, and am an enthusiastic user of this outbred, large vertebrate, which can walk right into pre-funded phenotyping centres (hospitals). However, some scientists are (somewhat flippantly) predicting ‘the demise of all non-human model organisms’ completely, only conceding the necessity for using mouse in impossible-in-human verification experiments. Although such positions tend to be put forward in jest, their underlying argument resonates: given our obsession on human health, and how much we can do humans – with broad outbred genetics, iPSC cell lines and organoids – why should we bother with other systems?
In 1799 George Shaw, the head of the Natural History Museum in London, received a bizarre pelt from a Captain in Australia: a duck bill attached to what felt like mole skin. Shaw examined the specimen and wrote up a description of it in a scientific journal, but he couldn’t help confessing that it was “impossible not to entertain some doubts as to the genuine nature of the animal, and to surmise that there might have been practised some arts of deception in its structure.” Hoaxes were rife at the time, with Chinese traders stitching together parts of different animals – part bird, part mammal – to make artful concoctions that would trick European visitors. Georgian London was becoming rather skeptical of these increasingly fantastical pieces of taxidermy.
The first step was developing a routine way to determine the order of the chemicals in the DNA polymer: sequencing. Fred Sanger, a gifted scientist and the only person with two Noble prizes in the same field under his belt, developed dideoxy-sequencing (a.k.a. “Sanger sequencing”) at the LMB in the 1970s. His laboratory, along with neighbourghing LMB labs including Sydney Brenner’s, produced a new generation of scientists: John Sulston, Bart Barrell, Roger Staden and Alan Coulson, who forged ahead towards the seemingly unobtainable goal of sequencing whole organisms – with human in their sights. First, they did the different bacteriophages (see my First Genome of Christmas). Then, in the 1980s John Sulston and colleagues started on mapping then sequencing the worm (see the Second Genome of Christmas).
Of course this was not just a UK effort; many US scientists were involved in genomics. A scientist and technology developer , Lee Hood, looked at how to remove the radioactivity that came with Sanger sequencing, and created flourophore based terminators. These were far safer and, importantly, amenable to automation. This led to the ABI company’s production of automated sequencers, which featured a scanning laser-based readout. Back in the UK, Alec Jeffreys made a serendipitous discovery: microsatellites – highly variable regions in the human genome that provided easy-to-determine genetic markers. This led to the rise of forensic DNA typing (first done for a criminal case near Alec’s native Leicester to provide evidence in a double murder case). A group of enterprising geneticists in France, led by Jean Weissenbach, used these microsatellites to generate the first genome-wide genetic map, based around Mormon families in Utah, who had kept impeccable family records. Clinician scientists were starting to use genetics actively: the first genetic diseases to be characterised molecularly were a set of haemglobinopathies (blood disorders such as sickle cell anaemia). In these cases, the clinicans were lucky that it was easy to track the protein itself as a genetic marker. A landmark breakthrough, by Francis Collins and colleagues, was the cloning of the gene for cystic fibrosis, using only DNA-based “positional” techniques, without knowing the actual defective protein. This was, at last, a clear, practical application of genomics.
From 1985 through the first part of the 1990s, all of these technologies and uses of DNA were improving, and it became increasingly clear that it was at least possible to consider sequencing the entire genome. However, this was still more of a sheer cliff than a gentle slope to climb. The human genome has three billion letters, a million-fold larger than bacteriophages and 30 times larger than the worm. If the human genome was going to be tackled, it was going to take a substantial, coordinated effort. Debates raged about the best technologies and approaches, the right time to invest in production vs developing better technology, and who, worldwide, would do what.
By the mid 90s things had settled down. The step-by-step approach used in the worm was clearly going to succeed, and there was no reason not to see the same approach working in human. The approach of mapping first, then sequencing was also compatible with international coordination, whereby each chromosome could be worked on separately without people treading on each other’s toes. There was some jostling about which groups should do which chromosomes (the small ones were claimed first, unsurprisingly), and some grumbling about people reaching beyond their actual capacity, but it was all on track to deliver around 2010.
- The Sanger Centre (now the Sanger Institute), led by John Sulston with Jane Rogers and David Bentley as key scientists, funded by the Wellcome Trust, a UK charity;
- US Department of Energy (DOE)-funded groups around the Bay Area in California (now the Joint Genome Institute, JGI), with Rick Myers in the early stages and Eddy Rubin pulling the configuration together;
- Three US National Institutes of Health (NIH) centres, with oversight from Francis Collins, director of the NIH’s National Human Genome Research Institute:
- The Washington University genome center in St Louis, led by Bob Waterston with Richard Wilson and Elaine Mardis as key scientists (this was the Sanger’s sister group on the worm as well);
- Mathematician-turned-geneticist (and part time entrepreneur), Eric Lander, who formed the Whitehead Genome centre as part of MIT (now the Broad Institute);
- An Australian transplanted into Texas, Richard Gibbs, at the Baylor genome centre.
For a few years, the Human Genome Project followed a steady rhythm: large-scale physical mapping followed by sequencing. Chromosome 22 was the first to be sequenced, by the Dunham team at the Sanger Centre. I remember poring over the sequence and gene models of this tiny human chromosome and thinking just how big the task ahead of us was. Chromosome 21 was heading to completion, and many other larger chromosomes were slowly being wrangled into shape.
Then, the sequencing world was turned upside down.
Craig Venter, a scientist/businessman had been around the academic genomic world for sometime, and realised perhaps better than anyone else the potential impact of automation. He had already published the first whole-genome shotgun bacteria and, inspired by a paper from Gene Myers (a computer scientist working on text analysis, and converting to biology) realised that a similar approach could work on human. Craig assembled an excellent set of scientists – Gene Myers, Granger Sutton and Mark Adams among others – and persuaded leading technology company ABI to set up a new venture to sequence the human genome – privately. This was at the end of the 1990s, at the start of the dotcom boom when it was anyone’s guess what a viable business model would be. Certainly, holding a key piece of information for biomedical research 10 years before the public domain effort looked a pretty good bet. Celera was born, raised a substantial amount of money on the US stock market and purchased a massive fleet of sequencers and computers.
Naturally, this was quite a shock to the academic project. I remember John Sulston gathering all of the Sanger Centre employees in the auditorium (I was a PhD student at the time) and telling us that this was a good thing – but complex. Behind the scenes there were all manner of discussions, best read about in one of the numerous books that came out. By my own recollection, there was a sneaking respect for Craig’s sheer chutzpa, coupled with a massive sense that one simply couldn’t have one organisation – and certainly not a company – own this key information.
The academic project also responded to the new, higher-pressure timeline. Rather than keeping with the map-first, sequence second approach, people switched to sequence-and-map as one scheme, but still with mid-size pieces (BACs – around 100,000 letter regions) rather than reads (only 500 letters at a time). This was a half-way point towards whole-genome shotgun and, critically, allowed the five major centres to accelerate their production rate. The nice map with flags across the genome basically disappeared (though each chromosome would then be mapped and finished) and the five centres ploughed onwards, leaving footprints all over the nice, tidy, well-laid plan.
But this acceleration of rate caused another problem: bottlenecks in the downstream informatics. Celera started to crow a bit about their depth of human talent in computer science and the size of their computer farm. This became a real issue. The public project was facing a very real headache of having thousands of fragments of the genome without any real way to put them together. My supervisor, Richard Durbin, was the lead computational person at Sanger and stepped up along with other academic groups, notably the creative, enthusiastic computer scientist David Haussler in Santa Cruz. David and Richard had worked on and off on all sorts of things, bringing in parts of computer science methods into biology, and they – with us, their groups – began to try and crack this problem.
The first problem was assembly. Previously, we were guided by a “physical map” and assembly was effectively done by hand on a computer-based workbench. This needed to change. David was joined by ex-computer-gaming programmer Jim Kent, who felt he could do this. I remember discussing the details of assembly methods and concepts on a phone call, with Jim enthusiastically claiming it was doable and everyone agreeing that Jim should come to the Sanger Centre for a while to absorb the details of overlaps, dispersed repeats and other Sanger genome lore. He packed his bags and left that day, appearing 12 hours later in Hinxton: a jovial, very definitely west-coast Amercian, ready to get to work. Jim worked constantly for about six months (back in Santa Cruz) solid to create the “golden path assembler”, which provided the sequence for the public projects. Jim also created the UCSC Browser, which remains one of the premier ways to access the human genome (though of course I am partial to a different, leading browser…).
And it didn’t stop there. The public project and the private Celera project were now really swapping insults in public, and Celera said that even if the public project could assemble their genome, they wouldn’t be able to find the genes in this sequence. Thankfully, three of us – Michele Clamp, Tim Hubbard and myself – had already started a sort of ‘skunk-works’ project at Sanger to be able to automatically annotate the genome. The algorithmic core was a program I had written, GeneWise, which was accurate and error-tolerant but insanely computationally expensive. Tim had a (in-retrospect, bonkers) cascading file system to try to match the raw computation with the arrival of data in real time. Michele was the key integrator. She was able to take Tim’s raw computes, craft the right approximation (described as “Mini-seq”) and pass it into GeneWise. This started to work, and we made a website around it: the Ensembl project, which provided another way to look at the genome. (Mini-seqs and GeneWise still hum away in the middle of Ensembl gene builds, and are responsible for the majority of vertebrate and many other gene sets.)
Even more surreally for me, the corresponding Celera annotation project was also using GeneWise (I had released it open source, as I would do everything), so I would have a list of bugs and issues from Michele and Ensembl during the day, and then a list of bugs and issues from Mark Yandell and colleagues from Celera overnight. The friendliness and openness of the Celera scientists – Gene, Mark Adams and Mark Yandell – was at complete odds to the increasingly bitter public stance between the two groups.
It was an intense but fun time. Michele and I worked around the clock to provide a sensible model of the genome and features (using – radically at the time – an SQL backend), and there were constant improvements to how we computed, stored and displayed information. We’d often work all day, flat out, and then head back to Cambridge, often in Michele’s house where we’d snatch a quick bite and watch the latest set of compute jobs fan out across the new, shiny compute farm bought to beef up Ensembl’s computational muscle. Michele’s partner (now husband) James ran the high-end computers, so if anything went wrong, from system through algorithm to integration – one of us was on hand to fix it. As the first jobs came back successfully, we would slowly relax, and eventually reward ourselves with a gin and tonic as we continued to keep one eye on the compute farm.
Eventually it became clear that both projects were going to get there – pretty much – in a dead heat. Given that the public project’s data could be integrated into the private version, Celera switched data production efforts to mouse, much to Gene Myers’ annoyance as he wanted to show that he could make a clean, good assembly from a pure whole-genome shotgun. There was a brokering of a joint statement between Celera and the public project, and this led to a live announcement from the White House by Bill Clinton, flanked by Craig Venter (private) and Francis Collins (public), with a TV link to Tony Blair and John Sulston in the UK.
One figure in this announcement came from our work: the number of human genes in the genome. This is a fun story in itself – I can’t do justice to it now – involving wild over-estimation for over two decades followed by extensive soul-searching as the first human chromosomes came out. I ended up running a sweepstake for the number whereby, in effect, we showed that in the absence of good data, even 200 scientists can be completely wrong. For the press release, it was our job to come up with an estimate of the number of human genes, so Michele launched our best-recipe-at-the-time compute. Bugs were found and squashed, and I remember hanging around, providing coffee and chocolate to Michele as needed (there is no point really in trying to debug someone else’s code in a pressurised environment). Eventually an estimate popped out: around 26,000 protein-coding genes.
We looked at each other and shook our heads – clearly too low, we thought, and went into the global phone conference where the good and the great of genomics said “too low” as well. So we went back and calculated all sorts of other ways there could be more protein coding genes (after all, a biotech called Incyte had been selling access to 100,000 human genes for over five years). We ended up with the rather clumsy phrase, “We have strong evidence for around 25,000 protein-coding genes, and there may be up to 35,000.”
In retrospect, Michele and I would have been better sticking to our guns, and going with the data. In fact, we now know there are around 20,000 protein-coding genes (though there are enough complex edge cases not to have a final number, even today).
The human genome was done in a rush, with enthusiasm, twice, in both cases in such a complex way that no other genome would be done like this again. In fact, Gene Myers was right. Whole-genome shotgun was “pretty good” (though purists would always point out that if you wanted the whole thing, it wouldn’t be adequate). The public project, John Sulston above all, was right that this information was for all of humanity, and should not be controlled by any one organisation.
I was very lucky to be at the right place at the right time to be a part of this game-changing time for human biology. Crazy days.
Mice got their start in the genetics laboratory in a rather eccentric collaboration between a Harvard Geneticist (W. E. Castle) and a fancy-mouse breeder (Abbie Lathrop), who provided a series of mice with specific traits, such as Japanese Waltzing mice. Abbie arguably ran the world’s first-ever mouse house on her farm in Massachusetts. A student of Castle, C.C. Little, got involved in studying mice and transformed a small hamlet on the coast of Maine, Bar Harbor, into a research laboratory, later named the “Jackson Laboratory” after a generous donor. The Jackson lab (shortened to “Jax”) is still one of the world’s premier mouse research sites.
Mice are excellent mammalian models: they really do have all the cell types, tissues and organs that human has, and so many features (though not all) of human biology, from cellular to physiological, can be replicated and studied in this animal. But it is the detailed control we have over the mouse genome that makes it an exceptional species for helping us understand biology. This control is thanks to two key developments. First, because mouse embryonic stem cells can be produced so easily, there are mouse cells (which you can keep in a petri dish) that can be coaxed into making viable embryos. These embryos can be implanted in pseudopregnant mice, and become full grown individuals. Second, one can swap pieces of DNA in and out in these stem cell lines at will – almost as easily as in yeast (and certainly more easily than in fly or worm).
Mouse is also likely to lead us in future to a more graph-based view of reference genomes. As there are inbred lines of mice, one can really talk about “individual” genomes in a solid way, knowing that others can ‘order up’ the same strain and work on them. Thomas Keane and colleagues have been building out the set of mouse strains beyond Black6, and doing increasingly independent assemblies, strain by strain. The resulting set of individual sequences absolutely shows the complex origin of laboratory mice; at any point, some mouse strains are as divergent as two species, and some are more like two individuals from a population. This complex web is best represented as a graph of sequences, rather than a set of edits from one reference, which is the current mode.
In 1787 Chobei Zenya (from Kyoto) wrote a book, “The Breeding of Curious Varieties of the Mouse”, which apparently had “recipes” for making particular coat colours for breeding strategies. There are far earlier documents from China on mouse strains, including the “waltzing” mouse (which we now know is a neurological condition). In some sense this is both the rootstock of this laboratory species and part of the motivation for and discovery of evolution and genetics (though Darwin spent more time looking at pigeons than mice).
Given the laboratory mouse’s flexible genetic manipulation, we will studying this species for at another 200 years.
When Mendel’s laws were rediscovered in the 1900s, many scientists turned to local species they could keep easily to explore this brave, new world of genetics. In America, Thomas Hunt chose the fruit fly. Scientists in Germany explored the guppy and Ginuea pigs. In England, crop plants were the focus of early genetics. In Japan, researchers turned to the tiny Medaka fish, a common addition to many of the ornamental ponds maintained in Japanese gardens.
Medaka also has the honour of being the first organism to show us that cross-over on the sex chromosomes does occur. We now know this to be commonplace, but at the time of its discovery this was a novel observation.
As genetics developed, Japanese researchers continued to inbreed Medaka fish, creating one of the most diverse set of inbred individual invertebrates from a single species in the world. Being fish, they have all the cell types and nearly all the organs that a mammal has: tiny, two-chambered hearts, livers, kidneys, muscles, brains, bones and eyes. Conveniently, one can keep lots and lots of them, far more cheaply than mice, and they reproduce regularly, with a generation time of around three months.
But then a different fish rose to prominence in molecular biology in the 1980s. Zebrafish, native of the Ganges, was chosen by the influential Christiane Nusslein-Volhard as the basis for redoing her Nobel-Prize-winning forward genetic screens in Drosophila, this time in a vertebrate.
So why am I so interested in Medaka? Well, I was having a beer with my colleague Jochen Wittbrodt, who is one of the rare Medaka specialists outside of Japan, and we were discussing the next stage of experiments. Medaka fish has a neat trick by which one can introduce foreign DNA (e.g. human) coupled to a reporter (green fluorescent protein from jellyfish is a favourite – easy to pick up using a microscope). Even on the first injection, the foreign DNA will often go into every cell. For most other species, you have to get lucky for the foreign DNA to go the germline, and then hope it will breed true. Jochen had done a number of successful reporter experiments based on designs from my group, and we were discussing whether we could draw on the long history of Medaka research with its rich tapestry of inbred lines to explore the impact of natural variation on these reporter experiments. So, I asked him how many inbred Medaka lines there were, and Jochen nonchalantly replied that he had no idea – after all, his colleague, Kiyoshi Naruse, made one or two new lines from the wild every year or so.
My jaw hit the floor. From the wild? I checked. Jochen confirmed. And then I explored some more, and discovered that there was a whole protocol for creating inbred individual Medaka from the wild.
This might sound trivial, but it is not. Keeping vertebrates in a laboratory is hard. Keeping them in a laboratory when they are inbred, such that their diploid genome is identical everywhere, is extremely difficult. Doing this routinely from the wild is basically unheard of (although this “self’ing” happens all the time in plant genetics).
But in Medaka, it could be doable. Impressive.
Jochen introduced me to Felix Loosli, the best Medaka breeder outside of Japan, and Kiyoshi Naruse, one of the leading breeders in Japan. The four of us have undertaken to generate and characterise a Medaka inbred panel from a single wild population (unsurprisingly, very close to Kiyoshi’s lab, in Nagoya).
Watch this space.