EBI as a data refinery

In describing what the EBI does, it is sometimes hard to provide a feel for the complexity and detail of our processes. Recently I have been using an analogy of  EBI as a “data refinery”: it takes in raw data streams (“feedstocks”), combines and refines them, and transforms them into multi-use outputs (“products”). I see it as a pristine, state-of-the-art refinery with large, imposing central tanks (perhaps with steam venting here and there for effect) from which massive pipes emerge, covered in reflective cladding and connected in refreshingly sensible ways. In the foreground are arrayed a series of smaller tanks and systems, interconnected in more complex ways. Surrounding the whole are workshops, and if you look close enough you can see a group of workers decommissioning one part of the system
whilst another builds a new one.

 Oil refinery on the north end of March Point; Mount Baker

I find this analogy useful for a number of reasons. First, a “product” is often itself a feedstock, which is why the EBI has so many complex cycles of information. For example, InterPro member database models and patterns are feedstocks for the InterPro entries; during refinement they become associated with one another, documentation and gene ontology (GO) assignments. InterPro takes in UniProt (UniParc) protein sequences and combines them with models to provide boundaries on proteins; these in turn allow the ‘InterPro2GO’ GO assignment process to occur. This automatic GO annotation is then applied to the UniProtKB entries along with experimentally defined GO annotations which come from GO curators worldwide, and include many entries about model organisms .The InterPro entries additionally provide raw information (feedstock) for the UniRule automatic annotation, where InterPro matches are  the mainstay condition of a particular rule, which the UniProt curator combines with other conditions such taxonomic restrictions and sequence properties , ensuring the most accurate application of the  annotation extracted
from the experimentally proven UniProtKB entries to the proteins of unknown function.

This is a complex network of inputs and outputs, (just writing it down and trying to keep it all straight is exhausting unless you are part of it – I went through a couple of rounds with Claire O’Donovan and Sarah Hunter to get the above flow absolutely straight) but the main input – bare protein sequences (coming from internal feedstocks including ENA and Ensembl) –is being converted into the main output: annotated protein entries, with human-readable annotation and careful audit trail of its ‘refinement’. This is what the user sees as the output of the refinery, and understandably does not want to spend too much time worrying about the details of pipe connectivity inside the refinery.

Another reason I find the refinery analogy useful is because volume can be deceptive. The biggest, most impressive tanks in this refinery are filled with DNA sequence data but for the refinery to work as a whole it needs many “specialist” chemicals, in lower volumes, to serve as critical catalytic components. It might be necessary for the refinery to make and store some components in order to streamline a more complex flow of information. The EBI works with key “catalyst” streams of information that have a disproportionate impact relative to their volume (e.g. this assignment of experimentally defined annotation).

A deceptive view of this refinery would focus exclusively on the final outputs and the most recent refinement process, without taking in the intricate web of components behind them. People might use Reactome or IntAct to understand a particular functional dataset, but the protein information in these resources depends on UniProt to track and annotate these sequences. The protein information in UniProtKB is dependent on the ENA database smoothly accepting submissions with annotated CDS proteins present. In this way, asking to visualise, say, phosphoprotein results on a pathway diagram is not as simple as it might seem. It implicitly draws on many of the tanks in the EBI refinery. This larger network actually goes beyond the EBI’s borders to its worldwide collaborators (e.g. wwPDB, or the INSDC’s GenBank/ENA/DDBJ).

The final “product” that the user sees often has a local manufacturer (i.e., bioinformatician/computational biologist) who pulls in information from the large tanks and combines it with local data to provide an overall picture and give context. Often, the research group querying EBI data does not worry too much about the details of how the refinery works, or about the complex inter-dependencies of the refinery; they just want easy access to a product they can rely on. It is the job of the EBI, and in future will be the job of ELIXIR, to satisfy this desire.

A refinery does not stay still. In each process, engineers (in our case bioinformaticians and software engineers) work to improve minor, everyday things and to carry out major retooling. New types of experimental information might require a new tank and pipelines, or become cheap enough to replace older feedstocks, in both cases opening up potential for new, useful products. New discoveries might change the way processes or transformations are handled, perhaps by adding a certain catalyst at a particular stage to improve the products.

Clearly the EBI is not the only refinery. Our European partners, such as SIB and Sanger, collaborate so closely with us on key projects that it’s hard to work out where one refinery stops and the other begins. We exchange data and expertise regularly with large refineries in the US and Japan, such as NCBI, UCSC, NIG and RCSB. It is exciting to see all of the proto-refineries in Europe, which offer different core competencies and are coalescing into a single robust, refinery: ELIXIR.

Like all analogies, this is not perfect. The concept of free data sharing, which is at the heart of molecular biology, does not fit well with this analogy. Although the complex process of providing the necessary CPU, disk and network has some resonance with the internal “plant” infrastructure, the fact that it is so generic and tradable does not. The EBI’s products are also directly used via the web, often without much intermediation (no need for a network of gas stations, etc.). Nevertheless, the picture of a complex interplay of inputs being progressively refined is helpful when trying to disentangle some of our trickier problems.

I welcome feedback on this analogy, and to what extent it helps one understand the EBI.