Compressing DNA: Part 1.

(First off, I am back to blogging, and this is the first of three blogs
I’m planning about DNA compression)

Markus Hsi-yang Fritz – a student in my group – came up with a reference sequence compression scheme for next generation sequencing data. This has just come out in “proper” print format this month (it was previously fully available on line…) (Full open access). Markus not only handled the bases, but also the key other components in a sequencing experiment – read pairing, quality values and unaligned sequence. For bases and read pair information, the compression is lossless. For the quality information and unaligned sequence one has to be lossy – if not, we can’t see a way of making an efficient compression scheme. But one can quite carefully control the where one keeps and loses data (“controlled loss of precision” in the terminology of compression) meaning that one can save information around important regions (eg, SNPs, CNVs etc).

Continue reading “Compressing DNA: Part 1.”

Orthologs and Paralogs

I am sitting in a talk (Interactome meeting) and the speaker is using InParanoid orthologs. At Ensembl we’ve adopted the TreeFam scheme for ortholog definition, and after alot of sweat to create statistics that assess the difference between orthologs sets, there is not a huge difference between InParanoid and TreeFam/Ensembl ortholog calls. (TreeFam/Ensembl is a little better, of course, but it always amazes me how good “simple” approaches can be).But the real benefit in the TreeFam scheme is the use of genuine phylogenetic trees than just ortholog lists. The tree is the best way to represent the evolution of the gene family.

Continue reading “Orthologs and Paralogs”

Sequence align view

A recent addition to Ensembl has been sequence alignview, to handle resequencing information. An example link is:
http://www.ensembl.org/Homo_sapiens/sequencealignview?gene=ENSG00000139618;individuals=HuAA;individuals=HuBB;individuals=HuCC

The framework for this data has been in placefor a while. Now we have probably the most obvious display of this – a multiple alignment of individuals or strains. For human individuals, as well as the 4 “Celera” humans, we will have Craig Venter’s genome and Jim Watson’s genome in soon. (There has been a persistent rumour that one of the 4 celera individuals was Craig, so that probably gives us 5 individuals overall, and only two, Craig and Jim, with high enough coverage to call Hetreozygote positions).

Continue reading “Sequence align view”

(no title)

This is a little bit of distraction for me – I should be doing other things, but I am between two quite complex documents (reviewing a long article and writing yet another strategy document) so I decided to dip my foot into blogging again. There are two real motivations for this. Firstly there is quite a few “general” things that I muse about which I would like other people to easily get access to – currently the people who get to give me feedback on some of these ideas are those I happen to have coffee with. Secondly blogs are clearly a way to keep the brands one knows and loves high the google-ranks and possibly also high in people’s only surfing habits, and that’s something I want to do – especially for Ensembl, but also for my other projects; Reactome and the projects I’ve grandfathered from – Pfam, Bioperl and the rest. So – expect numerous musings for both reasons.

Right. Now back to “real” work.