#coolscience digest

my take on the cool science papers of the day...

Discovery and characterization of chromatin states for systematic annotation of the human genome.

Ernst & Kellis.  Nature Biotechnology 28 (8), 817-825, 2010.

 

This paper has wide implications for our basic understanding of the function of epigenetic marks and the genome itself; super cool science! The study of epigenetics has been around for a while but only recently has the technology advanced sufficiently to allow us to really grasp its importance.  Epigenetics is (from Wikipedia):

the study of changes produced in gene expression caused by mechanisms other than changes in the underlying DNA sequence –hence the name epi- (Greek: επί- over, above) -genetics.

The most common (well studied) epigenetic modifications include DNA methylation and histone modifications.  DNA methylation can directly affect the transcriptional activity of a given genomic region.  Histone modifications, such as acetylation and methylation cause the DNA to unwind or contract on the histones increasing/decreasing its accessibility to trascription factors etc.  These marks can happen at many locations on the N terminal tail of the histone and can be compounded so that, for example, a histone residue can be mono, di or tri-methylated.  There are several hypotheses as to how these marks might interplay (or not) with each other.  Could specific combinations lead to unique biological functions or do they all do the same thing with the presence of additional marks occurring together acting to enhance stability and robustness of the signal?

The authors of this paper gathered existing data to compile a ChIP-seq data set representing 41 marks in human CD4 T cells.  The data included various acetylation and methylation marks as well as RNA pol 2, the H2AZ histone variant and CTCF, a zinc finger protein known to bind DNA insulator regions).  The data is genome wide at a resolution of 200bp.  The authors then used a multivariate hidden markov model (HMM) to ask what combinations of marks significantly occur together.  They then correlate these with known biological features.   A key observation is that their model is unsupervised – it had no knowledge of the genome annotations a priori so would discover them without any prodding.

After a fair bit of complicated mathematics they settled on 51 “chromatin states” (distinct combinations of epigenetic marks) which they were able to assign a function by correlating the genomic location of the state with known biological annotations such as introns, exons, transcription start sites (TSS), microarray data etc.  They categorised their states into five groups; promoter associated, transcribed, active intergenic, repressive and repetitive.  Each of these has numerous sub categories responding to a particular scenario, for example within the promoter associated states are; repressed promoter, high expression TSS, low expression TSS, etc.  We can now use the states to read/interpret what the genome is doing.  They find some interesting things, for example ~64% of the genome is in an epigenetically repressed state.  Further to this they investigate the biological functions of the genes associated with each promoter state and find that they are enriched for different functions.  For example state 3 (promoter upstream, low expression) is enriched for the cell cycle components whereas state 8 (transcribed promoter, highest expression, TSS for active genes) is enriched for T cell activation associated genes.

A particularly exciting application of this data is in the interpretation of intergenic SNPs that otherwise are hard to assign a biological relevance yet are strongly associated with a disease or phenotype. They give as an example an intergenic SNP associated with plasma eosinophil levels in inflammatory diseases which sits 40kb from the nearest gene where the surrounding regions have no biological annotations.  They discover that the SNP sits in the middle of a region in state 33 (distal enhancer).  This region is surrounded by repressed states and so we can immediately begin to interpret the function of the SNP in the regulation of gene expression

This study proves then that epigenetic marks work in concert with each other, not just alone.  It provides a mechanism by which we can interpret the genome and investigate regions outside of the coding sequences, the erstwhile junk DNA.  What I, as a biologist interested in disease, now want to know is whether these states are ever significantly altered during disease even when genetic mutations are not present.  It will be pretty costly to find this out…

 

 

 

Posted June 20, 2011

Global quantification of mammalian gene expression control

Right, a new blog so I had best introduce myself.  I'm interested in all things cool in molecular biology, bioinformatics and where they cross over. This site is a place for me to store my digested thoughts on papers etc that I've read. It's primarily aimed at giving my own memory a break, but if you enjoy reading it too then that's just great.  I've chosen as my first topic this paper:

"Global quantification of mammalian gene expression control"  Nature 473, 337-342, 2011

I've chosen this paper for several reasons. One, it's cool.  Two, it was the last thing I read and, three, it tackles a question I was concerned with during my PhD studies albeit on a much larger scale and in mammals rather than bacteria.  In prokaryotes the fundamental biological processes of transcription and translation are coupled.  There is no nuclear membrane to divide the two and so as soon as an mRNA transcript is produced (even as it's being produced!) ribosomes bind and initiate translation.  Therefore there is a strong correlation between the amount of mRNA and the quantity of it's resulting protein in prokaryotes.  Obviously the rate of decay for both the transcript and the protein leads to cases where this is not true, but as a general rule it seems to hold up ok.

In eukaryotes its a whole different ball game.  For one, the nucleus separates the physical process of transcription from translation.  Higher organisms also have much more complex processes to prepare mRNA for translation - the removal of introns for a start - which are not present in prokaryotes.  Because of this it has always been hard to categorically state that a gene's transcript levels are truly reflective of it's protein level.  This causes a problem.  For a variety of reasons, most of which are technological and financial, modern molecular biology is largely based on the measurement of mRNA levels from which the state of a cell is inferred. However, proteins are the real functional unit of a cell and if we can't be sure that the mRNA levels actually reflect their concentration then we can't be sure of the cellular state as a whole.

We've been waiting for a systematic comparison of mRNA and protein levels on a global scale to tease apart this relationship.  The technological limitations that held us back are now being overcome and this paper is the first (to my knowledge) to provide such a comprehensive comparison.  The authors have not only quantified the levels of both mRNA (with high throughput sequencing) and protein (liquid chromatography coupled with tandem mass spec) but have also been able to generate half lives (the rate of decay/turnover) for both as well.  To do this they grew their cells (murine fibroblasts and then later a human breast cancer cell line) in media that contained labeled amino acids and a nucleoside analogue that allowed the team to differentiate newly synthesised mRNA and protein from the pre-existing.  A ratio of the new and pre-existing concentrations compared to total RNA/protein allowed the group to calculate half lives.  In total they have data for 5,028 mRNA-protein pairs.  It is worth noting that they were able to collect mRNA and protein data in the same cells (literally the same cells, not just the same cell type) which means the data is entirely compatible. Further, the method does not use any destructive chemicals to inhibit transcription or translation in order to calculate half lives meaning the cell remains intact and functioning normally throughout the experiment.

I guess the headline result is that they show that the correlation between mRNA and protein levels is approx. 0.4.  Although this is not massive, it's greater than anyone had predicted in the past and is good news for all of us mRNA observers out there.  Next the group were able to construct a mathematical model that allowed them to explore the contribution of the four main processes involved (the synthesis and degradation of mRNA and protein).  They discover that the rate of translational initiation by the ribosome is the most fundamental check on protein abundance and not the rate of mRNA transcription, another reminder not to focus solely on the rate of transcription.

They classify proteins based on their mRNA and protein stabilities and find that those which are stable as both mRNA and protein are enriched for some fundamental cellular processes such as translation and metabolism.  Those which are unstable both as mRNA and protein are involved in signalling and regulatory systems (including epigenetic mechanisms).  Those with unstable proteins but stable mRNAs are concerned with functions such as cellular defence where the protein needs to be produced rapidly hence the pre-existing pool of mRNA.  This is largely as expected and indicates that the regulation of protein production evolved in a resource contrained environment and has adapted to fit the needs of the cell in an energy efficient optima.  The authors explore numerous other aspects which I won't go into here. 

It is important to note that these experiments were conducted on a large, non-synchronised population of cells and as such the results reflect the average over the cell cycle.  It will be the case that at the level of the individual cell a particular protein may have quite different synthesis/degradation characteristics.  Nevertheless such a resource will now be invaluable to scientists looking to create systems biology models of cellular pathways where quantities and synthesis/turnover rates are required for accurate computation.  It will be the case that the data shown here derived from mouse fibroblasts will not be applicable to many models but at least they will allow us to move on from our current uninformed guestimates.

 

 

Posted June 13, 2011