Discovery and characterization of chromatin states for systematic annotation of the human genome.
Ernst & Kellis. Nature Biotechnology 28 (8), 817-825, 2010.
This paper has wide implications for our basic understanding of the function of epigenetic marks and the genome itself; super cool science! The study of epigenetics has been around for a while but only recently has the technology advanced sufficiently to allow us to really grasp its importance. Epigenetics is (from Wikipedia):
the study of changes produced in gene expression caused by mechanisms other than changes in the underlying DNA sequence –hence the name epi- (Greek: επί- over, above) -genetics.
The most common (well studied) epigenetic modifications include DNA methylation and histone modifications. DNA methylation can directly affect the transcriptional activity of a given genomic region. Histone modifications, such as acetylation and methylation cause the DNA to unwind or contract on the histones increasing/decreasing its accessibility to trascription factors etc. These marks can happen at many locations on the N terminal tail of the histone and can be compounded so that, for example, a histone residue can be mono, di or tri-methylated. There are several hypotheses as to how these marks might interplay (or not) with each other. Could specific combinations lead to unique biological functions or do they all do the same thing with the presence of additional marks occurring together acting to enhance stability and robustness of the signal?
The authors of this paper gathered existing data to compile a ChIP-seq data set representing 41 marks in human CD4 T cells. The data included various acetylation and methylation marks as well as RNA pol 2, the H2AZ histone variant and CTCF, a zinc finger protein known to bind DNA insulator regions). The data is genome wide at a resolution of 200bp. The authors then used a multivariate hidden markov model (HMM) to ask what combinations of marks significantly occur together. They then correlate these with known biological features. A key observation is that their model is unsupervised – it had no knowledge of the genome annotations a priori so would discover them without any prodding.
After a fair bit of complicated mathematics they settled on 51 “chromatin states” (distinct combinations of epigenetic marks) which they were able to assign a function by correlating the genomic location of the state with known biological annotations such as introns, exons, transcription start sites (TSS), microarray data etc. They categorised their states into five groups; promoter associated, transcribed, active intergenic, repressive and repetitive. Each of these has numerous sub categories responding to a particular scenario, for example within the promoter associated states are; repressed promoter, high expression TSS, low expression TSS, etc. We can now use the states to read/interpret what the genome is doing. They find some interesting things, for example ~64% of the genome is in an epigenetically repressed state. Further to this they investigate the biological functions of the genes associated with each promoter state and find that they are enriched for different functions. For example state 3 (promoter upstream, low expression) is enriched for the cell cycle components whereas state 8 (transcribed promoter, highest expression, TSS for active genes) is enriched for T cell activation associated genes.
A particularly exciting application of this data is in the interpretation of intergenic SNPs that otherwise are hard to assign a biological relevance yet are strongly associated with a disease or phenotype. They give as an example an intergenic SNP associated with plasma eosinophil levels in inflammatory diseases which sits 40kb from the nearest gene where the surrounding regions have no biological annotations. They discover that the SNP sits in the middle of a region in state 33 (distal enhancer). This region is surrounded by repressed states and so we can immediately begin to interpret the function of the SNP in the regulation of gene expression
This study proves then that epigenetic marks work in concert with each other, not just alone. It provides a mechanism by which we can interpret the genome and investigate regions outside of the coding sequences, the erstwhile junk DNA. What I, as a biologist interested in disease, now want to know is whether these states are ever significantly altered during disease even when genetic mutations are not present. It will be pretty costly to find this out…
