Methyl-Seq Data Analysis

Detection and identification of methylated residues needs sequencing of bisulfite-treated DNA (or RNA), which converts un-methylated cytosine residues to uracil (and interpreted as a thymine in sequencing results), while methylated cytosine residues are unaffected. Sequence reads are therefore aligned to essentially two genomes in silico, converted and unconverted versions. 

Methylation occurs primarily at CpG contexts, but also less frequently at non-CpG contexts. Results are typically reported as percent methylation at any given position, but this should be thresholded to a minimum number of reads (usually 8 to 10 reads) before a call can be determined. Whole-genome bisulfite sequencing therefore requires at least 10X coverage (preferably much higher) at a significant cost. Commercial capture panels for known sites of methylation (CpG islands, known differentially methylated regions) can often yield much better coverage at these sites for lower cost and higher rate of results. One important control should be the inclusion of completely un-methylated DNA (typically lambda phage DNA), which can be used to measure the efficiency of bisulfite conversion, > 97% is considered to be good, while < 90% is very poor.

Raw reads are processed by trimming adaptor sequences and low-quality bases and removal of short reads (35bp) using TrimGalore. Processed reads are aligned to a modified human reference sequence using BISMARK. The percentage methylation is predicted using the bismark methylation extractor per position by percentage of CGs converted to TGs. Non-converted cytosines following bisulfite treatment can be either methylated or hydroxymethylated (mC + hMC).

Methylkit is used to discover differentially methylated regions (DMRs). Briefly, 200bp windows are chosen that have at least five differentially methylated CpGs with a read depth of at least 10. For each of those windows, the number of unmethylated and methylated observations is determined and a p-value is assigned using the logistic regression test. Statistically significant DMRs that show >50% methylation difference between groups are chosen using a q-value threshold.


Bisulfite treatment is very damaging to DNA, so a lower rate of alignment is normally expected compared to traditional DNA sequencing. Furthermore, a higher percentage of Illumina PhiX control DNA is usually included to maintain reasonably expected levels of base composition, resulting in lower target genome yields.


Methyl Seq analysis is typically done at two different levels: at base level (individual C nucleotide) and at regional level (a cluster of several C nucleotides within a defined window length, e.g. 5 Cs in 250 bp). While methylation at a specific residue may impact a specific binding site, changes in methylation over a region are thought to be more biologically significant.

There are three main groups of experimental techniques used for genome-wide DNA methylation and hydroxymethylation detection and methylation data production;

  • Restriction enzyme-based
  • Affinity enrichment-based
  • Bisulfite conversion-based methods 

These three groups describe the mechanism how methylated cytosine is recognized in order to differentiate methylated and unmethylated DNA (or hydroxymethylated and non-hydroxymethylated DNA). Bisulfite conversion-based methods are arguably the most commonly chosen approach today. 

