In many eukaryotic organisms, such as humans, the genome is tightly packed and organized with the help of nucleosomes (chromatin). A nucleosome is a complex formed by eight histone proteins that is wrapped with ~147bp of DNA. When the DNA is being actively transcribed into RNA, the DNA will be opened and loosened from the nucleosome complex. Many factors, such as the chromatin structure, the position of the nucleosomes, and histone modifications, play an important role in the organization and accessibility of the DNA. Consequently, these factors are also important for the activation and inactivation of genes. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-Seq) is a method to investigate the accessibility of chromatin and thus a method to determine regulatory mechanisms of gene expression. The method can help identify promoter regions and potential enhancers and silencers. A promoter is the DNA region close to the transcription start site (TSS). It contains binding sites for transcription factors that will recruit the RNA polymerase. An enhancer is a DNA region that can be located up to 1 Mb downstream or upstream of the promoter. When transcription factors bind an enhancer and contact a promoter region, the transcription of the gene is increased. In contrast, a silencer decreases or inhibits the gene’s expression. ATAC-Seq has become popular for identifying accessible regions of the genome as it’s easier, faster and requires fewer cells than alternative techniques, such as FAIRE-Seq and DNase-Seq.
The first step of ATAC-seq analysis involves pre-alignment QC, read alignment to a reference genome, and post-alignment QC and processing.
Pre-alignment quality control
The pre-alignment QC and read alignment steps are standard for most high-throughput sequencing technologies. For example, FastQC can be used to visualize base quality scores, GC content, sequence length distribution, sequence duplication levels, k-mer overrepresentation and contamination of primers and adapters in the sequencing data. An overall high base quality score with a slight drop towards the 3′ end of sequencing reads is acceptable.
After read trimming, FastQC can be performed again to check the successful removal of adapter and low-quality bases. Trimmed reads are then mapped to a reference genome. BWA-MEM and Bowtie2 aligners are memory-efficient and fast for short paired-end reads. The soft-clip strategy from both aligners allows the overhang of bases on both ends of reads which can further increase unique mapping rates.
Post-alignment processing and quality control
After sequence alignment, as in most DNA sequencing data, basic metrics of the aligned BAM file, such as unique mapping reads/rates, duplicated read percentages, and fragment size distribution can be collected using Picard and SAM tools. Additionally, reads should be removed if they are improperly paired or of low mapping quality. The mitochondrial genome, which is more accessible due to the lack of chromatin packaging, and the ENCODE blacklisted regions often have extremely high read coverage, and should also be discarded. Duplicated reads, which are likely to have arisen as PCR artifacts, should also be removed to significantly improve biological reproducibility. These steps will together improve the power of open chromatin detection and produce fewer false positives.
The second major step of ATAC-seq data analysis is to identify accessible regions and is the basis for advanced analysis. Currently, MACS2 is the default peak caller of the ENCODE ATAC-seq pipeline.
Here are a few additional things to consider when planning an ATAC-seq experiment;
Like most high-throughput sequencing applications, ATAC-seq requires that biological replicates be run. This ensures that any signals observed are due to biological effects and not idiosyncrasies of one particular sample or its processing. To begin with, two replicates per experimental group are sufficient.
With ATAC-seq, control groups are not typically run, presumably due to the expense and the limited value obtained. A control for a given sample would be genomic DNA from the sample that, instead of transposase treatment, is fragmented (e.g. by sonication), has adapters ligated, and is sequenced along with the ATAC sample.
In preparing libraries for sequencing, the samples should be amplified using as few PCR cycles as possible. This helps to reduce PCR duplicates, which are exact copies of DNA fragments that can interfere with the biological signal of interest.
The optimal sequencing depth varies based on the size of the reference genome and the degree of open chromatin expected.
For ATAC-seq, paired-end sequencing is recommended, for several reasons.
- More sequence data leads to better alignment results. Many genomes contain numerous repetitive elements, and failing to align reads to certain genomic regions unambiguously renders those regions less accessible to the assay.
- With ATAC-seq, we are interested in knowing both ends of the DNA fragments generated by the assay, since the ends indicate where the transposon is inserted. This can be done only with paired-end reads.
- PCR duplicates are identified more accurately. 6. Mitochondria
It is a well-known problem that ATAC-seq datasets usually have a large percentage of reads that are derived from mitochondrial DNA. Since there are no ATAC-seq peaks of interest in the mitochondrial genome, these reads are discarded in the computational analysis and thus represent a waste of sequencing resources.
- Nucleosome mapping
- Transcription factor binding analysis
- Novel enhancer identification
- Exploration of disease-relevant regulatory mechanisms
- Cell type–specific regulation analysis
- Evolutionary studies
- Comparative epigenomics
- Biomarker discovery