miRNAs play important roles in the regulation of gene expression. MicroRNAs (miRNAs) are short, single-stranded RNAs that post-transcriptionally regulate gene expression via mRNA decay and/or translational repression. The fastly developing field of microRNA sequencing (miRNA-seq; small RNA-seq) needs comprehensive, robust, user-friendly and standardized bioinformatics tools to analyze these large datasets. MiRNAs are transcribed by RNA polymerases II and III, generating precursors that undergo a series of cleavage events to form mature miRNAs. Around 30 to 60% of all human protein coding genes are regulated by miRNAs, involved in almost all biological processes ranging from development to metabolism to cancer.
MicroRNAs (miRNAs) are key post-transcriptional regulators that affect protein translation by targeting mRNAs. Their role in disease etiology and toxicity are well recognized. Given the rapid advancement of next-generation sequencing techniques, miRNA profiling has been increasingly conducted with RNA-seq, namely miRNA-seq. Analysis of miRNA-seq data requires several steps:
- mapping the reads to miRBase
- considering mismatches during the hairpin alignment (windowing)
- counting the reads (quantification)
The choice made in each step with respect to the parameter settings could affect miRNA quantification, differentially expressed miRNAs (DEMs) detection and novel miRNA identification. Furthermore, these parameters do not act in isolation and their joint effects impact miRNA-seq results and interpretation. In toxic-genomics, the variation associated with parameter setting should not overpower the treatment effect (such as the dose/time-dependent effect).
Change in the number of nucleotides out of the mature sequence in the hairpin alignment (window option) produced the largest variation for miRNA quantification and DEMs detection. However, such a variation is relatively small compared to the treatment effect when the study focused on DEMs that are more critical to interpret the toxicological effect. While the normalization methods introduced a large variation in DEMs, toxic behavior of thioacetamide showed consistency in the trend of time-dose responses. The high throughput sequencing technique provides high sensitivity and specificity to analyze the abundance of microRNA sequence in a sample as well as to discover novel microRNA species.
Raw Data Processing
The Genome Analyzer produces all the sequencing raw data in TIFF format. To extract the sequence information from the original image files, image analysis and base calling are performed using solexa pipeline V1.6. The short reads passed solexa CHASTITY filtering are retained for further processing.
3′ adapter Trimming
In the first step, the reads which pass the Solexa CHASTITY filtering are then passed through an adapter filter that searches for reads whose 3′-ends align to the 3′-adaptor. The adaptor sequences are trimmed and then the adaptor-trimmed reads which have passed both filters are formatted into a non-redundant FASTA file where the copy number and sequence is recorded for each unique tag. All sequences that fail to meet the quality cutoff, read length < 16nt and copy number < 2 are discarded as unusable reads.
For each microRNA sequence-based profile, the copy number of small RNA sequence reads can be used to estimate expression level of each microRNA. Generally, the total number of valid sequence reads for each profile (be they from microRNA or other RNA species) is used as the scaling factor for normalization, as this number would give an indication of the total amount of RNA in the given sample. One profile is used as the reference, and the values of each of the other profiles are divided by the scaling factor.
After normalization, the basic bioinformatics analysis of microRNA sequencing data is performed, this generally contains two parts:
- Identification of differentially expressed known miRNAs
- Discovery of novel miRNAs