SpliceSeq is a visual tool that supports exploration of:
- A single sample’s transcriptome from RNA-Seq data
- A comparative analysis of transcriptomes from pairs of samples
- An average transcriptome of a group of samples
- A comparative analysis of transcriptomes between pairs of groups
- It works by aligning sample reads to a database of known splicing patterns represented as gene transcript splice graphs. These splice graphs are constructed by the SpliceToolDBBuild program and stored in SpliceSeq DB, a relational MySQL database that is distributed with the tool.
- A custom sequence database is generated from the splice graphs against which the SpliceSeq Analyzer program will align the RNASeq data.
- Bowtie is used to align reads to the splice graph sequences, and the resultant summary statistics for the sample is stored in the SpliceSeq DB.
- The summary statistics include normalized read count values for genes, exons, splices, attributes, etc.
- SpliceSeq Analyzer then traverses the splice graphs to detect alternative splicing events and evaluate the impact of splicing changes on protein products for each sample.
- After the samples are loaded, SpliceSeq Analyzer is also called to compare samples. The comparison process identifies changes in splicing patterns between samples, classifies them, and evaluates the impact of those changes on known protein features.
Formation of splice graph
The first step is to summarize known transcript variations and knowledge about gene structure into a directed acyclic graph known as a splice graph, which represents exons as rectangular nodes and splice junctions as edges. The benefit of this representation is two-fold:
- It provides a succinct representation of all transcript splice paths represented in the source models.
- It allows for additional permutations of observed patterns by traversing the graph in new ways.
The thin exon sections represent untranslated regions (UTR) and the thick exon sections represent coding regions. Exons are drawn to scale and the connecting arcs represent splice paths. Each piece of transcript sequence is represented uniquely by a node, allowing each read to be aligned unambiguously within the splice graph so long as there are no redundant sequences within the exons themselves. When an exon has both long and short forms, it is split into two sub-exons such that the common portion is represented only once in the graph.
The SpliceSeqDBBuild process builds this reference set of splice graphs, one for each gene, using gene models downloaded from the UCSC Genome Browser database. Many different types of gene models are available, including RefSeq, UCSC Gene, Sanger Vega, Ensembl, and NCBI AceView, any of which may be used. Currently, SpliceSeq is distributed with splice graphs derived from Ensembl models because they tend to produce interpretable graphs with strong annotation and a nice balance between variation and complexity.
The steps performed by the SpliceSeqDBBuild are:
- Use FTP to download gene model data from UCSC Genome Browser Tables. Currently, Ensembl (ensGene, ensPep, and ensemblToGeneName) is used.
- Download chromosome sequence from NCBI.
- Construct the SpliceSeq data schema.
- For each gene, merge gene model transcripts into a single exon landscape map and splice list using genomic coordinates of each model exon.
- Generate a consolidated splice graph for each gene. Split exons into sub-exons for contiguous regions with multiple points of splice in/out paths.
- Annotate models with attributes for UTR, Coding, and transcript start/stop locations.
- Retrieve and store nucleic sequence for each exon.
- Download and store UniProt protein sequence and position specific feature information for each splice graph associate via gene symbol.