Both linear and nonlinear classifiers for true splicing junction detection

Deep sequencing of transcriptome quickly becomes the most powerful technique to interrogate the whole transcriptional landscape, including both known transcript quantification and novel transcript discovery. Theoretically, all splicing events as well as chimeric transcripts can be directly detected. However, the RNA-seq downstream data analysis still remains a big challenge. Several major Lucidenic-acid-B alternative splicing forms, such as exon skipping, mutually exclusive exon, alternative first/last exon and intron retention, can be detected by simply mapping RNA-seq reads to hypothetical splicing junctions. The reliability of a splicing junction is determined by: 1) number of reads mapping to the junction ; 2) number of mismatches on each mapped read; 3) read mapping position on the junction, i.e. how close is the center of the read to the junction itself. The shorter the distance is, the less likely that this mapping is simply by chance; 4) Mismatch position on junction read, e.g. mismatches Solasonine occurring at both ends of reads are more likely due to the sequencing error, while those occurring in the middle of read are more likely to be polymorphisms. However, most previous studies only considered the first quantitative information of junction reads, i.e. an exon junction is considered to be real if it has more than R junction reads. This read-counting method, as demonstrated in the results, has both high false positive and false negative rates. On the other hand, in one of the two earliest pioneering human transcriptome studies, Pan et al used features similar to those described above to train both linear and nonlinear classifiers for true splicing junction detection, and achieved superior results. In this paper, we introduced a new statistical metric, namely Minimal Match on Either Side of Exon junction, as a means to measure the ����quality���� of junction reads by integrating all the features listed above. Then, we presented a simple yet effective empirical statistical model using this metric to detect splicing junctions with real RNA-seq data.