FusionAI: a deep learning classifier to predict human FGBPs.

FusionAI: a classifier to predict human FGBPs using deep learning methodology
To understand the mechanisms of the formation and action of human fusion gene breakpoints (FGBPs) in diseases, we performed classifying between human FGBP positive and negative sequences context using a convolutional neural network approach. Using FusionAI, we can study the features of fusion gene breakpoints, infer the potential breakage of the user's interested genomic regions.

Datasets

Investigation of FGBP information of 48K FGs from FusionGDB identified the BP location across the human genome as in Figure 6. Here ‘e’ and ‘i’ denote the FGBPs located in the middle of exon and intron, respectively. The ‘j’ means that the FGBP is at the exon junction point. Since usually FGBPs were identified from RNA-seq data, the majority of BPs (~ 26K) were located at the exon junction points (j-j BP combination). We integrated 517 WGS based FGBP information from TCGA structural variant analysis work and these are i-i BP combination. For the 4.3K e-e combination BPs identified from RNA-seq data, we hypothesize that these are the real genomic BPs. We combined these BPs as one file to represent the FG positive BP data sample. To make fusion negative BP data samples, we excluded 17,110 genes, which is involved in 48K known human FGs, among 43K GENCODE genes. From the rest of those genes, we have chosen gene pairs randomly. Then, we used RepeatMasker, Duplicated Genes Database, and HUGO database’s pseudogenes to filter out genes and BPs belong to the repeat region, paralogs or pseudogenes. We also excluded the gene pairs with neighboring gene relationships. We set the minimum distance as 100Kb between randomly selected two BPs across gene bodies. A 20Kbp long DNA sequence was constructed by conjugating +/- 5Kbp sequence from each BP of two partner genes. Through these processes and filtration, we created ~ 20K non-FGBP data.

Fusion gene BPs from FusionGDB and ChiTaRS 3.0

Creating non-fusion BPs
1. Randomly choose two different genes never reported as fusion genes from FusionGDB and ChiTaRS 3.0.
2. Randomly choose genomic points for two genes.
3. Distance should be > 100kb in case of intra-chromosomal pairs.
4. Except read-through cases (co-transcription intergenic splicing cases, Co-TIS, fusion between neighbor genes).
5. Filters repeat regions, paralogs, and pseudo-genes. We discarded the cases that were located within the repeat regions obtained from the RepeatMasker (Smit et al., 1996) track in the UCSC genome browser. We also filtered out the gene fusions with paralogous genes obtained from the Duplicated Genes Database (Ouedraogo et al., 2012) or pseudo-genes obtained from the HUGO database (Gray et al., 2013) were removed from the candidates.

Datasets used for training and testing.

FusionAI model construction

Based on the primary sequences, we trained a multiple-layer deep neural network to (1) predict the likelihood of being a fusion gene for a designated gene pair and (2) identify sequence patterns that facilitate fusion gene formation. We designed the input sequence to be +/- 5kb flanking the FGBP of each FG partner gene and output the probability of being fusion and non-fusion. The input of the model is a sequence of 20 kb one-hot encoded nucleotides, where A, C, G, and T are encoded as [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], and [0, 0, 0, 1] respectively. The output is two probabilities corresponding to FGBP and nonFGBP that sum to one. Our deep neural network consists of two convolutional layers with filter size 10*2, one max pooling, one flatten, and two dense layers finally to the output layer. The model involves 1,672,290 parameters including both weight matrix and bias at related layers (Figure 7). We evaluated the model by testing the prediction accuracy on the test data. After training the optimal model, we explored the feature importance by perturbing a segment of sequence (20 bp) and calculating the absolute prediction changes upon the perturbation. We performed sliding this 20bp window to the entire 20 kb input sequence and calculated the feature importance score of each window.

Classification results and the identified important features for FB breakpoints

We curated 20K j-j combination BPs and 20K non-FGBPs to train and test our model. 32 K BPs from combined 40 K BPs were used as training and validation sets (80% for training and 20% for validation), and the rest 8K was used for an independent test. The performance (accuracy and loss) during the training process is illustrated in Figure 8A. We then tested the trained model on both the 32K original training samples and the 8K test samples. The accuracies for training and test data sets were 96.6% and 92.6% with 0.10- and 0.18-error rate, respectively. This performance is much better than the traditional machine learning method SVM that yielded an accuracy of 79% and 72% respectively. Figure 8B shows the feature importance score distribution across 20K sequence of six well-known FGs, including BCR-ABL1, EML4-ALK, TMPRSS2-ERB, PML-RARA, RUNX1-RUNX1T1, and FGFR3-TACC3. Particularly, a very narrow region near the BP (5 kb and 15 kb refer to the BP of the 5’ and 3’ partner genes) yielded high feature importance scores (BPfi), indicating that only a short segment of the sequence surrounding BP site might represent the BP characteristics. We further checked the sequences with a high BPfi score and found repeated motif sequences. For example, the fusion gene from EML4 (chr2:42491871) and ALK (chr2:29446394) yielded consistently high BP feature importance (BPfi) score at regions with motifs TTAAAAAT, AACCAAGGT and GACCGACTA. The sub-motifs of TTAAAAAT such as TTAAAA and AAAAAT are the microsatellites in the human genome. This is relevant to the previous studies that the cancer cells with large numbers of microsatellites are regarded as defected in the ability to correct the DNA damages or mistakes. We will continue to search the related mechanisms with these sequence features.

FusionGDB 2.0 fusion gene annotation update aided by deep learning

FGBPmap is a reference FGBP score map to be searched the tendency of breakage across the human genome. The users can search the tendency of fragility across the 3 billion human genome sequence. We will implement two-way searching approaches using BP information based or sequence based strategies. You can visit FGBPmap by clicking this.

About us

Pora Kim, MS, PhD, Hua Tan, PhD, and Xiaobo Zhou, PhD

Email: [email protected], [email protected], [email protected]

Mailing address:

  Center for Computational Systems Medicine
  School of Biomedical Informatics
  The University of Texas Health Science Center at Houston
  7000 Fannin Street, Houston, TX 77030