Assessment of contribution of genomic data sources to predicting protein functions

Seokha Ko (sukdo@gist.ac.kr), Hyunju Lee (hyunjulee@gist.ac.kr)
Department of Information and Communications, Gwangju Institute of Science and Technology
Gwangju, Republic of Korea


Outline

The protein function prediction using a group of highly contributing genomic data sources could increase classification accuracy instead of using all available data sets. In this study, we investigated the relationship between genomic data sets and protein functions. To do this, we used the Gene Ontology (GO) terms as function labels and 10 genomic datasets in M. musculus including protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profile, and diseases.


First, we measured the contribution of each data set based on its prediction accuracy using a KLR. We repeated the ten times of prediction of 1,726 GO-BP (Gene Ontology Biological process) terms, where each prediction was performed using a different data set out of 10 data sets. Then, we selected data sets having high prediction accuracy for each GO term. "Exhaustive Search result" search tool

Second, to reduce the time complexity in the previous approach, we also applied a L1 regularization logistic regression with kernel to select contributing data sets for each GO term.

To gain insight about relationship between GO term and genomic data type, we examined GO terms which have high prediction accuracy for each data set. Then, we performed hypergeometric test to measure how much contributing data sets of each GO term agree with those of its descendent GO terms in the directed acyclic graph structure of gene ontology.

For more details, please refer to the paper.

Additional files

  1. Prediction result:
    AUC value and regression model coefficients related to Table 1 are represented. Each data source was treated like Equation 7. So, there are two predictors for each data source.
    This file can be viewed with Microsoft Excel Viewer
    • Information about gene count of GO terms in each data source
    • KLR with integrated and unstandardized data
    • KLR with integrated and standardized data
    • Exhaustive search
    • L1 log reg with unstandaradized data and relative lambda 0.01
    • L1 log reg with standaradized data and relative lambda 0.01
    • L1 log reg with unstandaradized data and relative lambda 0.1
    • L1 log reg with standaradized data and relative lambda 0.1
  2. Precision at various recall value
    Precision at various recall value for exhaustive search and L1-norm regularized logistic regression (regularization parameter =0.01, standardized data).
    This file can be viewed with Microsoft Excel Viewer
  3. The lists of GO terms predicted well with a given data source in two different approaches:
    To form these groups of GO terms, as highly contribution criteria, high prediction accuracy (>=0.75 AUC and >=0.2 P20R value) and large coefficient (outside of 1SD and >=0.2 P20R value) were used in exhaustive search and L1-norm regularized logistic regression respectively.
    This file can be viewed with Microsoft Excel Viewer
  4. GO terms giving high prediction accuracy with only one data source:
    Table listing whole data of Table 3.
    This file can be viewed with Microsoft Excel Viewer
  5. Data underlying Table 3 (Additional File 4)
    Table listing data underlying Table 3 (Additional File 4).
  6. Enrichment test result (Exhaustive Search & L1 log reg)
    The results of enrichment test of exhaustive search and L1-norm regularized logistic regression (with standardized data and relative parameter 0.01 for regularization parameter ) are represented. Bold and italic types indicate commonly significant GO terms between enrichment tests of two approaches. Among them, underline depicts GO terms satisfying cut-off in two approaches (i.e in exhaustive search: >=0.75 AUC and >=0.2 P20R value, in L1-norm regularized logistic regression: coefficient of given data source is in the outside of 1SD and >=0.2 P20R value)
    This file can be viewed with Microsoft Excel Viewer
  7. The hierarchy of 'Reproduction' (GO:0000003):
    Hierarchy of 'Reproduction' which has high significant value with MGI in enrichment test is depicted. Colored GO terms are well predicted with MGI and, among them, red boxes represent GO terms which having high AUC with only that data source. There is some lack of hierarchy because our proteome was restricted. So, dotted line was used to describe ancestor that is not direct parent. In the parenthesis, the number of gene products (the number of gene product in MGI phenotype data source is also represented) and prediction accuracy with MGI phenotype data source in exhaustive search are represented. 'Na' of the AUC value means that prediction cannot be achievable because the lack of the number of gene product in MGI phenotype data source. For example, in the hierarchy, the total number of gene products of viral infectious cycle 'GO:0019058' is four but MGI data source has data about only one of them.
  8. Data underlying Figure 3 (Interpro, Zhang expression, PPI, OMIM, MGI)
  9. Data underlying Figure 4 (Interpro, Zhang expression, PPI, OMIM, MGI)
  10. Data underlying Figure 5 (Interpro, Zhang expression, PPI, OMIM, MGI)
Additional Files (click the below link)
Prediction Result (Table 1, xls)
Precision at various recall value (xls)
The lists of GO terms predicted well with a given data source in two different approaches(Table 2, xls)
GO terms giving high prediction accuracy with only one data source
(Table 3, xls)
Data underlying Table 3 and Additional File 4 (Text)
Enrichment test result (Table 4, xls)
GO:0000003 Hierarchy (png)
GO:0006812 Data (Figure 3, Text)
GO:0000003 Data (Figure 4, Text)
GO:0045935 Data (Figure 5, Text)