Outline
The protein function prediction using a group of highly contributing genomic data sources
could increase classification accuracy instead of using all
available data sets. In this study, we investigated the
relationship between genomic data sets and protein
functions. To do this, we used the Gene Ontology (GO)
terms as function labels and 10 genomic datasets in M.
musculus including protein-domains, protein-protein
interactions, gene expressions, phenotype ontology,
phylogenetic profile, and diseases.
- Prediction model
- Kernel based logistic regression (KLR)
- L1 norm regularized logistic regression (L1 log reg)
First, we measured the contribution of each data set
based on its prediction accuracy using a KLR.
We repeated the ten times of
prediction of 1,726 GO-BP (Gene Ontology Biological
process) terms, where each prediction was performed
using a different data set out of 10 data sets. Then, we
selected data sets having high prediction accuracy for
each GO term.
"Exhaustive Search result" search tool
Second, to reduce the time complexity in the previous
approach, we also applied a L1 regularization logistic
regression with kernel to select contributing data sets
for each GO term.
To gain insight about relationship between GO term
and genomic data type, we examined GO terms which
have high prediction accuracy for each data set. Then, we
performed hypergeometric test to measure how much
contributing data sets of each GO term agree with those
of its descendent GO terms in the directed acyclic graph
structure of gene ontology.
For more details, please refer to the paper.
Additional files
- Prediction result:
AUC value and regression model coefficients related to Table 1 are represented. Each data source was treated like Equation 7. So, there are two predictors for each data source.
This file can be viewed with Microsoft Excel Viewer
- Information about gene count of GO terms in each data source
- KLR with integrated and unstandardized data
- KLR with integrated and standardized data
- Exhaustive search
- L1 log reg with unstandaradized data and relative lambda 0.01
- L1 log reg with standaradized data and relative lambda 0.01
- L1 log reg with unstandaradized data and relative lambda 0.1
- L1 log reg with standaradized data and relative lambda 0.1
- Precision at various recall value
Precision at various recall value for exhaustive search and L1-norm regularized logistic regression (regularization parameter =0.01, standardized data).
This file can be viewed with Microsoft Excel Viewer
- The lists of GO terms predicted well with a given data source in two different approaches:
To form these groups of GO terms, as highly contribution criteria, high prediction accuracy (>=0.75 AUC and >=0.2 P20R value) and large coefficient (outside of 1SD and >=0.2 P20R value) were used in exhaustive search and L1-norm regularized logistic regression respectively.
This file can be viewed with Microsoft Excel Viewer
- GO terms giving high prediction accuracy with only one data source:
Table listing whole data of Table 3.
This file can be viewed with Microsoft Excel Viewer
- Data underlying Table 3 (Additional File 4)
Table listing data underlying Table 3 (Additional File 4).
- Enrichment test result (Exhaustive Search & L1 log reg)
The results of enrichment test of exhaustive search and L1-norm regularized logistic regression (with standardized data and relative parameter 0.01 for regularization parameter ) are represented. Bold and italic types indicate commonly significant GO terms between enrichment tests of two approaches. Among them, underline depicts GO terms satisfying cut-off in two approaches (i.e in exhaustive search: >=0.75 AUC and >=0.2 P20R value, in L1-norm regularized logistic regression: coefficient of given data source is in the outside of 1SD and >=0.2 P20R value)
This file can be viewed with Microsoft Excel Viewer
- The hierarchy of 'Reproduction' (GO:0000003):
Hierarchy of 'Reproduction' which has high significant value with MGI in enrichment test is depicted. Colored GO terms are well predicted with MGI and, among them, red boxes represent GO terms which having high AUC with only that data source. There is some lack of hierarchy because our proteome was restricted. So, dotted line was used to describe ancestor that is not direct parent.
In the parenthesis, the number of gene products (the number of gene product in MGI phenotype data source is also represented) and prediction accuracy with MGI phenotype data source in exhaustive search are represented. 'Na' of the AUC value means that prediction cannot be achievable because the lack of the number of gene product in MGI phenotype data source. For example, in the hierarchy, the total number of gene products of viral infectious cycle 'GO:0019058' is four but MGI data source has data about only one of them.
- Data underlying Figure 3 (Interpro, Zhang expression, PPI, OMIM, MGI)
- Data underlying Figure 4 (Interpro, Zhang expression, PPI, OMIM, MGI)
- Data underlying Figure 5 (Interpro, Zhang expression, PPI, OMIM, MGI)