*************************************************************************************************************************************** * ReadMe File * * ----------- * * "Voting based cancer module identification by combining topological and data-driven properties" * * * * Author: A. K. M. Azad & Hyunju Lee * * Data Mining & Computational Biology Lab * * Gwangju Institute of Science & Technology * * Gwangju, South Korea * * http://combio.gist.ac.kr/ * * * *************************************************************************************************************************************** I. Independent Dataset: ======================= 1. "GeneSymbolLocation_hg18.txt" [provided]: Gene Symbol Location mapping. 2. "std_inv_paired_GE_data_fixed_new.csv" [Its GBM data; will provided upon request]: Standerdized Gene Expression data File Format: ----------- - This GE data is formated as matrix where the cells contain standardized gene expression values and are separeted by comma (,). - Rows are labeled with Gene Symbols and columns are labeled with sample ID. 3. "Paired_CNV_data_L3_imputed_new.csv" [Its GBM data; will provided upon request]: Level 3 Segemented CNA data (after missing value imputation) where Gene-Symbols were mapped with chromosomal location aided by "GeneSymbolLocation_hg18.txt" file. File Format: ----------- - This CNA data is formated as matrix where the cells contain segmented CNA values and are separeted by comma (,). - Rows are labeled with Gene Symbols and columns are labeled with sample ID. 4. "diff_exp_genes.csv" [Its GBM data; provided]: List of Differentially Expressed genes found by t-test with bonferroni Correction in GE ("std_inv_paired_GE_data_fixed_new.csv") data. 5. "altered_GeneSymbols[GISTIC + RAE].csv" [Its GBM data; provided]: List of Significantly Altered Genes in CNA ("Paired_CNV_data_L3_imputed_new.csv") data found By TCGA (2008) result. 6. "Seed_genes.csv" [Its GBM data; provided]: List of Seed genes in our Algorithm composed by distict genes in Differentially Expressed ("diff_exp_genes.csv") and Significantly Altered genes ("altered_GeneSymbols[GISTIC + RAE].csv"). 7. "All Corrected PPI genes + Seed genes.csv" [Its GBM data; provided]: Gene as Nodes in PPI graph. 8. "All Corrected PPI interactions.csv" [provided]: PPI among gene (nodes) in PPI graph. 9. "CNV_segemented_region_2_gs_mapping.csv" [Its GBM data; provided]: Mapping between Gene-Symbol and Chromosome Location in our collected CNA data set. Mapping was aided by "GeneSymbolLocation_hg18.txt" file. II. List or Source Codes: (Ordered by Running Sequence) ======================== 1. "Construct_all_Typesof_PCC" 2. "Define_Direct_Relations" 3. "Define_Indirect_Relationships" 4. "Vote_Calculation" 5. "Wrap_up_MD_algo" III. Dependencies: (Third-Party Open-Source Libraries) ================= 1. "alglib.net": Used for calculating Pearson Correlation Coefficient and students t-test. (http://www.alglib.net/translator/re/alglib-3.5.0.csharp.zip) 2. "An Extensive Examination of Data Structures Using C# 2.0": Used for Graph implementation. (http://msdn.microsoft.com/en-us/library/ms379574(v=vs.80).aspx) IV. Running Source Codes: (Sequentially) ========================== Note: Running time of each source code reported here (for 4,821 GBM seed genes) is based on a machine with following specification: a) Windows 32-bit operating system b) Intel(R) Core(TM)2 Quad CPu Q6600 @ 2.4GHz c) 3.25GB RAM 1. "Construct_all_Typesof_PCC": =============================== ----------------------------------------------------------------------------------------------------------------------------------- | Description: | ============ | - Enumerates all possible types of Pairwise Pearson Correlation Coefficients among the Seed genes. | - Executes within 7 hours for 4,821 GBM seed genes. | | Input: (Input file names are for GBM only; for OVC, appropriate files should be used which are included in InputFiles.zip) | ===== | a) "diff_exp_genes.csv" | b) "altered_GeneSymbols[GISTIC + RAE].csv" | c) "std_inv_paired_GE_data_fixed_new.csv" | d) "Paired_CNV_data_L3_imputed_new.csv" | | Output: | ======= | a) "Pairwise_PCC_full.csv" : Enumeration of all possible types of PCC among Seed Genes. File is formated as bellow: | ,, | --------------------------------------------------------------------------------------------------------------------------------- 2. "Define_Direct_Relations": ============================= ----------------------------------------------------------------------------------------------------------------------------------- | Description: | ============ | - Defines Direct relationships in "Gene-Gene Relationship Network". | - Corresponding thresholds (GE-GE threshold or CNAs-CNAs threshold, or CNAs-GE threshold) were applied. | - Executes within 7 minutes for 4,821 GBM seed genes. | | Input: (Input file names are for GBM only; for OVC, appropriate files should be used which are included in InputFiles.zip) | ===== | a) "Seed_genes.csv" | b) "Pairwise_PCC_full.csv" | | Output: | ======= | a) "Gene_Gene_Relatinship_Network_[Direct].csv" : Gene Gene Relationship Network with Direct relatioships only. | Non-zero entries means PCC (between gene pairs) >= threshold, and entries | with "0" PCC means < threshold. | --------------------------------------------------------------------------------------------------------------------------------- 3. "Define_Indirect_Relationships": ================================== ----------------------------------------------------------------------------------------------------------------------------------- | Description: | ============ | - Defines Indirect relationships in "Gene-Gene Relationship Network" combining PPI and pairwise PCC value. | - Executes within 20~22 hours for 4,821 GBM seed genes. | | Input: (Input file names are for GBM only; for OVC, appropriate files should be used which are included in InputFiles.zip) | ===== | a) "Seed_genes.csv" | b) "All Corrected PPI genes + Seed genes.csv" | c) "All Corrected PPI interactions.csv" | d) "CNV_segemented_region_2_gs_mapping.csv" | e) "std_inv_paired_GE_data_fixed_new.csv" | f) "Paired_CNV_data_L3_imputed_new.csv" | g) "Gene_Gene_Relatinship_Network_[Direct].csv" | | Output: | ======= | a) "Gene_Gene_Relatinship_Network_[Direct + Indirect].csv" : Some entries in Gene Relationship Network with "0" values were | updated by indirect relationships. | --------------------------------------------------------------------------------------------------------------------------------- 4. "Vote_Calculation": ====================== ----------------------------------------------------------------------------------------------------------------------------------- | Description: | ============ | - Calculate pairwise vote combining Data-driven and Topological (PPI) Properties among gene pairs. | - Executes within 42 hours for 4,821 GBM seed genes. | | Input: (Input file names are for GBM only; for OVC, appropriate files should be used which are included in InputFiles.zip) | ===== | a) "Seed_genes.csv" | b) "All Corrected PPI genes + Seed genes.csv" | c) "All Corrected PPI interactions.csv" | d) "altered_GeneSymbols[GISTIC + RAE].csv" | e) "std_inv_paired_GE_data_fixed_new.csv" | f) "Paired_CNV_data_L3_imputed_new.csv" | g) "Gene_Gene_Relatinship_Network_[Direct + Indirect].csv" | | Output: | ======= | a) List of files for each seed genes containing corresponding Vote Tables (file names are according to gene name). | Each Vote table has following format: | ,,,,<'r_value(g,m)' from | "Gene_Gene_Relatinship_Network_[Direct + Indirect].csv"> | --------------------------------------------------------------------------------------------------------------------------------- 5. "Wrap_up_MD_algo": ====================== ----------------------------------------------------------------------------------------------------------------------------------- | Description: | ============ | - Form Pre-Modules and Merge Modules. | - Executes within 10 minutes for 4,821 GBM seed genes. | | Input: (Input file names are for GBM only; for OVC, appropriate files should be used which are included in InputFiles.zip) | ===== | a) "Seed_genes.csv" | b) "All Corrected PPI genes + Seed genes.csv" | c) "All Corrected PPI interactions.csv" | d) "CNV_segemented_region_2_gs_mapping.csv" | e) "List of all Vote Tables" | f) "Gene_Gene_Relatinship_Network_[Direct + Indirect].csv" | | Output: | ======= | a) List of Modules. Modules are in following format: | , | ---------------------------------------------------------------------------------------------------------------------------------