tutorial6 This exercise demonstrates how to reconstruct a combined draft GEM by from KEGG and MetaCyc pathway databases. A combined model with the comprehensive coverage of metabolic pathways is generated from different de novo reconstruction approaches. The input is a FASTA format file with whole-proteome sequences. The combined model is subsequently used for refinement of existing high-quality model and generation of a new version of GEM, by utilizing the manual curation results. This tutorial is a showcase of the new features released in RAVEN 2.0 through demonstrating the utilization of the newly developed functions on GEM reconstruction and curation for Streptomyces coelicolor strain A3(2). Users may apply this script as the template in their own work for other organisms. This refers to Tutorial 6 from "RAVEN tutorials.docx"
0001 % tutorial6 0002 % This exercise demonstrates how to reconstruct a combined draft GEM by 0003 % from KEGG and MetaCyc pathway databases. A combined model with the 0004 % comprehensive coverage of metabolic pathways is generated from 0005 % different de novo reconstruction approaches. The input is a FASTA 0006 % format file with whole-proteome sequences. The combined model is 0007 % subsequently used for refinement of existing high-quality model and 0008 % generation of a new version of GEM, by utilizing the manual curation 0009 % results. This tutorial is a showcase of the new features released in 0010 % RAVEN 2.0 through demonstrating the utilization of the newly developed 0011 % functions on GEM reconstruction and curation for Streptomyces 0012 % coelicolor strain A3(2). Users may apply this script as the template in 0013 % their own work for other organisms. 0014 % This refers to Tutorial 6 from "RAVEN tutorials.docx" 0015 0016 %Before reconstruction, a FASTA file with protein sequences of the target 0017 %organism needs to be prepared. In this tutorial, all protein sequences of 0018 %S. coelicolor A3(2) were downloaded from the NCBI genome database and 0019 %provided in current folder (Sco_all_protein.faa). The description lines 0020 %(starting with ">" character) of this FASTA file were modified by keeping 0021 %only the locus tag (e.g. SCO6005), which is commonly used as the gene 0022 %names in GEMs. 0023 0024 %Generate a draft GEM from MetaCyc database. The first parameter is 0025 %organismID and needs to be specified by user. The other parameters are set 0026 %to exclude unbalanced and undetermined reactions, but keep transport 0027 %reactions. The two parameters for homology search are set to the default 0028 %values that have been optimized to capture the protein hits with both 0029 %comprehensive coverage and the least false positives. 0030 ScoMetaCycDraftModel=getMetaCycModelForOrganism('ScoMetaCyc','Sco_all_protein.faa',1); 0031 0032 %Generate the two draft models from KEGG database. The first one is 0033 %reconstructed from the genome information annotated by KEGG. Since KEGG 0034 %provides genome annotations for over 5000 organisms, their GEMs can thus 0035 %be generated without providing protein sequences while only using the 0036 %organismID. The organisms with KEGG annotation and their associated ids 0037 %(e.g. 'sco' for Streptomyces coelicolor) can be found from here: 0038 %https://www.kegg.jp/kegg/catalog/org_list.html. 0039 %Generate a draft model using KEGG annotation and exclude incomplete 0040 %reactions and reactions with undefined stoichiometry 0041 ScoKEGGAnnotation=getKEGGModelForOrganism('sco','','','',1,0,0); 0042 0043 %Generate the second KEGG-based draft model based on sequence homology to 0044 %KEGG Ortholog sequence clusters while excluding incomplete reactions and 0045 %the ones with undefined stoichiometry. Type "help getKEGGModelForOrganism" 0046 %to see the detailed instructions for the choice of different parameters. 0047 %The default values for homology search are used because they have been 0048 %optimized for the best performance. 0049 ScoKEGGHomology=getKEGGModelForOrganism('ScoKEGGHMMs','Sco_all_protein.faa','prok90_kegg105','',1,0,0); 0050 0051 %De novo reconstruction from MetaCyc should take about 10 minutes, while 0052 %both reconstructions from KEGG may take up to 50-60 minutes in overall 0053 %depending on the computer hardware and the system used. Given that KEGG 0054 %and MetaCyc databases are formulated in different ways; KEGG relies on 0055 %sequence-based annotation, while MetaCyc collects only experimentally 0056 %verified pathways. Therefore, integration of MetaCyc- and KEGG-based draft 0057 %models could have a better coverage of the metbolism for the target 0058 %organism. 0059 0060 %At first, the two KEGG-based models can be directly merged 0061 ScoKEGGDraftModel=mergeModels({ScoKEGGAnnotation ScoKEGGHomology}); 0062 0063 disp(numel(ScoKEGGHomology.rxns)+numel(ScoKEGGAnnotation.rxns)); 0064 disp(numel(ScoKEGGDraftModel.rxns)); 0065 %By checking the reaction number, it can be seen that reaction number in 0066 %the merged model equals adding up the reaction numbers in homology and 0067 %annotation KEGG draft models. And there are duplicated reactions in this 0068 %merged model. 0069 0070 %Merge the duplicated reactions into one and combine multiple iso-enzymes 0071 %into a single grRules. The expandModel and contractModel functions are 0072 %utilised, see their instructions for details. 0073 ScoKEGGDraftModel=expandModel(ScoKEGGDraftModel); 0074 ScoKEGGDraftModel=contractModel(ScoKEGGDraftModel); 0075 0076 %In the end, KEGG- and MetaCyc-based draft models can be further combined 0077 %into an integrated GEM. This step is achieved by the function 0078 %combineMetaCycKEGGModels, which firstly converts metabolite and reaction 0079 %identifiers in the KEGG model into corresponding MetaCyc ids, and then 0080 %detects duplications and keeps only unique reactions and metabolites that 0081 %are mostly in MetaCyc identifiers. 0082 ScoCombinedDraftModel=combineMetaCycKEGGModels(ScoMetaCycDraftModel, ScoKEGGDraftModel); 0083 0084 %With this combined model, the existing iMK1208 model is refined by 0085 %incorporating the new pathways/reactions that are found in the combined 0086 %model but absent from the previous iMK1208 model. At the time of 0087 %publication, a total of 398 reactions in the combined draft model were 0088 %determined as new pathways based on manual curation results, which have 0089 %been organized into TableS3. Load these manually selected reactions and 0090 %their subSystems as the pre-processed array structure selectedNewRxns. 0091 load('iMK1208+suppInfo.mat','selectedNewRxns'); 0092 disp(selectedNewRxns); 0093 0094 %Generate a sub-model that includes only these new reactions. This is 0095 %implemented by subtracting the other reactions from the combined model 0096 %using function removeReactions 0097 rxnsToRemove=setdiff(ScoCombinedDraftModel.rxns,selectedNewRxns.rxns); 0098 newRxnSubModel=removeReactions(ScoCombinedDraftModel,rxnsToRemove,1,1); 0099 0100 %Now this newRxnSubModel contains only these new reactions. It needs to be 0101 %modified before merging with the iMK1208. Since these reactions are 0102 %metabolic ones and can all be assigned to the cytoplasm compartment 0103 newRxnSubModel.comps{1}='c'; 0104 newRxnSubModel.compNames{1}='Cytoplasm'; 0105 0106 %Amend to the genes and rxnGeneMat fields 0107 newRxnSubModel.genes={}; 0108 for k=1:numel(newRxnSubModel.rxns) 0109 newRxnSubModel.genes=[newRxnSubModel.genes;transpose(strsplit(newRxnSubModel.grRules{k},' or '))]; 0110 end 0111 newRxnSubModel.genes=unique(newRxnSubModel.genes); 0112 [~, newRxnSubModel.rxnGeneMat, ~]=standardizeGrRules(newRxnSubModel); 0113 0114 %Since the newRxnSubModel is ready for incorportation, the iMK1208 model 0115 %can be loaded for integration 0116 load('iMK1208+suppInfo.mat','iMK1208'); 0117 0118 %It should be noted that the incompatible nomenclatures (especially the 0119 %metabolite identifiers) used in different GEMs and databases led to a 0120 %serious problem in model comparison, curation and integration. In order to 0121 %properly integrate the iMK1208 with this sub-model, both models should be 0122 %unified into the same name space, since the metabolites in the sub-model 0123 %use MetaCyc identifiers that are different from the ones used in iMK1208 0124 %model. TableS2 contains the results for database mining and intensive 0125 %manual curation while associating these metabolite identifiers. Load 0126 %these manually selected reactions and their subSystems as the 0127 %pre-processed array structure metaCycMetsIniMK. 0128 load('iMK1208+suppInfo.mat','metaCycMetsIniMK'); 0129 0130 %Replace the metabolites in the sub-model with the identifiers and names 0131 %used in iMK1208 according to the mapping information in TableS2 0132 %(metaCycMetsIniMK). 0133 [a, b]=ismember(newRxnSubModel.mets,metaCycMetsIniMK); 0134 I=find(a); 0135 newRxnSubModel.mets=strcat(newRxnSubModel.mets,'_c'); 0136 newRxnSubModel.metNames(I)=iMK1208.metNames(b(I)); 0137 newRxnSubModel.mets(I)=iMK1208.mets(b(I)); 0138 0139 %The necessary issues for merging iMK1208 with the new reactions determined 0140 %from de novo generated draft GEMs have been resolved. Those can be 0141 %directly merged into the Sco4 model, which refers to the 4th major update 0142 %of the GEM for Streptomyces coelicolor. 0143 Sco4=mergeModels({iMK1208 newRxnSubModel}); 0144 0145 %Check out if the newly generated Sco4 could grow 0146 sol=solveLP(Sco4); 0147 disp(sol.f); 0148 %The results indicate that this new model is functional