getKEGGModelForOrganism Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG. If the target species is not available in KEGG, the user must select a closely related species. It is also possible to circumvent protein homology search (see fastaFile parameter for more details) Input: organismID three or four letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. Use 'eukaryotes' or 'prokaryotes' to get a model for the whole domain. Only applicable if fastaFile is empty, i.e. no homology search should be performed fastaFile a FASTA file that contains the protein sequences of the organism for which to reconstruct a model (opt, if no FASTA file is supplied then a model is reconstructed based only on the organism abbreviation. This option ignores all settings except for keepSpontaneous, keepUndefinedStoich, keepIncomplete and keepGeneral) dataDir directory for which to retrieve the input data. Should contain a combination of these sub-folders: -dataDir\keggdb The KEGG database files used in 1a (see below) -dataDir\fasta The multi-FASTA files generated in 1b (see below) -dataDir\aligned The aligned FASTA files as generated in 2a (see below) -dataDir\hmms The hidden Markov models as generated in 2b or downloaded from BioMet Toolbox (see below) The final directory in dataDir should be styled as proXXX_keggYY or eukXXX_keggYY, indicating whether the HMMs were trained on pro- or eukaryotic sequences, using a sequence similarity threshold of XXX %, fitting the KEGG version YY. E.g. euk90_kegg105. (opt, see note about fastaFile. Note that in order to rebuild the KEGG model from a database dump, as opposed to using the version supplied with RAVEN, you would still need to supply this) outDir directory to save the results from the quering of the hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallel. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (opt, default is a temporary dir where all *.out files are deleted before and after doing the reconstruction) keepSpontaneous include reactions labeled as "spontaneous". (opt, default true) keepUndefinedStoich include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (opt, default true) keepIncomplete include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (opt, default true) keepGeneral include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (opt, default false) cutOff significance score from HMMer needed to assign genes to a KO (opt, default 10^-50) minScoreRatioG a gene is only assigned to KOs for which the score is >=log(score)/log(best score) for that gene. This is to prevent that a gene which clearly belongs to one KO is assigned also to KOs with much lower scores (opt, default 0.8 (lower is less strict)) minScoreRatioKO ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (opt, default 0.3 (lower is less strict)) maxPhylDist -1: only use sequences from the same domain (Prokaryota, Eukaryota) other (positive) value: only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist) (opt, default Inf, which means that all sequences will be used) nSequences for each KO, use up to this many sequences from the most closely related species. This is mainly to speed up the alignment process for KOs with very many genes. This subsampling is performed before running CD-HIT (opt, default inf) seqIdentity sequence identity threshold in CD-HIT, referred as "global sequence identity" in CD-HIT User's Guide. If -1 is provided, CD-HIT is skipped (opt, default 0.9) globalModel structure containing both model and KOModel structures as generated by getModelFromKEGG. These will otherwise be loaded by via getModelFromKEGG. Providing globalKEGGmodel can speed up model generation if getKEGGModelForOrganism is run multiple times for different strains. Example: [globalModel.model,globalModel.KOModel] = getModelFromKEGG; (opt, default empty, global model is loaded by getModelFromKEGG) Output: model the reconstructed model PLEASE READ THIS: The input to this function can be confusing, because it is intended to be run in parallel on a cluster or in multiple sessions. It therefore saves a lot of intermediate results to storage. This also serves the purpose of not having to do redundant calculations. This, however, comes with the disadvantage of somewhat trickier handling. This is what this function does: 1a. Loads files from a local KEGG FTP dump and constructs a general RAVEN model representing the metabolic network. The functions getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files, which are later merged into 'keggModel.mat' by getModelFromKEGG function. The function getPhylDist generates 'keggPhylDist.mat' file. KEGG FTP access requires a <a href="matlab: web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>. 1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One multi-FASTA file for each KO in KEGG is generated. The Step 1 has to be re-done every time KEGG updates their database (or rather when the updates are large enough to warrant re-running this part). Many users would probably never use this feature. 2a. Filters KO-specific protein sets. This is done by using the settings "maxPhylDist" and "nSequences" to control which sequences should be used for constructing Hidden Markov models (HMMs), and later for matching your sequences to. The most common alternatives here would be to use sequences from only eukaryotes, only prokaryotes or all sequences in KEGG, but you could also play around with the parameters to use e.g. only fungal sequences. 2b. KO-specific protein FASTA files are re-organised into non-redundant protein sets with CD-HIT. The user can only set seqIdentity parameter, which corresponds to '-c' parameter in CD-HIT, described as "sequence identity threshold". CD-HIT suggsted sequence identity specific word_length (-n) parameters are used. 2c. Does a multi sequence alignment for multi-FASTA files obtained in Step 2b for future use. MAFFT software with automatic selection of alignment algorithm is used in this step ('--auto'). 2d. Trains hidden Markov models using HMMer for each of the aligned KO-specific FASTA files obtained in Step 2c. This is performed with 'hmmbuild' using the default settings. Step 2 may be reasonable to be re-done if the user wants to tweak the settings in proteins filtering, clustering, multi sequence alignment or HMMs training steps. However, it requires to have KO-specific protein FASTA files obtained in Step 1a. As such files are not provided in RAVEN and BioMet ToolBox, the user can only generate these files from KEGG FTP dump files, so KEGG FTP license is needed. 3a. Queries the HMMs with sequences for the organism you are making a model for. This step uses both the output from step 1a and from 2d. This is done with 'hmmsearch' function under default settings. The significance threshold value set in 'cutOff' parameter is used later when parsing '*.out' files to filter out KO hits with higher value than 'cutOff' value. The results with passable E values are summarised into KO-gene occurence matrix with E values in intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and 'minScoreRatioKO' are then applied to 'prune' KO-gene associations (see the function descriptions above for more details). The intersection values for these 'prunable' associations are converted to zeroes. 3b. Constructs a model based on the pre-processed KO-gene association matrix (koGeneMat). As the full KEGG model already has reaction-KO relationships, KOs are converted into the query genes. The final draft model contains only these reactions, which are associated with KOs from koGeneMat. The reactions without the genes may also be included, if the user set keepSpontaneous as 'true'. The Step 3 is specific to the organism for which the model is reconstructed. In principle the function looks at which output that is already available and runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the RAVEN\external\kegg directory -1b is only performed if not all required HMMs OR aligned FASTA files OR multi-FASTA files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). If not all files exist it will try to find the KEGG database files in dataDir. -2a is only performed if not all required HMMs OR aligned FASTA files files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2b is only performed if not all required HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from BioMet Toolbox instead (see above). -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallel or to resume if interrupted. -3b is always performed. These steps are specific to the organism for which you are reconstructing the model. Regarding the whole pipeline, the function checks the output that is already available and runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the RAVEN\external\kegg directory. -1b is only performed if any of required KOs do not have HMMs, aligned FASTA files, clustered FASTA files and raw FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). If not all files exist it will try to find the KEGG database files in dataDir. -2ab are only performed if any of required KOs do not have HMMs, aligned FASTA files and clustered FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2c is only performed if any of required KOs do not have HMMs and aligned FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2d is only performed if any of required KOs do not have HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from BioMet Toolbox instead (see above). -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallel or to resume if interrupted. -3b is always performed. NOTE: it is also possible to obtain draft model from KEGG without providing protein FASTA file for the target organism. In such case the organism three-four letter abbreviation set as 'organismID' must exist in the local KEGG database. In such case, the program just fetches all the reactions, which are associated with given 'organismID'. Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... nSequences,seqIdentity)
0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0002 outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0003 keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0004 nSequences,seqIdentity,globalModel) 0005 % getKEGGModelForOrganism 0006 % Reconstructs a genome-scale metabolic model based on protein homology 0007 % to the orthologies in KEGG. If the target species is not available in 0008 % KEGG, the user must select a closely related species. It is also 0009 % possible to circumvent protein homology search (see fastaFile parameter 0010 % for more details) 0011 % 0012 % Input: 0013 % organismID three or four letter abbreviation of the organism 0014 % (as used in KEGG). If not available, use a closely 0015 % related species. This is used for determing the 0016 % phylogenetic distance. Use 'eukaryotes' or 0017 % 'prokaryotes' to get a model for the whole domain. 0018 % Only applicable if fastaFile is empty, i.e. no 0019 % homology search should be performed 0020 % fastaFile a FASTA file that contains the protein sequences of 0021 % the organism for which to reconstruct a model (opt, 0022 % if no FASTA file is supplied then a model is 0023 % reconstructed based only on the organism 0024 % abbreviation. This option ignores all settings 0025 % except for keepSpontaneous, keepUndefinedStoich, 0026 % keepIncomplete and keepGeneral) 0027 % dataDir directory for which to retrieve the input data. 0028 % Should contain a combination of these sub-folders: 0029 % -dataDir\keggdb 0030 % The KEGG database files used in 1a (see below) 0031 % -dataDir\fasta 0032 % The multi-FASTA files generated in 1b (see 0033 % below) 0034 % -dataDir\aligned 0035 % The aligned FASTA files as generated in 2a (see 0036 % below) 0037 % -dataDir\hmms 0038 % The hidden Markov models as generated in 2b or 0039 % downloaded from BioMet Toolbox (see below) 0040 % The final directory in dataDir should be styled as 0041 % proXXX_keggYY or eukXXX_keggYY, indicating whether 0042 % the HMMs were trained on pro- or eukaryotic 0043 % sequences, using a sequence similarity threshold of 0044 % XXX %, fitting the KEGG version YY. E.g. 0045 % euk90_kegg105. (opt, see note about fastaFile. Note 0046 % that in order to rebuild the KEGG model from a 0047 % database dump, as opposed to using the version 0048 % supplied with RAVEN, you would still need to supply 0049 % this) 0050 % outDir directory to save the results from the quering of 0051 % the hidden Markov models. The output is specific 0052 % for the input sequences and the settings used. It 0053 % is stored in this manner so that the function can 0054 % continue if interrupted or if it should run in 0055 % parallel. Be careful not to leave output files from 0056 % different organisms or runs with different settings 0057 % in the same folder. They will not be overwritten 0058 % (opt, default is a temporary dir where all *.out 0059 % files are deleted before and after doing the 0060 % reconstruction) 0061 % keepSpontaneous include reactions labeled as "spontaneous". (opt, 0062 % default true) 0063 % keepUndefinedStoich include reactions in the form n A <=> n+1 A. These 0064 % will be dealt with as two separate metabolites 0065 % (opt, default true) 0066 % keepIncomplete include reactions which have been labelled as 0067 % "incomplete", "erroneous" or "unclear" (opt, 0068 % default true) 0069 % keepGeneral include reactions which have been labelled as 0070 % "general reaction". These are reactions on the form 0071 % "an aldehyde <=> an alcohol", and are therefore 0072 % unsuited for modelling purposes. Note that not all 0073 % reactions have this type of annotation, and the 0074 % script will therefore not be able to remove all 0075 % such reactions (opt, default false) 0076 % cutOff significance score from HMMer needed to assign 0077 % genes to a KO (opt, default 10^-50) 0078 % minScoreRatioG a gene is only assigned to KOs for which the score 0079 % is >=log(score)/log(best score) for that gene. This 0080 % is to prevent that a gene which clearly belongs to 0081 % one KO is assigned also to KOs with much lower 0082 % scores (opt, default 0.8 (lower is less strict)) 0083 % minScoreRatioKO ignore genes in a KO if their score is 0084 % <log(score)/log(best score in KO). This is to 0085 % "prune" KOs which have many genes and where some are 0086 % clearly a better fit (opt, default 0.3 (lower is 0087 % less strict)) 0088 % maxPhylDist -1: only use sequences from the same domain 0089 % (Prokaryota, Eukaryota) 0090 % other (positive) value: only use sequences for 0091 % organisms where the phylogenetic distance is at the 0092 % most this large (as calculated in getPhylDist) 0093 % (opt, default Inf, which means that all sequences 0094 % will be used) 0095 % nSequences for each KO, use up to this many sequences from the 0096 % most closely related species. This is mainly to 0097 % speed up the alignment process for KOs with very 0098 % many genes. This subsampling is performed before 0099 % running CD-HIT (opt, default inf) 0100 % seqIdentity sequence identity threshold in CD-HIT, referred as 0101 % "global sequence identity" in CD-HIT User's Guide. 0102 % If -1 is provided, CD-HIT is skipped (opt, default 0.9) 0103 % globalModel structure containing both model and KOModel 0104 % structures as generated by getModelFromKEGG. These 0105 % will otherwise be loaded by via getModelFromKEGG. 0106 % Providing globalKEGGmodel can speed up model 0107 % generation if getKEGGModelForOrganism is run 0108 % multiple times for different strains. Example: 0109 % [globalModel.model,globalModel.KOModel] = getModelFromKEGG; 0110 % (opt, default empty, global model is loaded by 0111 % getModelFromKEGG) 0112 % 0113 % Output: 0114 % model the reconstructed model 0115 % 0116 % PLEASE READ THIS: The input to this function can be confusing, because 0117 % it is intended to be run in parallel on a cluster or in multiple 0118 % sessions. It therefore saves a lot of intermediate results to storage. 0119 % This also serves the purpose of not having to do redundant 0120 % calculations. This, however, comes with the disadvantage of somewhat 0121 % trickier handling. This is what this function does: 0122 % 0123 % 1a. Loads files from a local KEGG FTP dump and constructs a general 0124 % RAVEN model representing the metabolic network. The functions 0125 % getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the 0126 % data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files, 0127 % which are later merged into 'keggModel.mat' by getModelFromKEGG 0128 % function. The function getPhylDist generates 'keggPhylDist.mat' 0129 % file. KEGG FTP access requires a <a href="matlab: 0130 % web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>. 0131 % 1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One 0132 % multi-FASTA file for each KO in KEGG is generated. 0133 % 0134 % The Step 1 has to be re-done every time KEGG updates their database (or 0135 % rather when the updates are large enough to warrant re-running this 0136 % part). Many users would probably never use this feature. 0137 % 0138 % 2a. Filters KO-specific protein sets. This is done by using the 0139 % settings "maxPhylDist" and "nSequences" to control which sequences 0140 % should be used for constructing Hidden Markov models (HMMs), and 0141 % later for matching your sequences to. 0142 % The most common alternatives here would be to use sequences from 0143 % only eukaryotes, only prokaryotes or all sequences in KEGG, but you 0144 % could also play around with the parameters to use e.g. only fungal 0145 % sequences. 0146 % 2b. KO-specific protein FASTA files are re-organised into 0147 % non-redundant protein sets with CD-HIT. The user can only set 0148 % seqIdentity parameter, which corresponds to '-c' parameter in 0149 % CD-HIT, described as "sequence identity threshold". CD-HIT suggsted 0150 % sequence identity specific word_length (-n) parameters are used. 0151 % 2c. Does a multi sequence alignment for multi-FASTA files obtained in 0152 % Step 2b for future use. MAFFT software with automatic selection of 0153 % alignment algorithm is used in this step ('--auto'). 0154 % 2d. Trains hidden Markov models using HMMer for each of the aligned 0155 % KO-specific FASTA files obtained in Step 2c. This is performed with 0156 % 'hmmbuild' using the default settings. 0157 % 0158 % Step 2 may be reasonable to be re-done if the user wants to tweak the 0159 % settings in proteins filtering, clustering, multi sequence alignment or 0160 % HMMs training steps. However, it requires to have KO-specific protein 0161 % FASTA files obtained in Step 1a. As such files are not provided in 0162 % RAVEN and BioMet ToolBox, the user can only generate these files from 0163 % KEGG FTP dump files, so KEGG FTP license is needed. 0164 % 0165 % 3a. Queries the HMMs with sequences for the organism you are making a 0166 % model for. This step uses both the output from step 1a and from 2d. 0167 % This is done with 'hmmsearch' function under default settings. The 0168 % significance threshold value set in 'cutOff' parameter is used 0169 % later when parsing '*.out' files to filter out KO hits with higher 0170 % value than 'cutOff' value. The results with passable E values are 0171 % summarised into KO-gene occurence matrix with E values in 0172 % intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and 0173 % 'minScoreRatioKO' are then applied to 'prune' KO-gene associations 0174 % (see the function descriptions above for more details). The 0175 % intersection values for these 'prunable' associations are converted 0176 % to zeroes. 0177 % 3b. Constructs a model based on the pre-processed KO-gene association 0178 % matrix (koGeneMat). As the full KEGG model already has reaction-KO 0179 % relationships, KOs are converted into the query genes. The final 0180 % draft model contains only these reactions, which are associated 0181 % with KOs from koGeneMat. The reactions without the genes may also 0182 % be included, if the user set keepSpontaneous as 'true'. 0183 % 0184 % The Step 3 is specific to the organism for which the model is 0185 % reconstructed. 0186 % 0187 % In principle the function looks at which output that is already available 0188 % and runs only the parts that are required for step 3. This means 0189 % that (see the definition of the parameters for details): 0190 % -1a is only performed if there are no KEGG model files in the 0191 % RAVEN\external\kegg directory 0192 % -1b is only performed if not all required HMMs OR aligned FASTA files 0193 % OR multi-FASTA files exist in the defined dataDir. This means that this 0194 % step is skipped if the HMMs are downloaded from BioMet Toolbox instead 0195 % (see above). If not all files exist it will try to find 0196 % the KEGG database files in dataDir. 0197 % -2a is only performed if not all required HMMs OR aligned FASTA files 0198 % files exist in the defined dataDir. This means that this step is skipped 0199 % if the HMMs are downloaded from BioMet Toolbox instead (see above). 0200 % -2b is only performed if not all required HMMs exist in the defined 0201 % dataDir. This means that this step is skipped if the FASTA files or 0202 % HMMs are downloaded from BioMet Toolbox instead (see above). 0203 % -3a is performed for the required HMMs for which no corresponding .out 0204 % file exists in outDir. This is just a way to enable the function to be 0205 % run in parallel or to resume if interrupted. 0206 % -3b is always performed. 0207 % 0208 % These steps are specific to the organism for which you are 0209 % reconstructing the model. 0210 % 0211 % Regarding the whole pipeline, the function checks the output that is 0212 % already available and runs only the parts that are required for step 3. 0213 % This means that (see the definition of the parameters for details): 0214 % -1a is only performed if there are no KEGG model files in the 0215 % RAVEN\external\kegg directory. 0216 % -1b is only performed if any of required KOs do not have HMMs, aligned 0217 % FASTA files, clustered FASTA files and raw FASTA files in the defined 0218 % dataDir. This means that this step is skipped if the HMMs are 0219 % downloaded from BioMet Toolbox instead (see above). If not all files 0220 % exist it will try to find the KEGG database files in dataDir. 0221 % -2ab are only performed if any of required KOs do not have HMMs, 0222 % aligned FASTA files and clustered FASTA files in the defined dataDir. 0223 % This means that this step is skipped if the HMMs are downloaded from 0224 % BioMet Toolbox instead (see above). 0225 % -2c is only performed if any of required KOs do not have HMMs and 0226 % aligned FASTA files in the defined dataDir. This means that this step 0227 % is skipped if the HMMs are downloaded from BioMet Toolbox instead (see 0228 % above). 0229 % -2d is only performed if any of required KOs do not have HMMs exist in 0230 % the defined dataDir. This means that this step is skipped if the FASTA 0231 % files or HMMs are downloaded from BioMet Toolbox instead (see above). 0232 % -3a is performed for the required HMMs for which no corresponding .out 0233 % file exists in outDir. This is just a way to enable the function to be 0234 % run in parallel or to resume if interrupted. 0235 % -3b is always performed. 0236 % 0237 % NOTE: it is also possible to obtain draft model from KEGG without 0238 % providing protein FASTA file for the target organism. In such case the 0239 % organism three-four letter abbreviation set as 'organismID' must exist 0240 % in the local KEGG database. In such case, the program just fetches all 0241 % the reactions, which are associated with given 'organismID'. 0242 % 0243 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0244 % outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0245 % keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0246 % nSequences,seqIdentity) 0247 0248 if nargin<2 || isempty(fastaFile) 0249 fastaFile=[]; 0250 else 0251 fastaFile=char(fastaFile); 0252 end 0253 if nargin<3 0254 dataDir=[]; 0255 else 0256 dataDir=char(dataDir); 0257 end 0258 if nargin<4 || isempty(outDir) 0259 outDir=tempdir; 0260 %Delete all *.out files if any exist 0261 delete(fullfile(outDir,'*.out')); 0262 else 0263 outDir=char(outDir); 0264 end 0265 if nargin<5 0266 keepSpontaneous=true; 0267 end 0268 if nargin<6 0269 keepUndefinedStoich=true; 0270 end 0271 if nargin<7 0272 keepIncomplete=true; 0273 end 0274 if nargin<8 0275 keepGeneral=false; 0276 end 0277 if nargin<9 0278 cutOff=10^-50; 0279 end 0280 if nargin<10 0281 minScoreRatioKO=0.3; 0282 end 0283 if nargin<11 0284 minScoreRatioG=0.8; 0285 end 0286 if nargin<12 0287 maxPhylDist=inf; 0288 %Include all sequences for each reaction 0289 end 0290 if nargin<13 0291 nSequences=inf; 0292 %Include all sequences for each reaction 0293 end 0294 if nargin<14 0295 seqIdentity=0.9; 0296 end 0297 0298 if isempty(fastaFile) 0299 fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']); 0300 else 0301 fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n'); 0302 %Check if query fasta exists 0303 fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir 0304 end 0305 0306 %Run the external binaries multi-threaded to use all logical cores assigned 0307 %to MATLAB 0308 cores = evalc('feature(''numcores'')'); 0309 cores = strsplit(cores, 'MATLAB was assigned: '); 0310 cores = regexp(cores{2},'^\d*','match'); 0311 cores = cores{1}; 0312 0313 %Get the directory for RAVEN Toolbox. 0314 ravenPath=findRAVENroot(); 0315 0316 %Checking if dataDir is consistent. It must point to pre-trained HMMs set, 0317 %compatible with the the current RAVEN version. The user may have the 0318 %required zip file already in working directory or have it extracted. If 0319 %the zip file and directory is not here, it is downloaded from the cloud 0320 if ~isempty(dataDir) 0321 hmmOptions={'euk90_kegg105','prok90_kegg105'}; 0322 if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions. 0323 %If not, then check whether the required folders exist anyway. 0324 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ... 0325 ~isfolder(fullfile(dataDir,'fasta')) && ... 0326 ~isfolder(fullfile(dataDir,'aligned')) && ... 0327 ~isfolder(fullfile(dataDir,'hmms')) 0328 error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')]) 0329 end 0330 else 0331 if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0332 fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']); 0333 elseif ~isfolder(dataDir) && isfile([dataDir,'.zip']) 0334 fprintf('Extracting the HMMs archive file... '); 0335 unzip([dataDir,'.zip']); 0336 fprintf('COMPLETE\n'); 0337 else 0338 hmmIndex=strcmp(dataDir,hmmOptions); 0339 if ~any(hmmIndex) 0340 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg105 and euk90_kegg105). ' ... 0341 'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity']) 0342 else 0343 fprintf('Downloading the HMMs archive file... '); 0344 try 0345 websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']); 0346 catch ME 0347 if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError') 0348 error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues') 0349 end 0350 end 0351 end 0352 0353 fprintf('COMPLETE\n'); 0354 fprintf('Extracting the HMMs archive file... '); 0355 unzip([dataDir,'.zip']); 0356 fprintf('COMPLETE\n'); 0357 end 0358 %Check if HMMs are extracted 0359 if ~isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0360 error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']); 0361 end 0362 end 0363 end 0364 0365 %Check if the fasta-file contains '/' or'\'. If not then it's probably just 0366 %a file name. Expand to full path. 0367 if any(fastaFile) 0368 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/')) 0369 fastaFile=which(fastaFile); 0370 end 0371 %Create the required sub-folders in dataDir if they dont exist 0372 if ~isfolder(fullfile(dataDir,'keggdb')) 0373 mkdir(dataDir,'keggdb'); 0374 end 0375 if ~isfolder(fullfile(dataDir,'fasta')) 0376 mkdir(dataDir,'fasta'); 0377 end 0378 if ~isfolder(fullfile(dataDir,'aligned')) 0379 mkdir(dataDir,'aligned'); 0380 end 0381 if ~isfolder(fullfile(dataDir,'hmms')) 0382 mkdir(dataDir,'hmms'); 0383 end 0384 if ~isfolder(outDir) 0385 mkdir(outDir); 0386 end 0387 end 0388 0389 %First generate the full global KEGG model. Can be provided as input. 0390 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as 0391 %there is also an internal RAVEN version available 0392 if nargin==15 0393 model=globalModel.model; 0394 KOModel=globalModel.KOModel; 0395 elseif any(dataDir) 0396 [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0397 else 0398 [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0399 end 0400 model.id=organismID; 0401 model.c=zeros(numel(model.rxns),1); 0402 0403 %If no FASTA file is supplied, then just remove all genes which are not for 0404 %the given organism ID 0405 if isempty(fastaFile) 0406 %Check if organismID can be found in KEGG species list or is 0407 %set to "eukaryotes" or "prokaryotes" 0408 phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true); 0409 if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes']) 0410 error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.'); 0411 end 0412 0413 fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']); 0414 if ismember(organismID,{'eukaryotes','prokaryotes'}) 0415 phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0416 if strcmp(organismID,'eukaryotes') 0417 proxyid='hsa'; 0418 %Use H. sapiens here 0419 else 0420 proxyid='eco'; 0421 %Use E. coli here 0422 end 0423 [~, phylDistId]=ismember(proxyid,phylDists.ids); 0424 idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:))); 0425 taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false); 0426 I=ismember(upper(taxIDs),upper(idsToKeep)); 0427 else 0428 %KEGG organism IDs may have three or four letters 0429 organismID=strcat(organismID,':'); 0430 %Add colon for accurate matching 0431 if length(organismID)==4 0432 I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes); 0433 elseif length(organismID)==5 0434 I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes); 0435 end 0436 end 0437 %Remove those genes 0438 model.genes=model.genes(I); 0439 model.rxnGeneMat=model.rxnGeneMat(:,I); 0440 fprintf('COMPLETE\n'); 0441 end 0442 0443 %First remove all reactions without genes 0444 if keepSpontaneous==true 0445 fprintf('Removing non-spontaneous reactions without GPR rules... '); 0446 load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous'); 0447 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0448 spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous)); 0449 else 0450 fprintf('Removing reactions without GPR rules... '); 0451 I=~any(model.rxnGeneMat,2); 0452 end 0453 model=removeReactions(model,I,true); 0454 fprintf('COMPLETE\n'); 0455 0456 %Clean gene names 0457 fprintf('Fixing gene names in the model... '); 0458 %Get rid of the prefix organism id 0459 model.genes=regexprep(model.genes,'^\w+?:',''); 0460 fprintf('COMPLETE\n'); 0461 0462 %If no FASTA file is supplied, then we are done here 0463 if isempty(fastaFile) 0464 %Create grRules 0465 fprintf('Constructing GPR associations and annotations for the model... '); 0466 model.grRules=cell(numel(model.rxns),1); 0467 model.grRules(:)={''}; 0468 %Add the gene associations as 'or' 0469 for i=1:numel(model.rxns) 0470 %Find the involved genes 0471 I=find(model.rxnGeneMat(i,:)); 0472 if any(I) 0473 model.grRules{i}=['(' model.genes{I(1)}]; 0474 for j=2:numel(I) 0475 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 0476 end 0477 model.grRules{i}=[model.grRules{i} ')']; 0478 end 0479 end 0480 %Fix grRules and reconstruct rxnGeneMat 0481 [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output 0482 model.grRules = grRules; 0483 model.rxnGeneMat = rxnGeneMat; 0484 %Add geneMiriams, assuming that it follows the syntax 0485 %kegg.genes/organismID:geneName 0486 model.geneMiriams=''; 0487 for i=1:numel(model.genes) 0488 model.geneMiriams{i,1}.name{1,1}='kegg.genes'; 0489 model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1}); 0490 end 0491 %Add the description to the reactions 0492 for i=1:numel(model.rxns) 0493 if ~isempty(model.rxnNotes{i}) 0494 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i)); 0495 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 0496 else 0497 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'}; 0498 end 0499 end 0500 fprintf('COMPLETE\n\n'); 0501 fprintf('*** Model reconstruction complete ***\n'); 0502 return; 0503 end 0504 0505 %Create a phylogenetic distance structure 0506 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0507 [~, phylDistId]=ismember(model.id,phylDistStruct.ids); 0508 0509 %Calculate the real maximal distance now. An abitary large number of 1000 0510 %is used for the "all in kingdom" or "all sequences" options. This is a bit 0511 %inconvenient way to do it, but it is to make it fit with some older code 0512 if isinf(maxPhylDist) || maxPhylDist==-1 0513 maxPhylDist=1000; 0514 end 0515 0516 %Get the KO ids for which files have been generated. Maybe not the neatest 0517 %way.. 0518 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa')); 0519 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa')); 0520 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw')); 0521 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm')); 0522 outFiles=listFiles(fullfile(outDir,'*.out')); 0523 0524 %Check if multi-FASTA files should be generated. This should only be 0525 %performed if there are IDs in the KOModel structure that haven't been 0526 %parsed yet 0527 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]); 0528 0529 if ~isempty(missingFASTA) 0530 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) 0531 EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n']; 0532 dispEM(EM); 0533 end 0534 %Only construct models for KOs which don't have files already 0535 fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true); 0536 %Permute the order of the KOs in the model so that constructMultiFasta 0537 %can be run on several processors at once 0538 fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns'); 0539 constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta')); 0540 else 0541 fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n'); 0542 end 0543 0544 if isunix 0545 if ismac 0546 binEnd='.mac'; 0547 else 0548 binEnd=''; 0549 end 0550 elseif ispc 0551 binEnd=''; 0552 else 0553 EM='Unknown OS, exiting.'; 0554 disp(EM); 0555 return 0556 end 0557 0558 %Check if alignment of FASTA files should be performed 0559 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]); 0560 if ~isempty(missingAligned) 0561 if seqIdentity==-1 0562 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0563 else 0564 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0565 end 0566 missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned))); 0567 tmpFile=tempname; 0568 %On Windows, paths need to be translated to Unix before parsing it to WSL 0569 if ispc 0570 wslPath.tmpFile=getWSLpath(tmpFile); 0571 %mafft has problems writing to terminal (/dev/stderr) when running 0572 %on WSL via MATLAB, instead write and read progress file 0573 mafftOutput = tempname; 0574 wslPath.mafftOutput=getWSLpath(mafftOutput); 0575 wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat')); 0576 wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit')); 0577 end 0578 0579 for i=1:numel(missingAligned) 0580 %This is checked here because it could be that it is created by a 0581 %parallel process. The faw-files are saved as temporary files to 0582 %kept track of which files are being worked on 0583 if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&... 0584 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa'])) 0585 %Check that the multi-FASTA file exists. It should do so since 0586 %we are saving empty files as well. Print a warning and 0587 %continue if not 0588 if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])) 0589 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist']; 0590 dispEM(EM,false); 0591 continue; 0592 end 0593 0594 %If the multi-FASTA file is empty then save an empty aligned 0595 %file and continue 0596 s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0597 if s.bytes<=0 0598 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w'); 0599 fclose(fid); 0600 continue; 0601 end 0602 0603 %Create an empty file to prevent other threads to start to work 0604 %on the same alignment 0605 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w'); 0606 fclose(fid); 0607 0608 %First load the FASTA file, then select up to nSequences 0609 %sequences of the most closely related species, apply any 0610 %constraints from maxPhylDist, and save it as a temporary file, 0611 %and create the model from that 0612 0613 fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0614 phylDist=inf(numel(fastaStruct),1); 0615 for j=1:numel(fastaStruct) 0616 %Get the organism abbreviation 0617 index=strfind(fastaStruct(j).Header,':'); 0618 if any(index) 0619 abbrev=fastaStruct(j).Header(1:index(1)-1); 0620 [~, index]=ismember(abbrev,phylDistStruct.ids); 0621 if any(index) 0622 phylDist(j)=phylDistStruct.distMat(index(1),phylDistId); 0623 end 0624 end 0625 end 0626 0627 %Inf means that it should not be included 0628 phylDist(phylDist>maxPhylDist)=[]; 0629 0630 %Sort based on phylDist 0631 [~, order]=sort(phylDist); 0632 0633 %Save the first nSequences hits to a temporary FASTA file 0634 if nSequences<=numel(fastaStruct) 0635 fastaStruct=fastaStruct(order(1:nSequences)); 0636 else 0637 fastaStruct=fastaStruct(order); 0638 end 0639 0640 %Do the clustering and alignment if there are more than one 0641 %sequences, otherwise just save the sequence (or an empty file) 0642 if numel(fastaStruct)>1 0643 if seqIdentity~=-1 0644 cdhitInpCustom=tempname; 0645 fastawrite(cdhitInpCustom,fastaStruct); 0646 if seqIdentity<=1 && seqIdentity>0.7 0647 nparam='5'; 0648 elseif seqIdentity>0.6 0649 nparam='4'; 0650 elseif seqIdentity>0.5 0651 nparam='3'; 0652 elseif seqIdentity>0.4 0653 nparam='2'; 0654 else 0655 EM='The provided seqIdentity must be between 0 and 1\n'; 0656 dispEM(EM); 0657 end 0658 if ispc 0659 wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom); 0660 [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0661 elseif ismac || isunix 0662 [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0663 end 0664 if status~=0 0665 EM=['Error when performing clustering of ' missingAligned{i} ':\n' output]; 0666 dispEM(EM); 0667 end 0668 %Remove the old tempfile 0669 if exist(cdhitInpCustom, 'file') 0670 delete([cdhitInpCustom '*']); 0671 end 0672 else 0673 %This means that CD-HIT should be skipped since 0674 %seqIdentity is equal to -1 0675 fastawrite(tmpFile,fastaStruct); 0676 end 0677 %Do the alignment for this file 0678 if ismac 0679 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0680 elseif isunix 0681 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0682 elseif ispc 0683 wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])); 0684 [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']); 0685 output=fileread(mafftOutput); 0686 delete(mafftOutput); 0687 end 0688 if status~=0 0689 %It could be that alignment failed because only one 0690 %sequence was left after clustering. If that is the 0691 %case, then the clustered file is just copied as 'faw' 0692 %file 0693 if any(regexp(output,'Only 1 sequence found')) 0694 movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f'); 0695 else 0696 EM=['Error when performing alignment of ' missingAligned{i} ':\n' output]; 0697 dispEM(EM); 0698 end 0699 end 0700 %Remove the old tempfile 0701 if exist(tmpFile, 'file') 0702 delete([tmpFile '*']); 0703 end 0704 else 0705 %If there is only one sequence then it's not possible to do 0706 %a multiple alignment. Just print the sequence instead. An 0707 %empty file was written previously so that doesn't have to 0708 %be dealt with 0709 if numel(fastaStruct)==1 0710 warnState = warning; %Save the current warning state 0711 warning('off','Bioinfo:fastawrite:AppendToFile'); 0712 fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct); 0713 warning(warnState) %Reset warning state to previous settings 0714 end 0715 end 0716 %Move the temporary file to the real one 0717 movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f'); 0718 0719 %Print the progress every 25 files 0720 if rem(i-1,25) == 0 0721 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns))); 0722 progress=pad(progress,3,'left'); 0723 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0724 end 0725 end 0726 end 0727 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0728 else 0729 if seqIdentity==-1 0730 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0731 else 0732 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0733 end 0734 end 0735 0736 %Check if training of Hidden Markov models should be performed 0737 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]); 0738 if ~isempty(missingHMMs) 0739 fprintf('Training the KEGG Orthology specific HMMs... 0%% complete'); 0740 missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs))); 0741 %Train models for all missing KOs 0742 for i=1:numel(missingHMMs) 0743 %This is checked here because it could be that it is created by a 0744 %parallel process 0745 if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])) 0746 %Check that the aligned FASTA file exists. It could be that it 0747 %is still being worked on by some other instance of the program 0748 %(the .faw file should then exist). This should not happen on a 0749 %single computer. It doesn't throw an error, because it should 0750 %finalize the ones it can 0751 if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])) 0752 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist']; 0753 dispEM(EM,false); 0754 continue; 0755 end 0756 0757 %If the multi-FASTA file is empty then save an empty aligned 0758 %file and continue 0759 s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0760 if s.bytes<=0 0761 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w'); 0762 fclose(fid); 0763 continue; 0764 end 0765 %Create a temporary file to indicate that it is working on the 0766 %KO. This is because hmmbuild cannot overwrite existing files 0767 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w'); 0768 fclose(fid); 0769 0770 %Create HMM 0771 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']); 0772 if status~=0 0773 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output]; 0774 dispEM(EM); 0775 end 0776 0777 %Delete the temporary file 0778 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])); 0779 0780 %Print the progress every 25 files 0781 if rem(i-1,25) == 0 0782 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns))); 0783 progress=pad(progress,3,'left'); 0784 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0785 end 0786 end 0787 end 0788 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0789 else 0790 fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n'); 0791 end 0792 0793 %Check which new .out files that should be generated. Check if training of 0794 %Hidden Markov models should be performed 0795 missingOUT=setdiff(KOModel.rxns,outFiles); 0796 if ~isempty(missingOUT) 0797 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... 0%% complete'); 0798 missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT))); 0799 for i=1:numel(missingOUT) 0800 %This is checked here because it could be that it is created by a 0801 %parallel process 0802 if ~isfile(fullfile(outDir,[missingOUT{i} '.out'])) 0803 %Check that the HMM file exists. It should do so since %we are 0804 %saving empty files as well. Print a warning and continue if 0805 %not 0806 if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])) 0807 EM=['The HMM file for ' missingOUT{i} ' does not exist']; 0808 dispEM(EM,false); 0809 continue; 0810 end 0811 0812 %Save an empty file to prevent several threads working on the 0813 %same file 0814 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0815 fclose(fid); 0816 0817 %If the HMM file is empty then save an out file and continue 0818 s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0819 if s.bytes<=0 0820 continue; 0821 end 0822 0823 %Check each gene in the input file against this model 0824 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']); 0825 if status~=0 0826 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output]; 0827 dispEM(EM); 0828 end 0829 0830 %Save the output to a file 0831 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0832 fwrite(fid,output); 0833 fclose(fid); 0834 0835 %Print the progress every 25 files 0836 if rem(i-1,25) == 0 0837 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns))); 0838 progress=pad(progress,3,'left'); 0839 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0840 end 0841 end 0842 end 0843 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0844 else 0845 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n'); 0846 end 0847 0848 0849 %***Begin retrieving the output and putting together the resulting model 0850 0851 fprintf('Parsing the HMM search results... '); 0852 %Retrieve matched genes from the HMMs 0853 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes 0854 genes=cell(3000,1); 0855 %Store the best score for a gene in a hash list (since it will be searching 0856 %many times) 0857 hTable = java.util.Hashtable; 0858 0859 geneCounter=0; 0860 for i=1:numel(KOModel.rxns) 0861 if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file') 0862 fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r'); 0863 beginMatches=false; 0864 while 1 0865 %Get the next line 0866 tline = fgetl(fid); 0867 0868 %Abort at end of file 0869 if ~ischar(tline) 0870 break; 0871 end 0872 0873 if and(beginMatches,strcmp(tline,' ------ inclusion threshold ------')) 0874 break; 0875 end 0876 0877 if beginMatches==false 0878 %This is how the listing of matches begins 0879 if any(strfind(tline,'E-value ')) 0880 %Read one more line that is only padding 0881 tline = fgetl(fid); 0882 beginMatches=true; 0883 end 0884 else 0885 %If matches should be read 0886 if ~strcmp(tline,' [No hits detected that satisfy reporting thresholds]') && ~isempty(tline) 0887 elements=regexp(tline,' ','split'); 0888 elements=elements(cellfun(@any,elements)); 0889 0890 %Check if the match is below the treshhold 0891 score=str2double(elements{1}); 0892 gene=elements{9}; 0893 if score<=cutOff 0894 %If the score is exactly 0, change it to a very 0895 %small value to avoid NaN 0896 if score==0 0897 score=10^-250; 0898 end 0899 %Check if the gene is added already and, is so, get 0900 %the best score for it 0901 I=hTable.get(gene); 0902 if any(I) 0903 koGeneMat(i,I)=score; 0904 else 0905 geneCounter=geneCounter+1; 0906 %The gene was not present yet so add it 0907 hTable.put(gene,geneCounter); 0908 genes{geneCounter}=gene; 0909 koGeneMat(i,geneCounter)=score; 0910 end 0911 end 0912 else 0913 break; 0914 end 0915 end 0916 end 0917 fclose(fid); 0918 end 0919 end 0920 fprintf('COMPLETE\n'); 0921 0922 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... '); 0923 koGeneMat=koGeneMat(:,1:geneCounter); 0924 0925 %Remove the genes for each KO that are below minScoreRatioKO. 0926 for i=1:size(koGeneMat,1) 0927 J=find(koGeneMat(i,:)); 0928 if any(J) 0929 koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0; 0930 end 0931 end 0932 0933 %Remove the KOs for each gene that are below minScoreRatioG 0934 for i=1:size(koGeneMat,2) 0935 J=find(koGeneMat(:,i)); 0936 if any(J) 0937 koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0; 0938 end 0939 end 0940 fprintf('COMPLETE\n'); 0941 0942 fprintf('Adding gene annotations to the model... '); 0943 %Create the new model 0944 model.genes=genes(1:geneCounter); 0945 model.grRules=cell(numel(model.rxns),1); 0946 model.grRules(:)={''}; 0947 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes)); 0948 0949 %Loop through the reactions and add the corresponding genes 0950 for i=1:numel(model.rxns) 0951 if isstruct(model.rxnMiriams{i}) 0952 %Get all KOs 0953 I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology')); 0954 KOs=model.rxnMiriams{i}.value(I); 0955 %Find the KOs and the corresponding genes 0956 J=ismember(KOModel.rxns,KOs); 0957 [~, K]=find(koGeneMat(J,:)); 0958 0959 if any(K) 0960 model.rxnGeneMat(i,K)=1; 0961 %Also delete KOs for which no genes were found. If no genes at 0962 %all were matched to the reaction it will be deleted later 0963 L=sum(koGeneMat(J,:),2)==0; 0964 model.rxnMiriams{i}.value(I(L))=[]; 0965 model.rxnMiriams{i}.name(I(L))=[]; 0966 end 0967 end 0968 end 0969 fprintf('COMPLETE\n'); 0970 0971 %Find and delete all reactions without genes. This also removes genes that 0972 %are not used (which could happen because minScoreRatioG and 0973 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions 0974 %without genes are kept in the model. Spontaneous reactions with original 0975 %gene associations are treated in the same way, like the rest of the 0976 %reactions - if gene associations were removed during HMM search, such 0977 %reactions are deleted from the model 0978 if keepSpontaneous==true 0979 %Not the most comprise way to delete reactions without genes, but this 0980 %makes the code easier to understand. Firstly the non-spontaneous 0981 %reactions without genes are removed. After that, the second deletion 0982 %step removes spontaneous reactions, which had gene associations before 0983 %HMM search, but no longer have after it 0984 fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... '); 0985 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0986 model=removeReactions(model,I,true,true); 0987 I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes); 0988 model=removeReactions(model,I,true,true); 0989 else 0990 %Just simply check for any new reactions without genes and remove 0991 %it 0992 fprintf('Removing reactions which after HMM search no longer have GPR rules... '); 0993 I=~any(model.rxnGeneMat,2); 0994 model=removeReactions(model,I,true,true); 0995 end 0996 fprintf('COMPLETE\n'); 0997 0998 fprintf('Constructing GPR rules and finalizing the model... '); 0999 %Add the gene associations as 'or' 1000 for i=1:numel(model.rxns) 1001 %Find the involved genes 1002 I=find(model.rxnGeneMat(i,:)); 1003 if any(I) 1004 model.grRules{i}=['(' model.genes{I(1)}]; 1005 for j=2:numel(I) 1006 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 1007 end 1008 model.grRules{i}=[model.grRules{i} ')']; 1009 end 1010 end 1011 1012 %Fix grRules and reconstruct rxnGeneMat 1013 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output 1014 model.grRules = grRules; 1015 model.rxnGeneMat = rxnGeneMat; 1016 1017 %Add the description to the reactions 1018 for i=1:numel(model.rxns) 1019 if ~isempty(model.rxnNotes{i}) 1020 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i)); 1021 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 1022 else 1023 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'}; 1024 end 1025 end 1026 %Remove the temp fasta file 1027 delete(fastaFile) 1028 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n'); 1029 end 1030 1031 function files=listFiles(directory) 1032 %Supporter function to list the files in a directory and return them as a 1033 %cell array 1034 temp=dir(directory); 1035 files=cell(numel(temp),1); 1036 for i=1:numel(temp) 1037 files{i}=temp(i,1).name; 1038 end 1039 files=strrep(files,'.fa',''); 1040 files=strrep(files,'.hmm',''); 1041 files=strrep(files,'.out',''); 1042 files=strrep(files,'.faw',''); 1043 end