getKEGGModelForOrganism Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG. If the target species is not available in KEGG, the user must select a closely related species. It is also possible to circumvent protein homology search (see fastaFile parameter for more details) Input: organismID three or four letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. Use 'eukaryotes' or 'prokaryotes' to get a model for the whole domain. Only applicable if fastaFile is empty, i.e. no homology search should be performed fastaFile a FASTA file that contains the protein sequences of the organism for which to reconstruct a model (optional, if no FASTA file is supplied then a model is reconstructed based only on the organism abbreviation. This option ignores all settings except for keepSpontaneous, keepUndefinedStoich, keepIncomplete and keepGeneral) dataDir directory for which to retrieve the input data. Should contain a combination of these sub-folders: -dataDir\keggdb The KEGG database files used in 1a (see below) -dataDir\fasta The multi-FASTA files generated in 1b (see below) -dataDir\aligned The aligned FASTA files as generated in 2a (see below) -dataDir\hmms The hidden Markov models as generated in 2b or downloaded from BioMet Toolbox (see below) The final directory in dataDir should be styled as prok90_kegg105 or euk90_kegg105, indicating whether the HMMs were trained on pro- or eukaryotic sequences; using which sequence similarity treshold (first set of digits); using which KEGG version (second set of digits). (this parameter should ALWAYS be provided) outDir directory to save the results from the quering of the hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallel. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (optional, default is a temporary dir where all *.out files are deleted before and after doing the reconstruction) keepSpontaneous include reactions labeled as "spontaneous". (optional, default true) keepUndefinedStoich include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (optional, default true) keepIncomplete include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (optional, default true) keepGeneral include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (optional, default false) cutOff significance score from HMMer needed to assign genes to a KO (optional, default 10^-50) minScoreRatioG a gene is only assigned to KOs for which the score is >=log(score)/log(best score) for that gene. This is to prevent that a gene which clearly belongs to one KO is assigned also to KOs with much lower scores (optional, default 0.8 (lower is less strict)) minScoreRatioKO ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (optional, default 0.3 (lower is less strict)) maxPhylDist -1: only use sequences from the same domain (Prokaryota, Eukaryota) other (positive) value: only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist) (optional, default Inf, which means that all sequences will be used) nSequences for each KO, use up to this many sequences from the most closely related species. This is mainly to speed up the alignment process for KOs with very many genes. This subsampling is performed before running CD-HIT (optional, default inf) seqIdentity sequence identity threshold in CD-HIT, referred as "global sequence identity" in CD-HIT User's Guide. If -1 is provided, CD-HIT is skipped (optional, default 0.9) globalModel structure containing both model and KOModel structures as generated by getModelFromKEGG. These will otherwise be loaded by via getModelFromKEGG. Providing globalKEGGmodel can speed up model generation if getKEGGModelForOrganism is run multiple times for different strains. Example: [globalModel.model,globalModel.KOModel] = getModelFromKEGG; (optional, default empty, global model is loaded by getModelFromKEGG) Output: model the reconstructed model PLEASE READ THIS: The input to this function can be confusing, because it is intended to be run in parallel on a cluster or in multiple sessions. It therefore saves a lot of intermediate results to storage. This also serves the purpose of not having to do redundant calculations. This, however, comes with the disadvantage of somewhat trickier handling. This is what this function does: 1a. Loads files from a local KEGG FTP dump and constructs a general RAVEN model representing the metabolic network. The functions getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files, which are later merged into 'keggModel.mat' by getModelFromKEGG function. The function getPhylDist generates 'keggPhylDist.mat' file. KEGG FTP access requires a <a href="matlab: web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>. 1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One multi-FASTA file for each KO in KEGG is generated. The Step 1 has to be re-done every time KEGG updates their database (or rather when the updates are large enough to warrant re-running this part). Many users would probably never use this feature. 2a. Filters KO-specific protein sets. This is done by using the settings "maxPhylDist" and "nSequences" to control which sequences should be used for constructing Hidden Markov models (HMMs), and later for matching your sequences to. The most common alternatives here would be to use sequences from only eukaryotes, only prokaryotes or all sequences in KEGG, but you could also play around with the parameters to use e.g. only fungal sequences. 2b. KO-specific protein FASTA files are re-organised into non-redundant protein sets with CD-HIT. The user can only set seqIdentity parameter, which corresponds to '-c' parameter in CD-HIT, described as "sequence identity threshold". CD-HIT suggsted sequence identity specific word_length (-n) parameters are used. 2c. Does a multi sequence alignment for multi-FASTA files obtained in Step 2b for future use. MAFFT software with automatic selection of alignment algorithm is used in this step ('--auto'). 2d. Trains hidden Markov models using HMMer for each of the aligned KO-specific FASTA files obtained in Step 2c. This is performed with 'hmmbuild' using the default settings. Step 2 may be reasonable to be re-done if the user wants to tweak the settings in proteins filtering, clustering, multi sequence alignment or HMMs training steps. However, it requires to have KO-specific protein FASTA files obtained in Step 1a. As such files are not provided in RAVEN and BioMet ToolBox, the user can only generate these files from KEGG FTP dump files, so KEGG FTP license is needed. 3a. Queries the HMMs with sequences for the organism you are making a model for. This step uses both the output from step 1a and from 2d. This is done with 'hmmsearch' function under default settings. The significance threshold value set in 'cutOff' parameter is used later when parsing '*.out' files to filter out KO hits with higher value than 'cutOff' value. The results with passable E values are summarised into KO-gene occurence matrix with E values in intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and 'minScoreRatioKO' are then applied to 'prune' KO-gene associations (see the function descriptions above for more details). The intersection values for these 'prunable' associations are converted to zeroes. 3b. Constructs a model based on the pre-processed KO-gene association matrix (koGeneMat). As the full KEGG model already has reaction-KO relationships, KOs are converted into the query genes. The final draft model contains only these reactions, which are associated with KOs from koGeneMat. The reactions without the genes may also be included, if the user set keepSpontaneous as 'true'. The Step 3 is specific to the organism for which the model is reconstructed. In principle the function looks at which output that is already available and runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the RAVEN\external\kegg directory -1b is only performed if not all required HMMs OR aligned FASTA files OR multi-FASTA files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). If not all files exist it will try to find the KEGG database files in dataDir. -2a is only performed if not all required HMMs OR aligned FASTA files files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2b is only performed if not all required HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from BioMet Toolbox instead (see above). -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallel or to resume if interrupted. -3b is always performed. These steps are specific to the organism for which you are reconstructing the model. Regarding the whole pipeline, the function checks the output that is already available and runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the RAVEN\external\kegg directory. -1b is only performed if any of required KOs do not have HMMs, aligned FASTA files, clustered FASTA files and raw FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). If not all files exist it will try to find the KEGG database files in dataDir. -2ab are only performed if any of required KOs do not have HMMs, aligned FASTA files and clustered FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2c is only performed if any of required KOs do not have HMMs and aligned FASTA files in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from BioMet Toolbox instead (see above). -2d is only performed if any of required KOs do not have HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from BioMet Toolbox instead (see above). -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallel or to resume if interrupted. -3b is always performed. NOTE: it is also possible to obtain draft model from KEGG without providing protein FASTA file for the target organism. In such case the organism three-four letter abbreviation set as 'organismID' must exist in the local KEGG database. In such case, the program just fetches all the reactions, which are associated with given 'organismID'. Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... nSequences,seqIdentity)
0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0002 outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0003 keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0004 nSequences,seqIdentity,globalModel) 0005 % getKEGGModelForOrganism 0006 % Reconstructs a genome-scale metabolic model based on protein homology 0007 % to the orthologies in KEGG. If the target species is not available in 0008 % KEGG, the user must select a closely related species. It is also 0009 % possible to circumvent protein homology search (see fastaFile parameter 0010 % for more details) 0011 % 0012 % Input: 0013 % organismID three or four letter abbreviation of the organism 0014 % (as used in KEGG). If not available, use a closely 0015 % related species. This is used for determing the 0016 % phylogenetic distance. Use 'eukaryotes' or 0017 % 'prokaryotes' to get a model for the whole domain. 0018 % Only applicable if fastaFile is empty, i.e. no 0019 % homology search should be performed 0020 % fastaFile a FASTA file that contains the protein sequences of 0021 % the organism for which to reconstruct a model (optional, 0022 % if no FASTA file is supplied then a model is 0023 % reconstructed based only on the organism 0024 % abbreviation. This option ignores all settings 0025 % except for keepSpontaneous, keepUndefinedStoich, 0026 % keepIncomplete and keepGeneral) 0027 % dataDir directory for which to retrieve the input data. 0028 % Should contain a combination of these sub-folders: 0029 % -dataDir\keggdb 0030 % The KEGG database files used in 1a (see below) 0031 % -dataDir\fasta 0032 % The multi-FASTA files generated in 1b (see 0033 % below) 0034 % -dataDir\aligned 0035 % The aligned FASTA files as generated in 2a (see 0036 % below) 0037 % -dataDir\hmms 0038 % The hidden Markov models as generated in 2b or 0039 % downloaded from BioMet Toolbox (see below) 0040 % The final directory in dataDir should be styled as 0041 % prok90_kegg105 or euk90_kegg105, indicating whether 0042 % the HMMs were trained on pro- or eukaryotic 0043 % sequences; using which sequence similarity treshold 0044 % (first set of digits); using which KEGG version 0045 % (second set of digits). (this parameter should 0046 % ALWAYS be provided) 0047 % outDir directory to save the results from the quering of 0048 % the hidden Markov models. The output is specific 0049 % for the input sequences and the settings used. It 0050 % is stored in this manner so that the function can 0051 % continue if interrupted or if it should run in 0052 % parallel. Be careful not to leave output files from 0053 % different organisms or runs with different settings 0054 % in the same folder. They will not be overwritten 0055 % (optional, default is a temporary dir where all *.out 0056 % files are deleted before and after doing the 0057 % reconstruction) 0058 % keepSpontaneous include reactions labeled as "spontaneous". (optional, 0059 % default true) 0060 % keepUndefinedStoich include reactions in the form n A <=> n+1 A. These 0061 % will be dealt with as two separate metabolites 0062 % (optional, default true) 0063 % keepIncomplete include reactions which have been labelled as 0064 % "incomplete", "erroneous" or "unclear" (optional, 0065 % default true) 0066 % keepGeneral include reactions which have been labelled as 0067 % "general reaction". These are reactions on the form 0068 % "an aldehyde <=> an alcohol", and are therefore 0069 % unsuited for modelling purposes. Note that not all 0070 % reactions have this type of annotation, and the 0071 % script will therefore not be able to remove all 0072 % such reactions (optional, default false) 0073 % cutOff significance score from HMMer needed to assign 0074 % genes to a KO (optional, default 10^-50) 0075 % minScoreRatioG a gene is only assigned to KOs for which the score 0076 % is >=log(score)/log(best score) for that gene. This 0077 % is to prevent that a gene which clearly belongs to 0078 % one KO is assigned also to KOs with much lower 0079 % scores (optional, default 0.8 (lower is less strict)) 0080 % minScoreRatioKO ignore genes in a KO if their score is 0081 % <log(score)/log(best score in KO). This is to 0082 % "prune" KOs which have many genes and where some are 0083 % clearly a better fit (optional, default 0.3 (lower is 0084 % less strict)) 0085 % maxPhylDist -1: only use sequences from the same domain 0086 % (Prokaryota, Eukaryota) 0087 % other (positive) value: only use sequences for 0088 % organisms where the phylogenetic distance is at the 0089 % most this large (as calculated in getPhylDist) 0090 % (optional, default Inf, which means that all sequences 0091 % will be used) 0092 % nSequences for each KO, use up to this many sequences from the 0093 % most closely related species. This is mainly to 0094 % speed up the alignment process for KOs with very 0095 % many genes. This subsampling is performed before 0096 % running CD-HIT (optional, default inf) 0097 % seqIdentity sequence identity threshold in CD-HIT, referred as 0098 % "global sequence identity" in CD-HIT User's Guide. 0099 % If -1 is provided, CD-HIT is skipped (optional, default 0.9) 0100 % globalModel structure containing both model and KOModel 0101 % structures as generated by getModelFromKEGG. These 0102 % will otherwise be loaded by via getModelFromKEGG. 0103 % Providing globalKEGGmodel can speed up model 0104 % generation if getKEGGModelForOrganism is run 0105 % multiple times for different strains. Example: 0106 % [globalModel.model,globalModel.KOModel] = getModelFromKEGG; 0107 % (optional, default empty, global model is loaded by 0108 % getModelFromKEGG) 0109 % 0110 % Output: 0111 % model the reconstructed model 0112 % 0113 % PLEASE READ THIS: The input to this function can be confusing, because 0114 % it is intended to be run in parallel on a cluster or in multiple 0115 % sessions. It therefore saves a lot of intermediate results to storage. 0116 % This also serves the purpose of not having to do redundant 0117 % calculations. This, however, comes with the disadvantage of somewhat 0118 % trickier handling. This is what this function does: 0119 % 0120 % 1a. Loads files from a local KEGG FTP dump and constructs a general 0121 % RAVEN model representing the metabolic network. The functions 0122 % getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the 0123 % data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files, 0124 % which are later merged into 'keggModel.mat' by getModelFromKEGG 0125 % function. The function getPhylDist generates 'keggPhylDist.mat' 0126 % file. KEGG FTP access requires a <a href="matlab: 0127 % web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>. 0128 % 1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One 0129 % multi-FASTA file for each KO in KEGG is generated. 0130 % 0131 % The Step 1 has to be re-done every time KEGG updates their database (or 0132 % rather when the updates are large enough to warrant re-running this 0133 % part). Many users would probably never use this feature. 0134 % 0135 % 2a. Filters KO-specific protein sets. This is done by using the 0136 % settings "maxPhylDist" and "nSequences" to control which sequences 0137 % should be used for constructing Hidden Markov models (HMMs), and 0138 % later for matching your sequences to. 0139 % The most common alternatives here would be to use sequences from 0140 % only eukaryotes, only prokaryotes or all sequences in KEGG, but you 0141 % could also play around with the parameters to use e.g. only fungal 0142 % sequences. 0143 % 2b. KO-specific protein FASTA files are re-organised into 0144 % non-redundant protein sets with CD-HIT. The user can only set 0145 % seqIdentity parameter, which corresponds to '-c' parameter in 0146 % CD-HIT, described as "sequence identity threshold". CD-HIT suggsted 0147 % sequence identity specific word_length (-n) parameters are used. 0148 % 2c. Does a multi sequence alignment for multi-FASTA files obtained in 0149 % Step 2b for future use. MAFFT software with automatic selection of 0150 % alignment algorithm is used in this step ('--auto'). 0151 % 2d. Trains hidden Markov models using HMMer for each of the aligned 0152 % KO-specific FASTA files obtained in Step 2c. This is performed with 0153 % 'hmmbuild' using the default settings. 0154 % 0155 % Step 2 may be reasonable to be re-done if the user wants to tweak the 0156 % settings in proteins filtering, clustering, multi sequence alignment or 0157 % HMMs training steps. However, it requires to have KO-specific protein 0158 % FASTA files obtained in Step 1a. As such files are not provided in 0159 % RAVEN and BioMet ToolBox, the user can only generate these files from 0160 % KEGG FTP dump files, so KEGG FTP license is needed. 0161 % 0162 % 3a. Queries the HMMs with sequences for the organism you are making a 0163 % model for. This step uses both the output from step 1a and from 2d. 0164 % This is done with 'hmmsearch' function under default settings. The 0165 % significance threshold value set in 'cutOff' parameter is used 0166 % later when parsing '*.out' files to filter out KO hits with higher 0167 % value than 'cutOff' value. The results with passable E values are 0168 % summarised into KO-gene occurence matrix with E values in 0169 % intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and 0170 % 'minScoreRatioKO' are then applied to 'prune' KO-gene associations 0171 % (see the function descriptions above for more details). The 0172 % intersection values for these 'prunable' associations are converted 0173 % to zeroes. 0174 % 3b. Constructs a model based on the pre-processed KO-gene association 0175 % matrix (koGeneMat). As the full KEGG model already has reaction-KO 0176 % relationships, KOs are converted into the query genes. The final 0177 % draft model contains only these reactions, which are associated 0178 % with KOs from koGeneMat. The reactions without the genes may also 0179 % be included, if the user set keepSpontaneous as 'true'. 0180 % 0181 % The Step 3 is specific to the organism for which the model is 0182 % reconstructed. 0183 % 0184 % In principle the function looks at which output that is already available 0185 % and runs only the parts that are required for step 3. This means 0186 % that (see the definition of the parameters for details): 0187 % -1a is only performed if there are no KEGG model files in the 0188 % RAVEN\external\kegg directory 0189 % -1b is only performed if not all required HMMs OR aligned FASTA files 0190 % OR multi-FASTA files exist in the defined dataDir. This means that this 0191 % step is skipped if the HMMs are downloaded from BioMet Toolbox instead 0192 % (see above). If not all files exist it will try to find 0193 % the KEGG database files in dataDir. 0194 % -2a is only performed if not all required HMMs OR aligned FASTA files 0195 % files exist in the defined dataDir. This means that this step is skipped 0196 % if the HMMs are downloaded from BioMet Toolbox instead (see above). 0197 % -2b is only performed if not all required HMMs exist in the defined 0198 % dataDir. This means that this step is skipped if the FASTA files or 0199 % HMMs are downloaded from BioMet Toolbox instead (see above). 0200 % -3a is performed for the required HMMs for which no corresponding .out 0201 % file exists in outDir. This is just a way to enable the function to be 0202 % run in parallel or to resume if interrupted. 0203 % -3b is always performed. 0204 % 0205 % These steps are specific to the organism for which you are 0206 % reconstructing the model. 0207 % 0208 % Regarding the whole pipeline, the function checks the output that is 0209 % already available and runs only the parts that are required for step 3. 0210 % This means that (see the definition of the parameters for details): 0211 % -1a is only performed if there are no KEGG model files in the 0212 % RAVEN\external\kegg directory. 0213 % -1b is only performed if any of required KOs do not have HMMs, aligned 0214 % FASTA files, clustered FASTA files and raw FASTA files in the defined 0215 % dataDir. This means that this step is skipped if the HMMs are 0216 % downloaded from BioMet Toolbox instead (see above). If not all files 0217 % exist it will try to find the KEGG database files in dataDir. 0218 % -2ab are only performed if any of required KOs do not have HMMs, 0219 % aligned FASTA files and clustered FASTA files in the defined dataDir. 0220 % This means that this step is skipped if the HMMs are downloaded from 0221 % BioMet Toolbox instead (see above). 0222 % -2c is only performed if any of required KOs do not have HMMs and 0223 % aligned FASTA files in the defined dataDir. This means that this step 0224 % is skipped if the HMMs are downloaded from BioMet Toolbox instead (see 0225 % above). 0226 % -2d is only performed if any of required KOs do not have HMMs exist in 0227 % the defined dataDir. This means that this step is skipped if the FASTA 0228 % files or HMMs are downloaded from BioMet Toolbox instead (see above). 0229 % -3a is performed for the required HMMs for which no corresponding .out 0230 % file exists in outDir. This is just a way to enable the function to be 0231 % run in parallel or to resume if interrupted. 0232 % -3b is always performed. 0233 % 0234 % NOTE: it is also possible to obtain draft model from KEGG without 0235 % providing protein FASTA file for the target organism. In such case the 0236 % organism three-four letter abbreviation set as 'organismID' must exist 0237 % in the local KEGG database. In such case, the program just fetches all 0238 % the reactions, which are associated with given 'organismID'. 0239 % 0240 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0241 % outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0242 % keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0243 % nSequences,seqIdentity) 0244 0245 if nargin<2 || isempty(fastaFile) 0246 fastaFile=[]; 0247 else 0248 fastaFile=char(fastaFile); 0249 end 0250 if nargin<3 0251 dataDir=[]; 0252 else 0253 dataDir=char(dataDir); 0254 end 0255 if nargin<4 || isempty(outDir) 0256 outDir=tempdir; 0257 %Delete all *.out files if any exist 0258 delete(fullfile(outDir,'*.out')); 0259 else 0260 outDir=char(outDir); 0261 end 0262 if nargin<5 0263 keepSpontaneous=true; 0264 end 0265 if nargin<6 0266 keepUndefinedStoich=true; 0267 end 0268 if nargin<7 0269 keepIncomplete=true; 0270 end 0271 if nargin<8 0272 keepGeneral=false; 0273 end 0274 if nargin<9 0275 cutOff=10^-50; 0276 end 0277 if nargin<10 0278 minScoreRatioKO=0.3; 0279 end 0280 if nargin<11 0281 minScoreRatioG=0.8; 0282 end 0283 if nargin<12 0284 maxPhylDist=inf; 0285 %Include all sequences for each reaction 0286 end 0287 if nargin<13 0288 nSequences=inf; 0289 %Include all sequences for each reaction 0290 end 0291 if nargin<14 0292 seqIdentity=0.9; 0293 end 0294 0295 if isempty(fastaFile) 0296 fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']); 0297 else 0298 fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n'); 0299 %Check if query fasta exists 0300 fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir 0301 end 0302 0303 %Run the external binaries multi-threaded to use all logical cores assigned 0304 %to MATLAB 0305 cores = evalc('feature(''numcores'')'); 0306 cores = strsplit(cores, 'MATLAB was assigned: '); 0307 cores = regexp(cores{2},'^\d*','match'); 0308 cores = cores{1}; 0309 0310 %Get the directory for RAVEN Toolbox. 0311 ravenPath=findRAVENroot(); 0312 0313 %Checking if dataDir is consistent. It must point to pre-trained HMMs set, 0314 %compatible with the the current RAVEN version. The user may have the 0315 %required zip file already in working directory or have it extracted. If 0316 %the zip file and directory is not here, it is downloaded from the cloud 0317 if ~isempty(dataDir) 0318 hmmOptions={'euk90_kegg105','prok90_kegg105'}; 0319 if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions. 0320 %If not, then check whether the required folders exist anyway. 0321 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ... 0322 ~isfolder(fullfile(dataDir,'fasta')) && ... 0323 ~isfolder(fullfile(dataDir,'aligned')) && ... 0324 ~isfolder(fullfile(dataDir,'hmms')) 0325 error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')]) 0326 end 0327 else 0328 if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0329 fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']); 0330 elseif ~isfolder(dataDir) && isfile([dataDir,'.zip']) 0331 fprintf('Extracting the HMMs archive file... '); 0332 unzip([dataDir,'.zip']); 0333 fprintf('COMPLETE\n'); 0334 else 0335 hmmIndex=strcmp(dataDir,hmmOptions); 0336 if ~any(hmmIndex) 0337 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg105 and euk90_kegg105). ' ... 0338 'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity']) 0339 else 0340 fprintf('Downloading the HMMs archive file... '); 0341 try 0342 websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']); 0343 catch ME 0344 if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError') 0345 error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues') 0346 end 0347 end 0348 end 0349 0350 fprintf('COMPLETE\n'); 0351 fprintf('Extracting the HMMs archive file... '); 0352 unzip([dataDir,'.zip']); 0353 fprintf('COMPLETE\n'); 0354 end 0355 %Check if HMMs are extracted 0356 if ~isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0357 error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']); 0358 end 0359 end 0360 end 0361 0362 %Check if the fasta-file contains '/' or'\'. If not then it's probably just 0363 %a file name. Expand to full path. 0364 if any(fastaFile) 0365 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/')) 0366 fastaFile=which(fastaFile); 0367 end 0368 %Create the required sub-folders in dataDir if they dont exist 0369 if ~isfolder(fullfile(dataDir,'keggdb')) 0370 mkdir(dataDir,'keggdb'); 0371 end 0372 if ~isfolder(fullfile(dataDir,'fasta')) 0373 mkdir(dataDir,'fasta'); 0374 end 0375 if ~isfolder(fullfile(dataDir,'aligned')) 0376 mkdir(dataDir,'aligned'); 0377 end 0378 if ~isfolder(fullfile(dataDir,'hmms')) 0379 mkdir(dataDir,'hmms'); 0380 end 0381 if ~isfolder(outDir) 0382 mkdir(outDir); 0383 end 0384 end 0385 0386 %First generate the full global KEGG model. Can be provided as input. 0387 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as 0388 %there is also an internal RAVEN version available 0389 if nargin==15 0390 model=globalModel.model; 0391 KOModel=globalModel.KOModel; 0392 elseif any(dataDir) 0393 [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0394 else 0395 [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0396 end 0397 model.id=organismID; 0398 model.c=zeros(numel(model.rxns),1); 0399 0400 %If no FASTA file is supplied, then just remove all genes which are not for 0401 %the given organism ID 0402 if isempty(fastaFile) 0403 %Check if organismID can be found in KEGG species list or is 0404 %set to "eukaryotes" or "prokaryotes" 0405 phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true); 0406 if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes']) 0407 error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.'); 0408 end 0409 0410 fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']); 0411 if ismember(organismID,{'eukaryotes','prokaryotes'}) 0412 phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0413 if strcmp(organismID,'eukaryotes') 0414 proxyid='hsa'; 0415 %Use H. sapiens here 0416 else 0417 proxyid='eco'; 0418 %Use E. coli here 0419 end 0420 [~, phylDistId]=ismember(proxyid,phylDists.ids); 0421 idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:))); 0422 taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false); 0423 I=ismember(upper(taxIDs),upper(idsToKeep)); 0424 else 0425 %KEGG organism IDs may have three or four letters 0426 organismID=strcat(organismID,':'); 0427 %Add colon for accurate matching 0428 if length(organismID)==4 0429 I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes); 0430 elseif length(organismID)==5 0431 I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes); 0432 end 0433 end 0434 %Remove those genes 0435 model.genes=model.genes(I); 0436 model.rxnGeneMat=model.rxnGeneMat(:,I); 0437 fprintf('COMPLETE\n'); 0438 end 0439 0440 %First remove all reactions without genes 0441 if keepSpontaneous==true 0442 fprintf('Removing non-spontaneous reactions without GPR rules... '); 0443 load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous'); 0444 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0445 spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous)); 0446 else 0447 fprintf('Removing reactions without GPR rules... '); 0448 I=~any(model.rxnGeneMat,2); 0449 end 0450 model=removeReactions(model,I,true); 0451 fprintf('COMPLETE\n'); 0452 0453 %Clean gene names 0454 fprintf('Fixing gene names in the model... '); 0455 %Get rid of the prefix organism id 0456 model.genes=regexprep(model.genes,'^\w+?:',''); 0457 fprintf('COMPLETE\n'); 0458 0459 %If no FASTA file is supplied, then we are done here 0460 if isempty(fastaFile) 0461 %Create grRules 0462 fprintf('Constructing GPR associations and annotations for the model... '); 0463 model.grRules=cell(numel(model.rxns),1); 0464 model.grRules(:)={''}; 0465 %Add the gene associations as 'or' 0466 for i=1:numel(model.rxns) 0467 %Find the involved genes 0468 I=find(model.rxnGeneMat(i,:)); 0469 if any(I) 0470 model.grRules{i}=['(' model.genes{I(1)}]; 0471 for j=2:numel(I) 0472 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 0473 end 0474 model.grRules{i}=[model.grRules{i} ')']; 0475 end 0476 end 0477 %Fix grRules and reconstruct rxnGeneMat 0478 [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output 0479 model.grRules = grRules; 0480 model.rxnGeneMat = rxnGeneMat; 0481 %Add geneMiriams, assuming that it follows the syntax 0482 %kegg.genes/organismID:geneName 0483 model.geneMiriams=''; 0484 for i=1:numel(model.genes) 0485 model.geneMiriams{i,1}.name{1,1}='kegg.genes'; 0486 model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1}); 0487 end 0488 %Add the description to the reactions 0489 for i=1:numel(model.rxns) 0490 if ~isempty(model.rxnNotes{i}) 0491 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i)); 0492 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 0493 else 0494 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'}; 0495 end 0496 end 0497 fprintf('COMPLETE\n\n'); 0498 fprintf('*** Model reconstruction complete ***\n'); 0499 return; 0500 end 0501 0502 %Create a phylogenetic distance structure 0503 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0504 [~, phylDistId]=ismember(model.id,phylDistStruct.ids); 0505 0506 %Calculate the real maximal distance now. An abitary large number of 1000 0507 %is used for the "all in kingdom" or "all sequences" options. This is a bit 0508 %inconvenient way to do it, but it is to make it fit with some older code 0509 if isinf(maxPhylDist) || maxPhylDist==-1 0510 maxPhylDist=1000; 0511 end 0512 0513 %Get the KO ids for which files have been generated. Maybe not the neatest 0514 %way.. 0515 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa')); 0516 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa')); 0517 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw')); 0518 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm')); 0519 outFiles=listFiles(fullfile(outDir,'*.out')); 0520 0521 %Check if multi-FASTA files should be generated. This should only be 0522 %performed if there are IDs in the KOModel structure that haven't been 0523 %parsed yet 0524 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]); 0525 0526 if ~isempty(missingFASTA) 0527 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) 0528 EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n']; 0529 dispEM(EM); 0530 end 0531 %Only construct models for KOs which don't have files already 0532 fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true); 0533 %Permute the order of the KOs in the model so that constructMultiFasta 0534 %can be run on several processors at once 0535 fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns'); 0536 constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta')); 0537 else 0538 fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n'); 0539 end 0540 0541 if isunix 0542 if ismac 0543 binEnd='.mac'; 0544 else 0545 binEnd=''; 0546 end 0547 elseif ispc 0548 binEnd=''; 0549 else 0550 EM='Unknown OS, exiting.'; 0551 disp(EM); 0552 return 0553 end 0554 0555 %Check if alignment of FASTA files should be performed 0556 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]); 0557 if ~isempty(missingAligned) 0558 if seqIdentity==-1 0559 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0560 else 0561 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0562 end 0563 missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned))); 0564 tmpFile=tempname; 0565 %On Windows, paths need to be translated to Unix before parsing it to WSL 0566 if ispc 0567 wslPath.tmpFile=getWSLpath(tmpFile); 0568 %mafft has problems writing to terminal (/dev/stderr) when running 0569 %on WSL via MATLAB, instead write and read progress file 0570 mafftOutput = tempname; 0571 wslPath.mafftOutput=getWSLpath(mafftOutput); 0572 wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat')); 0573 wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit')); 0574 end 0575 0576 for i=1:numel(missingAligned) 0577 %This is checked here because it could be that it is created by a 0578 %parallel process. The faw-files are saved as temporary files to 0579 %kept track of which files are being worked on 0580 if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&... 0581 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa'])) 0582 %Check that the multi-FASTA file exists. It should do so since 0583 %we are saving empty files as well. Print a warning and 0584 %continue if not 0585 if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])) 0586 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist']; 0587 dispEM(EM,false); 0588 continue; 0589 end 0590 0591 %If the multi-FASTA file is empty then save an empty aligned 0592 %file and continue 0593 s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0594 if s.bytes<=0 0595 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w'); 0596 fclose(fid); 0597 continue; 0598 end 0599 0600 %Create an empty file to prevent other threads to start to work 0601 %on the same alignment 0602 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w'); 0603 fclose(fid); 0604 0605 %First load the FASTA file, then select up to nSequences 0606 %sequences of the most closely related species, apply any 0607 %constraints from maxPhylDist, and save it as a temporary file, 0608 %and create the model from that 0609 0610 fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0611 phylDist=inf(numel(fastaStruct),1); 0612 for j=1:numel(fastaStruct) 0613 %Get the organism abbreviation 0614 index=strfind(fastaStruct(j).Header,':'); 0615 if any(index) 0616 abbrev=fastaStruct(j).Header(1:index(1)-1); 0617 [~, index]=ismember(abbrev,phylDistStruct.ids); 0618 if any(index) 0619 phylDist(j)=phylDistStruct.distMat(index(1),phylDistId); 0620 end 0621 end 0622 end 0623 0624 %Inf means that it should not be included 0625 phylDist(phylDist>maxPhylDist)=[]; 0626 0627 %Sort based on phylDist 0628 [~, order]=sort(phylDist); 0629 0630 %Save the first nSequences hits to a temporary FASTA file 0631 if nSequences<=numel(fastaStruct) 0632 fastaStruct=fastaStruct(order(1:nSequences)); 0633 else 0634 fastaStruct=fastaStruct(order); 0635 end 0636 0637 %Do the clustering and alignment if there are more than one 0638 %sequences, otherwise just save the sequence (or an empty file) 0639 if numel(fastaStruct)>1 0640 if seqIdentity~=-1 0641 cdhitInpCustom=tempname; 0642 fastawrite(cdhitInpCustom,fastaStruct); 0643 if seqIdentity<=1 && seqIdentity>0.7 0644 nparam='5'; 0645 elseif seqIdentity>0.6 0646 nparam='4'; 0647 elseif seqIdentity>0.5 0648 nparam='3'; 0649 elseif seqIdentity>0.4 0650 nparam='2'; 0651 else 0652 EM='The provided seqIdentity must be between 0 and 1\n'; 0653 dispEM(EM); 0654 end 0655 if ispc 0656 wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom); 0657 [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0658 elseif ismac || isunix 0659 [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0660 end 0661 if status~=0 0662 EM=['Error when performing clustering of ' missingAligned{i} ':\n' output]; 0663 dispEM(EM); 0664 end 0665 %Remove the old tempfile 0666 if exist(cdhitInpCustom, 'file') 0667 delete([cdhitInpCustom '*']); 0668 end 0669 else 0670 %This means that CD-HIT should be skipped since 0671 %seqIdentity is equal to -1 0672 fastawrite(tmpFile,fastaStruct); 0673 end 0674 %Do the alignment for this file 0675 if ismac 0676 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0677 elseif isunix 0678 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0679 elseif ispc 0680 wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])); 0681 [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']); 0682 output=fileread(mafftOutput); 0683 delete(mafftOutput); 0684 end 0685 if status~=0 0686 %It could be that alignment failed because only one 0687 %sequence was left after clustering. If that is the 0688 %case, then the clustered file is just copied as 'faw' 0689 %file 0690 if any(regexp(output,'Only 1 sequence found')) 0691 movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f'); 0692 else 0693 EM=['Error when performing alignment of ' missingAligned{i} ':\n' output]; 0694 dispEM(EM); 0695 end 0696 end 0697 %Remove the old tempfile 0698 if exist(tmpFile, 'file') 0699 delete([tmpFile '*']); 0700 end 0701 else 0702 %If there is only one sequence then it's not possible to do 0703 %a multiple alignment. Just print the sequence instead. An 0704 %empty file was written previously so that doesn't have to 0705 %be dealt with 0706 if numel(fastaStruct)==1 0707 warnState = warning; %Save the current warning state 0708 warning('off','Bioinfo:fastawrite:AppendToFile'); 0709 fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct); 0710 warning(warnState) %Reset warning state to previous settings 0711 end 0712 end 0713 %Move the temporary file to the real one 0714 movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f'); 0715 0716 %Print the progress every 25 files 0717 if rem(i-1,25) == 0 0718 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns))); 0719 progress=pad(progress,3,'left'); 0720 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0721 end 0722 end 0723 end 0724 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0725 else 0726 if seqIdentity==-1 0727 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0728 else 0729 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0730 end 0731 end 0732 0733 %Check if training of Hidden Markov models should be performed 0734 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]); 0735 if ~isempty(missingHMMs) 0736 fprintf('Training the KEGG Orthology specific HMMs... 0%% complete'); 0737 missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs))); 0738 %Train models for all missing KOs 0739 for i=1:numel(missingHMMs) 0740 %This is checked here because it could be that it is created by a 0741 %parallel process 0742 if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])) 0743 %Check that the aligned FASTA file exists. It could be that it 0744 %is still being worked on by some other instance of the program 0745 %(the .faw file should then exist). This should not happen on a 0746 %single computer. It doesn't throw an error, because it should 0747 %finalize the ones it can 0748 if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])) 0749 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist']; 0750 dispEM(EM,false); 0751 continue; 0752 end 0753 0754 %If the multi-FASTA file is empty then save an empty aligned 0755 %file and continue 0756 s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0757 if s.bytes<=0 0758 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w'); 0759 fclose(fid); 0760 continue; 0761 end 0762 %Create a temporary file to indicate that it is working on the 0763 %KO. This is because hmmbuild cannot overwrite existing files 0764 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w'); 0765 fclose(fid); 0766 0767 %Create HMM 0768 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']); 0769 if status~=0 0770 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output]; 0771 dispEM(EM); 0772 end 0773 0774 %Delete the temporary file 0775 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])); 0776 0777 %Print the progress every 25 files 0778 if rem(i-1,25) == 0 0779 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns))); 0780 progress=pad(progress,3,'left'); 0781 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0782 end 0783 end 0784 end 0785 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0786 else 0787 fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n'); 0788 end 0789 0790 %Check which new .out files that should be generated. Check if training of 0791 %Hidden Markov models should be performed 0792 missingOUT=setdiff(KOModel.rxns,outFiles); 0793 if ~isempty(missingOUT) 0794 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... 0%% complete'); 0795 missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT))); 0796 for i=1:numel(missingOUT) 0797 %This is checked here because it could be that it is created by a 0798 %parallel process 0799 if ~isfile(fullfile(outDir,[missingOUT{i} '.out'])) 0800 %Check that the HMM file exists. It should do so since %we are 0801 %saving empty files as well. Print a warning and continue if 0802 %not 0803 if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])) 0804 EM=['The HMM file for ' missingOUT{i} ' does not exist']; 0805 dispEM(EM,false); 0806 continue; 0807 end 0808 0809 %Save an empty file to prevent several threads working on the 0810 %same file 0811 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0812 fclose(fid); 0813 0814 %If the HMM file is empty then save an out file and continue 0815 s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0816 if s.bytes<=0 0817 continue; 0818 end 0819 0820 %Check each gene in the input file against this model 0821 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']); 0822 if status~=0 0823 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output]; 0824 dispEM(EM); 0825 end 0826 0827 %Save the output to a file 0828 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0829 fwrite(fid,output); 0830 fclose(fid); 0831 0832 %Print the progress every 25 files 0833 if rem(i-1,25) == 0 0834 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns))); 0835 progress=pad(progress,3,'left'); 0836 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0837 end 0838 end 0839 end 0840 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0841 else 0842 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n'); 0843 end 0844 0845 0846 %***Begin retrieving the output and putting together the resulting model 0847 0848 fprintf('Parsing the HMM search results... '); 0849 %Retrieve matched genes from the HMMs 0850 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes 0851 genes=cell(3000,1); 0852 %Store the best score for a gene in a hash list (since it will be searching 0853 %many times) 0854 hTable = java.util.Hashtable; 0855 0856 geneCounter=0; 0857 for i=1:numel(KOModel.rxns) 0858 if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file') 0859 fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r'); 0860 beginMatches=false; 0861 while 1 0862 %Get the next line 0863 tline = fgetl(fid); 0864 0865 %Abort at end of file 0866 if ~ischar(tline) 0867 break; 0868 end 0869 0870 if and(beginMatches,strcmp(tline,' ------ inclusion threshold ------')) 0871 break; 0872 end 0873 0874 if beginMatches==false 0875 %This is how the listing of matches begins 0876 if any(strfind(tline,'E-value ')) 0877 %Read one more line that is only padding 0878 tline = fgetl(fid); 0879 beginMatches=true; 0880 end 0881 else 0882 %If matches should be read 0883 if ~strcmp(tline,' [No hits detected that satisfy reporting thresholds]') && ~isempty(tline) 0884 elements=regexp(tline,' ','split'); 0885 elements=elements(cellfun(@any,elements)); 0886 0887 %Check if the match is below the treshhold 0888 score=str2double(elements{1}); 0889 gene=elements{9}; 0890 if score<=cutOff 0891 %If the score is exactly 0, change it to a very 0892 %small value to avoid NaN 0893 if score==0 0894 score=10^-250; 0895 end 0896 %Check if the gene is added already and, is so, get 0897 %the best score for it 0898 I=hTable.get(gene); 0899 if any(I) 0900 koGeneMat(i,I)=score; 0901 else 0902 geneCounter=geneCounter+1; 0903 %The gene was not present yet so add it 0904 hTable.put(gene,geneCounter); 0905 genes{geneCounter}=gene; 0906 koGeneMat(i,geneCounter)=score; 0907 end 0908 end 0909 else 0910 break; 0911 end 0912 end 0913 end 0914 fclose(fid); 0915 end 0916 end 0917 fprintf('COMPLETE\n'); 0918 0919 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... '); 0920 koGeneMat=koGeneMat(:,1:geneCounter); 0921 0922 %Remove the genes for each KO that are below minScoreRatioKO. 0923 for i=1:size(koGeneMat,1) 0924 J=find(koGeneMat(i,:)); 0925 if any(J) 0926 koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0; 0927 end 0928 end 0929 0930 %Remove the KOs for each gene that are below minScoreRatioG 0931 for i=1:size(koGeneMat,2) 0932 J=find(koGeneMat(:,i)); 0933 if any(J) 0934 koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0; 0935 end 0936 end 0937 fprintf('COMPLETE\n'); 0938 0939 fprintf('Adding gene annotations to the model... '); 0940 %Create the new model 0941 model.genes=genes(1:geneCounter); 0942 model.grRules=cell(numel(model.rxns),1); 0943 model.grRules(:)={''}; 0944 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes)); 0945 0946 %Loop through the reactions and add the corresponding genes 0947 for i=1:numel(model.rxns) 0948 if isstruct(model.rxnMiriams{i}) 0949 %Get all KOs 0950 I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology')); 0951 KOs=model.rxnMiriams{i}.value(I); 0952 %Find the KOs and the corresponding genes 0953 J=ismember(KOModel.rxns,KOs); 0954 [~, K]=find(koGeneMat(J,:)); 0955 0956 if any(K) 0957 model.rxnGeneMat(i,K)=1; 0958 %Also delete KOs for which no genes were found. If no genes at 0959 %all were matched to the reaction it will be deleted later 0960 L=sum(koGeneMat(J,:),2)==0; 0961 model.rxnMiriams{i}.value(I(L))=[]; 0962 model.rxnMiriams{i}.name(I(L))=[]; 0963 end 0964 end 0965 end 0966 fprintf('COMPLETE\n'); 0967 0968 %Find and delete all reactions without genes. This also removes genes that 0969 %are not used (which could happen because minScoreRatioG and 0970 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions 0971 %without genes are kept in the model. Spontaneous reactions with original 0972 %gene associations are treated in the same way, like the rest of the 0973 %reactions - if gene associations were removed during HMM search, such 0974 %reactions are deleted from the model 0975 if keepSpontaneous==true 0976 %Not the most comprise way to delete reactions without genes, but this 0977 %makes the code easier to understand. Firstly the non-spontaneous 0978 %reactions without genes are removed. After that, the second deletion 0979 %step removes spontaneous reactions, which had gene associations before 0980 %HMM search, but no longer have after it 0981 fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... '); 0982 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0983 model=removeReactions(model,I,true,true); 0984 I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes); 0985 model=removeReactions(model,I,true,true); 0986 else 0987 %Just simply check for any new reactions without genes and remove 0988 %it 0989 fprintf('Removing reactions which after HMM search no longer have GPR rules... '); 0990 I=~any(model.rxnGeneMat,2); 0991 model=removeReactions(model,I,true,true); 0992 end 0993 fprintf('COMPLETE\n'); 0994 0995 fprintf('Constructing GPR rules and finalizing the model... '); 0996 %Add the gene associations as 'or' 0997 for i=1:numel(model.rxns) 0998 %Find the involved genes 0999 I=find(model.rxnGeneMat(i,:)); 1000 if any(I) 1001 model.grRules{i}=['(' model.genes{I(1)}]; 1002 for j=2:numel(I) 1003 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 1004 end 1005 model.grRules{i}=[model.grRules{i} ')']; 1006 end 1007 end 1008 1009 %Fix grRules and reconstruct rxnGeneMat 1010 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output 1011 model.grRules = grRules; 1012 model.rxnGeneMat = rxnGeneMat; 1013 1014 %Fix subsystems 1015 emptySubSystems=cellfun(@isempty, model.subSystems); 1016 model.subSystems(emptySubSystems)={{''}}; 1017 1018 %Add the description to the reactions 1019 for i=1:numel(model.rxns) 1020 if ~isempty(model.rxnNotes{i}) 1021 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i)); 1022 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 1023 else 1024 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'}; 1025 end 1026 end 1027 %Remove the temp fasta file 1028 delete(fastaFile) 1029 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n'); 1030 end 1031 1032 function files=listFiles(directory) 1033 %Supporter function to list the files in a directory and return them as a 1034 %cell array 1035 temp=dir(directory); 1036 files=cell(numel(temp),1); 1037 for i=1:numel(temp) 1038 files{i}=temp(i,1).name; 1039 end 1040 files=strrep(files,'.fa',''); 1041 files=strrep(files,'.hmm',''); 1042 files=strrep(files,'.out',''); 1043 files=strrep(files,'.faw',''); 1044 end