


getKEGGModelForOrganism
Reconstructs a genome-scale metabolic model based on protein homology
to the orthologies in KEGG. If the target species is not available in
KEGG, the user must select a closely related species. It is also
possible to circumvent protein homology search (see fastaFile parameter
for more details)
Input:
organismID three or four letter abbreviation of the organism
(as used in KEGG). If not available, use a closely
related species. This is used for determing the
phylogenetic distance. Use 'eukaryotes' or
'prokaryotes' to get a model for the whole domain.
Only applicable if fastaFile is empty, i.e. no
homology search should be performed
fastaFile a FASTA file that contains the protein sequences of
the organism for which to reconstruct a model (optional,
if no FASTA file is supplied then a model is
reconstructed based only on the organism
abbreviation. This option ignores all settings
except for keepSpontaneous, keepUndefinedStoich,
keepIncomplete and keepGeneral)
dataDir directory for which to retrieve the input data.
Should contain a combination of these sub-folders:
-dataDir\keggdb
The KEGG database files used in 1a (see below)
-dataDir\fasta
The multi-FASTA files generated in 1b (see
below)
-dataDir\aligned
The aligned FASTA files as generated in 2a (see
below)
-dataDir\hmms
The hidden Markov models as generated in 2b or
downloaded from BioMet Toolbox (see below)
The final directory in dataDir should be styled as
prok90_kegg116 or euk90_kegg116, indicating whether
the HMMs were trained on pro- or eukaryotic
sequences; using which sequence similarity treshold
(first set of digits); using which KEGG version
(second set of digits). (this parameter should
ALWAYS be provided)
outDir directory to save the results from the quering of
the hidden Markov models. The output is specific
for the input sequences and the settings used. It
is stored in this manner so that the function can
continue if interrupted or if it should run in
parallel. Be careful not to leave output files from
different organisms or runs with different settings
in the same folder. They will not be overwritten
(optional, default is a temporary dir where all *.out
files are deleted before and after doing the
reconstruction)
keepSpontaneous include reactions labeled as "spontaneous". (optional,
default true)
keepUndefinedStoich include reactions in the form n A <=> n+1 A. These
will be dealt with as two separate metabolites
(optional, default true)
keepIncomplete include reactions which have been labelled as
"incomplete", "erroneous" or "unclear" (optional,
default true)
keepGeneral include reactions which have been labelled as
"general reaction". These are reactions on the form
"an aldehyde <=> an alcohol", and are therefore
unsuited for modelling purposes. Note that not all
reactions have this type of annotation, and the
script will therefore not be able to remove all
such reactions (optional, default false)
cutOff significance score from HMMer needed to assign
genes to a KO (optional, default 10^-50)
minScoreRatioG a gene is only assigned to KOs for which the score
is >=log(score)/log(best score) for that gene. This
is to prevent that a gene which clearly belongs to
one KO is assigned also to KOs with much lower
scores (optional, default 0.8 (lower is less strict))
minScoreRatioKO ignore genes in a KO if their score is
<log(score)/log(best score in KO). This is to
"prune" KOs which have many genes and where some are
clearly a better fit (optional, default 0.3 (lower is
less strict))
maxPhylDist -1: only use sequences from the same domain
(Prokaryota, Eukaryota)
other (positive) value: only use sequences for
organisms where the phylogenetic distance is at the
most this large (as calculated in getPhylDist)
(optional, default Inf, which means that all sequences
will be used)
nSequences for each KO, use up to this many sequences from the
most closely related species. This is mainly to
speed up the alignment process for KOs with very
many genes. This subsampling is performed before
running CD-HIT (optional, default inf)
seqIdentity sequence identity threshold in CD-HIT, referred as
"global sequence identity" in CD-HIT User's Guide.
If -1 is provided, CD-HIT is skipped (optional, default 0.9)
globalModel structure containing both model and KOModel
structures as generated by getModelFromKEGG. These
will otherwise be loaded by via getModelFromKEGG.
Providing globalKEGGmodel can speed up model
generation if getKEGGModelForOrganism is run
multiple times for different strains. Example:
[globalModel.model,globalModel.KOModel] = getModelFromKEGG;
(optional, default empty, global model is loaded by
getModelFromKEGG)
Output:
model the reconstructed model
PLEASE READ THIS: The input to this function can be confusing, because
it is intended to be run in parallel on a cluster or in multiple
sessions. It therefore saves a lot of intermediate results to storage.
This also serves the purpose of not having to do redundant
calculations. This, however, comes with the disadvantage of somewhat
trickier handling. This is what this function does:
1a. Loads files from a local KEGG FTP dump and constructs a general
RAVEN model representing the metabolic network. The functions
getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
which are later merged into 'keggModel.mat' by getModelFromKEGG
function. The function getPhylDist generates 'keggPhylDist.mat'
file. KEGG FTP access requires a <a href="matlab:
web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
multi-FASTA file for each KO in KEGG is generated.
The Step 1 has to be re-done every time KEGG updates their database (or
rather when the updates are large enough to warrant re-running this
part). Many users would probably never use this feature.
2a. Filters KO-specific protein sets. This is done by using the
settings "maxPhylDist" and "nSequences" to control which sequences
should be used for constructing Hidden Markov models (HMMs), and
later for matching your sequences to.
The most common alternatives here would be to use sequences from
only eukaryotes, only prokaryotes or all sequences in KEGG, but you
could also play around with the parameters to use e.g. only fungal
sequences.
2b. KO-specific protein FASTA files are re-organised into
non-redundant protein sets with CD-HIT. The user can only set
seqIdentity parameter, which corresponds to '-c' parameter in
CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
sequence identity specific word_length (-n) parameters are used.
2c. Does a multi sequence alignment for multi-FASTA files obtained in
Step 2b for future use. MAFFT software with automatic selection of
alignment algorithm is used in this step ('--auto').
2d. Trains hidden Markov models using HMMer for each of the aligned
KO-specific FASTA files obtained in Step 2c. This is performed with
'hmmbuild' using the default settings.
Step 2 may be reasonable to be re-done if the user wants to tweak the
settings in proteins filtering, clustering, multi sequence alignment or
HMMs training steps. However, it requires to have KO-specific protein
FASTA files obtained in Step 1a. As such files are not provided in
RAVEN and BioMet ToolBox, the user can only generate these files from
KEGG FTP dump files, so KEGG FTP license is needed.
3a. Queries the HMMs with sequences for the organism you are making a
model for. This step uses both the output from step 1a and from 2d.
This is done with 'hmmsearch' function under default settings. The
significance threshold value set in 'cutOff' parameter is used
later when parsing '*.out' files to filter out KO hits with higher
value than 'cutOff' value. The results with passable E values are
summarised into KO-gene occurence matrix with E values in
intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
'minScoreRatioKO' are then applied to 'prune' KO-gene associations
(see the function descriptions above for more details). The
intersection values for these 'prunable' associations are converted
to zeroes.
3b. Constructs a model based on the pre-processed KO-gene association
matrix (koGeneMat). As the full KEGG model already has reaction-KO
relationships, KOs are converted into the query genes. The final
draft model contains only these reactions, which are associated
with KOs from koGeneMat. The reactions without the genes may also
be included, if the user set keepSpontaneous as 'true'.
The Step 3 is specific to the organism for which the model is
reconstructed.
In principle the function looks at which output that is already available
and runs only the parts that are required for step 3. This means
that (see the definition of the parameters for details):
-1a is only performed if there are no KEGG model files in the
RAVEN\external\kegg directory
-1b is only performed if not all required HMMs OR aligned FASTA files
OR multi-FASTA files exist in the defined dataDir. This means that this
step is skipped if the HMMs are downloaded from BioMet Toolbox instead
(see above). If not all files exist it will try to find
the KEGG database files in dataDir.
-2a is only performed if not all required HMMs OR aligned FASTA files
files exist in the defined dataDir. This means that this step is skipped
if the HMMs are downloaded from BioMet Toolbox instead (see above).
-2b is only performed if not all required HMMs exist in the defined
dataDir. This means that this step is skipped if the FASTA files or
HMMs are downloaded from BioMet Toolbox instead (see above).
-3a is performed for the required HMMs for which no corresponding .out
file exists in outDir. This is just a way to enable the function to be
run in parallel or to resume if interrupted.
-3b is always performed.
These steps are specific to the organism for which you are
reconstructing the model.
Regarding the whole pipeline, the function checks the output that is
already available and runs only the parts that are required for step 3.
This means that (see the definition of the parameters for details):
-1a is only performed if there are no KEGG model files in the
RAVEN\external\kegg directory.
-1b is only performed if any of required KOs do not have HMMs, aligned
FASTA files, clustered FASTA files and raw FASTA files in the defined
dataDir. This means that this step is skipped if the HMMs are
downloaded from BioMet Toolbox instead (see above). If not all files
exist it will try to find the KEGG database files in dataDir.
-2ab are only performed if any of required KOs do not have HMMs,
aligned FASTA files and clustered FASTA files in the defined dataDir.
This means that this step is skipped if the HMMs are downloaded from
BioMet Toolbox instead (see above).
-2c is only performed if any of required KOs do not have HMMs and
aligned FASTA files in the defined dataDir. This means that this step
is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
above).
-2d is only performed if any of required KOs do not have HMMs exist in
the defined dataDir. This means that this step is skipped if the FASTA
files or HMMs are downloaded from BioMet Toolbox instead (see above).
-3a is performed for the required HMMs for which no corresponding .out
file exists in outDir. This is just a way to enable the function to be
run in parallel or to resume if interrupted.
-3b is always performed.
NOTE: it is also possible to obtain draft model from KEGG without
providing protein FASTA file for the target organism. In such case the
organism three-four letter abbreviation set as 'organismID' must exist
in the local KEGG database. In such case, the program just fetches all
the reactions, which are associated with given 'organismID'.
Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
nSequences,seqIdentity)


0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0002 outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0003 keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0004 nSequences,seqIdentity,globalModel) 0005 % getKEGGModelForOrganism 0006 % Reconstructs a genome-scale metabolic model based on protein homology 0007 % to the orthologies in KEGG. If the target species is not available in 0008 % KEGG, the user must select a closely related species. It is also 0009 % possible to circumvent protein homology search (see fastaFile parameter 0010 % for more details) 0011 % 0012 % Input: 0013 % organismID three or four letter abbreviation of the organism 0014 % (as used in KEGG). If not available, use a closely 0015 % related species. This is used for determing the 0016 % phylogenetic distance. Use 'eukaryotes' or 0017 % 'prokaryotes' to get a model for the whole domain. 0018 % Only applicable if fastaFile is empty, i.e. no 0019 % homology search should be performed 0020 % fastaFile a FASTA file that contains the protein sequences of 0021 % the organism for which to reconstruct a model (optional, 0022 % if no FASTA file is supplied then a model is 0023 % reconstructed based only on the organism 0024 % abbreviation. This option ignores all settings 0025 % except for keepSpontaneous, keepUndefinedStoich, 0026 % keepIncomplete and keepGeneral) 0027 % dataDir directory for which to retrieve the input data. 0028 % Should contain a combination of these sub-folders: 0029 % -dataDir\keggdb 0030 % The KEGG database files used in 1a (see below) 0031 % -dataDir\fasta 0032 % The multi-FASTA files generated in 1b (see 0033 % below) 0034 % -dataDir\aligned 0035 % The aligned FASTA files as generated in 2a (see 0036 % below) 0037 % -dataDir\hmms 0038 % The hidden Markov models as generated in 2b or 0039 % downloaded from BioMet Toolbox (see below) 0040 % The final directory in dataDir should be styled as 0041 % prok90_kegg116 or euk90_kegg116, indicating whether 0042 % the HMMs were trained on pro- or eukaryotic 0043 % sequences; using which sequence similarity treshold 0044 % (first set of digits); using which KEGG version 0045 % (second set of digits). (this parameter should 0046 % ALWAYS be provided) 0047 % outDir directory to save the results from the quering of 0048 % the hidden Markov models. The output is specific 0049 % for the input sequences and the settings used. It 0050 % is stored in this manner so that the function can 0051 % continue if interrupted or if it should run in 0052 % parallel. Be careful not to leave output files from 0053 % different organisms or runs with different settings 0054 % in the same folder. They will not be overwritten 0055 % (optional, default is a temporary dir where all *.out 0056 % files are deleted before and after doing the 0057 % reconstruction) 0058 % keepSpontaneous include reactions labeled as "spontaneous". (optional, 0059 % default true) 0060 % keepUndefinedStoich include reactions in the form n A <=> n+1 A. These 0061 % will be dealt with as two separate metabolites 0062 % (optional, default true) 0063 % keepIncomplete include reactions which have been labelled as 0064 % "incomplete", "erroneous" or "unclear" (optional, 0065 % default true) 0066 % keepGeneral include reactions which have been labelled as 0067 % "general reaction". These are reactions on the form 0068 % "an aldehyde <=> an alcohol", and are therefore 0069 % unsuited for modelling purposes. Note that not all 0070 % reactions have this type of annotation, and the 0071 % script will therefore not be able to remove all 0072 % such reactions (optional, default false) 0073 % cutOff significance score from HMMer needed to assign 0074 % genes to a KO (optional, default 10^-50) 0075 % minScoreRatioG a gene is only assigned to KOs for which the score 0076 % is >=log(score)/log(best score) for that gene. This 0077 % is to prevent that a gene which clearly belongs to 0078 % one KO is assigned also to KOs with much lower 0079 % scores (optional, default 0.8 (lower is less strict)) 0080 % minScoreRatioKO ignore genes in a KO if their score is 0081 % <log(score)/log(best score in KO). This is to 0082 % "prune" KOs which have many genes and where some are 0083 % clearly a better fit (optional, default 0.3 (lower is 0084 % less strict)) 0085 % maxPhylDist -1: only use sequences from the same domain 0086 % (Prokaryota, Eukaryota) 0087 % other (positive) value: only use sequences for 0088 % organisms where the phylogenetic distance is at the 0089 % most this large (as calculated in getPhylDist) 0090 % (optional, default Inf, which means that all sequences 0091 % will be used) 0092 % nSequences for each KO, use up to this many sequences from the 0093 % most closely related species. This is mainly to 0094 % speed up the alignment process for KOs with very 0095 % many genes. This subsampling is performed before 0096 % running CD-HIT (optional, default inf) 0097 % seqIdentity sequence identity threshold in CD-HIT, referred as 0098 % "global sequence identity" in CD-HIT User's Guide. 0099 % If -1 is provided, CD-HIT is skipped (optional, default 0.9) 0100 % globalModel structure containing both model and KOModel 0101 % structures as generated by getModelFromKEGG. These 0102 % will otherwise be loaded by via getModelFromKEGG. 0103 % Providing globalKEGGmodel can speed up model 0104 % generation if getKEGGModelForOrganism is run 0105 % multiple times for different strains. Example: 0106 % [globalModel.model,globalModel.KOModel] = getModelFromKEGG; 0107 % (optional, default empty, global model is loaded by 0108 % getModelFromKEGG) 0109 % 0110 % Output: 0111 % model the reconstructed model 0112 % 0113 % PLEASE READ THIS: The input to this function can be confusing, because 0114 % it is intended to be run in parallel on a cluster or in multiple 0115 % sessions. It therefore saves a lot of intermediate results to storage. 0116 % This also serves the purpose of not having to do redundant 0117 % calculations. This, however, comes with the disadvantage of somewhat 0118 % trickier handling. This is what this function does: 0119 % 0120 % 1a. Loads files from a local KEGG FTP dump and constructs a general 0121 % RAVEN model representing the metabolic network. The functions 0122 % getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the 0123 % data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files, 0124 % which are later merged into 'keggModel.mat' by getModelFromKEGG 0125 % function. The function getPhylDist generates 'keggPhylDist.mat' 0126 % file. KEGG FTP access requires a <a href="matlab: 0127 % web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>. 0128 % 1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One 0129 % multi-FASTA file for each KO in KEGG is generated. 0130 % 0131 % The Step 1 has to be re-done every time KEGG updates their database (or 0132 % rather when the updates are large enough to warrant re-running this 0133 % part). Many users would probably never use this feature. 0134 % 0135 % 2a. Filters KO-specific protein sets. This is done by using the 0136 % settings "maxPhylDist" and "nSequences" to control which sequences 0137 % should be used for constructing Hidden Markov models (HMMs), and 0138 % later for matching your sequences to. 0139 % The most common alternatives here would be to use sequences from 0140 % only eukaryotes, only prokaryotes or all sequences in KEGG, but you 0141 % could also play around with the parameters to use e.g. only fungal 0142 % sequences. 0143 % 2b. KO-specific protein FASTA files are re-organised into 0144 % non-redundant protein sets with CD-HIT. The user can only set 0145 % seqIdentity parameter, which corresponds to '-c' parameter in 0146 % CD-HIT, described as "sequence identity threshold". CD-HIT suggsted 0147 % sequence identity specific word_length (-n) parameters are used. 0148 % 2c. Does a multi sequence alignment for multi-FASTA files obtained in 0149 % Step 2b for future use. MAFFT software with automatic selection of 0150 % alignment algorithm is used in this step ('--auto'). 0151 % 2d. Trains hidden Markov models using HMMer for each of the aligned 0152 % KO-specific FASTA files obtained in Step 2c. This is performed with 0153 % 'hmmbuild' using the default settings. 0154 % 0155 % Step 2 may be reasonable to be re-done if the user wants to tweak the 0156 % settings in proteins filtering, clustering, multi sequence alignment or 0157 % HMMs training steps. However, it requires to have KO-specific protein 0158 % FASTA files obtained in Step 1a. As such files are not provided in 0159 % RAVEN and BioMet ToolBox, the user can only generate these files from 0160 % KEGG FTP dump files, so KEGG FTP license is needed. 0161 % 0162 % 3a. Queries the HMMs with sequences for the organism you are making a 0163 % model for. This step uses both the output from step 1a and from 2d. 0164 % This is done with 'hmmsearch' function under default settings. The 0165 % significance threshold value set in 'cutOff' parameter is used 0166 % later when parsing '*.out' files to filter out KO hits with higher 0167 % value than 'cutOff' value. The results with passable E values are 0168 % summarised into KO-gene occurence matrix with E values in 0169 % intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and 0170 % 'minScoreRatioKO' are then applied to 'prune' KO-gene associations 0171 % (see the function descriptions above for more details). The 0172 % intersection values for these 'prunable' associations are converted 0173 % to zeroes. 0174 % 3b. Constructs a model based on the pre-processed KO-gene association 0175 % matrix (koGeneMat). As the full KEGG model already has reaction-KO 0176 % relationships, KOs are converted into the query genes. The final 0177 % draft model contains only these reactions, which are associated 0178 % with KOs from koGeneMat. The reactions without the genes may also 0179 % be included, if the user set keepSpontaneous as 'true'. 0180 % 0181 % The Step 3 is specific to the organism for which the model is 0182 % reconstructed. 0183 % 0184 % In principle the function looks at which output that is already available 0185 % and runs only the parts that are required for step 3. This means 0186 % that (see the definition of the parameters for details): 0187 % -1a is only performed if there are no KEGG model files in the 0188 % RAVEN\external\kegg directory 0189 % -1b is only performed if not all required HMMs OR aligned FASTA files 0190 % OR multi-FASTA files exist in the defined dataDir. This means that this 0191 % step is skipped if the HMMs are downloaded from BioMet Toolbox instead 0192 % (see above). If not all files exist it will try to find 0193 % the KEGG database files in dataDir. 0194 % -2a is only performed if not all required HMMs OR aligned FASTA files 0195 % files exist in the defined dataDir. This means that this step is skipped 0196 % if the HMMs are downloaded from BioMet Toolbox instead (see above). 0197 % -2b is only performed if not all required HMMs exist in the defined 0198 % dataDir. This means that this step is skipped if the FASTA files or 0199 % HMMs are downloaded from BioMet Toolbox instead (see above). 0200 % -3a is performed for the required HMMs for which no corresponding .out 0201 % file exists in outDir. This is just a way to enable the function to be 0202 % run in parallel or to resume if interrupted. 0203 % -3b is always performed. 0204 % 0205 % These steps are specific to the organism for which you are 0206 % reconstructing the model. 0207 % 0208 % Regarding the whole pipeline, the function checks the output that is 0209 % already available and runs only the parts that are required for step 3. 0210 % This means that (see the definition of the parameters for details): 0211 % -1a is only performed if there are no KEGG model files in the 0212 % RAVEN\external\kegg directory. 0213 % -1b is only performed if any of required KOs do not have HMMs, aligned 0214 % FASTA files, clustered FASTA files and raw FASTA files in the defined 0215 % dataDir. This means that this step is skipped if the HMMs are 0216 % downloaded from BioMet Toolbox instead (see above). If not all files 0217 % exist it will try to find the KEGG database files in dataDir. 0218 % -2ab are only performed if any of required KOs do not have HMMs, 0219 % aligned FASTA files and clustered FASTA files in the defined dataDir. 0220 % This means that this step is skipped if the HMMs are downloaded from 0221 % BioMet Toolbox instead (see above). 0222 % -2c is only performed if any of required KOs do not have HMMs and 0223 % aligned FASTA files in the defined dataDir. This means that this step 0224 % is skipped if the HMMs are downloaded from BioMet Toolbox instead (see 0225 % above). 0226 % -2d is only performed if any of required KOs do not have HMMs exist in 0227 % the defined dataDir. This means that this step is skipped if the FASTA 0228 % files or HMMs are downloaded from BioMet Toolbox instead (see above). 0229 % -3a is performed for the required HMMs for which no corresponding .out 0230 % file exists in outDir. This is just a way to enable the function to be 0231 % run in parallel or to resume if interrupted. 0232 % -3b is always performed. 0233 % 0234 % NOTE: it is also possible to obtain draft model from KEGG without 0235 % providing protein FASTA file for the target organism. In such case the 0236 % organism three-four letter abbreviation set as 'organismID' must exist 0237 % in the local KEGG database. In such case, the program just fetches all 0238 % the reactions, which are associated with given 'organismID'. 0239 % 0240 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,... 0241 % outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,... 0242 % keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,... 0243 % nSequences,seqIdentity) 0244 0245 if nargin<2 || isempty(fastaFile) 0246 fastaFile=[]; 0247 else 0248 fastaFile=char(fastaFile); 0249 end 0250 if nargin<3 0251 dataDir=[]; 0252 else 0253 dataDir=char(dataDir); 0254 end 0255 if nargin<4 || isempty(outDir) 0256 outDir=tempdir; 0257 %Delete all *.out files if any exist 0258 delete(fullfile(outDir,'*.out')); 0259 else 0260 outDir=char(outDir); 0261 end 0262 if nargin<5 || isempty(keepSpontaneous) 0263 keepSpontaneous=true; 0264 end 0265 if nargin<6 || isempty(keepUndefinedStoich) 0266 keepUndefinedStoich=true; 0267 end 0268 if nargin<7 || isempty(keepIncomplete) 0269 keepIncomplete=true; 0270 end 0271 if nargin<8 || isempty(keepGeneral) 0272 keepGeneral=false; 0273 end 0274 if nargin<9 || isempty(cutOff) 0275 cutOff=10^-50; 0276 end 0277 if nargin<10 || isempty(minScoreRatioKO) 0278 minScoreRatioKO=0.3; 0279 end 0280 if nargin<11 || isempty(minScoreRatioG) 0281 minScoreRatioG=0.8; 0282 end 0283 if nargin<12 || isempty(maxPhylDist) 0284 maxPhylDist=inf; 0285 %Include all sequences for each reaction 0286 end 0287 if nargin<13 || isempty(nSequences) 0288 nSequences=inf; 0289 %Include all sequences for each reaction 0290 end 0291 if nargin<14 || isempty(seqIdentity) 0292 seqIdentity=0.9; 0293 end 0294 0295 if isempty(fastaFile) 0296 fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']); 0297 else 0298 fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n'); 0299 %Check if query fasta exists 0300 fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir 0301 end 0302 0303 %Run the external binaries multi-threaded to use all logical cores assigned 0304 %to MATLAB 0305 cores = evalc('feature(''numcores'')'); 0306 cores = strsplit(cores, 'MATLAB was assigned: '); 0307 cores = regexp(cores{2},'^\d*','match'); 0308 cores = cores{1}; 0309 0310 %Get the directory for RAVEN Toolbox. 0311 ravenPath=findRAVENroot(); 0312 0313 %Checking if dataDir is consistent. It must point to pre-trained HMMs set, 0314 %compatible with the the current RAVEN version. The user may have the 0315 %required zip file already in working directory or have it extracted. If 0316 %the zip file and directory is not here, it is downloaded from the cloud 0317 if ~isempty(dataDir) 0318 hmmOptions={'euk90_kegg116','prok90_kegg116'}; 0319 if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions. 0320 %If not, then check whether the required folders exist anyway. 0321 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ... 0322 ~isfolder(fullfile(dataDir,'fasta')) && ... 0323 ~isfolder(fullfile(dataDir,'aligned')) && ... 0324 ~isfolder(fullfile(dataDir,'hmms')) 0325 error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')]) 0326 end 0327 else 0328 if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0329 fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']); 0330 elseif ~isfolder(dataDir) && isfile([dataDir,'.zip']) 0331 fprintf('Extracting the HMMs archive file... '); 0332 unzip([dataDir,'.zip']); 0333 fprintf('COMPLETE\n'); 0334 else 0335 hmmIndex=strcmp(dataDir,hmmOptions); 0336 if ~any(hmmIndex) 0337 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg116 and euk90_kegg116). ' ... 0338 'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity']) 0339 else 0340 fprintf('Downloading the HMMs archive file... '); 0341 try 0342 websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.11.0/',hmmOptions{hmmIndex},'.zip']); 0343 catch ME 0344 if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError') 0345 error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues') 0346 end 0347 end 0348 end 0349 0350 fprintf('COMPLETE\n'); 0351 fprintf('Extracting the HMMs archive file... '); 0352 unzip([dataDir,'.zip']); 0353 fprintf('COMPLETE\n'); 0354 end 0355 %Check if HMMs are extracted 0356 if ~isfile(fullfile(dataDir,'hmms','K00844.hmm')) 0357 error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']); 0358 end 0359 end 0360 end 0361 0362 %Check if the fasta-file contains '/' or'\'. If not then it's probably just 0363 %a file name. Expand to full path. 0364 if any(fastaFile) 0365 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/')) 0366 fastaFile=which(fastaFile); 0367 end 0368 %Create the required sub-folders in dataDir if they dont exist 0369 if ~isfolder(fullfile(dataDir,'keggdb')) 0370 mkdir(dataDir,'keggdb'); 0371 end 0372 if ~isfolder(fullfile(dataDir,'fasta')) 0373 mkdir(dataDir,'fasta'); 0374 end 0375 if ~isfolder(fullfile(dataDir,'aligned')) 0376 mkdir(dataDir,'aligned'); 0377 end 0378 if ~isfolder(fullfile(dataDir,'hmms')) 0379 mkdir(dataDir,'hmms'); 0380 end 0381 if ~isfolder(outDir) 0382 mkdir(outDir); 0383 end 0384 end 0385 0386 %First generate the full global KEGG model. Can be provided as input. 0387 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as 0388 %there is also an internal RAVEN version available 0389 if nargin==15 0390 model=globalModel.model; 0391 KOModel=globalModel.KOModel; 0392 elseif any(dataDir) 0393 [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0394 else 0395 [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral); 0396 end 0397 model.id=organismID; 0398 model.c=zeros(numel(model.rxns),1); 0399 0400 %If no FASTA file is supplied, then just remove all genes which are not for 0401 %the given organism ID 0402 if isempty(fastaFile) 0403 %Check if organismID can be found in KEGG species list or is 0404 %set to "eukaryotes" or "prokaryotes" 0405 phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true); 0406 if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes']) 0407 error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.'); 0408 end 0409 0410 fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']); 0411 if ismember(organismID,{'eukaryotes','prokaryotes'}) 0412 phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0413 if strcmp(organismID,'eukaryotes') 0414 proxyid='hsa'; 0415 %Use H. sapiens here 0416 else 0417 proxyid='eco'; 0418 %Use E. coli here 0419 end 0420 [~, phylDistId]=ismember(proxyid,phylDists.ids); 0421 idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:))); 0422 taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false); 0423 I=ismember(upper(taxIDs),upper(idsToKeep)); 0424 else 0425 %KEGG organism IDs may have three or four letters 0426 organismID=strcat(organismID,':'); 0427 %Add colon for accurate matching 0428 if length(organismID)==4 0429 I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes); 0430 elseif length(organismID)==5 0431 I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes); 0432 end 0433 end 0434 %Remove those genes 0435 model.genes=model.genes(I); 0436 model.rxnGeneMat=model.rxnGeneMat(:,I); 0437 fprintf('COMPLETE\n'); 0438 end 0439 0440 %First remove all reactions without genes 0441 if keepSpontaneous==true 0442 fprintf('Removing non-spontaneous reactions without GPR rules... '); 0443 load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous'); 0444 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0445 spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous)); 0446 else 0447 fprintf('Removing reactions without GPR rules... '); 0448 I=~any(model.rxnGeneMat,2); 0449 end 0450 model=removeReactions(model,I,true); 0451 fprintf('COMPLETE\n'); 0452 0453 %Clean gene names 0454 fprintf('Fixing gene names in the model... '); 0455 %Get rid of the prefix organism id 0456 model.genes=regexprep(model.genes,'^\w+?:',''); 0457 fprintf('COMPLETE\n'); 0458 0459 %If no FASTA file is supplied, then we are done here 0460 if isempty(fastaFile) 0461 %Create grRules 0462 fprintf('Constructing GPR associations and annotations for the model... '); 0463 model.grRules=cell(numel(model.rxns),1); 0464 model.grRules(:)={''}; 0465 %Add the gene associations as 'or' 0466 for i=1:numel(model.rxns) 0467 %Find the involved genes 0468 I=find(model.rxnGeneMat(i,:)); 0469 if any(I) 0470 model.grRules{i}=['(' model.genes{I(1)}]; 0471 for j=2:numel(I) 0472 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 0473 end 0474 model.grRules{i}=[model.grRules{i} ')']; 0475 end 0476 end 0477 %Fix grRules and reconstruct rxnGeneMat 0478 [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output 0479 model.grRules = grRules; 0480 model.rxnGeneMat = rxnGeneMat; 0481 %Add geneMiriams, assuming that it follows the syntax 0482 %kegg.genes/organismID:geneName 0483 model.geneMiriams=''; 0484 for i=1:numel(model.genes) 0485 model.geneMiriams{i,1}.name{1,1}='kegg.genes'; 0486 model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1}); 0487 end 0488 %Add the description to the reactions 0489 for i=1:numel(model.rxns) 0490 if ~isempty(model.rxnNotes{i}) 0491 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i)); 0492 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 0493 else 0494 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'}; 0495 end 0496 end 0497 fprintf('COMPLETE\n\n'); 0498 fprintf('*** Model reconstruction complete ***\n'); 0499 return; 0500 end 0501 0502 %Create a phylogenetic distance structure 0503 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0504 [~, phylDistId]=ismember(model.id,phylDistStruct.ids); 0505 0506 %Calculate the real maximal distance now. An abitary large number of 1000 0507 %is used for the "all in kingdom" or "all sequences" options. This is a bit 0508 %inconvenient way to do it, but it is to make it fit with some older code 0509 if isinf(maxPhylDist) || maxPhylDist==-1 0510 maxPhylDist=1000; 0511 end 0512 0513 %Get the KO ids for which files have been generated. Maybe not the neatest 0514 %way.. 0515 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa')); 0516 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa')); 0517 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw')); 0518 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm')); 0519 outFiles=listFiles(fullfile(outDir,'*.out')); 0520 0521 %Check if multi-FASTA files should be generated. This should only be 0522 %performed if there are IDs in the KOModel structure that haven't been 0523 %parsed yet 0524 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]); 0525 0526 if ~isempty(missingFASTA) 0527 if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) 0528 EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n']; 0529 dispEM(EM); 0530 end 0531 %Only construct models for KOs which don't have files already 0532 fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true); 0533 %Permute the order of the KOs in the model so that constructMultiFasta 0534 %can be run on several processors at once 0535 fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns'); 0536 constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta')); 0537 else 0538 fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n'); 0539 end 0540 0541 if isunix 0542 if ismac 0543 binEnd='.mac'; 0544 else 0545 binEnd=''; 0546 end 0547 elseif ispc 0548 binEnd=''; 0549 else 0550 EM='Unknown OS, exiting.'; 0551 disp(EM); 0552 return 0553 end 0554 0555 tmpFile=tempname; 0556 %On Windows, paths need to be translated to Unix before parsing it to WSL 0557 if ispc 0558 wslPath.tmpFile=getWSLpath(tmpFile); 0559 %mafft has problems writing to terminal (/dev/stderr) when running 0560 %on WSL via MATLAB, instead write and read progress file 0561 mafftOutput = tempname; 0562 wslPath.mafftOutput=getWSLpath(mafftOutput); 0563 wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat')); 0564 wslPath.hmmbuild=getWSLpath(fullfile(ravenPath,'software','hmmer','hmmbuild')); 0565 wslPath.hmmsearch=getWSLpath(fullfile(ravenPath,'software','hmmer','hmmsearch')); 0566 wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit')); 0567 end 0568 0569 %Check if alignment of FASTA files should be performed 0570 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]); 0571 if ~isempty(missingAligned) 0572 if seqIdentity==-1 0573 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0574 else 0575 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... 0%% complete'); 0576 end 0577 missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned))); 0578 0579 for i=1:numel(missingAligned) 0580 %This is checked here because it could be that it is created by a 0581 %parallel process. The faw-files are saved as temporary files to 0582 %kept track of which files are being worked on 0583 if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&... 0584 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa'])) 0585 %Check that the multi-FASTA file exists. It should do so since 0586 %we are saving empty files as well. Print a warning and 0587 %continue if not 0588 if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])) 0589 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist']; 0590 dispEM(EM,false); 0591 continue; 0592 end 0593 0594 %If the multi-FASTA file is empty then save an empty aligned 0595 %file and continue 0596 s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0597 if s.bytes<=0 0598 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w'); 0599 fclose(fid); 0600 continue; 0601 end 0602 0603 %Create an empty file to prevent other threads to start to work 0604 %on the same alignment 0605 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w'); 0606 fclose(fid); 0607 0608 %First load the FASTA file, then select up to nSequences 0609 %sequences of the most closely related species, apply any 0610 %constraints from maxPhylDist, and save it as a temporary file, 0611 %and create the model from that 0612 0613 fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0614 phylDist=inf(numel(fastaStruct),1); 0615 for j=1:numel(fastaStruct) 0616 %Get the organism abbreviation 0617 index=strfind(fastaStruct(j).Header,':'); 0618 if any(index) 0619 abbrev=fastaStruct(j).Header(1:index(1)-1); 0620 [~, index]=ismember(abbrev,phylDistStruct.ids); 0621 if any(index) 0622 phylDist(j)=phylDistStruct.distMat(index(1),phylDistId); 0623 end 0624 end 0625 end 0626 0627 %Inf means that it should not be included 0628 phylDist(phylDist>maxPhylDist)=[]; 0629 0630 %Sort based on phylDist 0631 [~, order]=sort(phylDist); 0632 0633 %Save the first nSequences hits to a temporary FASTA file 0634 if nSequences<=numel(fastaStruct) 0635 fastaStruct=fastaStruct(order(1:nSequences)); 0636 else 0637 fastaStruct=fastaStruct(order); 0638 end 0639 0640 %Do the clustering and alignment if there are more than one 0641 %sequences, otherwise just save the sequence (or an empty file) 0642 if numel(fastaStruct)>1 0643 if seqIdentity~=-1 0644 cdhitInpCustom=tempname; 0645 fastawrite(cdhitInpCustom,fastaStruct); 0646 if seqIdentity<=1 && seqIdentity>0.7 0647 nparam='5'; 0648 elseif seqIdentity>0.6 0649 nparam='4'; 0650 elseif seqIdentity>0.5 0651 nparam='3'; 0652 elseif seqIdentity>0.4 0653 nparam='2'; 0654 else 0655 EM='The provided seqIdentity must be between 0 and 1\n'; 0656 dispEM(EM); 0657 end 0658 if ispc 0659 wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom); 0660 [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0661 elseif ismac || isunix 0662 [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']); 0663 end 0664 if status~=0 0665 EM=['Error when performing clustering of ' missingAligned{i} ':\n' output]; 0666 dispEM(EM); 0667 end 0668 %Remove the old tempfile 0669 if exist(cdhitInpCustom, 'file') 0670 delete([cdhitInpCustom '*']); 0671 end 0672 else 0673 %This means that CD-HIT should be skipped since 0674 %seqIdentity is equal to -1 0675 fastawrite(tmpFile,fastaStruct); 0676 end 0677 %Do the alignment for this file 0678 if ismac 0679 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0680 elseif isunix 0681 [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']); 0682 elseif ispc 0683 wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])); 0684 [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']); 0685 output=fileread(mafftOutput); 0686 delete(mafftOutput); 0687 end 0688 if status~=0 0689 %It could be that alignment failed because only one 0690 %sequence was left after clustering. If that is the 0691 %case, then the clustered file is just copied as 'faw' 0692 %file 0693 if any(regexp(output,'Only 1 sequence found')) 0694 movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f'); 0695 else 0696 EM=['Error when performing alignment of ' missingAligned{i} ':\n' output]; 0697 dispEM(EM); 0698 end 0699 end 0700 %Remove the old tempfile 0701 if exist(tmpFile, 'file') 0702 delete([tmpFile '*']); 0703 end 0704 else 0705 %If there is only one sequence then it's not possible to do 0706 %a multiple alignment. Just print the sequence instead. An 0707 %empty file was written previously so that doesn't have to 0708 %be dealt with 0709 if numel(fastaStruct)==1 0710 warnState = warning; %Save the current warning state 0711 warning('off','Bioinfo:fastawrite:AppendToFile'); 0712 fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct); 0713 warning(warnState) %Reset warning state to previous settings 0714 end 0715 end 0716 %Move the temporary file to the real one 0717 movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f'); 0718 0719 %Print the progress every 25 files 0720 if rem(i-1,25) == 0 0721 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns))); 0722 progress=pad(progress,3,'left'); 0723 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0724 end 0725 end 0726 end 0727 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0728 else 0729 if seqIdentity==-1 0730 fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0731 else 0732 fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n'); 0733 end 0734 end 0735 0736 %Check if training of Hidden Markov models should be performed 0737 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]); 0738 if ~isempty(missingHMMs) 0739 fprintf('Training the KEGG Orthology specific HMMs... 0%% complete'); 0740 missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs))); 0741 %Train models for all missing KOs 0742 for i=1:numel(missingHMMs) 0743 %This is checked here because it could be that it is created by a 0744 %parallel process 0745 if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])) 0746 %Check that the aligned FASTA file exists. It could be that it 0747 %is still being worked on by some other instance of the program 0748 %(the .faw file should then exist). This should not happen on a 0749 %single computer. It doesn't throw an error, because it should 0750 %finalize the ones it can 0751 if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])) 0752 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist']; 0753 dispEM(EM,false); 0754 continue; 0755 end 0756 0757 %If the multi-FASTA file is empty then save an empty aligned 0758 %file and continue 0759 s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0760 if s.bytes<=0 0761 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w'); 0762 fclose(fid); 0763 continue; 0764 end 0765 %Create a temporary file to indicate that it is working on the 0766 %KO. This is because hmmbuild cannot overwrite existing files 0767 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w'); 0768 fclose(fid); 0769 0770 %Create HMM 0771 if ismac || isunix 0772 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']); 0773 else 0774 wslPath.hmmFile = getWSLpath(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])); 0775 wslPath.alignFile = getWSLpath(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0776 [status, output] = system(['wsl "' wslPath.hmmbuild '" --cpu "' num2str(cores) '" "' wslPath.hmmFile '" "' wslPath.alignFile '"']); 0777 end 0778 if status~=0 0779 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output]; 0780 dispEM(EM); 0781 end 0782 0783 %Delete the temporary file 0784 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])); 0785 0786 %Print the progress every 25 files 0787 if rem(i-1,25) == 0 0788 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns))); 0789 progress=pad(progress,3,'left'); 0790 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0791 end 0792 end 0793 end 0794 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0795 else 0796 fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n'); 0797 end 0798 0799 %Check which new .out files that should be generated. Check if training of 0800 %Hidden Markov models should be performed 0801 missingOUT=setdiff(KOModel.rxns,outFiles); 0802 if ~isempty(missingOUT) 0803 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... 0%% complete'); 0804 missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT))); 0805 for i=1:numel(missingOUT) 0806 %This is checked here because it could be that it is created by a 0807 %parallel process 0808 if ~isfile(fullfile(outDir,[missingOUT{i} '.out'])) 0809 %Check that the HMM file exists. It should do so since %we are 0810 %saving empty files as well. Print a warning and continue if 0811 %not 0812 if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])) 0813 EM=['The HMM file for ' missingOUT{i} ' does not exist']; 0814 dispEM(EM,false); 0815 continue; 0816 end 0817 0818 %Save an empty file to prevent several threads working on the 0819 %same file 0820 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0821 fclose(fid); 0822 0823 %If the HMM file is empty then save an out file and continue 0824 s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0825 if s.bytes<=0 0826 continue; 0827 end 0828 0829 %Check each gene in the input file against this model 0830 if ismac || isunix 0831 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']); 0832 else 0833 wslPath.hmmFile = getWSLpath(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0834 wslPath.fastaFile = getWSLpath(fastaFile); 0835 [status, output]=system(['wsl "' wslPath.hmmsearch '" --cpu "' num2str(cores) '" "' wslPath.hmmFile '" "' wslPath.fastaFile '"']); 0836 end 0837 if status~=0 0838 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output]; 0839 dispEM(EM); 0840 end 0841 0842 %Save the output to a file 0843 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0844 fwrite(fid,output); 0845 fclose(fid); 0846 0847 %Print the progress every 25 files 0848 if rem(i-1,25) == 0 0849 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns))); 0850 progress=pad(progress,3,'left'); 0851 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress); 0852 end 0853 end 0854 end 0855 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n'); 0856 else 0857 fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n'); 0858 end 0859 0860 0861 %***Begin retrieving the output and putting together the resulting model 0862 0863 fprintf('Parsing the HMM search results... '); 0864 %Retrieve matched genes from the HMMs 0865 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes 0866 genes=cell(3000,1); 0867 %Store the best score for a gene in a hash list (since it will be searching 0868 %many times) 0869 hTable = java.util.Hashtable; 0870 0871 geneCounter=0; 0872 for i=1:numel(KOModel.rxns) 0873 if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file') 0874 fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r'); 0875 beginMatches=false; 0876 while 1 0877 %Get the next line 0878 tline = fgetl(fid); 0879 0880 %Abort at end of file 0881 if ~ischar(tline) 0882 break; 0883 end 0884 0885 if and(beginMatches,strcmp(tline,' ------ inclusion threshold ------')) 0886 break; 0887 end 0888 0889 if beginMatches==false 0890 %This is how the listing of matches begins 0891 if any(strfind(tline,'E-value ')) 0892 %Read one more line that is only padding 0893 tline = fgetl(fid); 0894 beginMatches=true; 0895 end 0896 else 0897 %If matches should be read 0898 if ~strcmp(tline,' [No hits detected that satisfy reporting thresholds]') && ~isempty(tline) 0899 elements=regexp(tline,' ','split'); 0900 elements=elements(cellfun(@any,elements)); 0901 0902 %Check if the match is below the treshhold 0903 score=str2double(elements{1}); 0904 gene=elements{9}; 0905 if score<=cutOff 0906 %If the score is exactly 0, change it to a very 0907 %small value to avoid NaN 0908 if score==0 0909 score=10^-250; 0910 end 0911 %Check if the gene is added already and, is so, get 0912 %the best score for it 0913 I=hTable.get(gene); 0914 if any(I) 0915 koGeneMat(i,I)=score; 0916 else 0917 geneCounter=geneCounter+1; 0918 %The gene was not present yet so add it 0919 hTable.put(gene,geneCounter); 0920 genes{geneCounter}=gene; 0921 koGeneMat(i,geneCounter)=score; 0922 end 0923 end 0924 else 0925 break; 0926 end 0927 end 0928 end 0929 fclose(fid); 0930 end 0931 end 0932 fprintf('COMPLETE\n'); 0933 0934 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... '); 0935 koGeneMat=koGeneMat(:,1:geneCounter); 0936 0937 %Remove the genes for each KO that are below minScoreRatioKO. 0938 for i=1:size(koGeneMat,1) 0939 J=find(koGeneMat(i,:)); 0940 if any(J) 0941 koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0; 0942 end 0943 end 0944 0945 %Remove the KOs for each gene that are below minScoreRatioG 0946 for i=1:size(koGeneMat,2) 0947 J=find(koGeneMat(:,i)); 0948 if any(J) 0949 koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0; 0950 end 0951 end 0952 fprintf('COMPLETE\n'); 0953 0954 fprintf('Adding gene annotations to the model... '); 0955 %Create the new model 0956 model.genes=genes(1:geneCounter); 0957 model.grRules=cell(numel(model.rxns),1); 0958 model.grRules(:)={''}; 0959 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes)); 0960 0961 %Loop through the reactions and add the corresponding genes 0962 for i=1:numel(model.rxns) 0963 if isstruct(model.rxnMiriams{i}) 0964 %Get all KOs 0965 I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology')); 0966 KOs=model.rxnMiriams{i}.value(I); 0967 %Find the KOs and the corresponding genes 0968 J=ismember(KOModel.rxns,KOs); 0969 [~, K]=find(koGeneMat(J,:)); 0970 0971 if any(K) 0972 model.rxnGeneMat(i,K)=1; 0973 %Also delete KOs for which no genes were found. If no genes at 0974 %all were matched to the reaction it will be deleted later 0975 L=sum(koGeneMat(J,:),2)==0; 0976 model.rxnMiriams{i}.value(I(L))=[]; 0977 model.rxnMiriams{i}.name(I(L))=[]; 0978 end 0979 end 0980 end 0981 fprintf('COMPLETE\n'); 0982 0983 %Find and delete all reactions without genes. This also removes genes that 0984 %are not used (which could happen because minScoreRatioG and 0985 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions 0986 %without genes are kept in the model. Spontaneous reactions with original 0987 %gene associations are treated in the same way, like the rest of the 0988 %reactions - if gene associations were removed during HMM search, such 0989 %reactions are deleted from the model 0990 if keepSpontaneous==true 0991 %Not the most comprise way to delete reactions without genes, but this 0992 %makes the code easier to understand. Firstly the non-spontaneous 0993 %reactions without genes are removed. After that, the second deletion 0994 %step removes spontaneous reactions, which had gene associations before 0995 %HMM search, but no longer have after it 0996 fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... '); 0997 I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous); 0998 model=removeReactions(model,I,true,true); 0999 I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes); 1000 model=removeReactions(model,I,true,true); 1001 else 1002 %Just simply check for any new reactions without genes and remove 1003 %it 1004 fprintf('Removing reactions which after HMM search no longer have GPR rules... '); 1005 I=~any(model.rxnGeneMat,2); 1006 model=removeReactions(model,I,true,true); 1007 end 1008 fprintf('COMPLETE\n'); 1009 1010 fprintf('Constructing GPR rules and finalizing the model... '); 1011 %Add the gene associations as 'or' 1012 for i=1:numel(model.rxns) 1013 %Find the involved genes 1014 I=find(model.rxnGeneMat(i,:)); 1015 if any(I) 1016 model.grRules{i}=['(' model.genes{I(1)}]; 1017 for j=2:numel(I) 1018 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 1019 end 1020 model.grRules{i}=[model.grRules{i} ')']; 1021 end 1022 end 1023 1024 %Fix grRules and reconstruct rxnGeneMat 1025 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output 1026 model.grRules = grRules; 1027 model.rxnGeneMat = rxnGeneMat; 1028 1029 %Add the description to the reactions 1030 for i=1:numel(model.rxns) 1031 if ~isempty(model.rxnNotes{i}) 1032 model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i)); 1033 model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. '); 1034 else 1035 model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'}; 1036 end 1037 end 1038 %Remove the temp fasta file 1039 delete(fastaFile) 1040 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n'); 1041 end 1042 1043 function files=listFiles(directory) 1044 %Supporter function to list the files in a directory and return them as a 1045 %cell array 1046 temp=dir(directory); 1047 files=cell(numel(temp),1); 1048 for i=1:numel(temp) 1049 files{i}=temp(i,1).name; 1050 end 1051 files=strrep(files,'.fa',''); 1052 files=strrep(files,'.hmm',''); 1053 files=strrep(files,'.out',''); 1054 files=strrep(files,'.faw',''); 1055 end