Home > external > kegg > getKEGGModelForOrganism.m

getKEGGModelForOrganism

PURPOSE ^

getKEGGModelForOrganism

SYNOPSIS ^

function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,nSequences,seqIdentity,globalModel)

DESCRIPTION ^

 getKEGGModelForOrganism
   Reconstructs a genome-scale metabolic model based on protein homology
   to the orthologies in KEGG. If the target species is not available in
   KEGG, the user must select a closely related species. It is also
   possible to circumvent protein homology search (see fastaFile parameter
   for more details)

   Input:
   organismID          three or four letter abbreviation of the organism
                       (as used in KEGG). If not available, use a closely
                       related species. This is used for determing the
                       phylogenetic distance. Use 'eukaryotes' or
                       'prokaryotes' to get a model for the whole domain.
                       Only applicable if fastaFile is empty, i.e. no
                       homology search should be performed
   fastaFile           a FASTA file that contains the protein sequences of
                       the organism for which to reconstruct a model (optional,
                       if no FASTA file is supplied then a model is
                       reconstructed based only on the organism
                       abbreviation. This option ignores all settings
                       except for keepSpontaneous, keepUndefinedStoich,
                       keepIncomplete and keepGeneral)
   dataDir             directory for which to retrieve the input data.
                       Should contain a combination of these sub-folders:
                       -dataDir\keggdb
                           The KEGG database files used in 1a (see below)
                       -dataDir\fasta
                           The multi-FASTA files generated in 1b (see
                           below)
                       -dataDir\aligned
                           The aligned FASTA files as generated in 2a (see
                           below)
                       -dataDir\hmms
                           The hidden Markov models as generated in 2b or
                           downloaded from BioMet Toolbox (see below)
                       The final directory in dataDir should be styled as
                       prok90_kegg116 or euk90_kegg116, indicating whether
                       the HMMs were trained on pro- or eukaryotic
                       sequences; using which sequence similarity treshold
                       (first set of digits); using which KEGG version
                       (second set of digits). (this parameter should
                       ALWAYS be provided)
   outDir              directory to save the results from the quering of
                       the hidden Markov models. The output is specific
                       for the input sequences and the settings used. It
                       is stored in this manner so that the function can
                       continue if interrupted or if it should run in
                       parallel. Be careful not to leave output files from
                       different organisms or runs with different settings
                       in the same folder. They will not be overwritten
                       (optional, default is a temporary dir where all *.out
                       files are deleted before and after doing the
                       reconstruction)
   keepSpontaneous     include reactions labeled as "spontaneous". (optional,
                       default true)
   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
                       will be dealt with as two separate metabolites
                       (optional, default true)
   keepIncomplete      include reactions which have been labelled as
                       "incomplete", "erroneous" or "unclear" (optional,
                       default true)
   keepGeneral         include reactions which have been labelled as
                       "general reaction". These are reactions on the form
                       "an aldehyde <=> an alcohol", and are therefore
                       unsuited for modelling purposes. Note that not all
                       reactions have this type of annotation, and the
                       script will therefore not be able to remove all
                       such reactions (optional, default false)
   cutOff              significance score from HMMer needed to assign
                       genes to a KO (optional, default 10^-50)
   minScoreRatioG      a gene is only assigned to KOs for which the score
                       is >=log(score)/log(best score) for that gene. This
                       is to prevent that a gene which clearly belongs to
                       one KO is assigned also to KOs with much lower
                       scores (optional, default 0.8 (lower is less strict))
   minScoreRatioKO     ignore genes in a KO if their score is
                       <log(score)/log(best score in KO). This is to
                       "prune" KOs which have many genes and where some are
                       clearly a better fit (optional, default 0.3 (lower is
                       less strict))
   maxPhylDist         -1: only use sequences from the same domain
                       (Prokaryota, Eukaryota)
                       other (positive) value: only use sequences for
                       organisms where the phylogenetic distance is at the
                       most this large (as calculated in getPhylDist)
                       (optional, default Inf, which means that all sequences
                       will be used)
   nSequences          for each KO, use up to this many sequences from the
                       most closely related species. This is mainly to
                       speed up the alignment process for KOs with very
                       many genes. This subsampling is performed before
                       running CD-HIT (optional, default inf)
   seqIdentity         sequence identity threshold in CD-HIT, referred as
                       "global sequence identity" in CD-HIT User's Guide.
                       If -1 is provided, CD-HIT is skipped (optional, default 0.9)
   globalModel         structure containing both model and KOModel
                       structures as generated by getModelFromKEGG. These
                       will otherwise be loaded by via getModelFromKEGG.
                       Providing globalKEGGmodel can speed up model
                       generation if getKEGGModelForOrganism is run
                       multiple times for different strains. Example:
                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
                       (optional, default empty, global model is loaded by
                       getModelFromKEGG)

   Output:
   model               the reconstructed model

   PLEASE READ THIS: The input to this function can be confusing, because
   it is intended to be run in parallel on a cluster or in multiple
   sessions. It therefore saves a lot of intermediate results to storage.
   This also serves the purpose of not having to do redundant
   calculations. This, however, comes with the disadvantage of somewhat
   trickier handling. This is what this function does:

   1a. Loads files from a local KEGG FTP dump and constructs a general
       RAVEN model representing the metabolic network. The functions
       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
       which are later merged into 'keggModel.mat' by getModelFromKEGG
       function. The function getPhylDist generates 'keggPhylDist.mat'
       file. KEGG FTP access requires a <a href="matlab:
       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
       multi-FASTA file for each KO in KEGG is generated.

   The Step 1 has to be re-done every time KEGG updates their database (or
   rather when the updates are large enough to warrant re-running this
   part). Many users would probably never use this feature.

   2a. Filters KO-specific protein sets. This is done by using the
       settings "maxPhylDist" and "nSequences" to control which sequences
       should be used for constructing Hidden Markov models (HMMs), and
       later for matching your sequences to.
       The most common alternatives here would be to use sequences from
       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
       could also play around with the parameters to use e.g. only fungal
       sequences.
   2b. KO-specific protein FASTA files are re-organised into
       non-redundant protein sets with CD-HIT. The user can only set
       seqIdentity parameter, which corresponds to '-c' parameter in
       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
       sequence identity specific word_length (-n) parameters are used.
   2c. Does a multi sequence alignment for multi-FASTA files obtained in
       Step 2b for future use. MAFFT software with automatic selection of
       alignment algorithm is used in this step ('--auto').
   2d. Trains hidden Markov models using HMMer for each of the aligned
       KO-specific FASTA files obtained in Step 2c. This is performed with
       'hmmbuild' using the default settings.

   Step 2 may be reasonable to be re-done if the user wants to tweak the
   settings in proteins filtering, clustering, multi sequence alignment or
   HMMs training steps. However, it requires to have KO-specific protein
   FASTA files obtained in Step 1a. As such files are not provided in
   RAVEN and BioMet ToolBox, the user can only generate these files from
   KEGG FTP dump files, so KEGG FTP license is needed.

   3a. Queries the HMMs with sequences for the organism you are making a
       model for. This step uses both the output from step 1a and from 2d.
       This is done with 'hmmsearch' function under default settings. The
       significance threshold value set in 'cutOff' parameter is used
       later when parsing '*.out' files to filter out KO hits with higher
       value than 'cutOff' value. The results with passable E values are
       summarised into KO-gene occurence matrix with E values in
       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
       (see the function descriptions above for more details). The
       intersection values for these 'prunable' associations are converted
       to zeroes.
   3b. Constructs a model based on the pre-processed KO-gene association
       matrix (koGeneMat). As the full KEGG model already has reaction-KO
       relationships, KOs are converted into the query genes. The final
       draft model contains only these reactions, which are associated
       with KOs from koGeneMat. The reactions without the genes may also
       be included, if the user set keepSpontaneous as 'true'.

   The Step 3 is specific to the organism for which the model is
   reconstructed.

   In principle the function looks at which output that is already available
   and runs only the parts that are required for step 3. This means
   that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory
   -1b is only performed if not all required HMMs OR aligned FASTA files
   OR multi-FASTA files exist in the defined dataDir. This means that this
   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
   (see above). If not all files exist it will try to find
   the KEGG database files in dataDir.
   -2a is only performed if not all required HMMs OR aligned FASTA files
   files exist in the defined dataDir. This means that this step is skipped
   if the HMMs are downloaded from BioMet Toolbox instead (see above).
   -2b is only performed if not all required HMMs exist in the defined
   dataDir. This means that this step is skipped if the FASTA files or
   HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   These steps are specific to the organism for which you are
   reconstructing the model.

   Regarding the whole pipeline, the function checks the output that is
   already available and runs only the parts that are required for step 3.
   This means that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory.
   -1b is only performed if any of required KOs do not have HMMs, aligned
   FASTA files, clustered FASTA files and raw FASTA files in the defined
   dataDir. This means that this step is skipped if the HMMs are
   downloaded from BioMet Toolbox instead (see above). If not all files
   exist it will try to find the KEGG database files in dataDir.
   -2ab are only performed if any of required KOs do not have HMMs,
   aligned FASTA files and clustered FASTA files in the defined dataDir.
   This means that this step is skipped if the HMMs are downloaded from
   BioMet Toolbox instead (see above).
   -2c is only performed if any of required KOs do not have HMMs and
   aligned FASTA files in the defined dataDir. This means that this step
   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
   above).
   -2d is only performed if any of required KOs do not have HMMs exist in
   the defined dataDir. This means that this step is skipped if the FASTA
   files or HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   NOTE: it is also possible to obtain draft model from KEGG without
   providing protein FASTA file for the target organism. In such case the
   organism three-four letter abbreviation set as 'organismID' must exist
   in the local KEGG database. In such case, the program just fetches all
   the reactions, which are associated with given 'organismID'.

 Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
    nSequences,seqIdentity)

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0002     outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0003     keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0004     nSequences,seqIdentity,globalModel)
0005 % getKEGGModelForOrganism
0006 %   Reconstructs a genome-scale metabolic model based on protein homology
0007 %   to the orthologies in KEGG. If the target species is not available in
0008 %   KEGG, the user must select a closely related species. It is also
0009 %   possible to circumvent protein homology search (see fastaFile parameter
0010 %   for more details)
0011 %
0012 %   Input:
0013 %   organismID          three or four letter abbreviation of the organism
0014 %                       (as used in KEGG). If not available, use a closely
0015 %                       related species. This is used for determing the
0016 %                       phylogenetic distance. Use 'eukaryotes' or
0017 %                       'prokaryotes' to get a model for the whole domain.
0018 %                       Only applicable if fastaFile is empty, i.e. no
0019 %                       homology search should be performed
0020 %   fastaFile           a FASTA file that contains the protein sequences of
0021 %                       the organism for which to reconstruct a model (optional,
0022 %                       if no FASTA file is supplied then a model is
0023 %                       reconstructed based only on the organism
0024 %                       abbreviation. This option ignores all settings
0025 %                       except for keepSpontaneous, keepUndefinedStoich,
0026 %                       keepIncomplete and keepGeneral)
0027 %   dataDir             directory for which to retrieve the input data.
0028 %                       Should contain a combination of these sub-folders:
0029 %                       -dataDir\keggdb
0030 %                           The KEGG database files used in 1a (see below)
0031 %                       -dataDir\fasta
0032 %                           The multi-FASTA files generated in 1b (see
0033 %                           below)
0034 %                       -dataDir\aligned
0035 %                           The aligned FASTA files as generated in 2a (see
0036 %                           below)
0037 %                       -dataDir\hmms
0038 %                           The hidden Markov models as generated in 2b or
0039 %                           downloaded from BioMet Toolbox (see below)
0040 %                       The final directory in dataDir should be styled as
0041 %                       prok90_kegg116 or euk90_kegg116, indicating whether
0042 %                       the HMMs were trained on pro- or eukaryotic
0043 %                       sequences; using which sequence similarity treshold
0044 %                       (first set of digits); using which KEGG version
0045 %                       (second set of digits). (this parameter should
0046 %                       ALWAYS be provided)
0047 %   outDir              directory to save the results from the quering of
0048 %                       the hidden Markov models. The output is specific
0049 %                       for the input sequences and the settings used. It
0050 %                       is stored in this manner so that the function can
0051 %                       continue if interrupted or if it should run in
0052 %                       parallel. Be careful not to leave output files from
0053 %                       different organisms or runs with different settings
0054 %                       in the same folder. They will not be overwritten
0055 %                       (optional, default is a temporary dir where all *.out
0056 %                       files are deleted before and after doing the
0057 %                       reconstruction)
0058 %   keepSpontaneous     include reactions labeled as "spontaneous". (optional,
0059 %                       default true)
0060 %   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
0061 %                       will be dealt with as two separate metabolites
0062 %                       (optional, default true)
0063 %   keepIncomplete      include reactions which have been labelled as
0064 %                       "incomplete", "erroneous" or "unclear" (optional,
0065 %                       default true)
0066 %   keepGeneral         include reactions which have been labelled as
0067 %                       "general reaction". These are reactions on the form
0068 %                       "an aldehyde <=> an alcohol", and are therefore
0069 %                       unsuited for modelling purposes. Note that not all
0070 %                       reactions have this type of annotation, and the
0071 %                       script will therefore not be able to remove all
0072 %                       such reactions (optional, default false)
0073 %   cutOff              significance score from HMMer needed to assign
0074 %                       genes to a KO (optional, default 10^-50)
0075 %   minScoreRatioG      a gene is only assigned to KOs for which the score
0076 %                       is >=log(score)/log(best score) for that gene. This
0077 %                       is to prevent that a gene which clearly belongs to
0078 %                       one KO is assigned also to KOs with much lower
0079 %                       scores (optional, default 0.8 (lower is less strict))
0080 %   minScoreRatioKO     ignore genes in a KO if their score is
0081 %                       <log(score)/log(best score in KO). This is to
0082 %                       "prune" KOs which have many genes and where some are
0083 %                       clearly a better fit (optional, default 0.3 (lower is
0084 %                       less strict))
0085 %   maxPhylDist         -1: only use sequences from the same domain
0086 %                       (Prokaryota, Eukaryota)
0087 %                       other (positive) value: only use sequences for
0088 %                       organisms where the phylogenetic distance is at the
0089 %                       most this large (as calculated in getPhylDist)
0090 %                       (optional, default Inf, which means that all sequences
0091 %                       will be used)
0092 %   nSequences          for each KO, use up to this many sequences from the
0093 %                       most closely related species. This is mainly to
0094 %                       speed up the alignment process for KOs with very
0095 %                       many genes. This subsampling is performed before
0096 %                       running CD-HIT (optional, default inf)
0097 %   seqIdentity         sequence identity threshold in CD-HIT, referred as
0098 %                       "global sequence identity" in CD-HIT User's Guide.
0099 %                       If -1 is provided, CD-HIT is skipped (optional, default 0.9)
0100 %   globalModel         structure containing both model and KOModel
0101 %                       structures as generated by getModelFromKEGG. These
0102 %                       will otherwise be loaded by via getModelFromKEGG.
0103 %                       Providing globalKEGGmodel can speed up model
0104 %                       generation if getKEGGModelForOrganism is run
0105 %                       multiple times for different strains. Example:
0106 %                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
0107 %                       (optional, default empty, global model is loaded by
0108 %                       getModelFromKEGG)
0109 %
0110 %   Output:
0111 %   model               the reconstructed model
0112 %
0113 %   PLEASE READ THIS: The input to this function can be confusing, because
0114 %   it is intended to be run in parallel on a cluster or in multiple
0115 %   sessions. It therefore saves a lot of intermediate results to storage.
0116 %   This also serves the purpose of not having to do redundant
0117 %   calculations. This, however, comes with the disadvantage of somewhat
0118 %   trickier handling. This is what this function does:
0119 %
0120 %   1a. Loads files from a local KEGG FTP dump and constructs a general
0121 %       RAVEN model representing the metabolic network. The functions
0122 %       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
0123 %       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
0124 %       which are later merged into 'keggModel.mat' by getModelFromKEGG
0125 %       function. The function getPhylDist generates 'keggPhylDist.mat'
0126 %       file. KEGG FTP access requires a <a href="matlab:
0127 %       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
0128 %   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
0129 %       multi-FASTA file for each KO in KEGG is generated.
0130 %
0131 %   The Step 1 has to be re-done every time KEGG updates their database (or
0132 %   rather when the updates are large enough to warrant re-running this
0133 %   part). Many users would probably never use this feature.
0134 %
0135 %   2a. Filters KO-specific protein sets. This is done by using the
0136 %       settings "maxPhylDist" and "nSequences" to control which sequences
0137 %       should be used for constructing Hidden Markov models (HMMs), and
0138 %       later for matching your sequences to.
0139 %       The most common alternatives here would be to use sequences from
0140 %       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
0141 %       could also play around with the parameters to use e.g. only fungal
0142 %       sequences.
0143 %   2b. KO-specific protein FASTA files are re-organised into
0144 %       non-redundant protein sets with CD-HIT. The user can only set
0145 %       seqIdentity parameter, which corresponds to '-c' parameter in
0146 %       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
0147 %       sequence identity specific word_length (-n) parameters are used.
0148 %   2c. Does a multi sequence alignment for multi-FASTA files obtained in
0149 %       Step 2b for future use. MAFFT software with automatic selection of
0150 %       alignment algorithm is used in this step ('--auto').
0151 %   2d. Trains hidden Markov models using HMMer for each of the aligned
0152 %       KO-specific FASTA files obtained in Step 2c. This is performed with
0153 %       'hmmbuild' using the default settings.
0154 %
0155 %   Step 2 may be reasonable to be re-done if the user wants to tweak the
0156 %   settings in proteins filtering, clustering, multi sequence alignment or
0157 %   HMMs training steps. However, it requires to have KO-specific protein
0158 %   FASTA files obtained in Step 1a. As such files are not provided in
0159 %   RAVEN and BioMet ToolBox, the user can only generate these files from
0160 %   KEGG FTP dump files, so KEGG FTP license is needed.
0161 %
0162 %   3a. Queries the HMMs with sequences for the organism you are making a
0163 %       model for. This step uses both the output from step 1a and from 2d.
0164 %       This is done with 'hmmsearch' function under default settings. The
0165 %       significance threshold value set in 'cutOff' parameter is used
0166 %       later when parsing '*.out' files to filter out KO hits with higher
0167 %       value than 'cutOff' value. The results with passable E values are
0168 %       summarised into KO-gene occurence matrix with E values in
0169 %       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
0170 %       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
0171 %       (see the function descriptions above for more details). The
0172 %       intersection values for these 'prunable' associations are converted
0173 %       to zeroes.
0174 %   3b. Constructs a model based on the pre-processed KO-gene association
0175 %       matrix (koGeneMat). As the full KEGG model already has reaction-KO
0176 %       relationships, KOs are converted into the query genes. The final
0177 %       draft model contains only these reactions, which are associated
0178 %       with KOs from koGeneMat. The reactions without the genes may also
0179 %       be included, if the user set keepSpontaneous as 'true'.
0180 %
0181 %   The Step 3 is specific to the organism for which the model is
0182 %   reconstructed.
0183 %
0184 %   In principle the function looks at which output that is already available
0185 %   and runs only the parts that are required for step 3. This means
0186 %   that (see the definition of the parameters for details):
0187 %   -1a is only performed if there are no KEGG model files in the
0188 %   RAVEN\external\kegg directory
0189 %   -1b is only performed if not all required HMMs OR aligned FASTA files
0190 %   OR multi-FASTA files exist in the defined dataDir. This means that this
0191 %   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
0192 %   (see above). If not all files exist it will try to find
0193 %   the KEGG database files in dataDir.
0194 %   -2a is only performed if not all required HMMs OR aligned FASTA files
0195 %   files exist in the defined dataDir. This means that this step is skipped
0196 %   if the HMMs are downloaded from BioMet Toolbox instead (see above).
0197 %   -2b is only performed if not all required HMMs exist in the defined
0198 %   dataDir. This means that this step is skipped if the FASTA files or
0199 %   HMMs are downloaded from BioMet Toolbox instead (see above).
0200 %   -3a is performed for the required HMMs for which no corresponding .out
0201 %   file exists in outDir. This is just a way to enable the function to be
0202 %   run in parallel or to resume if interrupted.
0203 %   -3b is always performed.
0204 %
0205 %   These steps are specific to the organism for which you are
0206 %   reconstructing the model.
0207 %
0208 %   Regarding the whole pipeline, the function checks the output that is
0209 %   already available and runs only the parts that are required for step 3.
0210 %   This means that (see the definition of the parameters for details):
0211 %   -1a is only performed if there are no KEGG model files in the
0212 %   RAVEN\external\kegg directory.
0213 %   -1b is only performed if any of required KOs do not have HMMs, aligned
0214 %   FASTA files, clustered FASTA files and raw FASTA files in the defined
0215 %   dataDir. This means that this step is skipped if the HMMs are
0216 %   downloaded from BioMet Toolbox instead (see above). If not all files
0217 %   exist it will try to find the KEGG database files in dataDir.
0218 %   -2ab are only performed if any of required KOs do not have HMMs,
0219 %   aligned FASTA files and clustered FASTA files in the defined dataDir.
0220 %   This means that this step is skipped if the HMMs are downloaded from
0221 %   BioMet Toolbox instead (see above).
0222 %   -2c is only performed if any of required KOs do not have HMMs and
0223 %   aligned FASTA files in the defined dataDir. This means that this step
0224 %   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
0225 %   above).
0226 %   -2d is only performed if any of required KOs do not have HMMs exist in
0227 %   the defined dataDir. This means that this step is skipped if the FASTA
0228 %   files or HMMs are downloaded from BioMet Toolbox instead (see above).
0229 %   -3a is performed for the required HMMs for which no corresponding .out
0230 %   file exists in outDir. This is just a way to enable the function to be
0231 %   run in parallel or to resume if interrupted.
0232 %   -3b is always performed.
0233 %
0234 %   NOTE: it is also possible to obtain draft model from KEGG without
0235 %   providing protein FASTA file for the target organism. In such case the
0236 %   organism three-four letter abbreviation set as 'organismID' must exist
0237 %   in the local KEGG database. In such case, the program just fetches all
0238 %   the reactions, which are associated with given 'organismID'.
0239 %
0240 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0241 %    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0242 %    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0243 %    nSequences,seqIdentity)
0244 
0245 if nargin<2 || isempty(fastaFile)
0246     fastaFile=[];
0247 else
0248     fastaFile=char(fastaFile);
0249 end
0250 if nargin<3
0251     dataDir=[];
0252 else
0253     dataDir=char(dataDir);
0254 end
0255 if nargin<4 || isempty(outDir)
0256     outDir=tempdir;
0257     %Delete all *.out files if any exist
0258     delete(fullfile(outDir,'*.out'));
0259 else
0260     outDir=char(outDir);
0261 end
0262 if nargin<5 || isempty(keepSpontaneous)
0263     keepSpontaneous=true;
0264 end
0265 if nargin<6 || isempty(keepUndefinedStoich)
0266     keepUndefinedStoich=true;
0267 end
0268 if nargin<7 || isempty(keepIncomplete)
0269     keepIncomplete=true;
0270 end
0271 if nargin<8 || isempty(keepGeneral)
0272     keepGeneral=false;
0273 end
0274 if nargin<9 || isempty(cutOff)
0275     cutOff=10^-50;
0276 end
0277 if nargin<10 || isempty(minScoreRatioKO)
0278     minScoreRatioKO=0.3;
0279 end
0280 if nargin<11 || isempty(minScoreRatioG)
0281     minScoreRatioG=0.8;
0282 end
0283 if nargin<12 || isempty(maxPhylDist)
0284     maxPhylDist=inf;
0285     %Include all sequences for each reaction
0286 end
0287 if nargin<13 || isempty(nSequences)
0288     nSequences=inf;
0289     %Include all sequences for each reaction
0290 end
0291 if nargin<14 || isempty(seqIdentity)
0292     seqIdentity=0.9;
0293 end
0294 
0295 if isempty(fastaFile)
0296     fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']);
0297 else
0298     fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n');
0299     %Check if query fasta exists
0300     fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir
0301 end
0302 
0303 %Run the external binaries multi-threaded to use all logical cores assigned
0304 %to MATLAB
0305 cores = evalc('feature(''numcores'')');
0306 cores = strsplit(cores, 'MATLAB was assigned: ');
0307 cores = regexp(cores{2},'^\d*','match');
0308 cores = cores{1};
0309 
0310 %Get the directory for RAVEN Toolbox.
0311 ravenPath=findRAVENroot();
0312 
0313 %Checking if dataDir is consistent. It must point to pre-trained HMMs set,
0314 %compatible with the the current RAVEN version. The user may have the
0315 %required zip file already in working directory or have it extracted. If
0316 %the zip file and directory is not here, it is downloaded from the cloud
0317 if ~isempty(dataDir)
0318     hmmOptions={'euk90_kegg116','prok90_kegg116'};
0319     if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions.
0320         %If not, then check whether the required folders exist anyway.
0321         if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ...
0322                 ~isfolder(fullfile(dataDir,'fasta')) && ...
0323                 ~isfolder(fullfile(dataDir,'aligned')) && ...
0324                 ~isfolder(fullfile(dataDir,'hmms'))
0325             error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')])
0326         end
0327     else
0328         if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0329             fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']);
0330         elseif ~isfolder(dataDir) && isfile([dataDir,'.zip'])
0331             fprintf('Extracting the HMMs archive file... ');
0332             unzip([dataDir,'.zip']);
0333             fprintf('COMPLETE\n');
0334         else
0335             hmmIndex=strcmp(dataDir,hmmOptions);
0336             if ~any(hmmIndex)
0337                 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg116 and euk90_kegg116). ' ...
0338                     'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity'])
0339             else
0340                 fprintf('Downloading the HMMs archive file... ');
0341                 try
0342                     websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.11.0/',hmmOptions{hmmIndex},'.zip']);
0343                 catch ME
0344                     if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
0345                         error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues')
0346                     end
0347                 end
0348             end
0349 
0350             fprintf('COMPLETE\n');
0351             fprintf('Extracting the HMMs archive file... ');
0352             unzip([dataDir,'.zip']);
0353             fprintf('COMPLETE\n');
0354         end
0355         %Check if HMMs are extracted
0356         if ~isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0357             error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']);
0358         end
0359     end
0360 end
0361 
0362 %Check if the fasta-file contains '/' or'\'. If not then it's probably just
0363 %a file name. Expand to full path.
0364 if any(fastaFile)
0365     if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/'))
0366         fastaFile=which(fastaFile);
0367     end
0368     %Create the required sub-folders in dataDir if they dont exist
0369     if ~isfolder(fullfile(dataDir,'keggdb'))
0370         mkdir(dataDir,'keggdb');
0371     end
0372     if ~isfolder(fullfile(dataDir,'fasta'))
0373         mkdir(dataDir,'fasta');
0374     end
0375     if ~isfolder(fullfile(dataDir,'aligned'))
0376         mkdir(dataDir,'aligned');
0377     end
0378     if ~isfolder(fullfile(dataDir,'hmms'))
0379         mkdir(dataDir,'hmms');
0380     end
0381     if ~isfolder(outDir)
0382         mkdir(outDir);
0383     end
0384 end
0385 
0386 %First generate the full global KEGG model. Can be provided as input.
0387 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as
0388 %there is also an internal RAVEN version available
0389 if nargin==15
0390     model=globalModel.model;
0391     KOModel=globalModel.KOModel;
0392 elseif any(dataDir)
0393     [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0394 else
0395     [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0396 end
0397 model.id=organismID;
0398 model.c=zeros(numel(model.rxns),1);
0399 
0400 %If no FASTA file is supplied, then just remove all genes which are not for
0401 %the given organism ID
0402 if isempty(fastaFile)
0403     %Check if organismID can be found in KEGG species list or is
0404     %set to "eukaryotes" or "prokaryotes"
0405     phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true);
0406     if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes'])
0407         error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.');
0408     end
0409 
0410     fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']);
0411     if ismember(organismID,{'eukaryotes','prokaryotes'})
0412         phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0413         if strcmp(organismID,'eukaryotes')
0414             proxyid='hsa';
0415             %Use H. sapiens here
0416         else
0417             proxyid='eco';
0418             %Use E. coli here
0419         end
0420         [~, phylDistId]=ismember(proxyid,phylDists.ids);
0421         idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:)));
0422         taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false);
0423         I=ismember(upper(taxIDs),upper(idsToKeep));
0424     else
0425         %KEGG organism IDs may have three or four letters
0426         organismID=strcat(organismID,':');
0427         %Add colon for accurate matching
0428         if length(organismID)==4
0429             I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes);
0430         elseif length(organismID)==5
0431             I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes);
0432         end
0433     end
0434     %Remove those genes
0435     model.genes=model.genes(I);
0436     model.rxnGeneMat=model.rxnGeneMat(:,I);
0437     fprintf('COMPLETE\n');
0438 end
0439 
0440 %First remove all reactions without genes
0441 if keepSpontaneous==true
0442     fprintf('Removing non-spontaneous reactions without GPR rules... ');
0443     load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous');
0444     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0445     spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous));
0446 else
0447     fprintf('Removing reactions without GPR rules... ');
0448     I=~any(model.rxnGeneMat,2);
0449 end
0450 model=removeReactions(model,I,true);
0451 fprintf('COMPLETE\n');
0452 
0453 %Clean gene names
0454 fprintf('Fixing gene names in the model... ');
0455 %Get rid of the prefix organism id
0456 model.genes=regexprep(model.genes,'^\w+?:','');
0457 fprintf('COMPLETE\n');
0458 
0459 %If no FASTA file is supplied, then we are done here
0460 if isempty(fastaFile)
0461     %Create grRules
0462     fprintf('Constructing GPR associations and annotations for the model... ');
0463     model.grRules=cell(numel(model.rxns),1);
0464     model.grRules(:)={''};
0465     %Add the gene associations as 'or'
0466     for i=1:numel(model.rxns)
0467         %Find the involved genes
0468         I=find(model.rxnGeneMat(i,:));
0469         if any(I)
0470             model.grRules{i}=['(' model.genes{I(1)}];
0471             for j=2:numel(I)
0472                 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
0473             end
0474             model.grRules{i}=[model.grRules{i} ')'];
0475         end
0476     end
0477     %Fix grRules and reconstruct rxnGeneMat
0478     [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output
0479     model.grRules = grRules;
0480     model.rxnGeneMat = rxnGeneMat;
0481     %Add geneMiriams, assuming that it follows the syntax
0482     %kegg.genes/organismID:geneName
0483     model.geneMiriams='';
0484     for i=1:numel(model.genes)
0485         model.geneMiriams{i,1}.name{1,1}='kegg.genes';
0486         model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1});
0487     end
0488     %Add the description to the reactions
0489     for i=1:numel(model.rxns)
0490         if ~isempty(model.rxnNotes{i})
0491             model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i));
0492             model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
0493         else
0494             model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'};
0495         end
0496     end
0497     fprintf('COMPLETE\n\n');
0498     fprintf('*** Model reconstruction complete ***\n');
0499     return;
0500 end
0501 
0502 %Create a phylogenetic distance structure
0503 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0504 [~, phylDistId]=ismember(model.id,phylDistStruct.ids);
0505 
0506 %Calculate the real maximal distance now. An abitary large number of 1000
0507 %is used for the "all in kingdom" or "all sequences" options. This is a bit
0508 %inconvenient way to do it, but it is to make it fit with some older code
0509 if isinf(maxPhylDist) || maxPhylDist==-1
0510     maxPhylDist=1000;
0511 end
0512 
0513 %Get the KO ids for which files have been generated. Maybe not the neatest
0514 %way..
0515 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa'));
0516 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa'));
0517 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw'));
0518 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm'));
0519 outFiles=listFiles(fullfile(outDir,'*.out'));
0520 
0521 %Check if multi-FASTA files should be generated. This should only be
0522 %performed if there are IDs in the KOModel structure that haven't been
0523 %parsed yet
0524 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]);
0525 
0526 if ~isempty(missingFASTA)
0527     if ~isfile(fullfile(dataDir,'keggdb','genes.pep'))
0528         EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n'];
0529         dispEM(EM);
0530     end
0531     %Only construct models for KOs which don't have files already
0532     fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true);
0533     %Permute the order of the KOs in the model so that constructMultiFasta
0534     %can be run on several processors at once
0535     fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns');
0536     constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta'));
0537 else
0538     fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n');
0539 end
0540 
0541 if isunix
0542     if ismac
0543         binEnd='.mac';
0544     else
0545         binEnd='';
0546     end
0547 elseif ispc
0548     binEnd='';
0549 else
0550     EM='Unknown OS, exiting.';
0551     disp(EM);
0552     return
0553 end
0554 
0555 tmpFile=tempname;
0556 %On Windows, paths need to be translated to Unix before parsing it to WSL
0557 if ispc
0558     wslPath.tmpFile=getWSLpath(tmpFile);
0559     %mafft has problems writing to terminal (/dev/stderr) when running
0560     %on WSL via MATLAB, instead write and read progress file
0561     mafftOutput = tempname;
0562     wslPath.mafftOutput=getWSLpath(mafftOutput);
0563     wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat'));
0564     wslPath.hmmbuild=getWSLpath(fullfile(ravenPath,'software','hmmer','hmmbuild'));
0565     wslPath.hmmsearch=getWSLpath(fullfile(ravenPath,'software','hmmer','hmmsearch'));
0566     wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit'));
0567 end
0568 
0569 %Check if alignment of FASTA files should be performed
0570 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]);
0571 if ~isempty(missingAligned)
0572     if seqIdentity==-1
0573         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0574     else
0575         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0576     end
0577     missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned)));
0578 
0579     for i=1:numel(missingAligned)
0580         %This is checked here because it could be that it is created by a
0581         %parallel process. The faw-files are saved as temporary files to
0582         %kept track of which files are being worked on
0583         if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&...
0584                 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']))
0585             %Check that the multi-FASTA file exists. It should do so since
0586             %we are saving empty files as well. Print a warning and
0587             %continue if not
0588             if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']))
0589                 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist'];
0590                 dispEM(EM,false);
0591                 continue;
0592             end
0593 
0594             %If the multi-FASTA file is empty then save an empty aligned
0595             %file and continue
0596             s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0597             if s.bytes<=0
0598                 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w');
0599                 fclose(fid);
0600                 continue;
0601             end
0602 
0603             %Create an empty file to prevent other threads to start to work
0604             %on the same alignment
0605             fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w');
0606             fclose(fid);
0607 
0608             %First load the FASTA file, then select up to nSequences
0609             %sequences of the most closely related species, apply any
0610             %constraints from maxPhylDist, and save it as a temporary file,
0611             %and create the model from that
0612 
0613             fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0614             phylDist=inf(numel(fastaStruct),1);
0615             for j=1:numel(fastaStruct)
0616                 %Get the organism abbreviation
0617                 index=strfind(fastaStruct(j).Header,':');
0618                 if any(index)
0619                     abbrev=fastaStruct(j).Header(1:index(1)-1);
0620                     [~, index]=ismember(abbrev,phylDistStruct.ids);
0621                     if any(index)
0622                         phylDist(j)=phylDistStruct.distMat(index(1),phylDistId);
0623                     end
0624                 end
0625             end
0626 
0627             %Inf means that it should not be included
0628             phylDist(phylDist>maxPhylDist)=[];
0629 
0630             %Sort based on phylDist
0631             [~, order]=sort(phylDist);
0632 
0633             %Save the first nSequences hits to a temporary FASTA file
0634             if nSequences<=numel(fastaStruct)
0635                 fastaStruct=fastaStruct(order(1:nSequences));
0636             else
0637                 fastaStruct=fastaStruct(order);
0638             end
0639 
0640             %Do the clustering and alignment if there are more than one
0641             %sequences, otherwise just save the sequence (or an empty file)
0642             if numel(fastaStruct)>1
0643                 if seqIdentity~=-1
0644                     cdhitInpCustom=tempname;
0645                     fastawrite(cdhitInpCustom,fastaStruct);
0646                     if seqIdentity<=1 && seqIdentity>0.7
0647                         nparam='5';
0648                     elseif seqIdentity>0.6
0649                         nparam='4';
0650                     elseif seqIdentity>0.5
0651                         nparam='3';
0652                     elseif seqIdentity>0.4
0653                         nparam='2';
0654                     else
0655                         EM='The provided seqIdentity must be between 0 and 1\n';
0656                         dispEM(EM);
0657                     end
0658                     if ispc
0659                         wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom);
0660                         [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0661                     elseif ismac || isunix
0662                         [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0663                     end
0664                     if status~=0
0665                         EM=['Error when performing clustering of ' missingAligned{i} ':\n' output];
0666                         dispEM(EM);
0667                     end
0668                     %Remove the old tempfile
0669                     if exist(cdhitInpCustom, 'file')
0670                         delete([cdhitInpCustom '*']);
0671                     end
0672                 else
0673                     %This means that CD-HIT should be skipped since
0674                     %seqIdentity is equal to -1
0675                     fastawrite(tmpFile,fastaStruct);
0676                 end
0677                 %Do the alignment for this file
0678                 if ismac
0679                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0680                 elseif isunix
0681                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0682                 elseif ispc
0683                     wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']));
0684                     [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']);
0685                     output=fileread(mafftOutput);
0686                     delete(mafftOutput);
0687                 end
0688                 if status~=0
0689                     %It could be that alignment failed because only one
0690                     %sequence was left after clustering. If that is the
0691                     %case, then the clustered file is just copied as 'faw'
0692                     %file
0693                     if any(regexp(output,'Only 1 sequence found'))
0694                         movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f');
0695                     else
0696                         EM=['Error when performing alignment of ' missingAligned{i} ':\n' output];
0697                         dispEM(EM);
0698                     end
0699                 end
0700                 %Remove the old tempfile
0701                 if exist(tmpFile, 'file')
0702                     delete([tmpFile '*']);
0703                 end
0704             else
0705                 %If there is only one sequence then it's not possible to do
0706                 %a multiple alignment. Just print the sequence instead. An
0707                 %empty file was written previously so that doesn't have to
0708                 %be dealt with
0709                 if numel(fastaStruct)==1
0710                     warnState = warning; %Save the current warning state
0711                     warning('off','Bioinfo:fastawrite:AppendToFile');
0712                     fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct);
0713                     warning(warnState) %Reset warning state to previous settings
0714                 end
0715             end
0716             %Move the temporary file to the real one
0717             movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f');
0718 
0719             %Print the progress every 25 files
0720             if rem(i-1,25) == 0
0721                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns)));
0722                 progress=pad(progress,3,'left');
0723                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0724             end
0725         end
0726     end
0727     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0728 else
0729     if seqIdentity==-1
0730         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0731     else
0732         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0733     end
0734 end
0735 
0736 %Check if training of Hidden Markov models should be performed
0737 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]);
0738 if ~isempty(missingHMMs)
0739     fprintf('Training the KEGG Orthology specific HMMs...   0%% complete');
0740     missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs)));
0741     %Train models for all missing KOs
0742     for i=1:numel(missingHMMs)
0743         %This is checked here because it could be that it is created by a
0744         %parallel process
0745         if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']))
0746             %Check that the aligned FASTA file exists. It could be that it
0747             %is still being worked on by some other instance of the program
0748             %(the .faw file should then exist). This should not happen on a
0749             %single computer. It doesn't throw an error, because it should
0750             %finalize the ones it can
0751             if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']))
0752                 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist'];
0753                 dispEM(EM,false);
0754                 continue;
0755             end
0756 
0757             %If the multi-FASTA file is empty then save an empty aligned
0758             %file and continue
0759             s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0760             if s.bytes<=0
0761                 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w');
0762                 fclose(fid);
0763                 continue;
0764             end
0765             %Create a temporary file to indicate that it is working on the
0766             %KO. This is because hmmbuild cannot overwrite existing files
0767             fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w');
0768             fclose(fid);
0769 
0770             %Create HMM
0771             if ismac || isunix
0772                 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']);
0773             else
0774                 wslPath.hmmFile   = getWSLpath(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']));
0775                 wslPath.alignFile = getWSLpath(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0776                 [status, output]  = system(['wsl "' wslPath.hmmbuild '" --cpu "' num2str(cores) '" "' wslPath.hmmFile '" "' wslPath.alignFile '"']);
0777             end
0778             if status~=0
0779                 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output];
0780                 dispEM(EM);
0781             end
0782 
0783             %Delete the temporary file
0784             delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']));
0785 
0786             %Print the progress every 25 files
0787             if rem(i-1,25) == 0
0788                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns)));
0789                 progress=pad(progress,3,'left');
0790                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0791             end
0792         end
0793     end
0794     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0795 else
0796     fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n');
0797 end
0798 
0799 %Check which new .out files that should be generated. Check if training of
0800 %Hidden Markov models should be performed
0801 missingOUT=setdiff(KOModel.rxns,outFiles);
0802 if ~isempty(missingOUT)
0803     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs...   0%% complete');
0804     missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT)));
0805     for i=1:numel(missingOUT)
0806         %This is checked here because it could be that it is created by a
0807         %parallel process
0808         if ~isfile(fullfile(outDir,[missingOUT{i} '.out']))
0809             %Check that the HMM file exists. It should do so since %we are
0810             %saving empty files as well. Print a warning and continue if
0811             %not
0812             if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']))
0813                 EM=['The HMM file for ' missingOUT{i} ' does not exist'];
0814                 dispEM(EM,false);
0815                 continue;
0816             end
0817 
0818             %Save an empty file to prevent several threads working on the
0819             %same file
0820             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0821             fclose(fid);
0822 
0823             %If the HMM file is empty then save an out file and continue
0824             s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0825             if s.bytes<=0
0826                 continue;
0827             end
0828 
0829             %Check each gene in the input file against this model
0830             if ismac || isunix
0831                 [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']);
0832             else
0833                 wslPath.hmmFile   = getWSLpath(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0834                 wslPath.fastaFile = getWSLpath(fastaFile);
0835                 [status, output]=system(['wsl "' wslPath.hmmsearch '" --cpu "' num2str(cores) '" "' wslPath.hmmFile '" "' wslPath.fastaFile '"']);
0836             end
0837             if status~=0
0838                 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output];
0839                 dispEM(EM);
0840             end
0841 
0842             %Save the output to a file
0843             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0844             fwrite(fid,output);
0845             fclose(fid);
0846 
0847             %Print the progress every 25 files
0848             if rem(i-1,25) == 0
0849                 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns)));
0850                 progress=pad(progress,3,'left');
0851                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0852             end
0853         end
0854     end
0855     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0856 else
0857     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n');
0858 end
0859 
0860 
0861 %***Begin retrieving the output and putting together the resulting model
0862 
0863 fprintf('Parsing the HMM search results... ');
0864 %Retrieve matched genes from the HMMs
0865 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes
0866 genes=cell(3000,1);
0867 %Store the best score for a gene in a hash list (since it will be searching
0868 %many times)
0869 hTable = java.util.Hashtable;
0870 
0871 geneCounter=0;
0872 for i=1:numel(KOModel.rxns)
0873     if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file')
0874         fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r');
0875         beginMatches=false;
0876         while 1
0877             %Get the next line
0878             tline = fgetl(fid);
0879 
0880             %Abort at end of file
0881             if ~ischar(tline)
0882                 break;
0883             end
0884 
0885             if and(beginMatches,strcmp(tline,'  ------ inclusion threshold ------'))
0886                 break;
0887             end
0888 
0889             if beginMatches==false
0890                 %This is how the listing of matches begins
0891                 if any(strfind(tline,'E-value '))
0892                     %Read one more line that is only padding
0893                     tline = fgetl(fid);
0894                     beginMatches=true;
0895                 end
0896             else
0897                 %If matches should be read
0898                 if ~strcmp(tline,'   [No hits detected that satisfy reporting thresholds]') && ~isempty(tline)
0899                     elements=regexp(tline,' ','split');
0900                     elements=elements(cellfun(@any,elements));
0901 
0902                     %Check if the match is below the treshhold
0903                     score=str2double(elements{1});
0904                     gene=elements{9};
0905                     if score<=cutOff
0906                         %If the score is exactly 0, change it to a very
0907                         %small value to avoid NaN
0908                         if score==0
0909                             score=10^-250;
0910                         end
0911                         %Check if the gene is added already and, is so, get
0912                         %the best score for it
0913                         I=hTable.get(gene);
0914                         if any(I)
0915                             koGeneMat(i,I)=score;
0916                         else
0917                             geneCounter=geneCounter+1;
0918                             %The gene was not present yet so add it
0919                             hTable.put(gene,geneCounter);
0920                             genes{geneCounter}=gene;
0921                             koGeneMat(i,geneCounter)=score;
0922                         end
0923                     end
0924                 else
0925                     break;
0926                 end
0927             end
0928         end
0929         fclose(fid);
0930     end
0931 end
0932 fprintf('COMPLETE\n');
0933 
0934 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... ');
0935 koGeneMat=koGeneMat(:,1:geneCounter);
0936 
0937 %Remove the genes for each KO that are below minScoreRatioKO.
0938 for i=1:size(koGeneMat,1)
0939     J=find(koGeneMat(i,:));
0940     if any(J)
0941         koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0;
0942     end
0943 end
0944 
0945 %Remove the KOs for each gene that are below minScoreRatioG
0946 for i=1:size(koGeneMat,2)
0947     J=find(koGeneMat(:,i));
0948     if any(J)
0949         koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0;
0950     end
0951 end
0952 fprintf('COMPLETE\n');
0953 
0954 fprintf('Adding gene annotations to the model... ');
0955 %Create the new model
0956 model.genes=genes(1:geneCounter);
0957 model.grRules=cell(numel(model.rxns),1);
0958 model.grRules(:)={''};
0959 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes));
0960 
0961 %Loop through the reactions and add the corresponding genes
0962 for i=1:numel(model.rxns)
0963     if isstruct(model.rxnMiriams{i})
0964         %Get all KOs
0965         I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'));
0966         KOs=model.rxnMiriams{i}.value(I);
0967         %Find the KOs and the corresponding genes
0968         J=ismember(KOModel.rxns,KOs);
0969         [~, K]=find(koGeneMat(J,:));
0970 
0971         if any(K)
0972             model.rxnGeneMat(i,K)=1;
0973             %Also delete KOs for which no genes were found. If no genes at
0974             %all were matched to the reaction it will be deleted later
0975             L=sum(koGeneMat(J,:),2)==0;
0976             model.rxnMiriams{i}.value(I(L))=[];
0977             model.rxnMiriams{i}.name(I(L))=[];
0978         end
0979     end
0980 end
0981 fprintf('COMPLETE\n');
0982 
0983 %Find and delete all reactions without genes. This also removes genes that
0984 %are not used (which could happen because minScoreRatioG and
0985 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions
0986 %without genes are kept in the model. Spontaneous reactions with original
0987 %gene associations are treated in the same way, like the rest of the
0988 %reactions - if gene associations were removed during HMM search, such
0989 %reactions are deleted from the model
0990 if keepSpontaneous==true
0991     %Not the most comprise way to delete reactions without genes, but this
0992     %makes the code easier to understand. Firstly the non-spontaneous
0993     %reactions without genes are removed. After that, the second deletion
0994     %step removes spontaneous reactions, which had gene associations before
0995     %HMM search, but no longer have after it
0996     fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... ');
0997     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0998     model=removeReactions(model,I,true,true);
0999     I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes);
1000     model=removeReactions(model,I,true,true);
1001 else
1002     %Just simply check for any new reactions without genes and remove
1003     %it
1004     fprintf('Removing reactions which after HMM search no longer have GPR rules... ');
1005     I=~any(model.rxnGeneMat,2);
1006     model=removeReactions(model,I,true,true);
1007 end
1008 fprintf('COMPLETE\n');
1009 
1010 fprintf('Constructing GPR rules and finalizing the model... ');
1011 %Add the gene associations as 'or'
1012 for i=1:numel(model.rxns)
1013     %Find the involved genes
1014     I=find(model.rxnGeneMat(i,:));
1015     if any(I)
1016         model.grRules{i}=['(' model.genes{I(1)}];
1017         for j=2:numel(I)
1018             model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
1019         end
1020         model.grRules{i}=[model.grRules{i} ')'];
1021     end
1022 end
1023 
1024 %Fix grRules and reconstruct rxnGeneMat
1025 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output
1026 model.grRules = grRules;
1027 model.rxnGeneMat = rxnGeneMat;
1028 
1029 %Add the description to the reactions
1030 for i=1:numel(model.rxns)
1031     if ~isempty(model.rxnNotes{i})
1032         model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i));
1033         model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
1034     else
1035         model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'};
1036     end
1037 end
1038 %Remove the temp fasta file
1039 delete(fastaFile)
1040 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n');
1041 end
1042 
1043 function files=listFiles(directory)
1044 %Supporter function to list the files in a directory and return them as a
1045 %cell array
1046 temp=dir(directory);
1047 files=cell(numel(temp),1);
1048 for i=1:numel(temp)
1049     files{i}=temp(i,1).name;
1050 end
1051 files=strrep(files,'.fa','');
1052 files=strrep(files,'.hmm','');
1053 files=strrep(files,'.out','');
1054 files=strrep(files,'.faw','');
1055 end

Generated by m2html © 2005