Description of getKEGGModelForOrganism

 getKEGGModelForOrganism
   Reconstructs a genome-scale metabolic model based on protein homology
   to the orthologies in KEGG. If the target species is not available in
   KEGG, the user must select a closely related species. It is also
   possible to circumvent protein homology search (see fastaFile parameter
   for more details)

   Input:
   organismID          three or four letter abbreviation of the organism
                       (as used in KEGG). If not available, use a closely
                       related species. This is used for determing the
                       phylogenetic distance. Use 'eukaryotes' or
                       'prokaryotes' to get a model for the whole domain.
                       Only applicable if fastaFile is empty, i.e. no
                       homology search should be performed
   fastaFile           a FASTA file that contains the protein sequences of
                       the organism for which to reconstruct a model (optional,
                       if no FASTA file is supplied then a model is
                       reconstructed based only on the organism
                       abbreviation. This option ignores all settings
                       except for keepSpontaneous, keepUndefinedStoich,
                       keepIncomplete and keepGeneral)
   dataDir             directory for which to retrieve the input data.
                       Should contain a combination of these sub-folders:
                       -dataDir\keggdb
                           The KEGG database files used in 1a (see below)
                       -dataDir\fasta
                           The multi-FASTA files generated in 1b (see
                           below)
                       -dataDir\aligned
                           The aligned FASTA files as generated in 2a (see
                           below)
                       -dataDir\hmms
                           The hidden Markov models as generated in 2b or
                           downloaded from BioMet Toolbox (see below)
                       The final directory in dataDir should be styled as
                       prok90_kegg105 or euk90_kegg105, indicating whether
                       the HMMs were trained on pro- or eukaryotic
                       sequences; using which sequence similarity treshold
                       (first set of digits); using which KEGG version
                       (second set of digits). (this parameter should
                       ALWAYS be provided)
   outDir              directory to save the results from the quering of
                       the hidden Markov models. The output is specific
                       for the input sequences and the settings used. It
                       is stored in this manner so that the function can
                       continue if interrupted or if it should run in
                       parallel. Be careful not to leave output files from
                       different organisms or runs with different settings
                       in the same folder. They will not be overwritten
                       (optional, default is a temporary dir where all *.out
                       files are deleted before and after doing the
                       reconstruction)
   keepSpontaneous     include reactions labeled as "spontaneous". (optional,
                       default true)
   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
                       will be dealt with as two separate metabolites
                       (optional, default true)
   keepIncomplete      include reactions which have been labelled as
                       "incomplete", "erroneous" or "unclear" (optional,
                       default true)
   keepGeneral         include reactions which have been labelled as
                       "general reaction". These are reactions on the form
                       "an aldehyde <=> an alcohol", and are therefore
                       unsuited for modelling purposes. Note that not all
                       reactions have this type of annotation, and the
                       script will therefore not be able to remove all
                       such reactions (optional, default false)
   cutOff              significance score from HMMer needed to assign
                       genes to a KO (optional, default 10^-50)
   minScoreRatioG      a gene is only assigned to KOs for which the score
                       is >=log(score)/log(best score) for that gene. This
                       is to prevent that a gene which clearly belongs to
                       one KO is assigned also to KOs with much lower
                       scores (optional, default 0.8 (lower is less strict))
   minScoreRatioKO     ignore genes in a KO if their score is
                       <log(score)/log(best score in KO). This is to
                       "prune" KOs which have many genes and where some are
                       clearly a better fit (optional, default 0.3 (lower is
                       less strict))
   maxPhylDist         -1: only use sequences from the same domain
                       (Prokaryota, Eukaryota)
                       other (positive) value: only use sequences for
                       organisms where the phylogenetic distance is at the
                       most this large (as calculated in getPhylDist)
                       (optional, default Inf, which means that all sequences
                       will be used)
   nSequences          for each KO, use up to this many sequences from the
                       most closely related species. This is mainly to
                       speed up the alignment process for KOs with very
                       many genes. This subsampling is performed before
                       running CD-HIT (optional, default inf)
   seqIdentity         sequence identity threshold in CD-HIT, referred as
                       "global sequence identity" in CD-HIT User's Guide.
                       If -1 is provided, CD-HIT is skipped (optional, default 0.9)
   globalModel         structure containing both model and KOModel
                       structures as generated by getModelFromKEGG. These
                       will otherwise be loaded by via getModelFromKEGG. 
                       Providing globalKEGGmodel can speed up model
                       generation if getKEGGModelForOrganism is run
                       multiple times for different strains. Example:
                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
                       (optional, default empty, global model is loaded by 
                       getModelFromKEGG)

   Output:
   model               the reconstructed model

   PLEASE READ THIS: The input to this function can be confusing, because
   it is intended to be run in parallel on a cluster or in multiple
   sessions. It therefore saves a lot of intermediate results to storage.
   This also serves the purpose of not having to do redundant
   calculations. This, however, comes with the disadvantage of somewhat
   trickier handling. This is what this function does:

   1a. Loads files from a local KEGG FTP dump and constructs a general
       RAVEN model representing the metabolic network. The functions
       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
       which are later merged into 'keggModel.mat' by getModelFromKEGG
       function. The function getPhylDist generates 'keggPhylDist.mat'
       file. KEGG FTP access requires a <a href="matlab:
       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
       multi-FASTA file for each KO in KEGG is generated.

   The Step 1 has to be re-done every time KEGG updates their database (or
   rather when the updates are large enough to warrant re-running this
   part). Many users would probably never use this feature.

   2a. Filters KO-specific protein sets. This is done by using the
       settings "maxPhylDist" and "nSequences" to control which sequences
       should be used for constructing Hidden Markov models (HMMs), and
       later for matching your sequences to.
       The most common alternatives here would be to use sequences from
       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
       could also play around with the parameters to use e.g. only fungal
       sequences.
   2b. KO-specific protein FASTA files are re-organised into
       non-redundant protein sets with CD-HIT. The user can only set
       seqIdentity parameter, which corresponds to '-c' parameter in
       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
       sequence identity specific word_length (-n) parameters are used.
   2c. Does a multi sequence alignment for multi-FASTA files obtained in
       Step 2b for future use. MAFFT software with automatic selection of
       alignment algorithm is used in this step ('--auto').
   2d. Trains hidden Markov models using HMMer for each of the aligned
       KO-specific FASTA files obtained in Step 2c. This is performed with
       'hmmbuild' using the default settings.

   Step 2 may be reasonable to be re-done if the user wants to tweak the
   settings in proteins filtering, clustering, multi sequence alignment or
   HMMs training steps. However, it requires to have KO-specific protein
   FASTA files obtained in Step 1a. As such files are not provided in
   RAVEN and BioMet ToolBox, the user can only generate these files from
   KEGG FTP dump files, so KEGG FTP license is needed.

   3a. Queries the HMMs with sequences for the organism you are making a
       model for. This step uses both the output from step 1a and from 2d.
       This is done with 'hmmsearch' function under default settings. The
       significance threshold value set in 'cutOff' parameter is used
       later when parsing '*.out' files to filter out KO hits with higher
       value than 'cutOff' value. The results with passable E values are
       summarised into KO-gene occurence matrix with E values in
       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
       (see the function descriptions above for more details). The
       intersection values for these 'prunable' associations are converted
       to zeroes.
   3b. Constructs a model based on the pre-processed KO-gene association
       matrix (koGeneMat). As the full KEGG model already has reaction-KO
       relationships, KOs are converted into the query genes. The final
       draft model contains only these reactions, which are associated
       with KOs from koGeneMat. The reactions without the genes may also
       be included, if the user set keepSpontaneous as 'true'.

   The Step 3 is specific to the organism for which the model is
   reconstructed.

   In principle the function looks at which output that is already available
   and runs only the parts that are required for step 3. This means
   that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory
   -1b is only performed if not all required HMMs OR aligned FASTA files
   OR multi-FASTA files exist in the defined dataDir. This means that this
   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
   (see above). If not all files exist it will try to find
   the KEGG database files in dataDir.
   -2a is only performed if not all required HMMs OR aligned FASTA files
   files exist in the defined dataDir. This means that this step is skipped
   if the HMMs are downloaded from BioMet Toolbox instead (see above).
   -2b is only performed if not all required HMMs exist in the defined
   dataDir. This means that this step is skipped if the FASTA files or
   HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   These steps are specific to the organism for which you are
   reconstructing the model.

   Regarding the whole pipeline, the function checks the output that is
   already available and runs only the parts that are required for step 3.
   This means that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory.
   -1b is only performed if any of required KOs do not have HMMs, aligned
   FASTA files, clustered FASTA files and raw FASTA files in the defined
   dataDir. This means that this step is skipped if the HMMs are
   downloaded from BioMet Toolbox instead (see above). If not all files
   exist it will try to find the KEGG database files in dataDir.
   -2ab are only performed if any of required KOs do not have HMMs,
   aligned FASTA files and clustered FASTA files in the defined dataDir.
   This means that this step is skipped if the HMMs are downloaded from
   BioMet Toolbox instead (see above).
   -2c is only performed if any of required KOs do not have HMMs and
   aligned FASTA files in the defined dataDir. This means that this step
   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
   above).
   -2d is only performed if any of required KOs do not have HMMs exist in
   the defined dataDir. This means that this step is skipped if the FASTA
   files or HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   NOTE: it is also possible to obtain draft model from KEGG without
   providing protein FASTA file for the target organism. In such case the
   organism three-four letter abbreviation set as 'organismID' must exist
   in the local KEGG database. In such case, the program just fetches all
   the reactions, which are associated with given 'organismID'.

 Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
    nSequences,seqIdentity)

0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0002     outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0003     keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0004     nSequences,seqIdentity,globalModel)
0005 % getKEGGModelForOrganism
0006 %   Reconstructs a genome-scale metabolic model based on protein homology
0007 %   to the orthologies in KEGG. If the target species is not available in
0008 %   KEGG, the user must select a closely related species. It is also
0009 %   possible to circumvent protein homology search (see fastaFile parameter
0010 %   for more details)
0011 %
0012 %   Input:
0013 %   organismID          three or four letter abbreviation of the organism
0014 %                       (as used in KEGG). If not available, use a closely
0015 %                       related species. This is used for determing the
0016 %                       phylogenetic distance. Use 'eukaryotes' or
0017 %                       'prokaryotes' to get a model for the whole domain.
0018 %                       Only applicable if fastaFile is empty, i.e. no
0019 %                       homology search should be performed
0020 %   fastaFile           a FASTA file that contains the protein sequences of
0021 %                       the organism for which to reconstruct a model (optional,
0022 %                       if no FASTA file is supplied then a model is
0023 %                       reconstructed based only on the organism
0024 %                       abbreviation. This option ignores all settings
0025 %                       except for keepSpontaneous, keepUndefinedStoich,
0026 %                       keepIncomplete and keepGeneral)
0027 %   dataDir             directory for which to retrieve the input data.
0028 %                       Should contain a combination of these sub-folders:
0029 %                       -dataDir\keggdb
0030 %                           The KEGG database files used in 1a (see below)
0031 %                       -dataDir\fasta
0032 %                           The multi-FASTA files generated in 1b (see
0033 %                           below)
0034 %                       -dataDir\aligned
0035 %                           The aligned FASTA files as generated in 2a (see
0036 %                           below)
0037 %                       -dataDir\hmms
0038 %                           The hidden Markov models as generated in 2b or
0039 %                           downloaded from BioMet Toolbox (see below)
0040 %                       The final directory in dataDir should be styled as
0041 %                       prok90_kegg105 or euk90_kegg105, indicating whether
0042 %                       the HMMs were trained on pro- or eukaryotic
0043 %                       sequences; using which sequence similarity treshold
0044 %                       (first set of digits); using which KEGG version
0045 %                       (second set of digits). (this parameter should
0046 %                       ALWAYS be provided)
0047 %   outDir              directory to save the results from the quering of
0048 %                       the hidden Markov models. The output is specific
0049 %                       for the input sequences and the settings used. It
0050 %                       is stored in this manner so that the function can
0051 %                       continue if interrupted or if it should run in
0052 %                       parallel. Be careful not to leave output files from
0053 %                       different organisms or runs with different settings
0054 %                       in the same folder. They will not be overwritten
0055 %                       (optional, default is a temporary dir where all *.out
0056 %                       files are deleted before and after doing the
0057 %                       reconstruction)
0058 %   keepSpontaneous     include reactions labeled as "spontaneous". (optional,
0059 %                       default true)
0060 %   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
0061 %                       will be dealt with as two separate metabolites
0062 %                       (optional, default true)
0063 %   keepIncomplete      include reactions which have been labelled as
0064 %                       "incomplete", "erroneous" or "unclear" (optional,
0065 %                       default true)
0066 %   keepGeneral         include reactions which have been labelled as
0067 %                       "general reaction". These are reactions on the form
0068 %                       "an aldehyde <=> an alcohol", and are therefore
0069 %                       unsuited for modelling purposes. Note that not all
0070 %                       reactions have this type of annotation, and the
0071 %                       script will therefore not be able to remove all
0072 %                       such reactions (optional, default false)
0073 %   cutOff              significance score from HMMer needed to assign
0074 %                       genes to a KO (optional, default 10^-50)
0075 %   minScoreRatioG      a gene is only assigned to KOs for which the score
0076 %                       is >=log(score)/log(best score) for that gene. This
0077 %                       is to prevent that a gene which clearly belongs to
0078 %                       one KO is assigned also to KOs with much lower
0079 %                       scores (optional, default 0.8 (lower is less strict))
0080 %   minScoreRatioKO     ignore genes in a KO if their score is
0081 %                       <log(score)/log(best score in KO). This is to
0082 %                       "prune" KOs which have many genes and where some are
0083 %                       clearly a better fit (optional, default 0.3 (lower is
0084 %                       less strict))
0085 %   maxPhylDist         -1: only use sequences from the same domain
0086 %                       (Prokaryota, Eukaryota)
0087 %                       other (positive) value: only use sequences for
0088 %                       organisms where the phylogenetic distance is at the
0089 %                       most this large (as calculated in getPhylDist)
0090 %                       (optional, default Inf, which means that all sequences
0091 %                       will be used)
0092 %   nSequences          for each KO, use up to this many sequences from the
0093 %                       most closely related species. This is mainly to
0094 %                       speed up the alignment process for KOs with very
0095 %                       many genes. This subsampling is performed before
0096 %                       running CD-HIT (optional, default inf)
0097 %   seqIdentity         sequence identity threshold in CD-HIT, referred as
0098 %                       "global sequence identity" in CD-HIT User's Guide.
0099 %                       If -1 is provided, CD-HIT is skipped (optional, default 0.9)
0100 %   globalModel         structure containing both model and KOModel
0101 %                       structures as generated by getModelFromKEGG. These
0102 %                       will otherwise be loaded by via getModelFromKEGG.
0103 %                       Providing globalKEGGmodel can speed up model
0104 %                       generation if getKEGGModelForOrganism is run
0105 %                       multiple times for different strains. Example:
0106 %                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
0107 %                       (optional, default empty, global model is loaded by
0108 %                       getModelFromKEGG)
0109 %
0110 %   Output:
0111 %   model               the reconstructed model
0112 %
0113 %   PLEASE READ THIS: The input to this function can be confusing, because
0114 %   it is intended to be run in parallel on a cluster or in multiple
0115 %   sessions. It therefore saves a lot of intermediate results to storage.
0116 %   This also serves the purpose of not having to do redundant
0117 %   calculations. This, however, comes with the disadvantage of somewhat
0118 %   trickier handling. This is what this function does:
0119 %
0120 %   1a. Loads files from a local KEGG FTP dump and constructs a general
0121 %       RAVEN model representing the metabolic network. The functions
0122 %       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
0123 %       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
0124 %       which are later merged into 'keggModel.mat' by getModelFromKEGG
0125 %       function. The function getPhylDist generates 'keggPhylDist.mat'
0126 %       file. KEGG FTP access requires a <a href="matlab:
0127 %       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
0128 %   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
0129 %       multi-FASTA file for each KO in KEGG is generated.
0130 %
0131 %   The Step 1 has to be re-done every time KEGG updates their database (or
0132 %   rather when the updates are large enough to warrant re-running this
0133 %   part). Many users would probably never use this feature.
0134 %
0135 %   2a. Filters KO-specific protein sets. This is done by using the
0136 %       settings "maxPhylDist" and "nSequences" to control which sequences
0137 %       should be used for constructing Hidden Markov models (HMMs), and
0138 %       later for matching your sequences to.
0139 %       The most common alternatives here would be to use sequences from
0140 %       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
0141 %       could also play around with the parameters to use e.g. only fungal
0142 %       sequences.
0143 %   2b. KO-specific protein FASTA files are re-organised into
0144 %       non-redundant protein sets with CD-HIT. The user can only set
0145 %       seqIdentity parameter, which corresponds to '-c' parameter in
0146 %       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
0147 %       sequence identity specific word_length (-n) parameters are used.
0148 %   2c. Does a multi sequence alignment for multi-FASTA files obtained in
0149 %       Step 2b for future use. MAFFT software with automatic selection of
0150 %       alignment algorithm is used in this step ('--auto').
0151 %   2d. Trains hidden Markov models using HMMer for each of the aligned
0152 %       KO-specific FASTA files obtained in Step 2c. This is performed with
0153 %       'hmmbuild' using the default settings.
0154 %
0155 %   Step 2 may be reasonable to be re-done if the user wants to tweak the
0156 %   settings in proteins filtering, clustering, multi sequence alignment or
0157 %   HMMs training steps. However, it requires to have KO-specific protein
0158 %   FASTA files obtained in Step 1a. As such files are not provided in
0159 %   RAVEN and BioMet ToolBox, the user can only generate these files from
0160 %   KEGG FTP dump files, so KEGG FTP license is needed.
0161 %
0162 %   3a. Queries the HMMs with sequences for the organism you are making a
0163 %       model for. This step uses both the output from step 1a and from 2d.
0164 %       This is done with 'hmmsearch' function under default settings. The
0165 %       significance threshold value set in 'cutOff' parameter is used
0166 %       later when parsing '*.out' files to filter out KO hits with higher
0167 %       value than 'cutOff' value. The results with passable E values are
0168 %       summarised into KO-gene occurence matrix with E values in
0169 %       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
0170 %       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
0171 %       (see the function descriptions above for more details). The
0172 %       intersection values for these 'prunable' associations are converted
0173 %       to zeroes.
0174 %   3b. Constructs a model based on the pre-processed KO-gene association
0175 %       matrix (koGeneMat). As the full KEGG model already has reaction-KO
0176 %       relationships, KOs are converted into the query genes. The final
0177 %       draft model contains only these reactions, which are associated
0178 %       with KOs from koGeneMat. The reactions without the genes may also
0179 %       be included, if the user set keepSpontaneous as 'true'.
0180 %
0181 %   The Step 3 is specific to the organism for which the model is
0182 %   reconstructed.
0183 %
0184 %   In principle the function looks at which output that is already available
0185 %   and runs only the parts that are required for step 3. This means
0186 %   that (see the definition of the parameters for details):
0187 %   -1a is only performed if there are no KEGG model files in the
0188 %   RAVEN\external\kegg directory
0189 %   -1b is only performed if not all required HMMs OR aligned FASTA files
0190 %   OR multi-FASTA files exist in the defined dataDir. This means that this
0191 %   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
0192 %   (see above). If not all files exist it will try to find
0193 %   the KEGG database files in dataDir.
0194 %   -2a is only performed if not all required HMMs OR aligned FASTA files
0195 %   files exist in the defined dataDir. This means that this step is skipped
0196 %   if the HMMs are downloaded from BioMet Toolbox instead (see above).
0197 %   -2b is only performed if not all required HMMs exist in the defined
0198 %   dataDir. This means that this step is skipped if the FASTA files or
0199 %   HMMs are downloaded from BioMet Toolbox instead (see above).
0200 %   -3a is performed for the required HMMs for which no corresponding .out
0201 %   file exists in outDir. This is just a way to enable the function to be
0202 %   run in parallel or to resume if interrupted.
0203 %   -3b is always performed.
0204 %
0205 %   These steps are specific to the organism for which you are
0206 %   reconstructing the model.
0207 %
0208 %   Regarding the whole pipeline, the function checks the output that is
0209 %   already available and runs only the parts that are required for step 3.
0210 %   This means that (see the definition of the parameters for details):
0211 %   -1a is only performed if there are no KEGG model files in the
0212 %   RAVEN\external\kegg directory.
0213 %   -1b is only performed if any of required KOs do not have HMMs, aligned
0214 %   FASTA files, clustered FASTA files and raw FASTA files in the defined
0215 %   dataDir. This means that this step is skipped if the HMMs are
0216 %   downloaded from BioMet Toolbox instead (see above). If not all files
0217 %   exist it will try to find the KEGG database files in dataDir.
0218 %   -2ab are only performed if any of required KOs do not have HMMs,
0219 %   aligned FASTA files and clustered FASTA files in the defined dataDir.
0220 %   This means that this step is skipped if the HMMs are downloaded from
0221 %   BioMet Toolbox instead (see above).
0222 %   -2c is only performed if any of required KOs do not have HMMs and
0223 %   aligned FASTA files in the defined dataDir. This means that this step
0224 %   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
0225 %   above).
0226 %   -2d is only performed if any of required KOs do not have HMMs exist in
0227 %   the defined dataDir. This means that this step is skipped if the FASTA
0228 %   files or HMMs are downloaded from BioMet Toolbox instead (see above).
0229 %   -3a is performed for the required HMMs for which no corresponding .out
0230 %   file exists in outDir. This is just a way to enable the function to be
0231 %   run in parallel or to resume if interrupted.
0232 %   -3b is always performed.
0233 %
0234 %   NOTE: it is also possible to obtain draft model from KEGG without
0235 %   providing protein FASTA file for the target organism. In such case the
0236 %   organism three-four letter abbreviation set as 'organismID' must exist
0237 %   in the local KEGG database. In such case, the program just fetches all
0238 %   the reactions, which are associated with given 'organismID'.
0239 %
0240 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0241 %    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0242 %    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0243 %    nSequences,seqIdentity)
0244 
0245 if nargin<2 || isempty(fastaFile)
0246     fastaFile=[];
0247 else
0248     fastaFile=char(fastaFile);
0249 end
0250 if nargin<3
0251     dataDir=[];
0252 else
0253     dataDir=char(dataDir);
0254 end
0255 if nargin<4 || isempty(outDir)
0256     outDir=tempdir;
0257     %Delete all *.out files if any exist
0258     delete(fullfile(outDir,'*.out'));
0259 else
0260     outDir=char(outDir);
0261 end
0262 if nargin<5
0263     keepSpontaneous=true;
0264 end
0265 if nargin<6
0266     keepUndefinedStoich=true;
0267 end
0268 if nargin<7
0269     keepIncomplete=true;
0270 end
0271 if nargin<8
0272     keepGeneral=false;
0273 end
0274 if nargin<9
0275     cutOff=10^-50;
0276 end
0277 if nargin<10
0278     minScoreRatioKO=0.3;
0279 end
0280 if nargin<11
0281     minScoreRatioG=0.8;
0282 end
0283 if nargin<12
0284     maxPhylDist=inf;
0285     %Include all sequences for each reaction
0286 end
0287 if nargin<13
0288     nSequences=inf;
0289     %Include all sequences for each reaction
0290 end
0291 if nargin<14
0292     seqIdentity=0.9;
0293 end
0294 
0295 if isempty(fastaFile)
0296     fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']);
0297 else
0298     fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n');
0299     %Check if query fasta exists
0300     fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir
0301 end
0302 
0303 %Run the external binaries multi-threaded to use all logical cores assigned
0304 %to MATLAB
0305 cores = evalc('feature(''numcores'')');
0306 cores = strsplit(cores, 'MATLAB was assigned: ');
0307 cores = regexp(cores{2},'^\d*','match');
0308 cores = cores{1};
0309 
0310 %Get the directory for RAVEN Toolbox.
0311 ravenPath=findRAVENroot();
0312 
0313 %Checking if dataDir is consistent. It must point to pre-trained HMMs set,
0314 %compatible with the the current RAVEN version. The user may have the
0315 %required zip file already in working directory or have it extracted. If
0316 %the zip file and directory is not here, it is downloaded from the cloud
0317 if ~isempty(dataDir)
0318     hmmOptions={'euk90_kegg105','prok90_kegg105'};
0319     if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions.
0320                                      %If not, then check whether the required folders exist anyway.
0321         if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ...
0322                 ~isfolder(fullfile(dataDir,'fasta')) && ...
0323                 ~isfolder(fullfile(dataDir,'aligned')) && ...
0324                 ~isfolder(fullfile(dataDir,'hmms'))
0325             error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')])
0326         end
0327     else
0328         if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0329             fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']);
0330         elseif ~isfolder(dataDir) && isfile([dataDir,'.zip'])
0331             fprintf('Extracting the HMMs archive file... ');
0332             unzip([dataDir,'.zip']);
0333             fprintf('COMPLETE\n');
0334         else
0335             hmmIndex=strcmp(dataDir,hmmOptions);
0336             if ~any(hmmIndex)
0337                 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg105 and euk90_kegg105). ' ...
0338                     'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity'])
0339             else
0340                 fprintf('Downloading the HMMs archive file... ');
0341                 try
0342                     websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']);
0343                 catch ME
0344                     if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
0345                         error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues')
0346                     end
0347                 end
0348             end
0349             
0350             fprintf('COMPLETE\n');
0351             fprintf('Extracting the HMMs archive file... ');
0352             unzip([dataDir,'.zip']);
0353             fprintf('COMPLETE\n');
0354         end
0355         %Check if HMMs are extracted
0356         if ~isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0357             error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']);
0358         end
0359     end
0360 end
0361 
0362 %Check if the fasta-file contains '/' or'\'. If not then it's probably just
0363 %a file name. Expand to full path.
0364 if any(fastaFile)
0365     if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/'))
0366         fastaFile=which(fastaFile);
0367     end
0368     %Create the required sub-folders in dataDir if they dont exist
0369     if ~isfolder(fullfile(dataDir,'keggdb'))
0370         mkdir(dataDir,'keggdb');
0371     end
0372     if ~isfolder(fullfile(dataDir,'fasta'))
0373         mkdir(dataDir,'fasta');
0374     end
0375     if ~isfolder(fullfile(dataDir,'aligned'))
0376         mkdir(dataDir,'aligned');
0377     end
0378     if ~isfolder(fullfile(dataDir,'hmms'))
0379         mkdir(dataDir,'hmms');
0380     end
0381     if ~isfolder(outDir)
0382         mkdir(outDir);
0383     end
0384 end
0385 
0386 %First generate the full global KEGG model. Can be provided as input.
0387 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as
0388 %there is also an internal RAVEN version available
0389 if nargin==15
0390     model=globalModel.model;
0391     KOModel=globalModel.KOModel;
0392 elseif any(dataDir)
0393     [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0394 else
0395     [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0396 end
0397 model.id=organismID;
0398 model.c=zeros(numel(model.rxns),1);
0399 
0400 %If no FASTA file is supplied, then just remove all genes which are not for
0401 %the given organism ID
0402 if isempty(fastaFile)
0403     %Check if organismID can be found in KEGG species list or is
0404     %set to "eukaryotes" or "prokaryotes"
0405     phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true);
0406     if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes'])
0407         error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.');
0408     end
0409     
0410     fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']);
0411     if ismember(organismID,{'eukaryotes','prokaryotes'})
0412         phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0413         if strcmp(organismID,'eukaryotes')
0414             proxyid='hsa';
0415             %Use H. sapiens here
0416         else
0417             proxyid='eco';
0418             %Use E. coli here
0419         end
0420         [~, phylDistId]=ismember(proxyid,phylDists.ids);
0421         idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:)));
0422         taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false);
0423         I=ismember(upper(taxIDs),upper(idsToKeep));
0424     else
0425         %KEGG organism IDs may have three or four letters
0426         organismID=strcat(organismID,':');
0427         %Add colon for accurate matching
0428         if length(organismID)==4
0429             I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes);
0430         elseif length(organismID)==5
0431             I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes);
0432         end
0433     end
0434     %Remove those genes
0435     model.genes=model.genes(I);
0436     model.rxnGeneMat=model.rxnGeneMat(:,I);
0437     fprintf('COMPLETE\n');
0438 end
0439 
0440 %First remove all reactions without genes
0441 if keepSpontaneous==true
0442     fprintf('Removing non-spontaneous reactions without GPR rules... ');
0443     load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous');
0444     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0445     spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous));
0446 else
0447     fprintf('Removing reactions without GPR rules... ');
0448     I=~any(model.rxnGeneMat,2);
0449 end
0450 model=removeReactions(model,I,true);
0451 fprintf('COMPLETE\n');
0452 
0453 %Clean gene names
0454 fprintf('Fixing gene names in the model... ');
0455 %Get rid of the prefix organism id
0456 model.genes=regexprep(model.genes,'^\w+?:','');
0457 fprintf('COMPLETE\n');
0458 
0459 %If no FASTA file is supplied, then we are done here
0460 if isempty(fastaFile)
0461     %Create grRules
0462     fprintf('Constructing GPR associations and annotations for the model... ');
0463     model.grRules=cell(numel(model.rxns),1);
0464     model.grRules(:)={''};
0465     %Add the gene associations as 'or'
0466     for i=1:numel(model.rxns)
0467         %Find the involved genes
0468         I=find(model.rxnGeneMat(i,:));
0469         if any(I)
0470             model.grRules{i}=['(' model.genes{I(1)}];
0471             for j=2:numel(I)
0472                 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
0473             end
0474             model.grRules{i}=[model.grRules{i} ')'];
0475         end
0476     end
0477     %Fix grRules and reconstruct rxnGeneMat
0478     [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output
0479     model.grRules = grRules;
0480     model.rxnGeneMat = rxnGeneMat;
0481     %Add geneMiriams, assuming that it follows the syntax
0482     %kegg.genes/organismID:geneName
0483     model.geneMiriams='';
0484     for i=1:numel(model.genes)
0485         model.geneMiriams{i,1}.name{1,1}='kegg.genes';
0486         model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1});
0487     end
0488     %Add the description to the reactions
0489     for i=1:numel(model.rxns)
0490         if ~isempty(model.rxnNotes{i})
0491             model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i));
0492             model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
0493         else
0494             model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'};
0495         end
0496     end
0497     fprintf('COMPLETE\n\n');
0498     fprintf('*** Model reconstruction complete ***\n');
0499     return;
0500 end
0501 
0502 %Create a phylogenetic distance structure
0503 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0504 [~, phylDistId]=ismember(model.id,phylDistStruct.ids);
0505 
0506 %Calculate the real maximal distance now. An abitary large number of 1000
0507 %is used for the "all in kingdom" or "all sequences" options. This is a bit
0508 %inconvenient way to do it, but it is to make it fit with some older code
0509 if isinf(maxPhylDist) || maxPhylDist==-1
0510     maxPhylDist=1000;
0511 end
0512 
0513 %Get the KO ids for which files have been generated. Maybe not the neatest
0514 %way..
0515 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa'));
0516 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa'));
0517 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw'));
0518 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm'));
0519 outFiles=listFiles(fullfile(outDir,'*.out'));
0520 
0521 %Check if multi-FASTA files should be generated. This should only be
0522 %performed if there are IDs in the KOModel structure that haven't been
0523 %parsed yet
0524 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]);
0525 
0526 if ~isempty(missingFASTA)
0527     if ~isfile(fullfile(dataDir,'keggdb','genes.pep'))
0528         EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n'];
0529         dispEM(EM);
0530     end
0531     %Only construct models for KOs which don't have files already
0532     fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true);
0533     %Permute the order of the KOs in the model so that constructMultiFasta
0534     %can be run on several processors at once
0535     fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns');
0536     constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta'));
0537 else
0538     fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n');
0539 end
0540 
0541 if isunix
0542     if ismac
0543         binEnd='.mac';
0544     else
0545         binEnd='';
0546     end
0547 elseif ispc
0548     binEnd='';
0549 else
0550     EM='Unknown OS, exiting.';
0551     disp(EM);
0552     return
0553 end
0554 
0555 %Check if alignment of FASTA files should be performed
0556 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]);
0557 if ~isempty(missingAligned)
0558     if seqIdentity==-1
0559         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0560     else
0561         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0562     end
0563     missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned)));
0564     tmpFile=tempname;
0565     %On Windows, paths need to be translated to Unix before parsing it to WSL
0566     if ispc
0567         wslPath.tmpFile=getWSLpath(tmpFile);
0568         %mafft has problems writing to terminal (/dev/stderr) when running
0569         %on WSL via MATLAB, instead write and read progress file
0570         mafftOutput = tempname;
0571         wslPath.mafftOutput=getWSLpath(mafftOutput);
0572         wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat'));
0573         wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit'));
0574     end
0575     
0576     for i=1:numel(missingAligned)
0577         %This is checked here because it could be that it is created by a
0578         %parallel process. The faw-files are saved as temporary files to
0579         %kept track of which files are being worked on
0580         if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&...
0581                 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']))
0582             %Check that the multi-FASTA file exists. It should do so since
0583             %we are saving empty files as well. Print a warning and
0584             %continue if not
0585             if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']))
0586                 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist'];
0587                 dispEM(EM,false);
0588                 continue;
0589             end
0590             
0591             %If the multi-FASTA file is empty then save an empty aligned
0592             %file and continue
0593             s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0594             if s.bytes<=0
0595                 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w');
0596                 fclose(fid);
0597                 continue;
0598             end
0599             
0600             %Create an empty file to prevent other threads to start to work
0601             %on the same alignment
0602             fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w');
0603             fclose(fid);
0604             
0605             %First load the FASTA file, then select up to nSequences
0606             %sequences of the most closely related species, apply any
0607             %constraints from maxPhylDist, and save it as a temporary file,
0608             %and create the model from that
0609             
0610             fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0611             phylDist=inf(numel(fastaStruct),1);
0612             for j=1:numel(fastaStruct)
0613                 %Get the organism abbreviation
0614                 index=strfind(fastaStruct(j).Header,':');
0615                 if any(index)
0616                     abbrev=fastaStruct(j).Header(1:index(1)-1);
0617                     [~, index]=ismember(abbrev,phylDistStruct.ids);
0618                     if any(index)
0619                         phylDist(j)=phylDistStruct.distMat(index(1),phylDistId);
0620                     end
0621                 end
0622             end
0623             
0624             %Inf means that it should not be included
0625             phylDist(phylDist>maxPhylDist)=[];
0626             
0627             %Sort based on phylDist
0628             [~, order]=sort(phylDist);
0629             
0630             %Save the first nSequences hits to a temporary FASTA file
0631             if nSequences<=numel(fastaStruct)
0632                 fastaStruct=fastaStruct(order(1:nSequences));
0633             else
0634                 fastaStruct=fastaStruct(order);
0635             end
0636             
0637             %Do the clustering and alignment if there are more than one
0638             %sequences, otherwise just save the sequence (or an empty file)
0639             if numel(fastaStruct)>1
0640                  if seqIdentity~=-1
0641                     cdhitInpCustom=tempname;
0642                     fastawrite(cdhitInpCustom,fastaStruct);
0643                     if seqIdentity<=1 && seqIdentity>0.7
0644                         nparam='5';
0645                     elseif seqIdentity>0.6
0646                         nparam='4';
0647                     elseif seqIdentity>0.5
0648                         nparam='3';
0649                     elseif seqIdentity>0.4
0650                         nparam='2';
0651                     else
0652                         EM='The provided seqIdentity must be between 0 and 1\n';
0653                         dispEM(EM);
0654                     end
0655                     if ispc
0656                         wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom);
0657                         [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0658                     elseif ismac || isunix
0659                         [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0660                     end
0661                     if status~=0
0662                         EM=['Error when performing clustering of ' missingAligned{i} ':\n' output];
0663                         dispEM(EM);
0664                     end
0665                     %Remove the old tempfile
0666                     if exist(cdhitInpCustom, 'file')
0667                         delete([cdhitInpCustom '*']);
0668                     end
0669                 else
0670                     %This means that CD-HIT should be skipped since
0671                     %seqIdentity is equal to -1
0672                     fastawrite(tmpFile,fastaStruct);
0673                 end
0674                 %Do the alignment for this file
0675                 if ismac
0676                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0677                 elseif isunix
0678                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0679                 elseif ispc
0680                     wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']));
0681                     [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']);
0682                     output=fileread(mafftOutput);
0683                     delete(mafftOutput);
0684                 end
0685                 if status~=0
0686                     %It could be that alignment failed because only one
0687                     %sequence was left after clustering. If that is the
0688                     %case, then the clustered file is just copied as 'faw'
0689                     %file
0690                     if any(regexp(output,'Only 1 sequence found'))
0691                         movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f');
0692                     else
0693                         EM=['Error when performing alignment of ' missingAligned{i} ':\n' output];
0694                         dispEM(EM);
0695                     end
0696                 end
0697                 %Remove the old tempfile
0698                 if exist(tmpFile, 'file')
0699                     delete([tmpFile '*']);
0700                 end
0701             else
0702                 %If there is only one sequence then it's not possible to do
0703                 %a multiple alignment. Just print the sequence instead. An
0704                 %empty file was written previously so that doesn't have to
0705                 %be dealt with
0706                 if numel(fastaStruct)==1
0707                     warnState = warning; %Save the current warning state
0708                     warning('off','Bioinfo:fastawrite:AppendToFile');
0709                     fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct);
0710                     warning(warnState) %Reset warning state to previous settings
0711                 end
0712             end
0713             %Move the temporary file to the real one
0714             movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f');
0715             
0716             %Print the progress every 25 files
0717             if rem(i-1,25) == 0
0718                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns)));
0719                 progress=pad(progress,3,'left');
0720                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0721             end
0722         end
0723     end
0724     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0725 else
0726     if seqIdentity==-1
0727         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0728     else
0729         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0730     end
0731 end
0732 
0733 %Check if training of Hidden Markov models should be performed
0734 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]);
0735 if ~isempty(missingHMMs)
0736     fprintf('Training the KEGG Orthology specific HMMs...   0%% complete');
0737     missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs)));
0738     %Train models for all missing KOs
0739     for i=1:numel(missingHMMs)
0740         %This is checked here because it could be that it is created by a
0741         %parallel process
0742         if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']))
0743             %Check that the aligned FASTA file exists. It could be that it
0744             %is still being worked on by some other instance of the program
0745             %(the .faw file should then exist). This should not happen on a
0746             %single computer. It doesn't throw an error, because it should
0747             %finalize the ones it can
0748             if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']))
0749                 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist'];
0750                 dispEM(EM,false);
0751                 continue;
0752             end
0753             
0754             %If the multi-FASTA file is empty then save an empty aligned
0755             %file and continue
0756             s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0757             if s.bytes<=0
0758                 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w');
0759                 fclose(fid);
0760                 continue;
0761             end
0762             %Create a temporary file to indicate that it is working on the
0763             %KO. This is because hmmbuild cannot overwrite existing files
0764             fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w');
0765             fclose(fid);
0766             
0767             %Create HMM
0768             [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']);
0769             if status~=0
0770                 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output];
0771                 dispEM(EM);
0772             end
0773             
0774             %Delete the temporary file
0775             delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']));
0776 
0777             %Print the progress every 25 files
0778             if rem(i-1,25) == 0
0779                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns)));
0780                 progress=pad(progress,3,'left');
0781                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0782             end
0783         end
0784     end
0785     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0786 else
0787     fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n');
0788 end
0789 
0790 %Check which new .out files that should be generated. Check if training of
0791 %Hidden Markov models should be performed
0792 missingOUT=setdiff(KOModel.rxns,outFiles);
0793 if ~isempty(missingOUT)
0794     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs...   0%% complete');
0795     missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT)));
0796     for i=1:numel(missingOUT)
0797         %This is checked here because it could be that it is created by a
0798         %parallel process
0799         if ~isfile(fullfile(outDir,[missingOUT{i} '.out']))
0800             %Check that the HMM file exists. It should do so since %we are
0801             %saving empty files as well. Print a warning and continue if
0802             %not
0803             if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']))
0804                 EM=['The HMM file for ' missingOUT{i} ' does not exist'];
0805                 dispEM(EM,false);
0806                 continue;
0807             end
0808             
0809             %Save an empty file to prevent several threads working on the
0810             %same file
0811             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0812             fclose(fid);
0813             
0814             %If the HMM file is empty then save an out file and continue
0815             s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0816             if s.bytes<=0
0817                 continue;
0818             end
0819             
0820             %Check each gene in the input file against this model
0821             [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']);
0822             if status~=0
0823                 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output];
0824                 dispEM(EM);
0825             end
0826             
0827             %Save the output to a file
0828             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0829             fwrite(fid,output);
0830             fclose(fid);
0831             
0832             %Print the progress every 25 files
0833             if rem(i-1,25) == 0
0834                 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns)));
0835                 progress=pad(progress,3,'left');
0836                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0837             end
0838         end
0839     end
0840     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0841 else
0842     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n');
0843 end
0844 
0845 
0846 %***Begin retrieving the output and putting together the resulting model
0847 
0848 fprintf('Parsing the HMM search results... ');
0849 %Retrieve matched genes from the HMMs
0850 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes
0851 genes=cell(3000,1);
0852 %Store the best score for a gene in a hash list (since it will be searching
0853 %many times)
0854 hTable = java.util.Hashtable;
0855 
0856 geneCounter=0;
0857 for i=1:numel(KOModel.rxns)
0858     if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file')
0859         fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r');
0860         beginMatches=false;
0861         while 1
0862             %Get the next line
0863             tline = fgetl(fid);
0864             
0865             %Abort at end of file
0866             if ~ischar(tline)
0867                 break;
0868             end
0869             
0870             if and(beginMatches,strcmp(tline,'  ------ inclusion threshold ------'))
0871                 break;
0872             end
0873             
0874             if beginMatches==false
0875                 %This is how the listing of matches begins
0876                 if any(strfind(tline,'E-value '))
0877                     %Read one more line that is only padding
0878                     tline = fgetl(fid);
0879                     beginMatches=true;
0880                 end
0881             else
0882                 %If matches should be read
0883                 if ~strcmp(tline,'   [No hits detected that satisfy reporting thresholds]') && ~isempty(tline)
0884                     elements=regexp(tline,' ','split');
0885                     elements=elements(cellfun(@any,elements));
0886                     
0887                     %Check if the match is below the treshhold
0888                     score=str2double(elements{1});
0889                     gene=elements{9};
0890                     if score<=cutOff
0891                         %If the score is exactly 0, change it to a very
0892                         %small value to avoid NaN
0893                         if score==0
0894                             score=10^-250;
0895                         end
0896                         %Check if the gene is added already and, is so, get
0897                         %the best score for it
0898                         I=hTable.get(gene);
0899                         if any(I)
0900                             koGeneMat(i,I)=score;
0901                         else
0902                             geneCounter=geneCounter+1;
0903                             %The gene was not present yet so add it
0904                             hTable.put(gene,geneCounter);
0905                             genes{geneCounter}=gene;
0906                             koGeneMat(i,geneCounter)=score;
0907                         end
0908                     end
0909                 else
0910                     break;
0911                 end
0912             end
0913         end
0914         fclose(fid);
0915     end
0916 end
0917 fprintf('COMPLETE\n');
0918 
0919 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... ');
0920 koGeneMat=koGeneMat(:,1:geneCounter);
0921 
0922 %Remove the genes for each KO that are below minScoreRatioKO.
0923 for i=1:size(koGeneMat,1)
0924     J=find(koGeneMat(i,:));
0925     if any(J)
0926         koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0;
0927     end
0928 end
0929 
0930 %Remove the KOs for each gene that are below minScoreRatioG
0931 for i=1:size(koGeneMat,2)
0932     J=find(koGeneMat(:,i));
0933     if any(J)
0934         koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0;
0935     end
0936 end
0937 fprintf('COMPLETE\n');
0938 
0939 fprintf('Adding gene annotations to the model... ');
0940 %Create the new model
0941 model.genes=genes(1:geneCounter);
0942 model.grRules=cell(numel(model.rxns),1);
0943 model.grRules(:)={''};
0944 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes));
0945 
0946 %Loop through the reactions and add the corresponding genes
0947 for i=1:numel(model.rxns)
0948     if isstruct(model.rxnMiriams{i})
0949         %Get all KOs
0950         I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'));
0951         KOs=model.rxnMiriams{i}.value(I);
0952         %Find the KOs and the corresponding genes
0953         J=ismember(KOModel.rxns,KOs);
0954         [~, K]=find(koGeneMat(J,:));
0955         
0956         if any(K)
0957             model.rxnGeneMat(i,K)=1;
0958             %Also delete KOs for which no genes were found. If no genes at
0959             %all were matched to the reaction it will be deleted later
0960             L=sum(koGeneMat(J,:),2)==0;
0961             model.rxnMiriams{i}.value(I(L))=[];
0962             model.rxnMiriams{i}.name(I(L))=[];
0963         end
0964     end
0965 end
0966 fprintf('COMPLETE\n');
0967 
0968 %Find and delete all reactions without genes. This also removes genes that
0969 %are not used (which could happen because minScoreRatioG and
0970 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions
0971 %without genes are kept in the model. Spontaneous reactions with original
0972 %gene associations are treated in the same way, like the rest of the
0973 %reactions - if gene associations were removed during HMM search, such
0974 %reactions are deleted from the model
0975 if keepSpontaneous==true
0976     %Not the most comprise way to delete reactions without genes, but this
0977     %makes the code easier to understand. Firstly the non-spontaneous
0978     %reactions without genes are removed. After that, the second deletion
0979     %step removes spontaneous reactions, which had gene associations before
0980     %HMM search, but no longer have after it
0981     fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... ');
0982     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0983     model=removeReactions(model,I,true,true);
0984     I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes);
0985     model=removeReactions(model,I,true,true);
0986 else
0987     %Just simply check for any new reactions without genes and remove
0988     %it
0989     fprintf('Removing reactions which after HMM search no longer have GPR rules... ');
0990     I=~any(model.rxnGeneMat,2);
0991     model=removeReactions(model,I,true,true);
0992 end
0993 fprintf('COMPLETE\n');
0994 
0995 fprintf('Constructing GPR rules and finalizing the model... ');
0996 %Add the gene associations as 'or'
0997 for i=1:numel(model.rxns)
0998     %Find the involved genes
0999     I=find(model.rxnGeneMat(i,:));
1000     if any(I)
1001         model.grRules{i}=['(' model.genes{I(1)}];
1002         for j=2:numel(I)
1003             model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
1004         end
1005         model.grRules{i}=[model.grRules{i} ')'];
1006     end
1007 end
1008 
1009 %Fix grRules and reconstruct rxnGeneMat
1010 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output
1011 model.grRules = grRules;
1012 model.rxnGeneMat = rxnGeneMat;
1013 
1014 %Fix subsystems
1015 emptySubSystems=cellfun(@isempty, model.subSystems);
1016 model.subSystems(emptySubSystems)={{''}};
1017 
1018 %Add the description to the reactions
1019 for i=1:numel(model.rxns)
1020     if ~isempty(model.rxnNotes{i})
1021         model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i));
1022         model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
1023     else
1024         model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'};
1025     end
1026 end
1027 %Remove the temp fasta file
1028 delete(fastaFile)
1029 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n');
1030 end
1031 
1032 function files=listFiles(directory)
1033 %Supporter function to list the files in a directory and return them as a
1034 %cell array
1035 temp=dir(directory);
1036 files=cell(numel(temp),1);
1037 for i=1:numel(temp)
1038     files{i}=temp(i,1).name;
1039 end
1040 files=strrep(files,'.fa','');
1041 files=strrep(files,'.hmm','');
1042 files=strrep(files,'.out','');
1043 files=strrep(files,'.faw','');
1044 end

getKEGGModelForOrganism

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE