Home > external > kegg > getKEGGModelForOrganism.m

getKEGGModelForOrganism

PURPOSE ^

getKEGGModelForOrganism

SYNOPSIS ^

function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,nSequences,seqIdentity,globalModel)

DESCRIPTION ^

 getKEGGModelForOrganism
   Reconstructs a genome-scale metabolic model based on protein homology
   to the orthologies in KEGG. If the target species is not available in
   KEGG, the user must select a closely related species. It is also
   possible to circumvent protein homology search (see fastaFile parameter
   for more details)

   Input:
   organismID          three or four letter abbreviation of the organism
                       (as used in KEGG). If not available, use a closely
                       related species. This is used for determing the
                       phylogenetic distance. Use 'eukaryotes' or
                       'prokaryotes' to get a model for the whole domain.
                       Only applicable if fastaFile is empty, i.e. no
                       homology search should be performed
   fastaFile           a FASTA file that contains the protein sequences of
                       the organism for which to reconstruct a model (opt,
                       if no FASTA file is supplied then a model is
                       reconstructed based only on the organism
                       abbreviation. This option ignores all settings
                       except for keepSpontaneous, keepUndefinedStoich,
                       keepIncomplete and keepGeneral)
   dataDir             directory for which to retrieve the input data.
                       Should contain a combination of these sub-folders:
                       -dataDir\keggdb
                           The KEGG database files used in 1a (see below)
                       -dataDir\fasta
                           The multi-FASTA files generated in 1b (see
                           below)
                       -dataDir\aligned
                           The aligned FASTA files as generated in 2a (see
                           below)
                       -dataDir\hmms
                           The hidden Markov models as generated in 2b or
                           downloaded from BioMet Toolbox (see below)
                       The final directory in dataDir should be styled as
                       proXXX_keggYY or eukXXX_keggYY, indicating whether
                       the HMMs were trained on pro- or eukaryotic
                       sequences, using a sequence similarity threshold of
                       XXX %, fitting the KEGG version YY. E.g.
                       euk90_kegg105. (opt, see note about fastaFile. Note
                       that in order to rebuild the KEGG model from a
                       database dump, as opposed to using the version
                       supplied with RAVEN, you would still need to supply
                       this)
   outDir              directory to save the results from the quering of
                       the hidden Markov models. The output is specific
                       for the input sequences and the settings used. It
                       is stored in this manner so that the function can
                       continue if interrupted or if it should run in
                       parallel. Be careful not to leave output files from
                       different organisms or runs with different settings
                       in the same folder. They will not be overwritten
                       (opt, default is a temporary dir where all *.out
                       files are deleted before and after doing the
                       reconstruction)
   keepSpontaneous     include reactions labeled as "spontaneous". (opt,
                       default true)
   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
                       will be dealt with as two separate metabolites
                       (opt, default true)
   keepIncomplete      include reactions which have been labelled as
                       "incomplete", "erroneous" or "unclear" (opt,
                       default true)
   keepGeneral         include reactions which have been labelled as
                       "general reaction". These are reactions on the form
                       "an aldehyde <=> an alcohol", and are therefore
                       unsuited for modelling purposes. Note that not all
                       reactions have this type of annotation, and the
                       script will therefore not be able to remove all
                       such reactions (opt, default false)
   cutOff              significance score from HMMer needed to assign
                       genes to a KO (opt, default 10^-50)
   minScoreRatioG      a gene is only assigned to KOs for which the score
                       is >=log(score)/log(best score) for that gene. This
                       is to prevent that a gene which clearly belongs to
                       one KO is assigned also to KOs with much lower
                       scores (opt, default 0.8 (lower is less strict))
   minScoreRatioKO     ignore genes in a KO if their score is
                       <log(score)/log(best score in KO). This is to
                       "prune" KOs which have many genes and where some are
                       clearly a better fit (opt, default 0.3 (lower is
                       less strict))
   maxPhylDist         -1: only use sequences from the same domain
                       (Prokaryota, Eukaryota)
                       other (positive) value: only use sequences for
                       organisms where the phylogenetic distance is at the
                       most this large (as calculated in getPhylDist)
                       (opt, default Inf, which means that all sequences
                       will be used)
   nSequences          for each KO, use up to this many sequences from the
                       most closely related species. This is mainly to
                       speed up the alignment process for KOs with very
                       many genes. This subsampling is performed before
                       running CD-HIT (opt, default inf)
   seqIdentity         sequence identity threshold in CD-HIT, referred as
                       "global sequence identity" in CD-HIT User's Guide.
                       If -1 is provided, CD-HIT is skipped (opt, default 0.9)
   globalModel         structure containing both model and KOModel
                       structures as generated by getModelFromKEGG. These
                       will otherwise be loaded by via getModelFromKEGG. 
                       Providing globalKEGGmodel can speed up model
                       generation if getKEGGModelForOrganism is run
                       multiple times for different strains. Example:
                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
                       (opt, default empty, global model is loaded by 
                       getModelFromKEGG)

   Output:
   model               the reconstructed model

   PLEASE READ THIS: The input to this function can be confusing, because
   it is intended to be run in parallel on a cluster or in multiple
   sessions. It therefore saves a lot of intermediate results to storage.
   This also serves the purpose of not having to do redundant
   calculations. This, however, comes with the disadvantage of somewhat
   trickier handling. This is what this function does:

   1a. Loads files from a local KEGG FTP dump and constructs a general
       RAVEN model representing the metabolic network. The functions
       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
       which are later merged into 'keggModel.mat' by getModelFromKEGG
       function. The function getPhylDist generates 'keggPhylDist.mat'
       file. KEGG FTP access requires a <a href="matlab:
       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
       multi-FASTA file for each KO in KEGG is generated.

   The Step 1 has to be re-done every time KEGG updates their database (or
   rather when the updates are large enough to warrant re-running this
   part). Many users would probably never use this feature.

   2a. Filters KO-specific protein sets. This is done by using the
       settings "maxPhylDist" and "nSequences" to control which sequences
       should be used for constructing Hidden Markov models (HMMs), and
       later for matching your sequences to.
       The most common alternatives here would be to use sequences from
       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
       could also play around with the parameters to use e.g. only fungal
       sequences.
   2b. KO-specific protein FASTA files are re-organised into
       non-redundant protein sets with CD-HIT. The user can only set
       seqIdentity parameter, which corresponds to '-c' parameter in
       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
       sequence identity specific word_length (-n) parameters are used.
   2c. Does a multi sequence alignment for multi-FASTA files obtained in
       Step 2b for future use. MAFFT software with automatic selection of
       alignment algorithm is used in this step ('--auto').
   2d. Trains hidden Markov models using HMMer for each of the aligned
       KO-specific FASTA files obtained in Step 2c. This is performed with
       'hmmbuild' using the default settings.

   Step 2 may be reasonable to be re-done if the user wants to tweak the
   settings in proteins filtering, clustering, multi sequence alignment or
   HMMs training steps. However, it requires to have KO-specific protein
   FASTA files obtained in Step 1a. As such files are not provided in
   RAVEN and BioMet ToolBox, the user can only generate these files from
   KEGG FTP dump files, so KEGG FTP license is needed.

   3a. Queries the HMMs with sequences for the organism you are making a
       model for. This step uses both the output from step 1a and from 2d.
       This is done with 'hmmsearch' function under default settings. The
       significance threshold value set in 'cutOff' parameter is used
       later when parsing '*.out' files to filter out KO hits with higher
       value than 'cutOff' value. The results with passable E values are
       summarised into KO-gene occurence matrix with E values in
       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
       (see the function descriptions above for more details). The
       intersection values for these 'prunable' associations are converted
       to zeroes.
   3b. Constructs a model based on the pre-processed KO-gene association
       matrix (koGeneMat). As the full KEGG model already has reaction-KO
       relationships, KOs are converted into the query genes. The final
       draft model contains only these reactions, which are associated
       with KOs from koGeneMat. The reactions without the genes may also
       be included, if the user set keepSpontaneous as 'true'.

   The Step 3 is specific to the organism for which the model is
   reconstructed.

   In principle the function looks at which output that is already available
   and runs only the parts that are required for step 3. This means
   that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory
   -1b is only performed if not all required HMMs OR aligned FASTA files
   OR multi-FASTA files exist in the defined dataDir. This means that this
   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
   (see above). If not all files exist it will try to find
   the KEGG database files in dataDir.
   -2a is only performed if not all required HMMs OR aligned FASTA files
   files exist in the defined dataDir. This means that this step is skipped
   if the HMMs are downloaded from BioMet Toolbox instead (see above).
   -2b is only performed if not all required HMMs exist in the defined
   dataDir. This means that this step is skipped if the FASTA files or
   HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   These steps are specific to the organism for which you are
   reconstructing the model.

   Regarding the whole pipeline, the function checks the output that is
   already available and runs only the parts that are required for step 3.
   This means that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the
   RAVEN\external\kegg directory.
   -1b is only performed if any of required KOs do not have HMMs, aligned
   FASTA files, clustered FASTA files and raw FASTA files in the defined
   dataDir. This means that this step is skipped if the HMMs are
   downloaded from BioMet Toolbox instead (see above). If not all files
   exist it will try to find the KEGG database files in dataDir.
   -2ab are only performed if any of required KOs do not have HMMs,
   aligned FASTA files and clustered FASTA files in the defined dataDir.
   This means that this step is skipped if the HMMs are downloaded from
   BioMet Toolbox instead (see above).
   -2c is only performed if any of required KOs do not have HMMs and
   aligned FASTA files in the defined dataDir. This means that this step
   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
   above).
   -2d is only performed if any of required KOs do not have HMMs exist in
   the defined dataDir. This means that this step is skipped if the FASTA
   files or HMMs are downloaded from BioMet Toolbox instead (see above).
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallel or to resume if interrupted.
   -3b is always performed.

   NOTE: it is also possible to obtain draft model from KEGG without
   providing protein FASTA file for the target organism. In such case the
   organism three-four letter abbreviation set as 'organismID' must exist
   in the local KEGG database. In such case, the program just fetches all
   the reactions, which are associated with given 'organismID'.

   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
    nSequences,seqIdentity)

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0002     outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0003     keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0004     nSequences,seqIdentity,globalModel)
0005 % getKEGGModelForOrganism
0006 %   Reconstructs a genome-scale metabolic model based on protein homology
0007 %   to the orthologies in KEGG. If the target species is not available in
0008 %   KEGG, the user must select a closely related species. It is also
0009 %   possible to circumvent protein homology search (see fastaFile parameter
0010 %   for more details)
0011 %
0012 %   Input:
0013 %   organismID          three or four letter abbreviation of the organism
0014 %                       (as used in KEGG). If not available, use a closely
0015 %                       related species. This is used for determing the
0016 %                       phylogenetic distance. Use 'eukaryotes' or
0017 %                       'prokaryotes' to get a model for the whole domain.
0018 %                       Only applicable if fastaFile is empty, i.e. no
0019 %                       homology search should be performed
0020 %   fastaFile           a FASTA file that contains the protein sequences of
0021 %                       the organism for which to reconstruct a model (opt,
0022 %                       if no FASTA file is supplied then a model is
0023 %                       reconstructed based only on the organism
0024 %                       abbreviation. This option ignores all settings
0025 %                       except for keepSpontaneous, keepUndefinedStoich,
0026 %                       keepIncomplete and keepGeneral)
0027 %   dataDir             directory for which to retrieve the input data.
0028 %                       Should contain a combination of these sub-folders:
0029 %                       -dataDir\keggdb
0030 %                           The KEGG database files used in 1a (see below)
0031 %                       -dataDir\fasta
0032 %                           The multi-FASTA files generated in 1b (see
0033 %                           below)
0034 %                       -dataDir\aligned
0035 %                           The aligned FASTA files as generated in 2a (see
0036 %                           below)
0037 %                       -dataDir\hmms
0038 %                           The hidden Markov models as generated in 2b or
0039 %                           downloaded from BioMet Toolbox (see below)
0040 %                       The final directory in dataDir should be styled as
0041 %                       proXXX_keggYY or eukXXX_keggYY, indicating whether
0042 %                       the HMMs were trained on pro- or eukaryotic
0043 %                       sequences, using a sequence similarity threshold of
0044 %                       XXX %, fitting the KEGG version YY. E.g.
0045 %                       euk90_kegg105. (opt, see note about fastaFile. Note
0046 %                       that in order to rebuild the KEGG model from a
0047 %                       database dump, as opposed to using the version
0048 %                       supplied with RAVEN, you would still need to supply
0049 %                       this)
0050 %   outDir              directory to save the results from the quering of
0051 %                       the hidden Markov models. The output is specific
0052 %                       for the input sequences and the settings used. It
0053 %                       is stored in this manner so that the function can
0054 %                       continue if interrupted or if it should run in
0055 %                       parallel. Be careful not to leave output files from
0056 %                       different organisms or runs with different settings
0057 %                       in the same folder. They will not be overwritten
0058 %                       (opt, default is a temporary dir where all *.out
0059 %                       files are deleted before and after doing the
0060 %                       reconstruction)
0061 %   keepSpontaneous     include reactions labeled as "spontaneous". (opt,
0062 %                       default true)
0063 %   keepUndefinedStoich    include reactions in the form n A <=> n+1 A. These
0064 %                       will be dealt with as two separate metabolites
0065 %                       (opt, default true)
0066 %   keepIncomplete      include reactions which have been labelled as
0067 %                       "incomplete", "erroneous" or "unclear" (opt,
0068 %                       default true)
0069 %   keepGeneral         include reactions which have been labelled as
0070 %                       "general reaction". These are reactions on the form
0071 %                       "an aldehyde <=> an alcohol", and are therefore
0072 %                       unsuited for modelling purposes. Note that not all
0073 %                       reactions have this type of annotation, and the
0074 %                       script will therefore not be able to remove all
0075 %                       such reactions (opt, default false)
0076 %   cutOff              significance score from HMMer needed to assign
0077 %                       genes to a KO (opt, default 10^-50)
0078 %   minScoreRatioG      a gene is only assigned to KOs for which the score
0079 %                       is >=log(score)/log(best score) for that gene. This
0080 %                       is to prevent that a gene which clearly belongs to
0081 %                       one KO is assigned also to KOs with much lower
0082 %                       scores (opt, default 0.8 (lower is less strict))
0083 %   minScoreRatioKO     ignore genes in a KO if their score is
0084 %                       <log(score)/log(best score in KO). This is to
0085 %                       "prune" KOs which have many genes and where some are
0086 %                       clearly a better fit (opt, default 0.3 (lower is
0087 %                       less strict))
0088 %   maxPhylDist         -1: only use sequences from the same domain
0089 %                       (Prokaryota, Eukaryota)
0090 %                       other (positive) value: only use sequences for
0091 %                       organisms where the phylogenetic distance is at the
0092 %                       most this large (as calculated in getPhylDist)
0093 %                       (opt, default Inf, which means that all sequences
0094 %                       will be used)
0095 %   nSequences          for each KO, use up to this many sequences from the
0096 %                       most closely related species. This is mainly to
0097 %                       speed up the alignment process for KOs with very
0098 %                       many genes. This subsampling is performed before
0099 %                       running CD-HIT (opt, default inf)
0100 %   seqIdentity         sequence identity threshold in CD-HIT, referred as
0101 %                       "global sequence identity" in CD-HIT User's Guide.
0102 %                       If -1 is provided, CD-HIT is skipped (opt, default 0.9)
0103 %   globalModel         structure containing both model and KOModel
0104 %                       structures as generated by getModelFromKEGG. These
0105 %                       will otherwise be loaded by via getModelFromKEGG.
0106 %                       Providing globalKEGGmodel can speed up model
0107 %                       generation if getKEGGModelForOrganism is run
0108 %                       multiple times for different strains. Example:
0109 %                       [globalModel.model,globalModel.KOModel] = getModelFromKEGG;
0110 %                       (opt, default empty, global model is loaded by
0111 %                       getModelFromKEGG)
0112 %
0113 %   Output:
0114 %   model               the reconstructed model
0115 %
0116 %   PLEASE READ THIS: The input to this function can be confusing, because
0117 %   it is intended to be run in parallel on a cluster or in multiple
0118 %   sessions. It therefore saves a lot of intermediate results to storage.
0119 %   This also serves the purpose of not having to do redundant
0120 %   calculations. This, however, comes with the disadvantage of somewhat
0121 %   trickier handling. This is what this function does:
0122 %
0123 %   1a. Loads files from a local KEGG FTP dump and constructs a general
0124 %       RAVEN model representing the metabolic network. The functions
0125 %       getRxnsFromKEGG, getGenesFromKEGG, getMetsFromKEGG summarise the
0126 %       data into 'keggRxns.mat', 'keggGenes.mat' and 'keggMets.mat' files,
0127 %       which are later merged into 'keggModel.mat' by getModelFromKEGG
0128 %       function. The function getPhylDist generates 'keggPhylDist.mat'
0129 %       file. KEGG FTP access requires a <a href="matlab:
0130 %       web('http://www.bioinformatics.jp/en/keggftp.html')">license</a>.
0131 %   1b. Generates protein FASTA files from the KEGG FTP dump (see 1a). One
0132 %       multi-FASTA file for each KO in KEGG is generated.
0133 %
0134 %   The Step 1 has to be re-done every time KEGG updates their database (or
0135 %   rather when the updates are large enough to warrant re-running this
0136 %   part). Many users would probably never use this feature.
0137 %
0138 %   2a. Filters KO-specific protein sets. This is done by using the
0139 %       settings "maxPhylDist" and "nSequences" to control which sequences
0140 %       should be used for constructing Hidden Markov models (HMMs), and
0141 %       later for matching your sequences to.
0142 %       The most common alternatives here would be to use sequences from
0143 %       only eukaryotes, only prokaryotes or all sequences in KEGG, but you
0144 %       could also play around with the parameters to use e.g. only fungal
0145 %       sequences.
0146 %   2b. KO-specific protein FASTA files are re-organised into
0147 %       non-redundant protein sets with CD-HIT. The user can only set
0148 %       seqIdentity parameter, which corresponds to '-c' parameter in
0149 %       CD-HIT, described as "sequence identity threshold". CD-HIT suggsted
0150 %       sequence identity specific word_length (-n) parameters are used.
0151 %   2c. Does a multi sequence alignment for multi-FASTA files obtained in
0152 %       Step 2b for future use. MAFFT software with automatic selection of
0153 %       alignment algorithm is used in this step ('--auto').
0154 %   2d. Trains hidden Markov models using HMMer for each of the aligned
0155 %       KO-specific FASTA files obtained in Step 2c. This is performed with
0156 %       'hmmbuild' using the default settings.
0157 %
0158 %   Step 2 may be reasonable to be re-done if the user wants to tweak the
0159 %   settings in proteins filtering, clustering, multi sequence alignment or
0160 %   HMMs training steps. However, it requires to have KO-specific protein
0161 %   FASTA files obtained in Step 1a. As such files are not provided in
0162 %   RAVEN and BioMet ToolBox, the user can only generate these files from
0163 %   KEGG FTP dump files, so KEGG FTP license is needed.
0164 %
0165 %   3a. Queries the HMMs with sequences for the organism you are making a
0166 %       model for. This step uses both the output from step 1a and from 2d.
0167 %       This is done with 'hmmsearch' function under default settings. The
0168 %       significance threshold value set in 'cutOff' parameter is used
0169 %       later when parsing '*.out' files to filter out KO hits with higher
0170 %       value than 'cutOff' value. The results with passable E values are
0171 %       summarised into KO-gene occurence matrix with E values in
0172 %       intersections as 'koGeneMat'. The parameters 'minScoreRatioG' and
0173 %       'minScoreRatioKO' are then applied to 'prune' KO-gene associations
0174 %       (see the function descriptions above for more details). The
0175 %       intersection values for these 'prunable' associations are converted
0176 %       to zeroes.
0177 %   3b. Constructs a model based on the pre-processed KO-gene association
0178 %       matrix (koGeneMat). As the full KEGG model already has reaction-KO
0179 %       relationships, KOs are converted into the query genes. The final
0180 %       draft model contains only these reactions, which are associated
0181 %       with KOs from koGeneMat. The reactions without the genes may also
0182 %       be included, if the user set keepSpontaneous as 'true'.
0183 %
0184 %   The Step 3 is specific to the organism for which the model is
0185 %   reconstructed.
0186 %
0187 %   In principle the function looks at which output that is already available
0188 %   and runs only the parts that are required for step 3. This means
0189 %   that (see the definition of the parameters for details):
0190 %   -1a is only performed if there are no KEGG model files in the
0191 %   RAVEN\external\kegg directory
0192 %   -1b is only performed if not all required HMMs OR aligned FASTA files
0193 %   OR multi-FASTA files exist in the defined dataDir. This means that this
0194 %   step is skipped if the HMMs are downloaded from BioMet Toolbox instead
0195 %   (see above). If not all files exist it will try to find
0196 %   the KEGG database files in dataDir.
0197 %   -2a is only performed if not all required HMMs OR aligned FASTA files
0198 %   files exist in the defined dataDir. This means that this step is skipped
0199 %   if the HMMs are downloaded from BioMet Toolbox instead (see above).
0200 %   -2b is only performed if not all required HMMs exist in the defined
0201 %   dataDir. This means that this step is skipped if the FASTA files or
0202 %   HMMs are downloaded from BioMet Toolbox instead (see above).
0203 %   -3a is performed for the required HMMs for which no corresponding .out
0204 %   file exists in outDir. This is just a way to enable the function to be
0205 %   run in parallel or to resume if interrupted.
0206 %   -3b is always performed.
0207 %
0208 %   These steps are specific to the organism for which you are
0209 %   reconstructing the model.
0210 %
0211 %   Regarding the whole pipeline, the function checks the output that is
0212 %   already available and runs only the parts that are required for step 3.
0213 %   This means that (see the definition of the parameters for details):
0214 %   -1a is only performed if there are no KEGG model files in the
0215 %   RAVEN\external\kegg directory.
0216 %   -1b is only performed if any of required KOs do not have HMMs, aligned
0217 %   FASTA files, clustered FASTA files and raw FASTA files in the defined
0218 %   dataDir. This means that this step is skipped if the HMMs are
0219 %   downloaded from BioMet Toolbox instead (see above). If not all files
0220 %   exist it will try to find the KEGG database files in dataDir.
0221 %   -2ab are only performed if any of required KOs do not have HMMs,
0222 %   aligned FASTA files and clustered FASTA files in the defined dataDir.
0223 %   This means that this step is skipped if the HMMs are downloaded from
0224 %   BioMet Toolbox instead (see above).
0225 %   -2c is only performed if any of required KOs do not have HMMs and
0226 %   aligned FASTA files in the defined dataDir. This means that this step
0227 %   is skipped if the HMMs are downloaded from BioMet Toolbox instead (see
0228 %   above).
0229 %   -2d is only performed if any of required KOs do not have HMMs exist in
0230 %   the defined dataDir. This means that this step is skipped if the FASTA
0231 %   files or HMMs are downloaded from BioMet Toolbox instead (see above).
0232 %   -3a is performed for the required HMMs for which no corresponding .out
0233 %   file exists in outDir. This is just a way to enable the function to be
0234 %   run in parallel or to resume if interrupted.
0235 %   -3b is always performed.
0236 %
0237 %   NOTE: it is also possible to obtain draft model from KEGG without
0238 %   providing protein FASTA file for the target organism. In such case the
0239 %   organism three-four letter abbreviation set as 'organismID' must exist
0240 %   in the local KEGG database. In such case, the program just fetches all
0241 %   the reactions, which are associated with given 'organismID'.
0242 %
0243 %   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
0244 %    outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
0245 %    keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist,...
0246 %    nSequences,seqIdentity)
0247 
0248 if nargin<2 || isempty(fastaFile)
0249     fastaFile=[];
0250 else
0251     fastaFile=char(fastaFile);
0252 end
0253 if nargin<3
0254     dataDir=[];
0255 else
0256     dataDir=char(dataDir);
0257 end
0258 if nargin<4 || isempty(outDir)
0259     outDir=tempdir;
0260     %Delete all *.out files if any exist
0261     delete(fullfile(outDir,'*.out'));
0262 else
0263     outDir=char(outDir);
0264 end
0265 if nargin<5
0266     keepSpontaneous=true;
0267 end
0268 if nargin<6
0269     keepUndefinedStoich=true;
0270 end
0271 if nargin<7
0272     keepIncomplete=true;
0273 end
0274 if nargin<8
0275     keepGeneral=false;
0276 end
0277 if nargin<9
0278     cutOff=10^-50;
0279 end
0280 if nargin<10
0281     minScoreRatioKO=0.3;
0282 end
0283 if nargin<11
0284     minScoreRatioG=0.8;
0285 end
0286 if nargin<12
0287     maxPhylDist=inf;
0288     %Include all sequences for each reaction
0289 end
0290 if nargin<13
0291     nSequences=inf;
0292     %Include all sequences for each reaction
0293 end
0294 if nargin<14
0295     seqIdentity=0.9;
0296 end
0297 
0298 if isempty(fastaFile)
0299     fprintf(['\n*** The model reconstruction from KEGG based on the annotation available for KEGG Species <strong>' organismID '</strong> ***\n\n']);
0300 else
0301     fprintf('\n*** The model reconstruction from KEGG based on the protein homology search against KEGG Orthology specific HMMs ***\n\n');
0302     %Check if query fasta exists
0303     fastaFile=checkFileExistence(fastaFile,2); %Copy file to temp dir
0304 end
0305 
0306 %Run the external binaries multi-threaded to use all logical cores assigned
0307 %to MATLAB
0308 cores = evalc('feature(''numcores'')');
0309 cores = strsplit(cores, 'MATLAB was assigned: ');
0310 cores = regexp(cores{2},'^\d*','match');
0311 cores = cores{1};
0312 
0313 %Get the directory for RAVEN Toolbox.
0314 ravenPath=findRAVENroot();
0315 
0316 %Checking if dataDir is consistent. It must point to pre-trained HMMs set,
0317 %compatible with the the current RAVEN version. The user may have the
0318 %required zip file already in working directory or have it extracted. If
0319 %the zip file and directory is not here, it is downloaded from the cloud
0320 if ~isempty(dataDir)
0321     hmmOptions={'euk90_kegg105','prok90_kegg105'};
0322     if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions.
0323                                      %If not, then check whether the required folders exist anyway.
0324         if ~isfile(fullfile(dataDir,'keggdb','genes.pep')) && ...
0325                 ~isfolder(fullfile(dataDir,'fasta')) && ...
0326                 ~isfolder(fullfile(dataDir,'aligned')) && ...
0327                 ~isfolder(fullfile(dataDir,'hmms'))
0328             error(['Pre-trained HMMs set is not recognised. If you want download RAVEN provided sets, it should match any of the following: ' strjoin(hmmOptions,' or ')])
0329         end
0330     else
0331         if isfolder(dataDir) && isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0332             fprintf(['NOTE: Found <strong>' dataDir '</strong> directory with pre-trained HMMs, it will therefore be used during reconstruction\n']);
0333         elseif ~isfolder(dataDir) && isfile([dataDir,'.zip'])
0334             fprintf('Extracting the HMMs archive file... ');
0335             unzip([dataDir,'.zip']);
0336             fprintf('COMPLETE\n');
0337         else
0338             hmmIndex=strcmp(dataDir,hmmOptions);
0339             if ~any(hmmIndex)
0340                 error(['Pre-trained HMMs are only provided with proteins clustered at 90% sequence identity (i.e. prok90_kegg105 and euk90_kegg105). ' ...
0341                     'Use either of these datasets, or otherwise download the relevant sequence data from KEGG to train HMMs with your desired sequence identity'])
0342             else
0343                 fprintf('Downloading the HMMs archive file... ');
0344                 try
0345                     websave([dataDir,'.zip'],['https://github.com/SysBioChalmers/RAVEN/releases/download/v2.8.0/',hmmOptions{hmmIndex},'.zip']);
0346                 catch ME
0347                     if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
0348                         error('Failed to download the HMMs archive file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues')
0349                     end
0350                 end
0351             end
0352             
0353             fprintf('COMPLETE\n');
0354             fprintf('Extracting the HMMs archive file... ');
0355             unzip([dataDir,'.zip']);
0356             fprintf('COMPLETE\n');
0357         end
0358         %Check if HMMs are extracted
0359         if ~isfile(fullfile(dataDir,'hmms','K00844.hmm'))
0360             error(['The HMM files seem improperly extracted and not found in ',dataDir,'/hmms. Please remove ',dataDir,' folder and rerun getKEGGModelForOrganism']);
0361         end
0362     end
0363 end
0364 
0365 %Check if the fasta-file contains '/' or'\'. If not then it's probably just
0366 %a file name. Expand to full path.
0367 if any(fastaFile)
0368     if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/'))
0369         fastaFile=which(fastaFile);
0370     end
0371     %Create the required sub-folders in dataDir if they dont exist
0372     if ~isfolder(fullfile(dataDir,'keggdb'))
0373         mkdir(dataDir,'keggdb');
0374     end
0375     if ~isfolder(fullfile(dataDir,'fasta'))
0376         mkdir(dataDir,'fasta');
0377     end
0378     if ~isfolder(fullfile(dataDir,'aligned'))
0379         mkdir(dataDir,'aligned');
0380     end
0381     if ~isfolder(fullfile(dataDir,'hmms'))
0382         mkdir(dataDir,'hmms');
0383     end
0384     if ~isfolder(outDir)
0385         mkdir(outDir);
0386     end
0387 end
0388 
0389 %First generate the full global KEGG model. Can be provided as input.
0390 %Otherwise, getModelFromKEGG is run. The dataDir must not be supplied as
0391 %there is also an internal RAVEN version available
0392 if nargin==15
0393     model=globalModel.model;
0394     KOModel=globalModel.KOModel;
0395 elseif any(dataDir)
0396     [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0397 else
0398     [model, KOModel]=getModelFromKEGG([],keepSpontaneous,keepUndefinedStoich,keepIncomplete,keepGeneral);
0399 end
0400 model.id=organismID;
0401 model.c=zeros(numel(model.rxns),1);
0402 
0403 %If no FASTA file is supplied, then just remove all genes which are not for
0404 %the given organism ID
0405 if isempty(fastaFile)
0406     %Check if organismID can be found in KEGG species list or is
0407     %set to "eukaryotes" or "prokaryotes"
0408     phylDistsFull=getPhylDist(fullfile(dataDir,'keggdb'),true);
0409     if ~ismember(organismID,[phylDistsFull.ids 'eukaryotes' 'prokaryotes'])
0410         error('Provided organismID is incorrect. Only species abbreviations from KEGG Species List or "eukaryotes"/"prokaryotes" are allowed.');
0411     end
0412     
0413     fprintf(['Pruning the model from <strong>non-' organismID '</strong> genes... ']);
0414     if ismember(organismID,{'eukaryotes','prokaryotes'})
0415         phylDists=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0416         if strcmp(organismID,'eukaryotes')
0417             proxyid='hsa';
0418             %Use H. sapiens here
0419         else
0420             proxyid='eco';
0421             %Use E. coli here
0422         end
0423         [~, phylDistId]=ismember(proxyid,phylDists.ids);
0424         idsToKeep=phylDists.ids(~isinf(phylDists.distMat(phylDistId,:)));
0425         taxIDs=cellfun(@(x) x{1},cellfun(@(x) strsplit(x,':'),model.genes,'UniformOutput',false),'UniformOutput',false);
0426         I=ismember(upper(taxIDs),upper(idsToKeep));
0427     else
0428         %KEGG organism IDs may have three or four letters
0429         organismID=strcat(organismID,':');
0430         %Add colon for accurate matching
0431         if length(organismID)==4
0432             I=cellfun(@(x) strcmpi(x(1:4),organismID),model.genes);
0433         elseif length(organismID)==5
0434             I=cellfun(@(x) strcmpi(x(1:5),organismID),model.genes);
0435         end
0436     end
0437     %Remove those genes
0438     model.genes=model.genes(I);
0439     model.rxnGeneMat=model.rxnGeneMat(:,I);
0440     fprintf('COMPLETE\n');
0441 end
0442 
0443 %First remove all reactions without genes
0444 if keepSpontaneous==true
0445     fprintf('Removing non-spontaneous reactions without GPR rules... ');
0446     load(fullfile(ravenPath,'external','kegg','keggRxns.mat'),'isSpontaneous');
0447     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0448     spontRxnsWithGenes=model.rxns(any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous));
0449 else
0450     fprintf('Removing reactions without GPR rules... ');
0451     I=~any(model.rxnGeneMat,2);
0452 end
0453 model=removeReactions(model,I,true);
0454 fprintf('COMPLETE\n');
0455 
0456 %Clean gene names
0457 fprintf('Fixing gene names in the model... ');
0458 %Get rid of the prefix organism id
0459 model.genes=regexprep(model.genes,'^\w+?:','');
0460 fprintf('COMPLETE\n');
0461 
0462 %If no FASTA file is supplied, then we are done here
0463 if isempty(fastaFile)
0464     %Create grRules
0465     fprintf('Constructing GPR associations and annotations for the model... ');
0466     model.grRules=cell(numel(model.rxns),1);
0467     model.grRules(:)={''};
0468     %Add the gene associations as 'or'
0469     for i=1:numel(model.rxns)
0470         %Find the involved genes
0471         I=find(model.rxnGeneMat(i,:));
0472         if any(I)
0473             model.grRules{i}=['(' model.genes{I(1)}];
0474             for j=2:numel(I)
0475                 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
0476             end
0477             model.grRules{i}=[model.grRules{i} ')'];
0478         end
0479     end
0480     %Fix grRules and reconstruct rxnGeneMat
0481     [grRules,rxnGeneMat] = standardizeGrRules(model); %Give detailed output
0482     model.grRules = grRules;
0483     model.rxnGeneMat = rxnGeneMat;
0484     %Add geneMiriams, assuming that it follows the syntax
0485     %kegg.genes/organismID:geneName
0486     model.geneMiriams='';
0487     for i=1:numel(model.genes)
0488         model.geneMiriams{i,1}.name{1,1}='kegg.genes';
0489         model.geneMiriams{i,1}.value{1,1}=strcat(lower(organismID),model.genes{i,1});
0490     end
0491     %Add the description to the reactions
0492     for i=1:numel(model.rxns)
0493         if ~isempty(model.rxnNotes{i})
0494             model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (without HMMs).',model.rxnNotes(i));
0495             model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
0496         else
0497             model.rxnNotes(i)={'Included by getKEGGModelForOrganism (without HMMs)'};
0498         end
0499     end
0500     fprintf('COMPLETE\n\n');
0501     fprintf('*** Model reconstruction complete ***\n');
0502     return;
0503 end
0504 
0505 %Create a phylogenetic distance structure
0506 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0507 [~, phylDistId]=ismember(model.id,phylDistStruct.ids);
0508 
0509 %Calculate the real maximal distance now. An abitary large number of 1000
0510 %is used for the "all in kingdom" or "all sequences" options. This is a bit
0511 %inconvenient way to do it, but it is to make it fit with some older code
0512 if isinf(maxPhylDist) || maxPhylDist==-1
0513     maxPhylDist=1000;
0514 end
0515 
0516 %Get the KO ids for which files have been generated. Maybe not the neatest
0517 %way..
0518 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa'));
0519 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa'));
0520 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw'));
0521 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm'));
0522 outFiles=listFiles(fullfile(outDir,'*.out'));
0523 
0524 %Check if multi-FASTA files should be generated. This should only be
0525 %performed if there are IDs in the KOModel structure that haven't been
0526 %parsed yet
0527 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]);
0528 
0529 if ~isempty(missingFASTA)
0530     if ~isfile(fullfile(dataDir,'keggdb','genes.pep'))
0531         EM=['The file ''genes.pep'' cannot be located at ' strrep(dataDir,'\','/') '/ and should be downloaded from the KEGG FTP.\n'];
0532         dispEM(EM);
0533     end
0534     %Only construct models for KOs which don't have files already
0535     fastaModel=removeReactions(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true);
0536     %Permute the order of the KOs in the model so that constructMultiFasta
0537     %can be run on several processors at once
0538     fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns');
0539     constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta'));
0540 else
0541     fprintf('Generating the KEGG Orthology specific multi-FASTA files... COMPLETE\n');
0542 end
0543 
0544 if isunix
0545     if ismac
0546         binEnd='.mac';
0547     else
0548         binEnd='';
0549     end
0550 elseif ispc
0551     binEnd='';
0552 else
0553     EM='Unknown OS, exiting.';
0554     disp(EM);
0555     return
0556 end
0557 
0558 %Check if alignment of FASTA files should be performed
0559 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]);
0560 if ~isempty(missingAligned)
0561     if seqIdentity==-1
0562         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0563     else
0564         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets...   0%% complete');
0565     end
0566     missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned)));
0567     tmpFile=tempname;
0568     %On Windows, paths need to be translated to Unix before parsing it to WSL
0569     if ispc
0570         wslPath.tmpFile=getWSLpath(tmpFile);
0571         %mafft has problems writing to terminal (/dev/stderr) when running
0572         %on WSL via MATLAB, instead write and read progress file
0573         mafftOutput = tempname;
0574         wslPath.mafftOutput=getWSLpath(mafftOutput);
0575         wslPath.mafft=getWSLpath(fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat'));
0576         wslPath.cdhit=getWSLpath(fullfile(ravenPath,'software','cd-hit','cd-hit'));
0577     end
0578     
0579     for i=1:numel(missingAligned)
0580         %This is checked here because it could be that it is created by a
0581         %parallel process. The faw-files are saved as temporary files to
0582         %kept track of which files are being worked on
0583         if ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw'])) &&...
0584                 ~isfile(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']))
0585             %Check that the multi-FASTA file exists. It should do so since
0586             %we are saving empty files as well. Print a warning and
0587             %continue if not
0588             if ~isfile(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']))
0589                 EM=['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist'];
0590                 dispEM(EM,false);
0591                 continue;
0592             end
0593             
0594             %If the multi-FASTA file is empty then save an empty aligned
0595             %file and continue
0596             s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0597             if s.bytes<=0
0598                 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w');
0599                 fclose(fid);
0600                 continue;
0601             end
0602             
0603             %Create an empty file to prevent other threads to start to work
0604             %on the same alignment
0605             fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w');
0606             fclose(fid);
0607             
0608             %First load the FASTA file, then select up to nSequences
0609             %sequences of the most closely related species, apply any
0610             %constraints from maxPhylDist, and save it as a temporary file,
0611             %and create the model from that
0612             
0613             fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0614             phylDist=inf(numel(fastaStruct),1);
0615             for j=1:numel(fastaStruct)
0616                 %Get the organism abbreviation
0617                 index=strfind(fastaStruct(j).Header,':');
0618                 if any(index)
0619                     abbrev=fastaStruct(j).Header(1:index(1)-1);
0620                     [~, index]=ismember(abbrev,phylDistStruct.ids);
0621                     if any(index)
0622                         phylDist(j)=phylDistStruct.distMat(index(1),phylDistId);
0623                     end
0624                 end
0625             end
0626             
0627             %Inf means that it should not be included
0628             phylDist(phylDist>maxPhylDist)=[];
0629             
0630             %Sort based on phylDist
0631             [~, order]=sort(phylDist);
0632             
0633             %Save the first nSequences hits to a temporary FASTA file
0634             if nSequences<=numel(fastaStruct)
0635                 fastaStruct=fastaStruct(order(1:nSequences));
0636             else
0637                 fastaStruct=fastaStruct(order);
0638             end
0639             
0640             %Do the clustering and alignment if there are more than one
0641             %sequences, otherwise just save the sequence (or an empty file)
0642             if numel(fastaStruct)>1
0643                  if seqIdentity~=-1
0644                     cdhitInpCustom=tempname;
0645                     fastawrite(cdhitInpCustom,fastaStruct);
0646                     if seqIdentity<=1 && seqIdentity>0.7
0647                         nparam='5';
0648                     elseif seqIdentity>0.6
0649                         nparam='4';
0650                     elseif seqIdentity>0.5
0651                         nparam='3';
0652                     elseif seqIdentity>0.4
0653                         nparam='2';
0654                     else
0655                         EM='The provided seqIdentity must be between 0 and 1\n';
0656                         dispEM(EM);
0657                     end
0658                     if ispc
0659                         wslPath.cdhitInpCustom=getWSLpath(cdhitInpCustom);
0660                         [status, output]=system(['wsl "' wslPath.cdhit '" -T "' num2str(cores) '" -i "' wslPath.cdhitInpCustom '" -o "' wslPath.tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0661                     elseif ismac || isunix
0662                         [status, output]=system(['"' fullfile(ravenPath,'software','cd-hit',['cd-hit' binEnd]) '" -T "' num2str(cores) '" -i "' cdhitInpCustom '" -o "' tmpFile '" -c "' num2str(seqIdentity) '" -n ' nparam ' -M 2000']);
0663                     end
0664                     if status~=0
0665                         EM=['Error when performing clustering of ' missingAligned{i} ':\n' output];
0666                         dispEM(EM);
0667                     end
0668                     %Remove the old tempfile
0669                     if exist(cdhitInpCustom, 'file')
0670                         delete([cdhitInpCustom '*']);
0671                     end
0672                 else
0673                     %This means that CD-HIT should be skipped since
0674                     %seqIdentity is equal to -1
0675                     fastawrite(tmpFile,fastaStruct);
0676                 end
0677                 %Do the alignment for this file
0678                 if ismac
0679                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-mac','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0680                 elseif isunix
0681                     [status, output]=system(['"' fullfile(ravenPath,'software','mafft','mafft-linux64','mafft.bat') '" --auto --anysymbol --thread "' num2str(cores) '" "' tmpFile '" > "' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '"']);
0682                 elseif ispc
0683                     wslPath.fawFile=getWSLpath(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']));
0684                     [status, ~]=system(['wsl "' wslPath.mafft '" --auto --anysymbol --progress "' wslPath.mafftOutput '" --thread "' num2str(cores) '" --out "' wslPath.fawFile '" "' wslPath.tmpFile '"']);
0685                     output=fileread(mafftOutput);
0686                     delete(mafftOutput);
0687                 end
0688                 if status~=0
0689                     %It could be that alignment failed because only one
0690                     %sequence was left after clustering. If that is the
0691                     %case, then the clustered file is just copied as 'faw'
0692                     %file
0693                     if any(regexp(output,'Only 1 sequence found'))
0694                         movefile(tmpFile,fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'f');
0695                     else
0696                         EM=['Error when performing alignment of ' missingAligned{i} ':\n' output];
0697                         dispEM(EM);
0698                     end
0699                 end
0700                 %Remove the old tempfile
0701                 if exist(tmpFile, 'file')
0702                     delete([tmpFile '*']);
0703                 end
0704             else
0705                 %If there is only one sequence then it's not possible to do
0706                 %a multiple alignment. Just print the sequence instead. An
0707                 %empty file was written previously so that doesn't have to
0708                 %be dealt with
0709                 if numel(fastaStruct)==1
0710                     warnState = warning; %Save the current warning state
0711                     warning('off','Bioinfo:fastawrite:AppendToFile');
0712                     fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct);
0713                     warning(warnState) %Reset warning state to previous settings
0714                 end
0715             end
0716             %Move the temporary file to the real one
0717             movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f');
0718             
0719             %Print the progress every 25 files
0720             if rem(i-1,25) == 0
0721                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'aligned','*.fa')))/numel(KOModel.rxns)));
0722                 progress=pad(progress,3,'left');
0723                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0724             end
0725         end
0726     end
0727     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0728 else
0729     if seqIdentity==-1
0730         fprintf('Performing the multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0731     else
0732         fprintf('Performing clustering and multiple alignment for KEGG Orthology specific protein sets... COMPLETE\n');
0733     end
0734 end
0735 
0736 %Check if training of Hidden Markov models should be performed
0737 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]);
0738 if ~isempty(missingHMMs)
0739     fprintf('Training the KEGG Orthology specific HMMs...   0%% complete');
0740     missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs)));
0741     %Train models for all missing KOs
0742     for i=1:numel(missingHMMs)
0743         %This is checked here because it could be that it is created by a
0744         %parallel process
0745         if ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm'])) && ~isfile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']))
0746             %Check that the aligned FASTA file exists. It could be that it
0747             %is still being worked on by some other instance of the program
0748             %(the .faw file should then exist). This should not happen on a
0749             %single computer. It doesn't throw an error, because it should
0750             %finalize the ones it can
0751             if ~isfile(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']))
0752                 EM=['The aligned FASTA file for ' missingHMMs{i} ' does not exist'];
0753                 dispEM(EM,false);
0754                 continue;
0755             end
0756             
0757             %If the multi-FASTA file is empty then save an empty aligned
0758             %file and continue
0759             s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0760             if s.bytes<=0
0761                 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w');
0762                 fclose(fid);
0763                 continue;
0764             end
0765             %Create a temporary file to indicate that it is working on the
0766             %KO. This is because hmmbuild cannot overwrite existing files
0767             fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w');
0768             fclose(fid);
0769             
0770             %Create HMM
0771             [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmbuild' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']);
0772             if status~=0
0773                 EM=['Error when training HMM for ' missingHMMs{i} ':\n' output];
0774                 dispEM(EM);
0775             end
0776             
0777             %Delete the temporary file
0778             delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']));
0779 
0780             %Print the progress every 25 files
0781             if rem(i-1,25) == 0
0782                 progress=num2str(floor(100*numel(listFiles(fullfile(dataDir,'hmms','*.hmm')))/numel(KOModel.rxns)));
0783                 progress=pad(progress,3,'left');
0784                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0785             end
0786         end
0787     end
0788     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0789 else
0790     fprintf('Training the KEGG Orthology specific HMMs... COMPLETE\n');
0791 end
0792 
0793 %Check which new .out files that should be generated. Check if training of
0794 %Hidden Markov models should be performed
0795 missingOUT=setdiff(KOModel.rxns,outFiles);
0796 if ~isempty(missingOUT)
0797     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs...   0%% complete');
0798     missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT)));
0799     for i=1:numel(missingOUT)
0800         %This is checked here because it could be that it is created by a
0801         %parallel process
0802         if ~isfile(fullfile(outDir,[missingOUT{i} '.out']))
0803             %Check that the HMM file exists. It should do so since %we are
0804             %saving empty files as well. Print a warning and continue if
0805             %not
0806             if ~isfile(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']))
0807                 EM=['The HMM file for ' missingOUT{i} ' does not exist'];
0808                 dispEM(EM,false);
0809                 continue;
0810             end
0811             
0812             %Save an empty file to prevent several threads working on the
0813             %same file
0814             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0815             fclose(fid);
0816             
0817             %If the HMM file is empty then save an out file and continue
0818             s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0819             if s.bytes<=0
0820                 continue;
0821             end
0822             
0823             %Check each gene in the input file against this model
0824             [status, output]=system(['"' fullfile(ravenPath,'software','hmmer',['hmmsearch' binEnd]) '" --cpu "' num2str(cores) '" "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']);
0825             if status~=0
0826                 EM=['Error when querying HMM for ' missingOUT{i} ':\n' output];
0827                 dispEM(EM);
0828             end
0829             
0830             %Save the output to a file
0831             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0832             fwrite(fid,output);
0833             fclose(fid);
0834             
0835             %Print the progress every 25 files
0836             if rem(i-1,25) == 0
0837                 progress=num2str(floor(100*numel(listFiles(fullfile(outDir,'*.out')))/numel(KOModel.rxns)));
0838                 progress=pad(progress,3,'left');
0839                 fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\b%s%% complete',progress);
0840             end
0841         end
0842     end
0843     fprintf('\b\b\b\b\b\b\b\b\b\b\b\b\bCOMPLETE\n');
0844 else
0845     fprintf('Querying the user-specified FASTA file against the KEGG Orthology specific HMMs... COMPLETE\n');
0846 end
0847 
0848 
0849 %***Begin retrieving the output and putting together the resulting model
0850 
0851 fprintf('Parsing the HMM search results... ');
0852 %Retrieve matched genes from the HMMs
0853 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes
0854 genes=cell(3000,1);
0855 %Store the best score for a gene in a hash list (since it will be searching
0856 %many times)
0857 hTable = java.util.Hashtable;
0858 
0859 geneCounter=0;
0860 for i=1:numel(KOModel.rxns)
0861     if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file')
0862         fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r');
0863         beginMatches=false;
0864         while 1
0865             %Get the next line
0866             tline = fgetl(fid);
0867             
0868             %Abort at end of file
0869             if ~ischar(tline)
0870                 break;
0871             end
0872             
0873             if and(beginMatches,strcmp(tline,'  ------ inclusion threshold ------'))
0874                 break;
0875             end
0876             
0877             if beginMatches==false
0878                 %This is how the listing of matches begins
0879                 if any(strfind(tline,'E-value '))
0880                     %Read one more line that is only padding
0881                     tline = fgetl(fid);
0882                     beginMatches=true;
0883                 end
0884             else
0885                 %If matches should be read
0886                 if ~strcmp(tline,'   [No hits detected that satisfy reporting thresholds]') && ~isempty(tline)
0887                     elements=regexp(tline,' ','split');
0888                     elements=elements(cellfun(@any,elements));
0889                     
0890                     %Check if the match is below the treshhold
0891                     score=str2double(elements{1});
0892                     gene=elements{9};
0893                     if score<=cutOff
0894                         %If the score is exactly 0, change it to a very
0895                         %small value to avoid NaN
0896                         if score==0
0897                             score=10^-250;
0898                         end
0899                         %Check if the gene is added already and, is so, get
0900                         %the best score for it
0901                         I=hTable.get(gene);
0902                         if any(I)
0903                             koGeneMat(i,I)=score;
0904                         else
0905                             geneCounter=geneCounter+1;
0906                             %The gene was not present yet so add it
0907                             hTable.put(gene,geneCounter);
0908                             genes{geneCounter}=gene;
0909                             koGeneMat(i,geneCounter)=score;
0910                         end
0911                     end
0912                 else
0913                     break;
0914                 end
0915             end
0916         end
0917         fclose(fid);
0918     end
0919 end
0920 fprintf('COMPLETE\n');
0921 
0922 fprintf('Removing gene, KEGG Orthology associations below minScoreRatioKO, minScoreRatioG... ');
0923 koGeneMat=koGeneMat(:,1:geneCounter);
0924 
0925 %Remove the genes for each KO that are below minScoreRatioKO.
0926 for i=1:size(koGeneMat,1)
0927     J=find(koGeneMat(i,:));
0928     if any(J)
0929         koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0;
0930     end
0931 end
0932 
0933 %Remove the KOs for each gene that are below minScoreRatioG
0934 for i=1:size(koGeneMat,2)
0935     J=find(koGeneMat(:,i));
0936     if any(J)
0937         koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0;
0938     end
0939 end
0940 fprintf('COMPLETE\n');
0941 
0942 fprintf('Adding gene annotations to the model... ');
0943 %Create the new model
0944 model.genes=genes(1:geneCounter);
0945 model.grRules=cell(numel(model.rxns),1);
0946 model.grRules(:)={''};
0947 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes));
0948 
0949 %Loop through the reactions and add the corresponding genes
0950 for i=1:numel(model.rxns)
0951     if isstruct(model.rxnMiriams{i})
0952         %Get all KOs
0953         I=find(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'));
0954         KOs=model.rxnMiriams{i}.value(I);
0955         %Find the KOs and the corresponding genes
0956         J=ismember(KOModel.rxns,KOs);
0957         [~, K]=find(koGeneMat(J,:));
0958         
0959         if any(K)
0960             model.rxnGeneMat(i,K)=1;
0961             %Also delete KOs for which no genes were found. If no genes at
0962             %all were matched to the reaction it will be deleted later
0963             L=sum(koGeneMat(J,:),2)==0;
0964             model.rxnMiriams{i}.value(I(L))=[];
0965             model.rxnMiriams{i}.name(I(L))=[];
0966         end
0967     end
0968 end
0969 fprintf('COMPLETE\n');
0970 
0971 %Find and delete all reactions without genes. This also removes genes that
0972 %are not used (which could happen because minScoreRatioG and
0973 %minScoreRatioKO). If keepSpontaneous==true, the spontaneous reactions
0974 %without genes are kept in the model. Spontaneous reactions with original
0975 %gene associations are treated in the same way, like the rest of the
0976 %reactions - if gene associations were removed during HMM search, such
0977 %reactions are deleted from the model
0978 if keepSpontaneous==true
0979     %Not the most comprise way to delete reactions without genes, but this
0980     %makes the code easier to understand. Firstly the non-spontaneous
0981     %reactions without genes are removed. After that, the second deletion
0982     %step removes spontaneous reactions, which had gene associations before
0983     %HMM search, but no longer have after it
0984     fprintf('Removing non-spontaneous reactions which after HMM search no longer have GPR rules... ');
0985     I=~any(model.rxnGeneMat,2)&~ismember(model.rxns,isSpontaneous);
0986     model=removeReactions(model,I,true,true);
0987     I=~any(model.rxnGeneMat,2)&ismember(model.rxns,spontRxnsWithGenes);
0988     model=removeReactions(model,I,true,true);
0989 else
0990     %Just simply check for any new reactions without genes and remove
0991     %it
0992     fprintf('Removing reactions which after HMM search no longer have GPR rules... ');
0993     I=~any(model.rxnGeneMat,2);
0994     model=removeReactions(model,I,true,true);
0995 end
0996 fprintf('COMPLETE\n');
0997 
0998 fprintf('Constructing GPR rules and finalizing the model... ');
0999 %Add the gene associations as 'or'
1000 for i=1:numel(model.rxns)
1001     %Find the involved genes
1002     I=find(model.rxnGeneMat(i,:));
1003     if any(I)
1004         model.grRules{i}=['(' model.genes{I(1)}];
1005         for j=2:numel(I)
1006             model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
1007         end
1008         model.grRules{i}=[model.grRules{i} ')'];
1009     end
1010 end
1011 
1012 %Fix grRules and reconstruct rxnGeneMat
1013 [grRules,rxnGeneMat] = standardizeGrRules(model,false); %Give detailed output
1014 model.grRules = grRules;
1015 model.rxnGeneMat = rxnGeneMat;
1016 
1017 %Add the description to the reactions
1018 for i=1:numel(model.rxns)
1019     if ~isempty(model.rxnNotes{i})
1020         model.rxnNotes(i)=strcat('Included by getKEGGModelForOrganism (using HMMs).',model.rxnNotes(i));
1021         model.rxnNotes(i)=strrep(model.rxnNotes(i),'.','. ');
1022     else
1023         model.rxnNotes(i)={'Included by getKEGGModelForOrganism (using HMMs)'};
1024     end
1025 end
1026 %Remove the temp fasta file
1027 delete(fastaFile)
1028 fprintf('COMPLETE\n\n*** Model reconstruction complete ***\n');
1029 end
1030 
1031 function files=listFiles(directory)
1032 %Supporter function to list the files in a directory and return them as a
1033 %cell array
1034 temp=dir(directory);
1035 files=cell(numel(temp),1);
1036 for i=1:numel(temp)
1037     files{i}=temp(i,1).name;
1038 end
1039 files=strrep(files,'.fa','');
1040 files=strrep(files,'.hmm','');
1041 files=strrep(files,'.out','');
1042 files=strrep(files,'.faw','');
1043 end

Generated by m2html © 2005