Description of removeLowScoreGenes

removeLowScoreGenes  Remove low-scoring genes from model.

   This function removes genes from a model based on their scores, a step
   used by the tINIT package. The function recognizes and differentiates
   between isozymes and subunits of an enzyme complex. Genes are removed
   from each grRule, subject to the following conditions:
       1) At least one gene must remain associated with the reaction
       2) Genes involved in a complex (joined by ANDs) are not removed

 Usage:

   [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring);

 Inputs:

   model           Model structure from which genes are to be removed.

   geneScores      A vector of scores associated with the model genes.
                   Genes with a positive score will remain in the model,
                   whereas genes with a negative score will try to be
                   removed.
                   
                   If all genes associated with a reaction have a negative
                   score, then the least-negative gene will remain; if 
                   there is a tie, one will be selected at random.

                   If a negative-scoring gene is a subunit in a complex, 
                   it will not be removed; however, the entire complex may
                   be removed. See the following example cases:

                    Original: G1 or (G2 and G3 and G4)
                    Negative: G1, G2
                         New: G2 and G3 and G4

                    Original: G1 or (G2 and G3) or (G4 and G5)
                    Negative: G1, G2
                         New: G4 and G5     [using default complexScoring]

                    Original: (G1 and (G2 or G3) and G4)
                    Negative: G2
                         New: (G1 and G3 and G4)

   isozymeScoring  Method used to combine the scores of multiple isozymes;
                   'min', 'max', 'median', or 'average'. 
                   (Opt, default 'max')

   complexScoring  Method used to combine the scores of enzyme complex
                   subunits: 'min', 'max', 'median', or 'average'. 
                   (Opt, default 'min')

 Outputs:

   newModel        Model with updated gene, grRules, and rxnGeneMat fields
                   after attemping to remove negative-score genes.

   remGenes        A list of negative-score genes that were fully removed
                   from the model. Negative-score genes that were removed
                   from some grRules but not all will not be included in
                   this list.

0001 function [newModel,remGenes] = removeLowScoreGenes(model,geneScores,isozymeScoring,complexScoring)
0002 %removeLowScoreGenes  Remove low-scoring genes from model.
0003 %
0004 %   This function removes genes from a model based on their scores, a step
0005 %   used by the tINIT package. The function recognizes and differentiates
0006 %   between isozymes and subunits of an enzyme complex. Genes are removed
0007 %   from each grRule, subject to the following conditions:
0008 %       1) At least one gene must remain associated with the reaction
0009 %       2) Genes involved in a complex (joined by ANDs) are not removed
0010 %
0011 % Usage:
0012 %
0013 %   [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring);
0014 %
0015 % Inputs:
0016 %
0017 %   model           Model structure from which genes are to be removed.
0018 %
0019 %   geneScores      A vector of scores associated with the model genes.
0020 %                   Genes with a positive score will remain in the model,
0021 %                   whereas genes with a negative score will try to be
0022 %                   removed.
0023 %
0024 %                   If all genes associated with a reaction have a negative
0025 %                   score, then the least-negative gene will remain; if
0026 %                   there is a tie, one will be selected at random.
0027 %
0028 %                   If a negative-scoring gene is a subunit in a complex,
0029 %                   it will not be removed; however, the entire complex may
0030 %                   be removed. See the following example cases:
0031 %
0032 %                    Original: G1 or (G2 and G3 and G4)
0033 %                    Negative: G1, G2
0034 %                         New: G2 and G3 and G4
0035 %
0036 %                    Original: G1 or (G2 and G3) or (G4 and G5)
0037 %                    Negative: G1, G2
0038 %                         New: G4 and G5     [using default complexScoring]
0039 %
0040 %                    Original: (G1 and (G2 or G3) and G4)
0041 %                    Negative: G2
0042 %                         New: (G1 and G3 and G4)
0043 %
0044 %   isozymeScoring  Method used to combine the scores of multiple isozymes;
0045 %                   'min', 'max', 'median', or 'average'.
0046 %                   (Opt, default 'max')
0047 %
0048 %   complexScoring  Method used to combine the scores of enzyme complex
0049 %                   subunits: 'min', 'max', 'median', or 'average'.
0050 %                   (Opt, default 'min')
0051 %
0052 % Outputs:
0053 %
0054 %   newModel        Model with updated gene, grRules, and rxnGeneMat fields
0055 %                   after attemping to remove negative-score genes.
0056 %
0057 %   remGenes        A list of negative-score genes that were fully removed
0058 %                   from the model. Negative-score genes that were removed
0059 %                   from some grRules but not all will not be included in
0060 %                   this list.
0061 %
0062 
0063 
0064 if nargin < 3 || isempty(isozymeScoring)
0065     isozymeScoring = 'max';
0066 end
0067 if nargin < 4
0068     complexScoring = 'min';
0069 end
0070 
0071 if ~isequal(size(model.genes),size(geneScores))
0072     error('The dimensions of model genes and geneScores do not match.');
0073 end
0074 
0075 % convert logical operators to symbols
0076 grRules = model.grRules;
0077 if any(contains(grRules,{'&','|'}))
0078     symbolicRules = true;
0079 else
0080     symbolicRules = false;
0081 end
0082 grRules = regexprep(grRules,' and ',' & ');
0083 grRules = regexprep(grRules,' or ',' | ');
0084 
0085 % get unique list of grRules
0086 [uRules,~,rule_ind] = unique(grRules);
0087 
0088 % iterate through each rule
0089 newRules = uRules;  %initialize newRules
0090 for i = 1:numel(uRules)
0091     if isempty(uRules{i}) || ~contains(uRules{i},'|')
0092         continue
0093     elseif contains(uRules{i},'&')
0094         newRules{i} = processComplexRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring);
0095     else
0096         newRules{i} = processSimpleRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring);
0097     end
0098 end
0099 
0100 % re-map unique rules to model
0101 newModel = model;
0102 newModel.grRules = newRules(rule_ind);
0103 
0104 % restore original logical operator formatting if it was changed
0105 if ~symbolicRules
0106     newModel.grRules = regexprep(newModel.grRules, ' \| ', ' or ');
0107     newModel.grRules = regexprep(newModel.grRules, ' & ', ' and ');
0108 end
0109 
0110 % regenerate "genes" and "rxnGeneMat" model fields
0111 [genes,rxnGeneMat] = getGenesFromGrRules(newModel.grRules);
0112 newModel.genes = genes;
0113 newModel.rxnGeneMat = rxnGeneMat;
0114 
0115 % update other gene-related fields
0116 remInd = ~ismember(model.genes,newModel.genes);
0117 remGenes = model.genes(remInd);
0118 
0119 if isfield(newModel,'geneShortNames')
0120     newModel.geneShortNames(remInd) = [];
0121 end
0122 if isfield(newModel,'geneMiriams')
0123     newModel.geneMiriams(remInd) = [];
0124 end
0125 if isfield(newModel,'geneFrom')
0126     newModel.geneFrom(remInd) = [];
0127 end
0128 if isfield(newModel,'geneComps')
0129     newModel.geneComps(remInd) = [];
0130 end
0131 
0132 
0133 end
0134 
0135 
0136 
0137 function [updatedRule,rScore] = processSimpleRule(rule,genes,gScores,isozymeScoring,complexScoring)
0138 % Either score or modify a reaction gene rule containig only ANDs or ORs.
0139 %
0140 % If the rule contains an enzyme complex (all ANDs), the complex will be
0141 % scored based on the score of its subunits. Subunits without a score (NaN)
0142 % will be excluded from the score calculation.
0143 %
0144 % If the rule contains only isozymes (all ORs), the negative-score genes
0145 % will be removed from the rule. Isozymes without a score (NaN) will not be
0146 % removed from the rule. The resuling rule will then be scored.
0147 
0148 
0149 % get IDs and indices of genes involved in rule
0150 ruleGenes = unique(regexp(rule,'[^&|\(\) ]+','match'));
0151 [~,geneInd] = ismember(ruleGenes,genes);
0152 
0153 % rules with one or no genes remain unchanged
0154 if numel(ruleGenes) < 2
0155     rScore = gScores(geneInd);
0156     updatedRule = rule;
0157     return
0158 end
0159 
0160 if ~contains(rule,'&')  % rule contains isozymes
0161     
0162     scoreMethod = isozymeScoring;
0163     negInd = gScores(geneInd) < 0;  % NaNs will return false here
0164     if all(negInd)
0165         % get the least negative gene, adding a small random value to avoid a tie
0166         [~,maxInd] = max(gScores(geneInd) + rand(size(geneInd))*(1e-8));
0167         updatedRule = ruleGenes{maxInd};
0168     elseif sum(~negInd) == 1
0169         updatedRule = ruleGenes{~negInd};
0170     else
0171         updatedRule = strjoin(ruleGenes(~negInd),' | ');
0172         if startsWith(rule,'(')
0173             updatedRule = ['(',updatedRule,')'];
0174         end
0175     end
0176     
0177     % update ruleGenes and their indices
0178     ruleGenes = unique(regexp(updatedRule,'[^&|\(\) ]+','match'));
0179     [~,geneInd] = ismember(ruleGenes,genes);
0180     
0181 elseif ~contains(rule,'|')  % rule contains enzyme complex
0182     scoreMethod = complexScoring;
0183     updatedRule = rule;
0184 else
0185     error('This function cannot handle rules with both "OR" and "AND" expressions.');
0186 end
0187 
0188 % score rule
0189 switch lower(scoreMethod)
0190     case 'min'
0191         rScore = min(gScores(geneInd),[],'omitnan');
0192     case 'max'
0193         rScore = max(gScores(geneInd),[],'omitnan');
0194     case 'median'
0195         rScore = median(gScores(geneInd),'omitnan');
0196     case 'average'
0197         rScore = mean(gScores(geneInd),'omitnan');
0198 end
0199 
0200 end
0201 
0202 
0203 
0204 function updatedRule = processComplexRule(rule,genes,gScores,isozymeScoring,complexScoring)
0205 % Update reactions containing both AND and OR expressions.
0206 %
0207 % Negative-score genes will be removed if they are isozymic, whereas they
0208 % will not be removed if they are part of an enzyme complex. However, if
0209 % the enzyme complex has a negative score, the entire complex will be
0210 % removed, as long as it is not the only remaining element in the rule.
0211 
0212 
0213 % Specify phrases to search for in the grRule. These phrases will find
0214 % genes grouped by all ANDs (first phrase) or all ORs (second phrase).
0215 search_phrases = {'\([^&|\(\) ]+( & [^&|\(\) ]+)+\)', '\([^&|\(\) ]+( \| [^&|\(\) ]+)+\)'};
0216 
0217 % initialize some variables
0218 subsets = {};  % subsets are groups of genes grouped by all ANDs or all ORs
0219 c = 1;  % counter to keep track of the group (subset) number
0220 r_orig = rule;  % record original rule to determine when it stops changing
0221 for k = 1:100  % iterate some arbitrarily high number of times
0222     for j = 1:length(search_phrases)
0223         new_subset = regexp(rule,search_phrases{j},'match')';  % extract subsets
0224         if ~isempty(new_subset)
0225             subsets = [subsets; new_subset];  % append to list of subsets
0226             subset_nums = arrayfun(@num2str,(c:length(subsets))','UniformOutput',false);  % get group numbers to be assigned to the new subsets, and convert to strings
0227             rule = regexprep(rule,search_phrases{j},strcat('#',subset_nums,'#'),'once');  % replace the subsets in the expression with their group numbers (enclosed by "#"s)
0228             c = c + length(new_subset);
0229         end
0230     end
0231     if isequal(rule,r_orig)
0232         break;  % stop iterating when rule stops changing
0233     else
0234         r_orig = rule;
0235     end
0236 end
0237 subsets{end+1} = rule;  % add final state of rule as the last subset
0238 
0239 % score and update each subset, and append to gene list and gene scores
0240 for i = 1:numel(subsets)
0241     [subsets{i},subset_score] = processSimpleRule(subsets{i},genes,gScores,isozymeScoring,complexScoring);
0242     gScores = [gScores; subset_score];
0243     genes = [genes; {strcat('#',num2str(i),'#')}];
0244 end
0245 
0246 % reconstruct the rule from its updated subsets
0247 updatedRule = subsets{end};
0248 for i = c-1:-1:1
0249     updatedRule = regexprep(updatedRule,strcat('#',num2str(i),'#'),subsets{i});
0250 end
0251 
0252 end
0253 
0254 
0255

removeLowScoreGenes

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE