Description of removeLowScoreGenes

removeLowScoreGenes  Remove low-scoring genes from model.

   This function removes genes from a model based on their scores, a step
   used by the tINIT package. The function recognizes and differentiates
   between isozymes and subunits of an enzyme complex. Genes are removed
   from each grRule, subject to the following conditions:
       1) At least one gene must remain associated with the reaction
       2) Genes involved in a complex (joined by ANDs) are not removed

 Usage:

   [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring);

 Inputs:

   model           Model structure from which genes are to be removed.

   geneScores      A vector of scores associated with the model genes.
                   Genes with a positive score will remain in the model,
                   whereas genes with a negative score will try to be
                   removed.
                   
                   If all genes associated with a reaction have a negative
                   score, then the least-negative gene will remain; if 
                   there is a tie, one will be selected at random.

                   If a negative-scoring gene is a subunit in a complex, 
                   it will not be removed; however, the entire complex may
                   be removed. See the following example cases:

                    Original: G1 or (G2 and G3 and G4)
                    Negative: G1, G2
                         New: G2 and G3 and G4

                    Original: G1 or (G2 and G3) or (G4 and G5)
                    Negative: G1, G2
                         New: G4 and G5     [using default complexScoring]

                    Original: (G1 and (G2 or G3) and G4)
                    Negative: G2
                         New: (G1 and G3 and G4)

   isozymeScoring  Method used to combine the scores of multiple isozymes;
                   'min', 'max', 'median', or 'average'. 
                   (optional, default 'max')

   complexScoring  Method used to combine the scores of enzyme complex
                   subunits: 'min', 'max', 'median', or 'average'. 
                   (optional, default 'min')

 Outputs:

   newModel        Model with updated gene, grRules, and rxnGeneMat fields
                   after attemping to remove negative-score genes.

   remGenes        A list of negative-score genes that were fully removed
                   from the model. Negative-score genes that were removed
                   from some grRules but not all will not be included in
                   this list.

0001 function [newModel,remGenes] = removeLowScoreGenes(model,geneScores,isozymeScoring,complexScoring)
0002 %removeLowScoreGenes  Remove low-scoring genes from model.
0003 %
0004 %   This function removes genes from a model based on their scores, a step
0005 %   used by the tINIT package. The function recognizes and differentiates
0006 %   between isozymes and subunits of an enzyme complex. Genes are removed
0007 %   from each grRule, subject to the following conditions:
0008 %       1) At least one gene must remain associated with the reaction
0009 %       2) Genes involved in a complex (joined by ANDs) are not removed
0010 %
0011 % Usage:
0012 %
0013 %   [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring);
0014 %
0015 % Inputs:
0016 %
0017 %   model           Model structure from which genes are to be removed.
0018 %
0019 %   geneScores      A vector of scores associated with the model genes.
0020 %                   Genes with a positive score will remain in the model,
0021 %                   whereas genes with a negative score will try to be
0022 %                   removed.
0023 %
0024 %                   If all genes associated with a reaction have a negative
0025 %                   score, then the least-negative gene will remain; if
0026 %                   there is a tie, one will be selected at random.
0027 %
0028 %                   If a negative-scoring gene is a subunit in a complex,
0029 %                   it will not be removed; however, the entire complex may
0030 %                   be removed. See the following example cases:
0031 %
0032 %                    Original: G1 or (G2 and G3 and G4)
0033 %                    Negative: G1, G2
0034 %                         New: G2 and G3 and G4
0035 %
0036 %                    Original: G1 or (G2 and G3) or (G4 and G5)
0037 %                    Negative: G1, G2
0038 %                         New: G4 and G5     [using default complexScoring]
0039 %
0040 %                    Original: (G1 and (G2 or G3) and G4)
0041 %                    Negative: G2
0042 %                         New: (G1 and G3 and G4)
0043 %
0044 %   isozymeScoring  Method used to combine the scores of multiple isozymes;
0045 %                   'min', 'max', 'median', or 'average'.
0046 %                   (optional, default 'max')
0047 %
0048 %   complexScoring  Method used to combine the scores of enzyme complex
0049 %                   subunits: 'min', 'max', 'median', or 'average'.
0050 %                   (optional, default 'min')
0051 %
0052 % Outputs:
0053 %
0054 %   newModel        Model with updated gene, grRules, and rxnGeneMat fields
0055 %                   after attemping to remove negative-score genes.
0056 %
0057 %   remGenes        A list of negative-score genes that were fully removed
0058 %                   from the model. Negative-score genes that were removed
0059 %                   from some grRules but not all will not be included in
0060 %                   this list.
0061 %
0062 
0063 
0064 if nargin < 3 || isempty(isozymeScoring)
0065     isozymeScoring = 'max';
0066 end
0067 if nargin < 4
0068     complexScoring = 'min';
0069 end
0070 
0071 if ~isequal(size(model.genes),size(geneScores))
0072     error('The dimensions of model genes and geneScores do not match.');
0073 end
0074 
0075 % convert logical operators to symbols
0076 grRules = model.grRules;
0077 if any(contains(grRules,{'&','|'}))
0078     symbolicRules = true;
0079 else
0080     symbolicRules = false;
0081 end
0082 grRules = regexprep(grRules,' and ',' & ');
0083 grRules = regexprep(grRules,' or ',' | ');
0084 
0085 % get unique list of grRules
0086 [uRules,~,rule_ind] = unique(grRules);
0087 
0088 % iterate through each rule
0089 newRules = uRules;  %initialize newRules
0090 for i = 1:numel(uRules)
0091     if isempty(uRules{i}) || ~contains(uRules{i},'|')
0092         continue
0093     elseif contains(uRules{i},'&')
0094         newRules{i} = processComplexRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring);
0095     else
0096         newRules{i} = processSimpleRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring);
0097     end
0098 end
0099 
0100 % re-map unique rules to model
0101 newModel = model;
0102 newModel.grRules = newRules(rule_ind);
0103 
0104 % restore original logical operator formatting if it was changed
0105 if ~symbolicRules
0106     newModel.grRules = regexprep(newModel.grRules, ' \| ', ' or ');
0107     newModel.grRules = regexprep(newModel.grRules, ' & ', ' and ');
0108 end
0109 
0110 % regenerate "genes" and "rxnGeneMat" model fields
0111 [genes,rxnGeneMat] = getGenesFromGrRules(newModel.grRules);
0112 newModel.genes = genes;
0113 newModel.rxnGeneMat = rxnGeneMat;
0114 
0115 % update other gene-related fields
0116 remInd = ~ismember(model.genes,newModel.genes);
0117 remGenes = model.genes(remInd);
0118 
0119 if isfield(newModel,'geneShortNames')
0120     newModel.geneShortNames(remInd) = [];
0121 end
0122 if isfield(newModel,'proteins')
0123     newModel.proteins(remInd) = [];
0124 end
0125 if isfield(newModel,'geneMiriams')
0126     newModel.geneMiriams(remInd) = [];
0127 end
0128 if isfield(newModel,'geneFrom')
0129     newModel.geneFrom(remInd) = [];
0130 end
0131 if isfield(newModel,'geneComps')
0132     newModel.geneComps(remInd) = [];
0133 end
0134 
0135 
0136 end
0137 
0138 
0139 
0140 function [updatedRule,rScore] = processSimpleRule(rule,genes,gScores,isozymeScoring,complexScoring)
0141 % Either score or modify a reaction gene rule containig only ANDs or ORs.
0142 %
0143 % If the rule contains an enzyme complex (all ANDs), the complex will be
0144 % scored based on the score of its subunits. Subunits without a score (NaN)
0145 % will be excluded from the score calculation.
0146 %
0147 % If the rule contains only isozymes (all ORs), the negative-score genes
0148 % will be removed from the rule. Isozymes without a score (NaN) will not be
0149 % removed from the rule. The resuling rule will then be scored.
0150 
0151 
0152 % get IDs and indices of genes involved in rule
0153 ruleGenes = unique(regexp(rule,'[^&|\(\) ]+','match'));
0154 [~,geneInd] = ismember(ruleGenes,genes);
0155 
0156 % rules with one or no genes remain unchanged
0157 if numel(ruleGenes) < 2
0158     rScore = gScores(geneInd);
0159     updatedRule = rule;
0160     return
0161 end
0162 
0163 if ~contains(rule,'&')  % rule contains isozymes
0164     
0165     scoreMethod = isozymeScoring;
0166     negInd = gScores(geneInd) < 0;  % NaNs will return false here
0167     if all(negInd)
0168         % get the least negative gene, adding a small random value to avoid a tie
0169         [~,maxInd] = max(gScores(geneInd) + rand(size(geneInd))*(1e-8));
0170         updatedRule = ruleGenes{maxInd};
0171     elseif sum(~negInd) == 1
0172         updatedRule = ruleGenes{~negInd};
0173     else
0174         updatedRule = strjoin(ruleGenes(~negInd),' | ');
0175         if startsWith(rule,'(')
0176             updatedRule = ['(',updatedRule,')'];
0177         end
0178     end
0179     
0180     % update ruleGenes and their indices
0181     ruleGenes = unique(regexp(updatedRule,'[^&|\(\) ]+','match'));
0182     [~,geneInd] = ismember(ruleGenes,genes);
0183     
0184 elseif ~contains(rule,'|')  % rule contains enzyme complex
0185     scoreMethod = complexScoring;
0186     updatedRule = rule;
0187 else
0188     error('This function cannot handle rules with both "OR" and "AND" expressions.');
0189 end
0190 
0191 % score rule
0192 switch lower(scoreMethod)
0193     case 'min'
0194         rScore = min(gScores(geneInd),[],'omitnan');
0195     case 'max'
0196         rScore = max(gScores(geneInd),[],'omitnan');
0197     case 'median'
0198         rScore = median(gScores(geneInd),'omitnan');
0199     case 'average'
0200         rScore = mean(gScores(geneInd),'omitnan');
0201 end
0202 
0203 end
0204 
0205 
0206 
0207 function updatedRule = processComplexRule(rule,genes,gScores,isozymeScoring,complexScoring)
0208 % Update reactions containing both AND and OR expressions.
0209 %
0210 % Negative-score genes will be removed if they are isozymic, whereas they
0211 % will not be removed if they are part of an enzyme complex. However, if
0212 % the enzyme complex has a negative score, the entire complex will be
0213 % removed, as long as it is not the only remaining element in the rule.
0214 
0215 
0216 % Specify phrases to search for in the grRule. These phrases will find
0217 % genes grouped by all ANDs (first phrase) or all ORs (second phrase).
0218 search_phrases = {'\([^&|\(\) ]+( & [^&|\(\) ]+)+\)', '\([^&|\(\) ]+( \| [^&|\(\) ]+)+\)'};
0219 
0220 % initialize some variables
0221 subsets = {};  % subsets are groups of genes grouped by all ANDs or all ORs
0222 c = 1;  % counter to keep track of the group (subset) number
0223 r_orig = rule;  % record original rule to determine when it stops changing
0224 for k = 1:100  % iterate some arbitrarily high number of times
0225     for j = 1:length(search_phrases)
0226         new_subset = regexp(rule,search_phrases{j},'match')';  % extract subsets
0227         if ~isempty(new_subset)
0228             subsets = [subsets; new_subset];  % append to list of subsets
0229             subset_nums = arrayfun(@num2str,(c:length(subsets))','UniformOutput',false);  % get group numbers to be assigned to the new subsets, and convert to strings
0230             rule = regexprep(rule,search_phrases{j},strcat('#',subset_nums,'#'),'once');  % replace the subsets in the expression with their group numbers (enclosed by "#"s)
0231             c = c + length(new_subset);
0232         end
0233     end
0234     if isequal(rule,r_orig)
0235         break;  % stop iterating when rule stops changing
0236     else
0237         r_orig = rule;
0238     end
0239 end
0240 subsets{end+1} = rule;  % add final state of rule as the last subset
0241 
0242 % score and update each subset, and append to gene list and gene scores
0243 for i = 1:numel(subsets)
0244     [subsets{i},subset_score] = processSimpleRule(subsets{i},genes,gScores,isozymeScoring,complexScoring);
0245     gScores = [gScores; subset_score];
0246     genes = [genes; {strcat('#',num2str(i),'#')}];
0247 end
0248 
0249 % reconstruct the rule from its updated subsets
0250 updatedRule = subsets{end};
0251 for i = c-1:-1:1
0252     updatedRule = regexprep(updatedRule,strcat('#',num2str(i),'#'),subsets{i});
0253 end
0254 
0255 end
0256 
0257 
0258

removeLowScoreGenes

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE