removeLowScoreGenes Remove low-scoring genes from model. This function removes genes from a model based on their scores, a step used by the tINIT package. The function recognizes and differentiates between isozymes and subunits of an enzyme complex. Genes are removed from each grRule, subject to the following conditions: 1) At least one gene must remain associated with the reaction 2) Genes involved in a complex (joined by ANDs) are not removed Usage: [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring); Inputs: model Model structure from which genes are to be removed. geneScores A vector of scores associated with the model genes. Genes with a positive score will remain in the model, whereas genes with a negative score will try to be removed. If all genes associated with a reaction have a negative score, then the least-negative gene will remain; if there is a tie, one will be selected at random. If a negative-scoring gene is a subunit in a complex, it will not be removed; however, the entire complex may be removed. See the following example cases: Original: G1 or (G2 and G3 and G4) Negative: G1, G2 New: G2 and G3 and G4 Original: G1 or (G2 and G3) or (G4 and G5) Negative: G1, G2 New: G4 and G5 [using default complexScoring] Original: (G1 and (G2 or G3) and G4) Negative: G2 New: (G1 and G3 and G4) isozymeScoring Method used to combine the scores of multiple isozymes; 'min', 'max', 'median', or 'average'. (optional, default 'max') complexScoring Method used to combine the scores of enzyme complex subunits: 'min', 'max', 'median', or 'average'. (optional, default 'min') Outputs: newModel Model with updated gene, grRules, and rxnGeneMat fields after attemping to remove negative-score genes. remGenes A list of negative-score genes that were fully removed from the model. Negative-score genes that were removed from some grRules but not all will not be included in this list.
0001 function [newModel,remGenes] = removeLowScoreGenes(model,geneScores,isozymeScoring,complexScoring) 0002 %removeLowScoreGenes Remove low-scoring genes from model. 0003 % 0004 % This function removes genes from a model based on their scores, a step 0005 % used by the tINIT package. The function recognizes and differentiates 0006 % between isozymes and subunits of an enzyme complex. Genes are removed 0007 % from each grRule, subject to the following conditions: 0008 % 1) At least one gene must remain associated with the reaction 0009 % 2) Genes involved in a complex (joined by ANDs) are not removed 0010 % 0011 % Usage: 0012 % 0013 % [newModel,remGenes] = removeLowScoreGenes(model,geneScores,complexScoring); 0014 % 0015 % Inputs: 0016 % 0017 % model Model structure from which genes are to be removed. 0018 % 0019 % geneScores A vector of scores associated with the model genes. 0020 % Genes with a positive score will remain in the model, 0021 % whereas genes with a negative score will try to be 0022 % removed. 0023 % 0024 % If all genes associated with a reaction have a negative 0025 % score, then the least-negative gene will remain; if 0026 % there is a tie, one will be selected at random. 0027 % 0028 % If a negative-scoring gene is a subunit in a complex, 0029 % it will not be removed; however, the entire complex may 0030 % be removed. See the following example cases: 0031 % 0032 % Original: G1 or (G2 and G3 and G4) 0033 % Negative: G1, G2 0034 % New: G2 and G3 and G4 0035 % 0036 % Original: G1 or (G2 and G3) or (G4 and G5) 0037 % Negative: G1, G2 0038 % New: G4 and G5 [using default complexScoring] 0039 % 0040 % Original: (G1 and (G2 or G3) and G4) 0041 % Negative: G2 0042 % New: (G1 and G3 and G4) 0043 % 0044 % isozymeScoring Method used to combine the scores of multiple isozymes; 0045 % 'min', 'max', 'median', or 'average'. 0046 % (optional, default 'max') 0047 % 0048 % complexScoring Method used to combine the scores of enzyme complex 0049 % subunits: 'min', 'max', 'median', or 'average'. 0050 % (optional, default 'min') 0051 % 0052 % Outputs: 0053 % 0054 % newModel Model with updated gene, grRules, and rxnGeneMat fields 0055 % after attemping to remove negative-score genes. 0056 % 0057 % remGenes A list of negative-score genes that were fully removed 0058 % from the model. Negative-score genes that were removed 0059 % from some grRules but not all will not be included in 0060 % this list. 0061 % 0062 0063 0064 if nargin < 3 || isempty(isozymeScoring) 0065 isozymeScoring = 'max'; 0066 end 0067 if nargin < 4 0068 complexScoring = 'min'; 0069 end 0070 0071 if ~isequal(size(model.genes),size(geneScores)) 0072 error('The dimensions of model genes and geneScores do not match.'); 0073 end 0074 0075 % convert logical operators to symbols 0076 grRules = model.grRules; 0077 if any(contains(grRules,{'&','|'})) 0078 symbolicRules = true; 0079 else 0080 symbolicRules = false; 0081 end 0082 grRules = regexprep(grRules,' and ',' & '); 0083 grRules = regexprep(grRules,' or ',' | '); 0084 0085 % get unique list of grRules 0086 [uRules,~,rule_ind] = unique(grRules); 0087 0088 % iterate through each rule 0089 newRules = uRules; %initialize newRules 0090 for i = 1:numel(uRules) 0091 if isempty(uRules{i}) || ~contains(uRules{i},'|') 0092 continue 0093 elseif contains(uRules{i},'&') 0094 newRules{i} = processComplexRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring); 0095 else 0096 newRules{i} = processSimpleRule(uRules{i},model.genes,geneScores,isozymeScoring,complexScoring); 0097 end 0098 end 0099 0100 % re-map unique rules to model 0101 newModel = model; 0102 newModel.grRules = newRules(rule_ind); 0103 0104 % restore original logical operator formatting if it was changed 0105 if ~symbolicRules 0106 newModel.grRules = regexprep(newModel.grRules, ' \| ', ' or '); 0107 newModel.grRules = regexprep(newModel.grRules, ' & ', ' and '); 0108 end 0109 0110 % regenerate "genes" and "rxnGeneMat" model fields 0111 [genes,rxnGeneMat] = getGenesFromGrRules(newModel.grRules); 0112 newModel.genes = genes; 0113 newModel.rxnGeneMat = rxnGeneMat; 0114 0115 % update other gene-related fields 0116 remInd = ~ismember(model.genes,newModel.genes); 0117 remGenes = model.genes(remInd); 0118 0119 if isfield(newModel,'geneShortNames') 0120 newModel.geneShortNames(remInd) = []; 0121 end 0122 if isfield(newModel,'proteins') 0123 newModel.proteins(remInd) = []; 0124 end 0125 if isfield(newModel,'geneMiriams') 0126 newModel.geneMiriams(remInd) = []; 0127 end 0128 if isfield(newModel,'geneFrom') 0129 newModel.geneFrom(remInd) = []; 0130 end 0131 if isfield(newModel,'geneComps') 0132 newModel.geneComps(remInd) = []; 0133 end 0134 0135 0136 end 0137 0138 0139 0140 function [updatedRule,rScore] = processSimpleRule(rule,genes,gScores,isozymeScoring,complexScoring) 0141 % Either score or modify a reaction gene rule containig only ANDs or ORs. 0142 % 0143 % If the rule contains an enzyme complex (all ANDs), the complex will be 0144 % scored based on the score of its subunits. Subunits without a score (NaN) 0145 % will be excluded from the score calculation. 0146 % 0147 % If the rule contains only isozymes (all ORs), the negative-score genes 0148 % will be removed from the rule. Isozymes without a score (NaN) will not be 0149 % removed from the rule. The resuling rule will then be scored. 0150 0151 0152 % get IDs and indices of genes involved in rule 0153 ruleGenes = unique(regexp(rule,'[^&|\(\) ]+','match')); 0154 [~,geneInd] = ismember(ruleGenes,genes); 0155 0156 % rules with one or no genes remain unchanged 0157 if numel(ruleGenes) < 2 0158 rScore = gScores(geneInd); 0159 updatedRule = rule; 0160 return 0161 end 0162 0163 if ~contains(rule,'&') % rule contains isozymes 0164 0165 scoreMethod = isozymeScoring; 0166 negInd = gScores(geneInd) < 0; % NaNs will return false here 0167 if all(negInd) 0168 % get the least negative gene, adding a small random value to avoid a tie 0169 [~,maxInd] = max(gScores(geneInd) + rand(size(geneInd))*(1e-8)); 0170 updatedRule = ruleGenes{maxInd}; 0171 elseif sum(~negInd) == 1 0172 updatedRule = ruleGenes{~negInd}; 0173 else 0174 updatedRule = strjoin(ruleGenes(~negInd),' | '); 0175 if startsWith(rule,'(') 0176 updatedRule = ['(',updatedRule,')']; 0177 end 0178 end 0179 0180 % update ruleGenes and their indices 0181 ruleGenes = unique(regexp(updatedRule,'[^&|\(\) ]+','match')); 0182 [~,geneInd] = ismember(ruleGenes,genes); 0183 0184 elseif ~contains(rule,'|') % rule contains enzyme complex 0185 scoreMethod = complexScoring; 0186 updatedRule = rule; 0187 else 0188 error('This function cannot handle rules with both "OR" and "AND" expressions.'); 0189 end 0190 0191 % score rule 0192 switch lower(scoreMethod) 0193 case 'min' 0194 rScore = min(gScores(geneInd),[],'omitnan'); 0195 case 'max' 0196 rScore = max(gScores(geneInd),[],'omitnan'); 0197 case 'median' 0198 rScore = median(gScores(geneInd),'omitnan'); 0199 case 'average' 0200 rScore = mean(gScores(geneInd),'omitnan'); 0201 end 0202 0203 end 0204 0205 0206 0207 function updatedRule = processComplexRule(rule,genes,gScores,isozymeScoring,complexScoring) 0208 % Update reactions containing both AND and OR expressions. 0209 % 0210 % Negative-score genes will be removed if they are isozymic, whereas they 0211 % will not be removed if they are part of an enzyme complex. However, if 0212 % the enzyme complex has a negative score, the entire complex will be 0213 % removed, as long as it is not the only remaining element in the rule. 0214 0215 0216 % Specify phrases to search for in the grRule. These phrases will find 0217 % genes grouped by all ANDs (first phrase) or all ORs (second phrase). 0218 search_phrases = {'\([^&|\(\) ]+( & [^&|\(\) ]+)+\)', '\([^&|\(\) ]+( \| [^&|\(\) ]+)+\)'}; 0219 0220 % initialize some variables 0221 subsets = {}; % subsets are groups of genes grouped by all ANDs or all ORs 0222 c = 1; % counter to keep track of the group (subset) number 0223 r_orig = rule; % record original rule to determine when it stops changing 0224 for k = 1:100 % iterate some arbitrarily high number of times 0225 for j = 1:length(search_phrases) 0226 new_subset = regexp(rule,search_phrases{j},'match')'; % extract subsets 0227 if ~isempty(new_subset) 0228 subsets = [subsets; new_subset]; % append to list of subsets 0229 subset_nums = arrayfun(@num2str,(c:length(subsets))','UniformOutput',false); % get group numbers to be assigned to the new subsets, and convert to strings 0230 rule = regexprep(rule,search_phrases{j},strcat('#',subset_nums,'#'),'once'); % replace the subsets in the expression with their group numbers (enclosed by "#"s) 0231 c = c + length(new_subset); 0232 end 0233 end 0234 if isequal(rule,r_orig) 0235 break; % stop iterating when rule stops changing 0236 else 0237 r_orig = rule; 0238 end 0239 end 0240 subsets{end+1} = rule; % add final state of rule as the last subset 0241 0242 % score and update each subset, and append to gene list and gene scores 0243 for i = 1:numel(subsets) 0244 [subsets{i},subset_score] = processSimpleRule(subsets{i},genes,gScores,isozymeScoring,complexScoring); 0245 gScores = [gScores; subset_score]; 0246 genes = [genes; {strcat('#',num2str(i),'#')}]; 0247 end 0248 0249 % reconstruct the rule from its updated subsets 0250 updatedRule = subsets{end}; 0251 for i = c-1:-1:1 0252 updatedRule = regexprep(updatedRule,strcat('#',num2str(i),'#'),subsets{i}); 0253 end 0254 0255 end 0256 0257 0258