makeEcModel Expands a conventional genome-scale model (in RAVEN format) with enzyme information and prepares the reactions for integration of enzyme usage coefficients. This function contains all the steps that need to be done to get a basic ecModel, without incorporating any kcat values or constraints yet. This function should only have to be run once for a model. Input: model a model in RAVEN format geckoLight true if a simplified GECKO light model should be generated. (Optional, default is false). modelAdapter a loaded model adapter (Optional, will otherwise use the default model adapter). Ouput: model an ecModel in GECKO 3 format, with a model.ec structure where enzyme and kcat information are stored. Protein pseudo- metabolites and their draw reactions are added to the model, but their usage is not yet implemented (due to absent kcat values at this stage). noUniprot genes for which no information could be found in the Uniprot database The function goes through the following steps: 1. Remove gene associations from pseudoreactions. 2. Invert irreversible backwards reactions. 3. Correct 'rev' vector to match lb and ub vectors. 4. Convert to irreversible model (splits reversible reactions). 5. [Skipped with geckoLight:] Expand model to split reactions with 'OR' in grRules (each reaction is then catalyzed by one enzyme (complex). 6. Make empty model.ec structure, that will contain enzyme and kcat information. One entry per reaction, where isozymes have multiple entries. This model.ec structure will later be populated with kcat values. For geckoLight the structure is different, where each reaction can have multiple isozymes. 7. Add enzyme information fields to model.ec structure: MW, sequence. 8. Populate model.ec structure with information from each reaction. 9. [Skipped with geckoLight:] Add proteins as pseudometabolites. 10. Add prot_pool pseudometabolite. 11. [Skipped with geckoLight:] Add usage reactions for the protein pseudometabolites, replenishing from the protein pool (default, can be changed to consider proteomics data at later stage) 12. Add protein pool reaction, without upper bound. Note that while protein pseudometabolites, draw & pool reactions might be added to the model, the enzyme usage is not yet incorporated in each metabolic reaction, so enzymes will not be used. applyKcatConstraints incorporates protein pseudometabolites in reactions as enzyme usages by applying the specified kcats as constraints. The EC structure looks as follows Attributes: geckoLight: 0 if full model, 1 if light rxns: reaction identifiers that correspond to model.rxns kcat: kcat values - not set here source: specifies where the kcats come from - not set here notes: notes that can be set by downstream functions - not set here eccodes: enzyme codes for each enzyme - not set here genes: the genes involved in the kcats - not necessarily the same as model.genes, since some genes may not be found in databases etc. enzymes: Uniprot protein identifiers for the genes mw: molecular weights of the enzymes sequence: sequence of the genes/enzymes concs: concentrations of the enzymes - not set here rxnEnzMat: matrix of enzymes and rxns The full model is split on all ORs in the GPRs, meaning that the reactions will be duplicated for each isozyme. Only the rxns with genes are added. The fields rxns, eccodes, kcat, source and notes will therefore have one entry per reaction. The fields genes, enzymes, mw, sequence and concs will have one entry per gene. The rxnEnzMat is a matrix with reactions and genes, mapping which genes are connected to which reaction (where isozymes have different reactions). The light model works a bit differently. The model has the same number of rxns as the original model, but expanded since it is reversible + one the extra prot maintenance rxn and one extra prot_pool rxn. However, the ec fields rxns, eccodes, kcat, source and notes are duplicated for each isozyme, sorted the same way as model.rxns. So, in model.ec.rxns, the same reaction will appear several times after one another, one entry per izozyme, with corresponding values for that isozyme. These fields therefore have the same length as for the full model. The fields genes, enzymes, mw, sequence and concs are the same here as in the full model. The rxnEnzMat maps the model.ec.rxns entries to genes and is therefore of the same size as for the full model. Usage: [model, noUniprot] = makeEcModel(model, geckoLight, modelAdapter)
0001 function [model, noUniprot] = makeEcModel(model, geckoLight, modelAdapter) 0002 % makeEcModel 0003 % Expands a conventional genome-scale model (in RAVEN format) with enzyme 0004 % information and prepares the reactions for integration of enzyme usage 0005 % coefficients. This function contains all the steps that need to be done 0006 % to get a basic ecModel, without incorporating any kcat values or 0007 % constraints yet. This function should only have to be run once for a 0008 % model. 0009 % 0010 % Input: 0011 % model a model in RAVEN format 0012 % geckoLight true if a simplified GECKO light model should be generated. 0013 % (Optional, default is false). 0014 % modelAdapter a loaded model adapter (Optional, will otherwise use the 0015 % default model adapter). 0016 % 0017 % Ouput: 0018 % model an ecModel in GECKO 3 format, with a model.ec structure where 0019 % enzyme and kcat information are stored. Protein pseudo- 0020 % metabolites and their draw reactions are added to the model, 0021 % but their usage is not yet implemented (due to absent kcat 0022 % values at this stage). 0023 % noUniprot genes for which no information could be found in the 0024 % Uniprot database 0025 % 0026 % The function goes through the following steps: 0027 % 1. Remove gene associations from pseudoreactions. 0028 % 2. Invert irreversible backwards reactions. 0029 % 3. Correct 'rev' vector to match lb and ub vectors. 0030 % 4. Convert to irreversible model (splits reversible reactions). 0031 % 5. [Skipped with geckoLight:] Expand model to split reactions with 0032 % 'OR' in grRules (each reaction is then catalyzed by one enzyme 0033 % (complex). 0034 % 6. Make empty model.ec structure, that will contain enzyme and kcat 0035 % information. One entry per reaction, where isozymes have multiple 0036 % entries. This model.ec structure will later be populated with kcat 0037 % values. For geckoLight the structure is different, where each 0038 % reaction can have multiple isozymes. 0039 % 7. Add enzyme information fields to model.ec structure: MW, sequence. 0040 % 8. Populate model.ec structure with information from each reaction. 0041 % 9. [Skipped with geckoLight:] Add proteins as pseudometabolites. 0042 % 10. Add prot_pool pseudometabolite. 0043 % 11. [Skipped with geckoLight:] Add usage reactions for the protein 0044 % pseudometabolites, replenishing from the protein pool (default, can 0045 % be changed to consider proteomics data at later stage) 0046 % 12. Add protein pool reaction, without upper bound. 0047 % 0048 % Note that while protein pseudometabolites, draw & pool reactions might 0049 % be added to the model, the enzyme usage is not yet incorporated in each 0050 % metabolic reaction, so enzymes will not be used. applyKcatConstraints 0051 % incorporates protein pseudometabolites in reactions as enzyme usages by 0052 % applying the specified kcats as constraints. 0053 % 0054 %The EC structure looks as follows 0055 % Attributes: 0056 % geckoLight: 0 if full model, 1 if light 0057 % rxns: reaction identifiers that correspond to model.rxns 0058 % kcat: kcat values - not set here 0059 % source: specifies where the kcats come from - not set here 0060 % notes: notes that can be set by downstream functions - not set 0061 % here 0062 % eccodes: enzyme codes for each enzyme - not set here 0063 % genes: the genes involved in the kcats - not necessarily the 0064 % same as model.genes, since some genes may not be found in 0065 % databases etc. 0066 % enzymes: Uniprot protein identifiers for the genes 0067 % mw: molecular weights of the enzymes 0068 % sequence: sequence of the genes/enzymes 0069 % concs: concentrations of the enzymes - not set here 0070 % rxnEnzMat: matrix of enzymes and rxns 0071 % 0072 % The full model is split on all ORs in the GPRs, meaning that the 0073 % reactions will be duplicated for each isozyme. Only the rxns with genes 0074 % are added. The fields rxns, eccodes, kcat, source and notes will 0075 % therefore have one entry per reaction. The fields genes, enzymes, mw, 0076 % sequence and concs will have one entry per gene. The rxnEnzMat is a 0077 % matrix with reactions and genes, mapping which genes are connected to 0078 % which reaction (where isozymes have different reactions). 0079 % 0080 % The light model works a bit differently. The model has the same number of 0081 % rxns as the original model, but expanded since it is reversible + one the 0082 % extra prot maintenance rxn and one extra prot_pool rxn. However, the ec 0083 % fields rxns, eccodes, kcat, source and notes are duplicated for each 0084 % isozyme, sorted the same way as model.rxns. So, in model.ec.rxns, the 0085 % same reaction will appear several times after one another, one entry per 0086 % izozyme, with corresponding values for that isozyme. These fields 0087 % therefore have the same length as for the full model. The fields genes, 0088 % enzymes, mw, sequence and concs are the same here as in the full model. 0089 % The rxnEnzMat maps the model.ec.rxns entries to genes and is therefore of 0090 % the same size as for the full model. 0091 % 0092 % Usage: 0093 % [model, noUniprot] = makeEcModel(model, geckoLight, modelAdapter) 0094 0095 if nargin<2 0096 geckoLight=false; 0097 elseif ~islogical(geckoLight) && ~(geckoLight == 0) && ~(geckoLight == 1) 0098 error('geckoLight should be either true or false') 0099 end 0100 0101 if nargin < 3 || isempty(modelAdapter) 0102 modelAdapter = ModelAdapterManager.getDefault(); 0103 if isempty(modelAdapter) 0104 error('Either send in a modelAdapter or set the default model adapter in the ModelAdapterManager.') 0105 end 0106 end 0107 params = modelAdapter.getParameters(); 0108 compartmentID = strcmp(model.compNames,params.enzyme_comp); 0109 if ~any(compartmentID) 0110 error(['Compartment ' params.enzyme_comp ' (specified in params.enzyme_comp) '... 0111 'cannot be found in model.compNames']) 0112 end 0113 compartmentID = model.comps(compartmentID); 0114 0115 if geckoLight 0116 ec.geckoLight=true; 0117 else 0118 ec.geckoLight=false; 0119 end 0120 0121 %Check if model is in RAVEN format 0122 if any(isfield(model,{'rules','modelID'})) 0123 error(['The model is likely loaded using COBRA Toolbox readCbModel(). Instead, use ' ... 0124 'RAVEN Toolbox importModel(). Alternatively, you can also convert the ', ... 0125 'model in MATLAB using ravenCobraWrapper().']) 0126 end 0127 0128 %Check for conflicting reaction and metabolite identifiers 0129 conflictId = startsWith(model.mets,'prot_'); 0130 if any(conflictId) 0131 error('The identifiers in model.mets are not allowed to start with ''prot_''.') 0132 end 0133 conflictId = startsWith(model.rxns,{'usage_prot_','prot_pool'}) | ... 0134 endsWith(model.rxns,'_REV') | ... 0135 ~cellfun(@isempty,regexp(model.rxns,'_EXP_\d+$')); 0136 if any(conflictId) 0137 error(['The identifiers in model.rxns are not allowed to start with ''usage_prot'' ' ... 0138 'or ''prot_pool'', or end with ''_REV'' or ''_EXP_[digit]''.']) 0139 end 0140 0141 uniprotDB = loadDatabases('both', modelAdapter); 0142 uniprotDB = uniprotDB.uniprot; 0143 0144 %1: Remove gene rules from pseudoreactions (if any): 0145 if exist(fullfile(params.path,'data','pseudoRxns.tsv'),'file') 0146 fID = fopen(fullfile(params.path,'data','pseudoRxns.tsv')); 0147 fileData = textscan(fID,'%s %s','delimiter','\t'); 0148 fclose(fID); 0149 pseudoRxns = fileData{1}; 0150 pseudoRxns = ismember(model.rxns,pseudoRxns); 0151 else 0152 pseudoRxns = false(numel(model.rxns),1); 0153 end 0154 pseudoRxns = find(pseudoRxns | contains(model.rxnNames,'pseudoreaction')); 0155 model.grRules(pseudoRxns) = {''}; 0156 model.rxnGeneMat(pseudoRxns,:) = zeros(numel(pseudoRxns), numel(model.genes)); 0157 0158 %2: Swap direction of reactions that are defined to only carry negative flux 0159 to_swap=model.lb < 0 & model.ub == 0; 0160 model.S(:,to_swap)=-model.S(:,to_swap); 0161 model.ub(to_swap)=-model.lb(to_swap); 0162 model.lb(to_swap)=0; 0163 0164 %3: Correct rev vector: true if LB < 0 & UB > 0, or it is an exchange reaction: 0165 model.rev = false(size(model.rxns)); 0166 for i = 1:length(model.rxns) 0167 if (model.lb(i) < 0 && model.ub(i) > 0) || sum(model.S(:,i) ~= 0) == 1 0168 model.rev(i) = true; 0169 end 0170 end 0171 0172 %4: Make irreversible model (appends _REV to reaction IDs to indicate reverse 0173 %reactions) 0174 [~,exchRxns] = getExchangeRxns(model); 0175 nonExchRxns = model.rxns; 0176 nonExchRxns(exchRxns) = []; 0177 model=convertToIrrev(model, nonExchRxns); 0178 0179 %5: Expand model, to separate isozymes (appends _EXP_* to reaction IDs to 0180 %indicate duplication) 0181 if ~geckoLight 0182 model=expandModel(model); 0183 end 0184 % Sort reactions, so that reversible and isozymic reactions are kept near 0185 if ~geckoLight 0186 model=sortIdentifiers(model); 0187 end 0188 0189 %6: Make ec-extension structure, one for gene-associated reaction. 0190 % The structure is different for light and full models 0191 rxnWithGene = find(sum(model.rxnGeneMat,2)); 0192 if ~geckoLight 0193 ec.rxns = model.rxns(rxnWithGene); 0194 emptyCell = cell(numel(rxnWithGene),1); 0195 emptyCell(:) = {''}; 0196 emptyVect = zeros(numel(rxnWithGene),1); 0197 0198 ec.kcat = emptyVect; 0199 ec.source = emptyCell; % Strings, like 'dlkcat', 'manual', 'brenda', etc. 0200 ec.notes = emptyCell; % Additional comments 0201 ec.eccodes = emptyCell; 0202 ec.concs = emptyVect; 0203 else 0204 %Different strategy for GECKO light: Each reaction can exist multiple times in 0205 %ec.rxns and similar fields - one time per isozyme. The number of copies is 0206 %the number of ORs in the GPR + 1 0207 numOrs = count(model.grRules(rxnWithGene), ' or '); 0208 cpys = numOrs + 1; 0209 prevNumRxns = length(numOrs); 0210 cpyIndices = repelem(rxnWithGene, cpys).'; %.' only matter when number of rxns is 1 0211 %loop through and add a prefix with an isozyme index to the rxns 0212 %we just give a fixed-length number as prefix, and assume that 999 is enough 0213 tmpRxns = model.rxns(cpyIndices); %now they have no prefix 0214 newRxns = tmpRxns; 0215 0216 %add the prefix 0217 nextIndex = 1; 0218 for i = 1:numel(model.rxns) 0219 localRxnIndex = 1; 0220 if nextIndex <= length(tmpRxns) && strcmp(model.rxns(i), tmpRxns(nextIndex)) 0221 while true 0222 tmp = compose('%03d_',localRxnIndex); 0223 newRxns{nextIndex} = [tmp{1} tmpRxns{nextIndex}]; 0224 localRxnIndex = localRxnIndex + 1; 0225 if (localRxnIndex >= 1000) 0226 error('Increase index size to 10000 - error in the code.'); %this should never happen, we don't have > 999 isozymes 0227 end 0228 nextIndex = nextIndex + 1; 0229 if nextIndex > length(tmpRxns) || ~strcmp(model.rxns(i), tmpRxns(nextIndex)) 0230 break; 0231 end 0232 end 0233 end 0234 end 0235 0236 ec.rxns = newRxns; 0237 0238 emptyCell = cell(numel(ec.rxns),1); 0239 emptyCell(:) = {''}; 0240 emptyVect = zeros(numel(ec.rxns),1); 0241 0242 ec.kcat = emptyVect; 0243 ec.source = emptyCell; % Strings, like 'dlkcat', 'manual', 'brenda', etc. 0244 ec.notes = emptyCell; % Additional comments 0245 ec.eccodes = emptyCell; 0246 ec.concs = emptyVect; 0247 end 0248 0249 %7: Gather enzyme information via UniprotDB 0250 uniprotCompatibleGenes = modelAdapter.getUniprotCompatibleGenes(model.genes); 0251 [Lia,Locb] = ismember(uniprotCompatibleGenes,uniprotDB.genes); 0252 0253 uniprot = modelAdapter.getUniprotIDsFromTable(uniprotCompatibleGenes); 0254 if ~isequal(uniprot,uniprotCompatibleGenes) 0255 uniprot(cellfun(@isempty,uniprot)) = {''}; 0256 [Lia,Locb] = ismember(uniprot,uniprotDB.ID); 0257 end 0258 noUniprot = uniprotCompatibleGenes(~Lia); 0259 if all(~Lia) 0260 error('None of the proteins in uniprot.tsv match the genes in the model. Changes to the obj.params.uniprot parameters, or a data/uniprotConversion.tsv file are likely required.') 0261 elseif ~isempty(noUniprot) 0262 printOrange(['WARNING: The ' num2str(numel(noUniprot)) ' gene(s) reported '... 0263 'in "noUniprot" cannot be found in data/uniprot.tsv, these will ' ... 0264 'not be enzyme-constrained. If you intend to use different Uniprot '... 0265 'data (e.g. from a different proteome, make sure you first delete '... 0266 'the existing data/uniprot.tsv file.\n']); 0267 end 0268 ec.genes = model.genes(Lia); %Will often be duplicate of model.genes, but is done here to prevent issues when it is not. 0269 ec.enzymes = uniprotDB.ID(Locb(Lia)); 0270 ec.mw = uniprotDB.MW(Locb(Lia)); 0271 ec.sequence = uniprotDB.seq(Locb(Lia)); 0272 %Additional info 0273 ec.concs = nan(numel(ec.genes),1); % To be filled with proteomics data when available 0274 0275 %8: Only parse rxns associated to genes 0276 if ~geckoLight 0277 ec.rxnEnzMat = zeros(numel(rxnWithGene),numel(ec.genes)); % Non-zeros will indicate the number of subunits 0278 for r=1:numel(rxnWithGene) 0279 rxnGenes = model.genes(find(model.rxnGeneMat(rxnWithGene(r),:))); 0280 [~,locEnz] = ismember(rxnGenes,ec.genes); % Could also parse directly from rxnGeneMat, but some genes might be missing from Uniprot DB 0281 if locEnz ~= 0 0282 ec.rxnEnzMat(r,locEnz) = 1; %Assume 1 copy per subunit or enzyme, can be modified later 0283 end 0284 end 0285 else 0286 %For light models, we need to split up all GPRs 0287 ec.rxnEnzMat = zeros(numel(ec.rxns),numel(ec.genes)); % Non-zeros will indicate the number of subunits 0288 nextIndex = 1; 0289 %For full model generation, the GPRs are controlled in expandModel, but 0290 %here we need to make an explicit format check 0291 indexes2check = findPotentialErrors(model.grRules,model); 0292 if ~isempty(indexes2check) 0293 printOrange('Run standardizeGrRules(model) for a more detailed warning.\n'); 0294 printOrange('For Human-GEM, these reactions can be corrected using simplifyGrRules.\n'); 0295 end 0296 0297 for i=1:prevNumRxns 0298 %ind is the index in the model, not to confuse with the index in the ec struct (i), 0299 %which only contains reactions with GPRs. 0300 ind = rxnWithGene(i); 0301 %Get rid of all '(' and ')' since I'm not looking at complex stuff 0302 %anyways 0303 geneString=model.grRules{ind}; 0304 geneString=strrep(geneString,'(',''); 0305 geneString=strrep(geneString,')',''); 0306 geneString=strrep(geneString,' or ',';'); 0307 0308 if (numOrs(i) == 0) 0309 geneNames = {geneString}; 0310 else 0311 %Split the string into gene names 0312 geneNames=regexp(geneString,';','split'); 0313 end 0314 0315 %Now loop through the isozymes and set the rxnGeneMat 0316 for j = 1:length(geneNames) 0317 %Find the gene in the gene list If ' and ' relationship, first 0318 %split the genes 0319 fnd = strfind(geneNames{j},' and '); 0320 if ~isempty(fnd) 0321 andGenes=regexp(geneNames{j},' and ','split'); 0322 ec.rxnEnzMat(nextIndex,ismember(ec.genes,andGenes)) = 1; %should be subunit stoichoimetry 0323 else 0324 ec.rxnEnzMat(nextIndex,ismember(ec.genes,geneNames(j)))=1;%should be subunit stoichoimetry 0325 end 0326 nextIndex = nextIndex + 1; 0327 end 0328 end 0329 end 0330 0331 %9: Add proteins as pseudometabolites 0332 if ~geckoLight 0333 [proteinMets.mets, uniprotSortId] = unique(ec.enzymes); 0334 proteinMets.mets = strcat('prot_',proteinMets.mets); 0335 proteinMets.metNames = proteinMets.mets; 0336 proteinMets.compartments = compartmentID; 0337 if isfield(model,'metMiriams') 0338 proteinMets.metMiriams = repmat({struct('name',{{'sbo'}},'value',{{'SBO:0000252'}})},numel(proteinMets.mets),1); 0339 end 0340 if isfield(model,'metCharges') 0341 proteinMets.metCharges = zeros(numel(proteinMets.mets),1); 0342 end 0343 proteinMets.metNotes = repmat({'Enzyme-usage pseudometabolite'},numel(proteinMets.mets),1); 0344 model = addMets(model,proteinMets); 0345 end 0346 0347 %10: Add protein pool pseudometabolite 0348 pool.mets = 'prot_pool'; 0349 pool.metNames = pool.mets; 0350 pool.compartments = compartmentID; 0351 pool.metNotes = 'Enzyme-usage protein pool'; 0352 model = addMets(model,pool); 0353 0354 %11: Add protein usage reactions. 0355 if ~geckoLight 0356 usageRxns.rxns = strcat('usage_',proteinMets.mets); 0357 usageRxns.rxnNames = usageRxns.rxns; 0358 usageRxns.mets = cell(numel(usageRxns.rxns),1); 0359 usageRxns.stoichCoeffs = cell(numel(usageRxns.rxns),1); 0360 for i=1:numel(usageRxns.mets) 0361 usageRxns.mets{i} = {proteinMets.mets{i}, 'prot_pool'}; 0362 usageRxns.stoichCoeffs{i} = [-1,1]; 0363 end 0364 usageRxns.lb = zeros(numel(usageRxns.rxns),1) - 1000; 0365 usageRxns.ub = zeros(numel(usageRxns.rxns),1); 0366 usageRxns.rev = ones(numel(usageRxns.rxns),1); 0367 usageRxns.grRules = ec.genes(uniprotSortId); 0368 model = addRxns(model,usageRxns); 0369 end 0370 0371 %12: Add protein pool reaction (with open UB) 0372 poolRxn.rxns = 'prot_pool_exchange'; 0373 poolRxn.rxnNames = poolRxn.rxns; 0374 poolRxn.mets = {'prot_pool'}; 0375 poolRxn.stoichCoeffs = {-1}; 0376 poolRxn.lb = -1000; 0377 poolRxn.ub = 0; 0378 poolRxn.rev = 1; 0379 model = addRxns(model,poolRxn); 0380 0381 model.ec=ec; 0382 end 0383 0384 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0385 %Function that gets the model field grRules and returns the indexes of the 0386 %rules in which the pattern ") and (" is present. 0387 %Copied from standardizeGrRules 0388 % TODO: Make this an accessible function in a separate file in RAVEN and remove this 0389 %implementation. 0390 function indexes2check = findPotentialErrors(grRules,model) 0391 indxs_l = find(~cellfun(@isempty,strfind(grRules,') and ('))); 0392 indxs_l_L = find(~cellfun(@isempty,strfind(grRules,') and'))); 0393 indxs_l_R = find(~cellfun(@isempty,strfind(grRules,'and ('))); 0394 indexes2check = vertcat(indxs_l,indxs_l_L,indxs_l_R); 0395 indexes2check = unique(indexes2check); 0396 0397 if ~isempty(indexes2check) 0398 textToPrint = 'WARNING: Potentially problematic ") AND (" in the grRules for reaction(s):\n'; 0399 for i=1:numel(indexes2check) 0400 textToPrint=[textToPrint '\t' model.rxns{indexes2check(i)} '\n']; 0401 end 0402 printOrange(textToPrint); 0403 end 0404 end