getMetsFromKEGG Retrieves information on all metabolites stored in KEGG database Input: keggPath if keggMets.mat is not in the RAVEN\external\kegg directory, this function will attempt to read data from a local FTP dump of the KEGG database. keggPath is the path to the root of this database Output: model a model structure generated from the database. The following fields are filled id 'KEGG' name 'Automatically generated from KEGG database' mets KEGG compound ids metNames Compound name. Only the first name will be saved if there are several synonyms metMiriams If there is a CHEBI id available, then that will be saved here inchis InChI string for the metabolite metFormulas The chemical composition of the metabolite. This will only be loaded if there is no InChI string NOTE: If the file keggMets.mat is in the RAVEN\external\kegg directory it will be loaded instead of parsing of the KEGG files. If it does not exist it will be saved after parsing of the KEGG files. In general, you should remove the keggMets.mat file if you want to rebuild the model structure from a newer version of KEGG. Usage: model=getMetsFromKEGG(keggPath) NOTE: This is how one entry looks in the file ENTRY C00001 Compound NAME H2O; Water FORMULA H2O EXACT_MASS 18.0106 MOL_WEIGHT 18.0153 REMARK Same as: D00001 REACTION R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017 R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044 (list truncated) ENZYME 1.1.1.1 1.1.1.22 1.1.1.23 1.1.1.115 1.1.1.132 1.1.1.136 1.1.1.170 1.1.1.186 (list truncated) BRITE Therapeutic category of drugs in Japan [BR:br08301] (list truncated) DBLINKS CAS: 7732-18-5 PubChem: 3303 ChEBI: 15377 (list truncated) Then a lot of info about the positions of the atoms and so on. It is not certain that each metabolite follows this structure exactly. The file is not tab-delimited. Instead each label is 12 characters (except for '///'). Check if the reactions have been parsed before and saved. If so, load the model.
0001 function model=getMetsFromKEGG(keggPath) 0002 % getMetsFromKEGG 0003 % Retrieves information on all metabolites stored in KEGG database 0004 % 0005 % Input: 0006 % keggPath if keggMets.mat is not in the RAVEN\external\kegg 0007 % directory, this function will attempt to read data from a 0008 % local FTP dump of the KEGG database. keggPath is the path 0009 % to the root of this database 0010 % 0011 % Output: 0012 % model a model structure generated from the database. The 0013 % following fields are filled 0014 % id 'KEGG' 0015 % name 'Automatically generated from KEGG database' 0016 % mets KEGG compound ids 0017 % metNames Compound name. Only the first name will be saved if 0018 % there are several synonyms 0019 % metMiriams If there is a CHEBI id available, then that will be 0020 % saved here 0021 % inchis InChI string for the metabolite 0022 % metFormulas The chemical composition of the metabolite. This 0023 % will only be loaded if there is no InChI string 0024 % 0025 % NOTE: If the file keggMets.mat is in the RAVEN\external\kegg directory 0026 % it will be loaded instead of parsing of the KEGG files. If it does not 0027 % exist it will be saved after parsing of the KEGG files. In general, you 0028 % should remove the keggMets.mat file if you want to rebuild the model 0029 % structure from a newer version of KEGG. 0030 % 0031 % Usage: model=getMetsFromKEGG(keggPath) 0032 % 0033 % NOTE: This is how one entry looks in the file 0034 % 0035 % ENTRY C00001 Compound 0036 % NAME H2O; 0037 % Water 0038 % FORMULA H2O 0039 % EXACT_MASS 18.0106 0040 % MOL_WEIGHT 18.0153 0041 % REMARK Same as: D00001 0042 % REACTION R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017 0043 % R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044 0044 % (list truncated) 0045 % ENZYME 1.1.1.1 1.1.1.22 1.1.1.23 1.1.1.115 0046 % 1.1.1.132 1.1.1.136 1.1.1.170 1.1.1.186 0047 % (list truncated) 0048 % BRITE Therapeutic category of drugs in Japan [BR:br08301] 0049 % (list truncated) 0050 % DBLINKS CAS: 7732-18-5 0051 % PubChem: 3303 0052 % ChEBI: 15377 0053 % (list truncated) 0054 % 0055 % Then a lot of info about the positions of the atoms and so on. It is not 0056 % certain that each metabolite follows this structure exactly. 0057 % 0058 % The file is not tab-delimited. Instead each label is 12 characters 0059 % (except for '///'). 0060 % 0061 % Check if the reactions have been parsed before and saved. If so, load the 0062 % model. 0063 % 0064 0065 if nargin<1 0066 keggPath='RAVEN/external/kegg'; 0067 else 0068 keggPath=char(keggPath); 0069 end 0070 0071 ravenPath=findRAVENroot(); 0072 metsFile=fullfile(ravenPath,'external','kegg','keggMets.mat'); 0073 if exist(metsFile, 'file') 0074 fprintf(['Importing KEGG metabolites from ' strrep(metsFile,'\','/') '... ']); 0075 load(metsFile); 0076 else 0077 fprintf(['NOTE: Cannot locate ' strrep(metsFile,'\','/') ', it will therefore be generated from the local KEGG database\n']); 0078 if ~isfile(fullfile(keggPath,'compound')) || ~isfile(fullfile(keggPath,'compound.inchi')) 0079 EM=fprintf(['The files ''compound'' and ''compound.inchi'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP.\n']); 0080 dispEM(EM); 0081 else 0082 fprintf('Generating keggMets.mat file... '); 0083 %Add new functionality in the order specified in models 0084 model.id='KEGG'; 0085 model.name='Automatically generated from KEGG database'; 0086 0087 %Preallocate memory for 50000 metabolites 0088 model.mets=cell(50000,1); 0089 model.metNames=cell(50000,1); 0090 model.metFormulas=cell(50000,1); 0091 model.metMiriams=cell(50000,1); 0092 0093 %First load information on metabolite ID, metabolite name, 0094 %composition, and ChEBI 0095 0096 fid = fopen(fullfile(keggPath,'compound'), 'r'); 0097 0098 %Keeps track of how many metabolites that have been added 0099 metCounter=0; 0100 0101 %Loop through the file 0102 while 1 0103 %Get the next line 0104 tline = fgetl(fid); 0105 0106 %Abort at end of file 0107 if ~ischar(tline) 0108 break; 0109 end 0110 0111 %Skip '///' 0112 if numel(tline)<12 0113 continue; 0114 end 0115 0116 %Check if it's a new reaction 0117 if strcmp(tline(1:12),'ENTRY ') 0118 metCounter=metCounter+1; 0119 0120 %Add empty strings where there should be such 0121 model.metNames{metCounter}=''; 0122 model.metFormulas{metCounter}=''; 0123 0124 %Add compound ID (always 6 characters) 0125 model.mets{metCounter}=tline(13:18); 0126 0127 %Add the KEGG id as metMiriams 0128 if length(model.mets{metCounter})==6 0129 miriamStruct=model.metMiriams{metCounter}; 0130 if strcmp('G',model.mets{metCounter}(1)) 0131 miriamStruct.name{1,1}='kegg.glycan'; 0132 else 0133 miriamStruct.name{1,1}='kegg.compound'; 0134 end 0135 miriamStruct.value{1,1}=tline(13:18); 0136 model.metMiriams{metCounter}=miriamStruct; 0137 end 0138 end 0139 0140 %Add name 0141 if strcmp(tline(1:12),'NAME ') 0142 %If there are synonyms, then the last character is ';' 0143 if strcmp(tline(end),';') 0144 model.metNames{metCounter}=tline(13:end-1); 0145 %Semicolon can also occur in the middle, separating 0146 %several synonims in the same line 0147 model.metNames{metCounter} = regexprep(model.metNames{metCounter},';.+',''); 0148 elseif regexp(tline,';') 0149 model.metNames{metCounter}=tline(13:end); 0150 model.metNames{metCounter} = regexprep(model.metNames{metCounter},';.+',''); 0151 else 0152 model.metNames{metCounter}=tline(13:end); 0153 end 0154 end 0155 0156 %Add composition 0157 if strcmp(tline(1:12),'FORMULA ') 0158 model.metFormulas{metCounter}=tline(13:end); 0159 end 0160 0161 %Add PubChem id 0162 if numel(tline)>21 0163 if strcmp(tline(13:21),'PubChem: ') 0164 if isstruct(model.metMiriams{metCounter}) 0165 addToIndex=numel(model.metMiriams{metCounter}.name)+1; 0166 else 0167 addToIndex=1; 0168 end 0169 miriamStruct=model.metMiriams{metCounter}; 0170 miriamStruct.name{addToIndex,1}='pubchem.substance'; 0171 miriamStruct.value{addToIndex,1}=tline(22:end); 0172 model.metMiriams{metCounter}=miriamStruct; 0173 end 0174 end 0175 0176 %Add CHEBI id 0177 if numel(tline)>19 0178 if strcmp(tline(13:19),'ChEBI: ') 0179 if isstruct(model.metMiriams{metCounter}) 0180 addToIndex=numel(model.metMiriams{metCounter}.name)+1; 0181 else 0182 addToIndex=1; 0183 end 0184 chebiIDs=strsplit(tline(20:end),' '); 0185 miriamStruct=model.metMiriams{metCounter}; 0186 for i=1:numel(chebiIDs) 0187 miriamStruct.name{addToIndex,1}='chebi'; 0188 miriamStruct.value{addToIndex,1}=strcat('CHEBI:',chebiIDs{i}); 0189 addToIndex=addToIndex+1; 0190 end 0191 model.metMiriams{metCounter}=miriamStruct; 0192 end 0193 end 0194 end 0195 0196 %Close the file 0197 fclose(fid); 0198 0199 %If too much space was allocated, shrink the model 0200 model.mets=model.mets(1:metCounter); 0201 model.metNames=model.metNames(1:metCounter); 0202 model.metFormulas=model.metFormulas(1:metCounter); 0203 model.metMiriams=model.metMiriams(1:metCounter); 0204 0205 %Then load the InChI strings from another file. Not all metabolites 0206 %will be present in the list 0207 0208 inchIDs=cell(numel(model.mets),1); 0209 inchis=cell(numel(model.mets),1); 0210 0211 %The format is metID*tab*string 0212 0213 fid = fopen(fullfile(keggPath,'compound.inchi'), 'r'); 0214 0215 %Loop through the file 0216 counter=1; 0217 while 1 0218 %Get the next line 0219 tline = fgetl(fid); 0220 0221 %Abort at end of file 0222 if ~ischar(tline) 0223 break; 0224 end 0225 0226 %Get the ID and the InChI 0227 inchIDs{counter}=tline(1:6); 0228 inchis{counter}=tline(14:end); 0229 counter=counter+1; 0230 end 0231 0232 %Close the file 0233 fclose(fid); 0234 0235 inchIDs=inchIDs(1:counter-1); 0236 inchis=inchis(1:counter-1); 0237 0238 %Find the metabolites that had InChI strings and add them to the 0239 %model 0240 [a, b]=ismember(inchIDs,model.mets); 0241 0242 %If there were mets with InChIs but that were not in the list 0243 if ~all(a) 0244 EM='Not all metabolites with InChI strings were found in the original list'; 0245 disp(EM); 0246 end 0247 0248 model.inchis=cell(numel(model.mets),1); 0249 model.inchis(:)={''}; 0250 model.inchis(b)=inchis; 0251 0252 %Remove composition if InChI was found 0253 model.metFormulas(b)={''}; 0254 0255 %Ensuring that all model.metMiriams.value consist only of strings, 0256 %no double 0257 for i=1:(numel(model.mets)) 0258 for j=1:(numel(model.metMiriams{i})) 0259 if isa(model.metMiriams{i}.value{j},'double') 0260 model.metMiriams{i}.value{j}=num2str(model.metMiriams{i}.value{j}); 0261 end 0262 end 0263 end 0264 0265 %Removing fronting and trailing whitespace from metNames 0266 model.metNames = deblank(model.metNames); 0267 0268 %Fixing redundant metNames. The first occurence of particular 0269 %metabolite name is not changed, but starting from the second 0270 %occurence, original metabolite name is concatenated with KEGG 0271 %COMPOUND id between the brackets 0272 for i=1:(numel(model.metNames)) 0273 if ~isempty(model.metNames{i}) 0274 if sum(ismember(model.metNames(1:i-1),model.metNames(i)))>=1 0275 model.metNames(i) = strcat(model.metNames(i), ' (', model.mets(i),')'); 0276 end 0277 end 0278 end 0279 %Saves the model 0280 save(metsFile,'model'); 0281 end 0282 end 0283 fprintf('COMPLETE\n'); 0284 end