Home > external > kegg > getMetsFromKEGG.m

getMetsFromKEGG

PURPOSE ^

getMetsFromKEGG

SYNOPSIS ^

function model=getMetsFromKEGG(keggPath)

DESCRIPTION ^

 getMetsFromKEGG
   Retrieves information on all metabolites stored in KEGG database

   Input:
   keggPath    if keggMets.mat is not in the RAVEN\external\kegg
               directory, this function will attempt to read data from a
               local FTP dump of the KEGG database. keggPath is the path
               to the root of this database

   Output:
   model       a model structure generated from the database. The
               following fields are filled
       id              'KEGG'
       name     'Automatically generated from KEGG database'
       mets            KEGG compound ids
       metNames        Compound name. Only the first name will be saved if
                       there are several synonyms
       metMiriams      If there is a CHEBI id available, then that will be
                       saved here
       inchis          InChI string for the metabolite
       metFormulas     The chemical composition of the metabolite. This
                       will only be loaded if there is no InChI string

   NOTE: If the file keggMets.mat is in the RAVEN\external\kegg directory
   it will be loaded instead of parsing of the KEGG files. If it does not
   exist it will be saved after parsing of the KEGG files. In general, you
   should remove the keggMets.mat file if you want to rebuild the model
   structure from a newer version of KEGG.
               
   Usage: model=getMetsFromKEGG(keggPath)

 NOTE: This is how one entry looks in the file

 ENTRY       C00001                      Compound
 NAME        H2O;
             Water
 FORMULA     H2O
 EXACT_MASS  18.0106
 MOL_WEIGHT  18.0153
 REMARK      Same as: D00001
 REACTION    R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017
             R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044
             (list truncated)
 ENZYME      1.1.1.1         1.1.1.22        1.1.1.23        1.1.1.115
             1.1.1.132       1.1.1.136       1.1.1.170       1.1.1.186
             (list truncated)
 BRITE       Therapeutic category of drugs in Japan [BR:br08301]
             (list truncated)
 DBLINKS     CAS: 7732-18-5
             PubChem: 3303
             ChEBI: 15377
             (list truncated)

 Then a lot of info about the positions of the atoms and so on. It is not
 certain that each metabolite follows this structure exactly.

 The file is not tab-delimited. Instead each label is 12 characters
 (except for '///').

 Check if the reactions have been parsed before and saved. If so, load the
 model.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function model=getMetsFromKEGG(keggPath)
0002 % getMetsFromKEGG
0003 %   Retrieves information on all metabolites stored in KEGG database
0004 %
0005 %   Input:
0006 %   keggPath    if keggMets.mat is not in the RAVEN\external\kegg
0007 %               directory, this function will attempt to read data from a
0008 %               local FTP dump of the KEGG database. keggPath is the path
0009 %               to the root of this database
0010 %
0011 %   Output:
0012 %   model       a model structure generated from the database. The
0013 %               following fields are filled
0014 %       id              'KEGG'
0015 %       name     'Automatically generated from KEGG database'
0016 %       mets            KEGG compound ids
0017 %       metNames        Compound name. Only the first name will be saved if
0018 %                       there are several synonyms
0019 %       metMiriams      If there is a CHEBI id available, then that will be
0020 %                       saved here
0021 %       inchis          InChI string for the metabolite
0022 %       metFormulas     The chemical composition of the metabolite. This
0023 %                       will only be loaded if there is no InChI string
0024 %
0025 %   NOTE: If the file keggMets.mat is in the RAVEN\external\kegg directory
0026 %   it will be loaded instead of parsing of the KEGG files. If it does not
0027 %   exist it will be saved after parsing of the KEGG files. In general, you
0028 %   should remove the keggMets.mat file if you want to rebuild the model
0029 %   structure from a newer version of KEGG.
0030 %
0031 %   Usage: model=getMetsFromKEGG(keggPath)
0032 %
0033 % NOTE: This is how one entry looks in the file
0034 %
0035 % ENTRY       C00001                      Compound
0036 % NAME        H2O;
0037 %             Water
0038 % FORMULA     H2O
0039 % EXACT_MASS  18.0106
0040 % MOL_WEIGHT  18.0153
0041 % REMARK      Same as: D00001
0042 % REACTION    R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017
0043 %             R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044
0044 %             (list truncated)
0045 % ENZYME      1.1.1.1         1.1.1.22        1.1.1.23        1.1.1.115
0046 %             1.1.1.132       1.1.1.136       1.1.1.170       1.1.1.186
0047 %             (list truncated)
0048 % BRITE       Therapeutic category of drugs in Japan [BR:br08301]
0049 %             (list truncated)
0050 % DBLINKS     CAS: 7732-18-5
0051 %             PubChem: 3303
0052 %             ChEBI: 15377
0053 %             (list truncated)
0054 %
0055 % Then a lot of info about the positions of the atoms and so on. It is not
0056 % certain that each metabolite follows this structure exactly.
0057 %
0058 % The file is not tab-delimited. Instead each label is 12 characters
0059 % (except for '///').
0060 %
0061 % Check if the reactions have been parsed before and saved. If so, load the
0062 % model.
0063 %
0064 
0065 if nargin<1
0066     keggPath='RAVEN/external/kegg';
0067 else
0068     keggPath=char(keggPath);
0069 end
0070 
0071 ravenPath=findRAVENroot();
0072 metsFile=fullfile(ravenPath,'external','kegg','keggMets.mat');
0073 if exist(metsFile, 'file')
0074     fprintf(['Importing KEGG metabolites from ' strrep(metsFile,'\','/') '... ']);
0075     load(metsFile);
0076 else
0077     fprintf(['NOTE: Cannot locate ' strrep(metsFile,'\','/') ', it will therefore be generated from the local KEGG database\n']);
0078     if ~isfile(fullfile(keggPath,'compound')) || ~isfile(fullfile(keggPath,'compound.inchi'))
0079         EM=fprintf(['The files ''compound'' and ''compound.inchi'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP.\n']);
0080         dispEM(EM);
0081     else
0082         fprintf('Generating keggMets.mat file... ');
0083         %Add new functionality in the order specified in models
0084         model.id='KEGG';
0085         model.name='Automatically generated from KEGG database';
0086         
0087         %Preallocate memory for 50000 metabolites
0088         model.mets=cell(50000,1);
0089         model.metNames=cell(50000,1);
0090         model.metFormulas=cell(50000,1);
0091         model.metMiriams=cell(50000,1);
0092         
0093         %First load information on metabolite ID, metabolite name,
0094         %composition, and ChEBI
0095         
0096         fid = fopen(fullfile(keggPath,'compound'), 'r');
0097         
0098         %Keeps track of how many metabolites that have been added
0099         metCounter=0;
0100         
0101         %Loop through the file
0102         while 1
0103             %Get the next line
0104             tline = fgetl(fid);
0105             
0106             %Abort at end of file
0107             if ~ischar(tline)
0108                 break;
0109             end
0110             
0111             %Skip '///'
0112             if numel(tline)<12
0113                 continue;
0114             end
0115             
0116             %Check if it's a new reaction
0117             if strcmp(tline(1:12),'ENTRY       ')
0118                 metCounter=metCounter+1;
0119                 
0120                 %Add empty strings where there should be such
0121                 model.metNames{metCounter}='';
0122                 model.metFormulas{metCounter}='';
0123                 
0124                 %Add compound ID (always 6 characters)
0125                 model.mets{metCounter}=tline(13:18);
0126                 
0127                 %Add the KEGG id as metMiriams
0128                 if length(model.mets{metCounter})==6
0129                     miriamStruct=model.metMiriams{metCounter};
0130                     if strcmp('G',model.mets{metCounter}(1))
0131                         miriamStruct.name{1,1}='kegg.glycan';
0132                     else
0133                         miriamStruct.name{1,1}='kegg.compound';
0134                     end
0135                     miriamStruct.value{1,1}=tline(13:18);
0136                     model.metMiriams{metCounter}=miriamStruct;
0137                 end
0138             end
0139             
0140             %Add name
0141             if strcmp(tline(1:12),'NAME        ')
0142                 %If there are synonyms, then the last character is ';'
0143                 if strcmp(tline(end),';')
0144                     model.metNames{metCounter}=tline(13:end-1);
0145                     %Semicolon can also occur in the middle, separating
0146                     %several synonims in the same line
0147                     model.metNames{metCounter} = regexprep(model.metNames{metCounter},';.+','');
0148                 elseif regexp(tline,';')
0149                     model.metNames{metCounter}=tline(13:end);
0150                     model.metNames{metCounter} = regexprep(model.metNames{metCounter},';.+','');
0151                 else
0152                     model.metNames{metCounter}=tline(13:end);
0153                 end
0154             end
0155             
0156             %Add composition
0157             if strcmp(tline(1:12),'FORMULA     ')
0158                 model.metFormulas{metCounter}=tline(13:end);
0159             end
0160             
0161             %Add PubChem id
0162             if numel(tline)>21
0163                 if strcmp(tline(13:21),'PubChem: ')
0164                     if isstruct(model.metMiriams{metCounter})
0165                         addToIndex=numel(model.metMiriams{metCounter}.name)+1;
0166                     else
0167                         addToIndex=1;
0168                     end
0169                     miriamStruct=model.metMiriams{metCounter};
0170                     miriamStruct.name{addToIndex,1}='pubchem.substance';
0171                     miriamStruct.value{addToIndex,1}=tline(22:end);
0172                     model.metMiriams{metCounter}=miriamStruct;
0173                 end
0174             end
0175             
0176             %Add CHEBI id
0177             if numel(tline)>19
0178                 if strcmp(tline(13:19),'ChEBI: ')
0179                     if isstruct(model.metMiriams{metCounter})
0180                         addToIndex=numel(model.metMiriams{metCounter}.name)+1;
0181                     else
0182                         addToIndex=1;
0183                     end
0184                     chebiIDs=strsplit(tline(20:end),' ');
0185                     miriamStruct=model.metMiriams{metCounter};
0186                     for i=1:numel(chebiIDs)
0187                         miriamStruct.name{addToIndex,1}='chebi';
0188                         miriamStruct.value{addToIndex,1}=strcat('CHEBI:',chebiIDs{i});
0189                         addToIndex=addToIndex+1;
0190                     end
0191                     model.metMiriams{metCounter}=miriamStruct;
0192                 end
0193             end
0194         end
0195         
0196         %Close the file
0197         fclose(fid);
0198         
0199         %If too much space was allocated, shrink the model
0200         model.mets=model.mets(1:metCounter);
0201         model.metNames=model.metNames(1:metCounter);
0202         model.metFormulas=model.metFormulas(1:metCounter);
0203         model.metMiriams=model.metMiriams(1:metCounter);
0204         
0205         %Then load the InChI strings from another file. Not all metabolites
0206         %will be present in the list
0207         
0208         inchIDs=cell(numel(model.mets),1);
0209         inchis=cell(numel(model.mets),1);
0210         
0211         %The format is metID*tab*string
0212         
0213         fid = fopen(fullfile(keggPath,'compound.inchi'), 'r');
0214         
0215         %Loop through the file
0216         counter=1;
0217         while 1
0218             %Get the next line
0219             tline = fgetl(fid);
0220             
0221             %Abort at end of file
0222             if ~ischar(tline)
0223                 break;
0224             end
0225             
0226             %Get the ID and the InChI
0227             inchIDs{counter}=tline(1:6);
0228             inchis{counter}=tline(14:end);
0229             counter=counter+1;
0230         end
0231         
0232         %Close the file
0233         fclose(fid);
0234         
0235         inchIDs=inchIDs(1:counter-1);
0236         inchis=inchis(1:counter-1);
0237         
0238         %Find the metabolites that had InChI strings and add them to the
0239         %model
0240         [a, b]=ismember(inchIDs,model.mets);
0241         
0242         %If there were mets with InChIs but that were not in the list
0243         if ~all(a)
0244             EM='Not all metabolites with InChI strings were found in the original list';
0245             disp(EM);
0246         end
0247         
0248         model.inchis=cell(numel(model.mets),1);
0249         model.inchis(:)={''};
0250         model.inchis(b)=inchis;
0251         
0252         %Remove composition if InChI was found
0253         model.metFormulas(b)={''};
0254         
0255         %Ensuring that all model.metMiriams.value consist only of strings,
0256         %no double
0257         for i=1:(numel(model.mets))
0258             for j=1:(numel(model.metMiriams{i}))
0259                 if isa(model.metMiriams{i}.value{j},'double')
0260                     model.metMiriams{i}.value{j}=num2str(model.metMiriams{i}.value{j});
0261                 end
0262             end
0263         end
0264         
0265         %Removing fronting and trailing whitespace from metNames
0266         model.metNames = deblank(model.metNames);
0267         
0268         %Fixing redundant metNames. The first occurence of particular
0269         %metabolite name is not changed, but starting from the second
0270         %occurence, original metabolite name is concatenated with KEGG
0271         %COMPOUND id between the brackets
0272         for i=1:(numel(model.metNames))
0273             if ~isempty(model.metNames{i})
0274                 if sum(ismember(model.metNames(1:i-1),model.metNames(i)))>=1
0275                     model.metNames(i) = strcat(model.metNames(i), ' (', model.mets(i),')');
0276                 end
0277             end
0278         end
0279         %Saves the model
0280         save(metsFile,'model');
0281     end
0282 end
0283 fprintf('COMPLETE\n');
0284 end

Generated by m2html © 2005