Home > external > kegg > getGenesFromKEGG.m

getGenesFromKEGG

PURPOSE ^

getGenesFromKEGG

SYNOPSIS ^

function model=getGenesFromKEGG(keggPath,koList)

DESCRIPTION ^

 getGenesFromKEGG
   Retrieves information on all genes stored in KEGG database

   Input:
   keggPath    if keggGenes.mat is not in the RAVEN\external\kegg
               directory, this function will attempt to read data from a
               local FTP dump of the KEGG database. keggPath is the path
               to the root of this database
   koList      the number of genes in KEGG is very large. koList can be a
               cell array with KO identifiers, in which case only genes
               belonging to one of those KEGG orthologies are retrieved
               (opt, default all KOs with associated reactions)

   Output:
   model       a model structure generated from the database. The
               following fields are filled
       id              'KEGG'
       name     'Automatically generated from KEGG database'
       rxns            KO ids
       rxnNames        Name for each entry
       genes           IDs for all the genes. Genes are saved as organism
                       abbreviation:id (same as in KEGG). 'HSA:124' for
                       example is alcohol dehydrogenase in Homo sapiens
       rxnGeneMat      A binary matrix that indicates whether a specific
                       gene is present in a KO id

   NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
   it will be loaded instead of parsing of the KEGG files. If it does not
   exist it will be saved after parsing of the KEGG files. In general, you
   should remove the keggGenes.mat file if you want to rebuild the model
   structure from a newer version of KEGG.

   Usage: model=getGenesFromKEGG(keggPath,koList)

 NOTE: This is how one entry looks in the file

 ENTRY       K11440                      KO
 NAME        gbsB
 DEFINITION  choline dehydrogenase [EC:1.1.1.1]
 PATHWAY     ko00260  Glycine, serine and threonine metabolism
             ko01100  Metabolic pathways
 MODULE      M00555  Betaine biosynthesis, choline => betaine
 BRITE       KEGG Orthology (KO) [BR:ko00001]
             (list truncated)
 DBLINKS     RN: R08557 R08558
             COG: COG1454
 GENES       BSU: BSU31050(gbsB)
             BSR: I33_3189
             (list truncated)
 REFERENCE   PMID:8752328
   AUTHORS   Boch J, Kempf B, Schmid R, Bremer E
   TITLE     Synthesis of the osmoprotectant glycine... (truncated)
   JOURNAL   J Bacteriol 178:5121-9 (1996)
             DOI:10.1128/JB.178.17.5121-5129.1996
   SEQUENCE  [bsu:BSU31050]
 ///

 The file is not tab-delimited. Instead each label is 12 characters
 (except for '///').

 Check if the genes have been parsed before and saved. If so, load the
 model.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getGenesFromKEGG(keggPath,koList)
0002 % getGenesFromKEGG
0003 %   Retrieves information on all genes stored in KEGG database
0004 %
0005 %   Input:
0006 %   keggPath    if keggGenes.mat is not in the RAVEN\external\kegg
0007 %               directory, this function will attempt to read data from a
0008 %               local FTP dump of the KEGG database. keggPath is the path
0009 %               to the root of this database
0010 %   koList      the number of genes in KEGG is very large. koList can be a
0011 %               cell array with KO identifiers, in which case only genes
0012 %               belonging to one of those KEGG orthologies are retrieved
0013 %               (opt, default all KOs with associated reactions)
0014 %
0015 %   Output:
0016 %   model       a model structure generated from the database. The
0017 %               following fields are filled
0018 %       id              'KEGG'
0019 %       name     'Automatically generated from KEGG database'
0020 %       rxns            KO ids
0021 %       rxnNames        Name for each entry
0022 %       genes           IDs for all the genes. Genes are saved as organism
0023 %                       abbreviation:id (same as in KEGG). 'HSA:124' for
0024 %                       example is alcohol dehydrogenase in Homo sapiens
0025 %       rxnGeneMat      A binary matrix that indicates whether a specific
0026 %                       gene is present in a KO id
0027 %
0028 %   NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
0029 %   it will be loaded instead of parsing of the KEGG files. If it does not
0030 %   exist it will be saved after parsing of the KEGG files. In general, you
0031 %   should remove the keggGenes.mat file if you want to rebuild the model
0032 %   structure from a newer version of KEGG.
0033 %
0034 %   Usage: model=getGenesFromKEGG(keggPath,koList)
0035 %
0036 % NOTE: This is how one entry looks in the file
0037 %
0038 % ENTRY       K11440                      KO
0039 % NAME        gbsB
0040 % DEFINITION  choline dehydrogenase [EC:1.1.1.1]
0041 % PATHWAY     ko00260  Glycine, serine and threonine metabolism
0042 %             ko01100  Metabolic pathways
0043 % MODULE      M00555  Betaine biosynthesis, choline => betaine
0044 % BRITE       KEGG Orthology (KO) [BR:ko00001]
0045 %             (list truncated)
0046 % DBLINKS     RN: R08557 R08558
0047 %             COG: COG1454
0048 % GENES       BSU: BSU31050(gbsB)
0049 %             BSR: I33_3189
0050 %             (list truncated)
0051 % REFERENCE   PMID:8752328
0052 %   AUTHORS   Boch J, Kempf B, Schmid R, Bremer E
0053 %   TITLE     Synthesis of the osmoprotectant glycine... (truncated)
0054 %   JOURNAL   J Bacteriol 178:5121-9 (1996)
0055 %             DOI:10.1128/JB.178.17.5121-5129.1996
0056 %   SEQUENCE  [bsu:BSU31050]
0057 % ///
0058 %
0059 % The file is not tab-delimited. Instead each label is 12 characters
0060 % (except for '///').
0061 %
0062 % Check if the genes have been parsed before and saved. If so, load the
0063 % model.
0064 %
0065 
0066 if nargin<1
0067     keggPath='RAVEN/external/kegg';
0068 else
0069     keggPath=char(keggPath);
0070 end
0071 
0072 ravenPath=findRAVENroot();
0073 genesFile=fullfile(ravenPath,'external','kegg','keggGenes.mat');
0074 if exist(genesFile, 'file')
0075     fprintf(['Importing KEGG genes from ' strrep(genesFile,'\','/') '... ']);
0076     load(genesFile);
0077 else
0078     fprintf(['NOTE: Cannot locate ' strrep(genesFile,'\','/') ', it will therefore be generated from the local KEGG database\n']);
0079     if ~isfile(fullfile(keggPath,'ko')) || ~isfile(fullfile(keggPath,'reaction'))
0080         EM=fprintf(['The files ''ko'' and ''reaction'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP\n']);
0081         dispEM(EM);
0082     else
0083         fprintf('Generating keggGenes.mat file...\n');
0084         %Get all KOs that are associated to reactions
0085         allKOs=getAllKOs(keggPath);
0086         
0087         %Since the list of genes will be accessed many times the hash table
0088         %is established
0089         geneMap=containers.Map('KeyType','char','ValueType','int32');
0090         
0091         %Add new functionality in the order specified in models
0092         model.id='KEGG';
0093         model.name='Automatically generated from KEGG database';
0094         
0095         %Preallocate memory
0096         model.rxns=cell(numel(allKOs),1);
0097         model.rxnNames=cell(numel(allKOs),1);
0098         
0099         %Reserve space for 500000 genes
0100         model.genes=cell(500000,1);
0101         
0102         %Load information on KO ID, name, and associated genes
0103         fid = fopen(fullfile(keggPath,'ko'), 'r');
0104         
0105         %Keeps track of how many KOs that have been added
0106         koCounter=0;
0107         nGenes=0;
0108         addingGenes=false;
0109         
0110         %These contain the mapping between KOs and genes and are used to
0111         %generate the model.rxnGeneMat in the end
0112         koIndex=zeros(5000000,1);
0113         geneIndex=koIndex;
0114         nMappings=0;
0115         
0116         skipRead=false;
0117         
0118         %Loop through the file
0119         while 1
0120             %Get the next line
0121             if skipRead==false
0122                 tline = fgetl(fid);
0123             else
0124                 skipRead=false;
0125             end
0126             
0127             %Abort at end of file
0128             if ~ischar(tline)
0129                 break;
0130             end
0131             
0132             %Skip '///'
0133             if numel(tline)<12
0134                 addingGenes=false;
0135                 continue;
0136             end
0137             
0138             %Check if it's a new reaction
0139             if strcmp(tline(1:12),'ENTRY       ')
0140                 %Check if it should be added
0141                 koID=tline(13:18);
0142                 
0143                 if ismember(koID,allKOs)
0144                     addMe=true;
0145                     koCounter=koCounter+1;
0146                 else
0147                     addMe=false;
0148                     continue;
0149                 end
0150                 
0151                 %Add reaction ID (always 6 characters)
0152                 model.rxns{koCounter}=koID;
0153                 
0154                 model.rxnNames{koCounter}='';
0155                 %Will be overwritten if it exists
0156             end
0157             
0158             %To get here we must be in a KO that should be added
0159             if addMe==true
0160                 %Add name
0161                 if strcmp(tline(1:12),'DEFINITION  ')
0162                     model.rxnNames{koCounter}=tline(13:end);
0163                 end
0164                 
0165                 %Check if it's time to start adding genes
0166                 if strcmp(tline(1:12),'GENES       ')
0167                     addingGenes=true;
0168                 end
0169                 
0170                 %Add genes
0171                 if addingGenes==true
0172                     %We are now adding genes for the current KO. All gene
0173                     %rows start with 12 spaces. If this is not true it
0174                     %means that there is additional annotation such as
0175                     %AUTHORS. This is not common but it exists
0176                     if strcmp(tline(1:12),'            ') || strcmp(tline(1:12),'GENES       ')
0177                         geneLine=tline(13:end);
0178                         
0179                         %Check if the line is from a new organism of from
0180                         %the same as before
0181                         if strcmp(geneLine(4),':')
0182                             %If organism id contains 3 letters;
0183                             currentOrganism=geneLine(1:3);
0184                             %Parse the string
0185                             genes=regexp(geneLine(6:end),' ','split');
0186                             genes=strcat(currentOrganism,':',genes(:));
0187                         elseif strcmp(geneLine(5),':')
0188                             %If organism id contains 4 letters;
0189                             currentOrganism=geneLine(1:4);
0190                             %Parse the string
0191                             genes=regexp(geneLine(7:end),' ','split');
0192                             genes=strcat(currentOrganism,':',genes(:));
0193                         end
0194                         
0195                         %Add the genes to the gene list
0196                         for i=1:numel(genes)
0197                             geneString=genes{i};
0198                             if geneMap.isKey(geneString)
0199                                 index=geneMap(geneString);
0200                             else
0201                                 nGenes=nGenes+1;
0202                                 model.genes{nGenes}=geneString;
0203                                 index=nGenes;
0204                                 geneMap(geneString)=index;
0205                             end
0206                             
0207                             %Add the connection to the KO
0208                             nMappings=nMappings+1;
0209                             koIndex(nMappings)=koCounter;
0210                             geneIndex(nMappings)=index;
0211                         end
0212                     else
0213                         %Now we want to throw away everything until the
0214                         %next entry
0215                         while 1
0216                             tline = fgetl(fid);
0217                             if strcmp(tline,'///')
0218                                 %When the new entry is found, skip reading
0219                                 %one line to fit with the rest of the code
0220                                 skipRead=true;
0221                                 break;
0222                             end
0223                         end
0224                     end
0225                 end
0226             end
0227         end
0228         %Close the file
0229         fclose(fid);
0230         
0231         %If too much space was allocated, shrink the model
0232         model.rxns=model.rxns(1:koCounter);
0233         model.rxnNames=model.rxnNames(1:koCounter);
0234         model.genes=model.genes(1:nGenes);
0235         
0236         %Retrieve and add the gene associations
0237         model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1));
0238         
0239         %To make sure the size is correct if the last KOs don't have genes
0240         if size(model.rxnGeneMat,1)~=koCounter
0241             model.rxnGeneMat(koCounter,1)=0;
0242         end
0243         
0244         %Trim the genes so that they only contain information that can be
0245         %matched to the KEGG file of protein sequences (remove all
0246         %information after first parenthesis)
0247         %NOTE: For some reason the organism abbreviation should be in lower
0248         %case in this database. Fix this here
0249         for i=1:numel(model.genes)
0250             parIndex=strfind(model.genes{i},'(');
0251             if any(parIndex)
0252                 model.genes{i}=model.genes{i}(1:parIndex-1);
0253             end
0254             colIndex=strfind(model.genes{i},':');
0255             model.genes{i}=[lower(model.genes{i}(1:colIndex-1)) model.genes{i}(colIndex:end)];
0256         end
0257         
0258         %Save the model structure
0259         save(genesFile,'model');
0260     end
0261 end
0262 
0263 %Only get the KOs in koList
0264 if nargin>1
0265     I=~ismember(model.rxns,koList);
0266     model=removeReactions(model,I,true,true);
0267 end
0268 fprintf('COMPLETE\n');
0269 end
0270 
0271 function allKOs=getAllKOs(keggPath)
0272 %Retrieves all KOs that are associated to reactions. This is because the
0273 %number of genes in KEGG is very large so without this parsing it would
0274 %take many hours
0275 
0276 allKOs={};
0277 
0278 %First check if the reactions have already been parsed
0279 ravenPath=findRAVENroot;
0280 rxnsFile=fullfile(ravenPath,'external','kegg','keggRxns.mat');
0281 if exist(rxnsFile, 'file')
0282     fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']);
0283     load(rxnsFile,'model');
0284     
0285     %Loop through the reactions and add the corresponding genes
0286     for i=1:numel(model.rxns)
0287         if isstruct(model.rxnMiriams{i})
0288             %Get all KOs
0289             allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'))];
0290         end
0291     end
0292 else
0293     %Parse the reaction file instead First load information on reaction ID,
0294     %reaction name, KO, pathway and ec-number
0295     fid = fopen(fullfile(keggPath,'reaction'), 'r');
0296     orthology=false;
0297     while 1
0298         %Get the next line
0299         tline = fgetl(fid);
0300         
0301         %Abort at end of file
0302         if ~ischar(tline)
0303             break;
0304         end
0305         
0306         %Skip '///'
0307         if numel(tline)<12
0308             continue;
0309         end
0310         
0311         %Check if it's a new reaction
0312         if strcmp(tline(1:12),'ENTRY       ')
0313             orthology=false;
0314         end
0315         
0316         if strcmp(tline(1:9),'REFERENCE')
0317             orthology=false;
0318         end
0319         
0320         %Add KO-ids
0321         if numel(tline)>16
0322             if strcmp(tline(1:16),'ORTHOLOGY   KO: ') || strcmp(tline(1:16),'            KO: ') || strcmp(tline(1:12),'ORTHOLOGY   ') || orthology==true
0323                 if strcmp(tline(13:16),'KO:')
0324                     %This is in the old version
0325                     allKOs=[allKOs;tline(17:22)];
0326                 elseif strcmp(tline(1:12),'ORTHOLOGY   ')
0327                     %This means that it found one KO in the new format and
0328                     %that subsequent lines might be other KOs
0329                     orthology=true;
0330                     allKOs=[allKOs;tline(13:18)];
0331                 end
0332             end
0333         end
0334     end
0335     
0336     %Close the file
0337     fclose(fid);
0338 end
0339 allKOs=unique(allKOs);
0340 end

Generated by m2html © 2005