Home > external > kegg > getGenesFromKEGG.m

getGenesFromKEGG

PURPOSE ^

getGenesFromKEGG

SYNOPSIS ^

function model=getGenesFromKEGG(keggPath,koList)

DESCRIPTION ^

 getGenesFromKEGG
   Retrieves information on all genes stored in KEGG database

   Input:
   keggPath    if keggGenes.mat is not in the RAVEN\external\kegg
               directory, this function will attempt to read data from a
               local FTP dump of the KEGG database. keggPath is the path
               to the root of this database
   koList      the number of genes in KEGG is very large. koList can be a
               cell array with KO identifiers, in which case only genes
               belonging to one of those KEGG orthologies are retrieved
               (optional, default all KOs with associated reactions)

   Output:
   model       a model structure generated from the database. The
               following fields are filled
       id              'KEGG'
       name     'Automatically generated from KEGG database'
       rxns            KO ids
       rxnNames        Name for each entry
       genes           IDs for all the genes. Genes are saved as organism
                       abbreviation:id (same as in KEGG). 'HSA:124' for
                       example is alcohol dehydrogenase in Homo sapiens
       rxnGeneMat      A binary matrix that indicates whether a specific
                       gene is present in a KO id

   NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
   it will be loaded instead of parsing of the KEGG files. If it does not
   exist it will be saved after parsing of the KEGG files. In general, you
   should remove the keggGenes.mat file if you want to rebuild the model
   structure from a newer version of KEGG.

 Usage: model=getGenesFromKEGG(keggPath,koList)

 NOTE: This is how one entry looks in the file

 ENTRY       K11440                      KO
 NAME        gbsB
 DEFINITION  choline dehydrogenase [EC:1.1.1.1]
 PATHWAY     ko00260  Glycine, serine and threonine metabolism
             ko01100  Metabolic pathways
 MODULE      M00555  Betaine biosynthesis, choline => betaine
 BRITE       KEGG Orthology (KO) [BR:ko00001]
             (list truncated)
 DBLINKS     RN: R08557 R08558
             COG: COG1454
 GENES       BSU: BSU31050(gbsB)
             BSR: I33_3189
             (list truncated)
 REFERENCE   PMID:8752328
   AUTHORS   Boch J, Kempf B, Schmid R, Bremer E
   TITLE     Synthesis of the osmoprotectant glycine... (truncated)
   JOURNAL   J Bacteriol 178:5121-9 (1996)
             DOI:10.1128/JB.178.17.5121-5129.1996
   SEQUENCE  [bsu:BSU31050]
 ///

 The file is not tab-delimited. Instead each label is 12 characters
 (except for '///').

 Check if the genes have been parsed before and saved. If so, load the
 model.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getGenesFromKEGG(keggPath,koList)
0002 % getGenesFromKEGG
0003 %   Retrieves information on all genes stored in KEGG database
0004 %
0005 %   Input:
0006 %   keggPath    if keggGenes.mat is not in the RAVEN\external\kegg
0007 %               directory, this function will attempt to read data from a
0008 %               local FTP dump of the KEGG database. keggPath is the path
0009 %               to the root of this database
0010 %   koList      the number of genes in KEGG is very large. koList can be a
0011 %               cell array with KO identifiers, in which case only genes
0012 %               belonging to one of those KEGG orthologies are retrieved
0013 %               (optional, default all KOs with associated reactions)
0014 %
0015 %   Output:
0016 %   model       a model structure generated from the database. The
0017 %               following fields are filled
0018 %       id              'KEGG'
0019 %       name     'Automatically generated from KEGG database'
0020 %       rxns            KO ids
0021 %       rxnNames        Name for each entry
0022 %       genes           IDs for all the genes. Genes are saved as organism
0023 %                       abbreviation:id (same as in KEGG). 'HSA:124' for
0024 %                       example is alcohol dehydrogenase in Homo sapiens
0025 %       rxnGeneMat      A binary matrix that indicates whether a specific
0026 %                       gene is present in a KO id
0027 %
0028 %   NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
0029 %   it will be loaded instead of parsing of the KEGG files. If it does not
0030 %   exist it will be saved after parsing of the KEGG files. In general, you
0031 %   should remove the keggGenes.mat file if you want to rebuild the model
0032 %   structure from a newer version of KEGG.
0033 %
0034 % Usage: model=getGenesFromKEGG(keggPath,koList)
0035 %
0036 % NOTE: This is how one entry looks in the file
0037 %
0038 % ENTRY       K11440                      KO
0039 % NAME        gbsB
0040 % DEFINITION  choline dehydrogenase [EC:1.1.1.1]
0041 % PATHWAY     ko00260  Glycine, serine and threonine metabolism
0042 %             ko01100  Metabolic pathways
0043 % MODULE      M00555  Betaine biosynthesis, choline => betaine
0044 % BRITE       KEGG Orthology (KO) [BR:ko00001]
0045 %             (list truncated)
0046 % DBLINKS     RN: R08557 R08558
0047 %             COG: COG1454
0048 % GENES       BSU: BSU31050(gbsB)
0049 %             BSR: I33_3189
0050 %             (list truncated)
0051 % REFERENCE   PMID:8752328
0052 %   AUTHORS   Boch J, Kempf B, Schmid R, Bremer E
0053 %   TITLE     Synthesis of the osmoprotectant glycine... (truncated)
0054 %   JOURNAL   J Bacteriol 178:5121-9 (1996)
0055 %             DOI:10.1128/JB.178.17.5121-5129.1996
0056 %   SEQUENCE  [bsu:BSU31050]
0057 % ///
0058 %
0059 % The file is not tab-delimited. Instead each label is 12 characters
0060 % (except for '///').
0061 %
0062 % Check if the genes have been parsed before and saved. If so, load the
0063 % model.
0064 %
0065 
0066 if nargin<1
0067     keggPath='RAVEN/external/kegg';
0068 else
0069     keggPath=char(keggPath);
0070 end
0071 
0072 ravenPath=findRAVENroot();
0073 genesFile=fullfile(ravenPath,'external','kegg','keggGenes.mat');
0074 if ~exist(genesFile, 'file')
0075     try
0076         downloadKEGGgenes();
0077     catch
0078         %Download failed; fall through to local regeneration below
0079     end
0080 end
0081 if exist(genesFile, 'file')
0082     fprintf(['Importing KEGG genes from ' strrep(genesFile,'\','/') '... ']);
0083     load(genesFile);
0084 else
0085     fprintf(['NOTE: Cannot locate ' strrep(genesFile,'\','/') ', it will therefore be generated from the local KEGG database\n']);
0086     if ~isfile(fullfile(keggPath,'ko')) || ~isfile(fullfile(keggPath,'reaction'))
0087         EM=fprintf(['The files ''ko'' and ''reaction'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP\n']);
0088         dispEM(EM);
0089     else
0090         fprintf('Generating keggGenes.mat file...\n');
0091         %Get all KOs that are associated to reactions
0092         allKOs=getAllKOs(keggPath);
0093         
0094         %Since the list of genes will be accessed many times the hash table
0095         %is established
0096         geneMap=containers.Map('KeyType','char','ValueType','int32');
0097         
0098         %Add new functionality in the order specified in models
0099         model.id='KEGG';
0100         model.name='Automatically generated from KEGG database';
0101         
0102         %Preallocate memory
0103         model.rxns=cell(numel(allKOs),1);
0104         model.rxnNames=cell(numel(allKOs),1);
0105         
0106         %Reserve space for 500000 genes
0107         model.genes=cell(500000,1);
0108         
0109         %Load information on KO ID, name, and associated genes
0110         fid = fopen(fullfile(keggPath,'ko'), 'r');
0111         
0112         %Keeps track of how many KOs that have been added
0113         koCounter=0;
0114         nGenes=0;
0115         addingGenes=false;
0116         
0117         %These contain the mapping between KOs and genes and are used to
0118         %generate the model.rxnGeneMat in the end
0119         koIndex=zeros(5000000,1);
0120         geneIndex=koIndex;
0121         nMappings=0;
0122         
0123         skipRead=false;
0124         
0125         %Loop through the file
0126         while 1
0127             %Get the next line
0128             if skipRead==false
0129                 tline = fgetl(fid);
0130             else
0131                 skipRead=false;
0132             end
0133             
0134             %Abort at end of file
0135             if ~ischar(tline)
0136                 break;
0137             end
0138             
0139             %Skip '///'
0140             if numel(tline)<12
0141                 addingGenes=false;
0142                 continue;
0143             end
0144             
0145             %Check if it's a new reaction
0146             if strcmp(tline(1:12),'ENTRY       ')
0147                 %Check if it should be added
0148                 koID=tline(13:18);
0149                 
0150                 if ismember(koID,allKOs)
0151                     addMe=true;
0152                     koCounter=koCounter+1;
0153                 else
0154                     addMe=false;
0155                     continue;
0156                 end
0157                 
0158                 %Add reaction ID (always 6 characters)
0159                 model.rxns{koCounter}=koID;
0160                 
0161                 model.rxnNames{koCounter}='';
0162                 %Will be overwritten if it exists
0163             end
0164             
0165             %To get here we must be in a KO that should be added
0166             if addMe==true
0167                 %Add name
0168                 if strcmp(tline(1:12),'DEFINITION  ')
0169                     model.rxnNames{koCounter}=tline(13:end);
0170                 end
0171                 
0172                 %Check if it's time to start adding genes
0173                 if strcmp(tline(1:12),'GENES       ')
0174                     addingGenes=true;
0175                 end
0176                 
0177                 %Add genes
0178                 if addingGenes==true
0179                     %We are now adding genes for the current KO. All gene
0180                     %rows start with 12 spaces. If this is not true it
0181                     %means that there is additional annotation such as
0182                     %AUTHORS. This is not common but it exists
0183                     if strcmp(tline(1:12),'            ') || strcmp(tline(1:12),'GENES       ')
0184                         geneLine=tline(13:end);
0185                         
0186                         %Check if the line is from a new organism of from
0187                         %the same as before
0188                         if strcmp(geneLine(4),':')
0189                             %If organism id contains 3 letters;
0190                             currentOrganism=geneLine(1:3);
0191                             %Parse the string
0192                             genes=regexp(geneLine(6:end),' ','split');
0193                             genes=strcat(currentOrganism,':',genes(:));
0194                         elseif strcmp(geneLine(5),':')
0195                             %If organism id contains 4 letters;
0196                             currentOrganism=geneLine(1:4);
0197                             %Parse the string
0198                             genes=regexp(geneLine(7:end),' ','split');
0199                             genes=strcat(currentOrganism,':',genes(:));
0200                         end
0201                         
0202                         %Add the genes to the gene list
0203                         for i=1:numel(genes)
0204                             geneString=genes{i};
0205                             if geneMap.isKey(geneString)
0206                                 index=geneMap(geneString);
0207                             else
0208                                 nGenes=nGenes+1;
0209                                 model.genes{nGenes}=geneString;
0210                                 index=nGenes;
0211                                 geneMap(geneString)=index;
0212                             end
0213                             
0214                             %Add the connection to the KO
0215                             nMappings=nMappings+1;
0216                             koIndex(nMappings)=koCounter;
0217                             geneIndex(nMappings)=index;
0218                         end
0219                     else
0220                         %Now we want to throw away everything until the
0221                         %next entry
0222                         while 1
0223                             tline = fgetl(fid);
0224                             if strcmp(tline,'///')
0225                                 %When the new entry is found, skip reading
0226                                 %one line to fit with the rest of the code
0227                                 skipRead=true;
0228                                 break;
0229                             end
0230                         end
0231                     end
0232                 end
0233             end
0234         end
0235         %Close the file
0236         fclose(fid);
0237         
0238         %If too much space was allocated, shrink the model
0239         model.rxns=model.rxns(1:koCounter);
0240         model.rxnNames=model.rxnNames(1:koCounter);
0241         model.genes=model.genes(1:nGenes);
0242         
0243         %Retrieve and add the gene associations
0244         model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1));
0245         
0246         %To make sure the size is correct if the last KOs don't have genes
0247         if size(model.rxnGeneMat,1)~=koCounter
0248             model.rxnGeneMat(koCounter,1)=0;
0249         end
0250         
0251         %Trim the genes so that they only contain information that can be
0252         %matched to the KEGG file of protein sequences (remove all
0253         %information after first parenthesis)
0254         %NOTE: For some reason the organism abbreviation should be in lower
0255         %case in this database. Fix this here
0256         for i=1:numel(model.genes)
0257             parIndex=strfind(model.genes{i},'(');
0258             if any(parIndex)
0259                 model.genes{i}=model.genes{i}(1:parIndex-1);
0260             end
0261             colIndex=strfind(model.genes{i},':');
0262             model.genes{i}=[lower(model.genes{i}(1:colIndex-1)) model.genes{i}(colIndex:end)];
0263         end
0264         
0265         %Save the model structure
0266         save(genesFile,'model');
0267     end
0268 end
0269 
0270 %Only get the KOs in koList
0271 if nargin>1
0272     I=~ismember(model.rxns,koList);
0273     model=removeReactions(model,I,true,true);
0274 end
0275 fprintf('COMPLETE\n');
0276 end
0277 
0278 function allKOs=getAllKOs(keggPath)
0279 %Retrieves all KOs that are associated to reactions. This is because the
0280 %number of genes in KEGG is very large so without this parsing it would
0281 %take many hours
0282 
0283 allKOs={};
0284 
0285 %First check if the reactions have already been parsed
0286 ravenPath=findRAVENroot;
0287 rxnsFile=fullfile(ravenPath,'external','kegg','keggRxns.mat');
0288 if exist(rxnsFile, 'file')
0289     fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']);
0290     load(rxnsFile,'model');
0291     
0292     %Loop through the reactions and add the corresponding genes
0293     for i=1:numel(model.rxns)
0294         if isstruct(model.rxnMiriams{i})
0295             %Get all KOs
0296             allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'))];
0297         end
0298     end
0299 else
0300     %Parse the reaction file instead First load information on reaction ID,
0301     %reaction name, KO, pathway and ec-number
0302     fid = fopen(fullfile(keggPath,'reaction'), 'r');
0303     orthology=false;
0304     while 1
0305         %Get the next line
0306         tline = fgetl(fid);
0307         
0308         %Abort at end of file
0309         if ~ischar(tline)
0310             break;
0311         end
0312         
0313         %Skip '///'
0314         if numel(tline)<12
0315             continue;
0316         end
0317         
0318         %Check if it's a new reaction
0319         if strcmp(tline(1:12),'ENTRY       ')
0320             orthology=false;
0321         end
0322         
0323         if strcmp(tline(1:9),'REFERENCE')
0324             orthology=false;
0325         end
0326         
0327         %Add KO-ids
0328         if numel(tline)>16
0329             if strcmp(tline(1:16),'ORTHOLOGY   KO: ') || strcmp(tline(1:16),'            KO: ') || strcmp(tline(1:12),'ORTHOLOGY   ') || orthology==true
0330                 if strcmp(tline(13:16),'KO:')
0331                     %This is in the old version
0332                     allKOs=[allKOs;tline(17:22)];
0333                 elseif strcmp(tline(1:12),'ORTHOLOGY   ')
0334                     %This means that it found one KO in the new format and
0335                     %that subsequent lines might be other KOs
0336                     orthology=true;
0337                     allKOs=[allKOs;tline(13:18)];
0338                 end
0339             end
0340         end
0341     end
0342     
0343     %Close the file
0344     fclose(fid);
0345 end
0346 allKOs=unique(allKOs);
0347 end
0348 
0349 function downloadKEGGgenes()
0350 % downloadKEGGgenes
0351 %   Downloads keggGenes.mat from the RAVEN GitHub release into
0352 %   external/kegg/. This file is distributed as a release asset rather
0353 %   than shipped in the repository because it exceeds GitHub's 100 MB
0354 %   file-size limit.
0355 %
0356 %   The following asset is expected on the release tag defined below:
0357 %       keggGenes.zip   containing keggGenes.mat
0358 %
0359 %   To force a refresh (for example, after a new KEGG snapshot has been
0360 %   published with a new RAVEN release), delete
0361 %   external/kegg/keggGenes.mat and call getGenesFromKEGG again.
0362 %
0363 %   Usage: downloadKEGGgenes()
0364 
0365 % Release tag that hosts the data archive. Bump this for any RAVEN
0366 % release that ships an updated keggGenes.mat.
0367 releaseTag = 'v2.11.1';
0368 
0369 archiveName = 'keggGenes.zip';
0370 targetDir   = fullfile(findRAVENroot(),'external','kegg');
0371 targetFile  = fullfile(targetDir,'keggGenes.mat');
0372 url         = ['https://github.com/SysBioChalmers/RAVEN/releases/download/',...
0373                releaseTag,'/',archiveName];
0374 zipPath     = fullfile(targetDir,archiveName);
0375 
0376 fprintf(['Downloading ' archiveName ' from RAVEN ' releaseTag ' release... ']);
0377 try
0378     websave(zipPath,url);
0379 catch ME
0380     if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
0381         error(['Failed to download ' archiveName ': the server returned ' ...
0382                'a 404 error. If this persists, please report it at ' ...
0383                'https://github.com/SysBioChalmers/RAVEN/issues']);
0384     else
0385         rethrow(ME);
0386     end
0387 end
0388 fprintf('COMPLETE\n');
0389 
0390 fprintf(['Extracting ' archiveName ' to ' strrep(targetDir,'\','/') '... ']);
0391 try
0392     unzip(zipPath,targetDir);
0393 catch ME
0394     if isfile(zipPath); delete(zipPath); end
0395     error(['Failed to extract ' archiveName '. The downloaded archive ' ...
0396            'may be corrupted.\nOriginal error: %s'],ME.message);
0397 end
0398 fprintf('COMPLETE\n');
0399 
0400 if isfile(zipPath); delete(zipPath); end
0401 
0402 if ~isfile(targetFile)
0403     error(['keggGenes.mat was not found in ' archiveName ...
0404            '. Please verify the release asset contents.']);
0405 end
0406 end

Generated by m2html © 2005