Description of getGenesFromKEGG

0001 function model=getGenesFromKEGG(keggPath,koList)
0002 % getGenesFromKEGG
0003 %   Retrieves information on all genes stored in KEGG database
0004 %
0005 %   Input:
0006 %   keggPath    if keggGenes.mat is not in the RAVEN\external\kegg
0007 %               directory, this function will attempt to read data from a
0008 %               local FTP dump of the KEGG database. keggPath is the path
0009 %               to the root of this database
0010 %   koList      the number of genes in KEGG is very large. koList can be a
0011 %               cell array with KO identifiers, in which case only genes
0012 %               belonging to one of those KEGG orthologies are retrieved
0013 %               (optional, default all KOs with associated reactions)
0014 %
0015 %   Output:
0016 %   model       a model structure generated from the database. The
0017 %               following fields are filled
0018 %       id              'KEGG'
0019 %       name     'Automatically generated from KEGG database'
0020 %       rxns            KO ids
0021 %       rxnNames        Name for each entry
0022 %       genes           IDs for all the genes. Genes are saved as organism
0023 %                       abbreviation:id (same as in KEGG). 'HSA:124' for
0024 %                       example is alcohol dehydrogenase in Homo sapiens
0025 %       rxnGeneMat      A binary matrix that indicates whether a specific
0026 %                       gene is present in a KO id
0027 %
0028 %   NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
0029 %   it will be loaded instead of parsing of the KEGG files. If it does not
0030 %   exist it will be saved after parsing of the KEGG files. In general, you
0031 %   should remove the keggGenes.mat file if you want to rebuild the model
0032 %   structure from a newer version of KEGG.
0033 %
0034 % Usage: model=getGenesFromKEGG(keggPath,koList)
0035 %
0036 % NOTE: This is how one entry looks in the file
0037 %
0038 % ENTRY       K11440                      KO
0039 % NAME        gbsB
0040 % DEFINITION  choline dehydrogenase [EC:1.1.1.1]
0041 % PATHWAY     ko00260  Glycine, serine and threonine metabolism
0042 %             ko01100  Metabolic pathways
0043 % MODULE      M00555  Betaine biosynthesis, choline => betaine
0044 % BRITE       KEGG Orthology (KO) [BR:ko00001]
0045 %             (list truncated)
0046 % DBLINKS     RN: R08557 R08558
0047 %             COG: COG1454
0048 % GENES       BSU: BSU31050(gbsB)
0049 %             BSR: I33_3189
0050 %             (list truncated)
0051 % REFERENCE   PMID:8752328
0052 %   AUTHORS   Boch J, Kempf B, Schmid R, Bremer E
0053 %   TITLE     Synthesis of the osmoprotectant glycine... (truncated)
0054 %   JOURNAL   J Bacteriol 178:5121-9 (1996)
0055 %             DOI:10.1128/JB.178.17.5121-5129.1996
0056 %   SEQUENCE  [bsu:BSU31050]
0057 % ///
0058 %
0059 % The file is not tab-delimited. Instead each label is 12 characters
0060 % (except for '///').
0061 %
0062 % Check if the genes have been parsed before and saved. If so, load the
0063 % model.
0064 %
0065 
0066 if nargin<1
0067     keggPath='RAVEN/external/kegg';
0068 else
0069     keggPath=char(keggPath);
0070 end
0071 
0072 ravenPath=findRAVENroot();
0073 genesFile=fullfile(ravenPath,'external','kegg','keggGenes.mat');
0074 if ~exist(genesFile, 'file')
0075     try
0076         downloadKEGGgenes();
0077     catch
0078         %Download failed; fall through to local regeneration below
0079     end
0080 end
0081 if exist(genesFile, 'file')
0082     fprintf(['Importing KEGG genes from ' strrep(genesFile,'\','/') '... ']);
0083     load(genesFile);
0084 else
0085     fprintf(['NOTE: Cannot locate ' strrep(genesFile,'\','/') ', it will therefore be generated from the local KEGG database\n']);
0086     if ~isfile(fullfile(keggPath,'ko')) || ~isfile(fullfile(keggPath,'reaction'))
0087         EM=fprintf(['The files ''ko'' and ''reaction'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP\n']);
0088         dispEM(EM);
0089     else
0090         fprintf('Generating keggGenes.mat file...\n');
0091         %Get all KOs that are associated to reactions
0092         allKOs=getAllKOs(keggPath);
0093         
0094         %Since the list of genes will be accessed many times the hash table
0095         %is established
0096         geneMap=containers.Map('KeyType','char','ValueType','int32');
0097         
0098         %Add new functionality in the order specified in models
0099         model.id='KEGG';
0100         model.name='Automatically generated from KEGG database';
0101         
0102         %Preallocate memory
0103         model.rxns=cell(numel(allKOs),1);
0104         model.rxnNames=cell(numel(allKOs),1);
0105         
0106         %Reserve space for 500000 genes
0107         model.genes=cell(500000,1);
0108         
0109         %Load information on KO ID, name, and associated genes
0110         fid = fopen(fullfile(keggPath,'ko'), 'r');
0111         
0112         %Keeps track of how many KOs that have been added
0113         koCounter=0;
0114         nGenes=0;
0115         addingGenes=false;
0116         
0117         %These contain the mapping between KOs and genes and are used to
0118         %generate the model.rxnGeneMat in the end
0119         koIndex=zeros(5000000,1);
0120         geneIndex=koIndex;
0121         nMappings=0;
0122         
0123         skipRead=false;
0124         
0125         %Loop through the file
0126         while 1
0127             %Get the next line
0128             if skipRead==false
0129                 tline = fgetl(fid);
0130             else
0131                 skipRead=false;
0132             end
0133             
0134             %Abort at end of file
0135             if ~ischar(tline)
0136                 break;
0137             end
0138             
0139             %Skip '///'
0140             if numel(tline)<12
0141                 addingGenes=false;
0142                 continue;
0143             end
0144             
0145             %Check if it's a new reaction
0146             if strcmp(tline(1:12),'ENTRY       ')
0147                 %Check if it should be added
0148                 koID=tline(13:18);
0149                 
0150                 if ismember(koID,allKOs)
0151                     addMe=true;
0152                     koCounter=koCounter+1;
0153                 else
0154                     addMe=false;
0155                     continue;
0156                 end
0157                 
0158                 %Add reaction ID (always 6 characters)
0159                 model.rxns{koCounter}=koID;
0160                 
0161                 model.rxnNames{koCounter}='';
0162                 %Will be overwritten if it exists
0163             end
0164             
0165             %To get here we must be in a KO that should be added
0166             if addMe==true
0167                 %Add name
0168                 if strcmp(tline(1:12),'DEFINITION  ')
0169                     model.rxnNames{koCounter}=tline(13:end);
0170                 end
0171                 
0172                 %Check if it's time to start adding genes
0173                 if strcmp(tline(1:12),'GENES       ')
0174                     addingGenes=true;
0175                 end
0176                 
0177                 %Add genes
0178                 if addingGenes==true
0179                     %We are now adding genes for the current KO. All gene
0180                     %rows start with 12 spaces. If this is not true it
0181                     %means that there is additional annotation such as
0182                     %AUTHORS. This is not common but it exists
0183                     if strcmp(tline(1:12),'            ') || strcmp(tline(1:12),'GENES       ')
0184                         geneLine=tline(13:end);
0185                         
0186                         %Check if the line is from a new organism of from
0187                         %the same as before
0188                         if strcmp(geneLine(4),':')
0189                             %If organism id contains 3 letters;
0190                             currentOrganism=geneLine(1:3);
0191                             %Parse the string
0192                             genes=regexp(geneLine(6:end),' ','split');
0193                             genes=strcat(currentOrganism,':',genes(:));
0194                         elseif strcmp(geneLine(5),':')
0195                             %If organism id contains 4 letters;
0196                             currentOrganism=geneLine(1:4);
0197                             %Parse the string
0198                             genes=regexp(geneLine(7:end),' ','split');
0199                             genes=strcat(currentOrganism,':',genes(:));
0200                         end
0201                         
0202                         %Add the genes to the gene list
0203                         for i=1:numel(genes)
0204                             geneString=genes{i};
0205                             if geneMap.isKey(geneString)
0206                                 index=geneMap(geneString);
0207                             else
0208                                 nGenes=nGenes+1;
0209                                 model.genes{nGenes}=geneString;
0210                                 index=nGenes;
0211                                 geneMap(geneString)=index;
0212                             end
0213                             
0214                             %Add the connection to the KO
0215                             nMappings=nMappings+1;
0216                             koIndex(nMappings)=koCounter;
0217                             geneIndex(nMappings)=index;
0218                         end
0219                     else
0220                         %Now we want to throw away everything until the
0221                         %next entry
0222                         while 1
0223                             tline = fgetl(fid);
0224                             if strcmp(tline,'///')
0225                                 %When the new entry is found, skip reading
0226                                 %one line to fit with the rest of the code
0227                                 skipRead=true;
0228                                 break;
0229                             end
0230                         end
0231                     end
0232                 end
0233             end
0234         end
0235         %Close the file
0236         fclose(fid);
0237         
0238         %If too much space was allocated, shrink the model
0239         model.rxns=model.rxns(1:koCounter);
0240         model.rxnNames=model.rxnNames(1:koCounter);
0241         model.genes=model.genes(1:nGenes);
0242         
0243         %Retrieve and add the gene associations
0244         model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1));
0245         
0246         %To make sure the size is correct if the last KOs don't have genes
0247         if size(model.rxnGeneMat,1)~=koCounter
0248             model.rxnGeneMat(koCounter,1)=0;
0249         end
0250         
0251         %Trim the genes so that they only contain information that can be
0252         %matched to the KEGG file of protein sequences (remove all
0253         %information after first parenthesis)
0254         %NOTE: For some reason the organism abbreviation should be in lower
0255         %case in this database. Fix this here
0256         for i=1:numel(model.genes)
0257             parIndex=strfind(model.genes{i},'(');
0258             if any(parIndex)
0259                 model.genes{i}=model.genes{i}(1:parIndex-1);
0260             end
0261             colIndex=strfind(model.genes{i},':');
0262             model.genes{i}=[lower(model.genes{i}(1:colIndex-1)) model.genes{i}(colIndex:end)];
0263         end
0264         
0265         %Save the model structure
0266         save(genesFile,'model');
0267     end
0268 end
0269 
0270 %Only get the KOs in koList
0271 if nargin>1
0272     I=~ismember(model.rxns,koList);
0273     model=removeReactions(model,I,true,true);
0274 end
0275 fprintf('COMPLETE\n');
0276 end
0277 
0278 function allKOs=getAllKOs(keggPath)
0279 %Retrieves all KOs that are associated to reactions. This is because the
0280 %number of genes in KEGG is very large so without this parsing it would
0281 %take many hours
0282 
0283 allKOs={};
0284 
0285 %First check if the reactions have already been parsed
0286 ravenPath=findRAVENroot;
0287 rxnsFile=fullfile(ravenPath,'external','kegg','keggRxns.mat');
0288 if exist(rxnsFile, 'file')
0289     fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']);
0290     load(rxnsFile,'model');
0291     
0292     %Loop through the reactions and add the corresponding genes
0293     for i=1:numel(model.rxns)
0294         if isstruct(model.rxnMiriams{i})
0295             %Get all KOs
0296             allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'))];
0297         end
0298     end
0299 else
0300     %Parse the reaction file instead First load information on reaction ID,
0301     %reaction name, KO, pathway and ec-number
0302     fid = fopen(fullfile(keggPath,'reaction'), 'r');
0303     orthology=false;
0304     while 1
0305         %Get the next line
0306         tline = fgetl(fid);
0307         
0308         %Abort at end of file
0309         if ~ischar(tline)
0310             break;
0311         end
0312         
0313         %Skip '///'
0314         if numel(tline)<12
0315             continue;
0316         end
0317         
0318         %Check if it's a new reaction
0319         if strcmp(tline(1:12),'ENTRY       ')
0320             orthology=false;
0321         end
0322         
0323         if strcmp(tline(1:9),'REFERENCE')
0324             orthology=false;
0325         end
0326         
0327         %Add KO-ids
0328         if numel(tline)>16
0329             if strcmp(tline(1:16),'ORTHOLOGY   KO: ') || strcmp(tline(1:16),'            KO: ') || strcmp(tline(1:12),'ORTHOLOGY   ') || orthology==true
0330                 if strcmp(tline(13:16),'KO:')
0331                     %This is in the old version
0332                     allKOs=[allKOs;tline(17:22)];
0333                 elseif strcmp(tline(1:12),'ORTHOLOGY   ')
0334                     %This means that it found one KO in the new format and
0335                     %that subsequent lines might be other KOs
0336                     orthology=true;
0337                     allKOs=[allKOs;tline(13:18)];
0338                 end
0339             end
0340         end
0341     end
0342     
0343     %Close the file
0344     fclose(fid);
0345 end
0346 allKOs=unique(allKOs);
0347 end
0348 
0349 function downloadKEGGgenes()
0350 % downloadKEGGgenes
0351 %   Downloads keggGenes.mat from the RAVEN GitHub release into
0352 %   external/kegg/. This file is distributed as a release asset rather
0353 %   than shipped in the repository because it exceeds GitHub's 100 MB
0354 %   file-size limit.
0355 %
0356 %   The following asset is expected on the release tag defined below:
0357 %       keggGenes.zip   containing keggGenes.mat
0358 %
0359 %   To force a refresh (for example, after a new KEGG snapshot has been
0360 %   published with a new RAVEN release), delete
0361 %   external/kegg/keggGenes.mat and call getGenesFromKEGG again.
0362 %
0363 %   Usage: downloadKEGGgenes()
0364 
0365 % Release tag that hosts the data archive. Bump this for any RAVEN
0366 % release that ships an updated keggGenes.mat.
0367 releaseTag = 'v2.11.1';
0368 
0369 archiveName = 'keggGenes.zip';
0370 targetDir   = fullfile(findRAVENroot(),'external','kegg');
0371 targetFile  = fullfile(targetDir,'keggGenes.mat');
0372 url         = ['https://github.com/SysBioChalmers/RAVEN/releases/download/',...
0373                releaseTag,'/',archiveName];
0374 zipPath     = fullfile(targetDir,archiveName);
0375 
0376 fprintf(['Downloading ' archiveName ' from RAVEN ' releaseTag ' release... ']);
0377 try
0378     websave(zipPath,url);
0379 catch ME
0380     if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
0381         error(['Failed to download ' archiveName ': the server returned ' ...
0382                'a 404 error. If this persists, please report it at ' ...
0383                'https://github.com/SysBioChalmers/RAVEN/issues']);
0384     else
0385         rethrow(ME);
0386     end
0387 end
0388 fprintf('COMPLETE\n');
0389 
0390 fprintf(['Extracting ' archiveName ' to ' strrep(targetDir,'\','/') '... ']);
0391 try
0392     unzip(zipPath,targetDir);
0393 catch ME
0394     if isfile(zipPath); delete(zipPath); end
0395     error(['Failed to extract ' archiveName '. The downloaded archive ' ...
0396            'may be corrupted.\nOriginal error: %s'],ME.message);
0397 end
0398 fprintf('COMPLETE\n');
0399 
0400 if isfile(zipPath); delete(zipPath); end
0401 
0402 if ~isfile(targetFile)
0403     error(['keggGenes.mat was not found in ' archiveName ...
0404            '. Please verify the release asset contents.']);
0405 end
0406 end
getGenesFromKEGG

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE