


getGenesFromKEGG
Retrieves information on all genes stored in KEGG database
Input:
keggPath if keggGenes.mat is not in the RAVEN\external\kegg
directory, this function will attempt to read data from a
local FTP dump of the KEGG database. keggPath is the path
to the root of this database
koList the number of genes in KEGG is very large. koList can be a
cell array with KO identifiers, in which case only genes
belonging to one of those KEGG orthologies are retrieved
(optional, default all KOs with associated reactions)
Output:
model a model structure generated from the database. The
following fields are filled
id 'KEGG'
name 'Automatically generated from KEGG database'
rxns KO ids
rxnNames Name for each entry
genes IDs for all the genes. Genes are saved as organism
abbreviation:id (same as in KEGG). 'HSA:124' for
example is alcohol dehydrogenase in Homo sapiens
rxnGeneMat A binary matrix that indicates whether a specific
gene is present in a KO id
NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory
it will be loaded instead of parsing of the KEGG files. If it does not
exist it will be saved after parsing of the KEGG files. In general, you
should remove the keggGenes.mat file if you want to rebuild the model
structure from a newer version of KEGG.
Usage: model=getGenesFromKEGG(keggPath,koList)
NOTE: This is how one entry looks in the file
ENTRY K11440 KO
NAME gbsB
DEFINITION choline dehydrogenase [EC:1.1.1.1]
PATHWAY ko00260 Glycine, serine and threonine metabolism
ko01100 Metabolic pathways
MODULE M00555 Betaine biosynthesis, choline => betaine
BRITE KEGG Orthology (KO) [BR:ko00001]
(list truncated)
DBLINKS RN: R08557 R08558
COG: COG1454
GENES BSU: BSU31050(gbsB)
BSR: I33_3189
(list truncated)
REFERENCE PMID:8752328
AUTHORS Boch J, Kempf B, Schmid R, Bremer E
TITLE Synthesis of the osmoprotectant glycine... (truncated)
JOURNAL J Bacteriol 178:5121-9 (1996)
DOI:10.1128/JB.178.17.5121-5129.1996
SEQUENCE [bsu:BSU31050]
///
The file is not tab-delimited. Instead each label is 12 characters
(except for '///').
Check if the genes have been parsed before and saved. If so, load the
model.


0001 function model=getGenesFromKEGG(keggPath,koList) 0002 % getGenesFromKEGG 0003 % Retrieves information on all genes stored in KEGG database 0004 % 0005 % Input: 0006 % keggPath if keggGenes.mat is not in the RAVEN\external\kegg 0007 % directory, this function will attempt to read data from a 0008 % local FTP dump of the KEGG database. keggPath is the path 0009 % to the root of this database 0010 % koList the number of genes in KEGG is very large. koList can be a 0011 % cell array with KO identifiers, in which case only genes 0012 % belonging to one of those KEGG orthologies are retrieved 0013 % (optional, default all KOs with associated reactions) 0014 % 0015 % Output: 0016 % model a model structure generated from the database. The 0017 % following fields are filled 0018 % id 'KEGG' 0019 % name 'Automatically generated from KEGG database' 0020 % rxns KO ids 0021 % rxnNames Name for each entry 0022 % genes IDs for all the genes. Genes are saved as organism 0023 % abbreviation:id (same as in KEGG). 'HSA:124' for 0024 % example is alcohol dehydrogenase in Homo sapiens 0025 % rxnGeneMat A binary matrix that indicates whether a specific 0026 % gene is present in a KO id 0027 % 0028 % NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory 0029 % it will be loaded instead of parsing of the KEGG files. If it does not 0030 % exist it will be saved after parsing of the KEGG files. In general, you 0031 % should remove the keggGenes.mat file if you want to rebuild the model 0032 % structure from a newer version of KEGG. 0033 % 0034 % Usage: model=getGenesFromKEGG(keggPath,koList) 0035 % 0036 % NOTE: This is how one entry looks in the file 0037 % 0038 % ENTRY K11440 KO 0039 % NAME gbsB 0040 % DEFINITION choline dehydrogenase [EC:1.1.1.1] 0041 % PATHWAY ko00260 Glycine, serine and threonine metabolism 0042 % ko01100 Metabolic pathways 0043 % MODULE M00555 Betaine biosynthesis, choline => betaine 0044 % BRITE KEGG Orthology (KO) [BR:ko00001] 0045 % (list truncated) 0046 % DBLINKS RN: R08557 R08558 0047 % COG: COG1454 0048 % GENES BSU: BSU31050(gbsB) 0049 % BSR: I33_3189 0050 % (list truncated) 0051 % REFERENCE PMID:8752328 0052 % AUTHORS Boch J, Kempf B, Schmid R, Bremer E 0053 % TITLE Synthesis of the osmoprotectant glycine... (truncated) 0054 % JOURNAL J Bacteriol 178:5121-9 (1996) 0055 % DOI:10.1128/JB.178.17.5121-5129.1996 0056 % SEQUENCE [bsu:BSU31050] 0057 % /// 0058 % 0059 % The file is not tab-delimited. Instead each label is 12 characters 0060 % (except for '///'). 0061 % 0062 % Check if the genes have been parsed before and saved. If so, load the 0063 % model. 0064 % 0065 0066 if nargin<1 0067 keggPath='RAVEN/external/kegg'; 0068 else 0069 keggPath=char(keggPath); 0070 end 0071 0072 ravenPath=findRAVENroot(); 0073 genesFile=fullfile(ravenPath,'external','kegg','keggGenes.mat'); 0074 if ~exist(genesFile, 'file') 0075 try 0076 downloadKEGGgenes(); 0077 catch 0078 %Download failed; fall through to local regeneration below 0079 end 0080 end 0081 if exist(genesFile, 'file') 0082 fprintf(['Importing KEGG genes from ' strrep(genesFile,'\','/') '... ']); 0083 load(genesFile); 0084 else 0085 fprintf(['NOTE: Cannot locate ' strrep(genesFile,'\','/') ', it will therefore be generated from the local KEGG database\n']); 0086 if ~isfile(fullfile(keggPath,'ko')) || ~isfile(fullfile(keggPath,'reaction')) 0087 EM=fprintf(['The files ''ko'' and ''reaction'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP\n']); 0088 dispEM(EM); 0089 else 0090 fprintf('Generating keggGenes.mat file...\n'); 0091 %Get all KOs that are associated to reactions 0092 allKOs=getAllKOs(keggPath); 0093 0094 %Since the list of genes will be accessed many times the hash table 0095 %is established 0096 geneMap=containers.Map('KeyType','char','ValueType','int32'); 0097 0098 %Add new functionality in the order specified in models 0099 model.id='KEGG'; 0100 model.name='Automatically generated from KEGG database'; 0101 0102 %Preallocate memory 0103 model.rxns=cell(numel(allKOs),1); 0104 model.rxnNames=cell(numel(allKOs),1); 0105 0106 %Reserve space for 500000 genes 0107 model.genes=cell(500000,1); 0108 0109 %Load information on KO ID, name, and associated genes 0110 fid = fopen(fullfile(keggPath,'ko'), 'r'); 0111 0112 %Keeps track of how many KOs that have been added 0113 koCounter=0; 0114 nGenes=0; 0115 addingGenes=false; 0116 0117 %These contain the mapping between KOs and genes and are used to 0118 %generate the model.rxnGeneMat in the end 0119 koIndex=zeros(5000000,1); 0120 geneIndex=koIndex; 0121 nMappings=0; 0122 0123 skipRead=false; 0124 0125 %Loop through the file 0126 while 1 0127 %Get the next line 0128 if skipRead==false 0129 tline = fgetl(fid); 0130 else 0131 skipRead=false; 0132 end 0133 0134 %Abort at end of file 0135 if ~ischar(tline) 0136 break; 0137 end 0138 0139 %Skip '///' 0140 if numel(tline)<12 0141 addingGenes=false; 0142 continue; 0143 end 0144 0145 %Check if it's a new reaction 0146 if strcmp(tline(1:12),'ENTRY ') 0147 %Check if it should be added 0148 koID=tline(13:18); 0149 0150 if ismember(koID,allKOs) 0151 addMe=true; 0152 koCounter=koCounter+1; 0153 else 0154 addMe=false; 0155 continue; 0156 end 0157 0158 %Add reaction ID (always 6 characters) 0159 model.rxns{koCounter}=koID; 0160 0161 model.rxnNames{koCounter}=''; 0162 %Will be overwritten if it exists 0163 end 0164 0165 %To get here we must be in a KO that should be added 0166 if addMe==true 0167 %Add name 0168 if strcmp(tline(1:12),'DEFINITION ') 0169 model.rxnNames{koCounter}=tline(13:end); 0170 end 0171 0172 %Check if it's time to start adding genes 0173 if strcmp(tline(1:12),'GENES ') 0174 addingGenes=true; 0175 end 0176 0177 %Add genes 0178 if addingGenes==true 0179 %We are now adding genes for the current KO. All gene 0180 %rows start with 12 spaces. If this is not true it 0181 %means that there is additional annotation such as 0182 %AUTHORS. This is not common but it exists 0183 if strcmp(tline(1:12),' ') || strcmp(tline(1:12),'GENES ') 0184 geneLine=tline(13:end); 0185 0186 %Check if the line is from a new organism of from 0187 %the same as before 0188 if strcmp(geneLine(4),':') 0189 %If organism id contains 3 letters; 0190 currentOrganism=geneLine(1:3); 0191 %Parse the string 0192 genes=regexp(geneLine(6:end),' ','split'); 0193 genes=strcat(currentOrganism,':',genes(:)); 0194 elseif strcmp(geneLine(5),':') 0195 %If organism id contains 4 letters; 0196 currentOrganism=geneLine(1:4); 0197 %Parse the string 0198 genes=regexp(geneLine(7:end),' ','split'); 0199 genes=strcat(currentOrganism,':',genes(:)); 0200 end 0201 0202 %Add the genes to the gene list 0203 for i=1:numel(genes) 0204 geneString=genes{i}; 0205 if geneMap.isKey(geneString) 0206 index=geneMap(geneString); 0207 else 0208 nGenes=nGenes+1; 0209 model.genes{nGenes}=geneString; 0210 index=nGenes; 0211 geneMap(geneString)=index; 0212 end 0213 0214 %Add the connection to the KO 0215 nMappings=nMappings+1; 0216 koIndex(nMappings)=koCounter; 0217 geneIndex(nMappings)=index; 0218 end 0219 else 0220 %Now we want to throw away everything until the 0221 %next entry 0222 while 1 0223 tline = fgetl(fid); 0224 if strcmp(tline,'///') 0225 %When the new entry is found, skip reading 0226 %one line to fit with the rest of the code 0227 skipRead=true; 0228 break; 0229 end 0230 end 0231 end 0232 end 0233 end 0234 end 0235 %Close the file 0236 fclose(fid); 0237 0238 %If too much space was allocated, shrink the model 0239 model.rxns=model.rxns(1:koCounter); 0240 model.rxnNames=model.rxnNames(1:koCounter); 0241 model.genes=model.genes(1:nGenes); 0242 0243 %Retrieve and add the gene associations 0244 model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1)); 0245 0246 %To make sure the size is correct if the last KOs don't have genes 0247 if size(model.rxnGeneMat,1)~=koCounter 0248 model.rxnGeneMat(koCounter,1)=0; 0249 end 0250 0251 %Trim the genes so that they only contain information that can be 0252 %matched to the KEGG file of protein sequences (remove all 0253 %information after first parenthesis) 0254 %NOTE: For some reason the organism abbreviation should be in lower 0255 %case in this database. Fix this here 0256 for i=1:numel(model.genes) 0257 parIndex=strfind(model.genes{i},'('); 0258 if any(parIndex) 0259 model.genes{i}=model.genes{i}(1:parIndex-1); 0260 end 0261 colIndex=strfind(model.genes{i},':'); 0262 model.genes{i}=[lower(model.genes{i}(1:colIndex-1)) model.genes{i}(colIndex:end)]; 0263 end 0264 0265 %Save the model structure 0266 save(genesFile,'model'); 0267 end 0268 end 0269 0270 %Only get the KOs in koList 0271 if nargin>1 0272 I=~ismember(model.rxns,koList); 0273 model=removeReactions(model,I,true,true); 0274 end 0275 fprintf('COMPLETE\n'); 0276 end 0277 0278 function allKOs=getAllKOs(keggPath) 0279 %Retrieves all KOs that are associated to reactions. This is because the 0280 %number of genes in KEGG is very large so without this parsing it would 0281 %take many hours 0282 0283 allKOs={}; 0284 0285 %First check if the reactions have already been parsed 0286 ravenPath=findRAVENroot; 0287 rxnsFile=fullfile(ravenPath,'external','kegg','keggRxns.mat'); 0288 if exist(rxnsFile, 'file') 0289 fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']); 0290 load(rxnsFile,'model'); 0291 0292 %Loop through the reactions and add the corresponding genes 0293 for i=1:numel(model.rxns) 0294 if isstruct(model.rxnMiriams{i}) 0295 %Get all KOs 0296 allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'))]; 0297 end 0298 end 0299 else 0300 %Parse the reaction file instead First load information on reaction ID, 0301 %reaction name, KO, pathway and ec-number 0302 fid = fopen(fullfile(keggPath,'reaction'), 'r'); 0303 orthology=false; 0304 while 1 0305 %Get the next line 0306 tline = fgetl(fid); 0307 0308 %Abort at end of file 0309 if ~ischar(tline) 0310 break; 0311 end 0312 0313 %Skip '///' 0314 if numel(tline)<12 0315 continue; 0316 end 0317 0318 %Check if it's a new reaction 0319 if strcmp(tline(1:12),'ENTRY ') 0320 orthology=false; 0321 end 0322 0323 if strcmp(tline(1:9),'REFERENCE') 0324 orthology=false; 0325 end 0326 0327 %Add KO-ids 0328 if numel(tline)>16 0329 if strcmp(tline(1:16),'ORTHOLOGY KO: ') || strcmp(tline(1:16),' KO: ') || strcmp(tline(1:12),'ORTHOLOGY ') || orthology==true 0330 if strcmp(tline(13:16),'KO:') 0331 %This is in the old version 0332 allKOs=[allKOs;tline(17:22)]; 0333 elseif strcmp(tline(1:12),'ORTHOLOGY ') 0334 %This means that it found one KO in the new format and 0335 %that subsequent lines might be other KOs 0336 orthology=true; 0337 allKOs=[allKOs;tline(13:18)]; 0338 end 0339 end 0340 end 0341 end 0342 0343 %Close the file 0344 fclose(fid); 0345 end 0346 allKOs=unique(allKOs); 0347 end 0348 0349 function downloadKEGGgenes() 0350 % downloadKEGGgenes 0351 % Downloads keggGenes.mat from the RAVEN GitHub release into 0352 % external/kegg/. This file is distributed as a release asset rather 0353 % than shipped in the repository because it exceeds GitHub's 100 MB 0354 % file-size limit. 0355 % 0356 % The following asset is expected on the release tag defined below: 0357 % keggGenes.zip containing keggGenes.mat 0358 % 0359 % To force a refresh (for example, after a new KEGG snapshot has been 0360 % published with a new RAVEN release), delete 0361 % external/kegg/keggGenes.mat and call getGenesFromKEGG again. 0362 % 0363 % Usage: downloadKEGGgenes() 0364 0365 % Release tag that hosts the data archive. Bump this for any RAVEN 0366 % release that ships an updated keggGenes.mat. 0367 releaseTag = 'v2.11.1'; 0368 0369 archiveName = 'keggGenes.zip'; 0370 targetDir = fullfile(findRAVENroot(),'external','kegg'); 0371 targetFile = fullfile(targetDir,'keggGenes.mat'); 0372 url = ['https://github.com/SysBioChalmers/RAVEN/releases/download/',... 0373 releaseTag,'/',archiveName]; 0374 zipPath = fullfile(targetDir,archiveName); 0375 0376 fprintf(['Downloading ' archiveName ' from RAVEN ' releaseTag ' release... ']); 0377 try 0378 websave(zipPath,url); 0379 catch ME 0380 if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError') 0381 error(['Failed to download ' archiveName ': the server returned ' ... 0382 'a 404 error. If this persists, please report it at ' ... 0383 'https://github.com/SysBioChalmers/RAVEN/issues']); 0384 else 0385 rethrow(ME); 0386 end 0387 end 0388 fprintf('COMPLETE\n'); 0389 0390 fprintf(['Extracting ' archiveName ' to ' strrep(targetDir,'\','/') '... ']); 0391 try 0392 unzip(zipPath,targetDir); 0393 catch ME 0394 if isfile(zipPath); delete(zipPath); end 0395 error(['Failed to extract ' archiveName '. The downloaded archive ' ... 0396 'may be corrupted.\nOriginal error: %s'],ME.message); 0397 end 0398 fprintf('COMPLETE\n'); 0399 0400 if isfile(zipPath); delete(zipPath); end 0401 0402 if ~isfile(targetFile) 0403 error(['keggGenes.mat was not found in ' archiveName ... 0404 '. Please verify the release asset contents.']); 0405 end 0406 end