getGenesFromKEGG Retrieves information on all genes stored in KEGG database Input: keggPath if keggGenes.mat is not in the RAVEN\external\kegg directory, this function will attempt to read data from a local FTP dump of the KEGG database. keggPath is the path to the root of this database koList the number of genes in KEGG is very large. koList can be a cell array with KO identifiers, in which case only genes belonging to one of those KEGG orthologies are retrieved (optional, default all KOs with associated reactions) Output: model a model structure generated from the database. The following fields are filled id 'KEGG' name 'Automatically generated from KEGG database' rxns KO ids rxnNames Name for each entry genes IDs for all the genes. Genes are saved as organism abbreviation:id (same as in KEGG). 'HSA:124' for example is alcohol dehydrogenase in Homo sapiens rxnGeneMat A binary matrix that indicates whether a specific gene is present in a KO id NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory it will be loaded instead of parsing of the KEGG files. If it does not exist it will be saved after parsing of the KEGG files. In general, you should remove the keggGenes.mat file if you want to rebuild the model structure from a newer version of KEGG. Usage: model=getGenesFromKEGG(keggPath,koList) NOTE: This is how one entry looks in the file ENTRY K11440 KO NAME gbsB DEFINITION choline dehydrogenase [EC:1.1.1.1] PATHWAY ko00260 Glycine, serine and threonine metabolism ko01100 Metabolic pathways MODULE M00555 Betaine biosynthesis, choline => betaine BRITE KEGG Orthology (KO) [BR:ko00001] (list truncated) DBLINKS RN: R08557 R08558 COG: COG1454 GENES BSU: BSU31050(gbsB) BSR: I33_3189 (list truncated) REFERENCE PMID:8752328 AUTHORS Boch J, Kempf B, Schmid R, Bremer E TITLE Synthesis of the osmoprotectant glycine... (truncated) JOURNAL J Bacteriol 178:5121-9 (1996) DOI:10.1128/JB.178.17.5121-5129.1996 SEQUENCE [bsu:BSU31050] /// The file is not tab-delimited. Instead each label is 12 characters (except for '///'). Check if the genes have been parsed before and saved. If so, load the model.
0001 function model=getGenesFromKEGG(keggPath,koList) 0002 % getGenesFromKEGG 0003 % Retrieves information on all genes stored in KEGG database 0004 % 0005 % Input: 0006 % keggPath if keggGenes.mat is not in the RAVEN\external\kegg 0007 % directory, this function will attempt to read data from a 0008 % local FTP dump of the KEGG database. keggPath is the path 0009 % to the root of this database 0010 % koList the number of genes in KEGG is very large. koList can be a 0011 % cell array with KO identifiers, in which case only genes 0012 % belonging to one of those KEGG orthologies are retrieved 0013 % (optional, default all KOs with associated reactions) 0014 % 0015 % Output: 0016 % model a model structure generated from the database. The 0017 % following fields are filled 0018 % id 'KEGG' 0019 % name 'Automatically generated from KEGG database' 0020 % rxns KO ids 0021 % rxnNames Name for each entry 0022 % genes IDs for all the genes. Genes are saved as organism 0023 % abbreviation:id (same as in KEGG). 'HSA:124' for 0024 % example is alcohol dehydrogenase in Homo sapiens 0025 % rxnGeneMat A binary matrix that indicates whether a specific 0026 % gene is present in a KO id 0027 % 0028 % NOTE: If the file keggGenes.mat is in the RAVEN\external\kegg directory 0029 % it will be loaded instead of parsing of the KEGG files. If it does not 0030 % exist it will be saved after parsing of the KEGG files. In general, you 0031 % should remove the keggGenes.mat file if you want to rebuild the model 0032 % structure from a newer version of KEGG. 0033 % 0034 % Usage: model=getGenesFromKEGG(keggPath,koList) 0035 % 0036 % NOTE: This is how one entry looks in the file 0037 % 0038 % ENTRY K11440 KO 0039 % NAME gbsB 0040 % DEFINITION choline dehydrogenase [EC:1.1.1.1] 0041 % PATHWAY ko00260 Glycine, serine and threonine metabolism 0042 % ko01100 Metabolic pathways 0043 % MODULE M00555 Betaine biosynthesis, choline => betaine 0044 % BRITE KEGG Orthology (KO) [BR:ko00001] 0045 % (list truncated) 0046 % DBLINKS RN: R08557 R08558 0047 % COG: COG1454 0048 % GENES BSU: BSU31050(gbsB) 0049 % BSR: I33_3189 0050 % (list truncated) 0051 % REFERENCE PMID:8752328 0052 % AUTHORS Boch J, Kempf B, Schmid R, Bremer E 0053 % TITLE Synthesis of the osmoprotectant glycine... (truncated) 0054 % JOURNAL J Bacteriol 178:5121-9 (1996) 0055 % DOI:10.1128/JB.178.17.5121-5129.1996 0056 % SEQUENCE [bsu:BSU31050] 0057 % /// 0058 % 0059 % The file is not tab-delimited. Instead each label is 12 characters 0060 % (except for '///'). 0061 % 0062 % Check if the genes have been parsed before and saved. If so, load the 0063 % model. 0064 % 0065 0066 if nargin<1 0067 keggPath='RAVEN/external/kegg'; 0068 else 0069 keggPath=char(keggPath); 0070 end 0071 0072 ravenPath=findRAVENroot(); 0073 genesFile=fullfile(ravenPath,'external','kegg','keggGenes.mat'); 0074 if exist(genesFile, 'file') 0075 fprintf(['Importing KEGG genes from ' strrep(genesFile,'\','/') '... ']); 0076 load(genesFile); 0077 else 0078 fprintf(['NOTE: Cannot locate ' strrep(genesFile,'\','/') ', it will therefore be generated from the local KEGG database\n']); 0079 if ~isfile(fullfile(keggPath,'ko')) || ~isfile(fullfile(keggPath,'reaction')) 0080 EM=fprintf(['The files ''ko'' and ''reaction'' cannot be located at ' strrep(keggPath,'\','/') '/ and should be downloaded from the KEGG FTP\n']); 0081 dispEM(EM); 0082 else 0083 fprintf('Generating keggGenes.mat file...\n'); 0084 %Get all KOs that are associated to reactions 0085 allKOs=getAllKOs(keggPath); 0086 0087 %Since the list of genes will be accessed many times the hash table 0088 %is established 0089 geneMap=containers.Map('KeyType','char','ValueType','int32'); 0090 0091 %Add new functionality in the order specified in models 0092 model.id='KEGG'; 0093 model.name='Automatically generated from KEGG database'; 0094 0095 %Preallocate memory 0096 model.rxns=cell(numel(allKOs),1); 0097 model.rxnNames=cell(numel(allKOs),1); 0098 0099 %Reserve space for 500000 genes 0100 model.genes=cell(500000,1); 0101 0102 %Load information on KO ID, name, and associated genes 0103 fid = fopen(fullfile(keggPath,'ko'), 'r'); 0104 0105 %Keeps track of how many KOs that have been added 0106 koCounter=0; 0107 nGenes=0; 0108 addingGenes=false; 0109 0110 %These contain the mapping between KOs and genes and are used to 0111 %generate the model.rxnGeneMat in the end 0112 koIndex=zeros(5000000,1); 0113 geneIndex=koIndex; 0114 nMappings=0; 0115 0116 skipRead=false; 0117 0118 %Loop through the file 0119 while 1 0120 %Get the next line 0121 if skipRead==false 0122 tline = fgetl(fid); 0123 else 0124 skipRead=false; 0125 end 0126 0127 %Abort at end of file 0128 if ~ischar(tline) 0129 break; 0130 end 0131 0132 %Skip '///' 0133 if numel(tline)<12 0134 addingGenes=false; 0135 continue; 0136 end 0137 0138 %Check if it's a new reaction 0139 if strcmp(tline(1:12),'ENTRY ') 0140 %Check if it should be added 0141 koID=tline(13:18); 0142 0143 if ismember(koID,allKOs) 0144 addMe=true; 0145 koCounter=koCounter+1; 0146 else 0147 addMe=false; 0148 continue; 0149 end 0150 0151 %Add reaction ID (always 6 characters) 0152 model.rxns{koCounter}=koID; 0153 0154 model.rxnNames{koCounter}=''; 0155 %Will be overwritten if it exists 0156 end 0157 0158 %To get here we must be in a KO that should be added 0159 if addMe==true 0160 %Add name 0161 if strcmp(tline(1:12),'DEFINITION ') 0162 model.rxnNames{koCounter}=tline(13:end); 0163 end 0164 0165 %Check if it's time to start adding genes 0166 if strcmp(tline(1:12),'GENES ') 0167 addingGenes=true; 0168 end 0169 0170 %Add genes 0171 if addingGenes==true 0172 %We are now adding genes for the current KO. All gene 0173 %rows start with 12 spaces. If this is not true it 0174 %means that there is additional annotation such as 0175 %AUTHORS. This is not common but it exists 0176 if strcmp(tline(1:12),' ') || strcmp(tline(1:12),'GENES ') 0177 geneLine=tline(13:end); 0178 0179 %Check if the line is from a new organism of from 0180 %the same as before 0181 if strcmp(geneLine(4),':') 0182 %If organism id contains 3 letters; 0183 currentOrganism=geneLine(1:3); 0184 %Parse the string 0185 genes=regexp(geneLine(6:end),' ','split'); 0186 genes=strcat(currentOrganism,':',genes(:)); 0187 elseif strcmp(geneLine(5),':') 0188 %If organism id contains 4 letters; 0189 currentOrganism=geneLine(1:4); 0190 %Parse the string 0191 genes=regexp(geneLine(7:end),' ','split'); 0192 genes=strcat(currentOrganism,':',genes(:)); 0193 end 0194 0195 %Add the genes to the gene list 0196 for i=1:numel(genes) 0197 geneString=genes{i}; 0198 if geneMap.isKey(geneString) 0199 index=geneMap(geneString); 0200 else 0201 nGenes=nGenes+1; 0202 model.genes{nGenes}=geneString; 0203 index=nGenes; 0204 geneMap(geneString)=index; 0205 end 0206 0207 %Add the connection to the KO 0208 nMappings=nMappings+1; 0209 koIndex(nMappings)=koCounter; 0210 geneIndex(nMappings)=index; 0211 end 0212 else 0213 %Now we want to throw away everything until the 0214 %next entry 0215 while 1 0216 tline = fgetl(fid); 0217 if strcmp(tline,'///') 0218 %When the new entry is found, skip reading 0219 %one line to fit with the rest of the code 0220 skipRead=true; 0221 break; 0222 end 0223 end 0224 end 0225 end 0226 end 0227 end 0228 %Close the file 0229 fclose(fid); 0230 0231 %If too much space was allocated, shrink the model 0232 model.rxns=model.rxns(1:koCounter); 0233 model.rxnNames=model.rxnNames(1:koCounter); 0234 model.genes=model.genes(1:nGenes); 0235 0236 %Retrieve and add the gene associations 0237 model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1)); 0238 0239 %To make sure the size is correct if the last KOs don't have genes 0240 if size(model.rxnGeneMat,1)~=koCounter 0241 model.rxnGeneMat(koCounter,1)=0; 0242 end 0243 0244 %Trim the genes so that they only contain information that can be 0245 %matched to the KEGG file of protein sequences (remove all 0246 %information after first parenthesis) 0247 %NOTE: For some reason the organism abbreviation should be in lower 0248 %case in this database. Fix this here 0249 for i=1:numel(model.genes) 0250 parIndex=strfind(model.genes{i},'('); 0251 if any(parIndex) 0252 model.genes{i}=model.genes{i}(1:parIndex-1); 0253 end 0254 colIndex=strfind(model.genes{i},':'); 0255 model.genes{i}=[lower(model.genes{i}(1:colIndex-1)) model.genes{i}(colIndex:end)]; 0256 end 0257 0258 %Save the model structure 0259 save(genesFile,'model'); 0260 end 0261 end 0262 0263 %Only get the KOs in koList 0264 if nargin>1 0265 I=~ismember(model.rxns,koList); 0266 model=removeReactions(model,I,true,true); 0267 end 0268 fprintf('COMPLETE\n'); 0269 end 0270 0271 function allKOs=getAllKOs(keggPath) 0272 %Retrieves all KOs that are associated to reactions. This is because the 0273 %number of genes in KEGG is very large so without this parsing it would 0274 %take many hours 0275 0276 allKOs={}; 0277 0278 %First check if the reactions have already been parsed 0279 ravenPath=findRAVENroot; 0280 rxnsFile=fullfile(ravenPath,'external','kegg','keggRxns.mat'); 0281 if exist(rxnsFile, 'file') 0282 fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']); 0283 load(rxnsFile,'model'); 0284 0285 %Loop through the reactions and add the corresponding genes 0286 for i=1:numel(model.rxns) 0287 if isstruct(model.rxnMiriams{i}) 0288 %Get all KOs 0289 allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'kegg.orthology'))]; 0290 end 0291 end 0292 else 0293 %Parse the reaction file instead First load information on reaction ID, 0294 %reaction name, KO, pathway and ec-number 0295 fid = fopen(fullfile(keggPath,'reaction'), 'r'); 0296 orthology=false; 0297 while 1 0298 %Get the next line 0299 tline = fgetl(fid); 0300 0301 %Abort at end of file 0302 if ~ischar(tline) 0303 break; 0304 end 0305 0306 %Skip '///' 0307 if numel(tline)<12 0308 continue; 0309 end 0310 0311 %Check if it's a new reaction 0312 if strcmp(tline(1:12),'ENTRY ') 0313 orthology=false; 0314 end 0315 0316 if strcmp(tline(1:9),'REFERENCE') 0317 orthology=false; 0318 end 0319 0320 %Add KO-ids 0321 if numel(tline)>16 0322 if strcmp(tline(1:16),'ORTHOLOGY KO: ') || strcmp(tline(1:16),' KO: ') || strcmp(tline(1:12),'ORTHOLOGY ') || orthology==true 0323 if strcmp(tline(13:16),'KO:') 0324 %This is in the old version 0325 allKOs=[allKOs;tline(17:22)]; 0326 elseif strcmp(tline(1:12),'ORTHOLOGY ') 0327 %This means that it found one KO in the new format and 0328 %that subsequent lines might be other KOs 0329 orthology=true; 0330 allKOs=[allKOs;tline(13:18)]; 0331 end 0332 end 0333 end 0334 end 0335 0336 %Close the file 0337 fclose(fid); 0338 end 0339 allKOs=unique(allKOs); 0340 end