scoreRxns Scores the reactions and genes in a model based on expression data from HPA and/or gene arrays Input: model a model structure hpaData HPA data structure from parseHPA (optional if arrayData is supplied, default []) arrayData gene expression data structure (optional if hpaData is supplied, default []) genes cell array with the unique gene names tissues cell array with the tissue names. The list may not be unique, as there can be multiple cell types per tissue celltypes cell array with the cell type names for each tissue levels GENESxTISSUES array with the expression level for each gene in each tissue/celltype. NaN should be used when no measurement was performed threshold a single value or a vector of gene expression thresholds, above which genes are considered to be "expressed". (optional, by default, the mean expression levels of each gene across all tissues in arrayData will be used as the threshold values) tissue tissue to score for. Should exist in either hpaData.tissues or arrayData.tissues celltype cell type to score for. Should exist in either hpaData.celltypes or arrayData.celltypes for this tissue (optional, default is to use the best values among all the cell types for the tissue. Use [] if you want to supply more arguments) noGeneScore score for reactions without genes (optional, default -2) multipleGeneScoring determines how scores are calculated for reactions with several genes ('best' or 'average') (optional, default 'best') multipleCellScoring determines how scores are calculated when several cell types are used ('best' or 'average') (optional, default 'best') hpaLevelScores structure with numerical scores for the expression level categories from HPA. The structure should have a "names" and a "scores" field (optional, see code for default scores) Output: rxnScores scores for each of the reactions in model geneScores scores for each of the genes in model. Genes which are not in the dataset(s) have -Inf as scores hpaScores scores for each of the genes in model if only taking hpaData into account. Genes which are not in the dataset(s) have -Inf as scores arrayScores scores for each of the genes in model if only taking arrayData into account. Genes which are not in the dataset(s) have -Inf as scores Usage: [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,... hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,... multipleCellScoring,hpaLevelScores)
0001 function [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,multipleCellScoring,hpaLevelScores) 0002 % scoreRxns 0003 % Scores the reactions and genes in a model based on expression data 0004 % from HPA and/or gene arrays 0005 % 0006 % Input: 0007 % model a model structure 0008 % hpaData HPA data structure from parseHPA (optional if arrayData is 0009 % supplied, default []) 0010 % arrayData gene expression data structure (optional if hpaData is 0011 % supplied, default []) 0012 % genes cell array with the unique gene names 0013 % tissues cell array with the tissue names. The list may not be 0014 % unique, as there can be multiple cell types per tissue 0015 % celltypes cell array with the cell type names for each tissue 0016 % levels GENESxTISSUES array with the expression level for 0017 % each gene in each tissue/celltype. NaN should be 0018 % used when no measurement was performed 0019 % threshold a single value or a vector of gene expression 0020 % thresholds, above which genes are considered to be 0021 % "expressed". (optional, by default, the mean expression 0022 % levels of each gene across all tissues in arrayData 0023 % will be used as the threshold values) 0024 % tissue tissue to score for. Should exist in either 0025 % hpaData.tissues or arrayData.tissues 0026 % celltype cell type to score for. Should exist in either 0027 % hpaData.celltypes or arrayData.celltypes for this 0028 % tissue (optional, default is to use the best values 0029 % among all the cell types for the tissue. Use [] if 0030 % you want to supply more arguments) 0031 % noGeneScore score for reactions without genes (optional, default -2) 0032 % multipleGeneScoring determines how scores are calculated for reactions 0033 % with several genes ('best' or 'average') 0034 % (optional, default 'best') 0035 % multipleCellScoring determines how scores are calculated when several 0036 % cell types are used ('best' or 'average') 0037 % (optional, default 'best') 0038 % hpaLevelScores structure with numerical scores for the expression 0039 % level categories from HPA. The structure should have a 0040 % "names" and a "scores" field (optional, see code for 0041 % default scores) 0042 % 0043 % 0044 % Output: 0045 % rxnScores scores for each of the reactions in model 0046 % geneScores scores for each of the genes in model. Genes which are 0047 % not in the dataset(s) have -Inf as scores 0048 % hpaScores scores for each of the genes in model if only taking hpaData 0049 % into account. Genes which are not in the dataset(s) 0050 % have -Inf as scores 0051 % arrayScores scores for each of the genes in model if only taking arrayData 0052 % into account. Genes which are not in the dataset(s) 0053 % have -Inf as scores 0054 % 0055 % Usage: [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,... 0056 % hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,... 0057 % multipleCellScoring,hpaLevelScores) 0058 0059 if nargin<3 0060 arrayData=[]; 0061 end 0062 if nargin<5 0063 celltype=[]; 0064 else 0065 celltype=char(celltype); 0066 end 0067 if nargin<6 0068 noGeneScore=-2; 0069 end 0070 if nargin<7 0071 multipleGeneScoring='best'; 0072 else 0073 multipleGeneScoring=char(multipleGeneScoring); 0074 end 0075 if nargin<8 0076 multipleCellScoring='best'; 0077 else 0078 multipleCellScoring=char(multipleCellScoring); 0079 end 0080 if nargin<9 0081 %The first four are for APE, the other ones for staining 0082 hpaLevelScores.names={'High' 'Medium' 'Low' 'None' 'Strong' 'Moderate' 'Weak' 'Negative' 'Not detected'}; 0083 hpaLevelScores.scores=[20 15 10 -8 20 15 10 -8 -8]; 0084 end 0085 0086 if isempty(hpaData) && isempty(arrayData) 0087 EM='Must supply hpaData, arrayData or both'; 0088 dispEM(EM); 0089 end 0090 if ~strcmpi(multipleGeneScoring,'best') && ~strcmpi(multipleGeneScoring,'average') 0091 EM='Valid options for multipleGeneScoring are "best" or "average"'; 0092 dispEM(EM); 0093 end 0094 if ~strcmpi(multipleCellScoring,'best') && ~strcmpi(multipleCellScoring,'average') 0095 EM='Valid options for multipleCellScoring are "best" or "average"'; 0096 dispEM(EM); 0097 end 0098 0099 0100 %Throw an error if array data for only one tissue is supplied without 0101 %specifying threshold values 0102 if ~isempty(arrayData) 0103 if numel(unique(arrayData.tissues))<2 0104 if ~isfield(arrayData,'threshold') || isempty(arrayData.threshold) 0105 EM='arrayData must contain measurements for at least two celltypes/tissues since the score is calculated based on the expression level compared to the overall average'; 0106 dispEM(EM); 0107 end 0108 end 0109 end 0110 0111 %Process arrayData.threshold if necessary 0112 if isfield(arrayData,'threshold') && (numel(arrayData.threshold) == 1) 0113 % if only a single gene threshold value is provided, then just 0114 % duplicate this value for all genes. 0115 arrayData.threshold = arrayData.threshold*ones(size(arrayData.genes)); 0116 end 0117 0118 %This is so that the code can ignore which combination of input data that is 0119 %used 0120 if isempty(arrayData) 0121 arrayData.genes={}; 0122 arrayData.tissues={}; 0123 arrayData.celltypes={}; 0124 arrayData.levels=[]; 0125 arrayData.threshold=[]; 0126 end 0127 if isempty(hpaData) 0128 hpaData.genes={}; 0129 hpaData.tissues={}; 0130 hpaData.celltypes={}; 0131 hpaData.levels={}; 0132 hpaData.types={}; 0133 hpaData.reliabilities={}; 0134 hpaData.gene2Level=[]; 0135 hpaData.gene2Type=[]; 0136 hpaData.gene2Reliability=[]; 0137 end 0138 0139 %Check that the tissue exists 0140 if ~ismember(upper(tissue),upper(hpaData.tissues)) && ~ismember(upper(tissue),upper(arrayData.tissues)) 0141 EM='The tissue name does not match'; 0142 dispEM(EM); 0143 end 0144 if any(celltype) 0145 %Check that both data types has cell type defined if that is to be used 0146 if ~isfield(hpaData,'celltypes') || ~isfield(arrayData,'celltypes') 0147 EM='Both hpaData and arrayData must contain cell type information if cell type is to be used'; 0148 dispEM(EM); 0149 end 0150 if ~ismember(upper(celltype),upper(hpaData.celltypes)) && ~ismember(upper(celltype),upper(arrayData.celltypes)) 0151 EM='The cell type name does not match'; 0152 dispEM(EM); 0153 end 0154 end 0155 0156 %Some preprocessing of the structures to speed up a little Remove all 0157 %tissues that are not the correct one 0158 J=~strcmpi(hpaData.tissues,tissue); 0159 0160 %If cell type is supplied, then only keep that cell type 0161 if any(celltype) 0162 J=J | ~strcmpi(hpaData.celltypes,celltype); 0163 end 0164 0165 hpaData.tissues(J)=[]; 0166 if isfield(hpaData,'celltypes') 0167 hpaData.celltypes(J)=[]; 0168 end 0169 if isfield(hpaData,'gene2Level') 0170 hpaData.gene2Level(:,J)=[]; 0171 end 0172 if isfield(hpaData,'gene2Type') 0173 hpaData.gene2Type(:,J)=[]; 0174 end 0175 if isfield(hpaData,'gene2Reliability') 0176 hpaData.gene2Reliability(:,J)=[]; 0177 end 0178 0179 %Remove all genes from the structures that are not in model or that aren't 0180 %measured in the tissue 0181 if ~isempty(hpaData.genes) %This should not be necessary, but the summation is a 0x1 matrix and the other is [] 0182 I=~ismember(hpaData.genes,model.genes) | sum(hpaData.gene2Level,2)==0; 0183 else 0184 I=[]; 0185 end 0186 hpaData.genes(I)=[]; 0187 if isfield(hpaData,'gene2Level') 0188 hpaData.gene2Level(I,:)=[]; 0189 end 0190 if isfield(hpaData,'gene2Type') 0191 hpaData.gene2Type(I,:)=[]; 0192 end 0193 if isfield(hpaData,'gene2Reliability') 0194 hpaData.gene2Reliability(I,:)=[]; 0195 end 0196 0197 I=strcmpi(arrayData.tissues,tissue); 0198 %If cell type is supplied, then only keep that cell type 0199 if any(celltype) 0200 I=I & strcmpi(arrayData.celltypes,celltype); 0201 end 0202 0203 %Remove all genes from the structures that are not in model or that aren't 0204 %measured in the tissue 0205 J=~ismember(arrayData.genes,model.genes) | myAll(isnan(arrayData.levels(:,I)),2); 0206 arrayData.genes(J)=[]; 0207 arrayData.levels(J,:)=[]; 0208 if isfield(arrayData,'threshold') 0209 arrayData.threshold(J) = []; 0210 end 0211 0212 %Calculate the scores for the arrayData. These scores are calculated for 0213 %each genes from its fold change between the tissue/celltype(s) in question 0214 %and all other celltypes. This is a lower quality data than protein 0215 %abundance, since a gene that is equally highly expressed in all cell types 0216 %will have a score of 0.0. These scores are therefore only used for genes 0217 %for which there is no HPA data available. The fold changes are transformed 0218 %as min(log(x),10) for x>1 and max(log(x),-5) for x<1 in order to have 0219 %negative scores for lower expressed genes and to scale the scrores to have 0220 %somewhat lower weights than the HPA scores 0221 tempArrayLevels=arrayData.levels; 0222 tempArrayLevels(isnan(tempArrayLevels))=0; 0223 if isfield(arrayData,'threshold') && ~isempty(arrayData.threshold) 0224 % if provided, the user-supplied expression threshold value(s) will be 0225 % used as the "average" expression level to which each gene is 0226 % compared. 0227 average=arrayData.threshold; 0228 else 0229 average=sum(tempArrayLevels,2)./sum(~isnan(arrayData.levels),2); 0230 end 0231 if strcmpi(multipleCellScoring,'best') 0232 current=max(tempArrayLevels(:,I),[],2); 0233 else 0234 current=sum(tempArrayLevels(:,I),2)./sum(~isnan(arrayData.levels(:,I)),2); 0235 end 0236 if ~isempty(current) 0237 aScores=5*log(current./average); 0238 else 0239 aScores=[]; 0240 end 0241 aScores(aScores>0)=min(aScores(aScores>0),10); 0242 aScores(aScores<0)=max(aScores(aScores<0),-5); 0243 aScores(isnan(aScores)) = -5; % NaNs occur when gene expression is zero across all tissues 0244 0245 %Map the HPA levels to scores 0246 [I, J]=ismember(upper(hpaData.levels),upper(hpaLevelScores.names)); 0247 if ~all(I) 0248 EM='There are expression level categories that do not match to hpaLevelScores'; 0249 dispEM(EM); 0250 end 0251 [K, L, M]=find(hpaData.gene2Level); 0252 scores=hpaLevelScores.scores(J); 0253 if strcmpi(multipleCellScoring,'best') 0254 hScores=max(sparse(K,L,scores(M),numel(hpaData.genes),numel(hpaData.tissues)),[],2); 0255 else 0256 hScores=mean(sparse(K,L,scores(M),numel(hpaData.genes),numel(hpaData.tissues)),2); 0257 end 0258 0259 %Get the scores for the genes, only use HPA if available 0260 geneScores=inf(numel(model.genes),1)*-1; 0261 hpaScores=geneScores; 0262 arrayScores=geneScores; 0263 0264 [I, J]=ismember(model.genes,hpaData.genes); 0265 hpaScores(I)=hScores(J(I)); 0266 geneScores(I)=hScores(J(I)); 0267 [I, J]=ismember(model.genes,arrayData.genes); 0268 arrayScores(I)=aScores(J(I)); 0269 geneScores(I & myIsInf(geneScores))=aScores(J(I & myIsInf(geneScores))); 0270 0271 %Remove the genes that have no data from the model 0272 I=ismember(model.genes,hpaData.genes) | ismember(model.genes,arrayData.genes); 0273 model.genes(~I)=[]; 0274 model.rxnGeneMat(:,~I)=[]; 0275 0276 %Map the genes to the HPA/array genes 0277 [hpaExist, hpaMap]=ismember(model.genes,hpaData.genes); 0278 [arrayExist, arrayMap]=ismember(model.genes,arrayData.genes); 0279 0280 %Set the default scores for reactions without genes 0281 rxnScores=ones(numel(model.rxns),1)*noGeneScore; 0282 0283 %Loop through the reactions and calculate the scores 0284 for i=1:numel(model.rxns) 0285 %Check if it has genes 0286 I=find(model.rxnGeneMat(i,:)); 0287 if any(I) 0288 %If any of the genes exist in hpaData, then don't use arrayData 0289 if any(hpaExist(I)) 0290 %At least one gene was found in HPA 0291 if strcmpi(multipleGeneScoring,'best') 0292 rxnScores(i)=max(hScores(hpaMap(I(hpaExist(I))))); 0293 else 0294 rxnScores(i)=mean(hScores(hpaMap(I(hpaExist(I))))); 0295 end 0296 else 0297 %Use array data 0298 if any(arrayExist(I)) 0299 %At least one gene was found in the array data 0300 if strcmpi(multipleGeneScoring,'best') 0301 rxnScores(i)=max(aScores(arrayMap(I(arrayExist(I))))); 0302 else 0303 rxnScores(i)=mean(aScores(arrayMap(I(arrayExist(I))))); 0304 end 0305 end 0306 end 0307 end 0308 end 0309 end 0310 0311 %This is because isinf and all returns 0x1 for empty set, which gives a 0312 %concatenation error. Do like this instead of having many if statements 0313 function y=myIsInf(x) 0314 y=isinf(x); 0315 if isempty(y) 0316 y=[]; 0317 end 0318 end 0319 function y=myAll(x,dim) 0320 y=all(x,dim); 0321 if isempty(y) 0322 y=[]; 0323 end 0324 end