Home > hpa > scoreModel.m

scoreModel

PURPOSE ^

scoreRxns

SYNOPSIS ^

function [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,multipleCellScoring,hpaLevelScores)

DESCRIPTION ^

 scoreRxns
   Scores the reactions and genes in a model based on expression data
   from HPA and/or gene arrays

   Input:
   model               a model structure
   hpaData             HPA data structure from parseHPA (optional if arrayData is
                       supplied, default [])
   arrayData           gene expression data structure (optional if hpaData is
                       supplied, default [])
       genes           cell array with the unique gene names
       tissues         cell array with the tissue names. The list may not be
                       unique, as there can be multiple cell types per tissue
       celltypes       cell array with the cell type names for each tissue
       levels          GENESxTISSUES array with the expression level for
                       each gene in each tissue/celltype. NaN should be
                       used when no measurement was performed
       threshold       a single value or a vector of gene expression 
                       thresholds, above which genes are considered to be
                       "expressed". (optional, by default, the mean expression
                       levels of each gene across all tissues in arrayData
                       will be used as the threshold values)
   tissue              tissue to score for. Should exist in either
                       hpaData.tissues or arrayData.tissues
   celltype            cell type to score for. Should exist in either
                       hpaData.celltypes or arrayData.celltypes for this
                       tissue (optional, default is to use the best values
                       among all the cell types for the tissue. Use [] if
                       you want to supply more arguments)
   noGeneScore         score for reactions without genes (optional, default -2)
   multipleGeneScoring determines how scores are calculated for reactions
                       with several genes ('best' or 'average')
                       (optional, default 'best')
   multipleCellScoring determines how scores are calculated when several
                       cell types are used ('best' or 'average')
                       (optional, default 'best')
   hpaLevelScores      structure with numerical scores for the expression
                       level categories from HPA. The structure should have a
                       "names" and a "scores" field (optional, see code for
                       default scores)


   Output:
   rxnScores       scores for each of the reactions in model
   geneScores      scores for each of the genes in model. Genes which are
                   not in the dataset(s) have -Inf as scores
   hpaScores       scores for each of the genes in model if only taking hpaData
                   into account. Genes which are not in the dataset(s)
                   have -Inf as scores
   arrayScores     scores for each of the genes in model if only taking arrayData
                   into account. Genes which are not in the dataset(s)
                   have -Inf as scores

 Usage: [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,...
               hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,...
               multipleCellScoring,hpaLevelScores)

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,multipleCellScoring,hpaLevelScores)
0002 % scoreRxns
0003 %   Scores the reactions and genes in a model based on expression data
0004 %   from HPA and/or gene arrays
0005 %
0006 %   Input:
0007 %   model               a model structure
0008 %   hpaData             HPA data structure from parseHPA (optional if arrayData is
0009 %                       supplied, default [])
0010 %   arrayData           gene expression data structure (optional if hpaData is
0011 %                       supplied, default [])
0012 %       genes           cell array with the unique gene names
0013 %       tissues         cell array with the tissue names. The list may not be
0014 %                       unique, as there can be multiple cell types per tissue
0015 %       celltypes       cell array with the cell type names for each tissue
0016 %       levels          GENESxTISSUES array with the expression level for
0017 %                       each gene in each tissue/celltype. NaN should be
0018 %                       used when no measurement was performed
0019 %       threshold       a single value or a vector of gene expression
0020 %                       thresholds, above which genes are considered to be
0021 %                       "expressed". (optional, by default, the mean expression
0022 %                       levels of each gene across all tissues in arrayData
0023 %                       will be used as the threshold values)
0024 %   tissue              tissue to score for. Should exist in either
0025 %                       hpaData.tissues or arrayData.tissues
0026 %   celltype            cell type to score for. Should exist in either
0027 %                       hpaData.celltypes or arrayData.celltypes for this
0028 %                       tissue (optional, default is to use the best values
0029 %                       among all the cell types for the tissue. Use [] if
0030 %                       you want to supply more arguments)
0031 %   noGeneScore         score for reactions without genes (optional, default -2)
0032 %   multipleGeneScoring determines how scores are calculated for reactions
0033 %                       with several genes ('best' or 'average')
0034 %                       (optional, default 'best')
0035 %   multipleCellScoring determines how scores are calculated when several
0036 %                       cell types are used ('best' or 'average')
0037 %                       (optional, default 'best')
0038 %   hpaLevelScores      structure with numerical scores for the expression
0039 %                       level categories from HPA. The structure should have a
0040 %                       "names" and a "scores" field (optional, see code for
0041 %                       default scores)
0042 %
0043 %
0044 %   Output:
0045 %   rxnScores       scores for each of the reactions in model
0046 %   geneScores      scores for each of the genes in model. Genes which are
0047 %                   not in the dataset(s) have -Inf as scores
0048 %   hpaScores       scores for each of the genes in model if only taking hpaData
0049 %                   into account. Genes which are not in the dataset(s)
0050 %                   have -Inf as scores
0051 %   arrayScores     scores for each of the genes in model if only taking arrayData
0052 %                   into account. Genes which are not in the dataset(s)
0053 %                   have -Inf as scores
0054 %
0055 % Usage: [rxnScores, geneScores, hpaScores, arrayScores]=scoreModel(model,...
0056 %               hpaData,arrayData,tissue,celltype,noGeneScore,multipleGeneScoring,...
0057 %               multipleCellScoring,hpaLevelScores)
0058 
0059 if nargin<3
0060     arrayData=[];
0061 end
0062 if nargin<5
0063     celltype=[];
0064 else
0065     celltype=char(celltype);
0066 end
0067 if nargin<6
0068     noGeneScore=-2;
0069 end
0070 if nargin<7
0071     multipleGeneScoring='best';
0072 else
0073     multipleGeneScoring=char(multipleGeneScoring);
0074 end
0075 if nargin<8
0076     multipleCellScoring='best';
0077 else
0078     multipleCellScoring=char(multipleCellScoring);    
0079 end
0080 if nargin<9
0081     %The first four are for APE, the other ones for staining
0082     hpaLevelScores.names={'High' 'Medium' 'Low' 'None' 'Strong' 'Moderate' 'Weak' 'Negative' 'Not detected'};
0083     hpaLevelScores.scores=[20 15 10 -8 20 15 10 -8 -8];
0084 end
0085 
0086 if isempty(hpaData) && isempty(arrayData)
0087     EM='Must supply hpaData, arrayData or both';
0088     dispEM(EM);
0089 end
0090 if ~strcmpi(multipleGeneScoring,'best') && ~strcmpi(multipleGeneScoring,'average')
0091     EM='Valid options for multipleGeneScoring are "best" or "average"';
0092     dispEM(EM);
0093 end
0094 if ~strcmpi(multipleCellScoring,'best') && ~strcmpi(multipleCellScoring,'average')
0095     EM='Valid options for multipleCellScoring are "best" or "average"';
0096     dispEM(EM);
0097 end
0098 
0099 
0100 %Throw an error if array data for only one tissue is supplied without
0101 %specifying threshold values
0102 if ~isempty(arrayData)
0103     if numel(unique(arrayData.tissues))<2
0104         if ~isfield(arrayData,'threshold') || isempty(arrayData.threshold)
0105             EM='arrayData must contain measurements for at least two celltypes/tissues since the score is calculated based on the expression level compared to the overall average';
0106             dispEM(EM);
0107         end
0108     end
0109 end
0110 
0111 %Process arrayData.threshold if necessary
0112 if isfield(arrayData,'threshold') && (numel(arrayData.threshold) == 1)
0113     % if only a single gene threshold value is provided, then just
0114     % duplicate this value for all genes.
0115     arrayData.threshold = arrayData.threshold*ones(size(arrayData.genes));
0116 end
0117 
0118 %This is so that the code can ignore which combination of input data that is
0119 %used
0120 if isempty(arrayData)
0121     arrayData.genes={};
0122     arrayData.tissues={};
0123     arrayData.celltypes={};
0124     arrayData.levels=[];
0125     arrayData.threshold=[];
0126 end
0127 if isempty(hpaData)
0128     hpaData.genes={};
0129     hpaData.tissues={};
0130     hpaData.celltypes={};
0131     hpaData.levels={};
0132     hpaData.types={};
0133     hpaData.reliabilities={};
0134     hpaData.gene2Level=[];
0135     hpaData.gene2Type=[];
0136     hpaData.gene2Reliability=[];
0137 end
0138 
0139 %Check that the tissue exists
0140 if ~ismember(upper(tissue),upper(hpaData.tissues)) && ~ismember(upper(tissue),upper(arrayData.tissues))
0141     EM='The tissue name does not match';
0142     dispEM(EM);
0143 end
0144 if any(celltype)
0145     %Check that both data types has cell type defined if that is to be used
0146     if ~isfield(hpaData,'celltypes') || ~isfield(arrayData,'celltypes')
0147         EM='Both hpaData and arrayData must contain cell type information if cell type is to be used';
0148         dispEM(EM);
0149     end
0150     if ~ismember(upper(celltype),upper(hpaData.celltypes)) && ~ismember(upper(celltype),upper(arrayData.celltypes))
0151         EM='The cell type name does not match';
0152         dispEM(EM);
0153     end
0154 end
0155 
0156 %Some preprocessing of the structures to speed up a little Remove all
0157 %tissues that are not the correct one
0158 J=~strcmpi(hpaData.tissues,tissue);
0159 
0160 %If cell type is supplied, then only keep that cell type
0161 if any(celltype)
0162     J=J | ~strcmpi(hpaData.celltypes,celltype);
0163 end
0164 
0165 hpaData.tissues(J)=[];
0166 if isfield(hpaData,'celltypes')
0167     hpaData.celltypes(J)=[];
0168 end
0169 if isfield(hpaData,'gene2Level')
0170     hpaData.gene2Level(:,J)=[];
0171 end
0172 if isfield(hpaData,'gene2Type')
0173     hpaData.gene2Type(:,J)=[];
0174 end
0175 if isfield(hpaData,'gene2Reliability')
0176     hpaData.gene2Reliability(:,J)=[];
0177 end
0178 
0179 %Remove all genes from the structures that are not in model or that aren't
0180 %measured in the tissue
0181 if ~isempty(hpaData.genes) %This should not be necessary, but the summation is a 0x1 matrix and the other is []
0182     I=~ismember(hpaData.genes,model.genes) | sum(hpaData.gene2Level,2)==0;
0183 else
0184     I=[];
0185 end
0186 hpaData.genes(I)=[];
0187 if isfield(hpaData,'gene2Level')
0188     hpaData.gene2Level(I,:)=[];
0189 end
0190 if isfield(hpaData,'gene2Type')
0191     hpaData.gene2Type(I,:)=[];
0192 end
0193 if isfield(hpaData,'gene2Reliability')
0194     hpaData.gene2Reliability(I,:)=[];
0195 end
0196 
0197 I=strcmpi(arrayData.tissues,tissue);
0198 %If cell type is supplied, then only keep that cell type
0199 if any(celltype)
0200     I=I & strcmpi(arrayData.celltypes,celltype);
0201 end
0202 
0203 %Remove all genes from the structures that are not in model or that aren't
0204 %measured in the tissue
0205 J=~ismember(arrayData.genes,model.genes) | myAll(isnan(arrayData.levels(:,I)),2);
0206 arrayData.genes(J)=[];
0207 arrayData.levels(J,:)=[];
0208 if isfield(arrayData,'threshold')
0209     arrayData.threshold(J) = [];
0210 end
0211 
0212 %Calculate the scores for the arrayData. These scores are calculated for
0213 %each genes from its fold change between the tissue/celltype(s) in question
0214 %and all other celltypes. This is a lower quality data than protein
0215 %abundance, since a gene that is equally highly expressed in all cell types
0216 %will have a score of 0.0. These scores are therefore only used for genes
0217 %for which there is no HPA data available. The fold changes are transformed
0218 %as min(log(x),10) for x>1 and max(log(x),-5) for x<1 in order to have
0219 %negative scores for lower expressed genes and to scale the scrores to have
0220 %somewhat lower weights than the HPA scores
0221 tempArrayLevels=arrayData.levels;
0222 tempArrayLevels(isnan(tempArrayLevels))=0;
0223 if isfield(arrayData,'threshold') && ~isempty(arrayData.threshold)
0224     % if provided, the user-supplied expression threshold value(s) will be
0225     % used as the "average" expression level to which each gene is
0226     % compared.
0227     average=arrayData.threshold;
0228 else
0229     average=sum(tempArrayLevels,2)./sum(~isnan(arrayData.levels),2);
0230 end
0231 if strcmpi(multipleCellScoring,'best')
0232     current=max(tempArrayLevels(:,I),[],2);
0233 else
0234     current=sum(tempArrayLevels(:,I),2)./sum(~isnan(arrayData.levels(:,I)),2);
0235 end
0236 if ~isempty(current)
0237     aScores=5*log(current./average);
0238 else
0239     aScores=[];
0240 end
0241 aScores(aScores>0)=min(aScores(aScores>0),10);
0242 aScores(aScores<0)=max(aScores(aScores<0),-5);
0243 aScores(isnan(aScores)) = -5;  % NaNs occur when gene expression is zero across all tissues
0244 
0245 %Map the HPA levels to scores
0246 [I, J]=ismember(upper(hpaData.levels),upper(hpaLevelScores.names));
0247 if ~all(I)
0248     EM='There are expression level categories that do not match to hpaLevelScores';
0249     dispEM(EM);
0250 end
0251 [K, L, M]=find(hpaData.gene2Level);
0252 scores=hpaLevelScores.scores(J);
0253 if strcmpi(multipleCellScoring,'best')
0254     hScores=max(sparse(K,L,scores(M),numel(hpaData.genes),numel(hpaData.tissues)),[],2);
0255 else
0256     hScores=mean(sparse(K,L,scores(M),numel(hpaData.genes),numel(hpaData.tissues)),2);
0257 end
0258 
0259 %Get the scores for the genes, only use HPA if available
0260 geneScores=inf(numel(model.genes),1)*-1;
0261 hpaScores=geneScores;
0262 arrayScores=geneScores;
0263 
0264 [I, J]=ismember(model.genes,hpaData.genes);
0265 hpaScores(I)=hScores(J(I));
0266 geneScores(I)=hScores(J(I));
0267 [I, J]=ismember(model.genes,arrayData.genes);
0268 arrayScores(I)=aScores(J(I));
0269 geneScores(I & myIsInf(geneScores))=aScores(J(I & myIsInf(geneScores)));
0270 
0271 %Remove the genes that have no data from the model
0272 I=ismember(model.genes,hpaData.genes) | ismember(model.genes,arrayData.genes);
0273 model.genes(~I)=[];
0274 model.rxnGeneMat(:,~I)=[];
0275 
0276 %Map the genes to the HPA/array genes
0277 [hpaExist, hpaMap]=ismember(model.genes,hpaData.genes);
0278 [arrayExist, arrayMap]=ismember(model.genes,arrayData.genes);
0279 
0280 %Set the default scores for reactions without genes
0281 rxnScores=ones(numel(model.rxns),1)*noGeneScore;
0282 
0283 %Loop through the reactions and calculate the scores
0284 for i=1:numel(model.rxns)
0285     %Check if it has genes
0286     I=find(model.rxnGeneMat(i,:));
0287     if any(I)
0288         %If any of the genes exist in hpaData, then don't use arrayData
0289         if any(hpaExist(I))
0290             %At least one gene was found in HPA
0291             if strcmpi(multipleGeneScoring,'best')
0292                 rxnScores(i)=max(hScores(hpaMap(I(hpaExist(I)))));
0293             else
0294                 rxnScores(i)=mean(hScores(hpaMap(I(hpaExist(I)))));
0295             end
0296         else
0297             %Use array data
0298             if any(arrayExist(I))
0299                 %At least one gene was found in the array data
0300                 if strcmpi(multipleGeneScoring,'best')
0301                     rxnScores(i)=max(aScores(arrayMap(I(arrayExist(I)))));
0302                 else
0303                     rxnScores(i)=mean(aScores(arrayMap(I(arrayExist(I)))));
0304                 end
0305             end
0306         end
0307     end
0308 end
0309 end
0310 
0311 %This is because isinf and all returns 0x1 for empty set, which gives a
0312 %concatenation error. Do like this instead of having many if statements
0313 function y=myIsInf(x)
0314 y=isinf(x);
0315 if isempty(y)
0316     y=[];
0317 end
0318 end
0319 function y=myAll(x,dim)
0320 y=all(x,dim);
0321 if isempty(y)
0322     y=[];
0323 end
0324 end

Generated by m2html © 2005