Home > hpa > parseHPArna.m

parseHPArna

PURPOSE ^

parseHPA

SYNOPSIS ^

function arrayData=parseHPArna(fileName, version)

DESCRIPTION ^

 parseHPA
   Parses a database dump of the Human Protein Atlas (HPA) RNA-Seq data.

   Input:
   fileName            tab-separated database dump of HPA RNA data. For
                       details regarding the format, see
                       http://www.proteinatlas.org/about/download.
   version             version of HPA [optional, default=19]


   Output:
   arrayData
       genes               cell array with the unique ensemble gene IDs
       geneNames           cell array with the gene names (gene abbrevs)
       tissues             cell array with the tissue names
       levels              matrix of gene expression levels (TPM), where
                           rows correspond to genes, and columns
                           correspond to tissues

   Usage: arrayData=parseHPArna(fileName,version)

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function arrayData=parseHPArna(fileName, version)
0002 % parseHPA
0003 %   Parses a database dump of the Human Protein Atlas (HPA) RNA-Seq data.
0004 %
0005 %   Input:
0006 %   fileName            tab-separated database dump of HPA RNA data. For
0007 %                       details regarding the format, see
0008 %                       http://www.proteinatlas.org/about/download.
0009 %   version             version of HPA [optional, default=19]
0010 %
0011 %
0012 %   Output:
0013 %   arrayData
0014 %       genes               cell array with the unique ensemble gene IDs
0015 %       geneNames           cell array with the gene names (gene abbrevs)
0016 %       tissues             cell array with the tissue names
0017 %       levels              matrix of gene expression levels (TPM), where
0018 %                           rows correspond to genes, and columns
0019 %                           correspond to tissues
0020 %
0021 %   Usage: arrayData=parseHPArna(fileName,version)
0022 
0023 if nargin<2
0024     %Change this and add code for more versions when the current HPA
0025     %version is increased and the format is changed
0026     version=19;
0027 end
0028 
0029 fileName=char(fileName);
0030 if ~isfile(fileName)
0031     error('HPA file %s cannot be found', string(fileName));
0032 end
0033 
0034 %NOTE! This function assumes that the first 4 columns contain (in order):
0035 % (1) gene ensembl ID, (2) gene abbrev, (3) tissue name,  (4) TPM value
0036 if (version == 19)
0037     headers={'Gene' 'Gene name' 'Tissue' 'TPM' 'pTPM' 'NX'};
0038 elseif (version == 18)
0039     headers={'Gene' 'Gene name' 'Sample' 'Value' 'Unit'};
0040 else
0041     error('Only HPA versions 18 and 19 are currently supported.');
0042 end
0043 
0044 %extract data from file
0045 formatSpec = strip(repmat('%q ', 1, numel(headers)));
0046 fid=fopen(fileName, 'r');
0047 hpa=textscan(fid, formatSpec, 'Delimiter', '\t');
0048 fclose(fid);
0049 
0050 %Go through and see if the headers match what was expected
0051 for i=1:numel(headers)
0052     if ~strcmpi(headers(i),hpa{i}(1))
0053         EM=['Could not find the header "' headers{i} '". Make sure that the input file matches the format specified at http://www.proteinatlas.org/about/download'];
0054         dispEM(EM);
0055     end
0056     %Remove the header line here
0057     hpa{i}(1)=[];
0058 end
0059 
0060 %Get unique gene IDs and tissue names
0061 [arrayData.genes, P, I] = unique(hpa{1});
0062 arrayData.geneNames = hpa{2}(P);  % retrieve corresponding gene names
0063 [arrayData.tissues, ~, J] = unique(hpa{3});
0064 
0065 %Now extract the gene levels and organize into matrix
0066 arrayData.levels = NaN(max(I),max(J));
0067 linearInd = sub2ind(size(arrayData.levels),I,J);
0068 arrayData.levels(linearInd) = str2double(hpa{4});
0069     
0070

Generated by m2html © 2005