parseHPA Parses a database dump of the Human Protein Atlas (HPA) RNA-Seq data. Input: fileName tab-separated database dump of HPA RNA data. For details regarding the format, see http://www.proteinatlas.org/about/download. version version of HPA [optional, default=19] Output: arrayData genes cell array with the unique ensemble gene IDs geneNames cell array with the gene names (gene abbrevs) tissues cell array with the tissue names levels matrix of gene expression levels (TPM), where rows correspond to genes, and columns correspond to tissues Usage: arrayData=parseHPArna(fileName,version)
0001 function arrayData=parseHPArna(fileName, version) 0002 % parseHPA 0003 % Parses a database dump of the Human Protein Atlas (HPA) RNA-Seq data. 0004 % 0005 % Input: 0006 % fileName tab-separated database dump of HPA RNA data. For 0007 % details regarding the format, see 0008 % http://www.proteinatlas.org/about/download. 0009 % version version of HPA [optional, default=19] 0010 % 0011 % 0012 % Output: 0013 % arrayData 0014 % genes cell array with the unique ensemble gene IDs 0015 % geneNames cell array with the gene names (gene abbrevs) 0016 % tissues cell array with the tissue names 0017 % levels matrix of gene expression levels (TPM), where 0018 % rows correspond to genes, and columns 0019 % correspond to tissues 0020 % 0021 % Usage: arrayData=parseHPArna(fileName,version) 0022 0023 if nargin<2 0024 %Change this and add code for more versions when the current HPA 0025 %version is increased and the format is changed 0026 version=19; 0027 end 0028 0029 fileName=char(fileName); 0030 if ~isfile(fileName) 0031 error('HPA file %s cannot be found', string(fileName)); 0032 end 0033 0034 %NOTE! This function assumes that the first 4 columns contain (in order): 0035 % (1) gene ensembl ID, (2) gene abbrev, (3) tissue name, (4) TPM value 0036 if (version == 19) 0037 headers={'Gene' 'Gene name' 'Tissue' 'TPM' 'pTPM' 'NX'}; 0038 elseif (version == 18) 0039 headers={'Gene' 'Gene name' 'Sample' 'Value' 'Unit'}; 0040 else 0041 error('Only HPA versions 18 and 19 are currently supported.'); 0042 end 0043 0044 %extract data from file 0045 formatSpec = strip(repmat('%q ', 1, numel(headers))); 0046 fid=fopen(fileName, 'r'); 0047 hpa=textscan(fid, formatSpec, 'Delimiter', '\t'); 0048 fclose(fid); 0049 0050 %Go through and see if the headers match what was expected 0051 for i=1:numel(headers) 0052 if ~strcmpi(headers(i),hpa{i}(1)) 0053 EM=['Could not find the header "' headers{i} '". Make sure that the input file matches the format specified at http://www.proteinatlas.org/about/download']; 0054 dispEM(EM); 0055 end 0056 %Remove the header line here 0057 hpa{i}(1)=[]; 0058 end 0059 0060 %Get unique gene IDs and tissue names 0061 [arrayData.genes, P, I] = unique(hpa{1}); 0062 arrayData.geneNames = hpa{2}(P); % retrieve corresponding gene names 0063 [arrayData.tissues, ~, J] = unique(hpa{3}); 0064 0065 %Now extract the gene levels and organize into matrix 0066 arrayData.levels = NaN(max(I),max(J)); 0067 linearInd = sub2ind(size(arrayData.levels),I,J); 0068 arrayData.levels(linearInd) = str2double(hpa{4}); 0069 0070