loadProtData Function that loads absolute proteomics data (in mg/gDCW) and returns mean values across replicates for each condition in the data file. By default it also filters the data by various criteria, to remove uncertain data (see input parameters). Input: replPerCond vector with number of replicates for each condition in the dataset. Example: [3, 2] if first conditions has triplicates and second condition has duplicates. protDataFile path to file with proteomics data, where protein levels are in mg/gDCW (Optional, default reads data/proteomics.tsv as specified in modelAdapter) Alternatively, protDataFile can be a protData structure that was previously made by loadProtdata. filterData logical whether abundances should be filtered. If false, minVal, maxRSD, maxMissing and addStdevs are not considered. (Optional, default true) modelAdapter a loaded model adapter (Optional, will otherwise use the default model adapter) minVal threshold of mean protein measurement per condition. (Optional, default = 0) maxRSD maximum relative standard per condition. (Optional, default = 1) maxMissing ratio of replicates for which a protein level might be missing. (Optional, default = 1/3 (or 1/2 if number of replicates = 2)) If conditions have different number of replicates (as indicated in replPerCond), maxMissing can also be a vector of the same length as replPerCond, with individual maxMissing parameters for each replicate. cutLowest percentage of lowest mean values per condition to be discared (not considering NaN values). (Optional, default 5) addStdevs how many standard deviations should be added to the mean value of each protein measurement across replicates, broadening the confidence interval. (Optional, default = 1) Output: protData structure with (filtered) proteome data uniprotIDs cell arrray with Uniprot IDs matching protData.abundances abundances matrix of proteomics data, where each column contains mean abundances per condition Usage: protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs)
0001 function protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs) 0002 % loadProtData 0003 % Function that loads absolute proteomics data (in mg/gDCW) and returns 0004 % mean values across replicates for each condition in the data file. By 0005 % default it also filters the data by various criteria, to remove 0006 % uncertain data (see input parameters). 0007 % 0008 % Input: 0009 % replPerCond vector with number of replicates for each condition in 0010 % the dataset. Example: [3, 2] if first conditions has 0011 % triplicates and second condition has duplicates. 0012 % protDataFile path to file with proteomics data, where protein levels 0013 % are in mg/gDCW (Optional, default reads 0014 % data/proteomics.tsv as specified in modelAdapter) 0015 % Alternatively, protDataFile can be a protData structure 0016 % that was previously made by loadProtdata. 0017 % filterData logical whether abundances should be filtered. If 0018 % false, minVal, maxRSD, maxMissing and addStdevs are not 0019 % considered. (Optional, default true) 0020 % modelAdapter a loaded model adapter (Optional, will otherwise use 0021 % the default model adapter) 0022 % minVal threshold of mean protein measurement per condition. 0023 % (Optional, default = 0) 0024 % maxRSD maximum relative standard per condition. (Optional, 0025 % default = 1) 0026 % maxMissing ratio of replicates for which a protein level might be 0027 % missing. (Optional, default = 1/3 (or 1/2 if number of 0028 % replicates = 2)) 0029 % If conditions have different number of replicates (as 0030 % indicated in replPerCond), maxMissing can also be a 0031 % vector of the same length as replPerCond, with 0032 % individual maxMissing parameters for each replicate. 0033 % cutLowest percentage of lowest mean values per condition to be 0034 % discared (not considering NaN values). (Optional, default 5) 0035 % addStdevs how many standard deviations should be added to the mean 0036 % value of each protein measurement across replicates, 0037 % broadening the confidence interval. (Optional, 0038 % default = 1) 0039 % 0040 % Output: 0041 % protData structure with (filtered) proteome data 0042 % uniprotIDs cell arrray with Uniprot IDs matching 0043 % protData.abundances 0044 % abundances matrix of proteomics data, where each 0045 % column contains mean abundances per 0046 % condition 0047 % 0048 % Usage: 0049 % protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs) 0050 0051 if nargin < 8 || isempty(addStdevs) 0052 addStdevs = 1; 0053 end 0054 0055 if nargin < 7 || isempty(maxMissing) 0056 maxMissing = 2/3; 0057 end 0058 0059 if nargin < 6 || isempty(maxRSD) 0060 maxRSD = 1; 0061 end 0062 0063 if nargin < 5 || isempty(minVal) 0064 minVal = 0; 0065 end 0066 0067 if nargin < 4 || isempty(modelAdapter) 0068 modelAdapter = ModelAdapterManager.getDefault(); 0069 if isempty(modelAdapter) 0070 error('Either send in a modelAdapter or set the default model adapter in the ModelAdapterManager.') 0071 end 0072 end 0073 params = modelAdapter.getParameters(); 0074 0075 if nargin < 3 || isempty(filterData) 0076 filterData = true; 0077 end 0078 0079 if nargin < 2 || isempty(protDataFile) 0080 protDataFile = fullfile(params.path,'data','proteomics.tsv'); 0081 end 0082 0083 format = '%s'; 0084 for i=1:sum(replPerCond) 0085 format = [format ' %f']; 0086 end 0087 if ~isstruct(protDataFile) 0088 fID = fopen(protDataFile); 0089 protDataRaw = textscan(fID,format,'Delimiter','\t','HeaderLines',1,'TreatAsEmpty',{'NA','na','NaN','#VALUE!'}); 0090 uniprotIDs = protDataRaw{1}; 0091 abundances = cell2mat(protDataRaw(2:end)); 0092 fclose(fID); 0093 else 0094 uniprotIDs = protDataFile.uniprotIDs; 0095 abundances = protDataFile.abundances; 0096 end 0097 0098 %Remove entriew without ID 0099 remData = cellfun(@isempty,uniprotIDs); 0100 uniprotIDs(remData,:) = []; 0101 abundances(remData,:) = []; 0102 m = size(abundances,1); 0103 filtAbund = nan(m,numel(replPerCond)); 0104 0105 if filterData 0106 for i=1:numel(replPerCond) 0107 condAbund = abundances(:,1:replPerCond(i)); 0108 if i<numel(replPerCond) 0109 abundances = abundances(:,replPerCond(i)+1:end); 0110 end 0111 % First filter maxMissing 0112 if size(condAbund,2) > 1 0113 if numel(maxMissing)>1 0114 maxMisRepl = maxMissing(i); 0115 else 0116 maxMisRepl = maxMissing; 0117 end 0118 remData = sum(condAbund>0,2)<maxMisRepl*size(condAbund,2); 0119 condAbund(remData,:) = nan; 0120 end 0121 % Filter by RSD 0122 remData = (std(condAbund,0,2,'omitnan')./mean(condAbund,2,'omitnan'))>maxRSD; 0123 condAbund(remData) = nan; 0124 % Add stdevs 0125 condAbund = mean(condAbund,2,'omitnan')+(addStdevs*std(condAbund,0,2,'omitnan')); 0126 % Filter by minVal 0127 remData = mean(condAbund,2,'omitnan')<minVal; 0128 condAbund(remData) = nan; 0129 % Remove bottom 5% 0130 numData = find(~isnan(condAbund)); 0131 [~,sortData] = sort(condAbund); 0132 lowCutoff = floor(numel(numData)*0.05); 0133 condAbund(sortData(1:lowCutoff)) = nan; 0134 % Combine conditions 0135 filtAbund(:,i) = condAbund; 0136 end 0137 else 0138 for i=1:numel(replPerCond) 0139 condAbund = abundances(:,1:replPerCond(i)); 0140 if i<numel(replPerCond) 0141 abundances = abundances(:,replPerCond(i)+1:end); 0142 end 0143 filtAbund(:,i) = mean(condAbund,2,'omitnan'); 0144 end 0145 end 0146 notAllNan = logical(sum(~isnan(filtAbund),2)); 0147 protData.abundances = filtAbund(notAllNan,:); 0148 protData.uniprotIDs = uniprotIDs(notAllNan); 0149 end 0150