Description of loadProtData

 loadProtData
   Function that loads absolute proteomics data (in mg/gDCW) and returns
   mean values across replicates for each condition in the data file. By
   default it also filters the data by various criteria, to remove
   uncertain data (see input parameters).

 Input:
   replPerCond     vector with number of replicates for each condition in
                   the dataset. Example: [3, 2] if first conditions has
                   triplicates and second condition has duplicates.
   protDataFile    path to file with proteomics data, where protein levels
                   are in mg/gDCW (Optional, default reads 
                   data/proteomics.tsv as specified in modelAdapter)
                   Alternatively, protDataFile can be a protData structure
                   that was previously made by loadProtdata.
   filterData      logical whether abundances should be filtered. If
                   false, minVal, maxRSD, maxMissing and addStdevs are not
                   considered. (Optional, default true)
   modelAdapter    a loaded model adapter (Optional, will otherwise use
                   the default model adapter)
   minVal          threshold of mean protein measurement per condition.
                   (Optional, default = 0)
   maxRSD          maximum relative standard per condition. (Optional,
                   default = 1)
   maxMissing      ratio of replicates for which a protein level might be
                   missing. (Optional, default = 1/3 (or 1/2 if number of
                   replicates = 2))
                   If conditions have different number of replicates (as
                   indicated in replPerCond), maxMissing can also be a
                   vector of the same length as replPerCond, with
                   individual maxMissing parameters for each replicate.
   cutLowest       percentage of lowest mean values per condition to be
                   discared (not considering NaN values). (Optional, default 5)
   addStdevs       how many standard deviations should be added to the mean
                   value of each protein measurement across replicates,
                   broadening the confidence interval. (Optional,
                   default = 1)

 Output:
   protData        structure with (filtered) proteome data
                   uniprotIDs  cell arrray with Uniprot IDs matching
                               protData.abundances
                   abundances  matrix of proteomics data, where each
                               column contains mean abundances per
                               condition

 Usage:
   protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs)

0001 function protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs)
0002 % loadProtData
0003 %   Function that loads absolute proteomics data (in mg/gDCW) and returns
0004 %   mean values across replicates for each condition in the data file. By
0005 %   default it also filters the data by various criteria, to remove
0006 %   uncertain data (see input parameters).
0007 %
0008 % Input:
0009 %   replPerCond     vector with number of replicates for each condition in
0010 %                   the dataset. Example: [3, 2] if first conditions has
0011 %                   triplicates and second condition has duplicates.
0012 %   protDataFile    path to file with proteomics data, where protein levels
0013 %                   are in mg/gDCW (Optional, default reads
0014 %                   data/proteomics.tsv as specified in modelAdapter)
0015 %                   Alternatively, protDataFile can be a protData structure
0016 %                   that was previously made by loadProtdata.
0017 %   filterData      logical whether abundances should be filtered. If
0018 %                   false, minVal, maxRSD, maxMissing and addStdevs are not
0019 %                   considered. (Optional, default true)
0020 %   modelAdapter    a loaded model adapter (Optional, will otherwise use
0021 %                   the default model adapter)
0022 %   minVal          threshold of mean protein measurement per condition.
0023 %                   (Optional, default = 0)
0024 %   maxRSD          maximum relative standard per condition. (Optional,
0025 %                   default = 1)
0026 %   maxMissing      ratio of replicates for which a protein level might be
0027 %                   missing. (Optional, default = 1/3 (or 1/2 if number of
0028 %                   replicates = 2))
0029 %                   If conditions have different number of replicates (as
0030 %                   indicated in replPerCond), maxMissing can also be a
0031 %                   vector of the same length as replPerCond, with
0032 %                   individual maxMissing parameters for each replicate.
0033 %   cutLowest       percentage of lowest mean values per condition to be
0034 %                   discared (not considering NaN values). (Optional, default 5)
0035 %   addStdevs       how many standard deviations should be added to the mean
0036 %                   value of each protein measurement across replicates,
0037 %                   broadening the confidence interval. (Optional,
0038 %                   default = 1)
0039 %
0040 % Output:
0041 %   protData        structure with (filtered) proteome data
0042 %                   uniprotIDs  cell arrray with Uniprot IDs matching
0043 %                               protData.abundances
0044 %                   abundances  matrix of proteomics data, where each
0045 %                               column contains mean abundances per
0046 %                               condition
0047 %
0048 % Usage:
0049 %   protData = loadProtData(replPerCond, protDataFile, filterData, modelAdapter, minVal, maxRSD, maxMissing, cutLowest, addStdevs)
0050 
0051 if nargin < 8 || isempty(addStdevs)
0052     addStdevs = 1;
0053 end
0054 
0055 if nargin < 7 || isempty(maxMissing)
0056     maxMissing = 2/3;
0057 end
0058 
0059 if nargin < 6 || isempty(maxRSD)
0060     maxRSD = 1;
0061 end
0062 
0063 if nargin < 5 || isempty(minVal)
0064     minVal = 0;
0065 end
0066 
0067 if nargin < 4 || isempty(modelAdapter)
0068     modelAdapter = ModelAdapterManager.getDefault();
0069     if isempty(modelAdapter)
0070         error('Either send in a modelAdapter or set the default model adapter in the ModelAdapterManager.')
0071     end
0072 end
0073 params = modelAdapter.getParameters();
0074 
0075 if nargin < 3 || isempty(filterData)
0076     filterData = true;
0077 end
0078 
0079 if nargin < 2 || isempty(protDataFile)
0080     protDataFile = fullfile(params.path,'data','proteomics.tsv');
0081 end
0082 
0083 format = '%s';
0084 for i=1:sum(replPerCond)
0085     format = [format ' %f'];
0086 end
0087 if ~isstruct(protDataFile)
0088     fID         = fopen(protDataFile);
0089     protDataRaw = textscan(fID,format,'Delimiter','\t','HeaderLines',1,'TreatAsEmpty',{'NA','na','NaN','#VALUE!'});
0090     uniprotIDs  = protDataRaw{1};
0091     abundances  = cell2mat(protDataRaw(2:end));
0092     fclose(fID);
0093 else
0094     uniprotIDs  = protDataFile.uniprotIDs;
0095     abundances  = protDataFile.abundances;
0096 end
0097 
0098 %Remove entriew without ID
0099 remData = cellfun(@isempty,uniprotIDs);
0100 uniprotIDs(remData,:) = [];
0101 abundances(remData,:) = [];
0102 m                     = size(abundances,1);
0103 filtAbund             = nan(m,numel(replPerCond));
0104 
0105 if filterData
0106     for i=1:numel(replPerCond)
0107         condAbund    = abundances(:,1:replPerCond(i));
0108         if i<numel(replPerCond)
0109             abundances   = abundances(:,replPerCond(i)+1:end);
0110         end
0111         % First filter maxMissing
0112         if size(condAbund,2) > 1
0113             if numel(maxMissing)>1
0114                 maxMisRepl = maxMissing(i);
0115             else
0116                 maxMisRepl = maxMissing;
0117             end
0118             remData = sum(condAbund>0,2)<maxMisRepl*size(condAbund,2);
0119             condAbund(remData,:) = nan;
0120         end
0121         % Filter by RSD
0122         remData = (std(condAbund,0,2,'omitnan')./mean(condAbund,2,'omitnan'))>maxRSD;
0123         condAbund(remData) = nan;
0124         % Add stdevs
0125         condAbund = mean(condAbund,2,'omitnan')+(addStdevs*std(condAbund,0,2,'omitnan'));
0126         % Filter by minVal
0127         remData = mean(condAbund,2,'omitnan')<minVal;
0128         condAbund(remData) = nan;
0129         % Remove bottom 5%
0130         numData  = find(~isnan(condAbund));
0131         [~,sortData] = sort(condAbund);
0132         lowCutoff = floor(numel(numData)*0.05);
0133         condAbund(sortData(1:lowCutoff)) = nan;
0134         % Combine conditions
0135         filtAbund(:,i) = condAbund;
0136     end
0137 else
0138     for i=1:numel(replPerCond)
0139         condAbund    = abundances(:,1:replPerCond(i));
0140         if i<numel(replPerCond)
0141             abundances = abundances(:,replPerCond(i)+1:end);
0142         end
0143         filtAbund(:,i) = mean(condAbund,2,'omitnan');
0144     end
0145 end
0146 notAllNan = logical(sum(~isnan(filtAbund),2));
0147 protData.abundances = filtAbund(notAllNan,:);
0148 protData.uniprotIDs = uniprotIDs(notAllNan);
0149 end
0150

loadProtData

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE