Question

我经常有一个包含列组合的表，它充当分组键/公共标识符，这样键可以跨行重复。一个简单的例子：

sampleId = [1 1 1 3 3 3]';
entity = [1 2 3 1 4 5]';
dataTable = table(sampleId, entity)

在这里，可以认为entity的观察结果附在样本1和3上。

我发现压缩这些数据非常有用，因此密钥在行中是唯一的，例如，我想要一个看起来像这样的最终表：

----------------------------
|  sampleId  |  entity     |
----------------------------
|      1     |  3x1 table  |
|      3     |  3x1 table  |
----------------------------

我知道这样做的唯一方法是使用for循环，如下所示：

tempCell = cell(length(unique(dataTable.sampleId)), 1);
counter = 1;
nonGroupVariables = dataTable.Properties.VariableNames(...
                    ~ismember(dataTable.Properties.VariableNames,'sampleId'));


for sampleId = unique(dataTable.sampleId)'

    tempCell(counter) = {dataTable(dataTable.sampleId == sampleId, nonGroupVariables)};

    counter = counter + 1;

end

newDataTable = table(unique(dataTable.sampleId), tempCell, 'VariableNames', ['sampleId', nonGroupVariables]);

有没有更好的方法（更有效/更快）实现这一目标，可能使用accummarray或分组？

Answer 1

您确实可以使用accumarray。我将区分两种情况：

该表格包含 n +1列。第一个 n 是分组变量，最后一列是数据变量。
该表格的 n + m 列。第一个 n 是分组变量，最后一个 m 是数据变量。

当然第二种情况包括第一种情况，但更容易考虑第一种情况，然后继续第二种情况。

n 分组变量，1个数据变量

sampleId  = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity    = [1 2 3 1 4 5]'; %'
dataTable = table(sampleId, sampleId2, entity); %// example data
n = 2; %// number of grouping variables

[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = accumarray(v, dataTable{:,n+1}, [], @(x) {x}); %// cell array of vectors,
    %// where each vector refers to one value of the grouping variable
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
    cell2table(c, 'VariableNames', dataTable.Properties.VariableNames(n+1))];
    %// create output table with correct variable names

这会生成一个表

第一个 n 列包含分组变量的唯一组合，即原始表中第一个 n 列。
最后一列在每行中包含带有数字向量的单元格。数值向量包含与该行给出的分组变量组合相对应的所有值。

请注意，curly-bracket indexing into the table用于使代码与表变量名称无关。在上面的例子中，结果是

>> compressedTable
compressedTable = 
    sampleId    sampleId2       entity   
    ________    _________    ____________
    1           1            [2x1 double]
    1           2            [         3]
    3           2            [2x1 double]
    3           3            [         1]

>> compressedTable.entity{1}
ans =
     2
     1
>> compressedTable.entity{2}
ans =
     3
>> compressedTable.entity{3}
ans =
     4
     5
>> compressedTable.entity{4}
ans =
     1

n 分组变量， m 数据变量

在这种情况下，您可能需要循环除第一列之外的列。在下文中，我使用arrayfun进行循环。

sampleId  = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity    = [1 2 3 1 4 5]'; %'
entity2   = entity*2;
dataTable = table(sampleId, sampleId2, entity, entity2); %// example data
n = 2; %// number of grouping variables

[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = arrayfun(@(n) accumarray(v, dataTable{:,n}, [], @(x) {x}), n+1:size(dataTable,2), ...
    'uniformoutput', 0); %// cell array of cell arrays of vectors
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
    cell2table([c{:}], 'VariableNames', dataTable.Properties.VariableNames(n+1:end))];
    %// create output table with correct variable names

结果是

compressedTable = 
    sampleId    sampleId2       entity         entity2   
    ________    _________    ____________    ____________
    1           1            [2x1 double]    [2x1 double]
    1           2            [         3]    [         6]
    3           2            [2x1 double]    [2x1 double]
    3           3            [         1]    [         2]

>> compressedTable.entity{1}
ans =
     2
     1
>> compressedTable.entity2{1}
ans =
     4
     2
>> compressedTable.entity{2}
ans =
     3
>> compressedTable.entity2{2}
ans =
     6
>> compressedTable.entity{3}
ans =
     4
     5
>> compressedTable.entity2{3}
ans =
     8
    10
>> compressedTable.entity{4}
ans =
     1
>> compressedTable.entity2{4}
ans =
     2

Answer 2

我找到了另一种使用varfun的方法：

compressedTable = varfun(@(x){x}, dataTable, 'GroupingVariables', 'sampleId');
compressedTable.GroupCount = [];
compressedTable.Properties.VariableNames = dataTable.Properties.VariableNames;

什么是压缩＆＃39;分组的最快/最有效/最好的方法？ matlab中的表数据

2 个答案:

n 分组变量，1个数据变量

n 分组变量， m 数据变量