Question

我有一个大型数据框，如下所示：

> my_table
   track_fid start_gid end_gid
1          1       100      82
2          2        82     100
3          3       100      82
4          4       100      32
5          5        82     100
6          6        82     100
7          7        82     100
8          8       100      82
9          9        34     100
10        10        31     100

我的目标是在最后添加列to_from，并使用字符y或n填充。

让我们以第一行为例 - start_gid中的值= 100，end_gid中的值= 82.如果表中的任何位置存在另一个其他行，其中值为逆，即，end_gid = 100且start_gid = 82中的值，我想用to_from填充两行的y列。如果逆不存在，则第一行应填充n。这里的关键是遍历每一行，并根据track_fid的顺序在表中搜索它的反转。如果找到track_fid更大的逆，则应插入y。一旦逆接收到y的值，就不能再次使用它。

例如，这将是一个示例输出：

> output
   track_fid start_gid end_gid to_from
1          1       100      82       y
2          2        82     100       y
3          3       100      82       y
4          4       100      32       n
5          5        82     100       y
6          6        82     100       y
7          7        82     100       n
8          8       100      82       y
9          9        34     100       n
10        10        31     100       n

有没有办法在R中创建这样的输出？

有些事情：

for(i in 2:nrow(my_table)) {
if(my_table[i-1,"start_gid"]= my_table[i,"end_gid"]) {
my_table$to_from = "y" } else { my_table$to_from = "n"}


> str(output)
'data.frame':   10 obs. of  4 variables:
 $ track_fid: int  1 2 3 4 5 6 7 8 9 10
 $ start_gid: int  100 82 100 100 82 82 82 100 34 31
 $ end_gid  : int  82 100 82 32 100 100 100 82 100 100
 $ to_from  : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 1 2 1 1

Answer 1

如果没有R中的循环，我看不到这样做的方法。您可以使用for循环以及next和break语句执行此操作。但在这种情况下，如果问题规模很大，我会转向Rcpp。

library(Rcpp)
sourceCpp(code = "
          #include <Rcpp.h>
          // [[Rcpp::export]]
          Rcpp::LogicalVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
            Rcpp::LogicalVector res(x.length());
            for (int i=0; i<(x.length()-1); i++) {
              if(res(i)) continue;
              for (int j=i+1; j<x.length(); j++) {
                if (res(j)) continue;
                if (x(i) == y(j) && x(j) == y(i)) {
                   res(i) = true;
                   res(j) = true;
                   break;
                }
              }
            }
            return res;
          }
          ")

DF$from_to <- myfun(DF$start_gid, DF$end_gid)
#   track_fid start_gid end_gid from_to
#1          1       100      82    TRUE
#2          2        82     100    TRUE
#3          3       100      82    TRUE
#4          4       100      32   FALSE
#5          5        82     100    TRUE
#6          6        82     100    TRUE
#7          7        82     100   FALSE
#8          8       100      82    TRUE
#9          9        34     100   FALSE
#10        10        31     100   FALSE

Answer 2

由于算法的详细描述，我们还可以使用data.table构建一个不带循环的不同解决方案。

首先，我们会计算start_gid和end_gid的唯一组合：

pairs <- dt[, .N, by = .(start_gid, end_gid)]
pairs

#   start_gid end_gid N
#1:       100      82 3
#2:        82     100 4
#3:       100      32 1
#4:        34     100 1
#5:        31     100 1

很明显，前三次出现（100,82）和（82,100）将有一个反向伴侣，而第四次出现（82,100）则没有。此外，（100,32），（34,100）和（31,100）的出现没有反向伙伴。

我们现在确定每组中可能的配对nmatch的最大数量。所有出现的（100,82）和（82,100）都属于同一组 82_100 。如果该组只包含一个成员，则没有其他配对伙伴，因此nmatch为0。

pairs <- pairs[, .(start_gid, end_gid, nmatch = if (.N <= 1L) 0L else min(N)), 
      by = .(grp = paste(pmin(start_gid, end_gid), pmax(start_gid, end_gid), sep = "_"))]
pairs

#      grp start_gid end_gid nmatch
#1: 82_100       100      82      3
#2: 82_100        82     100      3
#3: 32_100       100      32      0
#4: 34_100        34     100      0
#5: 31_100        31     100      0

我们现在加入两个表。它是一个右连接，以便dt的所有行都出现在输出中：

out <- pairs[dt, on = .(start_gid, end_gid)]
out
#       grp start_gid end_gid nmatch track_fid
# 1: 82_100       100      82      3         1
# 2: 82_100        82     100      3         2
# 3: 82_100       100      82      3         3
# 4: 32_100       100      32      0         4
# 5: 82_100        82     100      3         5
# 6: 82_100        82     100      3         6
# 7: 82_100        82     100      3         7
# 8: 82_100       100      82      3         8
# 9: 34_100        34     100      0         9
#10: 31_100        31     100      0        10

在最后一步中，每个组中的第一个nmatch配对被标记，结果按track_fid排序

out <- out[, .(track_fid, to_from = seq_len(.N) <= nmatch), by = .(start_gid, end_gid)]
out[order(track_fid)]
     start_gid end_gid track_fid to_from
# 1:       100      82         1    TRUE
# 2:        82     100         2    TRUE
# 3:       100      82         3    TRUE
# 4:       100      32         4   FALSE
# 5:        82     100         5    TRUE
# 6:        82     100         6    TRUE
# 7:        82     100         7   FALSE
# 8:       100      82         8    TRUE
# 9:        34     100         9   FALSE
#10:        31     100        10   FALSE

基准1：原始数据（10行）

使用仅包含10行的原始数据集将data.table解决方案与Rolands的Rcpp解决方案进行比较：

代码

library(microbenchmark)
microbenchmark(
  dt = {
    dt[, .N, by = .(start_gid, end_gid)][
      , .(start_gid, end_gid, nmatch = if (.N <= 1L) 0L else min(N)), 
      by = .(grp = paste(pmin(start_gid, end_gid), pmax(start_gid, end_gid), sep = "_"))][
        dt, on = .(start_gid, end_gid)][
          , .(track_fid, to_from = seq_len(.N) <= nmatch), 
          by = .(start_gid, end_gid)][
            order(track_fid)]
  },
  rcpp_source = {
    sourceCpp(code = "
          #include <Rcpp.h>
          // [[Rcpp::export]]
          Rcpp::LogicalVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
            Rcpp::LogicalVector res(x.length());
            for (int i=0; i<(x.length()-1); i++) {
              if(res(i)) continue;
              for (int j=i+1; j<x.length(); j++) {
                if (res(j)) continue;
                if (x(i) == y(j) && x(j) == y(i)) {
                   res(i) = true;
                   res(j) = true;
                   break;
                }
              }
            }
            return res;
          }
          ")
    dt$from_to <- myfun(dt$start_gid, dt$end_gid)
    dt
  },
  rcpp_func = {
    dt$from_to <- myfun(dt$start_gid, dt$end_gid)
    dt
  }
)

结果

Unit: microseconds
        expr      min       lq      mean    median       uq       max neval
          dt 2873.017 3233.418 3466.5484 3408.0495 3558.705  6345.633   100
 rcpp_source 8112.335 8537.114 8932.8953 8811.2385 9173.150 12093.931   100
   rcpp_func  101.192  121.582  142.0769  137.4405  154.620   255.246   100

正如所料，单独的Rcpp功能比data.table解决方案快20倍（对于给定的玩具大小样本数据）。但是，如果包含对sourceCPP的调用，则需要的时间是data.table解决方案的两倍以上。

请注意，data.table代码已被链接 data.table查询压缩。

基准2：更大的数据集

正如@Roland所建议的那样，我在较大的数据集上比较了data.table解决方案和Rcpp：

对于少于1000行的数据框，Rcpp比data.table解决方案更快。对于较大的数据帧，data.table解决方案比Rcpp解决方案更好地扩展。请注意，这是已实现算法的一个特征，并且不通常必须归因于Rcpp。

R

2 个答案:

基准1：原始数据（10行）

代码

结果

基准2：更大的数据集