在一个完全匹配和一个部分URL匹配上合并两个数据框

时间:2013-08-01 16:11:48

标签: r

我一直试图在一次完全匹配和一次部分上进行一些乏味的合并( ON非常大的数据)。我尝试了几种方法(使用pmatch,str_detect,grep和sapply)并得到了一些接近的结果,但试图找到一个优雅的解决方案。任何帮助见解将不胜感激。

我找到的另一个更长的溃败是在公共字段上进行常规合并(seesionId),然后编写一个for循环,如下所示:

for( i in 1:nrow(my.test.daa) ){
my.test.daa$Part_match [i] = pmatch(my.test.daa$Link_URL[i], my.test.daa$Referer[i])
...get index i to also get the other columns from dataset frame
}

新数据 - 包含重复项

pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
                                  "5b8cc8794a02ba868db21faef2",
                                  "5b8cc8794a02ba868db21faef3",
                                  "5b8cc8794a02ba868db21faef4",
                                  "5b8cc8794a02ba868db21faef5",
                                  "5b8cc8794a02ba868db21faef1")), 
                  URL = I(c("somewebsite.com/abc/detail/110302288511/",
                            "somewebsite.com/abc/detail/110302288512/",
                            "somewebsite.com/abc/detail/110302288513/",
                            "somewebsite.com/abc/detail/110302288514/",
                            "somewebsite.com/abc/detail/110302288511/",
                            "somewebsite.com/abc/detail/110302288512/"
                  )))


dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
                                  "5b8cc8794a02ba868db21faef3",
                                  "5b8cc8794a02ba868db21faef5",
                                  "5b8cc8794a02ba868db21faef7",
                                  "5b8cc8794a02ba868db21faef1"
                     )), 
                  Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
                                "somewebsite.com/abc/detail/110302288513/1103022815/",
                                "somewebsite.com/abc/detail/110302288513/11030228/",
                                "somewebsite.com/abc/detail/110302288465464/",
                                "somewebsite.com/abc/detail/110302288512/46545465/"
                  )))

OLD - 以下是data.frams的示例代码:

pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
                                  "5b8cc8794a02ba868db21faef2",
                                  "5b8cc8794a02ba868db21faef3",
                                  "5b8cc8794a02ba868db21faef4",
                                  "5b8cc8794a02ba868db21faef5",
                                  "5b8cc8794a02ba868db21faef6")), 
                  URL = I(c("somewebsite.com/abc/detail/110302288511/",
                          "somewebsite.com/abc/detail/110302288512/",
                          "somewebsite.com/abc/detail/110302288513/",
                          "somewebsite.com/abc/detail/110302288514/",
                          "somewebsite.com/abc/detail/110302288511/",
                          "somewebsite.com/abc/detail/110302288512/"
                  )))


dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
                                "5b8cc8794a02ba868db21faef3",
                                "5b8cc8794a02ba868db21faef5",
                                "5b8cc8794a02ba868db21faef7",
                                "5b8cc8794a02ba868db21faef2"
                              )), 
              Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
                          "somewebsite.com/abc/detail/110302288513/1103022815/",
                          "somewebsite.com/abc/detail/110302288513/11030228/",
                          "somewebsite.com/abc/detail/110302288465464/",
                          "somewebsite.com/abc/detail/1103022846546/"
                  )))

新输出 - 包含重复

    SessionId                           URL                                     Referer
5b8cc8794a02ba868db21faef1  somewebsite.com/abc/detail/110302288511/    somewebsite.com/abc/detail/110302288511/110302288512/   
5b8cc8794a02ba868db21faef3  somewebsite.com/abc/detail/110302288513/    somewebsite.com/abc/detail/110302288513/1103022815/
5b8cc8794a02ba868db21faef1  somewebsite.com/abc/detail/110302288512/    somewebsite.com/abc/detail/110302288512/46545465/

所以 OLD 输出需要如下所示:

    SessionId                           URL                                     Referer
5b8cc8794a02ba868db21faef1  somewebsite.com/abc/detail/110302288511/    somewebsite.com/abc/detail/110302288511/110302288512/   
5b8cc8794a02ba868db21faef3  somewebsite.com/abc/detail/110302288513/    somewebsite.com/abc/detail/110302288513/1103022815/

2 个答案:

答案 0 :(得分:1)

您可以将数据放在长格式中,然后在data.table内按ID进行处理。

library(reshape2)
dat <- do.call(rbind,lapply(list(pattern,dataset),function(x)
                             melt(x,id.vars='SessionId')))
library(data.table)
DT <- data.table(dat,key='SessionId')

DT[,if(.N ==2)
       if(length(grep(value[1],value[2]))>0) as.list(value)
   ,by='SessionId']

                    SessionId                                       V1                                                    V2
1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
2: 5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/   somewebsite.com/abc/detail/110302288513/1103022815/

编辑使用OP数据对2个解决方案进行基准测试(懒惰以创建大样本数据集)。 eddi解决方案快3倍。结果是预期的,我的解决方案较慢,因为它使用reshape2重新整形数据的额外步骤(有点慢)。

microbenchmark(eddi(),agstudy(),times=100)
Unit: milliseconds
      expr      min        lq    median        uq        max neval
    eddi() 3.232808  3.427557  3.553092  3.768891   8.665698   100
 agstudy() 9.998795 10.615281 11.208633 12.438759 129.517833   100

这里是用于基准测试的代码:

library(inline)
library(Rcpp)
library(reshape2)

eddi <- function(){
  library(data.table)
  pattern = data.table(pattern, key = 'SessionId')
  dataset = data.table(dataset, key = 'SessionId')
  dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1]
}

agstudy <- function(){
  dat <- do.call(rbind,lapply(list(pattern,dataset),function(x)
    melt(x,id.vars='SessionId')))
  library(data.table)
  DT <- data.table(dat,key='SessionId')

  DT[,if(.N ==2)
    if(length(grep(value[1],value[2]))>0) as.list(value)
     ,by='SessionId']

}

library('microbenchmark')
microbenchmark(eddi(),agstudy(),times=100)

EDIT2 对于mangae重复的案例,最好使用宽格式。灵感来自@eddit函数,这里我的版本没有创建Rcpp函数。

  pattern = data.table(pattern, key = 'SessionId')
  dataset = data.table(dataset, key = 'SessionId')
  dataset[pattern, nomatch = 0][mapply(grep,URL,Referer)==1]

PS我用eddi函数对此进行基准测试,而后者仍然稍微快一点

microbenchmark(eddi(),agstudy(),times=100)
Unit: milliseconds
      expr      min       lq   median       uq      max neval
    eddi() 3.684126 3.819901 4.007634 4.395048 8.490101   100
 agstudy() 4.057697 4.250171 4.595298 4.835747 8.581503   100

答案 1 :(得分:1)

我不认为R中存在必需的字符串向量比较函数,但你可以写你的。请注意,在下面的代码中应该进行各种检查,特别是如果想要在此问题之外使用string_compare函数,我不这样做(例如,检查两个向量是否具有相同的长度):< / p>

library(inline)
library(Rcpp)

string_compare = cxxfunction(signature(x = 'character', y = 'character'), '
  CharacterVector a(x), b(y);
  NumericVector res(a.size(), 1.0);

  for (int i = 0, size = a.size(); i < size; ++i) {
    int alen = a[i].size();
    int blen = b[i].size();
    if (alen > blen) {
      res[i] = 0;
      continue;
    }
    for (int j = 0; j < alen; ++j) {
      if (a[i][j] != b[i][j]) {
        res[i] = 0;
        break;
      }
    }    
  }

  return res;
', plugin = 'Rcpp')

library(data.table)
pattern = data.table(pattern, key = 'SessionId')
dataset = data.table(dataset, key = 'SessionId')

dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1]
#                    SessionId                                               Referer                                      URL
#1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/110302288512/ somewebsite.com/abc/detail/110302288511/
#2: 5b8cc8794a02ba868db21faef3   somewebsite.com/abc/detail/110302288513/1103022815/ somewebsite.com/abc/detail/110302288513/