我一直试图在一次完全匹配和一次部分上进行一些乏味的合并( ON非常大的数据)。我尝试了几种方法(使用pmatch,str_detect,grep和sapply)并得到了一些接近的结果,但试图找到一个优雅的解决方案。任何帮助见解将不胜感激。
我找到的另一个更长的溃败是在公共字段上进行常规合并(seesionId),然后编写一个for循环,如下所示:
for( i in 1:nrow(my.test.daa) ){
my.test.daa$Part_match [i] = pmatch(my.test.daa$Link_URL[i], my.test.daa$Referer[i])
...get index i to also get the other columns from dataset frame
}
新数据 - 包含重复项
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef1")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef1"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/110302288512/46545465/"
)))
OLD - 以下是data.frams的示例代码:
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef6")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef2"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/1103022846546/"
)))
新输出 - 包含重复
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288512/ somewebsite.com/abc/detail/110302288512/46545465/
所以 OLD 输出需要如下所示:
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
答案 0 :(得分:1)
您可以将数据放在长格式中,然后在data.table
内按ID进行处理。
library(reshape2)
dat <- do.call(rbind,lapply(list(pattern,dataset),function(x)
melt(x,id.vars='SessionId')))
library(data.table)
DT <- data.table(dat,key='SessionId')
DT[,if(.N ==2)
if(length(grep(value[1],value[2]))>0) as.list(value)
,by='SessionId']
SessionId V1 V2
1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
2: 5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
编辑使用OP数据对2个解决方案进行基准测试(懒惰以创建大样本数据集)。 eddi解决方案快3倍。结果是预期的,我的解决方案较慢,因为它使用reshape2
重新整形数据的额外步骤(有点慢)。
microbenchmark(eddi(),agstudy(),times=100)
Unit: milliseconds
expr min lq median uq max neval
eddi() 3.232808 3.427557 3.553092 3.768891 8.665698 100
agstudy() 9.998795 10.615281 11.208633 12.438759 129.517833 100
这里是用于基准测试的代码:
library(inline)
library(Rcpp)
library(reshape2)
eddi <- function(){
library(data.table)
pattern = data.table(pattern, key = 'SessionId')
dataset = data.table(dataset, key = 'SessionId')
dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1]
}
agstudy <- function(){
dat <- do.call(rbind,lapply(list(pattern,dataset),function(x)
melt(x,id.vars='SessionId')))
library(data.table)
DT <- data.table(dat,key='SessionId')
DT[,if(.N ==2)
if(length(grep(value[1],value[2]))>0) as.list(value)
,by='SessionId']
}
library('microbenchmark')
microbenchmark(eddi(),agstudy(),times=100)
EDIT2 对于mangae重复的案例,最好使用宽格式。灵感来自@eddit函数,这里我的版本没有创建Rcpp函数。
pattern = data.table(pattern, key = 'SessionId')
dataset = data.table(dataset, key = 'SessionId')
dataset[pattern, nomatch = 0][mapply(grep,URL,Referer)==1]
PS我用eddi函数对此进行基准测试,而后者仍然稍微快一点
microbenchmark(eddi(),agstudy(),times=100)
Unit: milliseconds
expr min lq median uq max neval
eddi() 3.684126 3.819901 4.007634 4.395048 8.490101 100
agstudy() 4.057697 4.250171 4.595298 4.835747 8.581503 100
答案 1 :(得分:1)
我不认为R中存在必需的字符串向量比较函数,但你可以写你的。请注意,在下面的代码中应该进行各种检查,特别是如果想要在此问题之外使用string_compare
函数,我不这样做(例如,检查两个向量是否具有相同的长度):< / p>
library(inline)
library(Rcpp)
string_compare = cxxfunction(signature(x = 'character', y = 'character'), '
CharacterVector a(x), b(y);
NumericVector res(a.size(), 1.0);
for (int i = 0, size = a.size(); i < size; ++i) {
int alen = a[i].size();
int blen = b[i].size();
if (alen > blen) {
res[i] = 0;
continue;
}
for (int j = 0; j < alen; ++j) {
if (a[i][j] != b[i][j]) {
res[i] = 0;
break;
}
}
}
return res;
', plugin = 'Rcpp')
library(data.table)
pattern = data.table(pattern, key = 'SessionId')
dataset = data.table(dataset, key = 'SessionId')
dataset[pattern, nomatch = 0][string_compare(URL, Referer) == 1]
# SessionId Referer URL
#1: 5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/110302288512/ somewebsite.com/abc/detail/110302288511/
#2: 5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/1103022815/ somewebsite.com/abc/detail/110302288513/