我的任务是搜索文本,用通用字符串替换人名和昵称。
以下是我的姓名数据框和相应的昵称的结构:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
以下是包含文本
的数据框的结构text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
提供这些数据框
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
我正在寻找一个例程,它将搜索df_name_comment的每一行,并使用df_name_comment $ text_names从df_name_nick中查找相应的昵称,并将其替换为XXX。 注意每个人的姓名可以有几个昵称。 请注意,在每个文本注释中,只替换该行的相应名称,以便我们将其作为输出:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
我认为这需要一个巧妙的gsubs,匹配和应用函数组合(mapply,sapply等)
我在Stack Overflow上搜索了类似于此请求的内容,并且只能找到基于具有唯一行元素的数据框的非常具体的正则表达式解决方案,而不是我认为可以通过多个昵称使用通用文本查找和gsubs的东西
任何人都可以帮我解决我的困境吗? 谢谢
Nevil (新手R程序员自2017年1月起)
答案 0 :(得分:2)
这是基于R的想法。我们基本上为每个名称粘贴昵称,按|
折叠,以便将其作为正则表达式在gsub
中传递,并将每个注释的匹配单词替换为XXX。在我们将汇总的昵称与mapply
合并后,我们使用df_name_comment
来执行此操作。
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
由此给出,
text_names text_comment 1 Abigail Tommy sits next to XXX 2 Abigail XXX is a mature young lady 3 Colin Tom is friends with Colin and Abi 4 Thomas As a footballer XXX is very good 5 Thomas XXX is a handsome man
注1 :将nicknames
中的NA替换为0是因为NA(对于不匹配的元素,merge
的默认填充)将转换注释字符串传入gsub
Note2 由于merge
,订单也会发生变化,但您可以按照惯例进行排序。
Note3 最好将变量作为字符而不是因素。因此,您要么使用stringsAsFactors = FALSE
读取数据框,要么转换为
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
修改强>
根据您的评论,我们可以简单地将评论的名称与我们的聚合数据集进行匹配,将其保存在向量中并直接在原始数据框上使用mapply
,而无需合并然后删除变量,即
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
答案 1 :(得分:1)
希望这有帮助!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
如果它解决了您的问题,请不要忘记告诉我们:)
答案 2 :(得分:0)
数据强>
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
解决方案2
编辑:在这个初始解决方案中,我使用grepl手动检查昵称是否存在,然后用其中一个匹配的ID进行gsubbed。我知道'|'运算符使用grepl,但不使用gsub。因此,对Sotos的这个想法给予了信任。
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
<强>输出强>
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
希望这有帮助!