通过与data.frame的相似性对字符串进行分组

时间:2016-04-07 15:50:05

标签: r group-by dplyr

我试图根据文章本身的相似性在一个庞大的报纸文章数据库中汇总条目。

我的数据看起来像这样:

ID  Source File Newspaper   Date        Length  Article
1   aaa     The Guardian    07.30.2002  561     US scientist questions 
2   aaa     The Guardian    07.30.2002  426     Cash fine to clear elderly...
3   aaa     The Guardian    07.30.2002  206     Token victory for HIV mother
4   aab     Financial Times 07.29.2002  964     A tough question at the heart..
5   aab     The Guardian    07.29.2002  500     Media: 'We want van Hoogstr…
6   aab     The Mirror      07.29.2002  43      IN BRIEF…
7   aab     The Sun         07.29.2002  196     US scientist questions
8   aab     The Sun         07.29.2002  140     ADDED VALUE
9   aab     The Times       07.29.2002  794     US-scientist questions
10  …       …               …           …       …

在这里看了一会儿后,我使用dplyr成功完成了重复:

Dup_info <- meta_articles.m %>%
  group_by(Articles) %>%
  summarise(IDs = toString(ID))

它正确地将#1和#7标识为重复项,并且我可以在删除重复项后保留信息。不幸的是,由于单个字符不同而没有将#9作为副本,因此我不太了解dplyr以了解如何实现99%或95%的相似性阈值。有人知道这是否可能吗?

dput(meta_articles.m)
structure(list(ID = 1:9, Source.File = structure(c(1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("aaa", "aab"), class = "factor"), 
    Newspaper = structure(c(2L, 2L, 2L, 1L, 2L, 3L, 4L, 4L, 5L
    ), .Label = c("Financial Times", "The Guardian", "The Mirror", 
    "The Sun", "The Times"), class = "factor"), Date = structure(c(2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("07.29.2002", 
    "07.30.2002"), class = "factor"), Length = c(561L, 426L, 
    206L, 964L, 500L, 43L, 196L, 140L, 794L), Article = structure(c(8L, 
    3L, 6L, 1L, 5L, 4L, 8L, 2L, 7L), .Label = c("A tough question at the heart..", 
    "ADDED VALUE", "Cash fine to clear elderly...", "IN BRIEF…", 
    "Media: 'We want van Hoogstr…", "Token victory for HIV mother", 
    "US-scientist questions", "US scientist questions"), class = "factor")), .Names = c("ID", 
"Source.File", "Newspaper", "Date", "Length", "Article"), class = "data.frame", row.names = c(NA, 
-9L))

1 个答案:

答案 0 :(得分:1)

我建议使用Levenshtein距离度量标准或类似的东西。这基本上是2个字符串之间的编辑距离。不会是完美的,但它会让你开始。

详细了解此处: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/adist.html

可以在stringdist包中找到更高级的功能,包括soundex方法,可以有效地对类似发音的单词进行分组。另外值得一看的是RecordLinkage包。

没有相当大的样本(dput)我无法提供实现的示例。

编辑: adist(meta_articles.m$Article)将生成相似度矩阵。忽略对角线,您可以解析该矩阵以找到您想要追求的任何相似性阈值的值

d <- adist(meta_articles.m$Article)
d2 <- d
d2[d2 > 2] <- NA  #set the limit at distance = 1
d2

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    0   NA   NA   NA   NA   NA    0   NA    1
 [2,]   NA    0   NA   NA   NA   NA   NA   NA   NA
 [3,]   NA   NA    0   NA   NA   NA   NA   NA   NA
 [4,]   NA   NA   NA    0   NA   NA   NA   NA   NA
 [5,]   NA   NA   NA   NA    0   NA   NA   NA   NA
 [6,]   NA   NA   NA   NA   NA    0   NA   NA   NA
 [7,]    0   NA   NA   NA   NA   NA    0   NA    1
 [8,]   NA   NA   NA   NA   NA   NA   NA    0   NA
 [9,]    1   NA   NA   NA   NA   NA    1   NA    0

因此row [1]与它自身相同,与[7]相同,并且[9]的编辑距离为1,等等。然后你可以继续按距离聚类,即:

d <- adist(meta_articles.m$Article)
rownames(d) <- meta_articles.m$Article
hc <- hclust(as.dist(d))
plot(hc)

dendogram

最后,将所有值组合在一起,编辑距离为2或更小:

df <- data.frame(meta_articles.m$Article,cutree(hc,h=2))
df

    meta_articles.m.Article cutree.hc..h...2.
1          US scientist questions                 1
2   Cash fine to clear elderly...                 2
3    Token victory for HIV mother                 3
4 A tough question at the heart..                 4
5    Media: 'We want van Hoogstr…                 5
6                       IN BRIEF…                 6
7          US scientist questions                 1
8                     ADDED VALUE                 7
9          US-scientist questions                 1