我试图根据文章本身的相似性在一个庞大的报纸文章数据库中汇总条目。
我的数据看起来像这样:
ID Source File Newspaper Date Length Article
1 aaa The Guardian 07.30.2002 561 US scientist questions
2 aaa The Guardian 07.30.2002 426 Cash fine to clear elderly...
3 aaa The Guardian 07.30.2002 206 Token victory for HIV mother
4 aab Financial Times 07.29.2002 964 A tough question at the heart..
5 aab The Guardian 07.29.2002 500 Media: 'We want van Hoogstr…
6 aab The Mirror 07.29.2002 43 IN BRIEF…
7 aab The Sun 07.29.2002 196 US scientist questions
8 aab The Sun 07.29.2002 140 ADDED VALUE
9 aab The Times 07.29.2002 794 US-scientist questions
10 … … … … …
在这里看了一会儿后,我使用dplyr成功完成了重复:
Dup_info <- meta_articles.m %>%
group_by(Articles) %>%
summarise(IDs = toString(ID))
它正确地将#1和#7标识为重复项,并且我可以在删除重复项后保留信息。不幸的是,由于单个字符不同而没有将#9作为副本,因此我不太了解dplyr以了解如何实现99%或95%的相似性阈值。有人知道这是否可能吗?
dput(meta_articles.m)
structure(list(ID = 1:9, Source.File = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("aaa", "aab"), class = "factor"),
Newspaper = structure(c(2L, 2L, 2L, 1L, 2L, 3L, 4L, 4L, 5L
), .Label = c("Financial Times", "The Guardian", "The Mirror",
"The Sun", "The Times"), class = "factor"), Date = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("07.29.2002",
"07.30.2002"), class = "factor"), Length = c(561L, 426L,
206L, 964L, 500L, 43L, 196L, 140L, 794L), Article = structure(c(8L,
3L, 6L, 1L, 5L, 4L, 8L, 2L, 7L), .Label = c("A tough question at the heart..",
"ADDED VALUE", "Cash fine to clear elderly...", "IN BRIEF…",
"Media: 'We want van Hoogstr…", "Token victory for HIV mother",
"US-scientist questions", "US scientist questions"), class = "factor")), .Names = c("ID",
"Source.File", "Newspaper", "Date", "Length", "Article"), class = "data.frame", row.names = c(NA,
-9L))
答案 0 :(得分:1)
我建议使用Levenshtein距离度量标准或类似的东西。这基本上是2个字符串之间的编辑距离。不会是完美的,但它会让你开始。
详细了解此处: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/adist.html
可以在stringdist
包中找到更高级的功能,包括soundex
方法,可以有效地对类似发音的单词进行分组。另外值得一看的是RecordLinkage
包。
没有相当大的样本(dput)我无法提供实现的示例。
编辑:
adist(meta_articles.m$Article)
将生成相似度矩阵。忽略对角线,您可以解析该矩阵以找到您想要追求的任何相似性阈值的值
d <- adist(meta_articles.m$Article)
d2 <- d
d2[d2 > 2] <- NA #set the limit at distance = 1
d2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 NA NA NA NA NA 0 NA 1
[2,] NA 0 NA NA NA NA NA NA NA
[3,] NA NA 0 NA NA NA NA NA NA
[4,] NA NA NA 0 NA NA NA NA NA
[5,] NA NA NA NA 0 NA NA NA NA
[6,] NA NA NA NA NA 0 NA NA NA
[7,] 0 NA NA NA NA NA 0 NA 1
[8,] NA NA NA NA NA NA NA 0 NA
[9,] 1 NA NA NA NA NA 1 NA 0
因此row [1]与它自身相同,与[7]相同,并且[9]的编辑距离为1,等等。然后你可以继续按距离聚类,即:
d <- adist(meta_articles.m$Article)
rownames(d) <- meta_articles.m$Article
hc <- hclust(as.dist(d))
plot(hc)
最后,将所有值组合在一起,编辑距离为2或更小:
df <- data.frame(meta_articles.m$Article,cutree(hc,h=2))
df
meta_articles.m.Article cutree.hc..h...2.
1 US scientist questions 1
2 Cash fine to clear elderly... 2
3 Token victory for HIV mother 3
4 A tough question at the heart.. 4
5 Media: 'We want van Hoogstr… 5
6 IN BRIEF… 6
7 US scientist questions 1
8 ADDED VALUE 7
9 US-scientist questions 1