Question

有没有人知道解决the longest common substring problem的R包？我正在寻找可以对矢量起作用的快速的东西。

Answer 1

查看omegahat上的“Rlibstree”包：http://www.omegahat.org/Rlibstree/。

这使用http://www.icir.org/christian/libstree/。

Answer 2

您应该查看LCS包的qualV功能。它是C实现的，因此非常有效。

Answer 3

这里的问题并不完全清楚解决方案对最长公共子串问题的预期应用。我遇到的一个常见应用是在不同数据集中的名称之间进行匹配。 stringdist包有一个有用的函数amatch()，我认为它适用于此任务。

简而言之，amatch()将两个向量作为输入，第一个是x您要查找的字符串向量匹配（这也可以是单个字符串），第二个是table，它是您想要进行比较的字符串向量，并选择与最长公共子字符串匹配。然后，amatch()将返回一个长度等于x长度的向量 - 此结果的每个元素都是table中包含最佳匹配的索引。

详细信息：amatch()采用method参数，如果要匹配最长公共子字符串，则指定为lcs。对于不同的字符串匹配技术还有许多其他选项（例如Levenshtein距离）。还有一个强制性maxDist参数。如果table中的所有字符串与x中的给定字符串相距“距离”更远，则amatch()将为其输出的元素返回NA。 “距离”的定义取决于您选择的字符串匹配算法。对于lcs，它（或多或少）仅表示有多少不同（不匹配）的字符。有关详细信息，请参阅文档。

并行化：amatch()的另一个不错的功能是它会自动为您操作并行操作，对使用的系统资源进行合理的猜测。如果您想要更多地控制它，可以切换nthread参数。

示例应用：

library(stringdist)

Names1 = c(
"SILVER EAGLE REFINING, INC. (SW)",
"ANTELOPE REFINING",
"ANTELOPE REFINING (DOUGLAS FACILITY)"
)

Names2 = c(
"Mobile Concrete, Inc.",
"Antelope Refining, LLC. ",
"Silver Eagle Refining Inc."
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Match_Idx
# [1] 3 2 2

Matches = data.frame(Names1, Names2[Match_Idx])
Matches

#                                 Names1          Names2.Match_Idx.
# 1     silver eagle refining, inc. (sw) silver eagle refining inc.
# 2                    antelope refining   antelope refining, llc. 
# 3 antelope refining (douglas facility)   antelope refining, llc. 

### Compare Matches:

Matches$Distance = stringdist(Matches$Names1, Matches$Match, method = 'lcs')

此外，与LCS中qualV之类的函数不同，这不会考虑涉及忽略中间字符以形成匹配的“子序列”匹配（如所讨论的here）。例如，请看：

Names1 = c(
"hello"
)

Names2 = c(
"hel123l5678o",
"hell"
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)

Matches = data.frame(Names1, Match = Names2[Match_Idx])
Matches

# 1  hello  hell

Answer 4

我不知道R，但我曾经实施过Hirschberg算法，该算法速度快，不会消耗太多空间。

我记得它只是递归地称为短函数的2或3。

这是一个链接： http://wordaligned.org/articles/longest-common-subsequence

所以不要犹豫在R中实现它，它值得付出努力，因为它是一个非常有趣的算法。

R - 最长的共同子串

4 个答案: