Question

我有几个段落试图提取具有其相关名称的首字母缩写。

例如，我可能有一个段落，其中包含很多文本，其中的名称为“ A. J. Balfour”或“ J. Balfour”。

这是我现在正在写的内容，它不起作用。我希望收到您的反馈！

z = "This is a bunch of text. I would like to extract A J Balfour"

sub("^(([A]\\\S+\\\s){1}\\\S+).*", "\\1", z, perl = TRUE)

我认为最好的选择是使用sub，但是在使正则表达式生效时遇到了问题。我在寻找可提取字符的正则表达式方面找不到很好的信息。

谢谢。

Answer 1

谢谢！我最终使用str_extract_all看起来像这样：

z =“这是一堆文本。我想提取A. J. Balfour以及其他一些单词或另一个A. F. Balfour甚至G. G. Balfour甚至A. G. Balfour””

str_extract_all（z，regex（“ [A-Z]。[A-Z]。Balfour”，simple = TRUE））

感谢所有想法！

Answer 2

stringr库具有str_extract函数，其语法比仅使用sub更为简单。

library(stringr)
str_extract(z, "[A]\\S{0,1}\\s(\\S\\S{0,1}\\s){0,1}.*")
#[1] "A J Balfour"

编辑： 这是另一种尝试，但是由于您正在寻求更通用的解决方案，因此很难找到完全匹配的结果。

z<-c( "This is a bunch of text. I would like to extract A J Balfour",
      "J Balfour",
      'This is a bunch of text.  G. Balfour'
)

str_extract_all(z, "([A-Z]+[\\. ]{1,2}){1,2}.*")

# (      - start of grouping
# [A-Z]  - Any capital letter
# +      - at least 1 times
# [\\. ]   - a period or a space
# {1,2} - one or two times
#  ){1,2} - 1 or 2 times for the grouping
# .*    - any character zero or more times

实际上，此尝试在第一次测试中失败。缩小到[A-J]会有所帮助。祝你好运。

Answer 3

考虑在基数R中使用regmatch。

z = "This is a bunch of text. I would like to extract A J Balfour"

regmatches(z,regexpr("[A]\\s{1}\\S+.*", z))
#[1] "A J Balfour"

如何使用sub提取R中带有缩写的名称？

3 个答案: