Question

我有一些文字，其中包含包含数字的短语，后跟一些符号。我想提取它们，例如，数字后跟百分比。使用quanteda包中的kwic函数似乎可以将数字用作正则表达式（例如"\\d{1,}"）。尽管如此，我还是没有找到如何使用quanteda提取它，然后是百分号。以下文本可以作为文本示例：

187例患者中有13例（7％）在ICU-1中获得艰难梭菌，9例（36％）在ICU-2中有25例，在BU中有51例患者中有3例（5.9％）。八（32％）发生腹泻仅归因于艰难梭菌和/或毒素，剩下的17个（68％）是无症状的：没有假膜性结肠炎。

Answer 1

quanteda包正在奇怪地处理正则表达式。我不确定为什么这个解决方案有效，但我认为它与kwic如何处理指定的模式有关。使用pattern函数包装phrase并添加空格会返回正确的结果：

s <- c("Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.")

kwic(s, phrase("\\d+ %"), valuetype = "regex")

我建议你联系软件包维护人员并指出这个问题。似乎违反直觉。

Answer 2

原因是当您直接在语料库或角色对象上调用kwic()时，它会在关键字在上下文分析之前将一些参数传递给tokens()，这些参数会影响标记化的发生方式。（这在...中的?kwic参数中有记录。）

quanteda 中的默认令牌化使用 stringi 字边界定义，以便：

tokens("Thirteen (7%) of 187")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "("        "7"        "%"        ")"        "of"       "187"

如果您想使用更简单的空白标记器，可以使用以下方法完成：

tokens("Thirteen (7%) of 187", what = "fasterword")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(7%)"     "of"       "187"

因此，在kwic()中使用此方法的方法是：

kwic(s, "\\d+%", valuetype = "regex", what = "fasterword")

#  [text1, 2]                    Thirteen |  (7%)  | of 187 patients acquired C.             
# [text1, 12]    C. difficile in ICU-1, 9 | (36%)  | of 25 on ICU-2 and                      
# [text1, 19]           25 on ICU-2 and 3 | (5.9%) | of 51 patients in BU.                   
# [text1, 26]    51 patients in BU. Eight | (32%)  | developed diarrhoea attributable only to
# [text1, 41] toxin, and the remaining 17 | (68%)  | were asymptomat- ic: none had

否则，您需要将正则表达式包装在phrase()函数中，并按空格分隔元素：

kwic(s, phrase("\\d+ %"), valuetype = "regex")

#   [text1, 3:4]             Thirteen( |  7 %  | ) of 187 patients acquired             
# [text1, 18:19]          in ICU-1, 9( | 36 %  | ) of 25 on ICU-2                       
# [text1, 28:29]       on ICU-2 and 3( | 5.9 % | ) of 51 patients in                    
# [text1, 39:40]         in BU. Eight( | 32 %  | ) developed diarrhoea attributable only
# [text1, 60:61] and the remaining 17( | 68 %  | ) were asymptomat- ic

这种行为可能需要一些时间来习惯，但它是确保完全用户控制搜索多标记序列的最佳方法，而不是实现确定应该是什么元素的单一方法。当输入尚未被标记化时，多标记序列。

quanteda kwic提取数字后跟百分比

2 个答案: