用句子开始和结束位置排除R

时间:2018-02-23 15:29:02

标签: r text-mining tidytext

R的新手 我正在使用tidytext::unnest_tokens使用下面的

将长文本细分为单个句子

tidy_drugs <- drugstext.raw %>% unnest_tokens(sentence, Section, token="sentences")

所以我得到一个data.frame,所有句子都转换成行。

我想从长篇文章中获取每个句子的开头和结尾位置。

以下是长文本文件的示例。它来自药品标签。

<< *6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
 The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
 In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting.*

所需的结果是具有三列的数据框

Dataframe

1 个答案:

答案 0 :(得分:1)

您可以使用str_locate中的stringr执行此操作。这通常很烦人,因为换行符和特殊字符可能会破坏您搜索的正则表达式。这里我们首先使用str_replace_all从输入文本中删除换行符,然后取消标记,确保保留原始文本并防止更改大小写。然后,我们制作一个新的正则表达式列,用正确转义的版本替换特殊字符(此处为().),并使用str_locate添加开头和每个字符串的结尾。

我没有得到与您相同的数字,但我复制了您的代码中的文字,该文字并不总是保留所有字符,并且您的最终end数字小于start无论如何。

library(tidyverse)
library(tidytext)

raw_text <- tibble(section = "6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
                   The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
                   In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting."
)

tidy_text <- raw_text %>%
  mutate(section = str_replace_all(section, "\\n", "")) %>%
  unnest_tokens(
    output = sentence,
    input = section,
    token = "sentences",
    drop = FALSE,
    to_lower = FALSE
    ) %>%
  mutate(
    regex = str_replace_all(sentence, "\\(", "\\\\("),
    regex = str_replace_all(regex, "\\)", "\\\\)"),
    regex = str_replace_all(regex, "\\.", "\\\\.")
  ) %>%
  mutate(
    start = str_locate(section, regex)[, 1],
    end = str_locate(section, regex)[, 2]
  ) %>%
  select(sentence, start, end) %>%
  print()
#> # A tibble: 3 x 3
#>   sentence                                                     start   end
#>   <chr>                                                        <int> <int>
#> 1 6.1 Clinical Trial Experience  Because clinical trials are ~     1   290
#> 2 The data below reflect exposure to ARDECRETRIS as monothera~   310   626
#> 3 In Studies 1 and 2, the most common adverse reactions were ~   646   762

reprex package(v0.2.0)创建于2018-02-23。