Question

我在data.table中有很多文本数据。我有几种感兴趣的文本模式。我设法对表进行了子集设置，以便它显示与至少两种模式匹配的文本（相关问题here）。

我现在希望每个匹配项可以有一行，并具有标识匹配项的附加列-因此，具有多个匹配项的行将与该列分开重复。

感觉这不应该太难，但是我正在努力！我模糊的想法可能是围绕着计算模式匹配的数量，然后将行重复很多次……但是我不完全确定如何为每个不同的模式获取标签……（也不确定是还是非常有效的。）

感谢您的帮助！

示例数据

library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5), 
                         text = c("lucy, sarah and paul live on the same street",
                                  "lucy has only moved here recently",
                                  "lucy and sarah are cousins",
                                  "john is also new to the area",
                                  "paul and john have known each other a long time"))


text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns, 
                                    function(x) str_detect(text, x))) >1]

理想的输出

required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
                             text = c("lucy, sarah and paul live on the same street",
                                      "lucy, sarah and paul live on the same street",
                                      "lucy, sarah and paul live on the same street",
                                      "lucy has only moved here recently",
                                      "lucy and sarah are cousins",
                                      "lucy and sarah are cousins",
                                      "john is also new to the area",
                                      "paul and john have known each other a long time"), 
                             person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))

Answer 1

一种方法是为每个指标创建一个变量并融化：

library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
  sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]

melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
##    ID                                            text     variable
## 1:  1    lucy, sarah and paul live on the same street         lucy
## 2:  2               lucy has only moved here recently         lucy
## 3:  3                      lucy and sarah are cousins         lucy
## 4:  1    lucy, sarah and paul live on the same street        sarah
## 5:  3                      lucy and sarah are cousins        sarah
## 6:  1    lucy, sarah and paul live on the same street paul or john
## 7:  4                    john is also new to the area paul or john
## 8:  5 paul and john have known each other a long time paul or john

执行相同步骤的整洁方法是：

library(tidyverse)
text_table %>%
  mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
  mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
  mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
  gather(value = value, key = person,  - c(ID, text)) %>%
  filter(value) %>%
  select(-value)

Answer 2

免责声明：这不是惯用的data.table解决方案

我将构建一个类似于以下的辅助函数，该函数将单行和一个输入并返回一个带有Nrows的新dt：

library(data.table)
library(tidyverse)

new_rows <- function(dtRow, patterns = text_patterns){

    res <- map(text_patterns, function(word) {

        textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>% 
            ifelse(is.character(.), ., NA)

        personField   <- str_extract(string = dtRow[1, text], pattern = word) %>% 
            ifelse(  . == "paul" | . == "john", "paul or john", .)

        idField <- ifelse(is.na(textField), NA, dtRow[1, ID])

        data.table(ID = idField, text = textField, person = personField) 

        }) %>% 
        rbindlist()

    res[!is.na(text), ]
}

我将执行它：

split(text_table, f = text_table[['ID']]) %>% 
    map_df(function(r) new_rows(dtRow = r))

答案是：

   ID                                            text       person
1:  1    lucy, sarah and paul live on the same street         lucy
2:  1    lucy, sarah and paul live on the same street        sarah
3:  1    lucy, sarah and paul live on the same street paul or john
4:  2               lucy has only moved here recently         lucy
5:  3                      lucy and sarah are cousins         lucy
6:  3                      lucy and sarah are cousins        sarah
7:  4                    john is also new to the area paul or john
8:  5 paul and john have known each other a long time paul or john

看起来像您的required_table（包括重复的ID）

   ID                                            text       person
1:  1    lucy, sarah and paul live on the same street         lucy
2:  1    lucy, sarah and paul live on the same street        sarah
3:  1    lucy, sarah and paul live on the same street paul or john
4:  2               lucy has only moved here recently         lucy
5:  3                      lucy and sarah are cousins         lucy
6:  3                      lucy and sarah are cousins        sarah
7:  4                    john is also new to the area paul or john
8:  5 paul and john have known each other a long time paul or john

展开data.table，以便每个ID的每个模式匹配一行

2 个答案:

展开data.table，以便每个ID的每个模式匹配一​​行

2 个答案:

展开data.table，以便每个ID的每个模式匹配一行