R中的文本挖掘 - 从关键字开始删除文本文件中的行

时间:2017-01-10 18:03:13

标签: r pdf text-mining

我正在将文本文件读入R,如下所示:

test<-readLines("D:/AAPL MSFT Earnings Calls/Test/Test.txt")

此文件是从PDF转换而来,并保留了一些我想要删除的标题数据。他们将从“页面”,“市值”等词开始。

如何删除TXT文件中以这些关键字开头的所有行?这与删除包含该单词的行相反。

使用下面的一个答案,我修改了一下,以便在

中读取
setwd("C:/Users/George/Google Drive/PhD/Strategic agility/Source Data/Peripherals Earnings Calls 2016")
text1<-readLines("test.txt")
text

library(purrr)
library(stringr)
text1 <- "foo
Page, bar
baz
Market Cap, qux"
text1 <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\\s+Cap,")
text1 %>% discard(~ any(str_detect(.x, ignore_patterns)))

text1

以下是我得到的输出:

> text1
[1] "foo"             "Page, bar"       "baz"             "Market Cap, qux"

foo / baz / qux字符是什么?谢谢

2 个答案:

答案 0 :(得分:1)

# once you have read and stored in a data.frame
# perform below subsetting :
x = grepl("^(Page|Market Cap)", df$id) # where df is you data.frame and 'id' is your 
                                       # column name that has those unwanted keywords
df <- df[!x,]  # does the job!

^有助于检查开始。因此,如果行以Page或(|Market Cap开头,那么grepl会返回TRUE

答案 1 :(得分:0)

library(purrr)
library(stringr)
file <- "foo
Page, bar
baz
Market Cap, qux"
test <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\\s+Cap,")
test %>% discard(~ any(str_detect(.x, ignore_patterns)))