Question

我正在尝试用名字，时间戳记和文本来分隔以下数据。当前，整个数据在1列中作为数据框列出，此列称为Text1。这是它的外观

text

First Name:          00:03       Welcome Back text text text
First Name 2:        00:54       Text Text Text
First Name 3:        01:24       Text Text Text

这是我到目前为止所做的：

text$specificname = str_split_fixed(text$text, ":", 2)

它创建了以下内容

text                                                            specific name

First Name:          00:03       Welcome Back text text text    First Name
First Name 2:        00:54       Text Text Text                 First Name2
First Name 3:        01:24       Text Text Text                 First Name 3

我该如何处理时间戳和文本？这是最好的方法吗？

编辑1：这就是我导入数据的方式


#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'

#Reading the HTML code from the website
wp = read_html(url)

#assignging the class to an object
alltext = html_nodes(wp, 'p')

#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)

Answer 1

假设text采用末尾注释中所示的形式，即每行包含一个分量的字符向量，我们可以使用read.table

read.table(text = gsub("  +", ",", text), sep = ",", as.is = TRUE)

提供此data.frame：

             V1    V2                          V3
1   First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54              Text Text Text
3 First Name 3: 01:24              Text Text Text

注意

Lines <- "First Name:          00:03       Welcome Back text text text
First Name 2:        00:54       Text Text Text
First Name 3:        01:24       Text Text Text"

text <- readLines(textConnection(Lines))

更新

对于添加到问题中的EDIT，定义一个正则表达式pat，该正则表达式与可能的空格，2位数字，冒号，2位数字以及可能还有更多的空格匹配。然后grep删除所有与之匹配的行，并给出tt，在每一行中，将匹配项替换为@，模式（空格除外）和@给出g。最后，使用@作为给出DF的字段分隔符来阅读它。

pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "@\\1@", tt)
DF <- read.table(text = g, sep = "@", quote = "", as.is = TRUE)

从民主辩论中划定文字

1 个答案:

注意

更新