我正在尝试用名字,时间戳记和文本来分隔以下数据。当前,整个数据在1列中作为数据框列出,此列称为Text1。这是它的外观
text
First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text
这是我到目前为止所做的:
text$specificname = str_split_fixed(text$text, ":", 2)
它创建了以下内容
text specific name
First Name: 00:03 Welcome Back text text text First Name
First Name 2: 00:54 Text Text Text First Name2
First Name 3: 01:24 Text Text Text First Name 3
我该如何处理时间戳和文本?这是最好的方法吗?
编辑1:这就是我导入数据的方式
#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'
#Reading the HTML code from the website
wp = read_html(url)
#assignging the class to an object
alltext = html_nodes(wp, 'p')
#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)
答案 0 :(得分:0)
假设text
采用末尾注释中所示的形式,即每行包含一个分量的字符向量,我们可以使用read.table
read.table(text = gsub(" +", ",", text), sep = ",", as.is = TRUE)
提供此data.frame:
V1 V2 V3
1 First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54 Text Text Text
3 First Name 3: 01:24 Text Text Text
Lines <- "First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text"
text <- readLines(textConnection(Lines))
对于添加到问题中的EDIT,定义一个正则表达式pat
,该正则表达式与可能的空格,2位数字,冒号,2位数字以及可能还有更多的空格匹配。然后grep
删除所有与之匹配的行,并给出tt
,在每一行中,将匹配项替换为@,模式(空格除外)和@给出g
。最后,使用@作为给出DF
的字段分隔符来阅读它。
pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "@\\1@", tt)
DF <- read.table(text = g, sep = "@", quote = "", as.is = TRUE)