替换制表符和换行符R

时间:2018-06-07 14:45:25

标签: r string text line-breaks stringr

我正在清理一个大文本文件以读入R.几乎每行都用制表符分隔,但是一些长引号也有换行符。我正在使用选项卡将文档分成带有扬声器列和注释列的数据框,这些换行符会破坏我的格式,因为R认为每一行都是一个新的扬声器,但后来说扬声器 NA < / em>当它找不到标签时。以下是我所拥有的样本:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. <br/>
*NA* Really, R is frustrating me. <br/>
*NA* But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

这就是我想要的:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

我正在以这种方式阅读文件:

atas <- stri_read_lines("ATAS2.txt") %>% str_replace_all("\t", "TABS_TO_BE_DELETED")

(我有那个随机字符串,因为当我将文本文档作为数据框时,R会一直删除标签页,仅供参考。)

现在,为了删除换行符,我试过了:

atas2 <- gsub("\r?\n|\r", " ", atas) 

atas2 <- str_replace_all(atas, "\n" , " ")

我也不能只删除所有特殊字符或格式来解决这个问题。如果我必须删除所有非字母数字字符,我需要保留标签(至少足够长,以便在我们以后可以拆分的位置放入一些不起眼的字符串), [] ()

我想让它忽略那些换行符或以某种方式将行合并在一起。只告诉它与不匹配行合并的唯一警告是我自己有一些行,没有任何发言者需要在扬声器列中没有归属,例如(但不限于):

(Laughter)

Interview 41

[Inaudible cross-talk]

感谢您提供的任何帮助!

2 个答案:

答案 0 :(得分:0)

您可以采取稍微不同的方法并执行类似的操作。请注意,您通常必须双重转义R正则表达式中的特殊字符(第一个是转义反斜杠)。

#read in text as a single string
text <- "Interviewer: How are you?
Subject: I'm just incredibly frustrated. 
    Really, R is frustrating me. 
    But maybe someone has a solution for me?
Interviewer: Fortunately, I have an answer for you."

#add `#` markers to separate text before and after speaker followed by colon 
text2 <- str_replace_all(text, "(\\w+?\\:)", "#\\1#")

#split at markers, remove first blank element, and cast as a 2-column data frame
text3 <- as.data.frame(matrix(str_split(text2, "#")[[1]][-1], ncol=2, byrow=TRUE))

#remove line breaks, tabs etc
text3$V2 <- str_replace_all(text3$V2, "[\\r\\n\\t]+", " ")

#remove excessive white space
text3$V2 <- str_trim(str_replace_all(text3$V2, "\\s+", " "))

text3
            V1                                                                                                    V2
1 Interviewer:                                                                                          How are you?
2     Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer:                                                                Fortunately, I have an answer for you.

答案 1 :(得分:0)

如果输出与Andrew Gustar所示的那样你可以做到:

read.csv(text=gsub("\\n(?!\\w+:)","",text,perl = T),sep=":",h=F)
           V1                                                                                                     V2
1 Interviewer                                                                                           How are you?
2     Subject  I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer                                                                 Fortunately, I have an answer for you.