只删除R中的一些换行符

时间:2018-06-01 16:59:29

标签: r text delimiter data-cleaning

我正在将文本文件读入R:

    text2 <- text %>%
  separate(X1, into = c("speaker", "comment"), sep = ":")

相关部分是它由换行符分隔。 然后我将其分为扬声器列和注释列:

cv::Mat hsv;
std::vector<cv::Mat> channels;
cv::split(hsv, channels);

结果是一个数据框,其中包含一列发言者和另一列评论。

问题是一些长评论中嵌入了换行符。这会扰乱将数据结构放在说话者列中换行符之后的注释,然后在注释部分放置NA。

如何告诉R忽略这些嵌入式换行符?如果它有帮助,列用冒号分隔(即访问者:你好吗?),所以在&#34; true&#34;之前应该只有一个冒号。换行。

谢谢!

1 个答案:

答案 0 :(得分:0)

我打算在输入文件如下所示的情况下工作:

TextFile.txt的

Interviewer: How are you?
Respondant: I'm fine.
Interviewer: The issue is that some of the long comments have line breaks
embedded in them. This messes up the data structure putting the comment after
the line break in the speaker column and then an NA in the comments section.
Respondant: How can I tell R to ignore these embedded line breaks? If it helps,
the columns are separated by a colon (i.e. Interviewer: How are you?), so there
should be only one colon before the "true" line break.

如果是这样,这个过程应该有效:

  1. 将线条读入矢量。
  2. 找出以演讲者姓名开头的行。
  3. 按照它们落在这些“起始”行之间的位置对所有行进行分类。
  4. 将评论合并为块。
  5. 拉出每个评论区的发言人姓名。
  6. data_frame它。
  7. library(stringi)
    library(dplyr)
    
    text <- readLines("textfile.txt")
    speaker_pattern <- "^\\w+(?=:)"
    comment_starts <- which(stri_detect_regex(text, speaker_pattern))
    comment_groups <- findInterval(seq_along(text), comment_starts)
    comments <- text %>%
      split(comment_groups) %>%
      vapply(FUN = paste0, FUN.VALUE = character(1), collapse = "\n")
    speakers <- stri_extract_first_regex(comments, speaker_pattern)
    comments <- stri_replace_first_regex(comments, "^\\w+: ", "")
    text2 <- data_frame(speaker = speakers, comment = comments)
    
    text2
    # # A tibble: 4 x 2
    #   speaker     comment                                            
    #   <chr>       <chr>                                              
    # 1 Interviewer How are you?                                       
    # 2 Respondant  I'm fine.                                          
    # 3 Interviewer "The issue is that some of the long comments have ~
    # 4 Respondant  "How can I tell R to ignore these embedded line br~