分隔文件“ |”段落文本包括跨行的换行符

时间:2020-05-30 14:58:18

标签: awk sed

我有一个大的TXT数据集,以|分隔,但是有一个字段允许段落文本,其中包含换行符和空行。不属于段落文本的所有行均以AA|开头。当我尝试通过readr导入R时,这些值变为NA,因为它不遵循结构

如果不是以sed开头的行,是否可以使用awkAA|来行,然后将其附加到前一个空格处?

输入:

AA|5904060|9001084471200270|9000263372600200|Result Comment:
No (1, 3) Beta-D-Glucan detected.  

This assay does not detect certain fungi, including 
Cryptococcus species, which produce very low levels of (1, 
3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, 
Mucor and Rhizopus), which are not known to produce BDG. 
Additionally, the yeast phase of Blastomyces dermatitidis 
produces little BDG and may not be detected by this assay.
|North Building|0|0

目标输出:

AA|5904060|9001084471200270|9000263372600200|Result Comment: No (1, 3) Beta-D-Glucan detected.  This assay does not detect certain fungi, including Cryptococcus species, which produce very low levels of (1, 3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, Mucor and Rhizopus), which are not known to produce BDG. Additionally, the yeast phase of Blastomyces dermatitidis produces little BDG and may not be detected by this assay.|North Building|0|0

3 个答案:

答案 0 :(得分:1)

使用gawk,我会做这样的事情:

awk 'BEGIN {RS="(\n|^)AA\\|"} NR>1 {print "AA|" gensub("\n"," ","g")}' myfile.txt

说明:仅当在行的开头找到记录分隔符时,才使文字字符串AA|生效。假设第一行以AA|开始,这将导致首先找到空记录,然后我们将其丢弃;处理从2到结束(NR> 1)的记录。在每条记录(由此奇数分隔符分隔)中,将每条换行符替换为一个空格,并在记录前打印AA|(回想一下,输入文件中存在的AA|是记录分隔符,因此它不再在记录本身中。)

每条记录末尾的换行符(紧接在下一行AA|之前)被记录分隔符吞没,因此在每条输出行的末尾不会出现错误的空格- (最后一条记录除外),该记录不会以“换行符AA|”分隔符终止。文件中的最后一个换行符保留下来,并在输出中转换为空格;如果最后一条记录末尾的多余空间弄乱了您的数据,则必须将其修复。 (上面没有显示。)

答案 1 :(得分:0)

尝试:

#!/bin/bash
awk '
  /^AA\|/ { if (r) print r; r = $0; next }
  { r = r " " $0 }
  END { print r }
' input

如果要避免多余的空格,可以在上面的代码中添加gsub (/ /, " ", r),如下所示:

awk '
  /^AA\|/ { if (r) print r; r = $0; next }
  { r = r " " $0; gsub (/  /, " ", r) }
  END { print r }
' input

答案 2 :(得分:0)

使用用于多字符RS和RT的GNU awk,并假设您知道每个记录中应该有多少个字段(8):

AA|

否则,如果您没有GNU awk或仅知道所有记录均以$ awk '/^AA\|/ { if (NR>1) prt(); rec="" } { rec = rec OFS $0 } END{ prt() } function prt(o){o=$0; $0=rec; $1=$1; gsub(/[[:space:]]*[|][[:space:]]*/,"|"); print; $0=o} ' file AA|5904060|9001084471200270|9000263372600200|Result Comment: No (1, 3) Beta-D-Glucan detected. This assay does not detect certain fungi, including Cryptococcus species, which produce very low levels of (1, 3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, Mucor and Rhizopus), which are not known to produce BDG. Additionally, the yeast phase of Blastomyces dermatitidis produces little BDG and may not be detected by this assay.|North Building|0|0 开头的行开头,然后使用任何awk:

const createURL = () => {
  let urlSearch = new URLSearchParams()
  let make  = $("[name=make]:checked" ).map((_,chk) => chk.value).get()
  let model = $("[name=model]:checked").map((_,chk) => chk.value).get()
  let year  = $("[name=year]:checked" ).map((_,chk) => chk.value).get()
  if (make.length  > 0) urlSearch.set("make",  make.join(","))
  if (model.length > 0) urlSearch.set("model", model.join(","))
  if (year.length  > 0) urlSearch.set("year",  year.join(","))
  const srch = urlSearch.toString();
  console.log(srch);
  // history.pushState({}, "Results for `Cars`", srch ? "?"+srch : "");
};
$(function() {
  $("input:checkbox").on("change", createURL)
});
相关问题