在一个非常大的文件中搜索和替换字符串

时间:2016-01-25 12:03:55

标签: json perl awk large-files data-manipulation

我首选shell命令来完成任务。我有一个非常非常大的文件 - 大约2.8 GB,内容是JSON的内容。一切都在一条线上,我被告知那里至少有150万条记录。

我必须准备文件以供消费。每条记录必须独立。样品:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

或者,使用以下内容......

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

最终结果应该是:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

尝试的命令:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

尝试的命令对于小文件非常适用。但它不适用于我必须操作的2.8 GB文件。在没有任何理由的情况下,塞德在10分钟后中途退出,没有做任何事情。经过几个小时后,Awk因为出现了分段错误(核心转储)的原因而导致错误。我尝试了perl的搜索和替换,并得到一个错误说"内存不足"。

任何帮助/想法都会很棒!

我机器上的其他信息:

  • 可用的磁盘空间超过105 GB。
  • 8 GB内存
  • 4核CPU
  • 运行Ubuntu 14.04

5 个答案:

答案 0 :(得分:4)

由于您已使用sed,awk和perl标记了您的问题,因此我认为您真正需要的是对工具的推荐。虽然这有点偏离主题,但我相信jq是你可以用来做的事情。它会比sed或awk更好,因为它实际上理解JSON 。这里显示的所有内容都可以通过一些编程在perl中完成。

假设内容如下(基于您的样本):

[shortcode category1]
your_function('category1');

[shortcode category2]
your_function('category2');

您可以轻松地将其重新格式化为“美化”它:

{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}

我们可以深入研究数据,只检索您感兴趣的记录(无论它们包含什么内容):

$ jq '.' < data.json
{
  "RomanCharacters": {
    "Alphabet": [
      {
        "RecordId": "1",
        "data": "data"
      },
      {
        "RecordId": "2",
        "data": "data"
      },
      {
        "RecordId": "3",
        "data": "data"
      },
      {
        "RecordId": "4",
        "data": "data"
      },
      {
        "RecordId": "5",
        "data": "data"
      }
    ]
  }
}

这更具可读性,无论是人类还是像awk这样逐行处理内容的工具。如果你想根据你的问题加入你的行进行处理,awk变得更加简单:

$ jq '.[][][]' < data.json
{
  "RecordId": "1",
  "data": "data"
}
{
  "RecordId": "2",
  "data": "data"
}
{
  "RecordId": "3",
  "data": "data"
}
{
  "RecordId": "4",
  "data": "data"
}
{
  "RecordId": "5",
  "data": "data"
}

或者,正如@peak在评论中建议的那样,通过使用jq的$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}' { "RecordId": "1", "data": "data" } { "RecordId": "2", "data": "data" } { "RecordId": "3", "data": "data" } { "RecordId": "4", "data": "data" } { "RecordId": "5", "data": "data" } (紧凑输出)选项完全消除了awk部分:

-c

答案 1 :(得分:3)

关于perl:尝试将输入行分隔符$/设置为},,如下所示:

#!/usr/bin/perl
$/= "},"; 
while (<>){
   print "$_\n"; 
}'

或者,作为一个单行:

$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat 

答案 2 :(得分:2)

尝试使用}作为记录分隔符,例如在Perl:

perl -l -0175 -ne 'print $_, $/' < input

您可能需要粘贴仅包含}的行。

答案 3 :(得分:2)

这可以通过不将数据视为单个记录来避免内存问题,但在性能方面可能走得太远(一次处理单个字符)。另请注意,内置RT变量(当前记录分隔符的值)需要gawk:

$ cat j.awk
BEGIN { RS="[[:print:]]" }
RT == "{" { bal++}
RT == "}" { bal-- }
{ printf "%s", RT }
RT == "," && bal == 2 { print "" }
END { print "" }

$ gawk -f j.awk j.txt
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

答案 4 :(得分:0)

使用此处提供的示例数据(以{Accounts:{Customer ...开头的那个),此问题的解决方案是读取文件并且正在读取它的数据是计算定义的分隔符的数量在$ /。对于10,000个分隔符的每个计数,它将写入新文件。对于找到的每个分隔符,它给它一个新的行。以下是脚本的外观:

#!/usr/bin/perl

$base="/home/dat789/incoming";
#$_="sample.dat";

$/= "}]},";   # delimiter to find and insert new line after
$n = 0;
$match="";
$filecount=0;
$recsPerFile=10000;   # set number of records in a file

print "Processing " . $_ ."\n";

while (<>){
   if ($n < $recsPerFile) {
      $match=$match.$_."\n";
      $n++;
      print "."; #This is so that we'd know it has done something
   }    
   else {
      my $newfile="partfile".$recsPerFile."-".$filecount . ".dat";
      open ( OUTPUT,'>', $newfile );
      print OUTPUT $match;
      $match="";
      $filecount++;   
      $n=0;
     print "Wrote file " .  $newfile . "\n";
   }
}

print "Finished\n\n";

我已经将这个脚本用于2.8 GB的大文件,其中的内容是未格式化的单行JSON。生成的输出文件将缺少正确的JSON页眉和页脚,但这很容易修复。

非常感谢你们的贡献!