Question

我有几个大的CSV文件，我需要使用1到多个参数进行搜索，如果我找到一个命中，我需要将该行保存在另一个文件中。下面是成功运行的perl代码示例，但对5gb文件的速度非常慢。任何关于加快这一点的建议都将不胜感激。

#!/usr/bin/env perl
use Text::CSV_XS;

$numArgs = $#ARGV;

#First Parameter is the input file name
$Finput = $ARGV[0];
chomp($Finput);

#Second Parameter is the output file name
$Foutput = $ARGV[1];
chomp($Foutput);

# Open the Control file but quit if it doesn't exist
open(INPUT1, $Finput) or die "The Input File $Finput could not be found.\n";
open(OUTPUT1, ">$Foutput") or die "Cannot open output $Foutout file.\n";


my $csv = Text::CSV_XS->new();
open my $FH, "<", $Finput;

while (<$FH>) {
    $csv->parse($_);
    my @fields = $csv->fields;

    if ($fields[0] == 10000) {
        if ($fields[34] eq 'abcdef') {
            if ($fields[103] == 9999) {
                print OUTPUT1 "$_\n";
            }
        }
    }
}

Answer 1

我不知道您的数据或您的标准。

但是如果我们可以使用上面给出的例子，那么在进行CSV处理之前，我会尝试对这些行进行琐碎的测试。

例如（请注意，我的perl非常糟糕，这是示例，不正确）：

if (/.*10000.*abcdef.*9999.*/) {
    $csv->parse($_);
    if ($fields[0] = 10000) {
        ...
    }
}

基本上，在执行合格所需的额外处理之前，您可以更快速地进行更简单，更快速的检查，以便更快地取消行。

显然，如果你的行匹配比不匹配，或者如果检查简单的资格认证并不真实，那么这种技术将不起作用。

做得好，CSV解析有点贵（事实上你假设单行CSV是单条记录就有错误，对你的数据可能是这样，但CSV实际上允许嵌入换行，所以它不是可以对所有CSV进行的通用假设。）

所以，如果“匆匆”，该线路无论如何都不会匹配，那么不必为解析它而付出代价就好了。

Answer 2

这是“成功”运行的代码？我觉得很难相信。

if ($fields[0] = 10000) {
    if ($fields[34] = 'abcdef') {
        if ($fields[103] = 9999) {

这些不是检查是否相等，而是分配。所有这些if子句总是会返回true。你可能想要的是==和eq，而不是=。

您还在输入文件上打开两个文件句柄，并以错误的方式使用CSV模块。我不相信这些小错误会导致脚本太慢，但它会打印该5gb文件中的所有记录。

这是您脚本的修订版本。

use strict;
use warnings;
use Text::CSV;
use autodie;

my $Finput = $ARGV[0];
my $Foutput = $ARGV[1];

open my $FH, "<", $Finput;
open my $out, ">", $Foutput;

my $csv = Text::CSV->new();

while (my $row = $csv->getline($FH)) {
    my @fields = @$row;
    if ($fields[0] == 10000) {
        if ($fields[34] eq 'abcdef') {
            if ($fields[103] == 9999) {
                $csv->print($out, $row);
            }
        }
    }
}

autodie编译指示将负责检查来自open的返回值（以及其他内容）。 use strict; use warnings;会让我们的大脑受到伤害。哦，我正在使用Text::CSV，而不是_XS版本。

Answer 3

你想使用grep“{searchstring}”filename1.csv filename2.csv＆gt;每个文件都有savefile.txt。也许你想逐行阅读filename.csv：

#!/bin/bash
exec 3<filename.csv
while read haystack <&3
do
  grep "{needle}" $haystack > result.txt 
done

在unix上搜索具有多个搜索条件的大型CSV文件

3 个答案: