Question

我对Perl很新，并希望有人能帮我解决这个问题。我需要从CSV文件嵌入的逗号中提取两列。格式如下：

"ID","URL","DATE","XXID","DATE-LONGFORMAT"

我需要在DATE之后立即提取XXID列，XXID列和列。请注意，每行不一定遵循相同的列数。

XXID列包含2个字母的前缀，并不总是以相同的字母开头。它几乎可以是aplhabet的任何字母。长度总是一样的。

最后，提取这三列后，我需要对XXID列进行排序并计算重复项。

Answer 1

以下是使用Text::CSV模块解析csv数据的示例脚本。请参阅模块的文档以找到适合您的数据的设置。

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new({ binary => 1 });

while (my $row = $csv->getline(*DATA)) {
    print "Date: $row->[2]\n";
    print "Col#1: $row->[3]\n";
    print "Col#2: $row->[4]\n";
}

Answer 2

我发布了一个名为Tie::Array::CSV的模块，它允许Perl与您的CSV作为本机Perl嵌套数组进行交互。如果您使用它，您可以使用搜索逻辑并应用它，就像您的数据已经在数组引用数组中一样。看看吧！

#!/usr/bin/env perl

use strict;
use warnings;

use File::Temp;
use Tie::Array::CSV;
use List::MoreUtils qw/first_index/;
use Data::Dumper;

# this builds a temporary file from DATA
# normally you would just make $file the filename
my $file = File::Temp->new;
print $file <DATA>;
#########

tie my @csv, 'Tie::Array::CSV', $file;

#find column from data in first row
my $colnum = first_index { /^\w.{6}$/ } @{$csv[0]};
print "Using column: $colnum\n";

#extract that column
my @column = map { $csv[$_][$colnum] } (0..$#csv);

#build a hash of repetitions
my %reps;
$reps{$_}++ for @column;

print Dumper \%reps;

Answer 3

您肯定希望使用CPAN库来解析CSV，因为您永远不会考虑格式的所有怪癖。

请参阅：How can I parse quoted CSV in Perl with a regex?

请参阅：How do I efficiently parse a CSV file in Perl?

但是，对于您提供的特定字符串，这是一个非常幼稚且非惯用的解决方案：

use strict;
use warnings;

my $string = '"ID","URL","DATE","XXID","DATE-LONGFORMAT"';

my @words = ();
my $word = "";
my $quotec = '"';
my $quoted = 0;

foreach my $c (split //, $string)
{
  if ($quoted)
  {
    if ($c eq $quotec)
    {
      $quoted = 0;
      push @words, $word;
      $word = "";
    }
    else
    {
      $word .= $c;
    }
  }
  elsif ($c eq $quotec)
  {
    $quoted = 1;
  }
}

for (my $i = 0; $i < scalar @words; ++$i)
{
  print "column " . ($i + 1) . " = $words[$i]\n";
}

如何使用Perl从CSV文件中提取多个列

3 个答案: