Perl:处理大文件并存储到数据库

时间:2015-08-25 11:37:22

标签: regex perl perl-data-structures

我的任务是修改一个Perl脚本,该脚本读取12个文件(每个约1GB,约400万个条目/文件)。问题是我不知道任何perl。但是,我能够成功修改我的用例脚本。问题是脚本需要花费大量时间来处理文件以及将条目插入数据库。任何减少时间的指示/建议都将受到赞赏。

其中一个文件的示例输入行(已修改以保护身份)如下:

10.0.0.25 [06/Aug/2015:06:00:02 +0000] "0.002" "200" "0.002" "172.16.2.57:7777" "-" "GET /txt/AKXBPPYICZIBGM/n1/19757_705326?dc=us2&ext_user_id=1400587512&si=149042592 HTTP/1.1" "http://xyz.xyz/site_view.xhtml?cmid=22211051&get-title=Test%20-%20Test%20Test%2010&Test=Test%20Test"

脚本(与一些伪代码混合)如下:

for ( $i = 0 ; $i < 12 ; $i++ ) {
  $img_log_path = "Set appropriate path according to the iteration";

  open(IMG_HANDLE, $img_log_path) or "Could not open the file $0\n";

    while ( $line = <IMG_HANDLE> ) {

        my $impression_date = `date --date="yesterday" +%Y-%m-%d`;
        chomp $line;

        if ( $line =~ m/&si=(\d+)/ig ) { 
            $affid       = $1; # Extract the affiliate ID from $line
            @fields      = split(/"/, $line); 
            $img_ref_url = $fields[13]; # Extract the impression URL from $line

            $impression_urls{$affid}{$img_ref_url}{IMPRESSION_COUNT}++; # Store the beacon impression URLs along with a count of number of impressions
            $img_ref_url =~ m/http:\/\/(.*?)\//i;
            $impression_urls{$affid}{$img_ref_url}{IMPRESSION_URL_HOST} = $1; # Store the hostname of the $img_ref_url

            # Store the time at which first impression occurs for this URL. 
            $line =~ m/(\d\d:\d\d:\d\d)\s/ig;
            my $impression_url_time = $1;
            my $impression_time = split( / /,$impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME} ); # If IMPRESSION_TIME exists for this URL from a previous impression then extract the time
            # Store the time for the first impression
            if($impression_url_time lt $impression_time || !defined($impression_urls{$affid}{$img_ref_url}{IMPRESSION_TIME})) {
                $impression_urls{$affid}{$ref}{IMPRESSION_TIME} = "$impression_date "."$impression_url_time"; # Store the time at which the first impression happened for this URL. 
            }

            if ( (defined($affid) && $affid ne "") && (defined($img_ref_url) && $img_ref_url ne "") ) {
                $affiliates{$affid}{TOTAL_IMP}++; # Increment the total number of impressions for this URL.

                if ( &CheckURL($img_ref_url) ) {
                    $affiliates{$affid}{ADULT_IMP}++;
                }
            }
        }
    }
    close IMG_HANDLE;
}

for $aff_id (@aff_ids) { #@aff_ids contains all the affiliate ids from a database
    my $impression_table = "tns_impressions";

    foreach $impression_url (keys %{ $impression_urls{$aff_id} }) {
        # If the pageurl doesn't exist then don't add it into the database
        if ($impression_url =~ /-/) {
            next;
        }

        # Insert the URL into the database
        my $query = "INSERT INTO $impression_table (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
        my $statement = $dbh->prepare($query)
                            or print STDERR "$dbh->errstr";
        $statement->execute(
            $aff_id,
            $impression_url,
            $impression_urls{$aff_id}{$impression_url}{IMPRESSION_TIME},
            $impression_urls{$aff_id}{$impression_url}{IMPRESSION_COUNT},
            $impression_urls{$aff_id}{$impression_url}{IMPRESSION_URL_HOST}
        ) or print STDERR "$statement->errstr";
    }
}

1 个答案:

答案 0 :(得分:0)

只是为了灵感,它看起来如何(未经测试):

for ( $i = 0; $i < 12; $i++ ) {
    $img_log_path = "Set appropriate path according to the iteration";

    open( my $fh, '<', $img_log_path )
        or "$0: Could not open the file $img_log_path: $!";

    while ( $line = <$fh> ) {

        if (my ( $impression_url_time, $affid, $ref_url, $host )
            = $line =~ m,
                    \A \S+ \s+                                      # IP
                    \[ [^:]+ : ( \d\d:\d\d:\d\d ) \s [^\]] \] \s+   # date and time
                    (?: " [^"]* ){10}                               # use same skip as original split
                    " \S+ \s+ \S+? \& si= (\d+) [^"]+ " \s+         # affid from HTTP header line
                    " ( http:// ( [^/]+ ) / [^"]+ ) "               # referential url and host
                    ,x
            )
        {
            my $rec = $impression_urls{$affid}{$ref_url} //= {
                IMPRESSION_TIME     => $impression_time,
                IMPRESSION_URL_HOST => $host
            };

            $rec->{IMPRESSION_COUNT}++;
            $rec->{IMPRESSION_TIME} = $impression_time
                unless $rec->{IMPRESSION_TIME} le $impression_time;

            $affiliates{$affid}{TOTAL_IMP}++;
            $affiliates{$affid}{ADULT_IMP}++ if &CheckURL($ref_url);
        }
    }
}

my $date = `date --date="yesterday" +%Y-%m-%d`;

my $statement = do {
    my $impression_table = "tns_impressions";
    my $query
        = "INSERT INTO `$impression_table` (affiliate_id, impression_url, created_at, no_of_impressions, hostname) VALUES (?,?,?,?,?)";
    $dbh->prepare($query)
        or print STDERR $dbh->errstr;
};

for $aff_id (@aff_ids)
{    #@aff_ids contains all the affiliate ids from a database

    foreach my $impression_url ( keys %{ $impression_urls{$aff_id} } ) {

        my $rec = $impression_urls{$aff_id}{$impression_url};

        # Insert the URL into the database
        $statement->execute(
            $aff_id, $impression_url,
            "$date $rec->{IMPRESSION_TIME}",
            $rec->{IMPRESSION_COUNT},
            $rec->{IMPRESSION_URL_HOST}
        ) or print STDERR $statement->errstr;
    }
}