Question

好的，所以我正在开发一个针对更改根目录的漏洞搜索器，我的问题是在大量文件中搜索大量字符串时。 htdocs，这比我想要的时间更长，我很肯定一些先进的perl编写器可以帮助我加快速度。以下是我希望改进的计划部分。

sub sStringFind {
  if (-B $_ ) {
  }else{
   open FH, '<', $_ ;
   my @lines = <FH>;
   foreach $fstring(@lines) {
    if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
     push(@huhFiles, "$_");
   }
  }
 }
}
#End suspicious string find.
find(\&sStringFind, "$cDir/www/htdocs");
for(@huhFiles) {
 print "$_\n";
}

也许有些哈希？不确定perl atm是不是很好。感谢任何帮助，谢谢你们。

Answer 1

你没有做任何会导致明显性能问题的事情，所以你必须在Perl之外寻找。使用grep。它应该快得多。

open my $grep, "-|", "grep", "-l", "-P", "-I", "-r", $regex, $dir;
my @files = <$grep>;
chomp @files;

-l将只返回匹配的文件名。 -P将使用Perl兼容的正则表达式。 -r将通过文件进行递归。 -I将忽略二进制文件。确保你的系统的grep具有所有这些选项。

Answer 2

与其他答案相反，我建议在每个文件上执行一次正则表达式，而不是每行一次。

use File::Slurp 'read_file';
        ...
    if (-B $_ ) {
    }else{
        if ( read_file("$_") =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
            push(@huhFiles, $_);
        }
    }

确保至少使用perl5.10.1。

Answer 3

因此，通过“散列”，我认为你的意思是在文件或行级别进行校验和，这样你就不必再检查它了吗？

基本问题是，校验和与否，仍然必须读取每个文件的每一行以扫描它或散列它。所以这并没有从根本上改变你的算法，只是推动了常数。

如果您有大量重复文件，则在文件级别进行检查可能会为您节省大量时间。如果不这样做，就会浪费很多时间。

cost = (checksum_cost * num_files) + (regex_cost * lines_per(unique_files))

在行级检查是正则表达式的成本和校验和的成本之间的折腾。如果没有多少重复的行，你就输了。如果你的校验和过于昂贵，你就输了。你可以这样写出来：

cost = (checksum_cost * total_lines) + (regex_cost * (total_lines - duplicate_lines))

我首先要弄清楚文件和行的重复百分比是多少。这很简单：

$line_frequency{ checksum($line) }++

然后查看频率为>= 2的百分比。该百分比是您通过检查看到的最大性能提升。如果它是50％，你将只看到增加50％。假设校验和成本为0，而不是，所以你会看到更少。如果校验和的成本是正则表达式成本的一半，那么你只会看到25％。

这就是我推荐grep的原因。它会比Perl更快地遍历文件和行，可以解决基本问题：你必须读取每个文件和每一行。

你所做的不是每次都看每个文件。一件简单的事情是记住您上次扫描并查看每个文件的修改时间。它没有改变，你的正则表达式没有改变，不要再检查它。一个更健壮的版本是存储每个文件的校验和，以防文件被修改时间改变。如果你的所有文件都没有经常变化，那将会有很大的胜利。

# Write a timestamp file at the top of the directory you're scanning
sub set_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";
    open my $fh, ">", $file or die "Can't open $file for writing: $!";
    print $fh time;

    return
}

# Read the timestamp file
sub get_last_scan_time {
    my $dir = shift;

    my $file = "$dir/.last_scan";

    return 0 unless -e $file;

    open my $fh, "<", $file or die "Can't open $file: $!";
    my $time = <$fh>;
    chomp $time;

    return $time;
}

use File::Slurp 'read_file';
use File::stat;

my $last_scan_time = get_last_scan_time($dir);

# Place the regex outside the routine just to make things tidier.
my $regex = qr{this|that|blah|...};
my @huhFiles;
sub scan_file {
    # Only scan text files
    return unless -T $_;

    # Don't bother scanning if it hasn't changed
    return if stat($_)->mtime < $last_scan_time;

    push(@huhFiles, $_) if read_file($_) =~ $regex;
}

# Set the scan time to before you start so if anything is edited
# while you're scanning you'll catch it next time.
set_last_scan_time($dir);

find(\&scan_file, $dir);

Answer 4

我会做很多事情来改善表现。

首先，您应该预编译您的正则表达式。一般来说，我这样做：我的@ items = qw（foo bar baz）; #usually我从配置文件中提取这个我的$ regex ='^'。加入“|”，@ item。 '$'; ＃举个例子。我也做了很多捕捉。正则表达式$ = QR（$正则表达式）1;

其次，如上所述，您应该一次读取一行文件。我所看到的大多数表现都是用完ram而不是cpu。

第三，如果你的一个cpu用完并且有很多文件可以使用，可以使用fork（）将应用程序拆分为调用者和接收者，这样你就可以使用多个cpu一次处理多个文件。你可以写一个公共文件，完成后解析它。

最后，请注意您的内存使用情况 - 很多时候，文件附加功能可以让您将内存中的内容保持在更小的范围内。

我必须使用5.8和5.10处理大型数据转储，这对我有用。

Answer 5

我不确定这是否会有所帮助，但是当您打开<FH>时，您会立即将整个文件读入perl数组（@lines）。您可以通过打开文件并逐行读取来获得更好的性能，而不是在处理之前将整个文件加载到内存中。但是，如果您的文件很小，那么您当前的方法可能实际上更快......

请参阅此页面以获取示例：http://www.perlfect.com/articles/perlfile.shtml

它可能看起来像这样（注意标量$line变量 - 不是数组）：

open FH, '<' $_;

while ($line = <FH>)
{
    # do something with line
}

close FH;

Answer 6

如上所述，您的脚本会将每个文件的全部内容读入@lines，然后扫描每一行。这表明有两个改进：一次读取一行，并在一行匹配时立即停止。

一些其他改进：if (-B $_) {} else { ... }很奇怪 - 如果您只想处理文本文件，请使用-T测试。您应始终检查open（）的返回值。在push()中使用引号是无用的。总而言之：

sub sStringFind {
    if (-T $_) {
        # Always - yes, ALWAYS check for failure on open()
        open(my $fh, '<', $_) or die "Could not open $_: $!";

        while (my $fstring = <$fh>) {
            if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd \/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byro \.net|milw0rm|tcpflooder/) {
                push(@huhFiles, $_);
                last; # No need to keep checking once this file's been flagged
            }
        }
    }
}

Answer 7

只是添加别的东西。

如果您正在从搜索字词列表中汇编regexp。然后Regexp::Assemble::Compressed可用于将搜索字词折叠为较短的正则表达式：

use Regexp::Assemble::Compressed;

my @terms = qw(sendraw portscan stunshell Bruteforce fakeproc sub google sub alltheweb sub uol sub bing sub altavista sub ask sub yahoo virgillio filestealth IO::Socket::INET /usr/sbin/bjork /usr/local/apache/bin/httpd /sbin/syslogd /sbin/klogd /usr/sbin/acpid /usr/sbin/cron /usr/sbin/httpd irc.byroe.net milw0rm tcpflooder);

my $ra = Regexp::Assemble::Compressed->new;
$ra->add("\Q${_}\E") for @terms;
my $re = $ra->re;
print $re."\n";

print "matched" if 'blah blah yahoo' =~ m{$re};

这会产生：

(?-xism:(?:\/(?:usr\/(?:sbin\/(?:(?:acpi|http)d|bjork|cron)|local\/apache\/bin\/httpd)|sbin\/(?:sys|k)logd)|a(?:l(?:ltheweb|tavista)|sk)|f(?:ilestealth|akeproc)|s(?:tunshell|endraw|ub)|(?:Bruteforc|googl)e|(?:virgilli|yaho)o|IO::Socket::INET|irc\.byroe\.net|tcpflooder|portscan|milw0rm|bing|uol))
matched

这可能对很长的搜索术语列表有益，特别是对于Perl pre 5.10。

Answer 8

只需使用您的代码：

#!/usr/bin/perl

# it looks awesome to use strict
use strict;
# using warnings is beyond awesome
use warnings;
use File::Find;

my $keywords = qr[sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder];

my @huhfiles;

find sub {
        return unless -f;
        my $file = $File::Find::name;

        open my $fh, '<', $file or die "$!\n";
        local $/ = undef;
        my $contents = <$fh>;
        # modern Perl handles this but it's a good practice
        # to close the file handle after usage
        close $fh;

        if ($contents =~ $keywords) {
                push @huhfiles, $file;
        }
}, "$cDir/www/htdocs";

if (@huhfiles) {
        print join "\n", @huhfiles;
} else {
        print "No vulnerable files found\n";
}

Answer 9

不要一次读取所有行。一次读一行，然后当你在文件中找到一个匹配项时，跳出循环并停止从该文件中读取。

此外，不要在不需要时进行插值。而不是

push(@huhFiles, "$_");

DO

push(@huhFiles, $_);

这不是速度问题，但它的编码风格更好。

需要帮助加快我的perl计划

9 个答案: