试图优化我的脚本

时间:2016-07-09 16:39:57

标签: php arrays optimization

嘿所以我构建了一个过滤组合列表的脚本,它只输出不重复超过2次的组合,但它的超级速度是我的脚本:

<?php
ini_set('max_execution_time', '-1');
ini_set('memory_limit', '-1');
$fileCombo = file("ww.txt", FILE_IGNORE_NEW_LINES);
$output = fopen("workpless.txt", "a") or die("Unable to open file!");

//all domains of the entire list
$domains = array();
//only domains that repeat themself less than 2 times
$less = array();

//takes the combo list explode it to domain names
foreach ($fileCombo as $combo) {
    $pieces = explode(":", $combo);
    $email = explode("@", $pieces[0]);
    //import domains to array
    $domains[] = strtolower($email[1]);
}
//count each string in the array
$ac = array_count_values($domains);
//this foreach just filter all the domains that not repeat themself over 2 times
foreach ($ac as $email => $item) {
    if($item <= 2) {
        $less[] = $email;
    }
}

/* this foreach is the one that makes all the trubles,
it takes all the domains that the last foreach filtered 
and its runing it 1 by 1 on the entire combo list to get
the actual combo */

foreach ($less as $find) {
    $matches = array_filter($fileCombo, function($var) use ($find) { return preg_match("/\b$find\b/i", $var); });
    foreach ($matches as $match) {
        $data = $match . PHP_EOL;
        fwrite($output, $data);
    }
}

fclose($output);
?>

伪代码(我能做的最好):

file1:
exaple@example.com:password
exaple@example.com:password
exaple@example.com:password
exaple@example1.com:password
exaple@example2.com:password

array "fileCombo" load file1 into the array
splitting each line by ":" so you will get [0]example@example.com, [1]password
splitting value [0] by "@" so you will get [0]example, [1]example.com
putting value [1] into new array called "domains"
counting how many duplicates of each domain
putting all the domains that have less than 2 dupes inside new array that called "less"
runing 1 by 1 each domain in "less" array on "fileCombo" array
if "less" value was found inside "fileCombo" array value Than
write the entire line from "fileCombo" into a text file

这个脚本用于2~5M行的大文件,这就是为什么我需要对它进行优化(当你在20k行上运行时它很快)。

2 个答案:

答案 0 :(得分:2)

以下应该是针对您的案例最“简洁”的解决方案。
但是您应该在大文件( 2~5M )上进行测试:

让我们说,file1.txt包含以下这些行:

exaple@example.com:password
exaple@example.com:password
exaple@example.com:password
exaple@example1.com:password
exaple@example2.com:password
$combos = file_get_contents("file1.txt");
preg_match_all("/\b\S+?@(\S+?):\S+?\b/m", $combos, $matches);
$less = array_filter(array_count_values($matches[1]), function ($v){
    return $v <= 2;
});
// $matches[0] - is an array of lines
// $matches[1] - is an array of domains in respective positions as lines
$matched_lines = "";
foreach (array_keys($less) as $domain) {
    $matched_lines .= $matches[0][array_search($domain, $matches[1])] . PHP_EOL;
}
if ($matched_lines) {
    file_put_contents("workpless.txt", $matched_lines, FILE_APPEND);
}
// Now "workpless.txt" contains the following lines:
example@example1.com:password
example@example2.com:password

答案 1 :(得分:1)

更新:以1M行文件的成本显示该域的所有相关行,花费更多5秒

测试了80,000行(40,000个独特行) - 2.5 MB

Memory Usage

69,994,816 bytes
70,246,808 bytes (process)
71,827,456 bytes (process peak)

Execution Time
0.54409 seconds

测试了1,000,000行(500,000个唯一行) - 33 MB

Memory Usage

864,805,152 bytes
865,057,144 bytes (process)
866,648,064 bytes (process peak)

Execution Time
8.9173 seconds

我的测试机器是i7-3612QM(CPU Mark 6833)4GB RAM SSD

来自80,000 lines file

的示例
exaple@example.com:password
exaple@example1.com:password
exaple@example1.com:password
exaple@example1.com:password
exaple@example2.com:password
exaple@example2.com:password
exaple@example3.com:password
exaple@example3.com:password

这是您的新版本:))

<?php
// System Start Time
define('START_TIME', microtime(true));

// System Start Memory
define('START_MEMORY_USAGE', memory_get_usage());

function show_current_stats() {
?>
    <b>Memory Usage</b>
    <pre>
    <?php print number_format(memory_get_usage() - START_MEMORY_USAGE); ?> bytes
    <?php print number_format(memory_get_usage()); ?> bytes (process)
    <?php print number_format(memory_get_peak_usage(TRUE)); ?> bytes (process peak)
    </pre>

    <b>Execution Time</b>
    <pre><?php print round((microtime(true) - START_TIME), 5); ?> seconds</pre>
<?php
}

// Script start here

$fileCombo = file("ww.txt", FILE_IGNORE_NEW_LINES);
$output = fopen("workpless.txt", "a") or die("Unable to open file!");

//all domains of the entire list
$domains = array();
//only domains that repeat themself less than 2 times
$less = array();
//let make relateion between domains and its position(keys) in fileCombo
$domains_keys = array();

//takes the combo list explode it to domain names
foreach ($fileCombo as $key => $combo) {
    $pieces = explode(":", $combo);
    $email = explode("@", $pieces[0]);
    //import domains to array
    $domains[] = strtolower($email[1]);

    // check if domain exists or create new domain in $domains_keys array
    if (isset($domains_keys[strtolower($email[1])] )) {
        $domains_keys[strtolower($email[1])][] = $key;
    } else {
        $domains_keys[strtolower($email[1])] = array($key);
    }
}
//count each string in the array
$ac = array_count_values($domains);
//this foreach just filter all the domains that not repeat themself over 2 times
foreach ($ac as $email => $item) {
    if($item <= 2) {
        $less[] = $email;
    }
}

foreach ($less as $find) {
    array_map(function($domain_key) use ($fileCombo, $output) {
        $data = $fileCombo[$domain_key] . PHP_EOL;
        fwrite($output, $data);
    }, $domains_keys[$find]);
}

fclose($output);

// uncomment to show stats : Credit go to micromvc
/* show_current_stats(); */

输出

exaple@example.com:password
exaple@example2.com:password
exaple@example2.com:password
exaple@example3.com:password
exaple@example3.com:password