阅读和阅读最快的方式用PHP解析(100,000 xml文件,总大小为6GB)

时间:2014-11-14 05:57:51

标签: php xml parsing

我正在开展一些项目,我想阅读并解析6GB大小的100000+文件。

我的问题: 1 GT;在几秒钟内读取和解析一个XML文件(大小在5kb-500kb之间)。 所以完整的XML文件集(100000+文件,大小为6GB)阅读&在3-5小时内解析。 2 - ;最快的方法

目前,一个XML文件(5KB-500KB)需要花费一分钟的时间来阅读和解析。

此致 绵


P.S。还请查看代码:

<HTML>
<HEAD>
<META HTTP-EQUIV="CACHE-CONTROL" CONTENT="NO-CACHE">
<META HTTP-EQUIV="EXPIRES" CONTENT="0">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><style    type="text/css">
<!--
body,td,th {
color: #CCCCCC;
}
body {
background-color: #000066;
}
-->
</style></HEAD>
</BODY>
<script>
<!--

/*
Auto Refresh Page with Time script By JavaScript Kit (javascriptkit.com) Over 200+ free scripts here!
*/

//enter refresh time in "minutes:seconds" Minutes should range from 0 to inifinity.    Seconds should range from 0 to 59

var limit="00:10"

if (document.images){
var parselimit=limit.split(":")
parselimit=parselimit[0]*60+parselimit[1]*1
}
function beginrefresh(){
if (!document.images)
return
if (parselimit==1)
window.location.reload()
else{ 
parselimit-=1
curmin=Math.floor(parselimit/60)
cursec=parselimit%60
if (curmin!=0)
curtime=curmin+" minutes and "+cursec+" seconds left until page refresh!"
else
curtime=cursec+" seconds left until page refresh!"
window.status=curtime
setTimeout("beginrefresh()",1000)
}
}

window.onload=beginrefresh
//-->
</script>
</HEAD>
<BODY>


<?php

require("MagicParser.php");

//header("Content-Type: text/plain");



$dbServer = "127.0.0.1";
$dbUser = "root";
$dbPass = "";
$dbName = "GDatabase";

$text = '';

$c = mysql_connect($dbServer, $dbUser, $dbPass) or die("Couldn't connect to database");
$d = mysql_select_db($dbName) or die("Couldn't select database");

//mysql_query("SET NAMES utf8;");

//mysql_query("SET CHARACTER_SET utf8;");


$sql = "select 
id, file_name
from 
tableP_files
where status = '' limit 1";



$result = mysql_query($sql,$c);

while($row = mysql_fetch_array($result))
{

$id = $row['id'];
$file_name = $row['file_name'];

$url = 'http://localhost/GDatabase/XML/' . $file_name;
}



$formatString = MagicParser_getFormat($url);

$update_query = "update tableP_files set format_string = '$formatString' where id =  $id";
if(!mysql_query($update_query,$c))
{
echo 'ERROR';
}
print "Format String: ".$formatString."\n\n";

// MagicParser_parse($url,"myRecordHandler",$formatString);
//  MagicParser_parse($url,"myRecordHandler","xml|ARTICLE/FLOATS-WRAP/TABLE-WRAP/TABLE/TBODY/TR/TD/");
MagicParser_parse($url,"myRecordHandler","xml|ARTICLE/");

function myRecordHandler($record)
{



$dbServer = "127.0.0.1";
$dbUser = "root";
$dbPass = "";
$dbName = "GDatabase";


$c = mysql_connect($dbServer, $dbUser, $dbPass) or die("Couldn't connect to database");
$d = mysql_select_db($dbName) or die("Couldn't select database");

mysql_query("SET NAMES utf8;");
mysql_query("SET CHARACTER_SET utf8;");



$sql = "select 
id, file_name
from 
tableP_files
where status = '' limit 1";


$result = mysql_query($sql,$c);

while($row = mysql_fetch_array($result))
{

$id = $row['id'];
$file_name = $row['file_name'];

$file_name = 'http://localhost/GDatabase/test/' . $file_name;
}


foreach($record as $key => $value)
{
    $tag =  addslashes($key);
    $value = addslashes($value);


$insert_query = "insert into tableP_xml set file_id = '$id', file_name = '$file_name', tag = '$tag', value = '$value', status = ''";
if(!mysql_query($insert_query,$c))
{
echo 'ERROR';
}




}

$update_query = "update tableP_files set status = 'done' where id = $id";
if(!mysql_query($update_query,$c))
{
echo 'ERROR';
}

echo "Done: " . $id . " - " . $file_name;
return TRUE;
}

?> 
</BODY>
</HTML>

1 个答案:

答案 0 :(得分:1)

我刚刚创建了每个大小为60kb的100000个xml文件,并且在php中试图用file_get_contents读取它们,花了87.5秒。提个醒!我是一个ssd,有大量的ram和强大的i5第四代处理器。只需将约90秒加载到内存中即可。

那么,你如何更快地做到这一点?并发性。

我将任务分成4块25000xml文件,将文件加载到内存(按顺序)的时间减少到~30秒。同样,这只是将xml加载到内存中的时间。因此,如果您要对xml进行更多处理,则需要更多处理能力或时间。

现在,你如何扩展这个?输入gearman。 Gearman允许您通过中央服务器将工作分发给工作人员来处理并行任务。您甚至可以让不同服务器上的一群工作人员注册执行您的任务。我认为你根本不需要超级计算机。您只需要定义所有作业一次,让工作人员完成工作(异步?)。