Question

我写了这个新手代码：

通过主目录运行（+ 2mil文件和许多子文件夹）过滤.tmx文件（m.dir中大约12k个文件）并提取特定字符串。并保存到.log文件。完成后，有一些子程序可以清理日志文件并将所有内容合并到一个文件中。问题是我把脚本留了一夜，它就卡住了。

由于dir大小，我会重新调整它。我之前已经将所有目录文件列为.txt，脚本可能会读取.txt并且一次处理1个文件，也许这样一段时间后它就不会占用99％的RAM。

也许你可以在加速操作或将这些程序合并到一个方面有其他见解。

Get-ChildItem "MasterDirpath\*.tmx" -Recurse  | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file
#$content | Where-Object {$_ -match '<tu '} | Set-Content $_.FullName

#filter and save content to a new file 
$content | Where-Object {!($_ -match '(?:creationid|changeid)="([^"]+)"' -or 
$_ -match '(<tuv.+?lang="[A-Za-z\-]+">)')} | %{$matches[1]} |Get-Unique  | 
Set-Content ($_.BaseName + '_out.log') 

}

Get-ChildItem "dir\tologs" -Filter *.log |
Foreach-Object {
$content = Get-Content -Raw $_.FullName
#make one line from extracted matches
$content -Replace "`r`n<" ,"`t<"  |Set-Content $_.FullName

}



Get-ChildItem "dir\tologs"  -Filter *.log | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file log file
$content | Where-Object {$_ -match '^.+$ '} | Sort | Get-Unique | Set-
Content $_.FullName
}


$path = "dir\tologs"
$out  = "dir\tologs\output.txt"

Get-ChildItem $path -Filter *.log | % {
$file = $_.Name
Get-Content $_.FullName | % {
    "${file}: $_" | Out-File -Append $out
}
}

更新 示例输入

这些.tmx文件的大小从1Mb到2Gb不等，目录大小约为1Tb。那里的所有文件都可以从几个Mb到几个Gb。脚本在50 tmx文件1-100mb的小目录上运行良好。

<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
<header creationtool="MemoQ" creationtoolversion="7.0.68" segtype="sentence" 
adminlang="en-us" creationid="lsmall" srclang="en-us" o-tmf="MemoQTM" 
datatype="unknown">
<prop type="defclient"> </prop>
<prop type="defproject"> </prop>
<prop type="defdomain"> </prop>
<prop type="defsubject"> </prop>
<prop type="description"> </prop>
<prop type="targetlang">hu</prop>
<prop type="name">28807_Project_HU</prop>
</header>
<body>
<tu changedate="20151104T174128Z" creationdate="20150929T180844Z" **creationid="pmccrory"** **changeid="lsmall">**
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">node_data_en_final.xml</prop>
***<tuv xml:lang="en-us">***
  <prop type="x-context-pre">&lt;seg&gt;Biomarkers and Integrated Solutions&lt;/seg&gt;</prop>
  <prop type="x-context-post">&lt;seg&gt;Novel therapeutic agents that have fast onset of action, good safety and tolerability profiles and that address common co-morbidities (for example, anxiety and substance abuse) &lt;ph type='fmt'&gt;{}&lt;/ph&gt;&lt;it pos='begin'&gt;&amp;lt;ul&amp;gt;&lt;/it&gt;&lt;/seg&gt;</prop>
  <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg>
</tuv>
***<tuv xml:lang="hu">***
  <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg>
</tuv>
</tu>
</body>
</tmx>

程序后输出：

AABB-COR-09_Master_DE_out.log: 6293 SYB <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: AD    <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: AGENTILE  <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ALIGN!    <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ANGELIKA  <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ASEDR <tuv xml:lang="en-us">    <tuv xml:lang="de-de">

Answer 1

作为第一个调试步骤，我将实施进度指示器。虽然有Write-Progress的良好输出，但打印点的简单版本通常会很好。通过查看点，您可以看到脚本是否已停止或是否仍在运行（尽管速度很慢。）

首先将文件保存在变量中，而不是直接传递到管道中。然后，您可以轻松记录文件的数量。实际处理文件时，为每个文件打印一个点.。实际的分母取决于你。 10000是一个很好的起始猜测，因为2M / 10k = 200并且不应该有太多的日志要读。

$tmxFiles = Get-ChildItem "MasterDirpath\*.tmx" -Recurse
write-host "Processing" $tmxFiles.Count ".tmx files"
$i=0;

$tmxFiles | % {
    if (++$i % 10000 -eq 0) {
        write-host -nonewline "."
    }
    # actual processing happens next
    ...
}

进入下一步时，请使用相同的逻辑：

$toLogs = Get-ChildItem "dir\tologs" -Filter *.log
write-host "Processing" $toLogs.Count ".log files"
$i=0;

$toLogs | % {
    if (++$i % 10000 -eq 0) {
        write-host -nonewline "."
    }
    # actual processing
    ...
}

还有Measure-Command可用于衡量脚本块运行的时间。当您弄清楚过程中更昂贵的部分时，请使用它。

Answer 2

使用一个管道进行整个处理，而不是4个单独的通道。
使用字符串运算符，例如-join和-split，而不是写入和读取同一文件
使用[regex]类及其Matches方法提取所需的所有标记。

$RX_EXTRACT = [regex](
    '(?<=(creationid|changeid)=")[^"]+(?=")|' +
    '<tuv.+?lang="[A-Za-z\-]+">'
) # the unwanted parts are suppressed from the output via look-behind and look-ahead

Get-ChildItem (Join-Path $TMX_DIR *.tmx) -Recurse | ForEach {
    $_.FullName + ': ' + (
        ($RX_EXTRACT.Matches((Get-Content $_ -raw)).Value | Get-Unique
        ) -join "`n" -replace '\n<', "`t<" -split "`n" -ne '' | Sort -Unique
    ) -join "`t"
} | Out-File "dir\tologs\output.txt"

未经过广泛测试。以它为例。

如何加速脚本并且不会阻塞RAM（目录大小+ 2mil文件）

2 个答案: