使用PowerShell get-content查找替换

时间:2014-06-23 11:43:13

标签: powershell

我试图在大文本文件中使用随机SSN屏蔽SSN号码。该文件是400M或.4演出。

我想找到并替换17,000个SSN实例。

以下是我正在使用的powershell脚本的示例。

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt

我的问题是,我在.ps1文件中拥有17,000行此代码。 ps1文件看起来类似于

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "122-45-6789", "666-66-6668"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "223-45-6789", "666-66-6667"} | set-content C:\TrainingFile\TrainingFile.txt

(get-content C:\TrainingFile\TrainingFile.txt) | foreach-object {$_ -replace "123-44-6789", "666-66-6669"} | set-content C:\TrainingFile\TrainingFile.txt

对于.ps1文件中的17,000个powershell命令。每行一个命令。

我只对一个命令进行了测试,并且执行了大约15个secoonds。做数学运算,170000 X 15秒出现大约3天来运行17,000个命令的.ps1脚本。

有更快的方法吗?

4 个答案:

答案 0 :(得分:2)

表现不佳的原因是正在进行大量额外工作。让我们把这个过程看成像这样的伪算法,

select SSN (X) and masked SSN (X') from a list
read all rows from file
look each file row for string X
if found, replace with X'
save all rows to file
loop until all SSNs are processed

那么问题是什么?对于每个SSN替换,您将处理所有行。不仅需要屏蔽而是需要屏蔽。这是一项额外的工作。如果你有100行和10个替换,你只需要100步就可以使用1000步。此外,读取和保存文件会创建磁盘IO。对于单个操作而言,这通常不是问题,将IO成本与循环计数相乘,您会发现磁盘等待浪费了相当多的时间。

为了获得出色的性能,请调整算法,

read all rows from file
loop through rows
for current row, change X -> X'
save the result

为什么这会更快? 1)您阅读并保存文件一次。磁盘IO很慢。 2)您只处理每一行,因此没有进行额外的工作。至于如何实际执行X - > X'变换,你必须更仔细地定义掩蔽规则是什么。

修改

这是一个更实用的解决方案:

因为你已经知道f(X) - > X'结果,您应该将预先计算的列表保存到磁盘,如此,

ssn, mask
"123-45-6789", "666-66-6666"
...
"223-45-6789", "666-66-6667"

将文件导入哈希表并通过窃取来自Ansgar's answer的所有多汁位来继续前进,

$ssnMask = @{}
$ssn = import-csv "c:\temp\SSNMasks.csv" -delimiter ","

# Add X -> X' to hashtable
$ssn | % {
  if(-not $ssnMask.ContainsKey($_.ssn)) {
    # It's an error to add existing key, so check first 
    $ssnMask.Add($_.ssn, $_.mask)
  }
}

$dataToMask = get-content "c:\temp\training.txt"
$dataToMask | % {
   if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
     # Replace SSN look-a-like with value from hashtable
     # NB: This simply removes SSNs that don't have a match in hashtable
     $_ -replace  $matches[1], $ssnMask[$matches[1]]
   }
} | set-content "c:\temp\training2.txt"

答案 1 :(得分:0)

避免多次读写文件。 I / O很昂贵,这会降低你的脚本速度。尝试这样的事情:

$filename = 'C:\TrainingFile\TrainingFile.txt'

$ssnMap = @{}
(Get-Content $filename) | % {
  if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) {
    # If SSN is found, check if a mapping of that SSN to a random SSN exists.
    # Otherwise create a new mapping.
    if ( -not $ssnMap.ContainsKey($matches[1]) ) {
      do {
        $rnd = Get-Random -Min 100000 -Max 999999
        $newSSN = "666-$($rnd -replace '(..)(....)','$1-$2')"
      } while ( $ssnMap.ContainsValue($newSSN) )  # loop to avoid collisions
      $ssnMap[$matches[1]] = $newSSN
    }

    # Replace the SSN with the corresponding randomly generated SSN.
    $_ -replace $matches[1], $ssnMap[$matches[1]]
  } else {
    # If no SSN is found, simply print the line.
    $_
  }
} | Set-Content $filename

如果您已经有一个随机SSN列表,并且还将它们映射到特定的"真实" SSN,您可以将这些映射从CSV(示例列标题:realSSNrandomSSN)读取到$ssnMap哈希表中:

$ssnMap = @{}
Import-Csv 'C:\mappings.csv' | % { $ssnMap[$_.realSSN] = $_.randomSSN }

答案 2 :(得分:0)

如果您已经生成了一个替换的随机SSN列表,并且该文件中的每个SSN只需要用其中一个替换(不一定映射到特定的替换字符串),我认为这将是多少更快:

$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'

$replacements = Get-Content 'C:\TrainingFile\SSN_Replacements.txt'

$i=0

Filter Replace-SSN { $_ -replace '\d{3}-\d{2}-\d{4}',$replacements[$i++] }

Get-Content $inputfile |
Replace-SSN |
Set-Content $outputfile

这将遍历您的替换SSN列表,为每个新替换选择列表中的下一个SSN。

编辑:

这是将特定SSN映射到特定替换字符串的解决方案。它假设您有一个原始SSN的CSV文件及其预期的替换字符串,如列'OldSSN'和'NewSSN':

$inputfile = 'C:\TrainingFile\TrainingFile.txt'
$outputfile = 'C:\TrainingFile\NewTrainingFile.txt'
$replacementfile = 'C:\TrainingFile\SSN_Replacements.csv' 

$SSNmatch = [regex]'\d{3}-\d{2}-\d{4}'

$replacements = @{}

Import-Csv $replacementfile |
 ForEach-Object { $replacements[$_.OldSSN] = $_.NewSSN }

Get-Content $inputfile -ReadCount 1000|

 ForEach-Object {
  foreach ($Line in $_){
  if ( $Line -match $SSNmatch ) #Found SSN in line
    { if ( $replacements.ContainsKey($matches[0]) ) #Found replacement string for this SSN
        { $Line -replace $SSNmatch,$replacements[$matches[0]] } #Replace SSN and ouput line

      else {Write-Warning "Warning - no replacement string found for $($matches[0])"
    }

   }

  else { $Line } #No SSN in this line - output line as-is
 }
} | Set-Content $outputfile

答案 3 :(得分:-1)

# Fairly fast PowerShell code for masking up to 1000 SSN number per line in a large text file (with unlimited # of lines in the file) where the SSN matches the pattern of " ###-##-#### ", " ##-####### ", or " ######### ".
# This code can handle a 14 MB text file that has SSN numbers in nearly every row within about 4 minutes.


# $inputFilename = 'C:/InputFile.txt'

$inputFileName = "
1                                                                                                                                    
           0550       125665    338066                                                                                               
-                   02 CR05635                                  07/06/16                                                             
0     SAMPLE CUSTOMER NAME                                                                                                   
      PO BOX 12345                                                                                                                  
      ROSEVILLE CA 12345-9109                                                                                                        




 EMPLOYEE DEFERRALS                                                                                        
 FREDDIE MAC RO 16 9385456   164-44-9120     XXX                                                                               
 SALLY MAE RO 95 9385356   07-4719130     XXX                                                                               
 FRED FLINTSTONE RO 95 1185456   061741130     XXX  
 WILMA FLINTSTONE RO 91 9235456   364-74-9130  123456789 123456389 987354321    XXX                                                          
 PEBBLES RUBBLE RO 10 9235456 06-3749130  064-74-9150  034-74-9130  XXX                                                                               
 BARNEY RUBBLE RO 11 9235456 06-3449130 06-3749140 063-74-9130     XXX                                                                               
 BETTY RUBBLE RO 16 9235456   9-74-9140  123456789 123456789 987654321    XXX                                                                               

 PLEASE ENTER BELOW ANY ADDITIONAL PARTICIPANTS FOR WHOM YOU ARE                                                                     
 REMITTING.  FOR GENERAL INFORMATION AND SERVICE CALL                                                                              
"

$outputFilename = 'D:/OutFile.txt'

#(Get-Content $inputFilename ) | % {

($inputFilename ) | % {

       $NewLine=$_
       # Write-Host "0 new line value is ($NewLine)."
       $ChangeFound='Y'

       $WhileCounter=0


       While (($ChangeFound -eq 'Y') -and ($WhileCounter -lt 1000))
       {
       $WhileCounter=$WhileCounter+1
       $ChangeFound='N'

       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 1a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ###-##-" + $matches[$i].matches[$k].value.substring(8) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 1b `$i ($i), `$k ($k), `$NewLine ($NewLine)."

              }
          }
          # Write-Host "1 new line value is ($NewLine)."
       }
       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 2a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" ##-###" + $matches[$i].matches[$k].value.substring(7) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 2b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
              }
          }
          # Write-Host "2 new line value is ($NewLine)."
       }
       $matches = $NewLine | Select-String -pattern "[ ][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][ |\t|\r|\n]" -AllMatches
       If ($matches.length -gt 0)
       {
          $ChangeFound='Y'
          $NewLine=''
          for($i = 0; $i -lt 1; $i++){
              for($k = 0; $k -lt 1; $k++){
                  # Write-Host "AmHere 3a `$i ($i), `$k ($k), `$NewLine ($NewLine)."
                  $t = $matches[$i] -replace $matches[$i].matches[$k].value, (" #####" + $matches[$i].matches[$k].value.substring(6) )
                  $NewLine=$NewLine + $t
                  # Write-Host "AmHere 3b `$i ($i), `$k ($k), `$NewLine ($NewLine)."
              }
          }
          #print the line
          # Write-Host "3 new line value is ($NewLine)."
       }
       # Write-Host "4 new line value is ($NewLine)."

       } # end of DoWhile
       Write-Host "5 new line value is ($NewLine)."

       $NewLine

    # Replace the SSN with the corresponding randomly generated SSN.
    # $_ -replace $matches[1], $ssnMap[$matches[1]]
 } | Set-Content $outputFilename