更有效地修改CSV文件内容的方法

时间:2015-05-22 14:33:05

标签: powershell csv powershell-v3.0 ssms-2012

我正在尝试删除SSMS 2012在查询结果导出为CSV时生成的一些碎屑。

例如,它包含null值的'NULL'一词,并为datetime值添加毫秒:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0

不幸的是,Excel没有正确地自动格式化datetime值小数秒,这导致客户之间的混淆(“我请求的日期字段发生了什么?”)还有更多工作要做(必须在发布之前将CSV转换为XLSX并正确格式化列。)

目标是删除NULL.000值的CSV文件:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0

Excel将打开此文件并正确格式化,无需其他技术帮助。

为此,我写道:

Function Invoke-CsvCleanser {

  [CmdletBinding()]
  Param(
    [parameter(Mandatory=$true)]
    [String]
    $Path,
    [switch]
    $Nulls,
    [switch]
    $Milliseconds
  )

  PROCESS {

    # open the file
    $data = Import-Csv $path

    # process each row
    $data | Foreach-Object { 

        # process each column
        Foreach ($property in $_.PSObject.Properties) {

            # if column contains 'NULL', replace it with ''
            if ($Nulls -and ($property.Value -eq 'NULL')) {
                $property.Value = $property.Value -replace 'NULL', ''
            }

            # if column contains a date/time value, remove milliseconds
            elseif ( $Milliseconds -and (isDate($property.Value)) ) {
                $property.Value = $property.Value -replace '.000', ''    
            }
        } 

    } 

    # save file
    $data | Export-Csv -Path $Path -NoTypeInformation

  }

}

function IsDate($object) {
    [Boolean]($object -as [DateTime])
}

PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop\0000.csv' -Nulls -Milliseconds

当文件大小很小时,这种方法很好,但对于大文件效率很低。理想情况下,Invoke-CsvCleanser会使用管道。

有更好的方法吗?

1 个答案:

答案 0 :(得分:1)

Import-CSV总是将整个文件加载到内存中,因此速度很慢。以下是我对此问题的回答的修改后的脚本:CSV formatting - strip qualifier from specific fields

它使用原始文件处理,因此它应该明显更快。使用正则表达式匹配\ NULL和毫秒。脚本能够对CSV进行大规模转换。

要拆分CSV的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes

将此脚本另存为Invoke-CsvCleanser.ps1。它接受以下论点:

  • InPath 文件夹,用于从中读取CSV。如果未指定,则使用当前目录。
  • OutPath 文件夹,用于将已处理的CSV保存到。将被创建,如果不存在。
  • 编码如果未指定,脚本将使用系统当前的ANSI代码页来读取文件。您可以在PowerShell控制台中获取系统的其他有效编码,如下所示:[System.Text.Encoding]::GetEncodings()
  • DoubleQuotes 开关,如果指定,周围的双引号将从值中删除
  • Nulls 切换,如果已指定,NULL字符串将从值中删除
  • 毫秒切换,如果已指定,.000字符串将从值中删除
  • 详细脚本会通过Write-Verbose消息告诉您发生了什么。

示例:

处理文件夹C:\CSVs_are_here中的所有CSV,删除NULL和毫秒,将处理后的CSV保存到文件夹C:\Processed_CSVs,请详细说明:

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose

Invoke-CsvCleanser.ps1脚本:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [switch]$Nulls,

    [switch]$Milliseconds,

    [switch]$DoubleQuotes
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {

                            # Strip surrounding quotes
                            if($DoubleQuotes)
                            {
                                $_ = $_.Trim($DQuotes)
                            }

                            # Strip NULL strings
                            if($Nulls)
                            {
                                $_ = $_ -replace $NullRegex, ''
                            }

                            # Strip milliseconds
                            if($Milliseconds)
                            {
                                $_ = $_ -replace $MillisecondsRegex, '$1'
                            }

                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

结果:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
  

看看是否可以pass lambdas作为一种方式会很有趣   使文件处理更加灵活。每个lambda都会执行一个   特定活动(删除NULL,上限,标准化文本,   等)

此版本可完全控制CSV处理。只需按照您希望它们执行的顺序将脚本块传递给Action参数。

示例:strip NULL s,strip milliseconds,然后删除双引号。

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', '$1'}, {$_ = $_.Trim('"')}
带有“lambdas”的

Invoke-CsvCleanser.ps1

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]    
    [scriptblock[]]$Action
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp =  $line -split $SplitRegex |
                        ForEach-Object {
                            # Process each item
                            foreach($scriptblock in $Action) {
                                . $scriptblock
                            }
                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }