如何将逗号分隔的文件转换为vb.net中的管道分隔符

时间:2014-12-12 18:46:42

标签: regex vb.net csv replace substring

网上(以及在SO中)有很多搜索结果用于类似到我需要做的事情,但我还没有针对我的特定情况遇到解决方案

我有一个以逗号分隔的文件,其中只有包含逗号的列在它们周围有双引号。其他没有逗号的字段只用逗号分隔。

举个例子:

123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"

该行的输出必须是:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope

我目前有这个代码:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        line = Replace(line, strReplacerQuoteCommaQuote, "|")
        line = Replace(line, strReplacerQuoteComma, "|")
        line = Replace(line, strReplacerCommaQuote, "|")
        line = Replace(line, Chr(34), "")

        Console.WriteLine("line: " & line)
    Loop
End Using

该过程的问题是当我到达第四行()时,字符串如下所示:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope

所以男人和非洲人需要在他们之后使用管道,但显然我不能在所有逗号上做替换。

我该怎么做?是否有可以处理此问题的RegEx语句?

使用工作代码更新

Avinash评论中的link得到了答案。我导入了System.Text.RegularExpressions并使用了以下内容:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
        Dim replacement As String = "|"
        Dim regEx As New Regex(pattern)

        Dim newLine As String = regEx.Replace(line, replacement)
        newLine = newLine.Replace(Chr(34), "")

        Console.WriteLine("newLine: " & newLine)
    Loop
End Using

3 个答案:

答案 0 :(得分:3)

这似乎适用于您的示例:

Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")

enter image description here

请参阅已接受的答案here以获取正则表达式的解释,并确保在您完成时{@ 3}},因为我基本上只是偷了他的正则表达式。

修改 关于您的性能问题,我创建了一个包含90k行的文件:

abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

大致相当于35MB的文件大小,我的笔记本电脑(没什么特别的)会在大约6.5秒内解析它。

是的,正则表达式很慢,并且TextFieldParser类也被广泛报道为不是最快的,但如果你仍在处理超过5分钟,你的代码显然还有其他一些瓶颈。请注意,我实际上并没有对解析的结果做任何事情。

编辑2:好的,我以为我最后一次(我今天早上很无聊)但我仍然无法复制你的延长转换时间。< / p>

时间变得残酷,我创建了一个150k行的输入文件:

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

每行有1140个字符,总文件大小约为167MB。

使用以下代码读取,转换和写回新文件 29 秒。

Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

修改3 :使用@ sln的正则表达式,此代码将同一文件的处理时间缩短为 4 秒。

Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = rgx.Replace(line, replacement)
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

所以,你去,我认为你有一个胜利者。作为sln状态,这是一个相对测试,因此机器速度无关紧要。

,(?=([^"]*"[^"]*")*[^"]*$)          took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 4 seconds

最后(并且只是为了完整性)@ jawood2005提出的解决方案非常可行:

Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
    Using tfp As New FileIO.TextFieldParser("d:\input.txt")
        tfp.TextFieldType = FileIO.FieldType.Delimited
        tfp.Delimiters = New String() {","}
        tfp.HasFieldsEnclosedInQuotes = True
        While Not tfp.EndOfData
            fields = tfp.ReadFields
            line = String.Join("|", fields)
            sw.WriteLine(line.Replace(Chr(34), ""))
        End While
    End Using
End Using

使用相同的150k行输入文件作为正则表达式解决方案,这在 18 秒内完成,因此比我的更好,但是sln赢得了最快解决问题的奖励。

答案 1 :(得分:3)

防弹方式。

 # Validate even quotes (one time match):  ^[^"]*(?:"[^"]*"[^"]*)*$   
 # Then ->
 # ----------------------------------------------
 # Find:  /,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/
 # Replace:  '|$1'

 ,
 (                             # (1 start)
      [^,"]*  
      (?: " [^"]* " )?
      [^,"]*  
 )                             # (1 end)
 (?= , | $ )

基准

由于@TheBlueDog发布了一个基准('编辑2'),我以为我会发布一个 基准也是。

它基于他的意见,其意图是展示使用
的邪恶 '到字符串结尾'预测作为验证技术
(即,这个 - &gt; ^[^"]*(?:"[^"]*"[^"]*)*$

Blue Dog的正则表达式替换方法因为不必要的回调而受到了一些阻碍,所以我 想象一下他的一些不好的数字。

不知道Vb.net所以这是在Perl中完成的。机器速度和语言都被考虑在内了 因为它是一个相对的测试。

摘要:

,(?=([^"]*"[^"]*")*[^"]*$)          took 10 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 2 seconds  

这表示5倍的差异。

Perl的基准测试,150K行(167MB文件):

use strict;
use warnings;

use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($infile, $outfile);

my $tstr = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
';

# =================================================
print "\nMaking 150K line (167MB file), csv_data_in.txt ...";

open( $infile, ">", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for writing $!";
for (1 .. 150_000)
{
   print $infile $tstr;
}
close( $infile );

print "\nDone !\n\n";

# =================================================
print "Converting delimiters, writing to csv_data_out.txt ...";

open( $infile, "<", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for readimg $!";
open( $outfile, ">", 'csv_data_out.txt' ) or die "can't open 'csv_data_out.txt' for writing $!";

my $line = '';

$t0 = new Benchmark;
while( $line = <$infile> )
{
    # Validation - Uncomment to check line for even quotes, otherwise don't
    # if ( $line =~ /^[^"]*(?:"[^"]*"[^"]*)*$/ )
    # {
        $line =~ s/,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/|$1/g;
    # }
    print $outfile $line;
}
$t1 = new Benchmark;

close( $infile );
close( $outfile );

print "\nDone !\n";
print "Conversion took: ", timestr(timediff($t1, $t0)), "\n\n";

输出:

Making 150K line (167MB file), csv_data_in.txt ...
Done !

Converting delimiters, writing to csv_data_out.txt ...
Done !
Conversion took: 2.1216 wallclock secs ( 1.87 usr +  0.17 sys =  2.04 CPU)

答案 2 :(得分:1)

这可能不是最佳解决方案,但应该有效......

我99%肯定您正在使用StreamReader(“sr”)来读取文件。尝试使用FileIO.TextFieldParser读取它,这将允许您将行拆分为字符串数组。

Dim aFile As FileIO.TextFieldParser = New FileIO.TextFieldParser(filePath)
Dim temp() As String ' this array will hold each line of data
Dim order As doOrder = Nothing
Dim orderID As Integer
Dim myDate As DateTime = Now.ToString

aFile.TextFieldType = FileIO.FieldType.Delimited
aFile.Delimiters = New String() {","}
aFile.HasFieldsEnclosedInQuotes = True

temp = aFile.ReadFields

' parse the actual file
Do While Not aFile.EndOfData...

在循环中,继续使用“aFile.ReadFields”来读取下一行。获得String数组后,可以将每个字段与它们之间的管道连接起来。有点凌乱,而不是正则表达式(不知道这是一个实际情况还是只是一个想法),但会完成工作。

此外,请注意“aFile.HasFieldsEnclosedInQuotes = True”,因为这是您列出的条件之一。

编辑:我看到The Blue Dog在我尝试键入时给出了正则表达式的答案...您可能仍然希望使用TextFieldParser,因为您正在阅读分隔文件。我现在就走开。