使用带引号和未带引号的字符串分隔逗号分隔的字符串

时间:2010-09-23 08:13:11

标签: c# regex

我有以下逗号分隔的字符串,我需要拆分。问题是某些内容在引号内,并且包含不应在分割中使用的逗号...

字符串:

111,222,"33,44,55",666,"77,88","99"

我想要输出:

111  
222  
33,44,55  
666  
77,88  
99  

我试过这个:

(?:,?)((?<=")[^"]+(?=")|[^",]+)   

但它以“77,88”,“99”之间的逗号作为命中,我得到以下输出:

111  
222  
33,44,55  
666  
77,88  
,  
99  

任何人都可以帮助我吗?我用完了几个小时...... :) /彼得

16 个答案:

答案 0 :(得分:81)

根据您的需要,您可能无法使用csv解析器,实际上可能想要重新发明轮子!

您可以使用一些简单的正则表达式

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

这将执行以下操作:

(?:^|,) =匹配表达式“行或字符串的开头,

(\"(?:[^\"]+|\"\")*\"|[^,]*) =一个带编号的捕获组,它将在两个选项之间进行选择:

  1. 引用的东西
  2. 逗号之间的东西
  3. 这应该可以为您提供所需的输出。

    C#中的示例代码

     static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
    
    public static string[] SplitCSV(string input)
    {
    
      List<string> list = new List<string>();
      string curr = null;
      foreach (Match match in csvSplit.Matches(input))
      {        
        curr = match.Value;
        if (0 == curr.Length)
        {
          list.Add("");
        }
    
        list.Add(curr.TrimStart(','));
      }
    
      return list.ToArray();
    }
    
    private void button1_Click(object sender, RoutedEventArgs e)
    {
        Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
    }
    

    警告根据@M者的评论 - 如果一个形成错误的csv文件中出现一个流氓新行字符而你最终会出现不均匀(“字符串”),你将会遇到灾难性的回溯({{ 3}})你的正则表达式和你的系统可能会崩溃(就像我们的生产系统一样)。可以很容易地在Visual Studio中复制,因为我发现它会崩溃。一个简单的try / catch也不会陷入这个问题。

    您应该使用:

    (?:^|,)(\"(?:[^\"])*\"|[^,]*)
    

    代替

答案 1 :(得分:14)

我真的很喜欢jimplode的答案,但我认为带有回报率的版本更有用,所以这里是:

public IEnumerable<string> SplitCSV(string input)
{
    Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

    foreach (Match match in csvSplit.Matches(input))
    {
        yield return match.Value.TrimStart(',');
    }
}

让它像扩展方法一样更有用:

public static class StringHelper
{
    public static IEnumerable<string> SplitCSV(this string input)
    {
        Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

        foreach (Match match in csvSplit.Matches(input))
        {
            yield return match.Value.TrimStart(',');
        }
    }
}

答案 2 :(得分:4)

这个正则表达式无需循环遍历值和TrimStart(','),就像在接受的答案中一样:

((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))

以下是C#中的实现:

string values = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";

MatchCollection matches = new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))").Matches(values);

foreach (var match in matches)
{
    Console.WriteLine(match);
}

输出

111  
222  
33,44,55  
666  
77,88  
99  

答案 3 :(得分:3)

当字符串在引号内有逗号时,这些答案都不起作用,如"value, 1"或转义双引号,如"value ""1"""valid CSV应解析分别为value, 1value "1"

如果您传入制表符而不是逗号作为分隔符,这也可以使用制表符分隔格式。

public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
    var currentString = new StringBuilder();
    var inQuotes = false;
    var quoteIsEscaped = false; //Store when a quote has been escaped.
    row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
    foreach (var character in row.Select((val, index) => new {val, index}))
    {
        if (character.val == delimiter) //We hit a delimiter character...
        {
            if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
            {
                Console.WriteLine(currentString);
                yield return currentString.ToString();
                currentString.Clear();
            }
            else
            {
                currentString.Append(character.val);
            }
        } else {
            if (character.val != ' ')
            {
                if(character.val == '"') //If we've hit a quote character...
                {
                    if(character.val == '\"' && inQuotes) //Does it appear to be a closing quote?
                    {
                        if (row[character.index + 1] == character.val) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                        {
                            quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                        }
                        else if (quoteIsEscaped)
                        {
                            quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                            currentString.Append(character.val);
                        }
                        else
                        {
                            inQuotes = false;
                        }
                    }
                    else
                    {
                        if (!inQuotes)
                        {
                            inQuotes = true;
                        }
                        else
                        {
                            currentString.Append(character.val); //...It's a quote inside a quote.
                        }
                    }
                }
                else
                {
                    currentString.Append(character.val);
                }
            }
            else
            {
                if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
                {
                    currentString.Append(character.val);
                }
            }
        }
    }
}

答案 4 :(得分:3)

对“Chad Hedgcock”提供的功能进行细微更新。

更新已开启:

第26行:character.val =='\“' - 由于在第24行进行检查,因此永远不会成立。即character.val =='”'

第28行:if(row [character.index + 1] == character.val)添加!quoteIsEscaped以逃脱3个连续引号。

public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
    if (character.val == delimiter) //We hit a delimiter character...
    {
        if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
        {
            //Console.WriteLine(currentString);
            yield return currentString.ToString();
            currentString.Clear();
        }
        else
        {
            currentString.Append(character.val);
        }
    } else {
        if (character.val != ' ')
        {
            if(character.val == '"') //If we've hit a quote character...
            {
                if(character.val == '"' && inQuotes) //Does it appear to be a closing quote?
                {
                    if (row[character.index + 1] == character.val && !quoteIsEscaped) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                    {
                        quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                    }
                    else if (quoteIsEscaped)
                    {
                        quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                        currentString.Append(character.val);
                    }
                    else
                    {
                        inQuotes = false;
                    }
                }
                else
                {
                    if (!inQuotes)
                    {
                        inQuotes = true;
                    }
                    else
                    {
                        currentString.Append(character.val); //...It's a quote inside a quote.
                    }
                }
            }
            else
            {
                currentString.Append(character.val);
            }
        }
        else
        {
            if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
            {
                currentString.Append(character.val);
            }
        }
    }
}

}

答案 5 :(得分:3)

快速简便:

    public static string[] SplitCsv(string line)
    {
        List<string> result = new List<string>();
        StringBuilder currentStr = new StringBuilder("");
        bool inQuotes = false;
        for (int i = 0; i < line.Length; i++) // For each character
        {
            if (line[i] == '\"') // Quotes are closing or opening
                inQuotes = !inQuotes;
            else if (line[i] == ',') // Comma
            {
                if (!inQuotes) // If not in quotes, end of current string, add it to result
                {
                    result.Add(currentStr.ToString());
                    currentStr.Clear();
                }
                else
                    currentStr.Append(line[i]); // If in quotes, just add it 
            }
            else // Add any other character to current string
                currentStr.Append(line[i]); 
        }
        result.Add(currentStr.ToString());
        return result.ToArray(); // Return array of all strings
    }

将此字符串作为输入:

 111,222,"33,44,55",666,"77,88","99"

它将返回:

111  
222  
33,44,55  
666  
77,88  
99  

答案 6 :(得分:2)

Don't reinvent CSV解析器,请尝试FileHelpers

答案 7 :(得分:2)

试试这个:

       string s = @"111,222,""33,44,55"",666,""77,88"",""99""";

       List<string> result = new List<string>();

       var splitted = s.Split('"').ToList<string>();
       splitted.RemoveAll(x => x == ",");
       foreach (var it in splitted)
       {
           if (it.StartsWith(",") || it.EndsWith(","))
           {
               var tmp = it.TrimEnd(',').TrimStart(',');
               result.AddRange(tmp.Split(','));
           }
           else
           {
               if(!string.IsNullOrEmpty(it)) result.Add(it);
           }
      }
       //Results:

       foreach (var it in result)
       {
           Console.WriteLine(it);
       }

答案 8 :(得分:2)

对于Jay的回答,如果你使用第二个布尔值,那么你可以在单引号内嵌套双引号,反之亦然。

    private string[] splitString(string stringToSplit)
{
    char[] characters = stringToSplit.ToCharArray();
    List<string> returnValueList = new List<string>();
    string tempString = "";
    bool blockUntilEndQuote = false;
    bool blockUntilEndQuote2 = false;
    int characterCount = 0;
    foreach (char character in characters)
    {
        characterCount = characterCount + 1;

        if (character == '"' && !blockUntilEndQuote2)
        {
            if (blockUntilEndQuote == false)
            {
                blockUntilEndQuote = true;
            }
            else if (blockUntilEndQuote == true)
            {
                blockUntilEndQuote = false;
            }
        }
        if (character == '\'' && !blockUntilEndQuote)
        {
            if (blockUntilEndQuote2 == false)
            {
                blockUntilEndQuote2 = true;
            }
            else if (blockUntilEndQuote2 == true)
            {
                blockUntilEndQuote2 = false;
            }
        }

        if (character != ',')
        {
            tempString = tempString + character;
        }
        else if (character == ',' && (blockUntilEndQuote == true || blockUntilEndQuote2 == true))
        {
            tempString = tempString + character;
        }
        else
        {
            returnValueList.Add(tempString);
            tempString = "";
        }

        if (characterCount == characters.Length)
        {
            returnValueList.Add(tempString);
            tempString = "";
        }
    }

    string[] returnValue = returnValueList.ToArray();
    return returnValue;
}

答案 9 :(得分:1)

我知道我有点迟到了,但是对于搜索来说,这就是我在C sharp中所做的事情

private string[] splitString(string stringToSplit)
    {
        char[] characters = stringToSplit.ToCharArray();
        List<string> returnValueList = new List<string>();
        string tempString = "";
        bool blockUntilEndQuote = false;
        int characterCount = 0;
        foreach (char character in characters)
        {
            characterCount = characterCount + 1;

            if (character == '"')
            {
                if (blockUntilEndQuote == false)
                {
                    blockUntilEndQuote = true;
                }
                else if (blockUntilEndQuote == true)
                {
                    blockUntilEndQuote = false;
                }
            }

            if (character != ',')
            {
                tempString = tempString + character;
            }
            else if (character == ',' && blockUntilEndQuote == true)
            {
                tempString = tempString + character;
            }
            else
            {
                returnValueList.Add(tempString);
                tempString = "";
            }

            if (characterCount == characters.Length)
            {
                returnValueList.Add(tempString);
                tempString = "";
            }
        }

        string[] returnValue = returnValueList.ToArray();
        return returnValue;
    }

答案 10 :(得分:1)

目前我使用以下正则表达式:

    public static Regex regexCSVSplit = new Regex(@"(?x:(
        (?<FULL>
        (^|[,;\t\r\n])\s*
        ( (?<CODAT> (?<CO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<CO>\s*)[,;\t\r\n])*)\k<CO>) |
          (?<CODAT> (?<DAT> [^""',;\s\r\n]* )) )
        (?=\s*([,;\t\r\n]|$))
        ) |
        (?<FULL>
        (^|[\s\t\r\n])
        ( (?<CODAT> (?<CO>[""'])(?<DAT> [^""',;\s\t\r\n]* )\k<CO>) |
        (?<CODAT> (?<DAT> [^""',;\s\t\r\n]* )) )
        (?=[,;\s\t\r\n]|$))
        ))", RegexOptions.Compiled);

此解决方案可以处理非常混乱的情况,如下所示: enter image description here

这是将结果输入数组的方法:

    var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().Select(x => x.Groups["DAT"].Value).ToArray();

在行动HERE

中查看此示例

答案 11 :(得分:1)

我需要一些更强大的东西,所以我从这里开始创建了这个......这个解决方案有点不那么优雅,而且更加冗长,但在我的测试中(有1,000,000行样本),我发现了这个要快2到3倍。此外,它还处理非转义嵌入式报价。由于我的解决方案的要求,我使用字符串分隔符和限定符而不是字符。我发现找到一个好的,通用的CSV解析器比我预期的更难,所以我希望这个解析算法可以帮助某人。

    public static string[] SplitRow(string record, string delimiter, string qualifier, bool trimData)
    {
        // In-Line for example, but I implemented as string extender in production code
        Func <string, int, int> IndexOfNextNonWhiteSpaceChar = delegate (string source, int startIndex)
        {
            if (startIndex >= 0)
            {
                if (source != null)
                {
                    for (int i = startIndex; i < source.Length; i++)
                    {
                        if (!char.IsWhiteSpace(source[i]))
                        {
                            return i;
                        }
                    }
                }
            }

            return -1;
        };

        var results = new List<string>();
        var result = new StringBuilder();
        var inQualifier = false;
        var inField = false;

        // We add new columns at the delimiter, so append one for the parser.
        var row = $"{record}{delimiter}";

        for (var idx = 0; idx < row.Length; idx++)
        {
            // A delimiter character...
            if (row[idx]== delimiter[0])
            {
                // Are we inside qualifier? If not, we've hit the end of a column value.
                if (!inQualifier)
                {
                    results.Add(trimData ? result.ToString().Trim() : result.ToString());
                    result.Clear();
                    inField = false;
                }
                else
                {
                    result.Append(row[idx]);
                }
            }

            // NOT a delimiter character...
            else
            {
                // ...Not a space character
                if (row[idx] != ' ')
                {
                    // A qualifier character...
                    if (row[idx] == qualifier[0])
                    {
                        // Qualifier is closing qualifier...
                        if (inQualifier && row[IndexOfNextNonWhiteSpaceChar(row, idx + 1)] == delimiter[0])
                        {
                            inQualifier = false;
                            continue;
                        }

                        else
                        {
                            // ...Qualifier is opening qualifier
                            if (!inQualifier)
                            {
                                inQualifier = true;
                            }

                            // ...It's a qualifier inside a qualifier.
                            else
                            {
                                inField = true;
                                result.Append(row[idx]);
                            }
                        }
                    }

                    // Not a qualifier character...
                    else
                    {
                        result.Append(row[idx]);
                        inField = true;
                    }
                }

                // ...A space character
                else
                {
                    if (inQualifier || inField)
                    {
                        result.Append(row[idx]);
                    }
                }
            }
        }

        return results.ToArray<string>();
    }

一些测试代码:

        //var input = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";

        var input =
            "111, 222, \"99\",\"33,44,55\" ,      \"666 \"mark of a man\"\", \" spaces \"77,88\"   \"";

        Console.WriteLine("Split with trim");
        Console.WriteLine("---------------");
        var result = SplitRow(input, ",", "\"", true);
        foreach (var r in result)
        {
            Console.WriteLine(r);
        }
        Console.WriteLine("");

        // Split 2
        Console.WriteLine("Split with no trim");
        Console.WriteLine("------------------");
        var result2 = SplitRow(input, ",", "\"", false);
        foreach (var r in result2)
        {
            Console.WriteLine(r);
        }
        Console.WriteLine("");

        // Time Trial 1
        Console.WriteLine("Experimental Process (1,000,000) iterations");
        Console.WriteLine("-------------------------------------------");
        watch = Stopwatch.StartNew();
        for (var i = 0; i < 1000000; i++)
        {
            var x1 = SplitRow(input, ",", "\"", false);
        }
        watch.Stop();
        elapsedMs = watch.ElapsedMilliseconds;
        Console.WriteLine($"Total Process Time: {string.Format("{0:0.###}", elapsedMs / 1000.0)} Seconds");
        Console.WriteLine("");

结果

Split with trim
---------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"

Split with no trim
------------------
111
222
99
33,44,55
666 "mark of a man"
 spaces "77,88"

Original Process (1,000,000) iterations
-------------------------------
Total Process Time: 7.538 Seconds

Experimental Process (1,000,000) iterations
--------------------------------------------
Total Process Time: 3.363 Seconds

答案 12 :(得分:0)

我曾经不得不做类似的事情,最后我遇到了正则表达式。正则表达式无法拥有状态使得它非常棘手 - 我最后写了一个简单的小解析器

如果你正在进行CSV解析,你应该坚持使用CSV解析器 - 不要重新发明轮子。

答案 13 :(得分:0)

这是我基于字符串原始指针操作的最快实现:

string[] FastSplit(string sText, char? cSeparator = null, char? cQuotes = null)
    {            
        string[] oTokens;

        if (null == cSeparator)
        {
            cSeparator = DEFAULT_PARSEFIELDS_SEPARATOR;
        }

        if (null == cQuotes)
        {
            cQuotes = DEFAULT_PARSEFIELDS_QUOTE;
        }

        unsafe
        {
            fixed (char* lpText = sText)
            {
                #region Fast array estimatation

                char* lpCurrent      = lpText;                    
                int   nEstimatedSize = 0;

                while (0 != *lpCurrent)
                {
                    if (cSeparator == *lpCurrent)
                    {
                        nEstimatedSize++;
                    }

                    lpCurrent++;
                }

                nEstimatedSize++; // Add EOL char(s)
                string[] oEstimatedTokens = new string[nEstimatedSize];

                #endregion

                #region Parsing

                char[] oBuffer = new char[sText.Length];
                int    nIndex  = 0;
                int    nTokens = 0;

                lpCurrent      = lpText;

                while (0 != *lpCurrent)
                {
                    if (cQuotes == *lpCurrent)
                    {
                        // Quotes parsing

                        lpCurrent++; // Skip quote
                        nIndex = 0;  // Reset buffer

                        while (
                               (0       != *lpCurrent)
                            && (cQuotes != *lpCurrent)
                        )
                        {
                            oBuffer[nIndex] = *lpCurrent; // Store char

                            lpCurrent++; // Move source cursor
                            nIndex++;    // Move target cursor
                        }

                    } 
                    else if (cSeparator == *lpCurrent)
                    {
                        // Separator char parsing

                        oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex); // Store token
                        nIndex                      = 0;                              // Skip separator and Reset buffer
                    }
                    else
                    {
                        // Content parsing

                        oBuffer[nIndex] = *lpCurrent; // Store char
                        nIndex++;                     // Move target cursor
                    }

                    lpCurrent++; // Move source cursor
                }

                // Recover pending buffer

                if (nIndex > 0)
                {
                    // Store token

                    oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex);
                }

                // Build final tokens list

                if (nTokens == nEstimatedSize)
                {
                    oTokens = oEstimatedTokens;
                }
                else
                {
                    oTokens = new string[nTokens];
                    Array.Copy(oEstimatedTokens, 0, oTokens, 0, nTokens);
                }

                #endregion
            }
        }

        // Epilogue            

        return oTokens;
    }

答案 14 :(得分:0)

试试这个

private string[] GetCommaSeperatedWords(string sep, string line)
    {
        List<string> list = new List<string>();
        StringBuilder word = new StringBuilder();
        int doubleQuoteCount = 0;
        for (int i = 0; i < line.Length; i++)
        {
            string chr = line[i].ToString();
            if (chr == "\"")
            {
                if (doubleQuoteCount == 0)
                    doubleQuoteCount++;
                else
                    doubleQuoteCount--;

                continue;
            }
            if (chr == sep && doubleQuoteCount == 0)
            {
                list.Add(word.ToString());
                word = new StringBuilder();
                continue;
            }
            word.Append(chr);
        }

        list.Add(word.ToString());

        return list.ToArray();
    }

答案 15 :(得分:0)

这是用基于状态的逻辑重写的乍得答案。当遇到"""BRAD"""作为字段时,他的回答对我来说失败了。那应该返回"BRAD",但是它只吃掉了所有剩余的字段。当我尝试调试它时,我最终将其重写为基于状态的逻辑:

enum SplitState { s_begin, s_infield, s_inquotefield, s_foundquoteinfield };
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
    var currentString = new StringBuilder();
    SplitState state = SplitState.s_begin;
    row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
    foreach (var character in row.Select((val, index) => new { val, index }))
    {
        //Console.WriteLine("character = " + character.val + " state = " + state);
        switch (state)
        {
            case SplitState.s_begin:
                if (character.val == delimiter)
                {
                    /* empty field */
                    yield return currentString.ToString();
                    currentString.Clear();
                } else if (character.val == '"')
                {
                    state = SplitState.s_inquotefield;
                } else
                {
                    currentString.Append(character.val);
                    state = SplitState.s_infield;
                }
                break;
            case SplitState.s_infield:
                if (character.val == delimiter)
                {
                    /* field with data */
                    yield return currentString.ToString();
                    state = SplitState.s_begin;
                    currentString.Clear();
                } else
                {
                    currentString.Append(character.val);
                }
                break;
            case SplitState.s_inquotefield:
                if (character.val == '"')
                {
                    // could be end of field, or escaped quote.
                    state = SplitState.s_foundquoteinfield;
                } else
                {
                    currentString.Append(character.val);
                }
                break;
            case SplitState.s_foundquoteinfield:
                if (character.val == '"')
                {
                    // found escaped quote.
                    currentString.Append(character.val);
                    state = SplitState.s_inquotefield;
                }
                else if (character.val == delimiter)
                {
                    // must have been last quote so we must find delimiter
                    yield return currentString.ToString();
                    state = SplitState.s_begin;
                    currentString.Clear();
                }
                else
                {
                    throw new Exception("Quoted field not terminated.");
                }
                break;
            default:
                throw new Exception("unknown state:" + state);
        }
    }
    //Console.WriteLine("currentstring = " + currentString.ToString());
}

这比其他解决方案多了很多代码行,但是很容易修改以添加边缘情况。