Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Requires universal CSV parser by specification RFC 4180. There is the csv file, with all the problems of the specification:

Excel opens the file as it is written in the specification:

Anyone does work regex for parse it?

CSV File

"a
b
c","x
y
z",357
test;test,xxx;xxx,152
"test2,test2","xxx2,xxx2",123
"test3""test3","xxx3""xxx3",987
,qwe,13
asd,123,
,,
,123,
,,123
123,,
123,123

Expected Results

Table by EXCEL

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
563 views
Welcome To Ask or Share your Answers For Others

1 Answer

NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)


###About the pattern When one or more lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record (and potentially an OpenValue capture group if there was an open quote at the end of the line/record).

Here's the commented pattern (test it on Regexstorm.net):

(?<=
|
|^)(?!
|
|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,
]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=
|
|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:
|
|
|$)                              // Records end at EOL

Here's the raw pattern without all the comments or whitespace.
(?<=
|
|^)(?!
|
|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,
]+)|"(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)

[Here is a visualization from Debuggex.com][3] (capture groups named for clarity): ![Debuggex.com visualization][4]

###Usage examples:

Simple example for reading an entire CSV file/stream at once (test it on C# Pad):
(For better performance and less impact on system resources you should use the second example)

using System.Text.RegularExpressions;

Regex CSVParser = new Regex(
    @"(?<=
|
|^)(?!
|
|$)" +
    @"(?:" +
        @"(?:" +
            @"""(?<Value>(?:[^""]|"""")*)""|" +
            @"(?<Value>(?!"")[^,
]+)|" +
            @"""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|" +
            @"(?<Value>)" +
        @")" +
        @"(?:,|(?=
|
|$))" +
    @")+?" +
    @"(?:(?<=,)(?<Value>))?" +
    @"(?:
|
|
|$)",
    RegexOptions.Compiled);

String CSVSample =
    ",record1 value2,val3,"value 4","testing ""embedded double quotes"""," +
    ""testing quoted "","" character", value 7,,value 9," +
    ""testing empty """" embedded quotes"," +
    ""testing a quoted value" + Environment.NewLine +
    Environment.NewLine +
    "that includes CR/LF patterns" + Environment.NewLine +
    Environment.NewLine +
    "(which we wish would never happen - but it does)", after CR/LF" + Environment.NewLine +
    Environment.NewLine +
    ""testing an open ended quoted value" + Environment.NewLine +
    Environment.NewLine +
    ",value 2 ,value 3," + Environment.NewLine +
    ""test"";

MatchCollection CSVRecords = CSVParser.Matches(CSVSample);

for (Int32 recordIndex = 0; recordIndex < CSVRecords.Count; recordIndex++)
{
    Match Record = CSVRecords[recordIndex];

    for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
    {
        Capture c = Record.Groups["Value"].Captures[valueIndex];
        Console.Write("R" + (recordIndex + 1) + ":V" + (valueIndex + 1) + " = ");

        if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
        {
            // No need to unescape/undouble quotes if the value is empty, the value starts
            // at the beginning of the record, or the character before the value is not a
            // quote (not a quoted value)
            Console.WriteLine(c.Value);
        }
        else
        {
            // The character preceding this value is a quote
            // so we need to unescape/undouble any embedded quotes
            Console.WriteLine(c.Value.Replace("""", """));
        }
    }
    
    foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
        Console.WriteLine("ERROR - Open ended quoted value: " + OpenValue.Value);
}

Better example for reading a large CSV file/stream without reading the entire file/stream into a string (test it [on C# Pad][6]).
using System.IO;
using System.Text.RegularExpressions;

// Same regex from before shortened to one line for brevity
Regex CSVParser = new Regex(
    @"(?<=
|
|^)(?!
|
|$)(?:(?:""(?<Value>(?:[^""]|"""")*)""|(?<Value>(?!"")[^,
]+)|""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)",
    RegexOptions.Compiled);

String CSVSample = ",record1 value2,val3,"value 4","testing ""embedded double quotes""","testing quoted "","" character", value 7,,value 9,"testing empty """" embedded quotes","testing a quoted value," + 
    Environment.NewLine + Environment.NewLine + "that includes CR/LF patterns" + Environment.NewLine + Environment.NewLine + "(which we wish would never happen - but it does)", after CR/LF," + Environment.NewLine + Environment
    .NewLine + ""testing an open ended quoted value" + Environment.NewLine + Environment.NewLine + ",value 2 ,value 3," + Environment.NewLine + ""test"";

using (StringReader CSVReader = new StringReader(CSVSample))
{
    String CSVLine = CSVReader.ReadLine();
    StringBuilder RecordText = new StringBuilder();
    Int32 RecordNum = 0;

    while (CSVLine != null)
    {
        RecordText.AppendLine(CSVLine);
        MatchCollection RecordsRead = CSVParser.Matches(RecordText.ToString());
        Match Record = null;
        
        for (Int32 recordIndex = 0; recordIndex < RecordsRead.Count; recordIndex++)
        {
            Record = RecordsRead[recordIndex];
        
            if (Record.Groups["OpenValue"].Success && recordIndex == RecordsRead.Count - 1)
            {
                // We're still trying to find the end of a muti-line value in this record
                // and it's the last of the records from this segment of the CSV.
                // If we're not still working with the initial record we started with then
                // prep the record text for the next read and break out to the read loop.
                if (recordIndex != 0)
                    RecordText.AppendLine(Record.Value);
                
                break;
            }
            
            // Valid record found or new record started before the end could be found
            RecordText.Clear();            
            RecordNum++;
            
            for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
            {
                Capture c = Record.Groups["Value"].Captures[valueIndex];
                Console.Write("R" + RecordNum + ":V" + (valueIndex + 1) + " = ");
                if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
                    Console.WriteLine(c.Value);
                else
                    Console.WriteLine(c.Value.Replace("""", """));
            }
            
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
        
        CSVLine = CSVReader.ReadLine();
        
        if (CSVLine == null && Record != null)
        {
            RecordNum++;
            
            //End of file - still working on an open value?
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
    }
}

Both examples return the same result of:

R1:V1 =
R1:V2 = record1 value2
R1:V3 = val3
R1:V4 = value 4
R1:V5 = testing "embedded double quotes"
R1:V6 = testing quoted "," character
R1:V7 = value 7
R1:V8 =
R1:V9 = value 9
R1:V10 = testing empty "" embedded quotes
R1:V11 = testing a quoted value
<b


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...