c++ - Simplest way to read a CSV file mapped to memory?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

c++ - Simplest way to read a CSV file mapped to memory?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.

Is there a way I can do this with a few modifications of my above code?

(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

395 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:23:30+0000

Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines^[1]) in ~1 second.

The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

For comparison:
that's ~3x faster than a naive wc csv.txt takes on the same file
it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):
perl -ne '$fields{scalar split /,/}++; END { map { print "$_
" } keys %fields  }' csv.txt
it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

Here's the parser in all it's glory:

using CsvField = boost::string_ref;
using CsvLine  = std::vector<CsvField>;
using CsvFile  = std::vector<CsvLine>;  // keep it simple :)

struct CsvParser : qi::grammar<char const*, CsvFile()> {
    CsvParser() : CsvParser::base_type(lines)
    {
        using namespace qi;

        field = raw [*~char_(",
")] 
            [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
        line  = field % ',';
        lines = line  % eol;
    }
    // declare: line, field, fields
};

The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

Here's the main:

int main()
{
    boost::iostreams::mapped_file_source csv("csv.txt");

    CsvFile parsed;
    if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
    {
        std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values
";
    }
}

Printing

116 MiB parsed into 2578421 lines of CSV values

You can use the values just as std::string:

for (int i = 0; i < 10; ++i)
{
    auto l     = rand() % parsed.size();
    auto& line = parsed[l];
    auto c     = rand() % line.size();

    std::cout << "Random field at L:" << l << " C:" << c << "" << line[c] << "
";
}

Which prints eg.:

Random field at L:1979500    C:2    sateen's
Random field at L:928192     C:1    sackcloth's
Random field at L:1570275    C:4    accompanist's
Random field at L:479916     C:2    apparel's
Random field at L:767709     C:0    pinks
Random field at L:1174430    C:4    axioms
Random field at L:1209371    C:4    wants
Random field at L:2183367    C:1    Klondikes
Random field at L:2142220    C:1    Anthony
Random field at L:1680066    C:2    pines

The fully working sample is here Live On Coliru

^[1] I created the file by repeatedly appending the output of

while read a && read b && read c && read d && read e
do echo "$a,$b,$c,$d,$e"
done < /etc/dictionaries-common/words

to csv.txt, until it counted 2.5 million lines.

Categories

c++ - Simplest way to read a CSV file mapped to memory?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags