Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to efficiently parse large CSV-like files, whose order of columns I get at runtime. With Spirit Qi, I would parse each field with a lazy auxiliary parser that would select at runtime which column-specific parser to apply to each column. But X3 doesn't seem to have lazy (despite that it's listed in documentation). After reading recommendations here on SO, I've decided to write a custom parser.

It ended up being pretty nice, but now I've noticed I don't really need the pos variable be exposed anywhere outside the custom parser itself. I've tried putting it into the custom parser itself and started getting compiler errors stating that the column_value_parser object is read-only. Can I somehow put pos into the parser structure?

Simplified code that gets the compile-time error, with commented out parts of my working version:

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    // size_t& pos;
    size_t pos;

    // column_value_parser(std::vector<column_variant>& columns, size_t& pos)
    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    //    , pos(pos)
        , pos(0)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& ctx, Other const& other, Attr& attr) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text& c) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '
';
                    }
                },
                [&](integer& c) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '
';
                    }
                },
                [&](real& c) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '
';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main(int argc, char *argv[])
{
    std::string input = "Hello,1,13.7,XXX
World,2,1e3,YYY";

    // Comes from external source.
    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    size_t pos = 0;

    boost::spirit::x3::parse(
        input.begin(), input.end(),
//         (column_value_parser(columns, pos) % ',') % boost::spirit::x3::eol);
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

XY: My goal is to parse ~500 GB of pseudo-CSV files in a reasonable time on a machine with little RAM, convert into a list of (roughly) [row-number, column-name, value], then put into storage. The format is actually a little more complex than CSV: database dumps formatted in… human-friendly way, with column values being actually several small sublangauges (e.g. dates or, uh, something similar to whole apache log lines stuffed into a single field), and I'm often extracting only one specific part of each column. Different files may have different columns and in different order, which I can only learn by parsing yet another set of files containing original queries. Thankfully, Spirit makes it a breeze…

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
509 views
Welcome To Ask or Share your Answers For Others

1 Answer

Three answers:

  1. The easiest fix is to make pos a mutable member
  2. The X3 hardcore answer is x3::with<>
  3. Functional composition

1. Making pos mutable

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    size_t mutable pos = 0;
    struct pos_tag;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& /*ctx*/, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '
';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '
';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '
';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX
World,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

2. x3::with<>

This is similar but with better (re)entrancy and encapsulation:

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx const& ctx, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        size_t& pos = boost::spirit::x3::get<pos_tag>(ctx).value;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '
';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '
';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '
';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }

    template <typename T>
    struct Mutable { T mutable value; };
    struct pos_tag;

    auto invoke() const {
        return boost::spirit::x3::with<pos_tag>(Mutable<size_t>{}) [ *this ];
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX
World,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    column_value_parser p(columns);

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (p.invoke() % ',') % boost::spirit::x3::eol);
}

3. Functional Composition

Because it's so much easier in X3, my favourite is to just generate the parser on demand.

Without requirements, this is the simplest I'd propose:

Live On Wandbox

#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;

namespace CSV {
    struct text    { };
    struct integer { };
    struct real    { };
    struct skip    { };

    auto const unquoted_text_field = *~x3::char_(",
");
    static inline auto as_parser(skip)    { return x3::omit[unquoted_text_field]; }
    static inline auto as_parser(text)    { return unquoted_text_field;           }
    static inline auto as_parser(integer) { return x3::int_;                      }
    static inline auto as_parser(real)    { return x3::double_;                   }

    template <typename... Spec>
    static inline auto line_parser(Spec... spec) {
        auto delim = ',' | &(x3::eoi | x3::eol);
        return ((as_parser(spec) >> delim) >> ... >> x3::eps);
    }

    template <typename... Spec> static inline auto csv_parser(Spec... spec) {
        return line_parser(spec...) % x3::eol;
    }
}

#include <iostream>
#include <iomanip>
using namespace CSV;

int main() {
    std::string const input = "Hello,1,13.7,XXX
World,2,1e3,YYY";
    auto f = begin(input), l = end(input);

    auto p = csv_parser(text{}, integer{}, real{}, skip{});

    if (parse(f, l, p)) {
        std::cout << "Parsed
";
    } else {
        std::cout << "Failed
";
    }

    if (f!=l) {
        std::cout << "Remaining: " << std::quoted(std::string(f,l)) << "
";
    }
}

A version with debug information enabled:

Live On Wandbox

<line>
  <try>Hello,1,13.7,XXX
Wor</try>
  <CSV::text>
    <try>Hello,1,13.7,XXX
Wor</try>
    <success>,1,13.7,XXX
World,2,</success>
  </CSV::text>
  <CSV::integer>
    <try>1,13.7,XXX
World,2,1</try>
    <success>,13.7,XXX
World,2,1e</success>
  </CSV::integer>
  <CSV::real>
    <try>13.7,XXX
World,2,1e3</try>
    <success>,XXX
World,2,1e3,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>XXX
World,2,1e3,YYY</try>
    <success>
World,2,1e3,YYY</success>
  </CSV::skip>
  <success>
World,2,1e3,YYY</success>
</line>
<line>
  <try>World,2,1e3,YYY</try>
  <CSV::text>
    <try>World,2,1e3,YYY</try>
    <success>,2,1e3,YYY</success>
  </CSV::text>
  <CSV::integer>
    <try>2,1e3,YYY</try>
    <success>,1e3,YYY</success>
  </CSV::integer>
  <CSV::real>
    <try>1e3,YYY</try>
    <success>,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>YYY</try>
    <success></success>
  </CSV::skip>
  <success></success>
</line>
Parsed

Notes, Caveats:

  • With anything mutable, beware of side-effects. E.g. if you have a | b and a includes column_value_parser, the side-effect of incrementing pos will not be rolled back when a fails and b is matched instead.

    In short, this makes your parse function impure.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...