I try to parse TPCH files with Boost Spirit QI. My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ). The data is in csv format and the tokens are delimited with a '|' character.
It works but it is very slow (20 sec. for 1 GB).
Here is my qi grammer for the lineitem file:
struct lineitem {
int l_orderkey;
int l_partkey;
int l_suppkey;
int l_linenumber;
std::string l_quantity;
std::string l_extendedprice;
std::string l_discount;
std::string l_tax;
std::string l_returnflag;
std::string l_linestatus;
std::string l_shipdate;
std::string l_commitdate;
std::string l_recepitdate;
std::string l_shipinstruct;
std::string l_shipmode;
std::string l_comment;
};
BOOST_FUSION_ADAPT_STRUCT( lineitem,
(int, l_orderkey)
(int, l_partkey)
(int, l_suppkey)
(int, l_linenumber)
(std::string, l_quantity)
(std::string, l_extendedprice)
(std::string, l_discount)
(std::string, l_tax)
(std::string, l_returnflag)
(std::string, l_linestatus)
(std::string, l_shipdate)
(std::string, l_commitdate)
(std::string, l_recepitdate)
(std::string, l_shipinstruct)
(std::string, l_shipmode)
(std::string, l_comment))
vector<lineitem>* lineitems=new vector<lineitem>();
phrase_parse(state->dataPointer,
state->dataEndPointer,
(*(int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|'
) ), space, *lineitems
);
The problem seems to be the character parsing. It is much slower than other conversions. Is there a better way to parse variable length tokens into strings?
See Question&Answers more detail:os