I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.
4624996948753406865 10214715013130414417
4305027007407867230 4569406367070518418
10817905656952544704 3697712211731468838 ... ...
I want to read the file and perform some operations on the numbers.
My Code in C++:
void process_data(string str)
{
vector<string> arr;
boost::split(arr, str, boost::is_any_of("
"));
do_some_operation(arr);
}
int main()
{
unsigned long long int read_bytes = 45 * 1024 *1024;
const char* fname = "input.txt";
ifstream fin(fname, ios::in);
char* memblock;
while(!fin.eof())
{
memblock = new char[read_bytes];
fin.read(memblock, read_bytes);
string str(memblock);
process_data(str);
delete [] memblock;
}
return 0;
}
I am relatively new to c++. When I run this code, I am facing these problems.
Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).
This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?
Is it necessary to include " " along with blank character in the boost split function?
I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.
See Question&Answers more detail:os