I have a large csv file with more than 1 million rows. Each row has two features, callsite (the location of an API invocation) and a sequence of tokens to the callsite. They are written as:
callsite 1, token 1, token 2, token 3, ...
callsite 1, token 3, token 4, token 4, token 6, ...
callsite 2, token 3, token 1, token 6, token 7, ...
I want to shuffle the rows and split them into two files (for training and testing). The problem is that I want to split than according to the callsites instead of the rows. There may be more than one row belonging to one callsite. So I first read all the callsites, shuffle and split them as follows:
import csv
import random
with open(file,'r') as csv_file:
reader = csv.reader(csv_file)
callsites = [row[0] for row in reader]
random.shuffle(callsites)
test_callsites = callsites[0:n_test] //n_test is the number of test cases
Then, I read each row from the csv file and compare the callsite to put it in the train.csv or test.csv as follows:
with open(file,'r') as csv_file, open('train.csv','w') as train_file, open('test.csv','w') as test_file:
reader = csv.reader(csv_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
for row in reader:
if row[0] in test_callsites:
test_writer.writerow(row)
else:
train_writer.writerow(row)
The problem is that the code works too slow, more than one day to finish. The comparison for each row causes the complexity O(n^2). And the read and write row by row may also be not efficient. But I am afraid that loading all data in the memory would cause memory error. Is there any better way to deal with large files like that?
Would it be quicker if I use dataframe to read and write it? But the sequence length is varied each row. I tried to write the data as (put all tokens as a list in one column):
callsite, sequence
callsite 1, [token1||token2||token 3]
However, it seems not convenient to restore the [token 1||token 2||token 3] as sequences. Is there any better practice to store and restore that kind of data with variable length?