Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a large csv file with more than 1 million rows. Each row has two features, callsite (the location of an API invocation) and a sequence of tokens to the callsite. They are written as:

callsite 1, token 1, token 2, token 3, ...
callsite 1, token 3, token 4, token 4, token 6, ...
callsite 2, token 3, token 1, token 6, token 7, ... 

I want to shuffle the rows and split them into two files (for training and testing). The problem is that I want to split than according to the callsites instead of the rows. There may be more than one row belonging to one callsite. So I first read all the callsites, shuffle and split them as follows:

import csv
import random
with open(file,'r') as csv_file:
    reader = csv.reader(csv_file)
    callsites = [row[0] for row in reader]
random.shuffle(callsites)
test_callsites = callsites[0:n_test] //n_test is the number of test cases

Then, I read each row from the csv file and compare the callsite to put it in the train.csv or test.csv as follows:

with open(file,'r') as csv_file, open('train.csv','w') as train_file, open('test.csv','w') as test_file:
    reader = csv.reader(csv_file)
    train_writer = csv.writer(train_file)
    test_writer = csv.writer(test_file)
    for row in reader:
        if row[0] in test_callsites:
            test_writer.writerow(row)
        else:
            train_writer.writerow(row)

The problem is that the code works too slow, more than one day to finish. The comparison for each row causes the complexity O(n^2). And the read and write row by row may also be not efficient. But I am afraid that loading all data in the memory would cause memory error. Is there any better way to deal with large files like that?

Would it be quicker if I use dataframe to read and write it? But the sequence length is varied each row. I tried to write the data as (put all tokens as a list in one column):

callsite,     sequence
callsite 1, [token1||token2||token 3]

However, it seems not convenient to restore the [token 1||token 2||token 3] as sequences. Is there any better practice to store and restore that kind of data with variable length?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
4.1k views
Welcome To Ask or Share your Answers For Others

1 Answer

The simplest fix is to change:

test_callsites = callsites[0:n_test]

to

test_callsites = frozenset(callsites[:n_test])  # set also works; frozenset just reduces chance of mistakenly modifying it

This would reduce the work for each test of if row[0] in test_callsites: from O(n_test) to O(1), which would likely make a huge improvement if n_test is on the order of four digits or more (likely, when we're talking about millions of rows).

You could also slightly reduce the work (mostly in terms of improving memory locality by having a smaller bin of things being selected) in creating it in the first place by changing:

random.shuffle(callsites)
test_callsites = callsites[0:n_test]

to:

test_callsites = frozenset(random.sample(callsites, n_test))

which avoids reshuffling the whole of callsites in favor of selecting n_test values from it (which you then convert to a frozenset, or just set, for cheap lookup). Bonus, it's a one-liner. :-)


Side-note: Your code is potentially wrong as written. You must pass newline='' to your various calls to open to ensure that the chosen CSV dialect's newline preferences are honored.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...