Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I tried to merge two data.frames, and they are like below:

   GVKEY YEAR coperol     delta     vega firm_related_wealth
1 001045 1992       1  38.88885 17.86943            2998.816
2 001045 1993       1  33.57905 19.19287            2286.418
3 001045 1994       1  48.54719 16.85830            3924.053
4 001045 1995       1 111.46762 38.71565            8550.903
5 001045 1996       1 218.89279 45.59413           17834.921
6 001045 1997       1 415.61461 51.45863           34279.515

AND

   GVKEY YEAR fracdirafter fracdirafterindep twfracdirafter
1 001004 1996         1.00              0.70    1.000000000
2 001004 1997         0.00              0.00    0.000000000
3 001004 1998         0.00              0.00    0.000000000
4 001004 1999         0.00              0.00    0.000000000
5 001004 2000         0.00              0.00    0.000000000
6 001004 2001         0.25              0.25    0.009645437

They both have 1,048,575 rows. My code is merge(a,b,by=c("GVKEY","YEAR")), I kept receiving error message "negative length vectors are not allowed". I also tried the data.table way, but got error message saying that my results would exceed 2^31 rows. Apparently, the merged data will not be so big, so I am not sure how to solve this issue.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.4k views
Welcome To Ask or Share your Answers For Others

1 Answer

You are getting this error because the data.frame / data.table created by the join has more than 2^31 - 1 rows (2,147,483,647).

Due to the way vectors are constructed internally by R, the maximum length of any vector is 2^31 - 1 elements (see: https://stackoverflow.com/a/5234293/2341679). Since a data.frame / data.table is really a list() of vectors, this limit also applies to the number of rows.

As other people have commented and answered, unfortunately you won't be able to construct this data.table, and its likely there are that many rows because of duplicate matches between your two data.tables (these may or may not be intentional on your part).

The good news is, if the duplicate matches are not errors, and you still want to perform the join, there is a way around it: you just need to do whatever computation you wanted to do on the resulting data.table in the same call as the join using the data.table[] operator, e.g.:

dt_left[dt_right, on = .(GVKEY, YEAR), 
        j = .(sum(firm_related_wealth), mean(fracdirafterindep),
        by = .EACHI]

If you're not familiar with the data.table syntax, you can perform calculations on columns within a data.table as shown above using the j argument. When performing a join using this syntax, computation in j is performed on the data.table created by the join.

The key here is the by = .EACHI argument. This breaks the join (and subsequent computation in j) down into smaller components: one data.table for each row in dt_right and its matches in dt_left, avoiding the problem of creating a data.table with > 2^31 - 1 rows.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...