I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right
This is my code
mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
by = .(week),
keep = c("list_of_id", "week")
]
It returned this error
Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading
I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.
When I run this code
> mydf %>%
+ filter(week == "2012-01-02")
It gives me this output
week value
1: 2012-01-02 483
2: 2012-01-02 61233
I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.
See Question&Answers more detail:os