Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right

This is my code

mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
                      by = .(week),
                      keep = c("list_of_id", "week")
                      ] 

It returned this error

Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading

I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.

When I run this code

> mydf %>% 
+   filter(week == "2012-01-02")

It gives me this output

         week    value 
1: 2012-01-02      483     
2: 2012-01-02     61233  

I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
623 views
Welcome To Ask or Share your Answers For Others

1 Answer

disk.frame looks interesting to fill a gap between RAM processing and Big Data.

To test it, I created a collection of 200 * 200 Mb CSV files for a total of 40Gb, above the 32Gb RAM installed on my computer:

library(furrr)
library(magrittr)
library(data.table)
library(dplyr)
library(disk.frame)
plan(multisession,workers = 11)
nbrOfWorkers()
#[1] 11

filelength <- 1e7

# Create 200 files * 200Mb
sizelist <- 1:200 %>% future_map(~{
  mydf <- data.table(week = sample(1:52,filelength,replace=T),
                     list_of_id=sample(1:filelength,filelength,replace=T))
  filename <- paste0('data/test',.x,'.csv')
  data.table::fwrite(mydf, filename)
  ##write.csv(mydf,file=filename)
  file.size(filename)
})

sum(unlist(sizelist))
# [1] 43209467799

As distinct_n is a dplyr verb, I first stayed in dplyr syntax:

setup_disk.frame()
#The number of workers available for disk.frame is 6
options(future.globals.maxSize = Inf)

mydf = csv_to_disk.frame(file.path('data',list.files('data')))
"
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = `  to set column types to minimize the chance of a failed read
=================================================

 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 1 of 2:

Converting 200 CSVs to 60 disk.frames each consisting of 60 chunks

 Progress: ──────────────────────────────────────────────────────────────── 100%

-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 00:01:44 elapsed (0.130s cpu)
 ----------------------------------------------------- 
 
 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 2 of 2:

Row-binding the 60 disk.frames together to form one large disk.frame:
Creating the disk.frame at c:TempWinRtmpkNkY9Hfile398469c42f1b.df

Appending disk.frames: 
 Progress: ──────────────────────────────────────────────────────────────── 100%

Stage 2 of 2 took: 59.9s elapsed (0.370s cpu)
 ----------------------------------------------------- 
Stage 1 & 2 in total took: 00:02:44 elapsed (0.500s cpu)"


result <- mydf %>% 
  group_by(week) %>% 
  summarize(value = n_distinct(list_of_id)) %>% 
  collect  

result
# A tibble: 52 x 2
    week   value
   <int>   <int>
 1     1 9786175
 2     2 9786479
 3     3 9786222
 4     4 9785997
 5     5 9785833
 6     6 9786013
 7     7 9786586
 8     8 9786029
 9     9 9785674
10    10 9786314
# ... with 42 more rows

So it works! Total RAM memory used for this specific task fluctuated between 1 and 5Gb, took a bit less than 10 minutes for 2 billion rows on 6 processors, the limiting factor being seemingly disk access speed and not processor performance.

I also tested with data.table syntax, as disk.frame accepts both, but I got back way too fast 60 times more rows (as if the 60 disk.frames created out of the 200 CSVs weren't merged and/or fully processed), and a lot of Warning messages: 1: In serialize(data, node$con).

I submitted an issue on GitHub.
Until this is clarified, I suggest to stay with dplyr syntax which works.

This example convinced me that disk.frame allows to process data bigger than RAM for supported verbs


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...