I have a big bunch of xml
-files, which I need to process. For that matter I want to be able to read the files, and save the resulting list of objects to disk. I tried to save the list with readr::write_rds
, but after reading it in again, the object is somewhat modified, and not valid any more. Is there anything I can do to alleviate this problem?
library(readr)
library(xml2)
x <- read_xml("<foo>
<bar>text <baz id = 'a' /></bar>
<bar>2</bar>
<baz id = 'b' />
</foo>")
# function to save and read object
roundtrip <- function(obj) {
tf <- tempfile()
on.exit(unlink(tf))
write_rds(obj, tf)
read_rds(tf)
}
list(x)
#> [[1]]
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
roundtrip(list(x))
#> [[1]]
#> {xml_document}
identical(x, roundtrip(x))
#> [1] FALSE
all.equal(x, roundtrip(x))
#> [1] TRUE
xml_children(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid
as_list(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid
Some context
I have around 500,000 xml-files. To process them I planned on turning them into a list with xml2::as_list
and I wrote code to extract what I need. Afterwards I realized, that as_list
is very expensive to run. I could either:
- re-write already carefully debugged code to parse data directly (
xml_child
,xml_text
, ...), or - use
as_list
.
In order to speed up no. 2 I could run it on another machine with more cores, but I would like to pass a single file to that machine, because collecting and copying all files is time-consuming.
See Question&Answers more detail:os