I'm trying to run some scraping where the action I take on a node is conditional on the contents of the node.
This should be a minimal example:
XML =
'<td class="id-tag">
<span title="Really Long Text">Really L...</span>
</td>
<td class="id-tag">Short</td>'
page = read_html(XML)
Basically, I want to extract html_attr(x, "title")
if <span>
exists, otherwise just get html_text(x)
.
Code to do the first is:
page %>% html_nodes(xpath = '//td[@class="id-tag"]/span') %>% html_attr("title")
# [1] "Really Long Text"
Code to do the second is:
page %>% html_nodes(xpath = '//td[@class="id-tag"]') %>% html_text
# [1] "
Really L...
" "Short"
The real problem is that the html_attr
approach doesn't give me any NA
or something similar for the nodes that don't match (even if I let the xpath
just be '//td[@class="id-tag"]'
first to be sure I've narrowed down to only the relevant nodes first. This destroys the order -- I can't tell automatically whether the original structure had "Really Long Text"
at the first or the second node.
(I thought of doing a join, but the mapping between the abbreviated text and the full text is not one-to-one/invertible).
This seems to be on the right path -- an if/else construction within the xpath
-- but doesn't work.
Ideally I'd get the output:
# [1] "Really Long Text" "Short"
See Question&Answers more detail:os