r - Unescape HTML &#nn; sequences

Question

Welcome To Ask or Share your Answers For Others

r - Unescape HTML &#nn; sequences

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

My text has some HTML escaped characters, for instance, instead of ' there is '. Now I would like to unescape these sequences. Since I do not know which characters are escaped, I do not want to use a simple mapping such as in c("'"="'", ...).

I understand that the number after the ampersand is the decimal unicode number. So ' is u27 since 27 is the hexidecimal representation of 39. So I thought a solution that involves

sprintf("u%x", s)

where s is the extracted number between & and ;. However, this results in an error: "u used without hex numbers."

What would be a better approach to convert HTML escaped sequences back to characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

753 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T20:07:01+0000

Just for reference, here is the solution I came up with. It makes use of the great package gsubfn:

library(gsubfn)

I use a vector htmlchars for named html entities I scraped from Wikipedia. For brevity, I do not copy the vector in this answer here, but source it from pastebin:

source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars

Now the decoding function I was looking for is simply:

strdehtml <- function(s) {
    ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
    ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
    return(ret)
}

Not sure if this covers all possible HTML characters, but it gets me working. For instance, it can be used thus:

test <- "My this &amp; last year&#39;s resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"

Categories

r - Unescape HTML &#nn; sequences

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags