I want to grab some data from Pro Football Reference website using the rvest
package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm
library("rvest")
library("dplyr")
#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html()
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()
Is that how you would have done it? :)
dat
could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.
colnames(dat) <- c("week", "day", "date", "winner", "at", "loser",
"box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")
dat2 <- dat %>% filter(!(box == ""))
head(dat2)
Looks good!
Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm
.
I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:
gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:
#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left
Each of these attempts returns {xml_nodeset (0)}
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")
Maybe let's try using xpath
. All of these attempts also return {xml_nodeset (0)}
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))] | //*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltip", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')
How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?
Thanks!
See Question&Answers more detail:os