I have downloaded the pdf file from this site (from the Table tab) and want to clean the dataset in R and convert it to a csv or excel file.
I am using pdftools package and have downloaded the other required packages. I want to focus on the data for the age groups. So far I have the dataset narrowed down by using these codes.
#Load the dataset
PDF1 <- pdf_text("agegr_1-4-21.pdf") %>%
readr::read_lines() #open the PDF inside your project folder
PDF1
PDF1.grass <-PDF1[-c(1:10,17:19)] # remove lines
PDF1.grass
write.table(PDF1.grass, file="docd_pdf.csv", sep=",", row.names=FALSE)
all_stat_lines <- PDF1.grass
pdf_transpose = t(all_stat_lines)
write.table(pdf_transpose, file="docd_pdf.csv", sep=",", row.names=FALSE)
df <- plyr::ldply(pdf_transpose) #create a data frame
head(df)
However the data frame that I am getting includes everything on one variable. Is there is a way to efficiently break up the datasets and have different columns for the age groups? I downloaded the pdf file from the site and named it agegr_1-4-21.pdf.
The output I am getting is