install.packages("pdftools")
library(pdftools)
Importing and extracting tables from PDF into R using “pdftools”
Introduction
Extracting tables from PDFs is often necessary when clients share data in PDF format. However, using free online tools to convert PDFs can pose a risk to confidential information. Fortunately, there is a safer alternative: extracting data directly using the “pdftools” package in R.
While there are other libraries, such as tabulizer, it requires Java and is not currently available on CRAN.
Let us begin with installing and loading the required package
Step 1) Define a reusable path variable for the PDF file
<- "Path_to_pdf/my_pdf.pdf" mypath
Step 2) Import the pdf file using pdf_text()
Here, data from .pdf is fetched by pdf_text() and stored as a character vector matching .pdf page count
<- pdf_text(mypath) # to fetch data of PDF file
pdfText typeof(pdfText) # to check the data type
[1] "character"
2] # to show content of the second page pdfText[
[1] " Example 4: Automobile Land Speed Records (GR 5-10)\n In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of\n Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour\n (mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across\n frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American\n Eagle is trying to break a land speed record of 800 mph. The Federation International de\n L’Automobile (FIA), the world’s governing body for motor sport and land speed records,\n recorded the following land speed records. (Retrieved on February 5, 2006, from\n http://www.landspeed.com/lsrinfo.asp.)\n\n Speed (mph) Driver Car Engine Date\n\n 407.447 Craig Breedlove Spirit of America GE J47 8/5/63\n\n 413.199 Tom Green Wingfoot Express WE J46 10/2/64\n\n 434.22 Art Arfons Green Monster GE J79 10/5/64\n\n 468.719 Craig Breedlove Spirit of America GE J79 10/13/64\n\n 526.277 Craig Breedlove Spirit of America GE J79 10/15/65\n\n 536.712 Art Arfons Green Monster GE J79 10/27/65\n\n 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/2/65\n\n 576.553 Art Arfons Green Monster GE J79 11/7/65\n\n 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/15/65\n\n 622.407 Gary Gabelich Blue Flame Rocket 10/23/70\n\n 633.468 Richard Noble Thrust 2 RR RG 146 10/4/83\n\n 763.035 Andy Green Thrust SSC RR Spey 10/15/97\n\n Example 5: Distance and Time (GR 8-10)\n The following data were collected using a car with a water clock set to release a drop in\n a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were\n run. Create a data table with an average distance column and an average velocity column,\n create an average distance-time graph, and draw the best-fit line or curve. Estimate the\n car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is\n it going at a constant speed, accelerating, or decelerating? How do you know?\n\n Time (drops of water) Distance (cm)\n 1 10,11,9\n 2 29, 31, 30\n 3 59, 58, 61\n 4 102, 100, 98\n 5 122, 125, 127\n\n\n\n© 2006 WGBH Educational Foundation. All rights reserved.\n\n 2\n"
Note: pdfText is a character vector, where each element represents the text of one PDF page
Step 3) Process each page into lines
But why split PDF text into lines?
Converting text to lines allows to process each line separately, making it easier to:
Identify headers or keywords
Locate patterns (e.g., names, dates, amounts)
Extract tabular data - Improves readability of the extracted data
Makes data cleaning and transformation easier
Enables you to loop through each line if necessary
Note: This step can be skipped in case simple text analysis is to be done
<- lapply(pdfText, function(page) { # Applies function to each element(i.e., each page) in pdfText
cleaned_pages
<- strsplit(page, "\n")[[1]] # Splits the text of the PDF into lines based on "\n".
text_lines
return(text_lines)
})
typeof(cleaned_pages)
[1] "list"
# cleaned_text <- unlist(cleaned_pages) # Flatten into a single vector
<- cleaned_pages[[2]] # Extract second page for demo
cleaned_text cleaned_text
[1] " Example 4: Automobile Land Speed Records (GR 5-10)"
[2] " In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of"
[3] " Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour"
[4] " (mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across"
[5] " frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American"
[6] " Eagle is trying to break a land speed record of 800 mph. The Federation International de"
[7] " L’Automobile (FIA), the world’s governing body for motor sport and land speed records,"
[8] " recorded the following land speed records. (Retrieved on February 5, 2006, from"
[9] " http://www.landspeed.com/lsrinfo.asp.)"
[10] ""
[11] " Speed (mph) Driver Car Engine Date"
[12] ""
[13] " 407.447 Craig Breedlove Spirit of America GE J47 8/5/63"
[14] ""
[15] " 413.199 Tom Green Wingfoot Express WE J46 10/2/64"
[16] ""
[17] " 434.22 Art Arfons Green Monster GE J79 10/5/64"
[18] ""
[19] " 468.719 Craig Breedlove Spirit of America GE J79 10/13/64"
[20] ""
[21] " 526.277 Craig Breedlove Spirit of America GE J79 10/15/65"
[22] ""
[23] " 536.712 Art Arfons Green Monster GE J79 10/27/65"
[24] ""
[25] " 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/2/65"
[26] ""
[27] " 576.553 Art Arfons Green Monster GE J79 11/7/65"
[28] ""
[29] " 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/15/65"
[30] ""
[31] " 622.407 Gary Gabelich Blue Flame Rocket 10/23/70"
[32] ""
[33] " 633.468 Richard Noble Thrust 2 RR RG 146 10/4/83"
[34] ""
[35] " 763.035 Andy Green Thrust SSC RR Spey 10/15/97"
[36] ""
[37] " Example 5: Distance and Time (GR 8-10)"
[38] " The following data were collected using a car with a water clock set to release a drop in"
[39] " a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were"
[40] " run. Create a data table with an average distance column and an average velocity column,"
[41] " create an average distance-time graph, and draw the best-fit line or curve. Estimate the"
[42] " car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is"
[43] " it going at a constant speed, accelerating, or decelerating? How do you know?"
[44] ""
[45] " Time (drops of water) Distance (cm)"
[46] " 1 10,11,9"
[47] " 2 29, 31, 30"
[48] " 3 59, 58, 61"
[49] " 4 102, 100, 98"
[50] " 5 122, 125, 127"
[51] ""
[52] ""
[53] ""
[54] "© 2006 WGBH Educational Foundation. All rights reserved."
[55] ""
[56] " 2"
<- cleaned_text[c(11:35)] # Select lines for the table (11 to 35)
cleaned_text
<- trimws(cleaned_text) # Remove leading/trailing spaces
cleaned_text
<- subset(cleaned_text, cleaned_text != "") # Remove blank lines
cleaned_text
<- strsplit(cleaned_text, "\\s{2,}") # Split rows into columns based on multiple spaces data_split
Note: The processing steps may vary depending on your specific requirements
Step 4) Convert lines to data frame
<- as.data.frame(do.call(rbind, data_split), stringsAsFactors = FALSE) # Convert to data frame
df
# Set column names and remove the header row
colnames(df) <- df[1,]
<- df[-1,]
df
datatable(df)
Additional functions to understand PDF Structure
# To display dimension of each page
pdf_pagesize(mypath)
# To get meta data of PDF file
pdf_info(mypath)
# Displays font names
pdf_fonts(mypath)
Conclusion:
The “pdftools” package in R is a versatile solution for extracting text and tables from PDFs, offering benefits like ease of use and fast text extraction. However, it has limitations when handling complex table structures or poorly formatted PDFs, often requiring additional cleaning or complementary tools. Despite these challenges, pdftools remains a valuable tool for turning static documents into structured data, enhancing data workflows in various domains.