Extracting table from pdf to excel using pdfplumber

Introduction

Extracting tables from PDFs is essential when valuable data is locked in an uneditable format, preventing easy analysis and reporting. Using pdfplumber simplifies this process by offering precise control over complex layouts, ensuring accurate extraction of tabular data. Its seamless integration with pandas allows the data to be converted into Excel, making it easier to clean, manipulate, and analyze.

Libraries to import

import pdfplumber
import pandas as pd

Importing pdf file

pdf_path = 'sample_data.pdf'

Data Description

This is a sample data contains recorded automobile land speed records, including notable achievements such as speeds from early races in 1898 and 1904, up to more recent attempts to break the 800 mph barrier, all governed by the Federation International de L’Automobile (FIA).

Below is a screenshot of page 2, where the table from the PDF is displayed.

Extracting pdf table in python DataFrame

Initialization and Opening the PDF File:

  • The Data=[] initializes an empty list called Data to store the extracted table data

  • This line opens the PDF file using pdfplumber. The with statement ensures that the file is properly closed after processing.

Data=[]
with pdfplumber.open(pdf_path) as pdf:

Accessing a Specific Page:

  • This retrieves the second page of the PDF (page indexing starts at 0) where the table is located.
page = pdf.pages[1]  

Extracting the Table:

  • This line uses extract_table() to extract the table data from the page. The table variable now contains the table’s data, where the first element is the header and the rest are rows.
table = page.extract_table()

Separating Header and Rows:

header=table[0]
rows=table[1:] 
    

Here, header stores the first row of the table, which contains column names, while rows stores the remaining rows of data.

Appending Rows to Data List:

for row in rows:
        Data.append(row)

This loop iterates through each row in rows and appends it to the Data list.

Creating a python DataFrame:

Data=pd.DataFrame(Data,columns=header)

Finally, this converts the Data list into a pandas DataFrame, using header for column names. The resulting DataFrame is now ready for further analysis or manipulation.

Speed (mph) Driver Car Engine Date Date
0 407.447 Craig Breedlove Spirit of America GE J47 8/5/63
1 413.199 Tom Green Wingfoot Express WE J46 10/2/64
2 434.22 Art Arfons Green Monster GE J79 10/5/64
3 468.719 Craig Breedlove Spirit of America GE J79 10/13/64
4 526.277 Craig Breedlove Spirit of America GE J79 10/15/65
5 536.712 Art Arfons Green Monster GE J79 10/27/65
6 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/2/65
7 576.553 Art Arfons Green Monster GE J79 11/7/65
8 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/15/65
9 622.407 Gary Gabelich Blue Flame Rocket 10/23/70
10 633.468 Richard Noble Thrust 2 RR RG 146 10/4/83
11 763.035 Andy Green Thrust SSC RR Spey 10/15/97

Conclusion

Extracting tables from PDFs using pdfplumber is an efficient and powerful approach for unlocking valuable data trapped in static documents. By converting the extracted data into pandas-friendly formats, users can seamlessly transition from raw data to meaningful insights. This process not only saves time but also enhances data accuracy, ensuring smoother workflows for reporting and analysis. With the added flexibility of exporting the data to Excel, pdfplumber proves to be an indispensable tool for handling complex tabular data from PDFs.