Extracting table from pdf to excel using pdfplumber

Introduction

Extracting tables from PDFs is essential when valuable data is locked in an uneditable format, preventing easy analysis and reporting. Using pdfplumber simplifies this process by offering precise control over complex layouts, ensuring accurate extraction of tabular data. Its seamless integration with pandas allows the data to be converted into Excel, making it easier to clean, manipulate, and analyze.

Libraries to import

import pdfplumber
import pandas as pd

Importing pdf file

pdf_path = 'sample_data.pdf'

Data Description

This is a sample data contains recorded automobile land speed records, including notable achievements such as speeds from early races in 1898 and 1904, up to more recent attempts to break the 800 mph barrier, all governed by the Federation International de L’Automobile (FIA).

Below is a screenshot of page 2, where the table from the PDF is displayed.

Extracting pdf table in python DataFrame

Initialization and Opening the PDF File:

The Data=[] initializes an empty list called Data to store the extracted table data
This line opens the PDF file using pdfplumber. The with statement ensures that the file is properly closed after processing.

Data=[]
with pdfplumber.open(pdf_path) as pdf:

Accessing a Specific Page:

This retrieves the second page of the PDF (page indexing starts at 0) where the table is located.

page = pdf.pages[1]

Extracting the Table:

This line uses extract_table() to extract the table data from the page. The table variable now contains the table’s data, where the first element is the header and the rest are rows.

table = page.extract_table()

Separating Header and Rows:

header=table[0]
rows=table[1:]

Here, header stores the first row of the table, which contains column names, while rows stores the remaining rows of data.

Appending Rows to Data List:

for row in rows:
        Data.append(row)

This loop iterates through each row in rows and appends it to the Data list.

Creating a python DataFrame:

Data=pd.DataFrame(Data,columns=header)

Finally, this converts the Data list into a pandas DataFrame, using header for column names. The resulting DataFrame is now ready for further analysis or manipulation.

	Speed (mph)	Driver	Car	Engine Date	Date
0	407.447	Craig Breedlove	Spirit of America	GE J47	8/5/63
1	413.199	Tom Green	Wingfoot Express	WE J46	10/2/64
2	434.22	Art Arfons	Green Monster	GE J79	10/5/64
3	468.719	Craig Breedlove	Spirit of America	GE J79	10/13/64
4	526.277	Craig Breedlove	Spirit of America	GE J79	10/15/65
5	536.712	Art Arfons	Green Monster	GE J79	10/27/65
6	555.127	Craig Breedlove	Spirit of America, Sonic 1	GE J79	11/2/65
7	576.553	Art Arfons	Green Monster	GE J79	11/7/65
8	600.601	Craig Breedlove	Spirit of America, Sonic 1	GE J79	11/15/65
9	622.407	Gary Gabelich	Blue Flame	Rocket	10/23/70
10	633.468	Richard Noble	Thrust 2	RR RG 146	10/4/83
11	763.035	Andy Green	Thrust SSC	RR Spey	10/15/97

Conclusion

Extracting tables from PDFs using pdfplumber is an efficient and powerful approach for unlocking valuable data trapped in static documents. By converting the extracted data into pandas-friendly formats, users can seamlessly transition from raw data to meaningful insights. This process not only saves time but also enhances data accuracy, ensuring smoother workflows for reporting and analysis. With the added flexibility of exporting the data to Excel, pdfplumber proves to be an indispensable tool for handling complex tabular data from PDFs.