import pdfplumber
import pandas as pd
Extracting table from pdf to excel using pdfplumber
Introduction
Extracting tables from PDFs is essential when valuable data is locked in an uneditable format, preventing easy analysis and reporting. Using pdfplumber
simplifies this process by offering precise control over complex layouts, ensuring accurate extraction of tabular data. Its seamless integration with pandas allows the data to be converted into Excel, making it easier to clean, manipulate, and analyze.
Libraries to import
Importing pdf file
= 'sample_data.pdf' pdf_path
Data Description
This is a sample data contains recorded automobile land speed records, including notable achievements such as speeds from early races in 1898 and 1904, up to more recent attempts to break the 800 mph barrier, all governed by the Federation International de L’Automobile (FIA).
Below is a screenshot of page 2, where the table from the PDF is displayed.
Extracting pdf table in python DataFrame
Initialization and Opening the PDF File:
The
Data=[]
initializes an empty list called Data to store the extracted table dataThis line opens the PDF file using pdfplumber. The with statement ensures that the file is properly closed after processing.
=[]
Datawith pdfplumber.open(pdf_path) as pdf:
Accessing a Specific Page:
- This retrieves the second page of the PDF (page indexing starts at 0) where the table is located.
= pdf.pages[1] page
Extracting the Table:
- This line uses extract_table() to extract the table data from the page. The table variable now contains the table’s data, where the first element is the header and the rest are rows.
= page.extract_table() table
Separating Header and Rows:
=table[0]
header=table[1:]
rows
Here, header stores the first row of the table, which contains column names, while rows stores the remaining rows of data.
Appending Rows to Data List:
for row in rows:
Data.append(row)
This loop iterates through each row in rows and appends it to the Data list.
Creating a python DataFrame:
=pd.DataFrame(Data,columns=header) Data
Finally, this converts the Data list into a pandas DataFrame, using header for column names. The resulting DataFrame is now ready for further analysis or manipulation.
Speed (mph) | Driver | Car | Engine Date | Date | |
---|---|---|---|---|---|
0 | 407.447 | Craig Breedlove | Spirit of America | GE J47 | 8/5/63 |
1 | 413.199 | Tom Green | Wingfoot Express | WE J46 | 10/2/64 |
2 | 434.22 | Art Arfons | Green Monster | GE J79 | 10/5/64 |
3 | 468.719 | Craig Breedlove | Spirit of America | GE J79 | 10/13/64 |
4 | 526.277 | Craig Breedlove | Spirit of America | GE J79 | 10/15/65 |
5 | 536.712 | Art Arfons | Green Monster | GE J79 | 10/27/65 |
6 | 555.127 | Craig Breedlove | Spirit of America, Sonic 1 | GE J79 | 11/2/65 |
7 | 576.553 | Art Arfons | Green Monster | GE J79 | 11/7/65 |
8 | 600.601 | Craig Breedlove | Spirit of America, Sonic 1 | GE J79 | 11/15/65 |
9 | 622.407 | Gary Gabelich | Blue Flame | Rocket | 10/23/70 |
10 | 633.468 | Richard Noble | Thrust 2 | RR RG 146 | 10/4/83 |
11 | 763.035 | Andy Green | Thrust SSC | RR Spey | 10/15/97 |
Conclusion
Extracting tables from PDFs using pdfplumber is an efficient and powerful approach for unlocking valuable data trapped in static documents. By converting the extracted data into pandas-friendly formats, users can seamlessly transition from raw data to meaningful insights. This process not only saves time but also enhances data accuracy, ensuring smoother workflows for reporting and analysis. With the added flexibility of exporting the data to Excel, pdfplumber proves to be an indispensable tool for handling complex tabular data from PDFs.