OCR — Extracting Data from Images using Python¶
Data is often available in formats that cannot be processed directly — photographed reports or image exports of tables. To Python, these
are just pixels. You cannot run pd.read_csv() on an image or extract text
from a screenshot without extra tooling.
OCR (Optical Character Recognition) solves this by converting image content into machine-readable text and structured data.
Workflow¶
Image Input → OCR Engine (Tesseract) → Parse Output
├── Table Image → img2table → pandas DataFrame
└── Text Image → pytesseract → Python string
Import Libraries¶
from img2table.document import Image as Img
from img2table.ocr import TesseractOCR
import pytesseract
from PIL import Image
Image 1: Table Image¶
Contains a structured product sales table with rows and columns.
from IPython.display import display, Image as IPImage
display(IPImage(filename="ProductSalesTable.png"))
Image 2: Text Image¶
Contains a short article with headings and paragraphs.
from IPython.display import display, Image as IPImage
display(IPImage(filename="text.png"))
Table Extraction → DataFrame¶
Problem: Reading a table from an image manually is tedious and error-prone.
Solution: img2table automatically detects table borders, rows, and columns — and returns a clean pandas DataFrame in one call.
It solves two things that plain pytesseract struggles with:
- Correctly grouping words into rows and columns
- Handling merged cells and irregular column spacing
# Set Tesseract binary path (Windows only)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# ---------- TABLE EXTRACTION ----------
ocr = TesseractOCR()
doc = Img("ProductSalesTable.png")
tables = doc.extract_tables(ocr=ocr)
if tables:
df = tables[0].df
print(df)
else:
print("No tables found")
0 1 2 3 0 Product Category Units Revenue($) 1 Laptop Electronics 1273 19,09,500 2 Wireless Mouse Peripherlas 4652 1,39,560 3 USB-C Hub Peripherlas 3121 1,87,260 4 HD WebCam Accessories 1638 1,31,040 5 Monitor 27 inches Electronics 870 6,09,000 6 Mechanical KBD Peripherlas 2176 5,44,000 7 Desk USB Lamp Accessories 2959 73,975
Text Extraction → Python String¶
Problem: Text inside an image is not machine-readable — it cannot be copied, searched, or processed.
Solution: pytesseract.image_to_string() scans the image and returns all detected text as a plain Python str.
This works best on clean, printed text with consistent font and good contrast — exactly what our sample image provides.
# ---------- TEXT EXTRACTION ----------
img = Image.open("text.png")
text = pytesseract.image_to_string(img)
print(text)
The Future of Artificial Intelligence Artificial Intelligence (AI) is rapidly transforming the way we live and work. From virtual assistants to self-driving cars, AI systems are becoming more capable and widely adopted.
Conclusion¶
| Task | Tool | Output |
|---|---|---|
| Extract table from image | img2table + TesseractOCR |
pandas DataFrame |
| Extract text from image | pytesseract |
Python string |
Key takeaways:
- Image quality directly impacts OCR accuracy — clean fonts, high contrast, and consistent color give the best results
img2tableis preferred over rawpytesseractfor tables as it handles column/row structure automatically