OCR — Extracting Data from Images using Python¶

Data is often available in formats that cannot be processed directly — photographed reports or image exports of tables. To Python, these are just pixels. You cannot run pd.read_csv() on an image or extract text from a screenshot without extra tooling.

OCR (Optical Character Recognition) solves this by converting image content into machine-readable text and structured data.

Workflow¶

Image Input → OCR Engine (Tesseract) → Parse Output
                                              ├── Table Image → img2table → pandas DataFrame
                                              └── Text Image  → pytesseract → Python string

Import Libraries¶

In [1]:
from img2table.document import Image as Img
from img2table.ocr import TesseractOCR
import pytesseract
from PIL import Image

Image 1: Table Image¶

Contains a structured product sales table with rows and columns.

In [2]:
from IPython.display import display, Image as IPImage
display(IPImage(filename="ProductSalesTable.png"))
No description has been provided for this image

Image 2: Text Image¶

Contains a short article with headings and paragraphs.

In [3]:
from IPython.display import display, Image as IPImage
display(IPImage(filename="text.png"))
No description has been provided for this image

Table Extraction → DataFrame¶

Problem: Reading a table from an image manually is tedious and error-prone.
Solution: img2table automatically detects table borders, rows, and columns — and returns a clean pandas DataFrame in one call.

It solves two things that plain pytesseract struggles with:

  • Correctly grouping words into rows and columns
  • Handling merged cells and irregular column spacing
In [4]:
# Set Tesseract binary path (Windows only)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# ---------- TABLE EXTRACTION ----------
ocr = TesseractOCR()
doc = Img("ProductSalesTable.png")

tables = doc.extract_tables(ocr=ocr)

if tables:
    df = tables[0].df
    print(df)
else:
    print("No tables found")
                   0            1      2           3
0            Product     Category  Units  Revenue($)
1             Laptop  Electronics   1273   19,09,500
2     Wireless Mouse  Peripherlas   4652    1,39,560
3          USB-C Hub  Peripherlas   3121    1,87,260
4          HD WebCam  Accessories   1638    1,31,040
5  Monitor 27 inches  Electronics    870    6,09,000
6     Mechanical KBD  Peripherlas   2176    5,44,000
7      Desk USB Lamp  Accessories   2959      73,975

Text Extraction → Python String¶

Problem: Text inside an image is not machine-readable — it cannot be copied, searched, or processed.
Solution: pytesseract.image_to_string() scans the image and returns all detected text as a plain Python str.

This works best on clean, printed text with consistent font and good contrast — exactly what our sample image provides.

In [5]:
# ---------- TEXT EXTRACTION ----------
img = Image.open("text.png")
text = pytesseract.image_to_string(img)

print(text)
The Future of Artificial Intelligence

Artificial Intelligence (AI) is rapidly transforming the way we live and
work. From virtual assistants to self-driving cars, AI systems are
becoming more capable and widely adopted.

Conclusion¶

Task Tool Output
Extract table from image img2table + TesseractOCR pandas DataFrame
Extract text from image pytesseract Python string

Key takeaways:

  • Image quality directly impacts OCR accuracy — clean fonts, high contrast, and consistent color give the best results
  • img2table is preferred over raw pytesseract for tables as it handles column/row structure automatically
In [ ]: