🦆 DuckDB — SQL on Pandas DataFrames¶


What is DuckDB? ¶

DuckDB is an in-process analytical database that runs entirely inside your Python environment — no server, no setup, no external dependencies. If you've used SQLAlchemy, you know the drill — create an engine, manage a connection, configure a dialect. It works, but it's ceremony.

DuckDB skips all of that.

No server. No credentials. No connection string. Just a live SQL engine running inside your Python process, ready to query DataFrames directly from memory.


Why DuckDB Over Pandas?¶

If you already know Python and Pandas, here's the honest truth: Pandas struggles at scale, and you've probably felt it.

Pain Point Pandas DuckDB
Large files (1GB+) Loads everything into RAM Streams & processes strategically
Complex aggregations Slow, verbose chaining Native SQL — fast & readable
Multi-table joins Painful merges SQL JOIN in one line
Memory usage High (copies on transforms) Minimal — operates in-place

Bottom line: DuckDB doesn't replace Python — it supercharges it. You keep your DataFrames, your .df() output, your Jupyter workflow. You just stop wrestling with Pandas for analytical tasks it was never optimized for.

If you know SQL, DuckDB is the upgrade you didn't know you needed.


When to use what: DuckDB vs Pandas?¶

Task Recommended Tool
Data cleaning, nulls, string ops Pandas
Multi-metric GROUP BY DuckDB
Window functions / ranking DuckDB
Multi-step logic (CTEs) DuckDB
Statistical distributions DuckDB
In-place column mutation Pandas
Plotting / visualisation Pandas → matplotlib / seaborn

How DuckDB Works? ¶

When you write:

duckdb.query("SELECT * FROM df WHERE Total_Price > 500").df()

DuckDB scans your Python local scope, finds the variable df, and reads it directly from memory as a columnar data source — no serialization, no disk I/O, no loading step.


Setup & Imports ¶

In [1]:
# Install dependencies
import pandas as pd
import duckdb

Load & Explore Data ¶

In [2]:
# Load dataset
df = pd.read_csv('Online Sales Data.csv')
df.columns = df.columns.str.strip().str.replace(' ', '_')

Query 1 — Revenue by Region & Category GROUP BY ¶

Question: Which region and product category combination drives the most revenue and volume?

In Pandas, answering this requires multiple steps — .groupby(), .agg() with a dictionary mapping each column to its function, followed by .reset_index() to flatten the result, and finally .sort_values() to order it. That's 4 chained operations just to answer one business question.

DuckDB expresses the exact same logic in one clean SQL block — SELECT, GROUP BY, ORDER BY — the way analytical queries were meant to be written. No chaining, no .reset_index() surprises, and ROUND() is built right in.

In [3]:
q1 = duckdb.query("""
    SELECT
        Region,
        Product_Category,
        COUNT(*)                        AS total_orders,
        SUM(Units_Sold)                   AS total_units_sold,
        ROUND(SUM(Unit_Price), 2)      AS total_revenue,
        ROUND(AVG(Unit_Price), 2)      AS avg_order_value
    FROM df
    GROUP BY Region,Product_Category
    ORDER BY total_revenue DESC
""").df()
q1
Out[3]:
Region Product_Category total_orders total_units_sold total_revenue avg_order_value
0 North America Electronics 40 66.0 27663.66 691.59
1 Europe Home Appliances 40 59.0 12807.42 320.19
2 Asia Sports 40 88.0 10451.36 261.28
3 Asia Clothing 40 145.0 2701.46 67.54
4 Europe Beauty Products 40 46.0 2464.92 61.62
5 North America Books 40 114.0 646.12 16.15

Query 2 — Top 3 Products per Region WINDOW + QUALIFY ¶

Question: What are the top 3 best-selling products by revenue in each region?

In [4]:
q2 = duckdb.query("""
    SELECT
        Region,
        Product_Name,
        Product_Category,
        ROUND(SUM(Unit_Price), 2)      AS total_revenue,
        SUM(Units_Sold)                   AS total_units,
        RANK() OVER (
            PARTITION BY Region
            ORDER BY SUM(Unit_Price) DESC
        ) AS revenue_rank
    FROM df
    GROUP BY Region, Product_Name, Product_Category
    QUALIFY revenue_rank <= 3
    ORDER BY Region, revenue_rank
""").df()
q2
Out[4]:
Region Product_Name Product_Category total_revenue total_units revenue_rank
0 Asia Peloton Bike Sports 1895.00 1.0 1
1 Asia Garmin Fenix 6X Pro Sports 999.99 1.0 2
2 Asia Bowflex SelectTech 1090 Adjustable Dumbbells Sports 699.99 1.0 3
3 Europe LG OLED TV Home Appliances 1299.99 2.0 1
4 Europe De'Longhi Magnifica Espresso Machine Home Appliances 899.99 1.0 2
5 Europe Roomba i7+ Home Appliances 799.99 2.0 3
6 North America Canon EOS R5 Camera Electronics 3899.99 1.0 1
7 North America MacBook Pro 16-inch Electronics 2499.99 1.0 2
8 North America Apple MacBook Pro 16-inch Electronics 2399.00 1.0 3

Query 3 — Payment Method & Spend Segmentation CASE WHEN ¶

Question: How does revenue and order volume break down by payment method, and what value tier does each payment bucket fall into?

In [5]:
q4 = duckdb.query("""
    SELECT
        Payment_Method,
        COUNT(Transaction_ID)                     AS total_orders,
        ROUND(SUM(Unit_Price), 2)          AS total_revenue,
        ROUND(AVG(Unit_Price), 2)          AS avg_order_value,
        ROUND(SUM(Unit_Price) * 100.0 /
              SUM(SUM(Unit_Price)) OVER (), 1) AS revenue_share_pct,
        CASE
            WHEN AVG(Unit_Price) >= 500  THEN 'High AOV'
            WHEN AVG(Unit_Price) >= 200  THEN 'Mid AOV'
            ELSE                               'Low AOV'
        END AS aov_tier
    FROM df
    GROUP BY Payment_Method
    ORDER BY total_revenue DESC
""").df()
q4
Out[5]:
Payment_Method total_orders total_revenue avg_order_value revenue_share_pct aov_tier
0 Credit Card 120 38761.14 323.01 68.3 Mid AOV
1 PayPal 80 15272.34 190.90 26.9 Low AOV
2 Debit Card 40 2701.46 67.54 4.8 Low AOV

Conclusion ¶

Across 3 focused queries on a real marketplace dataset, we showed that DuckDB handles the full range of analytical SQL — multi-dimensional aggregations, window-based rankings, running totals via CTEs, inline classification, and statistical distributions — all directly on a Pandas DataFrame with zero infrastructure.

# Query SQL Feature Business Question Answered
1 Revenue by Region & Category GROUP BY Which market + category drives most revenue?
2 Top 3 Products per Region WINDOW + QUALIFY Best-sellers by geography
3 Payment Method Segmentation CASE WHEN + window share AOV tiers and payment channel contribution