🦆 DuckDB — SQL on Pandas DataFrames¶

What is DuckDB? ¶

DuckDB is an in-process analytical database that runs entirely inside your Python environment — no server, no setup, no external dependencies. If you've used SQLAlchemy, you know the drill — create an engine, manage a connection, configure a dialect. It works, but it's ceremony.

DuckDB skips all of that.

No server. No credentials. No connection string. Just a live SQL engine running inside your Python process, ready to query DataFrames directly from memory.

Why DuckDB Over Pandas?¶

If you already know Python and Pandas, here's the honest truth: Pandas struggles at scale, and you've probably felt it.

Pain Point	Pandas	DuckDB
Large files (1GB+)	Loads everything into RAM	Streams & processes strategically
Complex aggregations	Slow, verbose chaining	Native SQL — fast & readable
Multi-table joins	Painful merges	SQL `JOIN` in one line
Memory usage	High (copies on transforms)	Minimal — operates in-place

Bottom line: DuckDB doesn't replace Python — it supercharges it. You keep your DataFrames, your .df() output, your Jupyter workflow. You just stop wrestling with Pandas for analytical tasks it was never optimized for.

If you know SQL, DuckDB is the upgrade you didn't know you needed.

When to use what: DuckDB vs Pandas?¶

Task	Recommended Tool
Data cleaning, nulls, string ops	Pandas
Multi-metric GROUP BY	DuckDB
Window functions / ranking	DuckDB
Multi-step logic (CTEs)	DuckDB
Statistical distributions	DuckDB
In-place column mutation	Pandas
Plotting / visualisation	Pandas → matplotlib / seaborn

How DuckDB Works? ¶

When you write:

duckdb.query("SELECT * FROM df WHERE Total_Price > 500").df()

DuckDB scans your Python local scope, finds the variable df, and reads it directly from memory as a columnar data source — no serialization, no disk I/O, no loading step.

Setup & Imports ¶

In [1]:

# Install dependencies
import pandas as pd
import duckdb

Load & Explore Data ¶

In [2]:

# Load dataset
df = pd.read_csv('Online Sales Data.csv')
df.columns = df.columns.str.strip().str.replace(' ', '_')

Query 1 — Revenue by Region & Category `GROUP BY` ¶

Question: Which region and product category combination drives the most revenue and volume?

In Pandas, answering this requires multiple steps — .groupby(), .agg() with a dictionary mapping each column to its function, followed by .reset_index() to flatten the result, and finally .sort_values() to order it. That's 4 chained operations just to answer one business question.

DuckDB expresses the exact same logic in one clean SQL block — SELECT, GROUP BY, ORDER BY — the way analytical queries were meant to be written. No chaining, no .reset_index() surprises, and ROUND() is built right in.

In [3]:

q1 = duckdb.query("""
    SELECT
        Region,
        Product_Category,
        COUNT(*)                        AS total_orders,
        SUM(Units_Sold)                   AS total_units_sold,
        ROUND(SUM(Unit_Price), 2)      AS total_revenue,
        ROUND(AVG(Unit_Price), 2)      AS avg_order_value
    FROM df
    GROUP BY Region,Product_Category
    ORDER BY total_revenue DESC
""").df()
q1

Out[3]:

	Region	Product_Category	total_orders	total_units_sold	total_revenue	avg_order_value
0	North America	Electronics	40	66.0	27663.66	691.59
1	Europe	Home Appliances	40	59.0	12807.42	320.19
2	Asia	Sports	40	88.0	10451.36	261.28
3	Asia	Clothing	40	145.0	2701.46	67.54
4	Europe	Beauty Products	40	46.0	2464.92	61.62
5	North America	Books	40	114.0	646.12	16.15

Query 2 — Top 3 Products per Region `WINDOW + QUALIFY` ¶

Question: What are the top 3 best-selling products by revenue in each region?

In [4]:

q2 = duckdb.query("""
    SELECT
        Region,
        Product_Name,
        Product_Category,
        ROUND(SUM(Unit_Price), 2)      AS total_revenue,
        SUM(Units_Sold)                   AS total_units,
        RANK() OVER (
            PARTITION BY Region
            ORDER BY SUM(Unit_Price) DESC
        ) AS revenue_rank
    FROM df
    GROUP BY Region, Product_Name, Product_Category
    QUALIFY revenue_rank <= 3
    ORDER BY Region, revenue_rank
""").df()
q2

Out[4]:

	Region	Product_Name	Product_Category	total_revenue	total_units	revenue_rank
0	Asia	Peloton Bike	Sports	1895.00	1.0	1
1	Asia	Garmin Fenix 6X Pro	Sports	999.99	1.0	2
2	Asia	Bowflex SelectTech 1090 Adjustable Dumbbells	Sports	699.99	1.0	3
3	Europe	LG OLED TV	Home Appliances	1299.99	2.0	1
4	Europe	De'Longhi Magnifica Espresso Machine	Home Appliances	899.99	1.0	2
5	Europe	Roomba i7+	Home Appliances	799.99	2.0	3
6	North America	Canon EOS R5 Camera	Electronics	3899.99	1.0	1
7	North America	MacBook Pro 16-inch	Electronics	2499.99	1.0	2
8	North America	Apple MacBook Pro 16-inch	Electronics	2399.00	1.0	3

Query 3 — Payment Method & Spend Segmentation `CASE WHEN` ¶

Question: How does revenue and order volume break down by payment method, and what value tier does each payment bucket fall into?

In [5]:

q4 = duckdb.query("""
    SELECT
        Payment_Method,
        COUNT(Transaction_ID)                     AS total_orders,
        ROUND(SUM(Unit_Price), 2)          AS total_revenue,
        ROUND(AVG(Unit_Price), 2)          AS avg_order_value,
        ROUND(SUM(Unit_Price) * 100.0 /
              SUM(SUM(Unit_Price)) OVER (), 1) AS revenue_share_pct,
        CASE
            WHEN AVG(Unit_Price) >= 500  THEN 'High AOV'
            WHEN AVG(Unit_Price) >= 200  THEN 'Mid AOV'
            ELSE                               'Low AOV'
        END AS aov_tier
    FROM df
    GROUP BY Payment_Method
    ORDER BY total_revenue DESC
""").df()
q4

Out[5]:

	Payment_Method	total_orders	total_revenue	avg_order_value	revenue_share_pct	aov_tier
0	Credit Card	120	38761.14	323.01	68.3	Mid AOV
1	PayPal	80	15272.34	190.90	26.9	Low AOV
2	Debit Card	40	2701.46	67.54	4.8	Low AOV

Conclusion ¶

Across 3 focused queries on a real marketplace dataset, we showed that DuckDB handles the full range of analytical SQL — multi-dimensional aggregations, window-based rankings, running totals via CTEs, inline classification, and statistical distributions — all directly on a Pandas DataFrame with zero infrastructure.

#	Query	SQL Feature	Business Question Answered
1	Revenue by Region & Category	`GROUP BY`	Which market + category drives most revenue?
2	Top 3 Products per Region	`WINDOW + QUALIFY`	Best-sellers by geography
3	Payment Method Segmentation	`CASE WHEN` + window share	AOV tiers and payment channel contribution