🦆 DuckDB — SQL on Pandas DataFrames¶
What is DuckDB? ¶
DuckDB is an in-process analytical database that runs entirely inside your Python environment — no server, no setup, no external dependencies. If you've used SQLAlchemy, you know the drill — create an engine, manage a connection, configure a dialect. It works, but it's ceremony.
DuckDB skips all of that.
No server. No credentials. No connection string. Just a live SQL engine running inside your Python process, ready to query DataFrames directly from memory.
Why DuckDB Over Pandas?¶
If you already know Python and Pandas, here's the honest truth: Pandas struggles at scale, and you've probably felt it.
| Pain Point | Pandas | DuckDB |
|---|---|---|
| Large files (1GB+) | Loads everything into RAM | Streams & processes strategically |
| Complex aggregations | Slow, verbose chaining | Native SQL — fast & readable |
| Multi-table joins | Painful merges | SQL JOIN in one line |
| Memory usage | High (copies on transforms) | Minimal — operates in-place |
Bottom line: DuckDB doesn't replace Python — it supercharges it. You keep your DataFrames, your .df() output, your Jupyter workflow. You just stop wrestling with Pandas for analytical tasks it was never optimized for.
If you know SQL, DuckDB is the upgrade you didn't know you needed.
When to use what: DuckDB vs Pandas?¶
| Task | Recommended Tool |
|---|---|
| Data cleaning, nulls, string ops | Pandas |
| Multi-metric GROUP BY | DuckDB |
| Window functions / ranking | DuckDB |
| Multi-step logic (CTEs) | DuckDB |
| Statistical distributions | DuckDB |
| In-place column mutation | Pandas |
| Plotting / visualisation | Pandas → matplotlib / seaborn |
How DuckDB Works? ¶
When you write:
duckdb.query("SELECT * FROM df WHERE Total_Price > 500").df()
DuckDB scans your Python local scope, finds the variable df, and reads it directly from memory as a columnar data source — no serialization, no disk I/O, no loading step.
Setup & Imports ¶
# Install dependencies
import pandas as pd
import duckdb
Load & Explore Data ¶
# Load dataset
df = pd.read_csv('Online Sales Data.csv')
df.columns = df.columns.str.strip().str.replace(' ', '_')
Query 1 — Revenue by Region & Category GROUP BY ¶
Question: Which region and product category combination drives the most revenue and volume?
In Pandas, answering this requires multiple steps — .groupby(), .agg() with a dictionary mapping each column to its function, followed by .reset_index() to flatten the result, and finally .sort_values() to order it. That's 4 chained operations just to answer one business question.
DuckDB expresses the exact same logic in one clean SQL block — SELECT, GROUP BY, ORDER BY — the way analytical queries were meant to be written. No chaining, no .reset_index() surprises, and ROUND() is built right in.
q1 = duckdb.query("""
SELECT
Region,
Product_Category,
COUNT(*) AS total_orders,
SUM(Units_Sold) AS total_units_sold,
ROUND(SUM(Unit_Price), 2) AS total_revenue,
ROUND(AVG(Unit_Price), 2) AS avg_order_value
FROM df
GROUP BY Region,Product_Category
ORDER BY total_revenue DESC
""").df()
q1
| Region | Product_Category | total_orders | total_units_sold | total_revenue | avg_order_value | |
|---|---|---|---|---|---|---|
| 0 | North America | Electronics | 40 | 66.0 | 27663.66 | 691.59 |
| 1 | Europe | Home Appliances | 40 | 59.0 | 12807.42 | 320.19 |
| 2 | Asia | Sports | 40 | 88.0 | 10451.36 | 261.28 |
| 3 | Asia | Clothing | 40 | 145.0 | 2701.46 | 67.54 |
| 4 | Europe | Beauty Products | 40 | 46.0 | 2464.92 | 61.62 |
| 5 | North America | Books | 40 | 114.0 | 646.12 | 16.15 |
Query 2 — Top 3 Products per Region WINDOW + QUALIFY ¶
Question: What are the top 3 best-selling products by revenue in each region?
q2 = duckdb.query("""
SELECT
Region,
Product_Name,
Product_Category,
ROUND(SUM(Unit_Price), 2) AS total_revenue,
SUM(Units_Sold) AS total_units,
RANK() OVER (
PARTITION BY Region
ORDER BY SUM(Unit_Price) DESC
) AS revenue_rank
FROM df
GROUP BY Region, Product_Name, Product_Category
QUALIFY revenue_rank <= 3
ORDER BY Region, revenue_rank
""").df()
q2
| Region | Product_Name | Product_Category | total_revenue | total_units | revenue_rank | |
|---|---|---|---|---|---|---|
| 0 | Asia | Peloton Bike | Sports | 1895.00 | 1.0 | 1 |
| 1 | Asia | Garmin Fenix 6X Pro | Sports | 999.99 | 1.0 | 2 |
| 2 | Asia | Bowflex SelectTech 1090 Adjustable Dumbbells | Sports | 699.99 | 1.0 | 3 |
| 3 | Europe | LG OLED TV | Home Appliances | 1299.99 | 2.0 | 1 |
| 4 | Europe | De'Longhi Magnifica Espresso Machine | Home Appliances | 899.99 | 1.0 | 2 |
| 5 | Europe | Roomba i7+ | Home Appliances | 799.99 | 2.0 | 3 |
| 6 | North America | Canon EOS R5 Camera | Electronics | 3899.99 | 1.0 | 1 |
| 7 | North America | MacBook Pro 16-inch | Electronics | 2499.99 | 1.0 | 2 |
| 8 | North America | Apple MacBook Pro 16-inch | Electronics | 2399.00 | 1.0 | 3 |
Query 3 — Payment Method & Spend Segmentation CASE WHEN ¶
Question: How does revenue and order volume break down by payment method, and what value tier does each payment bucket fall into?
q4 = duckdb.query("""
SELECT
Payment_Method,
COUNT(Transaction_ID) AS total_orders,
ROUND(SUM(Unit_Price), 2) AS total_revenue,
ROUND(AVG(Unit_Price), 2) AS avg_order_value,
ROUND(SUM(Unit_Price) * 100.0 /
SUM(SUM(Unit_Price)) OVER (), 1) AS revenue_share_pct,
CASE
WHEN AVG(Unit_Price) >= 500 THEN 'High AOV'
WHEN AVG(Unit_Price) >= 200 THEN 'Mid AOV'
ELSE 'Low AOV'
END AS aov_tier
FROM df
GROUP BY Payment_Method
ORDER BY total_revenue DESC
""").df()
q4
| Payment_Method | total_orders | total_revenue | avg_order_value | revenue_share_pct | aov_tier | |
|---|---|---|---|---|---|---|
| 0 | Credit Card | 120 | 38761.14 | 323.01 | 68.3 | Mid AOV |
| 1 | PayPal | 80 | 15272.34 | 190.90 | 26.9 | Low AOV |
| 2 | Debit Card | 40 | 2701.46 | 67.54 | 4.8 | Low AOV |
Conclusion ¶
Across 3 focused queries on a real marketplace dataset, we showed that DuckDB handles the full range of analytical SQL — multi-dimensional aggregations, window-based rankings, running totals via CTEs, inline classification, and statistical distributions — all directly on a Pandas DataFrame with zero infrastructure.
| # | Query | SQL Feature | Business Question Answered |
|---|---|---|---|
| 1 | Revenue by Region & Category | GROUP BY |
Which market + category drives most revenue? |
| 2 | Top 3 Products per Region | WINDOW + QUALIFY |
Best-sellers by geography |
| 3 | Payment Method Segmentation | CASE WHEN + window share |
AOV tiers and payment channel contribution |