Efficient Data Filtering in Python: Mastering the query Function

Introduction:

Data manipulation is at the heart of data science and analytics. Python’s pandas library is a powerful tool for handling and analyzing data, offering various methods to filter, query, and process large datasets. Among these, the query function stands out for its simplicity and readability when filtering data based on conditions. In this blog, we’ll explore how to use the query function, compare it to traditional approaches, and understand why it is a valuable addition to your data wrangling toolkit.

Importing Library

import pandas as pd

Importing data

Data Description:

The sample dataset contains information about employees in a company.It includes the following columns:

Column	Description
Name	Names of the employees.
Age	Age of the employees.
Department	The department each employee belongs to (e.g., HR, IT, Finance).
Salary	The salary of each employee.

df.head()

	Name	Age	Department	Salary
0	Alice	24	HR	50000
1	Bob	27	IT	60000
2	Charlie	22	Finance	55000
3	David	32	IT	70000
4	Eve	29	HR	65000

General Approach (Without Using query):

To filter employees in the HR department with a salary above 55000, you can use Boolean indexing:

# General approach without query
filtered_data = df[(df["Department"] == "HR") & (df["Salary"] > 55000)]
filtered_data

	Name	Age	Department	Salary
4	Eve	29	HR	65000

Using `query` Function:

The same task can be accomplished using the query function:

# Using query
filtered_data = df.query('Department == "HR" and Salary > 55000')
filtered_data

	Name	Age	Department	Salary
4	Eve	29	HR	65000

Query with Variables

One of the most powerful features of the query function is its ability to incorporate variables dynamically into the filtering conditions. By prefixing a variable with the @ symbol inside a query string, you can seamlessly integrate its value into your condition.

Example: Suppose you want to filter employees whose salary exceeds a given threshold, which can vary based on user input or analysis requirements. Here’s how you can do it using the query function:

# Define a variable for the salary threshold
salary_threshold = 55000

# Use the variable in the query
filtered_data = df.query('Salary > @salary_threshold')
filtered_data

	Name	Age	Department	Salary
1	Bob	27	IT	60000
3	David	32	IT	70000
4	Eve	29	HR	65000

Performance Benchmarking

To compare the performance of the query function with the traditional Boolean indexing approach, we used %timeit on a dataset with 1,000,000 rows.

# Measure execution time using Boolean indexing
boolean_indexing_time = %timeit -o large_df[(large_df["Department"] == department) & (large_df["Salary"] > salary_threshold)]

# Measure execution time using query function
query_time = %timeit -o large_df.query('Department == @department and Salary > @salary_threshold')

boolean_indexing_time,query_time

180 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
105 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(<TimeitResult : 180 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>,
 <TimeitResult : 105 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>)

Here are the results of the performance comparison using %timeit on a dataset with 1,000,000 rows:

Boolean Indexing: 131 ms ± 7.08 ms per loop
query Function: 95.4 ms ± 3.1 ms per loop

Note-The query function was faster in this case, taking approximately 27% less time than Boolean indexing. While this may vary depending on the dataset size and complexity, query often provides both performance and readability benefits in filtering operations.

Why Use query Instead of the General Approach?

Readability: The query function uses a more natural language-like syntax, making code easier to read and understand. Example: ‘Department == “HR” and Salary > 55000’ is simpler than using & and indexing brackets.
Conciseness: Reduces boilerplate code. You don’t need to repeatedly reference the DataFrame name or column indices.
Ease of Use in Strings: Ideal when filtering conditions are dynamic and provided as strings.
Enhanced Performance: For large datasets, query can sometimes perform better since it is optimized for column-based operations.

Conclusion:

The query function in pandas is a powerful tool for filtering data in an intuitive and readable way. While the general approach using Boolean indexing is versatile, query shines when dealing with complex conditions or when prioritizing code clarity. Incorporating query into your data manipulation workflow can save time and improve the maintainability of your code.

So next time you filter data in pandas, give the query function a try—you might find it becomes your new favorite!