import pandas as pd
Efficient Data Filtering in Python: Mastering the query Function
Introduction:
Data manipulation is at the heart of data science and analytics. Python’s pandas library is a powerful tool for handling and analyzing data, offering various methods to filter, query, and process large datasets. Among these, the query function stands out for its simplicity and readability when filtering data based on conditions. In this blog, we’ll explore how to use the query function, compare it to traditional approaches, and understand why it is a valuable addition to your data wrangling toolkit.
Importing Library
Importing data
Data Description:
The sample dataset contains information about employees in a company.It includes the following columns:
Column | Description |
---|---|
Name | Names of the employees. |
Age | Age of the employees. |
Department | The department each employee belongs to (e.g., HR, IT, Finance). |
Salary | The salary of each employee. |
df.head()
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Alice | 24 | HR | 50000 |
1 | Bob | 27 | IT | 60000 |
2 | Charlie | 22 | Finance | 55000 |
3 | David | 32 | IT | 70000 |
4 | Eve | 29 | HR | 65000 |
General Approach (Without Using query):
To filter employees in the HR department with a salary above 55000, you can use Boolean indexing:
# General approach without query
= df[(df["Department"] == "HR") & (df["Salary"] > 55000)]
filtered_data filtered_data
Name | Age | Department | Salary | |
---|---|---|---|---|
4 | Eve | 29 | HR | 65000 |
Using query
Function:
The same task can be accomplished using the query function:
# Using query
= df.query('Department == "HR" and Salary > 55000')
filtered_data filtered_data
Name | Age | Department | Salary | |
---|---|---|---|---|
4 | Eve | 29 | HR | 65000 |
Query with Variables
One of the most powerful features of the query function is its ability to incorporate variables dynamically into the filtering conditions. By prefixing a variable with the @ symbol inside a query string, you can seamlessly integrate its value into your condition.
Example
: Suppose you want to filter employees whose salary exceeds a given threshold, which can vary based on user input or analysis requirements. Here’s how you can do it using the query function:
# Define a variable for the salary threshold
= 55000
salary_threshold
# Use the variable in the query
= df.query('Salary > @salary_threshold')
filtered_data filtered_data
Name | Age | Department | Salary | |
---|---|---|---|---|
1 | Bob | 27 | IT | 60000 |
3 | David | 32 | IT | 70000 |
4 | Eve | 29 | HR | 65000 |
Performance Benchmarking
To compare the performance of the query function with the traditional Boolean indexing approach, we used %timeit on a dataset with 1,000,000 rows.
# Measure execution time using Boolean indexing
= %timeit -o large_df[(large_df["Department"] == department) & (large_df["Salary"] > salary_threshold)]
boolean_indexing_time
# Measure execution time using query function
= %timeit -o large_df.query('Department == @department and Salary > @salary_threshold')
query_time
boolean_indexing_time,query_time
180 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
105 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(<TimeitResult : 180 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>,
<TimeitResult : 105 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>)
Here are the results of the performance comparison using %timeit on a dataset with 1,000,000 rows:
- Boolean Indexing: 131 ms ± 7.08 ms per loop
- query Function: 95.4 ms ± 3.1 ms per loop
Note
-The query function was faster in this case, taking approximately 27% less time than Boolean indexing. While this may vary depending on the dataset size and complexity, query often provides both performance and readability benefits in filtering operations.
Why Use query Instead of the General Approach?
Readability: The query function uses a more natural language-like syntax, making code easier to read and understand. Example: ‘Department == “HR” and Salary > 55000’ is simpler than using & and indexing brackets.
Conciseness: Reduces boilerplate code. You don’t need to repeatedly reference the DataFrame name or column indices.
Ease of Use in Strings: Ideal when filtering conditions are dynamic and provided as strings.
Enhanced Performance: For large datasets, query can sometimes perform better since it is optimized for column-based operations.
Conclusion:
The query function in pandas is a powerful tool for filtering data in an intuitive and readable way. While the general approach using Boolean indexing is versatile, query shines when dealing with complex conditions or when prioritizing code clarity. Incorporating query into your data manipulation workflow can save time and improve the maintainability of your code.
So next time you filter data in pandas, give the query function a try—you might find it becomes your new favorite!