Exploring Set Function in Python

The set() function creates a collection of unique, unordered elements, automatically removing duplicates. It is useful for comparing data, grouping categories, and performing other operations within a dataset. By converting a column to a set, you can perform methods like union, intersection, and difference to analyze and compare data. Sets can be created using the set() function or by enclosing elements in curly braces {}.

Importing Library & Dataset

import pandas as pd
df = pd.read_csv('C:/Users/SANKHYA/Documents/dataset.csv')
df.head()
Order_ID Order_Date Customer_ID Category Sub_Category Sales Region
0 CA-2018-106103 10-06-18 SC-20305 Technology Accessories 132.52 Central
1 CA-2018-102407 09-12-18 AT-10435 Office Supplies Art 11.16 West
2 CA-2018-117947 18-08-18 NG-18355 Furniture Furnishings 40.48 East
3 CA-2018-152485 04-09-18 JD-15790 Office Supplies Art 13.12 Central
4 CA-2018-153339 03-11-18 DJ-13510 Furniture Furnishings 15.99 South

Let’s try few methods of set() function:

Using Union: Combines all unique elements from two sets.

# Define two sets
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
products_furniture = set(df[df['Category']=='Furniture']['Sub_Category'])

# Union of both sets
union_prod = products_tech.union(products_furniture)
print("Union of Sub-Categories (Technology and Furniture):", union_prod)
Union of Sub-Categories (Technology and Furniture): {'Furnishings', 'Machines', 'Chairs', 'Phones', 'Accessories', 'Copiers', 'Tables', 'Bookcases'}

Using Intersection: Finds elements common to both sets.

# Define two sets
cust_region1 = set(df[df['Region']=='East']['Customer_ID'])
cust_region2 = set(df[df['Region']=='South']['Customer_ID'])

# Intersection of customers from both regions
overlap_cust = cust_region1.intersection(cust_region2)
print("Customers in Both East and South Regions:", overlap_cust)
Customers in Both East and South Regions: {'MD-17860', 'AH-10075', 'CA-12265', 'RD-19810', 'BT-11680', 'AS-10090'}

Using Difference: Identifies elements present in one set but not in the other.

# Define two sets
products_office = set(df[df['Category']=='Office Supplies']['Sub_Category'])
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
                       
# Difference of sub-categories: sub-categories in Office Supplies but not in Technology
difference_prod = products_office.difference(products_tech)
print("Sub-Categories in Office Supplies but not in Technology:", difference_prod)
Sub-Categories in Office Supplies but not in Technology: {'Appliances', 'Labels', 'Art', 'Storage', 'Fasteners', 'Envelopes', 'Supplies', 'Paper', 'Binders'}

Conclusion:

The set() function in Python is a versatile tool for managing collections of unique elements, making it ideal for tasks requiring data comparison and filtering. Its support for methods like union, intersection, and difference simplifies the process of comparing and merging data, making it easier to uncover insights. By using these techniques, you can efficiently handle unique elements and relationships within your datasets. Incorporating set operations into your data analysis toolkit will enhance your ability to work with diverse data scenarios.