Exploring Set Function in Python

The set() function creates a collection of unique, unordered elements, automatically removing duplicates. It is useful for comparing data, grouping categories, and performing other operations within a dataset. By converting a column to a set, you can perform methods like union, intersection, and difference to analyze and compare data. Sets can be created using the set() function or by enclosing elements in curly braces {}.

Importing Library & Dataset

import pandas as pd

df = pd.read_csv('C:/Users/SANKHYA/Documents/dataset.csv')
df.head()

	Order_ID	Order_Date	Customer_ID	Category	Sub_Category	Sales	Region
0	CA-2018-106103	10-06-18	SC-20305	Technology	Accessories	132.52	Central
1	CA-2018-102407	09-12-18	AT-10435	Office Supplies	Art	11.16	West
2	CA-2018-117947	18-08-18	NG-18355	Furniture	Furnishings	40.48	East
3	CA-2018-152485	04-09-18	JD-15790	Office Supplies	Art	13.12	Central
4	CA-2018-153339	03-11-18	DJ-13510	Furniture	Furnishings	15.99	South

Let’s try few methods of set() function:

Using Union: Combines all unique elements from two sets.

# Define two sets
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
products_furniture = set(df[df['Category']=='Furniture']['Sub_Category'])

# Union of both sets
union_prod = products_tech.union(products_furniture)
print("Union of Sub-Categories (Technology and Furniture):", union_prod)

Union of Sub-Categories (Technology and Furniture): {'Furnishings', 'Machines', 'Chairs', 'Phones', 'Accessories', 'Copiers', 'Tables', 'Bookcases'}

Using Intersection: Finds elements common to both sets.

# Define two sets
cust_region1 = set(df[df['Region']=='East']['Customer_ID'])
cust_region2 = set(df[df['Region']=='South']['Customer_ID'])

# Intersection of customers from both regions
overlap_cust = cust_region1.intersection(cust_region2)
print("Customers in Both East and South Regions:", overlap_cust)

Customers in Both East and South Regions: {'MD-17860', 'AH-10075', 'CA-12265', 'RD-19810', 'BT-11680', 'AS-10090'}

Using Difference: Identifies elements present in one set but not in the other.

# Define two sets
products_office = set(df[df['Category']=='Office Supplies']['Sub_Category'])
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
                       
# Difference of sub-categories: sub-categories in Office Supplies but not in Technology
difference_prod = products_office.difference(products_tech)
print("Sub-Categories in Office Supplies but not in Technology:", difference_prod)

Sub-Categories in Office Supplies but not in Technology: {'Appliances', 'Labels', 'Art', 'Storage', 'Fasteners', 'Envelopes', 'Supplies', 'Paper', 'Binders'}

Conclusion:

The set() function in Python is a versatile tool for managing collections of unique elements, making it ideal for tasks requiring data comparison and filtering. Its support for methods like union, intersection, and difference simplifies the process of comparing and merging data, making it easier to uncover insights. By using these techniques, you can efficiently handle unique elements and relationships within your datasets. Incorporating set operations into your data analysis toolkit will enhance your ability to work with diverse data scenarios.