import pandas as pdExploring Set Function in Python
The set() function creates a collection of unique, unordered elements, automatically removing duplicates. It is useful for comparing data, grouping categories, and performing other operations within a dataset. By converting a column to a set, you can perform methods like union, intersection, and difference to analyze and compare data. Sets can be created using the set() function or by enclosing elements in curly braces {}.
Importing Library & Dataset
df = pd.read_csv('C:/Users/SANKHYA/Documents/dataset.csv')
df.head()| Order_ID | Order_Date | Customer_ID | Category | Sub_Category | Sales | Region | |
|---|---|---|---|---|---|---|---|
| 0 | CA-2018-106103 | 10-06-18 | SC-20305 | Technology | Accessories | 132.52 | Central |
| 1 | CA-2018-102407 | 09-12-18 | AT-10435 | Office Supplies | Art | 11.16 | West |
| 2 | CA-2018-117947 | 18-08-18 | NG-18355 | Furniture | Furnishings | 40.48 | East |
| 3 | CA-2018-152485 | 04-09-18 | JD-15790 | Office Supplies | Art | 13.12 | Central |
| 4 | CA-2018-153339 | 03-11-18 | DJ-13510 | Furniture | Furnishings | 15.99 | South |
Let’s try few methods of set() function:
Using Union: Combines all unique elements from two sets.
# Define two sets
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
products_furniture = set(df[df['Category']=='Furniture']['Sub_Category'])
# Union of both sets
union_prod = products_tech.union(products_furniture)
print("Union of Sub-Categories (Technology and Furniture):", union_prod)Union of Sub-Categories (Technology and Furniture): {'Furnishings', 'Machines', 'Chairs', 'Phones', 'Accessories', 'Copiers', 'Tables', 'Bookcases'}
Using Intersection: Finds elements common to both sets.
# Define two sets
cust_region1 = set(df[df['Region']=='East']['Customer_ID'])
cust_region2 = set(df[df['Region']=='South']['Customer_ID'])
# Intersection of customers from both regions
overlap_cust = cust_region1.intersection(cust_region2)
print("Customers in Both East and South Regions:", overlap_cust)Customers in Both East and South Regions: {'MD-17860', 'AH-10075', 'CA-12265', 'RD-19810', 'BT-11680', 'AS-10090'}
Using Difference: Identifies elements present in one set but not in the other.
# Define two sets
products_office = set(df[df['Category']=='Office Supplies']['Sub_Category'])
products_tech = set(df[df['Category']=='Technology']['Sub_Category'])
# Difference of sub-categories: sub-categories in Office Supplies but not in Technology
difference_prod = products_office.difference(products_tech)
print("Sub-Categories in Office Supplies but not in Technology:", difference_prod)Sub-Categories in Office Supplies but not in Technology: {'Appliances', 'Labels', 'Art', 'Storage', 'Fasteners', 'Envelopes', 'Supplies', 'Paper', 'Binders'}
Conclusion:
The set() function in Python is a versatile tool for managing collections of unique elements, making it ideal for tasks requiring data comparison and filtering. Its support for methods like union, intersection, and difference simplifies the process of comparing and merging data, making it easier to uncover insights. By using these techniques, you can efficiently handle unique elements and relationships within your datasets. Incorporating set operations into your data analysis toolkit will enhance your ability to work with diverse data scenarios.