Data Science Asked by Zigrivers on February 16, 2021
I am working with a dataset that has been coded and categorized, so that each datapoint has a set of coded characteristics. An example data point would be something like the following:
Example Data Point:
Each data point can have multiple codes associated with it.
What I’m looking to do is identify the “intersections” between the data points so that I can answer questions like the following:
I’ve been struggling with the right way to ask this question in my Google searching and realized I should just come to the experts on topics like this and get your help and guidance.
To do this type of work, what algorithms should I be investigating?
Thank you for your help!
You could achieve this by creating and analyzing a confusion matrix of characteristics that appear together. Here is an example:
From this image, you can see (from the price-price
section) that price
appears 4 times. Then, you can also see that (from the price-quality
section) that price
and quality
appear 3 times together. So, you can conclude that 75% of the time that price is a characteristic, quality is also a characteristic
.
Other information you can extract:
price
appears the mostprice
and quality
is the most common pairservice_and_support
is only 50% of the time paired with quality
Below is the code to generate this plot:
import seaborn as sn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
characteristics = {'quality':0, 'service_and_support':1, 'price':2, 'customer_relations':3}
data_points = [
['quality', 'service_and_support', 'price'],
['quality', 'price'],
['quality', 'customer_relations', 'price'],
['service_and_support', 'customer_relations', 'price'],
]
counts_matrix = np.zeros((len(characteristics), len(characteristics)))
for data_point in data_points:
for characteristic1 in data_point:
for characteristic2 in data_point:
counts_matrix[characteristics[characteristic1]][characteristics[characteristic2]] += 1
keys = list(sorted(characteristics.keys(), key=lambda x: characteristics[x]))
df_cm = pd.DataFrame(counts_matrix, index = keys, columns = keys)
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True)
plt.show()
Answered by Bruno Lubascher on February 16, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP