Which algorithms should I use for identifying similar characteristics between data points (the intersections)?

Question

I am working with a dataset that has been coded and categorized, so that each datapoint has a set of coded characteristics. An example data point would be something like the following:

Example Data Point:

Quality
Service & Support
Price

Each data point can have multiple codes associated with it.

What I'm looking to do is identify the "intersections" between the data points so that I can answer questions like the following:

When a data point has "Quality" as a characteristic, 25% of the time it also has "Price" as a characteristic

I've been struggling with the right way to ask this question in my Google searching and realized I should just come to the experts on topics like this and get your help and guidance.

To do this type of work, what algorithms should I be investigating?

Thank you for your help!

Bruno Lubascher · Answer

You could achieve this by creating and analyzing a confusion matrix of characteristics that appear together. Here is an example:

From this image, you can see (from the price-price section) that price appears 4 times. Then, you can also see that (from the price-quality section) that price and quality appear 3 times together. So, you can conclude that 75% of the time that price is a characteristic, quality is also a characteristic.

Other information you can extract:

price appears the most
price and quality is the most common pair
service_and_support is only 50% of the time paired with quality

Below is the code to generate this plot:

import seaborn as sn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

characteristics = {'quality':0, 'service_and_support':1, 'price':2, 'customer_relations':3}

data_points = [
    ['quality', 'service_and_support', 'price'],
    ['quality', 'price'],
    ['quality', 'customer_relations', 'price'],
    ['service_and_support', 'customer_relations', 'price'],
]

counts_matrix = np.zeros((len(characteristics), len(characteristics)))

for data_point in data_points:
    for characteristic1 in data_point:
        for characteristic2 in data_point:
            counts_matrix[characteristics[characteristic1]][characteristics[characteristic2]] += 1

keys = list(sorted(characteristics.keys(), key=lambda x: characteristics[x]))

df_cm = pd.DataFrame(counts_matrix, index = keys, columns = keys)
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True)

plt.show()

Which algorithms should I use for identifying similar characteristics between data points (the intersections)?

One Answer

Add your own answers!

Ask a Question