TransWikia.com

Which algorithms should I use for identifying similar characteristics between data points (the intersections)?

Data Science Asked by Zigrivers on February 16, 2021

I am working with a dataset that has been coded and categorized, so that each datapoint has a set of coded characteristics. An example data point would be something like the following:

Example Data Point:

  • Quality
  • Service & Support
  • Price

Each data point can have multiple codes associated with it.

What I’m looking to do is identify the “intersections” between the data points so that I can answer questions like the following:

  • When a data point has “Quality” as a characteristic, 25% of the time it also has “Price” as a characteristic

I’ve been struggling with the right way to ask this question in my Google searching and realized I should just come to the experts on topics like this and get your help and guidance.

To do this type of work, what algorithms should I be investigating?

Thank you for your help!

One Answer

You could achieve this by creating and analyzing a confusion matrix of characteristics that appear together. Here is an example:

enter image description here

From this image, you can see (from the price-price section) that price appears 4 times. Then, you can also see that (from the price-quality section) that price and quality appear 3 times together. So, you can conclude that 75% of the time that price is a characteristic, quality is also a characteristic.

Other information you can extract:

  • price appears the most
  • price and quality is the most common pair
  • service_and_support is only 50% of the time paired with quality

Below is the code to generate this plot:

import seaborn as sn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

characteristics = {'quality':0, 'service_and_support':1, 'price':2, 'customer_relations':3}

data_points = [
    ['quality', 'service_and_support', 'price'],
    ['quality', 'price'],
    ['quality', 'customer_relations', 'price'],
    ['service_and_support', 'customer_relations', 'price'],
]

counts_matrix = np.zeros((len(characteristics), len(characteristics)))

for data_point in data_points:
    for characteristic1 in data_point:
        for characteristic2 in data_point:
            counts_matrix[characteristics[characteristic1]][characteristics[characteristic2]] += 1

keys = list(sorted(characteristics.keys(), key=lambda x: characteristics[x]))

df_cm = pd.DataFrame(counts_matrix, index = keys, columns = keys)
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True)

plt.show()

Answered by Bruno Lubascher on February 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP