Data Science Asked by myradio on December 20, 2020
I have a data set with 5 variables,
a b c d e
1 0 0 1 0
0 1 0 1 1
0 1 1 0 0
0 0 0 1 0
1 1 1 0 0
0 1 1 0 1
1 0 1 0 0
1 0 0 1 1
0 1 0 1 1
0 0 1 1 0
I am only interested in the percentages of occurrence,
occurrence,
| a | b | c | d | e |
.4 | .5 | .5 | .6 | .4
BUT, I would like to visualize in such a way that I can see the overlap, or not, among all the different groups.
Any idea?
Since the combinations are known, we can use some knowledge of binary numbers and use this to find come up with a frequency plot
Basically - convert the binary string to integer and get a frequency plot based on the integer values
import numpy as np
import pandas as pd
from itertools import product
import matplotlib.pyplot as plt
# test data, 1 of every 32 combinations
combs = np.array(map(list, product([0, 1], repeat=5)))
# store in dataframe
df = pd.DataFrame(data={'a': combs[:, 0], 'b': combs[:, 1], 'c': combs[:, 2], 'd': combs[:, 3], 'e': combs[:, 4]})
# concatenate the binary sequences to strings
df['concatenate'] = df[list('abcde')].astype(str).apply(''.join, axis=1)
# to convert binary strings to integers
def int2(x):
return int(x, 2)
# every combination has a unique value
df['unique_values'] = df['concatenate'].apply(int2)
# prepare labels for the frequency plot
variables = list('abcde')
labels = []
for combination in df.concatenate:
tmp = ''.join([variables[i] for i, x in enumerate(combination) if x != '0'])
labels.append(tmp)
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(df.unique_values, bins=32, rwidth=0.8)
# turn of the
plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
top=False, # ticks along the top edge are off
labelbottom=False)
# calculate the bin centers
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
ax.set_xticks(bin_centers)
for label, x in zip(labels, bin_centers):
# replace integer mapping with the labels
ax.annotate(str(label), xy=(x, 0), xycoords=('data', 'axes fraction'),
xytext=(0, -5), textcoords='offset points', va='top', ha='center', rotation='30')
plt.show()
Answered by sai on December 20, 2020
If you have richer data (ie more than 10 rows), you will want an upset plot. Upset plots are a way to view information in an intuitive way like a Venn diagram, but is more useful for 4+ categories.
Some references which may give you some ideas and implementation in R:
Answered by Timothy Chan on December 20, 2020
With Wolfram Language you may use AbsoluteCorrelation.
With
t = {
{1, 0, 0, 1, 0}, {0, 1, 0, 1, 1},
{0, 1, 1, 0, 0}, {0, 0, 0, 1, 0},
{1, 1, 1, 0, 0}, {0, 1, 1, 0, 1},
{1, 0, 1, 0, 0}, {1, 0, 0, 1, 1},
{0, 1, 0, 1, 1}, {0, 0, 1, 1, 0}
}
Then
MatrixForm[ac = AbsoluteCorrelation[t]]
Where the diagonals are the marginal column frequencies and the off-diagonals the joint frequencies. That is for ac[[1,1]]
variable a
occurs with frequency 0.4 and for ac[[1,2]]
(row 1, column 2) variable a
occurs jointly with variable b
with frequency 0.1
This can be visualised with MatrixPlot or ArrayPlot.
MatrixPlot[
ac
, FrameTicks -> {Transpose@{Range@5, CharacterRange["a", "e"]}}
, PlotLegends -> Automatic]
Hope this helps.
Answered by Edmund on December 20, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP