TransWikia.com

Visualize frequency of 5 Boolean variables together

Data Science Asked by myradio on December 20, 2020

I have a data set with 5 variables,

a b c d e
1 0 0 1 0
0 1 0 1 1
0 1 1 0 0
0 0 0 1 0
1 1 1 0 0
0 1 1 0 1
1 0 1 0 0
1 0 0 1 1
0 1 0 1 1
0 0 1 1 0

I am only interested in the percentages of occurrence,

occurrence,

| a | b | c | d | e |
.4 | .5 | .5 | .6 | .4

BUT, I would like to visualize in such a way that I can see the overlap, or not, among all the different groups.

Any idea?

3 Answers

Since the combinations are known, we can use some knowledge of binary numbers and use this to find come up with a frequency plot

Basically - convert the binary string to integer and get a frequency plot based on the integer values

import numpy as np
import pandas as pd
from itertools import product
import matplotlib.pyplot as plt

# test data, 1 of every 32 combinations
combs = np.array(map(list, product([0, 1], repeat=5)))
# store in dataframe
df = pd.DataFrame(data={'a': combs[:, 0], 'b': combs[:, 1], 'c': combs[:, 2], 'd': combs[:, 3], 'e': combs[:, 4]})
# concatenate the binary sequences to strings
df['concatenate'] = df[list('abcde')].astype(str).apply(''.join, axis=1)

# to convert binary strings to integers
def int2(x):
    return int(x, 2)

# every combination has a unique value
df['unique_values'] = df['concatenate'].apply(int2)

# prepare labels for the frequency plot
variables = list('abcde')
labels = []
for combination in df.concatenate:
    tmp = ''.join([variables[i] for i, x in enumerate(combination) if x != '0'])
    labels.append(tmp)

fig, ax = plt.subplots()
counts, bins, patches = ax.hist(df.unique_values, bins=32, rwidth=0.8)

# turn of the
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    top=False,         # ticks along the top edge are off
    labelbottom=False)

# calculate the bin centers
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
ax.set_xticks(bin_centers)
for label, x in zip(labels, bin_centers):
    # replace integer mapping with the labels
    ax.annotate(str(label), xy=(x, 0), xycoords=('data', 'axes fraction'),
        xytext=(0, -5), textcoords='offset points', va='top', ha='center', rotation='30')

plt.show()

enter image description here

Answered by sai on December 20, 2020

enter image description hereIf you have richer data (ie more than 10 rows), you will want an upset plot. Upset plots are a way to view information in an intuitive way like a Venn diagram, but is more useful for 4+ categories.

Some references which may give you some ideas and implementation in R:

Answered by Timothy Chan on December 20, 2020

With Wolfram Language you may use AbsoluteCorrelation.

With

t = {
     {1, 0, 0, 1, 0}, {0, 1, 0, 1, 1}, 
     {0, 1, 1, 0, 0}, {0, 0, 0, 1, 0}, 
     {1, 1, 1, 0, 0}, {0, 1, 1, 0, 1}, 
     {1, 0, 1, 0, 0}, {1, 0, 0, 1, 1}, 
     {0, 1, 0, 1, 1}, {0, 0, 1, 1, 0}
    }

Then

MatrixForm[ac = AbsoluteCorrelation[t]] 

Mathematica graphics

Where the diagonals are the marginal column frequencies and the off-diagonals the joint frequencies. That is for ac[[1,1]] variable a occurs with frequency 0.4 and for ac[[1,2]] (row 1, column 2) variable a occurs jointly with variable b with frequency 0.1

This can be visualised with MatrixPlot or ArrayPlot.

MatrixPlot[
 ac 
 , FrameTicks -> {Transpose@{Range@5, CharacterRange["a", "e"]}}
 , PlotLegends -> Automatic]

Mathematica graphics

Hope this helps.

Answered by Edmund on December 20, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP