TransWikia.com

How can I efficiently represent statistics and visualizations for a large number of features?

Data Science Asked by HS-nebula on March 1, 2021

Suppose I have a dataset with more than 50 features (could be 100 or 10,000, etc.). I’d like to create a summary table that has some statistics about each feature and plot the distributions. Normally, I’d use df.describe() and add various stats to that dataframe, but the resulting dataframe can be very large and take more effort to parse through and understand the features, plus it’s tough to fit on a single-page in a Word document.

Plotting the features using a Seaborn Clustermap or Pairplot would be a typical way I’d approach it, but for the larger number of features, the plot gets too large to see relationships clearly. Using a for loop to plot histogram and bar plots would work but is messy and not efficient for datasets with more than 100 (my rough estimate and preference) features.

How can I efficiently represent these stats and plots for the large number of features?

If there is a better method in R, that’d be helpful too. I know the tableone package is really good (and there’s a Python version), but again it gets hard to fit it in a single-page Word document. If my question is too subjective, I can try to rephrase it.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP