Data Science Asked by pnv on November 14, 2020
I am new to data science. I have a dataset of around 200,000 records, having 5 columns. There is a field called, class
. For each class
, there are one or many divisions
. I have to do this:
1. Filter the dataset, such that only those classes
with at least 5 divisions turn up.
For each division, I have to calculate attendance
from another column.
There is a minimum attendance
value for each class
. I have to find the percentage of divisions in each class with the minimum attendance
.
I started with importing the data in python using Pandas and started writing loops for processing this. But I am sure this is not the right way to do. Can you please give some idea.Can I do this in Excel pivot table?
It is a bit hard to solve your problem without data but I tried to give it a go
with a how I think the data would be encoded. I used R with data.tables. You can read data.tables with fread()
.
require(data.table)
# Assume sample_data has the following format:
# class: the class
# division: the division
# attendance: the attandance for a match
#
# I assume the table is in long format e.g. multiple rows exist per class with
# per class one or different divisions.
# Make the list of classes with at least 5 divisions.
classes_of_interest <-
sample_data[,
.(num_divisions = length(unique(divisions))),
by = class][num_divisions > 4, class]
# Only consider the classes that were in at least 5 divisions.
attandance_by_division <-
sample_data[class %in% classes_of_interest,
.(attendance = sum(num_people)),
by = list(division, class)]
setkey(attandance_by_division, "class")
# Merge the data set with a datas set that contains the required number
# of attendants per class.
# The format is as follows:
# class: the class
# mininum_attendance: the minimum attendance
attendance_data <-
merge(attendance_requirements,
attandance_by_division, by = "class")
# Here I exploit the fact that the true/false condition will be converted
# to a 1 and 0. So I can sum and divide by the length of index subset created
# by aggregating on 'class'.
pct_of_division <-
sample_data[,
.(pct_with_min_attendance = (sum(attendance > minimum_attendance)
/ length(.I))),
by = class]
Hope this helps
Answered by Stereo on November 14, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP