Finding aggregated information of data

Question

I am new to data science. I have a dataset of around 200,000 records, having 5 columns. There is a field called, class. For each class, there are one or many divisions. I have to do this:
1. Filter the dataset, such that only those classes with at least 5 divisions turn up.

For each division, I have to calculate attendance from another column.
There is a minimum attendance value for each class. I have to find the percentage of divisions in each class with the minimum attendance.

I started with importing the data in python using Pandas and started writing loops for processing this. But I am sure this is not the right way to do. Can you please give some idea.Can I do this in Excel pivot table?

Stereo · Answer

It is a bit hard to solve your problem without data but I tried to give it a go with a how I think the data would be encoded. I used R with data.tables. You can read data.tables with fread().

Step 1

 require(data.table)

 # Assume sample_data has the following format:
 #   class: the class
 #   division: the division
 #   attendance: the attandance for a match
 #
 # I assume the table is in long format e.g. multiple rows exist per class with
 # per class one or different divisions.

 # Make the list of classes with at least 5 divisions.
 classes_of_interest <- 
   sample_data[, 
               .(num_divisions = length(unique(divisions))),
               by = class][num_divisions > 4, class]

Step 2

 # Only consider the classes that were in at least 5 divisions.
 attandance_by_division <- 
   sample_data[class %in% classes_of_interest, 
               .(attendance = sum(num_people)),
               by = list(division, class)]
 setkey(attandance_by_division, "class")

Step 3

 # Merge the data set with a datas set that contains the required number
 # of attendants per class.
 # The  format is as follows:
 #   class: the class
 #   mininum_attendance: the minimum attendance
 attendance_data <- 
   merge(attendance_requirements, 
         attandance_by_division, by = "class")

 # Here I exploit the fact that the true/false condition will be converted
 # to a 1 and 0. So I can sum and divide by the length of index subset created
 # by aggregating on 'class'.
 pct_of_division <- 
   sample_data[, 
               .(pct_with_min_attendance = (sum(attendance > minimum_attendance)
                                            / length(.I))),
                 by = class]

Hope this helps

Finding aggregated information of data

One Answer

Step 1

Step 2

Step 3

Add your own answers!

Ask a Question