TransWikia.com

List value in Pandas DataFrame column makes analysis harder

Data Science Asked by Leonardo Ramos on May 4, 2021

Should I move to database?

I have a list of courses data in JSON format that looks like this:

courses = [
  {
    course_id: "c_01",
    teachers: ["t_01", "t_02"]
  },
  {
    course_id: "c_02",
    teachers: ["t_02", "t_03"]
  }
]

And a list of teachers that look like this:

teachers = [
  {
    teacher_id: "t_01",
    teacher_fullname: "teacher_01"
  },
  {
    teacher_id: "t_02",
    teacher_fullname: "teacher_02"
  },
  {
    teacher_id: "t_03",
    teacher_fullname: "teacher_03"
  }
]

I also have other data like courseworks and submissions that I want to cross with these to create summary analytics like:

  • ¿Which are the courses of a teacher?
  • ¿Who is the teacher with most courses?
  • ¿What is the average coursework quantity per course per teacher?
  • etc…

I loaded each list (courses, teachers, courseworks, submissions) into DataFrames, but I’m having a hard time selecting the courses of a teacher using vectorized methods.

Programming attempts

  1. Tried using DataFrame.query but failed to use array methods inside the query
  2. Tried using Series.isin but as it can’t hash the list inside the Series teachers is useless.

Data manipulations attempts

Then I tried to flatten the teachers Series as follows:

courses = [
  {
    course_id: "c_01",
    teacher_id: "t_01"
  },

  {
    course_id: "c_01",
    teacher_id: "t_02"
  },
  ,
  {
    course_id: "c_02",
    teacher_id: "t_02"
  }

  {
    course_id: "c_02",
    teacher_id: "t_03"
  }
]

Which works because it allows merges and all kinds of queries, but because of combinatorial explosions I ended up with thousands extra rows I no certainty that the aggregated numbers are correct.

My last approach was to add one column per teacher in each course like this:

courses = [  
  {
    course_id: "c_01",
    teacher_01: 1,
    teacher_02: 1,
    teacher_03: 0,
  },
  {
    course_id: "c_02",
    teacher_01: 0,
    teacher_02: 1,
    teacher_03: 1,
  }
]

It actually works perfectly and the aggregations become very easy to do. One concern is that I may end up with thousands of columns (I’m not sure if that’s a problem) and the other is that each time a run an analysis on a batch of data I’ll end up with different columns so it will be harder to make cross analysis between different datasets.

Anyway, my final thoughts are that I should store everything in a database and query the information that I need already "joined" to perform an easier and cleanier analysis process.

so…Should I move to database? Am I missing something?

One Answer

Pandas unfortunately does not allow for the type of conditional join you wish to do, without copying a lot of unnecessary data before processing.

Your best solution is pandas to explode the teaches columns like you did and broadcast join the teachers. If you are dealing with memory issues, and still want to deal with pandas then you will ave to multiprocess the join by splitting the courses dataframe.

Correct answer by Benoit Descamps on May 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP