List value in Pandas DataFrame column makes analysis harder

Question

Should I move to database?
I have a list of courses data in JSON format that looks like this:
courses = [
  {
    course_id: "c_01",
    teachers: ["t_01", "t_02"]
  },
  {
    course_id: "c_02",
    teachers: ["t_02", "t_03"]
  }
]

And a list of teachers that look like this:
teachers = [
  {
    teacher_id: "t_01",
    teacher_fullname: "teacher_01"
  },
  {
    teacher_id: "t_02",
    teacher_fullname: "teacher_02"
  },
  {
    teacher_id: "t_03",
    teacher_fullname: "teacher_03"
  }
]

I also have other data like courseworks and submissions that I want to cross with these to create summary analytics like:

¿Which are the courses of a teacher?
¿Who is the teacher with most courses?
¿What is the average coursework quantity per course per teacher?
etc...

I loaded each list (courses, teachers, courseworks, submissions) into DataFrames, but I'm having a hard time selecting the courses of a teacher using vectorized methods.
Programming attempts

Tried using DataFrame.query but failed to use array methods inside the query
Tried using Series.isin but as it can't hash the list inside the Series teachers is useless.

Data manipulations attempts
Then I tried to flatten the teachers Series as follows:
courses = [
  {
    course_id: "c_01",
    teacher_id: "t_01"
  },

{
    course_id: "c_01",
    teacher_id: "t_02"
  },
  ,
  {
    course_id: "c_02",
    teacher_id: "t_02"
  }

{
    course_id: "c_02",
    teacher_id: "t_03"
  }
]

Which works because it allows merges and all kinds of queries, but because of combinatorial explosions I ended up with thousands extra rows I no certainty that the aggregated numbers are correct.
My last approach was to add one column per teacher in each course like this:
courses = [  
  {
    course_id: "c_01",
    teacher_01: 1,
    teacher_02: 1,
    teacher_03: 0,
  },
  {
    course_id: "c_02",
    teacher_01: 0,
    teacher_02: 1,
    teacher_03: 1,
  }
]

It actually works perfectly and the aggregations become very easy to do. One concern is that I may end up with thousands of columns (I'm not sure if that's a problem) and the other is that each time a run an analysis on a batch of data I'll end up with different columns so it will be harder to make cross analysis between different datasets.
Anyway, my final thoughts are that I should store everything in a database and query the information that I need already "joined" to perform an easier and cleanier analysis process.
so...Should I move to database? Am I missing something?

Benoit Descamps · Accepted Answer

Pandas unfortunately does not allow for the type of conditional join you wish to do, without copying a lot of unnecessary data before processing.
Your best solution is pandas to explode the teaches columns like you did and broadcast join the teachers. If you are dealing with memory issues, and still want to deal with pandas then you will ave to multiprocess the join by splitting the courses dataframe.

List value in Pandas DataFrame column makes analysis harder

One Answer

Add your own answers!

Ask a Question