Data Science Asked by Leonardo Ramos on May 4, 2021
Should I move to database?
I have a list of courses data in JSON format that looks like this:
courses = [
{
course_id: "c_01",
teachers: ["t_01", "t_02"]
},
{
course_id: "c_02",
teachers: ["t_02", "t_03"]
}
]
And a list of teachers that look like this:
teachers = [
{
teacher_id: "t_01",
teacher_fullname: "teacher_01"
},
{
teacher_id: "t_02",
teacher_fullname: "teacher_02"
},
{
teacher_id: "t_03",
teacher_fullname: "teacher_03"
}
]
I also have other data like courseworks and submissions that I want to cross with these to create summary analytics like:
I loaded each list (courses, teachers, courseworks, submissions) into DataFrames, but I’m having a hard time selecting the courses of a teacher using vectorized methods.
Programming attempts
Data manipulations attempts
Then I tried to flatten the teachers Series as follows:
courses = [
{
course_id: "c_01",
teacher_id: "t_01"
},
{
course_id: "c_01",
teacher_id: "t_02"
},
,
{
course_id: "c_02",
teacher_id: "t_02"
}
{
course_id: "c_02",
teacher_id: "t_03"
}
]
Which works because it allows merges and all kinds of queries, but because of combinatorial explosions I ended up with thousands extra rows I no certainty that the aggregated numbers are correct.
My last approach was to add one column per teacher in each course like this:
courses = [
{
course_id: "c_01",
teacher_01: 1,
teacher_02: 1,
teacher_03: 0,
},
{
course_id: "c_02",
teacher_01: 0,
teacher_02: 1,
teacher_03: 1,
}
]
It actually works perfectly and the aggregations become very easy to do. One concern is that I may end up with thousands of columns (I’m not sure if that’s a problem) and the other is that each time a run an analysis on a batch of data I’ll end up with different columns so it will be harder to make cross analysis between different datasets.
Anyway, my final thoughts are that I should store everything in a database and query the information that I need already "joined" to perform an easier and cleanier analysis process.
so…Should I move to database? Am I missing something?
Pandas unfortunately does not allow for the type of conditional join you wish to do, without copying a lot of unnecessary data before processing.
Your best solution is pandas to explode the teaches columns like you did and broadcast join the teachers. If you are dealing with memory issues, and still want to deal with pandas then you will ave to multiprocess the join by splitting the courses dataframe.
Correct answer by Benoit Descamps on May 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP