Data Science Asked by Somphon Rueangari on May 22, 2021
I have a dataset consist of 8 columns and 15600 rows with the following columns:-
1.Entry_academic_year which have 5 discrete value (2558,2559,2560,2561,2562)
2.Faculty (It is the faculty that student has taken like engineering)
3.branch (It is the branch that student has taken like software engineering)
4.Admission type (how the student enter the college)
5.Graduated_high_school (it is the high school where student got graduated)
6.province_of_school
7.GPA_high_school(It is the GPA of student in high school)
8.GPA_college(It is the GPA of the student during college)
I am trying to predict the GPA of the student at the college by dividing the GPA into 4 quartiles with respect to percentile (25,50,75), The problem I faced is that the Graduated_high_school columns have around 1732 unique value with some school contain only one row which makes the prediction around 30-35 % accuracy
Any idea on how to fix it?
Perhaps you can see if Graduated_high_school
is correlated in any way to GPA_college
? If there is no correlation, you can try to fit a model by dropping the Graduated_high_school
column.
Else, you can try to drop rows belonging to under-represented high schools. However, one problem I foresee is that future predictions might have Graduated_high_school
that are unseen in the training dataset, leading to problems (e.g. schools that weren't mentioned in the dataset, or if someone decides to use your model, on a dataset from another country). So, if the Graduated_high_school
is not important, I would consider dropping it altogether.
Or, maybe you can change Graduated_high_school
to something else that is related, such as number of teachers in high school, teacher-student ratio etc.
Answered by Daren on May 22, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP