Data Science Asked on December 15, 2020
I currently have a dataset where each observation is a person’s traffic ticket history over districts.
For each column, which represents a district:
GOAL: to (rank districts to) see which district should have more police presence due to increased traffic violation and also use the features to predict whether or not that person has 1+ traffic accidents in 2019.
PROBLEM: not all people have been to every district. I currently just encode the value to 0 if the person has never been to that district. But this should be a valid NA value.
For example, it seems illogical to rank a district if only one person (in the dataset) has been to that district.
QUESTION(S): How exactly should I handle this? I don’t think imputing as 0 is the right call here.
Original Data:
PersonId DistA DistB DistC DistD DistE Accident19
1 0 1 1 0 NA 1
2 NA 0 0 0 1 0
3 0 1 1 0 NA 1
4 1 0 0 0 NA 0
Imputed Data:
PersonId DistA DistB DistC DistD DistE Accident19
1 0 1 1 0 0 1
2 0 0 0 0 1 0
3 0 1 1 0 0 1
4 1 0 0 0 0 0
Many thanks in advance!
Thank you for clarifying the question @Eisen. So the question looks at two main things:
For the first point, I think what would be a good idea is to yes show the break down of people visiting each of the districts and committing 1+ traffic violations. But, I think adding (95%) confidence intervals would be particularly helpful to see what is a reliable estimate of people who commit 1+ traffic violations in a given district.
In terms of the second point, maybe you can use a feedforward neural network which will take as input the traffic violation categories for each district and output whether the person has committed an accident in 2019 (Accident19
).
The architecture is pretty much up to you, but the fail layer would need to be a 2-node softmax layer, which will create the probability distribution over the two classes (has had an accident in 2019 or not).
To represent this categorical data, I would suggest making the NA it's own category. Furthermore, best to represent the district traffic violation categories as one-hot encoded vectors. Then for a particular person, you concatenate these one-hot encoded vectors from all districts in your dataset.
Answered by shepan6 on December 15, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP