Data Science Asked on December 9, 2021
I want to use Natural Language Processing to analyze traffic accident reports and from the text determine two things:
A fragment of an accident report narrative looks something like this:
TU2 was parked and attended on the west side of Main St, facing west,
engine turned off. TU1 was parked on the east side of Main St, facing
east and exited the parking spot, driving backwards. TU1’s rear
collided with TU2’s rear. TU2 was still parked with the engine off
when rear ended by TU1.
The accident reports are in thousands of files, with all accident reports for one year in one city in a single file. The “answers” (labels) to vehicle travel direction and movement descriptions are provided, along with the narrative of the accident report. So I have a decent training dataset.
I was thinking about starting with an approach like an n-gram bag of words and a simple classifier for vehicle direction (north, southwest, etc.). Would that be a good start?
Firstly, I hope that the label is either a short summary or words of varying length, not just one word direction. Because moving cars involved in an accident may have multiple directions, or one car could be just parked like the example.
Secondly, given that you are planning to predict varying length label, and given the example text, I am pretty sure that bag of words will not work well. You need the context. And you dont have million sized training set to train a transformer deep neural net. So try to utilize pre-trained embeddings like USE, GloVe etc. If you are using the embedding, it provides you awesome feature engineering done. Just train some low complexity model like random forest/ xgboost on (label, embeddings). You can also explore pretrained summary generators like bertseq2seq, however the code for this didn't work for me yet.
Answered by dipanjan sanyal on December 9, 2021
If I understand you correctly, your question is how to get your data into a model? Here is a brief example using R. The figure shows how the data is formated when I read it into R. It is one column, containing the label
(this will be your y
) and one column containing the event
(this will be the text from which you make a bag of words). Make sure text is lower case and contains no special characters. Maybe remove stopwords, do stemming or prune your vocabulary. My text here is well formated for my task.
In R you can declare the column type
to a factor and plug the factor in a regression model. Alternatively, you can simply "recode" the type to numbers, say 0=accident
and 1=crash
(etc). Any model should be able to digest these numbers which indicate your "classes" to be predicted. Don't forget to split your data into a train and test set.
The next step is to generate a bag of words or n-grams from event
(I think this should be doable for you based on online examples).
Once you have your labels (y
) and your bag of words (x
) you can start with some model(s). In another answer, a Keras model was proposed. I think this is an option, but probably a somewhat over-engineered solution. An alternative would be to use "normal" Logit with regularization (lasso or ridge). The reason for lasso/ridge is that features (aka columns in you bag of words) are "shrunken" automatically if they do not contribute much to a good prediction. This usually improves fit.
Estimation is simple using glmnet
:
library(glmnet)
# Fit model to training set
cv_fit <- cv.glmnet(x = dtm_train, y = train[['type']],
nfolds = 5,
type.measure = "class",
alpha=1, # 1=Lasso / 0=Ridge
grouped = FALSE,
family = "multinomial") # I have 4 classes
# Plot CV results for parameter lambda
plot(cv_fit)
# Get best lambda
bestlam = cv_fit$lambda.min
# Predict classes
classes = predict(cv_fit, newx=dtm_test, s=bestlam, type="class")
# Look at results
table(classes, test[['type']])
You first "tune" the parameter lamda by CV
. What you get is a nice figure and an optimal lambda.
Next you can predict classes and look at the outcome:
classes Accident Crash Incident Report
Accident 142 5 23 5
Crash 2 9 0 0
Incident 64 5 1697 29
Report 8 1 11 1
Well, this is just a bonkers example of mine (not fine tuned). However, if you check different alpha values [0,1]
you may get decent results for your task.
Here is a good guide to glmnet
and some documentation for R. You can do the same thing in Python by the way.
Answered by Peter on December 9, 2021
if youre looking to generate an n-gram you can use a straightforward python fctn:
def ngram(tokens, n):
grams = []
for i in range(len(tokens) - n + 1):
grams.append(tokens[i:i+n])
return grams
sent = 'hi i am fred fred burger :)'
bigrams = ngram(sent.split(), 2)
and for data processing you can just embed each gram seperately (just hold a dict {gram : index} if you want to move forward with some form of logistic regressor or such. You can then write the model in any framework you fancy. I reccomend as a beginner using keras, because the logistic model will be 2 lines and then extending to further models will be alot easier at that level of abstraction. Also they have alot of preprocessing tools that may help you (for padding and etc)
Answered by mshlis on December 9, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP