Using nlp to analyze accident report

Question

I want to use Natural Language Processing to analyze traffic accident reports and from the text determine two things:

Direction of vehicle travel (just compass directions like north, southeast, etc.)
Vehicle movement descriptions (e.g. backing, turning left on a red, stopped in traffic, turning right, parked).

A fragment of an accident report narrative looks something like this:

TU2 was parked and attended on the west side of Main St, facing west,
  engine turned off. TU1 was parked on the east side of Main St, facing
  east and exited the parking spot, driving backwards. TU1's rear
  collided with TU2's rear. TU2 was still parked with the engine off
  when rear ended by TU1.

The accident reports are in thousands of files, with all accident reports for one year in one city in a single file.  The "answers" (labels) to vehicle travel direction and movement descriptions are provided, along with the narrative of the accident report.  So I have a decent training dataset.

I was thinking about starting with an approach like an n-gram bag of words and a simple classifier for vehicle direction (north, southwest, etc.).  Would that be a good start?

mshlis · Answer

if youre looking to generate an n-gram you can use a straightforward python fctn:

def ngram(tokens, n):
  grams = []
  for i in range(len(tokens) - n + 1):
    grams.append(tokens[i:i+n])
  return grams
sent = 'hi i am fred fred burger :)'
bigrams = ngram(sent.split(), 2)

and for data processing you can just embed each gram seperately (just hold a dict {gram : index} if you want to move forward with some form of logistic regressor or such. You can then write the model in any framework you fancy. I reccomend as a beginner using keras, because the logistic model will be 2 lines and then extending to further models will be alot easier at that level of abstraction. Also they have alot of preprocessing tools that may help you (for padding and etc)

Peter · Answer

If I understand you correctly, your question is how to get your data into a model? Here is a brief example using R. The figure shows how the data is formated when I read it into R. It is one column, containing the label (this will be your y) and one column containing the event (this will be the text from which you make a bag of words). Make sure text is lower case and contains no special characters. Maybe remove stopwords, do stemming or prune your vocabulary. My text here is well formated for my task.

In R you can declare the column type to a factor and plug the factor in a regression model. Alternatively, you can simply "recode" the type to numbers, say 0=accident and 1=crash (etc). Any model should be able to digest these numbers which indicate your "classes" to be predicted. Don't forget to split your data into a train and test set.

The next step is to generate a bag of words or n-grams from event (I think this should be doable for you based on online examples).

Once you have your labels (y) and your bag of words (x) you can start with some model(s). In another answer, a Keras model was proposed. I think this is an option, but probably a somewhat over-engineered solution. An alternative would be to use "normal" Logit with regularization (lasso or ridge). The reason for lasso/ridge is that features (aka columns in you bag of words) are "shrunken" automatically if they do not contribute much to a good prediction. This usually improves fit.

Estimation is simple using glmnet:

library(glmnet)
# Fit model to training set
cv_fit <- cv.glmnet(x = dtm_train, y = train[['type']],
                    nfolds = 5,
                    type.measure = "class",
                    alpha=1,  # 1=Lasso / 0=Ridge 
                    grouped = FALSE,
                    family = "multinomial")  # I have 4 classes

# Plot CV results for parameter lambda
plot(cv_fit)

# Get best lambda
bestlam = cv_fit$lambda.min

# Predict classes
classes = predict(cv_fit, newx=dtm_test, s=bestlam, type="class")

# Look at results
table(classes, test[['type']])

You first "tune" the parameter lamda by CV. What you get is a nice figure and an optimal lambda.

Next you can predict classes and look at the outcome:

classes    Accident Crash Incident Report
  Accident      142     5       23      5
  Crash           2     9        0      0
  Incident       64     5     1697     29
  Report          8     1       11      1

Well, this is just a bonkers example of mine (not fine tuned). However, if you check different alpha values [0,1] you may get decent results for your task.

Here is a good guide to glmnet and some documentation for R. You can do the same thing in Python by the way.

dipanjan sanyal · Answer

Firstly, I hope that the label is either a short summary or words of varying length, not just one word direction. Because moving cars involved in an accident may have multiple directions, or one car could be just parked like the example.
Secondly, given that you are planning to predict varying length label, and given the example text, I am pretty sure that bag of words will not work well. You need the context. And you dont have million sized training set to train a transformer deep neural net. So try to utilize  pre-trained embeddings like USE, GloVe etc. If you are using the embedding, it provides you awesome feature engineering done. Just train some low complexity model like random forest/ xgboost on (label, embeddings). You can also explore pretrained summary generators like bertseq2seq, however the code for this didn't work for me yet.

Using nlp to analyze accident report

3 Answers

Add your own answers!

Ask a Question