Retrieve user features in real time from UserId for prediction

Question

Let's say I'm building an app like Uber and I want to predict the user's most likely destination based on the user's past history, current latitude/longitude, and time/date.

Here is the proposed architecture -

Let's say I have a pre-trained model hosted as a service. The part I'm struggling with is, how do I get the user features from the database in realtime from the RiderID to be used by the prediction service (XGBoost Model)? I'm guessing a lookup in a SQL database will take too long, considering I have 1M+ users and rides.

Thanks in advance!

The Lyrist · Answer

I think most likely the return of your model wouldn’t be worth it given the amount of effort to generalize it enough.

You may have millions of users, but each users need will probably too unique for generalization. I.e., everyone’s commute is so different that what you have learned from other users are probably not applicable to other users. (Unless during rush hours where most people are heading to the core business areas. You don’t need a model for that. )
The trained model will probably be marginally better than logging the users usage history. It probably isn’t worth the effort to train and process these data for what you can gain. 
Recent locations is probably good enough for most users and super easy to implement. Your model will probably have a hard time to predict the odd unusual trips anyway.

You are probably better off storing the users current location and query the most likely destination. Or simply look at all the popular trip destinations from that location. A data base with proper index should be able to handle that

BKersbergen · Answer

It sounds like you are looking for a fast and horizontally scalable database. I would advise you to use a column family database instead of a relational database for storing this kind of data. We are using Google BigTable (BT) for this in a similar use-case. On a 3 node BT cluster with SSD disks we have over 300M records that are fetched by key in 6ms @99 percentile with a load of 1000 requests per second. If the load increases you can just simply add nodes while running to your cluster or remove them. An opensource alternative like Cassandra is even faster in our experience. That database key would be RiderID in your case.

Answered by BKersbergen on March 24, 2021

Retrieve user features in real time from UserId for prediction

2 Answers

Add your own answers!

Ask a Question