Use Python, NimbusML and ML.NET to predict New York taxi fares
--
There are many popular machine learning libraries for Python. There’s TensorFlow, scikit-learn, Theano, Caffe, and many others.
And in the NET domain we have Microsoft’s new ML.NET machine learning library which can be used in C# and F# applications.
But now Microsoft has created NimbusML, a new library that will let you access the ML.NET machine learning library directly in your Python code!
NimbusML acts as a bridge between the Python process that’s running your app code and the dotNET runtime that’s hosting the ML.NET library. All calls are transparently routed between Python and dotNET.
Naturally I had to try it out. I decided to port my New York taxi price prediction model to NimbusML to see what happens.
I’m always big on writing extremely compact apps and I was happy to get the C# version of the taxi price predictor down to 122 lines of code. And by porting the app to F#, I managed to reduce its size even further to only 69 lines of code.
But how compact will the Python app be?
Let’s find out.
The first thing I’ll need is a data file with transcripts of New York taxi rides. The NYC Taxi & Limousine Commission provides yearly TLC Trip Record Data files which have exactly what I need.
I will download the Yellow Taxi Trip Records from December 2018 and save it as yellow_tripdata_2018–12.csv.
This is a CSV file with 8,173,233 records that looks like this:
There are a lot of columns with interesting information in this data file, but I will only train on the following:
- Column 0: The data provider vendor ID
- Column 3: Number of passengers
- Column 4: Trip distance
- Column 5: The rate code (standard, JFK, Newark, …)
- Column 9: Payment type (credit card, cash, …)
- Column 10: Fare amount