What is Bike-Share Predictive Model

This project is a product of the complex relationship between time, weather, and human behavioral trends. The regression model, in simplest terms, is developed to predict the "Net Flow" bike count of every bike sta- tion in an hourly manner. In summary, the predictions can then be used by companies to reconsider station stocking schedules, and by bike-share customers to create more optimal routes.
Why Bike-Share
Bike-sharing is a very practical and sustainable form of transportation that I frequently use for commuting and errands. It is a clean and healthy form of transportation that has so much potential to be improved. I believe that by understanding how people use bike-sharing systems, we can make improvements to the infrastructure and build the future of sustainable urban transportation.
Summary
Data Scraping & Transformation:
-
Scraped, Cleaned, Merged (9 years worth)
-
All user bike transaction data (~540+ million rows)
-
Hourly weather data ( between 2013 - 2022)
-
-
Transformed:
-
All bike transaction records (past 9 years) into →
-
All stations' hourly in/out flow count time series (between 2013 - 2022)
-
Predictive Model: Predicts hourly net bike flow count of 1600+ stations using only the current weather information through Weather API with an accuracy of ~%80. Trained on big data set of 5 million rows using ranger random forest library.




Application Description
-
The algorithm would begin by retrieving the current weather information for the desired hour range from the Weather API.
-
It would then retrieve the current station stock information through the Bike-Share system's API.
-
Next, the algorithm would compute the net flow for each station using the in and out flow machine learning models.
-
Finally, the algorithm would report whether there will be overstock or understock among all of the stations within the selected hour range.
-
The algorithm can also be run on a smaller scale to predict the net flow for a specific station.
Technology Used
This project is completed utilizing R language for both Data Scraping & Transformation and the Machine Learning training aspects. I choose R mainly to get familiarized with the syntax and learn and gain proficiency in the language.
Ranger's Random Forest was found to produce the highest accuracy, F1 score, and precision. It also outperformed significantly in runtime all of the other popular regression methods such as Support Vector Machine, Logistic Regression, Decision Tree, XGboost, KNN, etc.




Project GitHub: