Predicting Airbnb Nightly Prices: A Regression Problem using Machine Learning

Alin Cristian Preda

Student thesis: Master thesis


Airbnb is an on-line platform, enabling homeowners to rent out their unused space to travellers in need of accommodation. The “hosts” are free to establish their own arbitrary prices. Thusly, it becomes essential for these small-time entrepreneurs to gather clues as to how much they should charge. By making use of analytics and machine learning, is it possible to harvest the power of data, for the purpose of discovering actionable insights which enable data-driven decisions? Also, is the use of open data sources, such as the Inside Airbnb project, sufficient for this task? And is it better to employ cross-market data or simply focus on one city? The main objective of this research project was to come up with a model that can reliably predict nightly Airbnb prices of rooms and homes. A secondary goal was experimenting with new features – review comments text sentiment and listing photography quality scores - and new approaches to training data – using sets of multiple cities rather than just one market’s data. The models that ended up being employed for this task were Random Forests and XG-Boost, which are quite capable of tackling supervised learning regression problems. Pre-trained neural networks and natural language processing’s sentiment analysis branch were employed towards engineering new features which could add predictive power. The study of geospatial data through visualization was used to uncover insights into similarities and differences between markets. Existing literature written on the subject has aided in showcasing good practices, confirming universal findings, and providing inspiration for new approaches and perspectives. The scope was narrowed down to ten major European cities. XG-Boost has proven itself the superior regression method, scoring highest across multiple approaches. Its best result offers an R2 score of 0.64, when making use of all ten cities’ data and, also the engineered features. As is consistent across research, features such as a listing’s capacity, its proximity to the centre and whether a place is fully rented out are some of the most important indicators of price levels. I demonstrated the potential of feature-engineering photographs and review texts. Open data sources do not account for all the variability of the prices and new features are to be sought out. I believe there are clues to be discovered in studies which focus on the social intricacies of Airbnb. It probably is worth to have a closer look at how hosts present themselves and how their image and interactions influence their potential to attract and secure “guests” for more competitive prices. Overall, I would argue that we are not quite there yet in terms of automating decision-making in this particular industry but neither have we reached the end of possibilities

EducationsMSc in Business Administration and Information Systems, (Graduate Programme) Final Thesis
Publication date2020
Number of pages112
SupervisorsWeifang Wu