PySpark big data project

This project is to design recommendation algo using combined Machine Learning algo with Spark, and compares performance under different data size, using distributed computing algorithms and technologies for processing big data and demonstrating performance gains that can be achieved by using distributed computing.

To see the full content please see the file PySpark_ML_Experiment_Report.pdf To redo the experiment please ran all the file(except the deprecated file).

Abstract

In Airbnb, a digital platform connecting accommodations and expe- riences, guests frequently invest weeks in exploring and comparing various options before finalizing a reservation. The extensive and investigative process of searching, along with the requirement to consider both guest and host preferences, poses distinct challenges for Airbnb’s search ranking mechanism. In this research we aim to develop a recommendation system for Airbnb listings based on historical user reviews in major cities, focusing on location and dis- tance parameters. By incorporating user feedback and geographical data, our goal is to create a user-friendly system where users spec- ify a location on a map and a radius, and the system recommends the top 10 Airbnb listings within that specified distance. By using this strategy, we aim to improve the user experience by offering accommodations that are relevant to their needs and preferences for closeness.

Group work

The advent of online platforms has revolutionized the tourism and hospitality industry, offering travelers a wide array of accommodation options. Among these platforms, Airbnb has emerged as a prominent player, providing users with unique lodging experiences in various cities across the globe. At Airbnb, search is the main way for a guest to discover the right inventory, such as stays, (in-real-life) experiences (i.e. tourism places), online (i.e. virtual) experiences, etc. In this research paper, we delve into the intricate workings of Airbnb data analysis and recommendation systems, focusing on twelve major cities: Paris, London, Toronto, Rome, Amsterdam, Barcelona, Bangkok, Berlin, Bristol, Brussels, Cambridge, and Edinburgh.

Deep learning techniques have demonstrated effectiveness across various domains. Chauhan and Palivela utilized LSTM and Glove word embedding for fake news detection on social media, effectively filtering content. Ensafi et al. proposed LSTM for forecasting seasonal item sales, comparing its efficiency with other methods. Ji et al. introduced the STARec model for user preference modeling, highlighting its advantages such as the activity frequency feature but noting drawbacks like abnormal training time, explored future visitor recommendations alongside POI recommendations using KNN and RF. Chen et al. suggested a POI recommendation approach based on user location interest and contextual data, with limitations in handling numerous features. Çakmak et al. analyzed visitation profiles to recommend POIs using link prediction methods. In a reinforcement learning attempt by Massimo and Ricci an IRL-based model was proposed, successfully addressing clustering overheads but struggling with capturing behavioral patterns in large datasets.

Our approach to enhancing the search experience for Airbnb guests is comprehensive, integrating several key components to optimize recommendation systems and facilitate informed decision- making. Central to our methodology is the automation of crawling and pre-processing Airbnb data from InsideAirbnb source, utilizing distributed computing techniques such as Apache Spark DataFrame to handle large scale datasets efficiently. This curated dataset, comprising millions of observations from diverse cities, serves as the foundation for insightful analysis and recommendation. In tandem with automated data processing, sentiment analysis plays a pivotal role in our methodology, enabling us to understand the sentiment of user reviews towards Airbnb listings. By leveraging tools like NLTK SentimentIntensityAnalyzer, we gain valuable insights into user preferences and satisfaction levels, informing our recommendation strategies. Moreover, our methodology incorporates interactive visualization methods to present analyzed data in a user-friendly manner. Through the development of intuitive interfaces using ipyleaflet and ipywidgets, users can explore Airbnb listings based on their preferences and geographic locations, enhancing user engagement and facilitating informed decision making. Lastly, optimization strategies are deployed to streamline the recommendation process and enhance search efficiency. Techniques such as geohashing and binary search optimize search algorithms, ensuring a seamless and satisfying user experience, and ultimately driving increased guest conversion rates on Airbnb. Furthermore, we noticed Airbnb users often struggle to plan their trips effectively due to a lack of information on nearby tourist attractions and their distance when choosing to stay on Airbnb. To address such issue, we propose integrating a "Travel Score" into Airbnb’s recommendation system. This feature combines clustering algorithms to geographically categorize listings near tourist sites to assess the distance of these locations. By providing a quantifiable Travel Score, our solution aims to transform Airbnb into a more comprehensive travel planning tool, ensuring accommodations align with travelers’ sightseeing preferences. Overall, our recommendation system enhances Airbnb’s functionality by incorporating a two-dimensional evaluation approach, which includes both a sentiment score and a Travel Score for each stay.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
(Deprecated)initial_local_testing_osm_data.ipynb		(Deprecated)initial_local_testing_osm_data.ipynb
Data-Preproccesing.ipynb		Data-Preproccesing.ipynb
Data_Visualisation-1-2.ipynb		Data_Visualisation-1-2.ipynb
Geospatial_Heatmaps.ipynb		Geospatial_Heatmaps.ipynb
Project-marking.pdf		Project-marking.pdf
Proposal.md		Proposal.md
PySpark_ML_Experiment_Report.pdf		PySpark_ML_Experiment_Report.pdf
README.md		README.md
data.json		data.json
download.py		download.py
download.sh		download.sh
geodata_KMeans_DBSCAN.ipynb		geodata_KMeans_DBSCAN.ipynb
geodata_Preprocessing.ipynb		geodata_Preprocessing.ipynb
geodata_download.ipynb		geodata_download.ipynb
geodata_travel_score_recommendation.ipynb		geodata_travel_score_recommendation.ipynb
res.csv		res.csv
travel_scored_airbnb.csv		travel_scored_airbnb.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark big data project

Abstract

Group work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySpark big data project

Abstract

Group work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages