Built during my capstone project, this tool helps
transaction analysts make data-driven co-living and rental property investment decisions
in Singapore. It combines market data aggregation, geospatial analysis, and machine
learning to estimate rental prices and explain the drivers behind each prediction.
The primary users are investment analysts who need to quickly assess the rental potential
of a given location without manually cross-referencing multiple data sources.
The project followed an end-to-end analytics pipeline: data scraping and geocoding,
feature engineering, model selection, and a Flask dashboard. An OLS regression was
evaluated as a methodological baseline. Its poor explanatory power prompted
the switch to tree-based models (Random Forest and XGBoost) for the Singapore properties,
which handled non-linearity and feature interactions far better.
The Problem
Rental price estimation in Singapore requires synthesising data across many dimensions.
Investment analysts start with comparables analysis, which includes finding similar properties
in the same area and adjusting for differences in size, age, and floor level.
They also need to factor in location-specific signals like proximity to MRT stations, parks, shopping malls, etc.
Finally, they must understand broader market trends from
historical transaction records. Doing this manually is slow, inconsistent, and relies
heavily on analyst experience.
There was no single tool that pulled these signals together into a queryable interface
with a principled model underneath.
The Solution
By integrating data from multiple sources (see data souces below), I trained a machine learning model
to predict rental prices based on property features and location-based amenity proximity scores.
The model is served through a Flask app where users can click on a map to select a location in the core central region (CCR)
in Singapore and get an instant price estimate with SHAP explainability charts showing the feature contributions.
Flask app demo: map-based rental price prediction
Data Sources
Source
Data
Collection Method
REALIS
Rental & sales transactions 2017–2025
Web scraping
Booking.com
Hotel listings, pricing, location
Playwright browser automation
Cove (co-living)
Co-living room rates & locations
Web scraping
OpenStreetMap / Nominatim
Coordinates for all properties & amenities
Geocoding API
Data.gov.sg
MRT stations, schools, parks, hospitals
Public datasets
Features
Interactive Map Location Picker
Users click anywhere on a map of Singapore to select a location. The app
captures the coordinates client-side and passes them to the backend, eliminating
the need to know an exact address or postal code.
Amenity Proximity Scoring
For any selected location, the app computes Haversine distances to the nearest
MRT stations, hospitals, parks, shopping malls, schools, and hotels. These
distances become model features and are also surfaced to the user directly.
ML Price Prediction (Random Forest & XGBoost)
Both Random Forest and XGBoost regressors were trained and evaluated on REALIS
transaction history. Input features include floor area, floor level, property age,
property subtype, and Haversine distances to amenity categories. The best-performing
model is served for live predictions.
SHAP Explainability
Every prediction is accompanied by a SHAP waterfall chart showing exactly which
features pushed the price up or down. This gives analysts a principled basis for
their investment thesis rather than a black-box number.
Historical Market Data Browser
Analysts can query the underlying REALIS SQLite database with filters for price
range, area, property type, and date range to see comparable transactions and
understand market context around any location.
Architecture
The application is structured as a single Flask server with three logical layers:
Layer
Technology
Responsibility
Frontend
Jinja2 templates, Leaflet.js
Map interaction, form inputs, result rendering
Backend
Flask (Python)
Request routing, feature engineering, model inference
ML Layer
scikit-learn, SHAP, joblib
Prediction, explainability, model persistence
Data Layer
SQLite, pickle files
Transaction history, amenity cache, trained model artefacts
Data Collection
Playwright, BeautifulSoup, Requests
Offline scraping pipelines that populate the DB
Prediction pipeline (per request):
User selects location on map → coordinates sent to Flask
Backend computes Haversine distances to all amenity categories
Features assembled: area, floor level, property age, property subtype, distances to amenities
Best model (Random Forest or XGBoost) loaded from pickle → price predicted
SHAP explainer loaded → waterfall values computed
Results rendered back to user with chart and comparable transactions
Technical Challenges
OLS Baseline & Model Selection
An OLS regression on Korean apartment data
(MOLIT) served as the initial baseline. Only floor area was statistically significant —
all other coefficients were noise. This failure motivated the switch to Random Forest
and XGBoost, which handle non-linear relationships and feature interactions that OLS
cannot capture.
Anti-scraping on Booking.com
Booking.com's dynamic rendering required Playwright
browser automation rather than simple HTTP requests. Multiple script versions were
developed to handle pagination, lazy-loading, and session management reliably.
Geocoding at Scale
Thousands of property addresses needed coordinates.
Nominatim rate limits required batching with delays and a local cache to avoid
re-geocoding addresses across runs.
REALIS Data Cleaning
8 years of REALIS exports arrived as inconsistently
formatted Excel files. A cleaning pipeline standardised column names, property
subtypes, and date formats before loading into SQLite.