Property Price Analyser

Overview

Built during my capstone project, this tool helps transaction analysts make data-driven co-living and rental property investment decisions in Singapore. It combines market data aggregation, geospatial analysis, and machine learning to estimate rental prices and explain the drivers behind each prediction.

The primary users are investment analysts who need to quickly assess the rental potential of a given location without manually cross-referencing multiple data sources.

The project followed an end-to-end analytics pipeline: data scraping and geocoding, feature engineering, model selection, and a Flask dashboard. An OLS regression was evaluated as a methodological baseline. Its poor explanatory power prompted the switch to tree-based models (Random Forest and XGBoost) for the Singapore properties, which handled non-linearity and feature interactions far better.

The Problem

Rental price estimation in Singapore requires synthesising data across many dimensions. Investment analysts start with comparables analysis, which includes finding similar properties in the same area and adjusting for differences in size, age, and floor level. They also need to factor in location-specific signals like proximity to MRT stations, parks, shopping malls, etc. Finally, they must understand broader market trends from historical transaction records. Doing this manually is slow, inconsistent, and relies heavily on analyst experience.

There was no single tool that pulled these signals together into a queryable interface with a principled model underneath.

The Solution

By integrating data from multiple sources (see data souces below), I trained a machine learning model to predict rental prices based on property features and location-based amenity proximity scores. The model is served through a Flask app where users can click on a map to select a location in the core central region (CCR) in Singapore and get an instant price estimate with SHAP explainability charts showing the feature contributions.

Flask app demo: map-based rental price prediction

Data Sources

Source	Data	Collection Method
REALIS	Rental & sales transactions 2017–2025	Web scraping
Booking.com	Hotel listings, pricing, location	Playwright browser automation
Cove (co-living)	Co-living room rates & locations	Web scraping
OpenStreetMap / Nominatim	Coordinates for all properties & amenities	Geocoding API
Data.gov.sg	MRT stations, schools, parks, hospitals	Public datasets

Features

Interactive Map Location Picker

Users click anywhere on a map of Singapore to select a location. The app captures the coordinates client-side and passes them to the backend, eliminating the need to know an exact address or postal code.

Amenity Proximity Scoring

For any selected location, the app computes Haversine distances to the nearest MRT stations, hospitals, parks, shopping malls, schools, and hotels. These distances become model features and are also surfaced to the user directly.

ML Price Prediction (Random Forest & XGBoost)

Both Random Forest and XGBoost regressors were trained and evaluated on REALIS transaction history. Input features include floor area, floor level, property age, property subtype, and Haversine distances to amenity categories. The best-performing model is served for live predictions.

SHAP Explainability

Every prediction is accompanied by a SHAP waterfall chart showing exactly which features pushed the price up or down. This gives analysts a principled basis for their investment thesis rather than a black-box number.

Historical Market Data Browser

Analysts can query the underlying REALIS SQLite database with filters for price range, area, property type, and date range to see comparable transactions and understand market context around any location.

Architecture

The application is structured as a single Flask server with three logical layers:

Layer	Technology	Responsibility
Frontend	Jinja2 templates, Leaflet.js	Map interaction, form inputs, result rendering
Backend	Flask (Python)	Request routing, feature engineering, model inference
ML Layer	scikit-learn, SHAP, joblib	Prediction, explainability, model persistence
Data Layer	SQLite, pickle files	Transaction history, amenity cache, trained model artefacts
Data Collection	Playwright, BeautifulSoup, Requests	Offline scraping pipelines that populate the DB

Prediction pipeline (per request):

User selects location on map → coordinates sent to Flask
Backend computes Haversine distances to all amenity categories
Features assembled: area, floor level, property age, property subtype, distances to amenities
Best model (Random Forest or XGBoost) loaded from pickle → price predicted
SHAP explainer loaded → waterfall values computed
Results rendered back to user with chart and comparable transactions

Technical Challenges

OLS Baseline & Model Selection

An OLS regression on Korean apartment data (MOLIT) served as the initial baseline. Only floor area was statistically significant — all other coefficients were noise. This failure motivated the switch to Random Forest and XGBoost, which handle non-linear relationships and feature interactions that OLS cannot capture.

Anti-scraping on Booking.com

Booking.com's dynamic rendering required Playwright browser automation rather than simple HTTP requests. Multiple script versions were developed to handle pagination, lazy-loading, and session management reliably.

Geocoding at Scale

Thousands of property addresses needed coordinates. Nominatim rate limits required batching with delays and a local cache to avoid re-geocoding addresses across runs.

REALIS Data Cleaning

8 years of REALIS exports arrived as inconsistently formatted Excel files. A cleaning pipeline standardised column names, property subtypes, and date formats before loading into SQLite.