Taipei House Value Prediction
Comparative analysis of MARS, Generalized Additive Models, and Ridge Regression on Taipei housing records — where confirmed non-linearities in house age and MRT distance demand flexible non-parametric approaches.
414
Housing Records · Taipei & New Taipei City
3
Regression Methods Compared
58.80
Best MSE — GAMs (Backfitting)
The Problem
Taipei house prices are shaped by non-linear relationships with MRT proximity and building age — structural patterns that linear regression cannot capture and that require flexible non-parametric modeling
House prices in Taipei City and New Taipei City are determined by a set of predictors whose relationships with price are demonstrably non-linear. Distance to the nearest MRT station is inversely and non-linearly proportional to price — at short distances, prices are very high and drop rapidly as distance increases, but the rate of decline is not constant. House age follows a parabolic pattern: new houses command high prices, prices fall as houses age, but older historic properties recover value — a U-shaped relationship that no single linear term can represent. Transaction date adds a temporal market dimension. These non-linearities are not merely assumed — they are confirmed in the data and represent the core modeling challenge. Linear regression and Ridge regression, which assume linear predictor-response relationships, will produce systematically biased predictions wherever these curves deviate from straight lines. The question is which non-parametric approach — MARS with piecewise linear basis functions, or GAMs with smooth backfitting splines — better captures these structural patterns.
The Solution
A three-method comparison of MARS, GAMs, and Ridge — with cross-validated degree and nk for MARS and per-predictor df tuning for GAMs — evaluated on MSE to determine which approach best models Taipei's non-linear housing market
MARS is fitted using the earth package with cross-validation over degree (1, 2, 3) and number of forward iterations nk (5 to 50), selecting the optimal combination of degree=1 and nk=10 — indicating that variable interactions do not meaningfully improve predictions on this dataset. The resulting MARS model builds piecewise linear basis functions (hinge functions) around key knot values for dist_mrt, house_age, lat, and tr_date. GAMs are fitted using the gam package with a backfitting procedure, applying smoothing splines to each predictor independently with per-predictor degrees of freedom tuned to best capture each variable's curve shape: tr_date (df=3), house_age (df=8), dist_mrt (df=6), n_stores (df=3), lat (df=6), and long (df=7). Ridge regression provides a regularized linear baseline via glmnet. All three are compared on MSE to identify which approach best handles the confirmed non-linearities in this real estate dataset.
Key Outcome
GAMs achieve the best MSE (58.80) — outperforming MARS (61.07) and Ridge (80.06) — with GAMs' smooth backfitting procedure better capturing the continuous non-linear curves in this dataset than MARS's piecewise linear approximation, and both non-parametric methods substantially outperforming the linear Ridge baseline.
Technical Deep Dive
Methodology & Analysis
Analytical Workflow
Stage 1 — Data Characterization & Non-Linearity Diagnosis
Dataset
414 Records · 6 Predictors
Taipei & New Taipei City · Jun 2012 – May 2013 · Target: price per unit area (10K NTD/Ping)
Non-Linearity Confirmed
Parabolic Age · Inverse MRT Distance
house_age: U-shaped (new and historic homes both high-priced) · dist_mrt: rapid non-linear price decay with distance
Baseline Ruled Out
Linear Methods Inadequate
Ridge included as linear benchmark · Non-parametric methods required for confirmed non-linear relationships
Stage 2 — Three-Method Fitting
MARS · earth
Degree=1 · nk=10
CV over degree (1–3) & nk (5–50) · No interactions optimal · Hinge functions on dist_mrt, house_age, lat, tr_date · MSE 61.07
GAMs · gam
Backfitting · Per-Predictor df
tr_date df=3 · house_age df=8 · dist_mrt df=6 · n_stores df=3 · lat df=6 · long df=7 · MSE 58.80 (best)
Ridge · glmnet
Linear Regularized Baseline
L2 penalty · All 6 predictors retained · Cannot model non-linearities · MSE 80.06
Stage 3 — Comparative Evaluation
Conclusion
GAMs Best (MSE 58.80) · MARS Close (MSE 61.07) · Ridge Substantially Worse (MSE 80.06)
Variable interactions not strongly present · Smooth non-linearities favor GAMs backfitting over MARS piecewise approximation · Linear methods inadequate for this dataset
Stage 1
Data Characterization & Non-Linearity Diagnosis
The dataset covers 414 housing transactions in Taipei City and New Taipei City between June 2012 and May 2013, with 6 predictors: transaction date, house age, distance to the nearest MRT station, number of nearby convenience stores, and geographic coordinates (latitude, longitude). Exploratory analysis confirms two key non-linear relationships. House age follows a parabolic U-shaped curve: new properties are most expensive, prices fall as houses age, then recover for older buildings with historic value — a pattern no single linear term can capture. Distance to the nearest MRT station is non-linearly inversely proportional to price: prices drop rapidly at low distances and flatten at high distances. Both patterns, along with hypothesized interactions between dist_mrt, n_stores, and house_age, motivate the use of non-parametric regression methods.
Stage 2
MARS Fitting & Interaction Testing
MARS is fitted using the earth package with cross-validation over degree (1, 2, 3) and nk (5 to 50). Degree controls the maximum order of variable interactions allowed — degree=1 means no interactions, degree=2 allows pairwise interactions. Cross-validation selects degree=1 and nk=10 as optimal, confirming that variable interactions do not meaningfully improve predictions on this dataset despite being theoretically plausible. The final MARS model builds 7 hinge functions using (max(0, x-t)) and (max(0, t-x)) basis function pairs around key knot values for dist_mrt, house_age, lat, and tr_date, producing a piecewise linear approximation of the non-linear curves present in the data.
Stage 3
GAMs Fitting via Backfitting
GAMs are fitted using the gam package with a backfitting procedure — iteratively updating the smoothed fit for each predictor against partial residuals while holding all others fixed, until convergence. Each predictor receives an independently tuned degrees of freedom: house_age receives the most flexibility (df=8), reflecting its complex parabolic shape; dist_mrt (df=6) and lat (df=6) follow; while tr_date (df=3) and n_stores (df=3) require less complexity. Long receives df=7. This per-predictor tuning allows GAMs to allocate modeling capacity precisely where the data's non-linearities are most pronounced — a key advantage over MARS, which applies the same piecewise linear framework to all predictors uniformly.
Key Methodological Choices
Non-parametric methods required — non-linearities are confirmed, not assumed
The choice to use MARS and GAMs rather than linear methods is grounded in data evidence, not modeling preference. Scatter plots of house_age vs price and dist_mrt vs price confirm structural non-linearities before any model is fitted. Ridge regression is included as a regularized linear benchmark to quantify the cost of linearity assumption — its MSE of 80.06 versus GAMs' 58.80 makes the cost of the linear assumption concrete and measurable, providing an empirical rather than theoretical argument for the non-parametric approach.
MARS degree=1 — interactions theoretically plausible but empirically unsupported
Prior to fitting, several interactions were hypothesized as plausible — particularly between dist_mrt, n_stores, and house_age, and between tr_date and dist_mrt. Cross-validation over degree=1, 2, and 3 tests whether these interactions actually improve predictive performance. The result — degree=1 optimal — means the data does not support including interactions: the marginal improvement from modeling joint effects does not compensate for the additional variance they introduce. Reporting this null interaction finding is as informative as confirming interactions would have been, since it simplifies the final model considerably.
GAMs preferred over MARS — smooth non-linearities favor continuous splines over piecewise approximation
MARS approximates non-linear curves using piecewise linear hinge functions — effective for sharp breakpoints but less precise for smooth, continuously varying relationships. The non-linearities in this dataset — the gradual MRT distance decay and the parabolic age curve — are smooth rather than abrupt, making spline-based GAMs a more natural fit. GAMs' backfitting procedure also allows each predictor's smoothness to be independently calibrated, allocating the most degrees of freedom to the most complex predictors (house_age at df=8) and fewer to simpler ones — a degree of precision that MARS's uniform piecewise framework cannot replicate.
Tech Stack
| Technology | Purpose |
|---|---|
| R | Statistical modeling environment and primary implementation language |
| earth (R package) | MARS implementation — forward selection of hinge basis functions with backward deletion to prevent overfitting |
| gam (R package) | GAMs implementation — backfitting with smoothing splines and per-predictor degrees of freedom tuning |
| ridge (R package) | Ridge regression — regularized linear baseline for quantifying the cost of linearity assumption |
Results & Metrics
What the analysis reveals
58.80
Best MSE — GAMs
Backfitting with per-predictor smoothing splines — best captures smooth non-linearities
26%
Ridge MSE Penalty
Ridge MSE (80.06) is 26% higher than GAMs — the measurable cost of the linear assumption on this dataset
dist_mrt
Strongest Price Driver
Distance to nearest MRT station — the most influential predictor of house value in both MARS and GAMs
GAMs outperform MARS — smooth backfitting better captures continuous non-linear curves
GAMs achieve MSE of 58.80 versus MARS's 61.07 — a gap of 2.27 that, while modest in absolute terms, reflects a systematic advantage. The non-linearities in this dataset are smooth and continuously varying rather than abrupt — the MRT distance decay is a gradual curve, and the house age parabola is rounded rather than kinked. GAMs' backfitting with smoothing splines is better suited to continuous smooth curves than MARS's piecewise linear hinge functions, which approximate curves as sequences of straight-line segments. The slight GAMs advantage is consistent with the underlying geometry of the data.
MRT distance is the dominant predictor — proximity to transit is the primary price driver
Distance to the nearest MRT station emerges as the strongest predictor of house price in both MARS and GAMs. In the MARS model, dist_mrt is assigned the largest hinge function coefficient — at distances below 1,144 meters, every meter of additional proximity adds 0.019 units of price per area. In GAMs, dist_mrt receives df=6, reflecting the complexity of its fitted smooth curve. The finding is economically interpretable: in Taiwan, where MRT networks serve as the primary daily transportation infrastructure, proximity to transit is directly capitalized into residential property values at a rate that no linear model can adequately represent.
House age captures the U-shaped premium for both new construction and historic properties
In the MARS model, house age enters through two opposing hinge functions — one capturing the price premium for houses younger than 27.1 years, and another for houses older than 27.1 years — together reproducing the parabolic relationship. In GAMs, house_age receives the highest degrees of freedom (df=8), reflecting the complexity of its fitted smooth curve. The dual-direction effect is interpretable: new construction commands high prices for modern amenities; older historic properties recover value through scarcity, aesthetic character, and location in established neighborhoods that have appreciated over time.
Variable interactions are theoretically plausible but empirically unsupported
Prior to fitting, interactions between dist_mrt, n_stores, and house_age — and between tr_date and dist_mrt — were identified as plausible based on domain reasoning: newly built homes near MRT stations and convenience stores should command a multiplicative premium. Cross-validation across MARS degree=1, 2, and 3 tests this hypothesis directly. The optimal degree=1 (no interactions) contradicts the prior expectation — on this dataset, the joint effects of predictor pairs do not improve predictions beyond their individual contributions. The MARS model with degree=2 does recover a tr_date × dist_mrt interaction, but it does not reduce MSE enough to justify the added complexity.
Ridge's MSE of 80.06 quantifies the cost of assuming linearity on a non-linear dataset
Ridge regression achieves MSE of 80.06 — 26% higher than GAMs and 31% higher than MARS. This gap is not primarily attributable to underfitting from the regularization penalty; it reflects the fundamental inability of any linear model to represent the parabolic house age effect and the rapid non-linear MRT distance decay. Ridge regression is included not as a competitive alternative but as a principled baseline that makes the non-parametric advantage measurable. Its poor relative performance confirms that the non-linearities in this dataset are large enough to meaningfully harm prediction accuracy when ignored.