Econometric analysis of the determinants of high rent burden among Spanish households using EPF 2023 microdata. Includes data preparation, logistic regression modeling, evaluation metrics, and interpretation of key factors linked to housing vulnerability.
This repository contains an econometric analysis of the determinants of high rent burden among Spanish households. Using microdata from the Household Budget Survey (EPF) 2023, the study examines which socioeconomic, demographic, geographic and housing-related factors drive a household to allocate an unusually high share of total expenditure to rent.
The core of the project is a logistic regression model (Logit) that identifies households with high rent burden, defined as those above one standard deviation from the mean proportion of expenditure devoted to the principal dwelling’s rent.
This project is part of my academic portfolio for Economics.
- Quantify the factors influencing the likelihood of experiencing a high rent-to-expenditure ratio.
- Identify significant predictors of housing vulnerability.
- Evaluate the predictive performance of a logistic regression model using microdata.
- Provide insights relevant for public policies aimed at improving rental affordability.
- Source: Encuesta de Presupuestos Familiares (EPF) 2023 – INE
- Dependent variable: High rent burden (binary)
- Key explanatory variables:
- Household income
- Dwelling size (useful surface)
- Age of the household reference person
- Number of household members
- Education level
- Labour status
- Municipal size & population density
- Sex of household reference person
The original EPF microdata cannot be redistributed due to INE licensing restrictions.
Only simulated/sample data or scripts for preprocessing are included in this repository.
- Construction of the rent burden variable (rent expenditure / total household expenditure).
- Categorization into low, medium and high rent burden using ±1 standard deviation around the mean.
- Binary dependent variable: 1 = high rent burden, 0 = otherwise.
A Logit model is estimated to identify determinants of high rent burden.
Key steps include:
- Variable selection based on economic and sociodemographic literature.
- Omnibus test for overall model significance.
- Parameter estimation and hypothesis testing.
- Influence analysis using Cook’s distance.
- Re-estimation after filtering high-influence observations.
- Confusion matrix and classification metrics.
- Sensitivity, specificity, precision and error rates.
- ROC curve and AUC calculation.
-
Significant predictors:
- Dwelling size (negative effect): Larger dwellings reduce the probability of high rent burden.
- Age of household head (negative effect): Younger households show higher vulnerability.
- Household income: Effect per unit is nearly zero, but large income differences matter in practical terms.
-
Non-significant variables:
Education, household size, labour status, sex, population density, and municipal size do not show significant effects after controlling for other variables. -
Model performance:
- Accuracy: 85.1%
- Sensitivity: 98.6%
- Specificity: 8.5%
- ROC AUC: 0.756
The model is highly effective at detecting vulnerable households (few false negatives), though it overestimates vulnerability among non-vulnerable ones.
The analysis provides evidence that housing vulnerability is concentrated among households that are younger and live in smaller dwellings. Although unit income changes show minimal statistical effect, substantial income differences matter in practice.
- Alejandro Corchón
- Fiorella Raguseo