Asymmetric impacts of artificial intelligence on housing price valuation across education levels

Table of Contents

Data

We obtained transaction records, including residential housing prices, from the Ministry of Land, Infrastructure and Transport (MLIT). Although there are various housing types, in this study, we targeted apartments because they include full address information, which enabled us to discern neighborhoods and aggregate amenity variables. In addition, apartments are the predominant residential type of housing in South Korea (Ahn et al., 2020). Thus, apartment datasets can properly represent the spatial dynamics of housing prices in the four metropolitan cities.

The datasets used in this study include various control variables regarding residential housing characteristics, e.g., unit areas, seasonal transaction periods, proximities to the local environmental amenities, and populations. These variables were retrieved from public databases (Statistics Korea, Statistical Geographic Information Service, the Korea Transport Database, and MLIT) and private real estate companies (Kookmin Bank real estate, Naver real estate, and Daum real estate). Data from the years 2018 and 2019 were aggregated, yielding 53,458, 56,606, 24,350, and 44,305 observations for Busan, Daegu, Daejeon, and Gwangju, respectively.^{Footnote 4}

The aggregated datasets were categorized into four groups, i.e., housing characteristics, local amenities, local demographics, and seasonal dummies, as summarized in Table 1. Prior to fitting the datasets to our models, we cleaned the datasets, and the multicollinearity issue was assessed considering the variance inflation factor (VIF). In this process, some variables were dropped in the modeling procedure. In addition, four variables, i.e., transacted prices and three proximity variables, were transformed to logarithmic scale because these variables significantly depart from a normal distribution (Ahn et al., 2020). The descriptive statistics of the variables used in this study are given in Appendix A in the Supplementary Information.

Table 1 Categorized hedonic variables.

A total of 17 variables were confirmed as hedonic variables, including two education variables, to appraise housing prices. The details of the variables are shown in Table 2, where the first column indicates the appellation of a variable denoted in our results. The second column describes the variable’s characteristics. Generally, the constructed variable set aligns with the existing literature on the assessment of housing prices (Ahn et al., 2020; An et al., 2023; Dai et al., 2023).^{Footnote 5}

Table 2 Control variables.

As proxies for education levels in neighborhoods, two education variables, i.e., Univ. grad. and Top school, were considered in this study.^{Footnote 6} The Univ. grad. variable reflects the ratio of university graduates among adults residing in a neighborhood, which has conventionally been utilized as a proxy for neighborhood education levels in relation to housing prices (Ahn et al., 2020; Lin et al., 2022; An et al., 2023). A higher level of education has served as a requisite credential for quality job opportunities in South Korea,^{Footnote 7} as discussed in “Study contexts”. As such, it is commonly believed that people with a higher level of education can afford greater housing costs based on their higher wages. Lin et al. (2022) found that university graduates are more likely to settle for higher housing costs to reside in urban cities with better public services and urban infrastructures. Accordingly, the Univ. grad. variable can capture the degree of educational levels in a neighborhood in relation to housing prices.

The Top school variable represents how many students are admitted to the nation’s most prestigious university, i.e., Seoul National University. Hundreds of universities and community colleges exist in South Korea; however, entrance to Seoul National University is highly competitive because admission has a higher potential for high social standing than admission to lower-ranked universities.^{Footnote 8} This implies that the number of entrants to Seoul National University can reflect the education fever and/or the quality of educational services in a given neighborhood. In this context, Chung (2015) used this variable, i.e., the admittance rate to Seoul National University, as a proxy for school quality in neighborhoods. Thus, our education variable, Top school, can represent the extent of education fever and/or the quality of educational services in a neighborhood.

We conduct a descriptive statistical analysis, using the key variables (Univ. grad. and Top school), to gain preliminary insights for the comparative analysis. We begin by examining the skewness and excess kurtosis of both education variables within each city. Furthermore, we test whether the two statistics conform to those expected under a normal distribution. Table 3 represents each education variable’s skewness, excess kurtosis, and corresponding test results in each city. Table 3 shows that all surveyed areas exhibit asymmetric distributions for both education variables. Educational contexts in neighborhoods are key determinants concerning property prices and housing purchases (Guo and Qian, 2021; Wang and Li, 2022). In this sense, the test results indicate that estimation disparities between HPMs and ML models can arise from asymmetric education levels within a city, as HPMs primarily capture linear trends (Schulz et al., 2020) against the nonlinearity of ML models (Fuster et al., 2022).

Table 3 Skewness and excess kurtosis of education variables.

Next, we employ the $t$-test to compare mean values and the $F$-test to compare variances of the two education variables in order to determine whether statistically significant differences exist between cities (Yao et al., 2022). Table 4 summarizes the pairwise test statistics for each city, highlighting the heterogeneous characteristics of the education variables across all surveyed areas. Based on these preliminary insights, we treat each city as an individual experimental group (Rainio et al., 2024).

Table 4 $t$-test and $F$-test statistics for comparisons between the study areas.

Hedonic price models

HPMs have served as the conventional approach to estimate residential housing prices (Chau and Chin, 2003; An et al., 2023). Most studies have utilized this linear method to identify hedonic variables’ marginal effects on housing prices (Hong et al., 2020; Ahn et al., 2020; Qiu et al., 2023) due to its simple implementation and intuitive interpretability. The ordinary least squares (OLS)-based regression form is the representative model for HPMs. This study assessed housing prices based on HPMs and identified the financial premiums of hedonic variables in relation to housing prices. The mathematical formula of the OLS-based HPM in the log-linear form^{Footnote 9} (Gibbons et al., 2014; An et al., 2024) is expressed as follows:

$${\mathrm{ln}}\,{p}_{i}=c+\mathop{\sum }\limits_{h=1}^{H}{\beta }_{h}{x}_{{ih}}+\mathop{\sum }\limits_{s=1}^{3}{\gamma }_{s}{d}_{{is}}+{e}_{i},$$

(1)

where ${p}_{i}$ denotes the transacted price of an apartment unit $i$ per square meter, $c$ is the constant, and $H$ is the number of hedonic variables, including housing characteristics, local amenities, and local demographics (${x}_{{ih}}$). In addition, ${d}_{{is}}$ denotes the seasonal dummy variables (spring, fall, and winter), and ${\beta }_{h}$ and ${\gamma }_{s}$ denote regression coefficients corresponding to hedonic variables (${x}_{{ih}}$) and seasonal dummies (${d}_{{is}}$), respectively, which are estimated using the OLS method. The residual term is denoted ${e}_{i}$.

Note that the housing unit prices potentially affect the prices of nearby housing units (Qiu et al., 2023); thus, housing prices are spatially correlated (Huang et al., 2017). To alleviate this spatial dependency issue (Ahn et al., 2020) and ensure the robustness of our results, we introduced the spatial lag regression (SLR) model in the assessment of housing prices. Here, a spatial lag term (${Wy}$) is input to the hedonic model. The model specification of the SLR is expressed as follows (Anselin, 2013):

$$y=\rho {Wy}+X\beta +\varepsilon ,$$

(2)

where $y$ represents the $N\times 1$ vector of the log-transformed housing prices, $\rho$ denotes the spatial lag parameter, which is estimated by minimizing the root mean squared error (RMSE) in the range between $-1$ and $1$, $X$ represents the $N\times \left(H+4\right)$ matrix of hedonic variables including three seasonal dummies, and $\beta$ represents the $\left(H+4\right)\times 1$ vector of regression coefficients. Finally, $\varepsilon$ denotes the $N\times 1$ vector of residuals, which is assumed to be homoscedastic, independent across observations, and distributed normally.

The spatial weight matrix $W$ takes the $N\times N$ matrix in the following row-standardized form:

$${W}_{\tau v}=\left\{\begin{array}{l}1/{D}_{\tau v},\quad{D}_{\tau v} < {D}_{{band}}\\ 0,\qquad\quad\,{\rm{otherwise}}\end{array}\right.,$$

(3)

where $\tau$ and $v$ denote two housing units’ locations, which can be specified by their longitudes and latitudes; thus, ${D}_{\tau v}$ is the distance between two housing units located in $\tau$ and $v$. In addition, ${D}_{{band}}$ is the distance band set by unity. However, the potential bias resulting from simultaneity issue must be addressed because the target variable $y$ in Eq. (2) can be jointly estimated (Brueckner, 1998). Therefore, we implement SLR depending on the following mathematical formula, which can be obtained from Eq. (2), to remedy biased estimation (Brueckner, 1998; Ahn et al., 2020):

$$\left(I-\rho W\right)y=X\beta +\varepsilon .$$

(4)

Machine learning algorithms

Regarding our core research questions, we employ the random forest (RF) and extreme gradient boost (XGBoost) models to assess housing prices. These models serve as appropriate alternatives to handle the nonlinearity inherent in our datasets based on their flexibility, which has been challenged to linear models (Dou et al., 2023; Swietek and Zumwald, 2023).

The RF model is an ensemble of decision trees that utilizes a number of predictors, i.e., trees, which can be described as a set of ${h}_{t}\left({X}^{d}\right)$, where ${h}_{t}$ represents a tree predictor corresponding to a tree $t$, and ${X}^{d}$ is the matrix of the hedonic variables (including seasonal dummies). Here, each tree predictor estimates the log-transformed housing prices in this study. The RF model adopts the bootstrap method^{Footnote 10} and out-of-bag estimation^{Footnote 11} in the training process, thereby providing robustness against outliers and unbiased estimations (Breiman, 2001). The final output of the RF model can be obtained by aggregating each estimated value of the tree predictors, i.e., ${h}_{t}\left({X}^{d}\right)$, as follows:

$$\hat{y}=\frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}{h}_{t}\left({X}^{d}\right),$$

(5)

where $T$ denotes the number of tree predictors, and ${X}^{d}$ is the $N\times \left(H+3\right)$ matrix of the hedonic variables used to appraise housing prices.

The XGBoost model employs a growing tree in accordance with feature splits and additive tree structures. The addition of a single tree in each iteration supplements the previous predictor’s estimation error. Following the iterations correcting residuals, the XGBoost model can enhance the predictive power incrementally (Chen and Guestrin, 2016). Based on decision rules, the XGBoost model utilizes tree structures $f$, which map a score considering the variables’ characteristics to each leaf within each tree structure ${f}_{t}$. Given $T$ trees, the XGBoost model obtains the estimated value $\hat{y}$ by summing all scores assigned to the leaves in each tree. The XGBoost model attempts to minimize the following objective function, consisting of the loss function and regularization term as follows (Dou et al., 2023):

$${Obj}\left(\phi \right)=\mathop{\sum }\limits_{i=1}^{N}l\left({\hat{y}}_{i},{y}_{i}\right)+\mathop{\sum }\limits_{t=1}^{T}{\rm{\Omega }}\left({f}_{t}\right),$$

(6)

where $\phi$ is a set of parameters, and $l$ is the loss function that calculates the estimation errors. In addition, $N$ is the number of observations, and ${\hat{y}}_{i}$ and ${y}_{i}$ are the estimated and real target values for single observation $i$, respectively. ${\rm{\Omega }}$ is the regularization term that controls the complexity of the regression predictors (Chen and Guestrin, 2016), which improves generalization, and each ${f}_{t}$ corresponds to the structure of each tree.

Experimental design

Two scenarios were considered in this study: (1) the emergence of unequal effects of the AI-based valuation and (2) its aggravation in housing price appraisals. The unequal effect of the AI-based valuation model can be confirmed when the positive difference between the housing prices assessed by the AI model and the HPM is in the well-educated housing group, whereas the negative difference is in the less-educated housing group. In addition, the unequal effect is exacerbated if it satisfies the following scenario: the difference between values estimated by the AI model and HPM is statistically greater for well-educated housing groups compared to less-educated housing groups.

Before identifying the two scenarios, we first analyze whether local education levels are significantly associated with housing prices across all surveyed areas. This procedure incorporates two education variables in the HPM specifications, and we investigate their regression coefficients. In addition, the valuation models’ predictive power was checked in the assessment of housing prices. Here, we divided the entire dataset into a training set (70%) and a test set (30%) in each of the four cities. To discern the unequal effects of flexible models according to educational levels, we considered three cases: (1) training the model and assessing housing prices by controlling Univ. grad., (2) training the model and assessing housing prices by controlling Top school, and (3) training the model and assessing housing prices by dropping two education variables. After applying these cases to the four models, i.e., a series of HPMs and two ML models, we calculated the difference between the values estimated by the AI model and the HPM, i.e., ${\hat{y}}_{{AI}}-{\hat{y}}_{{HPM}}$.

Then, we sorted the differences in ascending order based on the level of education variable and aggregated lower and upper groups into 5, 10, and 20 percentiles, considering observations. Finally, we conducted $t$-tests for within each group and for between lower and upper groups. The $t$-test within each group validates the homogeneity between housing prices estimated by AI models and HPMs, and we employed the $t$-test between lower and upper groups to validate whether the differences in the housing prices estimated by the AI models and HPMs exhibit different patterns of percentile groups, i.e., education levels. The latter test was conducted to discriminate whether the discrepancy derived from the AI-based valuation model was exacerbated on average according to the neighborhood’s education level.

link