• Nebyly nalezeny žádné výsledky

Methodology, Methods and Models

4.6 Housing Submarkets

The housing submarkets are essentially the cluster within which the real estates dispose of a considerably higher level of homogeneity. It is proved by many recent studies that the housing submarkets, .i.e. the clusters within which each estates dispose with similar model coefficients, can be constructed via the clustering the coefficients of the GWR model (Kopczewska & Ćwiakowski 2021). Consequently, once those submarkets are determined, evaluation in time dimension can also be performed, assuming that data are collected over multiple time periods. For the coefficients clustering itself, the first step requires some form of dimensionality reduction and the most commonly used method is the Principal Component Analysis (PCA), which is further described in the following section. Once the PCA is obtained the k-means clustering algorithm is used.

4.6.1 Principal Component Analysis

The principal component analysis (PCA) is a widely used dimensionality re-duction technique that has been around for a very long time since the original Person’s paper. Despite the fact that many different data reduction techniques have also been proposed, the PCA is still considered as the "State-of-the-Art"

technique.

When working with high dimensional data, which commonly have many mea-sures, using the PCA we are able to represent and analyze the relationships within the data with a considerably small number of features (columns of the design matrix X). This is because the principal components, which essen-tially provide a low-dimensional representation of the underlying data while capturing most of the variability within it. Interestingly enough, the PCA has multiple mathematical solutions for obtaining the principal components.

While the principal components can be obtained via linear programming and via Eigen Decomposition of the Covariance matrix (e.g James et al. (2013)), the most commonly utilized approach of obtaining the principal components is via the Singular Value Decomposition (SVD).

We firstly standardize the design matrix X, which is of size n × m, where n denotes the number of observations andmdenotes the number of regressors, e.g.

number of rooms, square meters, etc. Then sample average (further referred to

simply as mean) of all mfeatures can then be stored in a single row vector Xj as:

Xj = ∑︁ni=1xij n .

As far as most literature such as Jauregui (2012) goes, no variance standardizing techniques are discussed. However, frequently all ofmfeatures ofXmatrix are measured on different scales. Therefore, variance should also be transformed to be unit. Sample standard deviation (further referred to simply as standard deviation) of m features can be written in a row vector σj as:

σj =

√︄∑︁n

i=1(xiXj)2 n−1

After the two vectors are obtained, standardized data B matrix derived from X can be thus constructed as:

Bij= XijXj σj ,

whereBisn×mmatrix. This step is known as az-standardization. Once theB matrix is obtained, Variance-Covariance matrix can be estimated using

C= 1

n−1BTB.

The Variance-Covariance matrix Cis then m×m symmetric matrix. Thei-th entry on the diagonal of C, namely Cii, is the (unit) variance of given i-th variable. In order to obtain principal components, Eigen decomposition of a matrix C is performed:

eig(C)⇒λ, W,

here λ denotes eigenvalues and W are Eigenvectors. After this step, Eigen-vectors are ordered by the magnitude of their corresponding λi-th eigenvalue.

Hence, the Eigenvector affiliated with the largest Eigenvalue is the first prin-cipal component PC1. The Eigenvector corresponding to the second largest

4. Methodology, Methods and Models 38

Eigenvalue is, therefore, the second principal component PC2 and etc. By definition, every single one of the principal components is orthogonal to other principal components. This is particularly useful when estimating some models based on principal component projection since all of the components are wholly uncorrelated.

Singular Value Decomposition for PCA

In section above we present PCA solution as eigenvectors and eigenvalues of the The Variance-Covariance matrix C. It turns out that calculating Principle components as a Eigen decomposition of C matrix is not necessarily the most computationally efficient way of computing W. As described e.g. by Murphy (2012) to compute Wmatrix, the singular value decomposition (SVD) can be assessed. Using modified notation from Murphy (2012) centeredBdata matrix can b taken and then represented as a product expressed as:

B=UΣVT, (4.17)

where, for the sake of understanding, we also provide an example using full matrice’s illustration:

U is a n ×n matrix, which columns (called the left singular vectors) are hi-erarchically ordered in term of their capability of describing the variability in the columns of B. The matrix Σis a non-negative diagonal matrix, where all diagonal elements (referred to as singular values) are also hierarchically order with decreasing magnitudes. This means that σ1σ2. . . σm ≥ 0. The elements of Σ capture the amount of variability explained by each associated column of U. theV matrix is, similarly to U, a unitary matrix which columns contain the right singular vectors. The right singular vectors then define the the feature space directions towards which the data vary the most. Once the

data points are projected onto this dimensions (say 2 or 3 of them), we are left with the principal component scoresthemself, which provide us with a low-dimensionality representation of our original dataset (in our case the centered matrix of the GWR’s coefficients.) The broader description as well as associ-ated theorems and proofs are more discussed e.g. in Murphy (2012).

Once the principal component analysis is performed and the low-dimensionality representation of the GWR coefficients is obtained, the clustering algorithms can be used to find the housing submarkets. The most commonly used clus-tering algorithm, also used in the study of Kopczewska & Ćwiakowski (2021), is the famous k-means clustering method.

The central question of utilizing the k-means approach is the selection of the number of clusters. This number can be defined either a prior using e.g. the region border, in which certain form of homogeneity is assumed, or using data-driven approach. We utilized the data-data-driven approach where the ”optimal”

number of clusters is selected by using various numbers of clusters i.e. from 2 to 15 and then, for each k, the Total Within sum of squares loss is computed.

Then, using this loss function combined with common-sense approach the pa-rameter k is determined for each region seperately.

Moreover, since the PCA approach allows for a certain degree of interpretability, we are able to identify the main sources of variability between and within each of the submarket clusters respectively. This can be undoubtedly a very important characteristic of the submarket not only for the policy maker but also for the real estate agents as well as for all individual seeking for the housing oppurtunity.

4. Methodology, Methods and Models 40