Property price prediction
==========================

.. highlight:: Python

In this tutorial, you will learn:

* The basics in macroeconomic analysis
* The ways of analyzing macroeconomic indicators
* The ways of analyzing real estate market data
* How to build a property price prediction model

Intro to macroeconomic analysis
-------------------------------

| As we have discussed in the first tutorial, macroeconomic analysis is a way of
  investigating the macroeconomic indicators that influence the stock market.
  
| In this module, we will first analyze the macroeconomic indicators and explore how 
  the indicators affect the stock prices in Hong Kong. 
  
| Then, we will specifically analyze the Hong Kong real estate market, as we believe
  that it is one of the most important macroeconomic indicator that can reflect the
  Hong Kong's economy.

| In addition, we will build a property price prediction model to predict the house price
  of Hong Kong. 

Macroeconomic indicators in Hong Kong
-------------------------------------

| Before proceeding, first import some necessary libraries needed for this module.
 
::

    import random
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder

| After importing the libraries, let's have a look at the data. The data contains 8 
  different macroeconomic indicators collected from 2016 to 2020. 

| Use :code:`df.info()` to print information of all columns.  

.. figure:: images/macroeconomic_data.png
    :width: 400px
    :align: center
    :alt: "Macroeconomic data."

    The column information of macroeconomic data.

Univariate analysis
^^^^^^^^^^^^^^^^^^^
| In univariate analysis, use :code:`pandas.Dataframe.describe()` to examine the 
  distribution of the numerical features. It returns the statistical summary such as mean, 
  standard deviation, min, and max of a data frame. 

| For a better understanding of the statistics summary, use :code:`seaborn.distplot()` 
  to visualise the results with histograms.

::

    # Statistical summary
    print(df[feature_name].describe())

    # Histogram
    plt.figure(figsize=(8,4))
    sns.distplot(df[feature_name], axlabel=feature_name);

Bivariate analysis
^^^^^^^^^^^^^^^^^^
| In bivariate analysis, we are going to study the correlations between 
  a macroeconomic indicator and the Hang Seng Index. 
  Use :code:`matplotlib.pyplot.scatter()` and :code:`seaborn.regplot()` to 
  visualize the relationship between two features. 

::

    x = df[feature_name]
    y = df['hsi']

    plt.scatter(x, y)
    plt.xticks(rotation=45)
    fig = sns.regplot(x=feature_name, y="hsi", data=df)

.. figure:: images/scatter_graph_regline.png
    :width: 400px
    :align: center
    :alt: "Scatter graph with regline."

    An example of a scatter plot with a regression line.

| Then, use :code:`pandas.Dataframe.corr()` and :code:`seaborn.heatmap()` to compute 
  a pairwise correlation of features and visualize the correlation matrix.

::

    fig, ax = plt.subplots(figsize=(10,10)) 
    cols = df.corr().sort_values('hsi', ascending=False).index
    cm = np.corrcoef(df[cols].values.T)
    hm = sns.heatmap(cm, annot=True, square=True, annot_kws={'size':11}, yticklabels=cols.values, xticklabels=cols.values)
    plt.show()

.. figure:: images/heatmap_macro_indicator.png
    :width: 620px
    :align: center
    :alt: "Heatmap - macroeconomic indicator."

    Heatmap - macroeconomic indicators of Hong Kong.

| According to the above figure, we can see that house price, GDP, population, imports, 
  composite consumer price index , total exports, and year are positively correlated to the 
  Hang Seng index, while both seasonally adjusted unemployment rate and not seasonally 
  adjusted unemployment rate are negatively correlated to the Hang Seng index.


The Hong Kong real estate market
--------------------------------
| As shown above, the house price in Hong Kong has a strong positive correlation with the
  Hang Seng Index. In fact, the properties and construction sector accounts for over 10% 
  of weighting in the Hang Seng Index (Hang Seng Indexes Company Limited, 2020), and thus 
  the real estate market data is a source of volatility in the Hong Kong stock market. 
  
| While Hong Kong's real estate market is a constant topic of discussion, it will be worth 
  analyzing the Hong Kong real estate market data. Using the same data analysis 
  technique used for the above analysis, we will now analyze Hong Kong residential 
  market transaction records.

Data pre-processing
^^^^^^^^^^^^^^^^^^^
| Before analyzing the transaction records:

1. Derive some useful features from existing features.
::

    # Add new features
    df['month'] = pd.to_datetime(df['RegDate']).dt.month
    df['year'] = pd.to_datetime(df['RegDate']).dt.year

2. Drop unmeaningful features and features with too many missing values
::

    # Drop unnecessary columns
    df = df.drop([feature_name], axis=1)

3. Handle missing values by replacing NAN with a mean value of a feature
::

    # Handling missinig values
    # Fill with mean
    feature_name_mean = df[feature_name].mean()
    df[feature_name] = df[feature_name].fillna(feature_name_mean)

4. Label encode categorical features

::

    le = LabelEncoder() 
    le.fit(list(processed_df[feature_name].values)) 
    processed_df[feature_name] = le.transform(list(processed_df[feature_name].values))

Economic indicator analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^
| In economic indicator analysis, we will explore how the macroeconomic indicators affect the 
  monthly average house price per saleable area in Hong Kong. 

| The transaction records from Centaline Property will be used for this analysis.

.. figure:: images/transaction_record_centaline.png
    :width: 700px
    :align: center
    :alt: "Transaction records - Centaline Property"

    The data structure of transaction record (Centaline Property).

| Before analyzing the data, calculate the monthly average house price per saleable area. 
  Then, join the data with economic indicators by year and month.

::

    # calculate the monthly average house price
    df = df.groupby(['year','month'],as_index=False).mean()
    df = df.rename(columns={'UnitPricePerSaleableArea': 'AveragePricePerSaleableArea'})

| Using the bivariate analysis method we learned, a pairwise correlation of features 
  is computed and visualized. The result shows that population, year, composite consumer price index, 
  GDP, imports, and total exports are positively correlated to the monthly average 
  house price per saleable area in Hong Kong, while both unemployment
  rates are negatively correlated to the monthly average house price per saleable area in Hong Kong.

.. figure:: images/corr_economic_indicator_analysis.png
    :width: 620px
    :align: center
    :alt: "Heatmap - economic indicator analysis"

    Heatmap - economic indicators analysis.

Transaction record analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^
| In transaction record analysis, we will examine the relationship between features describing 
  the house and the individual housing prices of Hong Kong.

| The transaction records from Midland Realty will be used for this analysis.

.. figure:: images/transaction_record_midland1.png
    :width: 650px
    :align: center
    :alt: "Transaction records - Midland Realty - Part 1"

    The data structure of transaction record (Midland Realty) - Part 1.

.. figure:: images/transaction_record_midland2.png
    :width: 650px
    :align: center
    :alt: "Transaction records - Midland Realty - Part 2"

    The data structure of transaction record (Midland Realty) - Part 2.

| In univariate analysis, the distribution of Hong Kong’s house price is examined. 
  The housing price of Hong Kong has a mean of 9 million HKD and a standard deviation 
  of 13 million HKD. The skewness and kurtosis were 26.9 and 1526.4 respectively, 
  showing that the housing price of Hong Kong is skewed positively to a very high 
  degree. 

::

    # Distribution
    print(df['price'].describe())

    # Skewness and kurtosis
    print("Skewness: ", df['price'].skew())
    print("Kurtosis: ", df['price'].kurt())

::

    #output: 

    count    1.664090e+05
    mean     9.133268e+06
    std      1.310856e+07
    min      5.500000e+05
    25%      5.200000e+06
    50%      6.830000e+06
    75%      9.500000e+06
    max      1.399000e+09
    Name: price, dtype: float64

    Skewness:  26.927207752922435
    Kurtosis:  1526.4066673335874
  
| In order to get a better result for the bivariate analysis, outliers are 
  removed by using standard deviation.

::

  # Calculate mean and standard deviation
  data_mean, data_std = np.mean(df[feature_name]), np.std(df[feature_name])

  # Calculate upper boundary
  upper = data_mean + data_std * 3

  # Remove outliers
  df = df[df[feature_name] < upper]

| In bivariate analysis, the correlation coefficient between the features describing 
  the house and the house price is computed. 7 features with the highest correlation 
  is selected and shown below.

.. figure:: images/corr_transaction_data.png
  :width: 500px
  :align: center
  :alt: "Heatmap - transaction data analysis"

  Heatmap - transaction data analysis.

| According to the above figure, the housing price in Hong Kong has (1) a strong positive 
  correlation with saleable area; (2) a moderate positive correlation with last 
  transaction price; (3) a moderate positive correlation with gross area; (4) a moderate 
  positive correlation with number of bedrooms; (5) a weak positive correlation with 
  floor; (6) a weak negative correlation with region; and (7) a weak negative correlation 
  with building age.


| The full implementation of the economic indicator analysis and transaction data analysis
  could be found in :code:`code/macroeconomic-analysis/` in the repository.

Property price prediction with machine learning
-----------------------------------------------

| Based on the transaction data analysis, let's build property price prediction models.

Train-test split
^^^^^^^^^^^^^^^^
| Use :code:`sklearn.model_selection.train_test_split()` to split the data with the ratio 
  of 8:2. The input variables are the top 7 features selected from the analysis, and 
  the output feature is the house price.

::

    feat_col = [ c for c in df.columns if c not in ['price'] ]
    x_df, y_df = df[feat_col], df['price']

    x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=RAND_SEED)

Log transformation
^^^^^^^^^^^^^^^^^^
| Before training the model, transform :code:`y_train` using log function to normalise the highly 
  skewed price data. In this way, the dynamic range of Hong Kong’s property price can be reduced.

::

    log_y_train= np.log1p(y_train)

Training the model
^^^^^^^^^^^^^^^^^^
| In total, 4 different types of predictive models will be built: 

1. XGBoost
2. Lasso
3. Random Forest
4. Linear Regression

| Train the models with :code:`x_train` and :code:`y_train`, and use the
  models to make the predictions. 

::

    import xgboost as xgb

    # XGBoost
    model_xgb = xgb.XGBRegressor(objective ='reg:squarederror', 
                          learning_rate = 0.1, max_depth = 5, alpha = 10, 
                          random_state=RAND_SEED, n_estimators = 1000)
    model_xgb.fit(x_train, log_y_train)
    xgb_train_pred = np.expm1(model_xgb.predict(x_train))
    xgb_test_pred = np.expm1(model_xgb.predict(x_test))


Evaluate accuracy
^^^^^^^^^^^^^^^^^
| Then, evaluate the performance of each model by root mean square log error (RMSLE). 
  The reason why RMSLE is used is because the price values are too big, and RMSLE prevents 
  penalising large differences between actual and predicted prices.

::

    from sklearn.metrics import mean_squared_log_error
    
    def rmsle(y, y_pred):
      return np.sqrt(mean_squared_log_error(y, y_pred))

::

    #output:

        XGBoost RMSLE(train):  0.1626671056150446
        XGBoost RMSLE(test):  0.16849945199484243

| The train model RMSLE and the test model RMSLE are 0.1627 and 0.1685 respectively.
  XGBoost uses a more accurate implementation of gradient boosting algorithm and 
  optimised regularisation, and hence, it gives a better result than other models. 
  
| However, in this case, the result shows that the model is slightly overfitting the train data. 
  The below figure shows the graph of actual and predicted property price for XGBoost.


:: 

    plt.figure(figsize=(5,5))
    plt.scatter(y_test,xgb_test_pred)
    plt.xlabel('Actual Y')
    plt.ylabel('Predicted Y')
    plt.show()

.. figure:: images/prediction_graph_xgb.png
    :width: 300px
    :align: center
    :alt: "The prediction graph for XGBoost."

    The graph of actual and predicted house price for XGBoost.

.. attention::
   | All investments entail inherent risk. This repository seeks to solely educate 
     people on methodologies to build and evaluate algorithmic trading strategies. 
     All final investment decisions are yours and as a result you could make or lose money.