MultiFactor-Backtesting-Framework/portfolio_optimizer.py at main · polo2444172276/MultiFactor-Backtesting-Framework · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
from datetime import datetime, timedelta
import src.dataloader as dl
import pandas as pd
import rqdatac as rq
import scipy
import statsmodels as sm
import numpy as np
import seaborn as sns
import pathos
from tqdm import tqdm
import multiprocessing
import pickle
import matplotlib.pyplot as plt
from src import preprocess
from src.utils import *
from collections import Iterable
import src.factor_combinator as comb
import scipy.sparse as sp
import cvxpy as cp
from src.constants import *

class PortfolioOptimizer:
    """
    A class created specifically for the portfolio optimization process
    For a complete math derivation process, see Huatai MultiFactor Report1 华泰多因子系列1
    """
    def __init__(self, df_backtest: pd.DataFrame, style_factor_dict: dict, hist_periods=12, gamma=1.):
        """
        Args:
            df_backtest (pd.DataFrame): a pandas dataframe used for backtesting. It has multi-index (date, stock)
            style_factors_dict (dict): a dictionary mapping factor types to factor lists
                e.g. {'value': ['pe_ratio_ttm', 'pb_ratio_ttm', ],
                        }
                in order for factor data to be correctly read in,
                pe_ratio_ttm.h5 and pb_ratio_ttm.h5 should exist under ./Data/factor/value/
            hist_periods (int, optional): number of months of historica data used for forecasting. Defaults to 12.
        """
        self.df_backtest = df_backtest
        self.hist_periods = hist_periods
        self.style_factor_dict = style_factor_dict
        # 'style_factors' is a flatten version of 'style_factor_dict' containing no factor types, only factor names
        self.style_factors = sum(style_factor_dict.values(), [])
        self.country_factor = 'country'
        self.gamma = gamma

    def run(self, ):
        """
        The main function in this class
        1. select factors and set up the backtesting framework
        2. calculate the historical factor returns
        3. forecast next period's factor return and covariance matrix, forecast the return and risk of the overall portfolio
        4. set up objective and constraints to solve for the optimal stock portfolio
        5. see the backtesting results
        """
        self.preprocess() #step 1
        self.get_regression_results() #step 2
        self.predict() #step 3
        self.solve_opt_weights() #step 4
        self.plot_return() #step 5

    @timer
    def preprocess(self, ):
        """
            Step 1:
            Preprocess the dataframe before starting any other steps
        Returns:
            The updated backtesting dataframe
        """
        self.df_backtest = preprocess.add_factors(self.df_backtest, self.style_factor_dict)
        self.df_backtest = preprocess.standardize_factors(self.df_backtest, self.style_factors)
        self.df_backtest[self.country_factor] = 1

        # Turn the industry column into one-hot vectors
        df_industry_dummy = pd.get_dummies(self.df_backtest[PRIMARY_INDUSTRY_COL])
        # Set all the industry factors
        self.industry_factors = list(df_industry_dummy.columns)
        self.df_backtest.loc[:, self.industry_factors] = df_industry_dummy.values
        # Set all the factors
        self.all_factors = [self.country_factor] + self.industry_factors + self.style_factors
        return self.df_backtest

    @timer
    def get_regression_results(self, ):
        """
            Step 2:
            Obtain the regression results on each rebalancing date, which include historical factor returns and historical idiosyncratic returns
            Store them in pandas dataframes
        Returns:
            A list of statsmodels.wls result objects
        """

        # Fit a weighted least square regression model on each rebalancing date
        # Regress next period's return with the factor exposures on the current rebalancing date
        # The coefficients are the factor returns in the next period
        def get_wls_result(df_sub):
            import statsmodels.formula.api as smf
            wls_result = smf.wls(formula = f"next_period_return ~ 0 + {self.country_factor} + {' + '.join(self.industry_factors)} + {' + '.join(self.style_factors)}",
                            data=df_sub, weights = df_sub['market_value'] ** 0.5, missing='drop').fit()
            return wls_result
        # multiprocessing or multithreading is not any faster than for loop here
        self.wls_results = [get_wls_result(df_sub) for _, df_sub in self.df_backtest.groupby(level=0)]
        date_wls_pairs = zip(self.df_backtest.index.get_level_values(0).unique(), self.wls_results)
        # obtain the historical factor returns
        self.df_hist_factor_return = pd.concat({ date: wls_result.params for date, wls_result in date_wls_pairs}).unstack(level=1)
        self.df_hist_factor_return.index.name = 'date'
        # obtain the idiosyncratic returns
        self.df_hist_idio_return = pd.concat([wls_result.resid for wls_result in self.wls_results])
        return self.wls_results

    @timer
    def predict(self, ):
        """
            Step 3:
            Forecast next period's factor return and covariance matrix
        Returns:
            None
        """
        self.df_pred_factor_return = self.predict_factor_return()
        self.df_pred_factor_cov = self.predict_factor_cov()
        #All of the first 12 periods, and no other periods, should have nan values
        return_nan_count_by_date = (self.df_pred_factor_return.notnull().sum(axis=1) == 0).sum()
        assert( return_nan_count_by_date == self.hist_periods)
        cov_nan_count_by_date = (self.df_pred_factor_cov.notnull().groupby(level=0).sum().sum(axis=1) == 0).sum()
        assert( cov_nan_count_by_date == self.hist_periods)

        self.df_pred_idio_return = self.predict_idio_return()
        self.df_pred_stock_returns = self.predict_stock_return()


    @timer
    def solve_opt_weights(self, ):
        """
            Step 4: Maximize risk-adjusted return to solve for the optimal stock portfolio
        Returns:
            A list of lists: The optimal weights on each rebalancing date
        """
        gamma = cp.Parameter(nonneg=True)
        Lmax = cp.Parameter()
        def solve_optimal_weight(data, solver=cp.ECOS, abs_tol=1e-8):
            """
            Solve for the optimal weights on a SINGLE rebalancing date.
            """
            X_t, F_t, Delta, r = data
            """
            Let V be the N x N predicted stock return covariance matrix over the next period
            V is predicted as follows: V = X * F * X.transpose() + Delta, where
            X is the N x K factor exposure matrix on the current rebalancing date
            F is the K x K predicted factor covariance matrix over the next period
            Delta is the N x N predicted idiosyncratic return matrix over the next period

            Objective:
            Maximize R - gamma * var, where
            R is the 1 x 1 predicted portfolio return over the next period
            gamma is the 1 x 1 risk penalty coefficient. larger gamma will make the model more inclined to return and less inclined to risk
            var is the 1 x 1 predicted portfolio variance over the next period

            Mathematically, we have:
            R = w.transpose * r
            var = w.transpose * V * w
            V is the N x N predicted stock covariance matrix
            w is the N x 1 portfolio weight vector we wish to optimize

            Constraints:
            1) No short-selling: all weights should be non-negative
            2) Stock diversification: Investment into any stock should be less than 1%
            TODO: Add more constraints as outlined in Huatai's report
            """
            gamma.value = self.gamma
            N = X_t.shape[0]
            w = cp.Variable(N)
            ret = r.T @ w
            # Multiplying w and X first gives O(nk^2) time complexity, as opposed to O(n^3) if we calculate V first
            # This saves tons of time!
            variance = cp.quad_form(w.T @ X_t, F_t) + cp.sum_squares(np.sqrt(Delta) @ w) #check this out, see if it generalizes to non-diagonal matrices

            problem = cp.Problem(cp.Maximize(ret - gamma * variance),
                        [cp.sum(w) == 1,
                        0 <= w,
                        w <= 0.01,
                        ]
                            )
            """
            cvxpy will check that the optimization problem is convex before solving it
            If the optimization problem is not convex, one source of error can be that V is not symmetric semi-positive definite.
            You can verify this by checking if F and Delta are symmetric semi-positive definite or not.
            See footnote below this class about speeding up the convex optimization problem.
            """
            problem.solve(verbose=False, solver=solver)

            # check that the solver converges to a valid solution i.e. the obtained weight vector satisfies the above imposed constraints
            # Use abs_tol to avoid numerical rounding issues
            solved_weight = w.value
            assert(abs(solved_weight.sum() - 1) < abs_tol)
            assert(np.all(solved_weight >= 0 - abs_tol) )
            assert(np.all(solved_weight <= 0.01 + abs_tol))
            return w.value

        def get_data_by_date(date):
            """Given the rebalacing date, return X, F, Delta and r on that SINGLE rebalancing date.
            """
            X_t = self.df_backtest.loc[self.df_backtest.index.get_level_values(0) == date, self.all_factors].values
            F_t = self.df_pred_factor_cov.loc[self.df_pred_factor_cov.index.get_level_values(0) == date, self.all_factors].values
            u = self.df_pred_idio_return[self.df_pred_idio_return.index.get_level_values(0) == date].values
            Delta = scipy.sparse.diags( u ** 2 )
            r = self.df_pred_stock_returns[self.df_pred_stock_returns.index.get_level_values(0) == date].values
            return [X_t, F_t, Delta, r]

        # Store all input data by date in a list, then loop through that list and convex-optimize the weight vectors
        self.input_data = [ get_data_by_date(date) for date in REBALANCING_DATES[self.hist_periods: -1] ]
        valid_date_mask = self.df_backtest.index.get_level_values(0).isin(REBALANCING_DATES[self.hist_periods:-1])
        self.df_backtest.loc[valid_date_mask, 'opt_weight'] = sum( [list(solve_optimal_weight(input)) for input in tqdm(self.input_data)], [])  # this line of code takes 15s ~ 20s on my laptop
        self.opt_weights = self.df_backtest['opt_weight']
        return self.opt_weights

    @timer
    def plot_return(self, ):
        """
        Step 5: Calculate the cumulative return series and visualize the backtesting results
        TODO: Add more functionalities, such as sharpe ratio, win rate, etc. The best way to achieve this would be to create a 'BacktestingResult' class that takes a cumulative return series as argument, and can plot the relevant figures and print out a table.
        """
        # calculate portfolio return in each period
        self.df_backtest['weighted_return'] = self.df_backtest['next_period_return'].values * self.df_backtest['opt_weight'].values

        # calculate the cumulative portfolio return series
        self.df_portfolio_returns = self.df_backtest.groupby(level=0)['weighted_return'].apply(lambda df: df.sum(skipna=False))
        self.df_portfolio_cum_returns = (1 + self.df_portfolio_returns).cumprod()
        # plot it out
        # sns.distplot( df_portfolio_returns)
        self.df_portfolio_cum_returns.plot()

    def predict_factor_return(self, method=None):
        #helper function for self.predict
        if method is not None:
            #TODO: Implement other advanced methods for forecasting factor returns
            pass
        else:
            #Baseline rolling average prediction
            df_pred_factor_return = self.df_hist_factor_return.rolling(self.hist_periods).mean().shift(1)
        return df_pred_factor_return

    def predict_factor_cov(self, method=None):
        #helper function for self.predict
        if method is not None:
            #TODO: Implement other advanced methods for forecasting factor covariance
            pass
        else:
            #Naive covariance matrix estimation
            df_pred_factor_cov = self.df_hist_factor_return.rolling(self.hist_periods).cov().groupby(level=1).shift(1)
        return df_pred_factor_cov

    def predict_idio_return(self, method=None):
        #helper function for self.predict
        if method is not None:
            #TODO: Implement other advanced methods for forecasting idiosyncratic return
            pass
        else:
            #No prediction at all, based on the assumption that idiosyncratic return represents the unpredictable portion of stock return
            return pd.Series(0, index=self.df_backtest.index)

    def predict_stock_return(self, ) -> pd.Series:
        """
        Helper function for self.predict
        Predict stock return under the factor modelling framework. We do this once for ALL rebalancing dates.
            N is the # of stocks on the current rebalancing date
            K is the # of factors on the current rebalancing date
            r is the predicted N x 1 stock return vector over the next period
            We predict r as follows:
            r = X x f + u
            X is a N x K factor exposure matrix on the current rebalancing date
            f is the predicted K x 1 factor return vector over the next period
            u is the predicted N x 1 idiosyncratic return vector over the next period
        Returns:
            pd.Series: the predicted N x 1 stock return vector over the next period
        """
        X = self.df_backtest[self.all_factors]
        f = pd.merge(self.df_pred_factor_return, X[[]], how='right', left_on='date', right_on='date', )
        u = self.df_pred_idio_return
        # u = pd.merge(X[[]], self.df_pred_idio_return.rename('idio_return'), how='left', left_index=True, right_index=True)['idio_return']
        self.df_pred_stock_returns = pd.Series((X.values * f.values).sum(axis=1) + u.values, index=X.index, name='predicted_stock_return')
        return self.df_pred_stock_returns


"""
Detailed notes on portfolio optimization:

In both factor combination and portfolio optimization problems, we have to implement constrained optimization algorithms to solve for an array of weights. The former problem is easy and can be solved with scipy.optimize within 0.2 seconds for all rebalancing dates. This is because the number of factors to be combined are at most 10 ~ 20, and we are finding solution in a very low dimension. However, based on my empirical testing results, the time it takes to complete an optimization problem usually scales quadratically with the input dimension. This is why it takes more than 10 minutes for scipy to solve a portfolio optimization problem on a single rebalancing date using the naive approach! Since we have 100+ rebalancing dates, the time is definitely intractable if we stay with the naive approach. As a result, below are some trials by me to speed up the optimization problem:

1. Formulate as convex optimization problem and use cvxpy instead of scipy.optimize(Success)
Not sure by how much time this speeds up the problem. But it's said on stackoverflow that cvxpy are usually much faster because the package deals specifically with convex problems.

2. Multiply the matrices in different orders (Very successful).
Following https://colab.research.google.com/github/cvxgrp/cvx_short_course/blob/master/applications/portfolio_optimization.ipynb#scrollTo=zDfbDngvkJAV. This is the most effective way to speed up the problem! This boosts the solver's time from O(n^3) to O(nk^2), where n is # of stocks and k is # of factors. A single optimization problem used to take around 10 minutes but now takes only 3 seconds!

3. Set up the problem at the beginning and change values in each iteration (Failed)
Because the dimension of matrices are changing in each iteration, it seems like we cannot save the setup time by setting up variables at the beginning and changing the variable values during each iteration.

4. Use a more efficient version of BLAS (Ongoing)
As introduced in https://markus-beuckelmann.de/blog/boosting-numpy-blas.html , there are four versions of BLAS & LAPACK,
default, ATLAS, IntelMKL, OpenBLAS. It seems that the other three versions are at least around x2 times faster than the default version.
According to https://mruss.dev/numpy-mkl/ and https://conda-forge.org/docs/maintainer/knowledge_base.html#blas, I tried to download the IntelMKL version using conda. The download is sucessful everytime, and
command line installs the correct package each time. But `numpy.show_config()` gives 'NO ATLAS' for BLAS opt and MKL opt everytime, meaning
I am still using the default BLAS version for numpy. This is so annoying!

5. Choose the best solver - ECOS (Success)
Setting: for loop + no multiprocessing
default(OSQP): 1m42s
SCS: 2m26s
ECOS: 1m10s
ECOS is the fastest

6. Use multiprocessing (Can be implemented but not faster).
Not sure why it takes the same amount of time when I use for loop v.s. multiprocessing. Plus, the timing functions(either timer.timer, timer.process_time or timer.perf_counter) do not record the true time -- the true time is always longer than what is recorded. Considering the memory issues caused by multiprocessing, we'll use for loop onwwards.

7. Set near-zero values to 0 in the matrix(Failed).
Originally I thought this may be a good idea because 94%+ values in the matrix have absolute values smaller than 0.0001, so we can approximate those values as 0's, thereby taking the computational advantage of sparse matrices. But it turned out that after setting them as 0, the matrix is no longer semi-positive definitie.
"""