cp11_15_TradingStrategies(dataframe.plot.scatter)

最新推荐文章于 2023-08-25 14:48:32 发布

原创最新推荐文章于 2023-08-25 14:48:32 发布 · 4.1k 阅读

1 ·

CC 4.0 BY-SA版权

pythonForFinance 专栏收录该内容

25 篇文章

订阅专栏

本文探讨了基于简单移动平均线的量化交易策略，并利用机器学习技术如线性回归、支持向量机等预测市场方向，通过不同特征组合优化策略表现。

Simple Moving Averages(SMAs)

This section focuses on an algorithmic trading strategy based on simple moving averages and how to backtest such a strategy.

简单移动平均线沿用最简单的统计学方式，将过去某特定时间内的价格取其平均值。简单移动平均线计算方法如同其名——简单。它只是将每日得到的平均值连成一线并随时间移动，每一支烛因而得到相同的数值。

Data Import

import numpy as np
import pandas as pd
import datetime as dt
from pylab import mpl, plt

plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'
%matplotlib inline

Second, the reading of the raw data and the selection of the financial time series for a single symbol, the stock of Apple, Inc. (AAPL.O). The analysis in this section is based on end-of-day data; intraday data is used in subsequent sections:

raw = pd.read_csv('../source/tr_eikon_eod_data.csv')
raw.head()

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0)
raw.head()

raw.info()

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True)
raw.head()

raw.info()

type(raw.index)

symbol='AAPL.O'

data = (
pd.DataFrame(raw[symbol]).dropna()
)
data.head()

Trading Strategy

Third, the calculation of the SMA values for two different rolling window sizes. Figure 15-1 shows the three time series visually:

#####################rolling window ########################

###################rolling window ######################

SMA1 = 42
SMA2 = 252

data['SMA1'] = data[symbol].rolling(SMA1).mean() #Calculates the values for the shorter SMA.
data['SMA2'] = data[symbol].rolling(SMA2).mean() #Calculates the values for the longer SMA.

data.plot(figsize=(10,6), title='Apple stock price and two simple moving averages')

Fourth, the derivation of the positions(仓位). The trading rules are:

Buy signal: Go long (= +1)做多 when the shorter SMA is above the longer SMA.

Sell signal: Go short (= -1)做空 when the shorter SMA is below the longer SMA.

Wait (park in cash): the 42d trend is within a range of +/– SD points around the 252d trend.

Technical Analysis: https://blog.youkuaiyun.com/Linli522362242/article/details/90110433

data.dropna(inplace=True)
#np.where(cond, a, b) evaluates the condition cond element-wise and places a when True and b otherwise.
data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)
data.tail()

#right side y-axis
ax = data.plot(secondary_y='Position', figsize=(10,6), title='Apple stock price, two SMAs, and resulting positions')
ax.get_legend().set_bbox_to_anchor((0.25, 0.85))

This replicates the results derived in Chapter 8. What is not addressed there is if following the trading rules — i.e., implementing the algorithmic trading strategy — is superior compared to the benchmark case of simply going long on the Apple stock over the whole period. Given that the strategy leads to two periods only during which the Apple stock should be shorted, differences in the performance can only result from these two periods.

Vectorized Backtesting

The vectorized backtesting can now be implemented as follows.

First, the log returns are calculated.

Then the positionings, represented as +1 or -1, are multiplied by the relevant log return. This simple calculation is possible since a long position 多头头寸earns the return of the Apple stock and a short position 空头头寸earns the negative return of the Apple stock.

Finally, the log returns for the Apple stock and the algorithmic trading strategy based on SMAs need to be added up and the exponential function applied to arrive at the performance values:

data[symbol].head()

data[symbol].shift(1).head()

#Calculates the log returns of the Apple stock (i.e., the benchmark investment).
#symbol='AAPL.O'
data['Returns'] = np.log(data[symbol]/data[symbol].shift(1)) #ln == log_e
data.head()

#Multiplies the position values, shifted by one day, by the log returns of the Apple stock; the shift
#is required to avoid a foresight bias
data['Strategy'] = data['Position'].shift(1) * data['Returns']
data.round(4).head()

data.dropna(inplace=True)

#Sums up the log returns for the strategy
#and the benchmark investment
#and calculates the exponential value to arrive at the absolute performance.
np.exp(data[['Returns','Strategy']].sum())

#Calculates the annualized volatility for the strategy and the benchmark investment.
data[['Returns', 'Strategy']].std() * 252**0.5

The numbers show that the algorithmic trading strategy indeed outperforms the benchmark investment of passively(被动地) holding the Apple stock(Strategy:5.811299 > Returns:4.017148). Due to the type and characteristics of the strategy, the annualized volatility is the same(0.250), such that it also outperforms the benchmark investment on a riskadjusted basis.

To gain a better picture of the overall performance, Figure 15-3 shows the performance of the Apple stock and the algorithmic trading strategy over time:

ax = data[['Returns', 'Strategy']].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of Apple stock and SMA-based trading strategy over time')
#data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)
data['Position'].plot(ax=ax, secondary_y='Position', style='--')
ax.get_legend().set_bbox_to_anchor((0.25, 0.85))

SIMPLIFICATIONS

The vectorized backtesting approach as introduced in this subsection is based on a number of simplifying assumptions. Among others, transactions costs (fixed fees, bid-ask spreads, lending costs, etc.) are not included. This might be justifiable for a trading strategy that leads to a few trades only over multiple years. It is also assumed that all trades take place at the end-of-day closing prices for the Apple stock. A more realistic backtesting approach would take these and other (market microstructure) elements into account.

Optimization

A natural question that arises is if the chosen parameters SMA1=42 and SMA2=252 are the “right” ones. In general, investors prefer higher returns to lower returns ceteris paribus. Therefore, one might be inclined to search for those parameters that maximize the return over the relevant period. To this end, a brute force approach can be used that simply repeats the whole vectorized backtesting procedure for different parameter combinations, records the results, and does a ranking afterward. This is what the following code does:

from itertools import product#笛卡尔积

sma1 = range(20, 61, 4) #Specifies the parameter values for SMA1.
sma2 = range(180, 281, 10) #Specifies the parameter values for SMA2.

results = pd.DataFrame()
for SMA1, SMA2 in product(sma1, sma2): #Combines all values for SMA1 with those for SMA2.
data = pd.DataFrame(raw[symbol])
data.dropna(inplace=True)

#data['Returns'] = np.log(data[symbol] / data[symbol].shift(1))

data['SMA1'] = data[symbol].rolling(SMA1).mean()
data['SMA2'] = data[symbol].rolling(SMA2).mean()
data.dropna(inplace=True)

data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)

data['Returns'] = np.log(data[symbol] / data[symbol].shift(1))
data['Strategy'] = data['Position'].shift(1) * data['Returns']
data.dropna(inplace=True)

perf = np.exp( data[['Returns', 'Strategy']].sum() )#dataframe

results = results.append(pd.DataFrame(
{'SMA1':SMA1,
'SMA2':SMA2,
'MARKET': perf['Returns'],
'STRATEGY': perf['Strategy'],
'OUT': perf['Strategy'] - perf['Returns']
},
index = [0]### with all indices are 0
),
ignore_index = True #will assign a new index to current dataframe
)#Records the vectorized backtesting results in a DataFrame object.
results=results[['SMA1', 'SMA2', 'MARKET','STRATEGY','OUT']] #adjusts the columnname's order
results.head()

The following code gives an overview of the results and shows the seven best-performing parameter combinations of all those backtested. The ranking is implemented according to the outperformance of the algorithmic trading strategy compared to the benchmark investment. The performance of the benchmark investment varies since the choice of the SMA2 parameter influences the length of the time interval and data set on which the vectorized backtest is implemented:

results.info()

results.sort_values('OUT', ascending=False).head(7)

According to the brute force–based optimization, SMA1=40 and SMA2=190 are the optimal parameters, leading to an outperformance of some 230(40+190) percentage points. However, this result is heavily dependent on the data set used and is prone to overfitting. A more rigorous approach would be to implement the optimization on one data set, the in-sample or training data set, and test it on another one, the out-of-sample or testing data set.

OVERFITTING

In general, any type of optimization, fitting, or training in the context of algorithmic trading strategies is prone to what is called overfitting. This means that parameters might be chosen that perform (exceptionally) well for the used data set but might perform (exceptionally) badly on other data sets or in practice.

Random Walk Hypothesis

The previous section introduces vectorized backtesting as an efficient tool to backtest algorithmic trading strategies. The single strategy backtested based on a single financial time series, namely historical end-of-day prices for the Apple stock, outperforms the benchmark investment of simply going long on the Apple stock over the same period.

Although rather specific in nature, these results are in contrast to what the random walk hypothesis (RWH) predicts, namely that such predictive approaches should not yield any outperformance at all. The RWH postulates假设 that prices in financial markets follow a random walk, or, in continuous time, an arithmetic Brownian motion without drift. The expected value of an arithmetic Brownian motion without drift at any point in the future equals its value today. As a consequence, the best predictor for tomorrow’s price, in a least-squares sense, is today’s price if the RWH applies.

For many years, economists, statisticians, and teachers of finance have been interested in developing and testing models of stock price behavior. One important model that has evolved from this research is the theory of random walks. This theory casts serious doubt on many other methods for describing and predicting stock price behavior — methods that have considerable popularity outside the academic world. For example, we shall see later that, if the random-walk theory is an accurate description of reality, then the various “technical” or “chartist” procedures for predicting stock prices are completely without value.

The RWH is consistent with the efficient markets hypothesis (EMH), which, non-technically speaking, states that market prices reflect “all available information.” Different degrees of efficiency are generally distinguished, such as weak, semi-strong, and strong, defining more specifically what “all available information” entails. Formally, such a definition can be based on the concept of an information set in theory and on a data set for programming purposes, as the following quote illustrates:

A market is efficient with respect to an information set S if it is impossible to make economic profits by trading on the basis of information set S.

Using Python, the RWH can be tested for a specific case as follows. A financial time series of historical market prices is used for which a number of lagged滞后 versions are created — say, five. OLS最小二乘 regression is then used to predict the market prices based on the lagged market prices created before. The basic idea is that the market prices from yesterday and four more days back can be used to predict today’s market price

The following Python code implements this idea and creates five lagged versions of the historical end-of-day closing levels of the S&P 500 stock index:

raw.head()

symbol = '.SPX'
data = pd.DataFrame(raw[symbol])

lags = 5
cols = []
for lag in range(1, lags+1):
#Defines a column name for the current lag value.
col = 'lag_{}'.format(lag)
#Creates the lagged version of the market prices for the current lag value.
data[col] = data[symbol].shift(lag)
#Collects the column names for later reference.
cols.append(col)
data.head(7)

data.dropna(inplace=True)
data.head()

Using NumPy, the OLS regression is straightforward to implement. As the optimal regression parameters show, lag_1 indeed is the most important one in predicting the market price based on OLS regression. Its value is close to 1. The other four values are rather close to 0. Figure 15-4 visualizes the optimal regression parameter values.

When using the optimal results to visualize the prediction values as compared to the original index
values for the S&P 500, it becomes obvious from Figure 15-5 that indeed lag_1 is basically what is
used to come up with the prediction value. Graphically speaking, the prediction line in Figure 15-5 is
the original time series shifted by one day to the right (with some minor adjustments).

All in all, the brief analysis in this section reveals some support for both the RWH and the EMH. For
sure, the analysis is done for a single stock index only and uses a rather specific parameterization —
but this can easily be widened to incorporate multiple financial instruments across multiple asset
classes, different values for the number of lags, etc. In general, one will find out that the results are
qualitatively more or less the same. After all, the RWH and EMH are among the financial theories
that have broad empirical经验 support. In that sense, any algorithmic trading strategy must prove its worth
by proving that the RWH does not apply in general. This for sure is a tough hurdle.

Linear OLS Regression

This section applies linear OLS regression to predict the direction of market movements based on historical log returns. To keep things simple, only two features are used. The first feature (lag_1) represents the log returns of the financial time series lagged by one day. The second feature (lag_2) lags the log returns by two days. Log returns — in contrast to prices — are stationary in general, which often is a necessary condition for the application of statistical and ML algorithms.

The basic idea behind the usage of lagged log returns as features is that they might be informative in predicting future returns. For example, one might hypothesize that after two downward movements an upward movement is more likely (“mean reversion”), or, to the contrary, that another downward movement is more likely (“momentum” or “trend”). The application of regression techniques allows the formalization of such informal reasonings.

The Data

First, the importing and preparation of the data set. Figure 15-6 shows the frequency distribution of the daily historical log returns for the EUR/USD exchange rate. They are the basis for the features as well as the labels to be used in what follows:

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True).dropna()
raw.head()

raw.columns

symbol = 'EUR='
data = pd.DataFrame(raw[symbol]) #raw[symbol] is a series
data.head()

#why log?
#Try to multiply many small numbers in Python. Eventually it rounds off to 0.
data['returns'] = np.log(data/data.shift(1))
data.dropna(inplace=True) #since data.shift(1)
data.head()

data['direction'] = np.sign(data['returns']).astype(int)
data.head()

data['returns'].hist(bins=35, figsize=(10,6))
plt.title('Histogram of log returns for EUR/USD exchange rate')

#a check whether thedata['returns']values are indeed log-normally distributed.

#如果峰度大于三，峰的形状比较尖，比正态分布峰要陡峭。反之亦然

#在相同的标准差下，峰度系数越大，分布就有更多的极端值，那么其余值必然要更加集中在众数周围，其分布必然就更加陡峭

#https://blog.youkuaiyun.com/Linli522362242/article/details/99728616

Second, the code that creates the features data by lagging the log returns and visualizes it in combination with the returns data (see Figure 15-7):

lags =2
def create_lags(data):
global cols
cols = []
for lag in range(1, lags+1):
col = 'lag_{}'.format(lag)

#data=data.drop(['lag_()'], axis=1)
data[col] = data['returns'].shift(lag)
cols.append(col)

create_lags(data)
data.head()

data.dropna(inplace=True) #since shift()

data.plot.scatter(x='lag_1', y='lag_2', c='returns', cmap='coolwarm', figsize=(10,6), colorbar=True)
plt.axvline(0,c='r', ls='--')
plt.axhline(0,c='r', ls='--')
plt.title('Scatter plot based on features and labels data')

Regression

With the data set completed, linear OLS regression can be applied to learn about any potential (linear) relationships, to predict market movement based on the features, and to backtest a trading strategy based on the predictions. Two basic approaches are available: using the log returns or only the direction data as the dependent variable during the regression. In any case, predictions are realvalued and therefore transformed to either +1 or -1 to only work with the direction of the prediction:

#The linear OLS regression implementation from scikit-learn is used.
from sklearn.linear_model import LinearRegression

#model
model=LinearRegression()

cols

#fit and then predict
#The regression is implemented on the log returns directly …
data['pos_ols_1'] = model.fit(data[cols], data['returns']).predict(data[cols])
#… and on the direction data which is of primary interest.
#data['direction'] = np.sign(data['returns']).astype(int)
data['pos_ols_2'] = model.fit(data[cols], data['direction']).predict(data[cols])

data[['pos_ols_1', 'pos_ols_2']].head()

data[['pos_ols_1', 'pos_ols_2']] = np.where( data[['pos_ols_1', 'pos_ols_2']]>0, 1,-1 )
data[['pos_ols_1', 'pos_ols_2']].head()

#The real-valued predictions are transformed to directional values (+1, -1).
data['pos_ols_1'].value_counts()

#The real-valued predictions are transformed to directional values (+1, -1).
data['pos_ols_2'].value_counts()

#However, both lead to a relatively large number of trades over time.
(data['pos_ols_1'].diff() != 0).sum()

#However, both lead to a relatively large number of trades over time.
(data['pos_ols_2'].diff() !=0).sum()

Equipped with the directional prediction, vectorized backtesting can be applied to judge the performance of the resulting trading strategies. At this stage, the analysis is based on a number of simplifying assumptions, such as “zero transaction costs” and the usage of the same data set for both training and testing. Under these assumptions, however, both regression-based strategies outperform the benchmark passive investment, while only the strategy trained on the direction of the market shows a positive overall performance

data['strat_ols_1'] = data['pos_ols_1'] * data['returns']
data['strat_ols_2'] = data['pos_ols_2'] * data['returns']

data.head()

data[ ['returns', 'strat_ols_1', 'strat_ols_2'] ].sum().apply(np.exp)

the strategy trained on the direction of the market shows a better performance(1.339286 > 0.942422)

#Shows the number of correct and false predictions by the strategies
(data['direction'] == data['pos_ols_1']).value_counts()

#Shows the number of correct and false predictions by the strategies
(data['direction'] == data['pos_ols_2']).value_counts()

data[['returns', 'strat_ols_1', 'strat_ols_2']].cumsum().apply(np.exp).plot(figsize=(10,6),title='Performance of EUR/USD and regression-based strategies over time')

only the strategy trained on the direction of the market shows a positive overall performance

Clustering

This section applies k-means clustering, as introduced in “Machine Learning”, to financial time series data to automatically come up with clusters that are used to formulate a trading strategy. The idea is that the algorithm identifies two clusters of feature values that predict either an upward movement or a downward movement.

The following code applies the k-means algorithm to the two features as used before. Figure 15-9 visualizes the two clusters:

from sklearn.cluster import KMeans

#Two clusters are chosen for the algorithm.
model = KMeans(n_clusters=2, random_state=0)

model.fit( data[cols] )

data.head()

data['pos_clus'] = model.predict(data[cols])
data.head()

#Given the cluster values, the position is chosen.
data['pos_clus'] = np.where(data['pos_clus'] == 1, -1, 1)

plt.figure(figsize=(10,6))
plt.scatter(data[cols].iloc[:,0], #lag_1
data[cols].iloc[:,1], #lag_2
c=data['pos_clus'],
cmap='coolwarm'
)
plt.title('Two clusters as identified by the k-means algorithm')

Admittedly, this approach is quite arbitrary in this context — after all, how should the algorithm know what one is looking for? However, the resulting trading strategy shows a slight outperformance at the end compared to the benchmark passive investment (see Figure 15-10). It is noteworthy that no guidance (supervision) is given and that the hit ratio — i.e., the number of correct predictions in relationship to all predictions made — is less than 50%:

data['strat_clus'] = data['pos_clus'] * data['returns']

data[['returns', 'strat_clus']].sum().apply(np.exp)

(data['direction'] == data['pos_clus']).value_counts()

data[['returns', 'strat_clus']].cumsum().apply(np.exp).plot(figsize=(10,6), title='Performance of EUR/USD and k-means-based strategy over time')

Frequency Approach

Beyond more sophisticated algorithms and techniques, one might come up with the idea of just implementing a frequency approach to predict directional movements in financial markets. To this end, one might transform the two real-valued features to binary ones and assess the probability of an upward and a downward movement, respectively, from the historical observations of such movements, given the four possible combinations for the two binary features ((0, 0), (0, 1), (1, 0), (1, 1)).

data[cols[0]].head()

np.digitize(data[cols[0]], bins=[0])[:5]

def create_bins(data, bins=[0]):
global cols_bin
cols_bin =[]
for col in cols:
col_bin = col + '_bin'
#Digitizes the feature values given the bins parameter.
data[col_bin] = np.digitize(data[col], bins=bins)
cols_bin.append(col_bin) #list

create_bins(data)

data.head()

data[cols_bin+['direction']].head()

grouped = data.groupby(cols_bin + ['direction'])
#Shows the frequency of the possible movements conditional on the feature value combinations.
#direction: current return # lag_1 : yesterday's return # lag_2: The day before yesterday's return
grouped.size()

grouped['direction'].size()

#Transforms the DataFrame object to have the frequencies in columns.
res = grouped['direction'].size().unstack(fill_value=0)

res

def highlight_max(s):
is_max=( s==s.max() )
print(s)
print(is_max)
print('#'*10)
return ['background-color:yellow' if v else '' for v in is_max]

#Highlights the highest-frequency value per feature value combination.
#res = grouped['direction'].size().unstack(fill_value=0)
res.style.apply(highlight_max, axis=1)

Given the frequency data, three feature value combinations hint at a downward movement while one lets an upward movement seem more likely. This translates into a trading strategy the performance of which is shown in

#data[col] = data['returns'].shift(lag)
#data[col_bin] = np.digitize(data[col], bins=bins) #0: positive ; 1: negative
data[cols_bin].head()

data[cols_bin].sum(axis=1).head()

#Translates the findings given the frequencies to a trading strategy.
data['pos_freq'] = np.where(data[cols_bin].sum(axis=1) == 2, -1, 1)
#since data['direction']== 1 or -1
(data['direction'] == data['pos_freq']).value_counts()

data['strat_freq'] = data['pos_freq']*data['returns']

data[['returns', 'strat_freq']].sum().apply(np.exp)

##green curve # not alway perform better than the passive investment(return curve)

data[['returns', 'strat_freq']].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and frequency-based trading strategy over time'
)

Classification

This section applies the classification algorithms from ML (as introduced in “Machine Learning”) to the problem of predicting the direction of price movements in financial markets. With that background and the examples from previous sections, the application of the logistic regression, Gaussian Naive Bayes, and support vector machine approaches is as straightforward as applying them to smaller sample data sets.

Two Binary Features

First, a fitting of the models based on the binary feature values and the derivation of the resulting position values:

from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

C=1#Penalty parameter C of the error term.

models = {
'log_reg': linear_model.LogisticRegression(C=C),
'gauss_nb': GaussianNB(),
'svm': SVC(C=C)
}

def fit_models(data): #A function that fits all models.
mfit = {model: models[model].fit(data[cols_bin], # lag_1,lag_2
data['direction'] #data['direction'] = np.sign(data['returns']).astype(int)
)
for model in models.keys()
}

fit_models(data)

def derive_positions(data):#A function that derives all position values from the fitted models.
for model in models.keys():
data['pos_' + model] = models[model].predict(data[cols_bin])

derive_positions(data)

Second, the vectorized backtesting of the resulting trading strategies. Figure 15-12 visualizes the performance over time:

def evaluate_strats(data):
global sel
sel=[]
for model in models.keys():
col = 'strat_' + model
data[col] = data['pos_'+model] * data['returns']
sel.append(col)
sel.insert(0, 'returns')

evaluate_strats(data)
sel.insert(1, 'strat_freq')

data.head()

#Some strategies might show the exact same performance.
#data[['returns', 'strat_freq','strat_log_reg','trat_gauss_nb','strat_svm']]
data[sel].sum().apply(np.exp)

colormap={
'returns':'m', #puple-red
'strat_freq':'y', #yellow
'strat_log_reg':'b', #blue
'strat_gauss_nb':'k', #black
'strat_svm':'r' #red
}

data[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (two binary lags) over time',
style=colormap
)

data[['strat_freq', 'strat_log_reg']].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (two binary lags) over time',
style=colormap
)

Five Binary Features

In an attempt to improve the strategies’ performance, the following code works with five binary lags instead of two. In particular, the performance of the SVM-based strategy is significantly improved(see Figure). On the other hand, the performance of the LR- and GNB-based strategies is worse:

data = pd.DataFrame(raw[symbol])
data.head()

data['returns'] = np.log(data/data.shift(1))
# d:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in sign
# """Entry point for launching an IPython kernel.
data.dropna(inplace=True)
#or data['direction'] = np.sign(data['returns']).astype(int)
data['direction'] = np.sign(data['returns'])

data.head()

lags =5 #Five lags of the log returns series are now used.
def create_lags(data):
global cols
cols = []
for lag in range(1, lags+1):
col = 'lag_{}'.format(lag)
data[col] = data['returns'].shift(lag)
cols.append(col)
create_lags(data)
data.dropna(inplace=True)
data.head()

def create_bins(data, bins=[0]):
global cols_bin
cols_bin =[]
for col in cols:
col_bin = col + '_bin'
#Digitizes the feature values given the bins parameter.
data[col_bin] = np.digitize(data[col], bins=bins)
cols_bin.append(col_bin) #list
create_bins(data)#The real-valued features data is transformed to binary data
cols_bin

data[cols_bin].head()

fit_models(data)
derive_positions(data) #prediction
def evaluate_strats(data):
global sel
sel=[]
for model in models.keys():
col = 'strat_' + model
data[col] = data['pos_'+model] * data['returns']
sel.append(col)
sel.insert(0, 'returns')
evaluate_strats(data) #result

data[sel].sum().apply(np.exp)

data[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (five binary lags) over time'
)

Five Digitized Features

Finally, the following code uses the first and second moment of the historical log returns to digitize the features data, allowing for more possible feature value combinations. This improves the performance of all classification algorithms used, but for SVM the improvement is again most pronounced (see Figure

mu = data['returns'].mean() #The mean log return and
std = data['returns'].std() #the standard deviation are used

#to digitize the features data.

bins = [mu-std, mu, mu+std]
bins

def create_bins(data, bins=[0]):
global cols_bin
cols_bin =[]
for col in cols:
col_bin = col + '_bin'
#Digitizes the feature values given the bins parameter.
data[col_bin] = np.digitize(data[col], bins=bins) #return the index
cols_bin.append(col_bin) #list
create_bins(data, bins)
data[cols_bin].head()

def derive_positions(data):#A function that derives all position values from the fitted models.
for model in models.keys():
data['pos_' + model] = models[model].predict(data[cols_bin])
derive_positions(data)

def evaluate_strats(data):
global sel
sel=[]
for model in models.keys():
col = 'strat_' + model
data[col] = data['pos_'+model] * data['returns']
sel.append(col)
sel.insert(0, 'returns')
evaluate_strats(data)

data[sel].sum().apply(np.exp)

colormap={
'returns':'m', #puple-red
'strat_freq':'y', #yellow
'strat_log_reg':'g', #green
'strat_gauss_nb':'r', #red
'strat_svm':'b' #blue
}
data[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (five digitized lags) over time'
,style=colormap
)

#####################################################

TYPES OF FEATURES

This chapter exclusively works with lagged return data as features data, mostly in binarized or digitized form. This is mainly
done for convenience, since such features data can be derived from the financial time series itself. However, in practical
applications the features data can be gained from a wealth of different data sources and might include other financial time
series and statistics derived thereof 衍生的统计数据, macroeconomic data, company financial indicators, or news articles. Refer to López de Prado (2018) for an in-depth discussion of this topic. There are also Python packages for automated time series feature extraction available, such as tsfresh.

#####################################################

Sequential Train-Test Split

To better judge the performance of the classification algorithms, the code that follows implements a sequential train-test split. The idea here is to simulate the situation where only data up to a certain point in time is available on which to train an ML algorithm. During live trading, the algorithm is thenfaced with data it has never seen before. This is where the algorithm must prove its worth. In this particular case, all classification algorithms outperform — under the simplified assumptions from before — the passive benchmark investment, but only the GNB and LR algorithms achieve a positive absolute performance

split = int(len(data) * 0.5)
split

#copy reference: https://blog.youkuaiyun.com/u010712012/article/details/79754132
train = data.iloc[:split].copy()

#Trains all classification algorithms on the training data
fit_models(train)

test = data.iloc[split:].copy()

derive_positions(test) #prediction

def evaluate_strats(data):
global sel
sel=[]
for model in models.keys():
col = 'strat_' + model
data[col] = data['pos_'+model] * data['returns']
sel.append(col)
sel.insert(0, 'returns')
evaluate_strats(test) #result

test[sel].sum().apply(np.exp)

colormap={
'returns':'m', #puple-red
'strat_freq':'y', #yellow
'strat_log_reg':'b', #blue
'strat_gauss_nb':'k', #black
'strat_svm':'r' #red
}
test[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (sequential train-test split)'
,style=colormap
)

only the GNB and LR algorithms achieve a positive absolute performance(always)

Randomized Train-Test Split

The classification algorithms are trained and tested on binary or digitized features data. The idea is that the feature value patterns allow a prediction of future market movements with a better hit ratio than 50%. Implicitly, it is assumed that the patterns’ predictive power persists over time. In that sense, it shouldn’t make (too much of) a difference on which part of the data an algorithm is trained and on which part of the data it is tested — implying that one can break up the temporal sequence时间顺序 of the data for training and testing.

A typical way to do this is a randomized train-test split to test the performance of the classification algorithms out-of-sample — again trying to emulate reality, where an algorithm during trading is faced with new data on a continuous basis. The approach used is the same as that applied to the sample data in “Train-test splits: Support vector machines”. Based on this approach, the SVM algorithm shows again the best performance out-of-sample

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.5, shuffle=True, random_state=100)

train.head()

#Train and test data sets are copied and brought back in temporal order.
train = train.copy().sort_index()
train.head()

train[cols_bin].head()

test = test.copy().sort_index()

test[sel].sum().apply(np.exp)

colormap={
'returns':'m', #puple-red
'strat_freq':'w', #white
'strat_log_reg':'y', #yellow
'strat_gauss_nb':'r', #red
'strat_svm':'b' #blue
}
test[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
title='Performance of EUR/USD and classification-based trading strategies (randomized train-test split)'
,style=colormap
)