cp11_15_TradingStrategies(dataframe.plot.scatter)

本文探讨了基于简单移动平均线的量化交易策略,并利用机器学习技术如线性回归、支持向量机等预测市场方向,通过不同特征组合优化策略表现。

Simple Moving Averages(SMAs)

This section focuses on an algorithmic trading strategy based on simple moving averages and how to backtest such a strategy.

简单移动平均线沿用最简单的统计学方式,将过去某特定时间内的价格取其平均值。简单移动平均线计算方法如同其名——简单。它只是将每日得到的平均值连成一线并随时间移动,每一支烛因而得到相同的数值。

Data Import

import numpy as np
import pandas as pd
import datetime as dt
from pylab import mpl, plt

plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'
%matplotlib inline

Second, the reading of the raw data and the selection of the financial time series for a single symbol, the stock of Apple, Inc. (AAPL.O). The analysis in this section is based on end-of-day data; intraday data is used in subsequent sections:

raw = pd.read_csv('../source/tr_eikon_eod_data.csv')
raw.head()

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0)
raw.head()

raw.info()

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True)
raw.head()

raw.info()

type(raw.index)

symbol='AAPL.O'

data = (
    pd.DataFrame(raw[symbol]).dropna()
)
data.head()

Trading Strategy

Third, the calculation of the SMA values for two different rolling window sizes. Figure 15-1 shows the three time series visually:

#####################rolling window ########################

###################rolling window ######################

SMA1 = 42
SMA2 = 252

data['SMA1'] = data[symbol].rolling(SMA1).mean() #Calculates the values for the shorter SMA.
data['SMA2'] = data[symbol].rolling(SMA2).mean() #Calculates the values for the longer SMA.

data.plot(figsize=(10,6), title='Apple stock price and two simple moving averages')

Fourth, the derivation of the positions(仓位). The trading rules are:

Buy signal: Go long (= +1)做多 when the shorter SMA is above the longer SMA.

Sell signal: Go short (= -1)做空 when the shorter SMA is below the longer SMA.

Wait (park in cash): the 42d trend is within a range of +/– SD points around the 252d trend.

Technical Analysis: https://blog.youkuaiyun.com/Linli522362242/article/details/90110433

data.dropna(inplace=True)
#np.where(cond, a, b) evaluates the condition cond element-wise and places a when True and b otherwise.
data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)
data.tail()

                #right side y-axis
ax = data.plot(secondary_y='Position', figsize=(10,6), title='Apple stock price, two SMAs, and resulting positions')
ax.get_legend().set_bbox_to_anchor((0.25, 0.85))

This replicates the results derived in Chapter 8. What is not addressed there is if following the trading rules — i.e., implementing the algorithmic trading strategy — is superior compared to the benchmark case of simply going long on the Apple stock over the whole period. Given that the strategy leads to two periods only during which the Apple stock should be shorted, differences in the performance can only result from these two periods.

Vectorized Backtesting

The vectorized backtesting can now be implemented as follows.

First, the log returns are calculated.

Then the positionings, represented as +1 or -1, are multiplied by the relevant log return. This simple calculation is possible since a long position 多头头寸earns the return of the Apple stock and a short position 空头头寸earns the negative return of the Apple stock.

Finally, the log returns for the Apple stock and the algorithmic trading strategy based on SMAs need to be added up and the exponential function applied to arrive at the performance values:

data[symbol].head()

data[symbol].shift(1).head()

#Calculates the log returns of the Apple stock (i.e., the benchmark investment).
#symbol='AAPL.O'
data['Returns'] = np.log(data[symbol]/data[symbol].shift(1))  #ln == log_e
data.head()

#Multiplies the position values, shifted by one day, by the log returns of the Apple stock; the shift
#is required to avoid a foresight bias
data['Strategy'] = data['Position'].shift(1) * data['Returns']
data.round(4).head()

data.dropna(inplace=True)

#Sums up the log returns for the strategy 
#and the benchmark investment 
#and calculates the exponential value to arrive at the absolute performance.
np.exp(data[['Returns','Strategy']].sum())

#Calculates the annualized volatility for the strategy and the benchmark investment.
data[['Returns', 'Strategy']].std() * 252**0.5

The numbers show that the algorithmic trading strategy indeed outperforms the benchmark investment of passively(被动地) holding the Apple stock(Strategy:5.811299 > Returns:4.017148). Due to the type and characteristics of the strategy, the annualized volatility is the same(0.250), such that it also outperforms the benchmark investment on a riskadjusted basis.

To gain a better picture of the overall performance, Figure 15-3 shows the performance of the Apple stock and the algorithmic trading strategy over time:

ax = data[['Returns', 'Strategy']].cumsum().apply(np.exp).plot(figsize=(10,6), 
                                                                              title='Performance of Apple stock and SMA-based trading strategy over time')
#data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)
data['Position'].plot(ax=ax, secondary_y='Position', style='--')
ax.get_legend().set_bbox_to_anchor((0.25, 0.85))

SIMPLIFICATIONS

The vectorized backtesting approach as introduced in this subsection is based on a number of simplifying assumptions. Among others, transactions costs (fixed fees, bid-ask spreads, lending costs, etc.) are not included. This might be justifiable for a trading strategy that leads to a few trades only over multiple years. It is also assumed that all trades take place at the end-of-day closing prices for the Apple stock. A more realistic backtesting approach would take these and other (market microstructure) elements into account.

Optimization

A natural question that arises is if the chosen parameters SMA1=42 and SMA2=252 are the “right” ones. In general, investors prefer higher returns to lower returns ceteris paribus. Therefore, one might be inclined to search for those parameters that maximize the return over the relevant period. To this end, a brute force approach can be used that simply repeats the whole vectorized backtesting procedure for different parameter combinationsrecords the results, and does a ranking afterward. This is what the following code does:

from itertools import product#笛卡尔积

sma1 = range(20, 61, 4) #Specifies the parameter values for SMA1.
sma2 = range(180, 281, 10) #Specifies the parameter values for SMA2.

results = pd.DataFrame()
for SMA1, SMA2 in product(sma1, sma2): #Combines all values for SMA1 with those for SMA2.
    data = pd.DataFrame(raw[symbol])
    data.dropna(inplace=True)
    
    #data['Returns'] = np.log(data[symbol] / data[symbol].shift(1))
    
    data['SMA1'] = data[symbol].rolling(SMA1).mean()
    data['SMA2'] = data[symbol].rolling(SMA2).mean()
    data.dropna(inplace=True)
    
    data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)
    
    data['Returns'] = np.log(data[symbol] / data[symbol].shift(1))
    data['Strategy'] = data['Position'].shift(1) * data['Returns']
    data.dropna(inplace=True)
    
    perf = np.exp( data[['Returns', 'Strategy']].sum() )#dataframe
    
    results = results.append(pd.DataFrame(
                                            {'SMA1':SMA1,
                                             'SMA2':SMA2,
                                             'MARKET': perf['Returns'],
                                             'STRATEGY': perf['Strategy'],
                                             'OUT': perf['Strategy'] - perf['Returns']  
                                            },
                                            index = [0]### with  all indices are 0
                                         ),
                             ignore_index = True #will assign a new index to current dataframe
                            )#Records the vectorized backtesting results in a DataFrame object.
results=results[['SMA1', 'SMA2', 'MARKET','STRATEGY','OUT']]  #adjusts the columnname's order
results.head()

The following code gives an overview of the results and shows the seven best-performing parameter combinations of all those backtested. The ranking is implemented according to the outperformance of the algorithmic trading strategy compared to the benchmark investment. The performance of the benchmark investment varies since the choice of the SMA2 parameter influences the length of the time interval and data set on which the vectorized backtest is implemented:

results.info()

results.sort_values('OUT', ascending=False).head(7)

According to the brute force–based optimization, SMA1=40 and SMA2=190 are the optimal parameters, leading to an outperformance of some 230(40+190) percentage points. However, this result is heavily dependent on the data set used and is prone to overfitting. A more rigorous approach would be to implement the optimization on one data set, the in-sample or training data set, and test it on another one, the out-of-sample or testing data set.

OVERFITTING

In general, any type of optimization, fitting, or training in the context of algorithmic trading strategies is prone to what is called overfitting. This means that parameters might be chosen that perform (exceptionally) well for the used data set but might perform (exceptionally) badly on other data sets or in practice.

Random Walk Hypothesis

The previous section introduces vectorized backtesting as an efficient tool to backtest algorithmic trading strategies. The single strategy backtested based on a single financial time series, namely historical end-of-day prices for the Apple stock, outperforms the benchmark investment of simply going long on the Apple stock over the same period.

Although rather specific in nature, these results are in contrast to what the random walk hypothesis (RWH) predicts, namely that such predictive approaches should not yield any outperformance at all. The RWH postulates假设 that prices in financial markets follow a random walk, or, in continuous time, an arithmetic Brownian motion without drift. The expected value of an arithmetic Brownian motion without drift at any point in the future equals its value today. As a consequence, the best predictor for tomorrow’s price, in a least-squares sense, is today’s price if the RWH applies.

For many years, economists, statisticians, and teachers of finance have been interested in developing and testing models of stock price behavior. One important model that has evolved from this research is the theory of random walks. This theory casts serious doubt on many other methods for describing and predicting stock price behavior — methods that have considerable popularity outside the academic world. For example, we shall see later that, if the random-walk theory is an accurate description of reality, then the various “technical” or “chartist” procedures for predicting stock prices are completely without value.

The RWH is consistent with the efficient markets hypothesis (EMH), which, non-technically speaking, states that market prices reflect “all available information.” Different degrees of efficiency are generally distinguished, such as weak, semi-strong, and strong, defining more specifically what “all available information” entails. Formally, such a definition can be based on the concept of an information set in theory and on a data set for programming purposes, as the following quote illustrates:

A market is efficient with respect to an information set S if it is impossible to make economic profits by trading on the basis of information set S.

Using Python, the RWH can be tested for a specific case as follows. A financial time series of historical market prices is used for which a number of lagged滞后 versions are created — say, five. OLS最小二乘 regression is then used to predict the market prices based on the lagged market prices created before. The basic idea is that the market prices from yesterday and four more days back can be used to predict today’s market price

The following Python code implements this idea and creates five lagged versions of the historical end-of-day closing levels of the S&P 500 stock index:

raw.head()

symbol = '.SPX'
data = pd.DataFrame(raw[symbol])

lags = 5
cols = []
for lag in range(1, lags+1):
    #Defines a column name for the current lag value.
    col = 'lag_{}'.format(lag)
    #Creates the lagged version of the market prices for the current lag value.
    data[col] = data[symbol].shift(lag)
    #Collects the column names for later reference.
    cols.append(col)
data.head(7)

data.dropna(inplace=True)
data.head()

Using NumPy, the OLS regression is straightforward to implement. As the optimal regression parameters show, lag_1 indeed is the most important one in predicting the market price based on OLS regression. Its value is close to 1. The other four values are rather close to 0. Figure 15-4 visualizes the optimal regression parameter values.

When using the optimal results to visualize the prediction values as compared to the original index
values for the S&P 500, it becomes obvious from Figure 15-5 that indeed lag_1 is basically what is
used to come up with the prediction value.
Graphically speaking, the prediction line in Figure 15-5 is
the original time series shifted by one day to the right (with some minor adjustments).

All in all, the brief analysis in this section reveals some support for both the RWH and the EMH. For
sure, the analysis is done for a single stock index only and uses a rather specific parameterization —
but this can easily be widened to incorporate multiple financial instruments across multiple asset
classes, different values for the number of lags, etc. In general, one will find out that the results are
qualitatively more or less the same. After all, the RWH and EMH are among the financial theories
that have broad empirical经验 support. In that sense, any algorithmic trading strategy must prove its worth
by proving that the RWH does not apply in general. This for sure is a tough hurdle.

Linear OLS Regression

This section applies linear OLS regression to predict the direction of market movements based on historical log returns. To keep things simple, only two features are used. The first feature (lag_1) represents the log returns of the financial time series lagged by one day. The second feature (lag_2) lags the log returns by two days. Log returns — in contrast to prices — are stationary in general, which often is a necessary condition for the application of statistical and ML algorithms.

The basic idea behind the usage of lagged log returns as features is that they might be informative in predicting future returns. For example, one might hypothesize that after two downward movements an upward movement is more likely (“mean reversion”), or, to the contrary, that another downward movement is more likely (“momentum” or “trend”). The application of regression techniques allows the formalization of such informal reasonings.

The Data

First, the importing and preparation of the data set. Figure 15-6 shows the frequency distribution of the daily historical log returns for the EUR/USD exchange rate. They are the basis for the features as well as the labels to be used in what follows:

raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True).dropna()
raw.head()

raw.columns

symbol = 'EUR='
data = pd.DataFrame(raw[symbol])  #raw[symbol] is a series

data.head()

#why log?
#Try to multiply many small numbers in Python. Eventually it rounds off to 0.
data['returns'] = np.log(data/data.shift(1))
data.dropna(inplace=True)  #since data.shift(1)

data.head()

data['direction'] = np.sign(data['returns']).astype(int)
data.head()

data['returns'].hist(bins=35, figsize=(10,6))
plt.title('Histogram of log returns for EUR/USD exchange rate')

#a check whether thedata['returns']values are indeed log-normally distributed.

#如果峰度大于三,峰的形状比较尖,比正态分布峰要陡峭。反之亦然

#在相同的标准差下,峰度系数越大,分布就有更多的极端值,那么其余值必然要更加集中在众数周围,其分布必然就更加陡峭

#https://blog.youkuaiyun.com/Linli522362242/article/details/99728616

Second, the code that creates the features data by lagging the log returns and visualizes it in combination with the returns data (see Figure 15-7):

lags =2
def create_lags(data):
    global cols
    cols = []
    for lag in range(1, lags+1):
        col = 'lag_{}'.format(lag)

        #data=data.drop(['lag_()'], axis=1)
        data[col] = data['returns'].shift(lag)
        cols.append(col)

create_lags(data)
data.head()

data.dropna(inplace=True) #since shift()

data.plot.scatter(x='lag_1', y='lag_2', c='returns', cmap='coolwarm', figsize=(10,6), colorbar=True)
plt.axvline(0,c='r', ls='--')
plt.axhline(0,c='r', ls='--')
plt.title('Scatter plot based on features and labels data')

Regression

With the data set completed, linear OLS regression can be applied to learn about any potential (linear) relationships, to predict market movement based on the features, and to backtest a trading strategy based on the predictions. Two basic approaches are available: using the log returns or only the direction data as the dependent variable during the regression. In any case, predictions are realvalued and therefore transformed to either +1 or -1 to only work with the direction of the prediction:

#The linear OLS regression implementation from scikit-learn is used.
from sklearn.linear_model import LinearRegression

#model
model=LinearRegression()

cols

#fit and then predict
#The regression is implemented on the log returns directly …
data['pos_ols_1'] = model.fit(data[cols], data['returns']).predict(data[cols])
#… and on the direction data which is of primary interest.
#data['direction'] = np.sign(data['returns']).astype(int)
data['pos_ols_2'] = model.fit(data[cols], data['direction']).predict(data[cols])

data[['pos_ols_1', 'pos_ols_2']].head()

data[['pos_ols_1', 'pos_ols_2']] = np.where( data[['pos_ols_1', 'pos_ols_2']]>0, 1,-1 )
data[['pos_ols_1', 'pos_ols_2']].head()

#The real-valued predictions are transformed to directional values (+1, -1).
data['pos_ols_1'].value_counts()

#The real-valued predictions are transformed to directional values (+1, -1).
data['pos_ols_2'].value_counts()

#However, both lead to a relatively large number of trades over time.
(data['pos_ols_1'].diff() != 0).sum()

#However, both lead to a relatively large number of trades over time.
(data['pos_ols_2'].diff() !=0).sum()

Equipped with the directional prediction, vectorized backtesting can be applied to judge the performance of the resulting trading strategies. At this stage, the analysis is based on a number of simplifying assumptions, such as “zero transaction costs” and the usage of the same data set for both training and testing. Under these assumptions, however, both regression-based strategies outperform the benchmark passive investmentwhile only the strategy trained on the direction of the market shows a positive overall performance

data['strat_ols_1'] = data['pos_ols_1'] * data['returns']
data['strat_ols_2'] = data['pos_ols_2'] * data['returns']

data.head()

data[ ['returns', 'strat_ols_1', 'strat_ols_2'] ].sum().apply(np.exp)

the strategy trained on the direction of the market shows a better performance(1.339286 > 0.942422)

#Shows the number of correct and false predictions by the strategies
(data['direction'] == data['pos_ols_1']).value_counts()

#Shows the number of correct and false predictions by the strategies
(data['direction'] == data['pos_ols_2']).value_counts()

data[['returns', 'strat_ols_1', 'strat_ols_2']].cumsum().apply(np.exp).plot(figsize=(10,6),title='Performance of EUR/USD and regression-based strategies over time')

only the strategy trained on the direction of the market shows a positive overall performance

Clustering

This section applies k-means clustering, as introduced in “Machine Learning”, to financial time series data to automatically come up with clusters that are used to formulate a trading strategy. The idea is that the algorithm identifies two clusters of feature values that predict either an upward movement or a downward movement.

The following code applies the k-means algorithm to the two features as used before. Figure 15-9 visualizes the two clusters:

from sklearn.cluster import KMeans

#Two clusters are chosen for the algorithm.
model = KMeans(n_clusters=2, random_state=0)

model.fit( data[cols] )

data.head()

data['pos_clus'] = model.predict(data[cols])
data.head()

#Given the cluster values, the position is chosen.
data['pos_clus'] = np.where(data['pos_clus'] == 1, -1, 1)

plt.figure(figsize=(10,6))
plt.scatter(data[cols].iloc[:,0],  #lag_1
                  data[cols].iloc[:,1],       #lag_2
                 c=data['pos_clus'],
                 cmap='coolwarm'
           )
plt.title('Two clusters as identified by the k-means algorithm')

Admittedly, this approach is quite arbitrary in this context — after all, how should the algorithm know what one is looking for? However, the resulting trading strategy shows a slight outperformance at the end compared to the benchmark passive investment (see Figure 15-10). It is noteworthy that no guidance (supervision) is given and that the hit ratio — i.e., the number of correct predictions in relationship to all predictions made — is less than 50%:

data['strat_clus'] = data['pos_clus'] * data['returns']

data[['returns', 'strat_clus']].sum().apply(np.exp)

(data['direction'] == data['pos_clus']).value_counts()

data[['returns', 'strat_clus']].cumsum().apply(np.exp).plot(figsize=(10,6), title='Performance of EUR/USD and k-means-based strategy over time')

Frequency Approach

Beyond more sophisticated algorithms and techniques, one might come up with the idea of just implementing a frequency approach to predict directional movements in financial markets. To this end, one might transform the two real-valued features to binary ones and assess the probability of an upward and a downward movement, respectively, from the historical observations of such movements, given the four possible combinations for the two binary features ((0, 0), (0, 1), (1, 0), (1, 1)).

data[cols[0]].head()

np.digitize(data[cols[0]], bins=[0])[:5]

def create_bins(data, bins=[0]):
    global cols_bin
    cols_bin =[]
    for col in cols:
        col_bin = col + '_bin'
        #Digitizes the feature values given the bins parameter.
        data[col_bin] = np.digitize(data[col], bins=bins)
        cols_bin.append(col_bin)  #list

create_bins(data)

data.head()

data[cols_bin+['direction']].head()

grouped = data.groupby(cols_bin + ['direction'])
#Shows the frequency of the possible movements conditional on the feature value combinations.
#direction: current return  # lag_1 : yesterday's return  # lag_2: The day before yesterday's return
grouped.size()

grouped['direction'].size()

#Transforms the DataFrame object to have the frequencies in columns.
res = grouped['direction'].size().unstack(fill_value=0)

res

def highlight_max(s):
    is_max=( s==s.max() )
    print(s)
    print(is_max)
    print('#'*10)
    return ['background-color:yellow' if v else '' for v in is_max]

#Highlights the highest-frequency value per feature value combination.
#res = grouped['direction'].size().unstack(fill_value=0)
res.style.apply(highlight_max, axis=1)

Given the frequency data, three feature value combinations hint at a downward movement while one lets an upward movement seem more likely. This translates into a trading strategy the performance of which is shown in

#data[col] = data['returns'].shift(lag
#data[col_bin] = np.digitize(data[col], bins=bins)   #0: positive ; 1: negative
data[cols_bin].head()

data[cols_bin].sum(axis=1).head() 

#Translates the findings given the frequencies to a trading strategy.
data['pos_freq'] = np.where(data[cols_bin].sum(axis=1) == 2, -1, 1)
#since  data['direction']== 1 or -1
(data['direction'] == data['pos_freq']).value_counts()

data['strat_freq'] = data['pos_freq']*data['returns']

data[['returns', 'strat_freq']].sum().apply(np.exp)

##green curve  # not alway perform better than the passive investment(return curve)

data[['returns', 'strat_freq']].cumsum().apply(np.exp).plot(figsize=(10,6), 
                                                                           title='Performance of EUR/USD and frequency-based trading strategy over time'
                                                           )

Classification

This section applies the classification algorithms from ML (as introduced in “Machine Learning”) to the problem of predicting the direction of price movements in financial markets. With that background and the examples from previous sections, the application of the logistic regression, Gaussian Naive Bayes, and support vector machine approaches is as straightforward as applying them to smaller sample data sets.

Two Binary Features

First, a fitting of the models based on the binary feature values and the derivation of the resulting position values:

from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

C=1#Penalty parameter C of the error term.

models = {
    'log_reg': linear_model.LogisticRegression(C=C),
    'gauss_nb': GaussianNB(),
    'svm': SVC(C=C)
}

def fit_models(data): #A function that fits all models.
    mfit = {model: models[model].fit(data[cols_bin], # lag_1,lag_2
                                     data['direction'] #data['direction'] = np.sign(data['returns']).astype(int)
                                    )
            for model in models.keys()
           }

fit_models(data)

def derive_positions(data):#A function that derives all position values from the fitted models.
    for model in models.keys():
        data['pos_' + model] = models[model].predict(data[cols_bin])

derive_positions(data)

Second, the vectorized backtesting of the resulting trading strategies. Figure 15-12 visualizes the performance over time:

def evaluate_strats(data):
    global sel
    sel=[]
    for model in models.keys():
        col = 'strat_' + model
        data[col] = data['pos_'+model] * data['returns']
        sel.append(col)
    sel.insert(0, 'returns')  
 

evaluate_strats(data)
sel.insert(1, 'strat_freq')

data.head()

#Some strategies might show the exact same performance.
#data[['returns', 'strat_freq','strat_log_reg','trat_gauss_nb','strat_svm']]
data[sel].sum().apply(np.exp)

colormap={
    'returns':'m',  #puple-red
    'strat_freq':'y', #yellow
    'strat_log_reg':'b', #blue
    'strat_gauss_nb':'k', #black
    'strat_svm':'r'  #red
}

data[sel].cumsum().apply(np.exp).plot(figsize=(10,6), 
                                      title='Performance of EUR/USD and classification-based trading strategies (two binary lags) over time',
                                      style=colormap
                                     )

data[['strat_freq', 'strat_log_reg']].cumsum().apply(np.exp).plot(figsize=(10,6), 
                                      title='Performance of EUR/USD and classification-based trading strategies (two binary lags) over time',
                                      style=colormap
                                     )

Five Binary Features

In an attempt to improve the strategies’ performance, the following code works with five binary lags instead of two. In particular, the performance of the SVM-based strategy is significantly improved(see Figure). On the other hand, the performance of the LR- and GNB-based strategies is worse:

data = pd.DataFrame(raw[symbol])
data.head()

data['returns'] = np.log(data/data.shift(1))
# d:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in sign
#  """Entry point for launching an IPython kernel.
data.dropna(inplace=True)
#or data['direction'] = np.sign(data['returns']).astype(int)
data['direction'] = np.sign(data['returns'])

data.head()

lags =5 #Five lags of the log returns series are now used.
def create_lags(data):
    global cols
    cols = []
    for lag in range(1, lags+1):
        col = 'lag_{}'.format(lag)
        data[col] = data['returns'].shift(lag)
        cols.append(col)
create_lags(data)
data.dropna(inplace=True)

data.head()

def create_bins(data, bins=[0]):
    global cols_bin
    cols_bin =[]
    for col in cols:
        col_bin = col + '_bin'
        #Digitizes the feature values given the bins parameter.
        data[col_bin] = np.digitize(data[col], bins=bins)
        cols_bin.append(col_bin)  #list
create_bins(data)#The real-valued features data is transformed to binary data
cols_bin

data[cols_bin].head()

fit_models(data)
derive_positions(data) #prediction

def evaluate_strats(data):
    global sel
    sel=[]
    for model in models.keys():
        col = 'strat_' + model
        data[col] = data['pos_'+model] * data['returns']
        sel.append(col)
    sel.insert(0, 'returns')
evaluate_strats(data) #result

data[sel].sum().apply(np.exp)

data[sel].cumsum().apply(np.exp).plot(figsize=(10,6), 
                                      title='Performance of EUR/USD and classification-based trading strategies (five binary lags) over time'
                                     )

In an attempt to improve the strategies’ performance, the following code works with five binary lags instead of two. In particular, the performance of the SVM-based strategy is significantly improved(see Figure). On the other hand, the performance of the LR- and GNB-based strategies is worse.

Five Digitized Features

Finally, the following code uses the first and second moment of the historical log returns to digitize the features data, allowing for more possible feature value combinations. This improves the performance of all classification algorithms used, but for SVM the improvement is again most pronounced (see Figure

mu = data['returns'].mean() #The mean log return and
std = data['returns'].std() #the standard deviation are used

#to digitize the features data.


bins = [mu-std, mu, mu+std]
bins

def create_bins(data, bins=[0]):
    global cols_bin
    cols_bin =[]
    for col in cols:
        col_bin = col + '_bin'
        #Digitizes the feature values given the bins parameter.
        data[col_bin] = np.digitize(data[col], bins=bins) #return the index
        cols_bin.append(col_bin)  #list
create_bins(data, bins)
data[cols_bin].head()

def fit_models(data): #A function that fits all models.
    mfit = {model: models[model].fit(data[cols_bin], # lag_1,lag_2
                                     data['direction'] #data['direction'] = np.sign(data['returns']).astype(int)
                                    )
            for model in models.keys()
           }
fit_models(data)

def derive_positions(data):#A function that derives all position values from the fitted models.
    for model in models.keys():
        data['pos_' + model] = models[model].predict(data[cols_bin])
derive_positions(data)

def evaluate_strats(data):
    global sel
    sel=[]
    for model in models.keys():
        col = 'strat_' + model
        data[col] = data['pos_'+model] * data['returns']
        sel.append(col)
    sel.insert(0, 'returns')
evaluate_strats(data)

data[sel].sum().apply(np.exp)

colormap={
    'returns':'m',  #puple-red
    'strat_freq':'y', #yellow
    'strat_log_reg':'g', #green
    'strat_gauss_nb':'r', #red
    'strat_svm':'b'  #blue
}
data[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
                                      title='Performance of EUR/USD and classification-based trading strategies (five digitized lags) over time'
                                      ,style=colormap
                                     )

#####################################################

TYPES OF FEATURES


This chapter exclusively works with lagged return data as features data, mostly in binarized or digitized form. This is mainly
done for convenience, since such features data can be derived from the financial time series itself. However, in practical
applications
the features data can be gained from a wealth of different data sources and might include other financial time
series
and statistics derived thereof 衍生的统计数据, macroeconomic data, company financial indicators, or news articles. Refer to López de Prado (2018) for an in-depth discussion of this topic. There are also Python packages for automated time series feature extraction available, such as tsfresh.

#####################################################

Sequential Train-Test Split

To better judge the performance of the classification algorithms, the code that follows implements a sequential train-test split. The idea here is to simulate the situation where only data up to a certain point in time is available on which to train an ML algorithm. During live trading, the algorithm is thenfaced with data it has never seen before. This is where the algorithm must prove its worth. In this particular case, all classification algorithms outperform — under the simplified assumptions from before — the passive benchmark investment, but only the GNB and LR algorithms achieve a positive absolute performance

split = int(len(data) * 0.5)
split

#copy reference: https://blog.youkuaiyun.com/u010712012/article/details/79754132
train = data.iloc[:split].copy()

#Trains all classification algorithms on the training data
fit_models(train)

test = data.iloc[split:].copy()

derive_positions(test) #prediction

def evaluate_strats(data):
    global sel
    sel=[]
    for model in models.keys():
        col = 'strat_' + model
        data[col] = data['pos_'+model] * data['returns']
        sel.append(col)
    sel.insert(0, 'returns')
evaluate_strats(test) #result

test[sel].sum().apply(np.exp)

colormap={
    'returns':'m',  #puple-red
    'strat_freq':'y', #yellow
    'strat_log_reg':'b', #blue
    'strat_gauss_nb':'k', #black
    'strat_svm':'r'  #red
}
test[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
                                      title='Performance of EUR/USD and classification-based trading strategies (sequential train-test split)'
                                      ,style=colormap
                                     )

only the GNB and LR algorithms achieve a positive absolute performance(always)

 

Randomized Train-Test Split

The classification algorithms are trained and tested on binary or digitized features data. The idea is that the feature value patterns allow a prediction of future market movements with a better hit ratio than 50%. Implicitly, it is assumed that the patterns’ predictive power persists over time. In that sense, it shouldn’t make (too much of) a difference on which part of the data an algorithm is trained and on which part of the data it is tested — implying that one can break up the temporal sequence时间顺序 of the data for training and testing.

A typical way to do this is a randomized train-test split to test the performance of the classification algorithms out-of-sample — again trying to emulate reality, where an algorithm during trading is faced with new data on a continuous basis. The approach used is the same as that applied to the sample data in “Train-test splits: Support vector machines”. Based on this approach, the SVM algorithm shows again the best performance out-of-sample

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.5, shuffle=True, random_state=100)

train.head()

#Train and test data sets are copied and brought back in temporal order.
train = train.copy().sort_index()
train.head()

train[cols_bin].head()

test = test.copy().sort_index()

def fit_models(data): #A function that fits all models.
    mfit = {model: models[model].fit(data[cols_bin], # lag_1,lag_2
                                     data['direction'] #data['direction'] = np.sign(data['returns']).astype(int)
                                    )
            for model in models.keys()
           }
fit_models(train) #training

def derive_positions(data):#A function that derives all position values from the fitted models.
    for model in models.keys():
        data['pos_' + model] = models[model].predict(data[cols_bin])
derive_positions(test) #prediction

def evaluate_strats(data):
    global sel
    sel=[]
    for model in models.keys():
        col = 'strat_' + model
        data[col] = data['pos_'+model] * data['returns']
        sel.append(col)
    sel.insert(0, 'returns')
evaluate_strats(test) #result

test[sel].sum().apply(np.exp)

colormap={
    'returns':'m',  #puple-red
    'strat_freq':'w', #white
    'strat_log_reg':'y', #yellow
    'strat_gauss_nb':'r', #red
    'strat_svm':'b'  #blue
}
test[sel].cumsum().apply(np.exp).plot(figsize=(10,6),
                                      title='Performance of EUR/USD and classification-based trading strategies (randomized train-test split)'
                                      ,style=colormap
                                     )

Based on this approach, the SVM algorithm shows again the best performance out-of-sample???

I don't think so!!! (the performance of the SVM is the worst.

 

 

导入必要的库 import pandas as pd import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split, cross_val_score, learning_curve from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.decomposition import PCA from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.neural_network import MLPClassifier import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from matplotlib.patches import Patch import warnings warnings.filterwarnings(‘ignore’) 设置中文字体 plt.rcParams[‘font.sans-serif’] = [‘SimHei’] # 用来正常显示中文标签 plt.rcParams[‘axes.unicode_minus’] = False # 用来正常显示负号 尝试导入可选库 try: from xgboost import XGBClassifier XGB_AVAILABLE = True except ImportError: print(“XGBoost 不可用,跳过 XGBoost 模型”) XGB_AVAILABLE = False try: from lightgbm import LGBMClassifier LGBM_AVAILABLE = True except ImportError: print(“LightGBM 不可用,跳过 LightGBM 模型”) LGBM_AVAILABLE = False try: import shap SHAP_AVAILABLE = True except ImportError: print(“SHAP 不可用,跳过 SHAP 解释”) SHAP_AVAILABLE = False 定义张继权教授的四因子理论函数 def apply_four_factor_theory(df): # 检查所需的列是否存在 required_cols = [‘PFBA’, ‘PFPeA’, ‘PFHxA’, ‘PFHpA’, ‘PFOA’, ‘PFNA’, ‘PFDA’, ‘PFBS’, ‘PFHxS’, ‘PFOS’] available_cols = [col for col in required_cols if col in df.columns] if len(available_cols) < 4: print(f"警告: 只有 {len(available_cols)} 个PFAS列可用,可能需要调整四因子计算") # 短期酸类因子 (PFBA, PFPeA, PFHxA) short_term_cols = [col for col in ['PFBA', 'PFPeA', 'PFHxA'] if col in df.columns] if short_term_cols: df['Short_term_acid_factor'] = df[short_term_cols].mean(axis=1, skipna=True) else: df['Short_term_acid_factor'] = 0 # 长期酸类因子 (PFHpA, PFOA, PFNA, PFDA) long_term_cols = [col for col in ['PFHpA', 'PFOA', 'PFNA', 'PFDA'] if col in df.columns] if long_term_cols: df['Long_term_acid_factor'] = df[long_term_cols].mean(axis=1, skipna=True) else: df['Long_term_acid_factor'] = 0 # 磺酸类因子 (PFBS, PFHxS, PFOS) sulfonate_cols = [col for col in ['PFBS', 'PFHxS', 'PFOS'] if col in df.columns] if sulfonate_cols: df['Sulfonate_factor'] = df[sulfonate_cols].mean(axis=1, skipna=True) else: df['Sulfonate_factor'] = 0 # 暴露因子 (总PFAS浓度) all_pfas_cols = [col for col in required_cols if col in df.columns] if all_pfas_cols: df['Exposure_factor'] = df[all_pfas_cols].sum(axis=1, skipna=True) else: df['Exposure_factor'] = 0 return df 自定义表格打印函数 def print_table(data, headers=None, title=None): if title: print(f"\n{title}“) print(”=" * 60) if headers: # 打印表头 header_line = " | ".join(f"{h:>15}" for h in headers) print(header_line) print("-" * len(header_line)) # 打印数据行 for row in data: if isinstance(row, (list, tuple)): row_line = " | ".join(f"{str(item):>15}" for item in row) else: # 处理DataFrame行 row_line = " | ".join(f"{str(row[col]):>15}" for col in headers) print(row_line) 尝试多种编码方式读取文件 def read_csv_with_encodings(file_path): # 尝试的编码列表(中文环境中常见的编码) encodings = [‘utf-8’, ‘gbk’, ‘gb2312’, ‘gb18030’, ‘latin1’, ‘cp936’] for encoding in encodings: try: print(f"尝试使用 {encoding} 编码读取文件...") df = pd.read_csv(file_path, encoding=encoding) # 删除完全为空的行和列 df = df.dropna(how='all', axis=0) df = df.dropna(how='all', axis=1) print(f"成功使用 {encoding} 编码读取文件") return df except UnicodeDecodeError: continue except Exception as e: print(f"使用 {encoding} 编码时出错: {e}") continue # 如果所有编码都失败,尝试使用错误处理 try: print("尝试使用错误处理方式读取文件...") df = pd.read_csv(file_path, encoding='utf-8', errors='ignore') df = df.dropna(how='all', axis=0) df = df.dropna(how='all', axis=1) print("使用错误处理方式成功读取文件") return df except Exception as e: print(f"所有读取尝试都失败: {e}") return None 加载CSV文件并跳过完全为空的行列 file_path = r’E:\pycharm\meta\整合数据.csv’ # 使用原始字符串表示法 try: # 使用多种编码方式尝试读取文件 df = read_csv_with_encodings(file_path) if df is None: print("无法读取文件,请检查文件路径和格式") exit() print(f"数据形状: {df.shape}") print(f"列名: {df.columns.tolist()}") except Exception as e: print(f"读取文件时出错: {e}") exit() 数据预处理 检查PFAS相关列是否存在,如果不存在则尝试找到类似的列 expected_features = [‘PFBA’, ‘PFPeA’, ‘PFHxA’, ‘PFHpA’, ‘PFOA’, ‘PFNA’, ‘PFDA’, ‘PFBS’, ‘PFHxS’, ‘PFOS’] available_features = [] for feature in expected_features: if feature in df.columns: available_features.append(feature) else: # 尝试找到包含该字符串的列 matching_cols = [col for col in df.columns if feature.lower() in col.lower()] if matching_cols: available_features.extend(matching_cols) print(f"使用 ‘{matching_cols[0]}’ 替代 ‘{feature}’") 如果没有找到任何PFAS特征,使用所有数值列 if not available_features: numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() available_features = numeric_cols print(f"未找到PFAS特征,使用所有数值列: {available_features}") print(f"可用的PFAS特征: {available_features}") 检查目标列 target_column = ‘城市’ if target_column not in df.columns: # 尝试找到可能的目标列 possible_targets = [col for col in df.columns if any(word in col for word in [‘城市’, ‘地区’, ‘区域’, ‘地点’, ‘city’, ‘region’])] if possible_targets: target_column = possible_targets[0] print(f"使用 ‘{target_column}’ 作为目标变量") else: # 如果没有找到,使用第一列非数值列作为目标 non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns if len(non_numeric_cols) > 0: target_column = non_numeric_cols[0] print(f"使用 ‘{target_column}’ 作为目标变量") else: # 如果没有非数值列,使用最后一列作为目标 target_column = df.columns[-1] print(f"使用最后一列 ‘{target_column}’ 作为目标变量") 处理缺失值(用中位数填充) for feature in available_features: if feature in df.columns: df[feature] = pd.to_numeric(df[feature], errors=‘coerce’) df[feature] = df[feature].fillna(df[feature].median()) 应用张继权教授的四因子理论 df = apply_four_factor_theory(df) 添加四因子到特征中 features = available_features + [‘Short_term_acid_factor’, ‘Long_term_acid_factor’, ‘Sulfonate_factor’, ‘Exposure_factor’] print(f"最终使用的特征: {features}") 编码目标变量(多分类) le = LabelEncoder() df[target_column] = le.fit_transform(df[target_column].fillna(‘Unknown’)) class_names = le.classes_ 分割数据集 X = df[features] y = df[target_column] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"训练集大小: {X_train.shape}“) print(f"测试集大小: {X_test.shape}”) print(f"目标变量类别数: {len(np.unique(y))}") 标准化特征 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) PCA降维(保留95%方差) pca = PCA(n_components=0.95) X_train_pca = pca.fit_transform(X_train_scaled) X_test_pca = pca.transform(X_test_scaled) print(f"PCA降维后特征数: {X_train_pca.shape[1]}") ========== 新增:相关性分析 ========== print(“\n进行相关性分析…”) 计算特征之间的相关性矩阵 correlation_matrix = X.corr() 绘制相关性热图 plt.figure(figsize=(12, 10)) sns.heatmap(correlation_matrix, annot=True, cmap=‘coolwarm’, center=0, fmt=‘.2f’) plt.title(‘特征相关性热图’) plt.tight_layout() plt.savefig(‘correlation_heatmap.png’, dpi=300, bbox_inches=‘tight’) plt.show() 找出高度相关的特征对 high_corr_pairs = [] for i in range(len(correlation_matrix.columns)): for j in range(i + 1, len(correlation_matrix.columns)): if abs(correlation_matrix.iloc[i, j]) > 0.8: # 阈值设为0.8 high_corr_pairs.append(( correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j] )) if high_corr_pairs: print(“高度相关的特征对:”) for pair in high_corr_pairs: print(f" {pair[0]} 和 {pair[1]}: {pair[2]:.3f}") else: print(“没有发现高度相关的特征对(相关系数>0.8)) ========== 新增:Meta分析 - 数据分布可视化 ========== print(“\n进行Meta分析…”) 1. 目标变量分布 plt.figure(figsize=(10, 6)) df[target_column].value_counts().plot(kind=‘bar’) plt.title(‘目标变量分布’) plt.xlabel(‘类别’) plt.ylabel(‘样本数量’) plt.xticks(rotation=45) plt.tight_layout() plt.savefig(‘target_distribution.png’, dpi=300, bbox_inches=‘tight’) plt.show() 2. 四因子分布 four_factor_cols = [‘Short_term_acid_factor’, ‘Long_term_acid_factor’, ‘Sulfonate_factor’, ‘Exposure_factor’] if all(col in df.columns for col in four_factor_cols): fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.ravel() for i, col in enumerate(four_factor_cols): axes[i].hist(df[col], bins=30, alpha=0.7, color='skyblue', edgecolor='black') axes[i].set_title(f'{col} 分布') axes[i].set_xlabel(col) axes[i].set_ylabel('频率') plt.tight_layout() plt.savefig('four_factor_distribution.png', dpi=300, bbox_inches='tight') plt.show() 3. 特征箱线图 plt.figure(figsize=(15, 10)) X.boxplot() plt.title(‘特征箱线图’) plt.xticks(rotation=45) plt.tight_layout() plt.savefig(‘feature_boxplot.png’, dpi=300, bbox_inches=‘tight’) plt.show() 4. PCA可视化 plt.figure(figsize=(10, 8)) if X_train_pca.shape[1] >= 2: scatter = plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap=‘viridis’, alpha=0.7) plt.colorbar(scatter) plt.xlabel(‘第一主成分’) plt.ylabel(‘第二主成分’) plt.title(‘PCA降维可视化’) plt.tight_layout() plt.savefig(‘pca_visualization.png’, dpi=300, bbox_inches=‘tight’) plt.show() 5. 累计方差解释率 plt.figure(figsize=(10, 6)) pca_full = PCA().fit(X_train_scaled) plt.plot(np.cumsum(pca_full.explained_variance_ratio_)) plt.xlabel(‘主成分数量’) plt.ylabel(‘累计方差解释率’) plt.title(‘PCA累计方差解释率’) plt.grid(True) plt.tight_layout() plt.savefig(‘pca_explained_variance.png’, dpi=300, bbox_inches=‘tight’) plt.show() ========== 模型训练和评估 ========== 定义机器学习模型 models = { ‘LogisticRegression’: LogisticRegression(max_iter=1000, random_state=42), ‘DecisionTree’: DecisionTreeClassifier(random_state=42), ‘RandomForest’: RandomForestClassifier(random_state=42), ‘SVM’: SVC(random_state=42, probability=True), ‘KNN’: KNeighborsClassifier(), ‘NaiveBayes’: GaussianNB(), ‘GradientBoosting’: GradientBoostingClassifier(random_state=42), ‘MLP’: MLPClassifier(max_iter=1000, random_state=42), ‘AdaBoost’: AdaBoostClassifier(random_state=42), ‘ExtraTrees’: ExtraTreesClassifier(random_state=42) } 添加可选模型 if XGB_AVAILABLE: models[‘XGBoost’] = XGBClassifier(use_label_encoder=False, eval_metric=‘mlogloss’, random_state=42) if LGBM_AVAILABLE: models[‘LightGBM’] = LGBMClassifier(random_state=42) print(f"将训练 {len(models)} 个模型") 训练和评估模型 results = [] cv_results = [] # 用于存储交叉验证结果 model_objects = {} # 存储训练好的模型对象 print(“开始训练模型…”) for name, model in models.items(): print(f"训练 {name}…") try: # 训练模型 model.fit(X_train_pca, y_train) model_objects[name] = model # 预测和评估 y_pred = model.predict(X_test_pca) y_pred_proba = model.predict_proba(X_test_pca) if hasattr(model, "predict_proba") else None acc = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred, output_dict=True, zero_division=0) # 交叉验证 cv_scores = cross_val_score(model, X_train_pca, y_train, cv=5, scoring='accuracy') cv_mean = cv_scores.mean() cv_std = cv_scores.std() results.append([ name, acc, report['weighted avg']['precision'], report['weighted avg']['recall'], report['weighted avg']['f1-score'], cv_mean, cv_std ]) cv_results.append({ 'model': name, 'scores': cv_scores }) except Exception as e: print(f"训练模型 {name} 时出错: {e}") results.append([name, 0, 0, 0, 0, 0, 0]) 生成结果表格 headers = [‘Model’, ‘Accuracy’, ‘Precision’, ‘Recall’, ‘F1-Score’, ‘CV Mean’, ‘CV Std’] results_df = pd.DataFrame(results, columns=headers) results_df = results_df.sort_values(‘Accuracy’, ascending=False) print(“\n模型性能排名:”) print_table(results_df.values.tolist(), headers=headers, title=“模型性能比较”) ========== 新增:模型性能可视化 ========== print(“\n生成模型性能可视化…”) 1. 模型准确率比较 plt.figure(figsize=(12, 8)) models_names = results_df[‘Model’] accuracies = results_df[‘Accuracy’] cv_means = results_df[‘CV Mean’] cv_stds = results_df[‘CV Std’] x = np.arange(len(models_names)) width = 0.35 plt.bar(x - width / 2, accuracies, width, label=‘测试集准确率’, alpha=0.7) plt.bar(x + width / 2, cv_means, width, yerr=cv_stds, label=‘交叉验证准确率’, alpha=0.7, capsize=5) plt.xlabel(‘模型’) plt.ylabel(‘准确率’) plt.title(‘模型性能比较’) plt.xticks(x, models_names, rotation=45) plt.legend() plt.tight_layout() plt.savefig(‘model_performance_comparison.png’, dpi=300, bbox_inches=‘tight’) plt.show() 2. 交叉验证箱线图 cv_df = pd.DataFrame({item[‘model’]: item[‘scores’] for item in cv_results}) plt.figure(figsize=(12, 8)) cv_df.boxplot() plt.title(‘模型交叉验证准确率分布’) plt.xticks(rotation=45) plt.ylabel(‘准确率’) plt.tight_layout() plt.savefig(‘cv_boxplot.png’, dpi=300, bbox_inches=‘tight’) plt.show() 3. 最佳模型详细分析 best_model_name = results_df.iloc[0][‘Model’] best_model = model_objects[best_model_name] print(f"\n对最佳模型 {best_model_name} 进行详细分析…") 混淆矩阵 y_pred_best = best_model.predict(X_test_pca) cm = confusion_matrix(y_test, y_pred_best) plt.figure(figsize=(10, 8)) sns.heatmap(cm, annot=True, fmt=‘d’, cmap=‘Blues’, xticklabels=class_names, yticklabels=class_names) plt.title(f’{best_model_name} 混淆矩阵’) plt.xlabel(‘预测标签’) plt.ylabel(‘真实标签’) plt.tight_layout() plt.savefig(f’confusion_matrix_{best_model_name}.png’, dpi=300, bbox_inches=‘tight’) plt.show() 学习曲线 def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)): plt.figure(figsize=(10, 6)) train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="训练得分") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="交叉验证得分") plt.xlabel("训练样本数") plt.ylabel("得分") plt.title(title) plt.legend(loc="best") plt.tight_layout() plt.savefig(f'learning_curve_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() plot_learning_curve(best_model, f’{best_model_name} 学习曲线’, X_train_pca, y_train, cv=5) 特征重要性(如果模型支持) if hasattr(best_model, ‘feature_importances_’): importances = best_model.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(10, 8)) plt.title(f"{best_model_name} 特征重要性") plt.bar(range(min(20, len(importances))), importances[indices[:20]]) plt.xticks(range(min(20, len(importances))), [f'PC{i + 1}' for i in indices[:20]], rotation=45) plt.tight_layout() plt.savefig(f'feature_importance_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() ========== 新增:SHAP特征分析组合图 ========== if SHAP_AVAILABLE: print(“\n生成SHAP特征分析组合图…”) # 只为最佳模型生成SHAP组合图 if best_model_name in ['RandomForest', 'DecisionTree', 'GradientBoosting', 'XGBoost', 'LightGBM']: try: # 创建SHAP解释器 explainer = shap.TreeExplainer(best_model) shap_values = explainer.shap_values(X_test_pca) # 对于多分类问题,选择第一个类别的SHAP值 if isinstance(shap_values, list): shap_values_used = shap_values[0] # 使用第一个类别的SHAP值 else: shap_values_used = shap_values # 创建组合图:全局重要性(左)与个体影响(右) fig = plt.figure(figsize=(20, 10)) gs = gridspec.GridSpec(1, 2, width_ratios=[1, 1]) # 左图:全局特征重要性 ax1 = plt.subplot(gs[0]) shap.summary_plot(shap_values_used, X_test_pca, plot_type="bar", feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax1.set_title(f'{best_model_name} - SHAP全局特征重要性', fontsize=16, fontweight='bold') # 右图:个体样本SHAP值 ax2 = plt.subplot(gs[1]) # 选择一个有代表性的样本(SHAP值绝对值最大的样本) sample_idx = np.argmax(np.sum(np.abs(shap_values_used), axis=1)) # 绘制瀑布图显示个体样本的SHAP值 shap.waterfall_plot( explainer.expected_value[0] if isinstance(explainer.expected_value, list) else explainer.expected_value, shap_values_used[sample_idx], feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax2.set_title(f'{best_model_name} - 个体样本SHAP值分析\n(样本索引: {sample_idx})', fontsize=16, fontweight='bold') plt.tight_layout() plt.savefig(f'shap_combined_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() # 创建另一个组合图:SHAP摘要图与依赖图 fig2 = plt.figure(figsize=(20, 10)) gs2 = gridspec.GridSpec(1, 2, width_ratios=[1, 1]) # 左图:SHAP摘要图(显示特征值与SHAP值的关系) ax3 = plt.subplot(gs2[0]) shap.summary_plot(shap_values_used, X_test_pca, feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax3.set_title(f'{best_model_name} - SHAP摘要图', fontsize=16, fontweight='bold') # 右图:SHAP依赖图(最重要的特征) ax4 = plt.subplot(gs2[1]) # 找到最重要的特征 if hasattr(best_model, 'feature_importances_'): feature_importances = best_model.feature_importances_ most_important_feature = np.argmax(feature_importances) else: # 如果没有feature_importances_属性,使用SHAP值计算重要性 feature_importances = np.mean(np.abs(shap_values_used), axis=0) most_important_feature = np.argmax(feature_importances) shap.dependence_plot(most_important_feature, shap_values_used, X_test_pca, feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, ax=ax4) ax4.set_title(f'{best_model_name} - SHAP依赖图\n(最重要的特征: PC{most_important_feature + 1})', fontsize=16, fontweight='bold') plt.tight_layout() plt.savefig(f'shap_detailed_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() print(f"已保存 SHAP组合图 for {best_model_name}") except Exception as e: print(f"生成SHAP组合图失败 for {best_model_name}: {e}") else: print(f"最佳模型 {best_model_name} 不支持TreeExplainer,跳过SHAP组合图") else: print(“\nSHAP不可用,跳过SHAP组合图”) ========== 生成四因子表格 ========== four_factor_cols = [‘Short_term_acid_factor’, ‘Long_term_acid_factor’, ‘Sulfonate_factor’, ‘Exposure_factor’] if all(col in df.columns for col in four_factor_cols): four_factor_table = df[[target_column] + four_factor_cols].head(10) # 解码目标变量以显示原始标签 four_factor_table[target_column] = le.inverse_transform(four_factor_table[target_column]) print_table(four_factor_table.values.tolist(), headers=[target_column] + four_factor_cols, title=“四因子理论表格 (前10行)) ========== 生成PCA组件表格 ========== pca_components = pd.DataFrame(pca.components_, columns=features) print(f"\nPCA主成分表格 (前5个主成分)😊 print_table(pca_components.head().values.tolist(), headers=[‘PC’] + features, title=“PCA主成分权重”) ========== 保存重要结果 ========== results_df.to_csv(‘model_results.csv’, index=False) print(“\n模型结果已保存到 ‘model_results.csv’”) 显示最佳模型 best_model = results_df.iloc[0] print(f"\n最佳模型: {best_model[‘Model’]} (准确率: {best_model[‘Accuracy’]:.4f})") print(“\n所有图表已生成完成!”)完善改进代码
最新发布
11-12
import sys import pandas as pd import numpy as np from PyQt5 import QtWidgets from PyQt5.QtWidgets import QMainWindow, QFileDialog, QGridLayout, QVBoxLayout, QHBoxLayout, QLabel, QComboBox, QPushButton, QLineEdit, QGroupBox, QStatusBar from PyQt5.QtCore import Qt from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas import matplotlib.pyplot as plt from scipy.signal import savgol_filter # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 使用黑体显示中文 plt.rcParams['axes.unicode_minus'] = False # 正常显示负号 class MyApp(QMainWindow): def __init__(self): super().__init__() self.setWindowTitle("数据分析工具") self.setGeometry(100, 100, 1200, 800) # 初始化组件 self.initUI() # 存储数据 self.data = None self.selected_point = None self.valid_data = None # 存储有效数据 self.smoothed_data = None # 存储平滑后的数据 self.avg_rate = None # 平均速率 self.all_points_avg_rates = {} # 存储所有数据点的平均速率 self.all_points_tangent_angles = {} # 存储所有数据点的切线角数据 def initUI(self): # 创建主窗口的中央部件 centralWidget = QtWidgets.QWidget(self) self.setCentralWidget(centralWidget) # 创建网格布局 layout = QGridLayout() # 创建文件加载部分 fileGroupBox = QGroupBox("加载数据") fileLayout = QHBoxLayout() self.loadButton = QPushButton("加载CSV文件", self) self.loadButton.clicked.connect(self.loadCSV) fileLayout.addWidget(self.loadButton) fileGroupBox.setLayout(fileLayout) layout.addWidget(fileGroupBox, 0, 0) # 创建选择分析点的下拉框 analysisGroupBox = QGroupBox("选择数据点") analysisLayout = QHBoxLayout() self.comboBox = QComboBox(self) analysisLayout.addWidget(self.comboBox) analysisGroupBox.setLayout(analysisLayout) layout.addWidget(analysisGroupBox, 1, 0) # 创建平滑处理按钮 processGroupBox = QGroupBox("数据处理") processLayout = QVBoxLayout() self.smoothButton = QPushButton("平滑处理并绘图", self) self.smoothButton.clicked.connect(self.smoothData) processLayout.addWidget(self.smoothButton) self.rateButton = QPushButton("计算速率并绘图", self) self.rateButton.clicked.connect(self.calculateRate) processLayout.addWidget(self.rateButton) self.exportRateButton = QPushButton("导出速率数据", self) self.exportRateButton.clicked.connect(self.exportRate) processLayout.addWidget(self.exportRateButton) processGroupBox.setLayout(processLayout) layout.addWidget(processGroupBox, 2, 0) # 创建时间输入框和平均速率计算按钮 rateGroupBox = QGroupBox("速率计算") rateLayout = QVBoxLayout() timeInputLayout = QHBoxLayout() self.startTimeInput = QLineEdit(self) self.startTimeInput.setPlaceholderText("起始时间(YYYY-MM-DD HH:MM)") self.endTimeInput = QLineEdit(self) self.endTimeInput.setPlaceholderText("终止时间(YYYY-MM-DD HH:MM)") timeInputLayout.addWidget(self.startTimeInput) timeInputLayout.addWidget(self.endTimeInput) self.averageRateButton = QPushButton("计算平均速率", self) self.averageRateButton.clicked.connect(self.calculateAverageRate) rateLayout.addLayout(timeInputLayout) rateLayout.addWidget(self.averageRateButton) # 新增:计算所有点平均速率按钮 self.allPointsRateButton = QPushButton("计算所有点平均速率", self) self.allPointsRateButton.clicked.connect(self.calculateAllPointsAverageRate) rateLayout.addWidget(self.allPointsRateButton) # 新增:导出所有点平均速率按钮 self.exportAllRatesButton = QPushButton("导出所有点平均速率", self) self.exportAllRatesButton.clicked.connect(self.exportAllPointsAverageRate) rateLayout.addWidget(self.exportAllRatesButton) rateGroupBox.setLayout(rateLayout) layout.addWidget(rateGroupBox, 3, 0) # 创建S-t转换T-t按钮 st2ttGroupBox = QGroupBox("S-t转T-t曲线") st2ttLayout = QVBoxLayout() self.st2ttButton = QPushButton("S-t转T-t曲线并绘图", self) self.st2ttButton.clicked.connect(self.st2tt) st2ttLayout.addWidget(self.st2ttButton) st2ttGroupBox.setLayout(st2ttLayout) layout.addWidget(st2ttGroupBox, 4, 0) # 创建计算切线角的按钮 angleGroupBox = QGroupBox("切线角计算") angleLayout = QVBoxLayout() self.tangentAngleButton = QPushButton("计算切线角并标注", self) self.tangentAngleButton.clicked.connect(self.calculateTangentAngle) angleLayout.addWidget(self.tangentAngleButton) self.exportTangentAngleButton = QPushButton("导出切线角数据", self) self.exportTangentAngleButton.clicked.connect(self.exportTangentAngle) angleLayout.addWidget(self.exportTangentAngleButton) # 新增:计算所有点切线角按钮 self.allPointsTangentAngleButton = QPushButton("计算所有点切线角", self) self.allPointsTangentAngleButton.clicked.connect(self.calculateAllPointsTangentAngle) angleLayout.addWidget(self.allPointsTangentAngleButton) # 新增:导出所有点切线角按钮 self.exportAllTangentAnglesButton = QPushButton("导出所有点切线角", self) self.exportAllTangentAnglesButton.clicked.connect(self.exportAllPointsTangentAngle) angleLayout.addWidget(self.exportAllTangentAnglesButton) angleGroupBox.setLayout(angleLayout) layout.addWidget(angleGroupBox, 5, 0) # 状态栏 self.statusBar = QStatusBar(self) self.setStatusBar(self.statusBar) # matplotlib 图表部分 self.canvas = FigureCanvas(plt.Figure(figsize=(15, 6))) # 设置画布大小 layout.addWidget(self.canvas, 0, 1, 6, 1) centralWidget.setLayout(layout) def validate_data_loaded(self): """验证数据是否已加载""" if self.data is None: self.statusBar.showMessage("请先加载数据", 5000) return False return True def validate_smoothed_data(self): """验证是否已完成平滑处理""" if not hasattr(self, 'smoothed_data') or not hasattr(self, 'valid_data'): self.statusBar.showMessage("请先进行平滑处理", 5000) return False return True def loadCSV(self): try: fileName, _ = QFileDialog.getOpenFileName(self, "打开CSV文件", "", "CSV Files (*.csv)") if fileName: # 修正编码读取逻辑 try: self.data = pd.read_csv(fileName, encoding='utf-8') except UnicodeDecodeError: try: self.data = pd.read_csv(fileName, encoding='GBK') except UnicodeDecodeError: self.data = pd.read_csv(fileName, encoding='ISO-8859-1') if '扫描时间' not in self.data.columns: self.statusBar.showMessage("数据中缺少'扫描时间'列", 5000) return self.comboBox.clear() # 只添加数值列,跳过时间列 numeric_columns = [col for col in self.data.columns if col != '扫描时间'] self.comboBox.addItems(numeric_columns) # 重置处理状态 self.valid_data = None self.smoothed_data = None self.avg_rate = None self.all_points_avg_rates = {} self.all_points_tangent_angles = {} self.statusBar.showMessage("数据加载成功", 5000) except Exception as e: self.statusBar.showMessage(f"加载数据时出错: {e}", 5000) def smoothData(self): try: if not self.validate_data_loaded(): return selected_point = self.comboBox.currentText() if not selected_point: self.statusBar.showMessage("请选择数据点", 5000) return data = self.data[['扫描时间', selected_point]].dropna() # 将 '扫描时间' 列转换为 datetime 格式 data['扫描时间'] = pd.to_datetime(data['扫描时间'], errors='coerce') if data['扫描时间'].isnull().any(): self.statusBar.showMessage("时间格式错误,请检查数据", 5000) return displacement = pd.to_numeric(data[selected_point], errors='coerce') # 删除无效数据行 valid_mask = ~data['扫描时间'].isnull() & ~displacement.isnull() valid_data = pd.DataFrame({ 'time': data['扫描时间'][valid_mask], 'displacement': displacement[valid_mask] }).reset_index(drop=True) if len(valid_data) < 10: self.statusBar.showMessage("有效数据太少,无法进行平滑处理", 5000) return # 保存验证数据供其他方法使用 self.valid_data = valid_data # 动态设置Savitzky-Golay滤波器窗口大小 window_length = min(51, len(valid_data) // 10 * 2 + 1) if window_length < 5: window_length = 5 if window_length > len(valid_data): window_length = len(valid_data) - 1 if len(valid_data) % 2 == 0 else len(valid_data) # 确保窗口长度为奇数 if window_length % 2 == 0: window_length -= 1 if window_length < 3: window_length = 3 displacement_smooth = savgol_filter( valid_data['displacement'], window_length=window_length, polyorder=min(3, window_length - 1) ) # 确保位移曲线是递增的 displacement_increasing = np.maximum.accumulate(displacement_smooth) self.smoothed_data = displacement_increasing # 清除之前的绘图 self.canvas.figure.clf() # 绘制平滑处理前后的图像 ax = self.canvas.figure.subplots() ax.plot(valid_data['time'], valid_data['displacement'], label='原始累计位移', color='blue', alpha=0.7) ax.plot(valid_data['time'], displacement_increasing, label='平滑后递增累计位移', color='red', linestyle='--', linewidth=2) ax.set_title(f"平滑处理: {selected_point}") ax.set_xlabel('时间') ax.set_ylabel('累计位移(毫米)') ax.legend() ax.grid(True) ax.tick_params(axis='x', rotation=45) ax.figure.tight_layout() self.canvas.draw() self.statusBar.showMessage("平滑处理完成", 5000) except Exception as e: self.statusBar.showMessage(f"平滑处理时出错: {e}", 5000) def calculateRate(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() # 确保已经有平滑后的数据 if not hasattr(self, 'smoothed_data'): self.statusBar.showMessage("请先进行平滑处理", 5000) return # 使用平滑后的数据计算速率 valid_data = self.valid_data # 将时间转换为小时 (以小时为单位) time_in_hours = (valid_data['time'] - valid_data['time'].iloc[0]).dt.total_seconds() / 3600.0 # 计算速率(位移的一阶导数,单位是mm/h) velocity_increasing = np.gradient(self.smoothed_data, time_in_hours) # 清除之前的绘图 self.canvas.figure.clf() # 绘制速率随时间变化的折线图 ax = self.canvas.figure.subplots() ax.plot(valid_data['time'], velocity_increasing, label='速率 (mm/h)', color='green') # 设置标题和标签 ax.set_title(f"速率随时间变化 (单位:mm/h): {selected_point}") ax.set_xlabel('时间') ax.set_ylabel('速率 (mm/h)') # 设置网格和旋转 x 轴标签 ax.legend() ax.grid(True) ax.tick_params(axis='x', rotation=45) # 调整布局,防止重叠 ax.figure.tight_layout() # 刷新画布,显示绘图结果 self.canvas.draw() self.statusBar.showMessage("速率计算完成", 5000) else: self.statusBar.showMessage("请先加载数据或进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"计算速率时出错: {e}", 5000) def exportRate(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() # 确保已经有平滑后的数据 if not hasattr(self, 'smoothed_data'): self.statusBar.showMessage("请先进行平滑处理", 5000) return # 使用平滑后的数据计算速率 valid_data = self.valid_data # 将时间转换为小时 (以小时为单位) time_in_hours = (valid_data['time'] - valid_data['time'].iloc[0]).dt.total_seconds() / 3600.0 # 计算速率(位移的一阶导数,单位是mm/h) velocity_increasing = np.gradient(self.smoothed_data, time_in_hours) # 将速率数据添加到DataFrame中 valid_data['Rate (mm/h)'] = velocity_increasing # 导出为CSV文件 fileName, _ = QFileDialog.getSaveFileName(self, "保存速率数据", "", "CSV Files (*.csv)") if fileName: # 导出时间和速率列到 CSV 文件 valid_data[['time', 'Rate (mm/h)']].to_csv(fileName, index=False) self.statusBar.showMessage("速率数据导出成功", 5000) else: self.statusBar.showMessage("请先加载数据或进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"导出速率数据时出错: {e}", 5000) def calculateAverageRate(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() # 从输入框获取开始和结束时间 start_time_str = self.startTimeInput.text() end_time_str = self.endTimeInput.text() # 将输入的时间转换为 datetime 格式 start_time = pd.to_datetime(start_time_str) end_time = pd.to_datetime(end_time_str) # 使用平滑处理时保存的valid_data valid_data = self.valid_data # 筛选时间段内的数据 mask = (valid_data['time'] >= start_time) & (valid_data['time'] <= end_time) selected_data = valid_data.loc[mask] # 获取对应的平滑数据 selected_smoothed_indices = mask.values selected_smoothed_data = self.smoothed_data[selected_smoothed_indices] # 确保筛选后的数据不为空 if selected_data.empty: self.statusBar.showMessage("在该时间范围内没有数据", 5000) return # 计算总位移变化和总时间 total_displacement = selected_smoothed_data[-1] - selected_smoothed_data[0] total_time_hours = (selected_data['time'].iloc[-1] - selected_data['time'].iloc[0]).total_seconds() / 3600.0 if total_time_hours == 0: self.statusBar.showMessage("时间范围太短,无法计算平均速率", 5000) return # 计算平均速率 = 总位移变化 / 总时间 avg_rate = total_displacement / total_time_hours # 保存平均速率供其他函数使用 self.avg_rate = avg_rate # 在状态栏中显示平均速率 self.statusBar.showMessage(f"平均速率为: {avg_rate:.6f} mm/h (基于平滑后数据)", 5000) else: self.statusBar.showMessage("请先加载数据并进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"计算平均速率时出错: {e}", 5000) # 新增功能1: 计算所有数据点的平均速率 def calculateAllPointsAverageRate(self): try: if self.data is None: self.statusBar.showMessage("请先加载数据", 5000) return # 从输入框获取开始和结束时间 start_time_str = self.startTimeInput.text() end_time_str = self.endTimeInput.text() if not start_time_str or not end_time_str: self.statusBar.showMessage("请先输入起始时间和终止时间", 5000) return # 将输入的时间转换为 datetime 格式 start_time = pd.to_datetime(start_time_str) end_time = pd.to_datetime(end_time_str) # 获取所有数据点列名(排除时间列) data_points = [col for col in self.data.columns if col != '扫描时间'] # 清空之前的计算结果 self.all_points_avg_rates = {} # 遍历所有数据点 for point in data_points: try: # 获取当前数据点的数据 point_data = self.data[['扫描时间', point]].dropna() # 将 '扫描时间' 列转换为 datetime 格式 point_data['扫描时间'] = pd.to_datetime(point_data['扫描时间'], errors='coerce') point_data = point_data[~point_data['扫描时间'].isnull()] displacement = pd.to_numeric(point_data[point], errors='coerce') # 删除无效数据行 valid_mask = ~point_data['扫描时间'].isnull() & ~displacement.isnull() valid_data = pd.DataFrame({ 'time': point_data['扫描时间'][valid_mask], 'displacement': displacement[valid_mask] }).reset_index(drop=True) if len(valid_data) < 10: continue # 动态设置窗口长度并进行平滑 window_length = min(51, len(valid_data) // 10 * 2 + 1) if window_length < 5: window_length = 5 if window_length > len(valid_data): window_length = len(valid_data) - 1 if len(valid_data) % 2 == 0 else len(valid_data) if window_length % 2 == 0: window_length -= 1 if window_length < 3: window_length = 3 displacement_smooth = savgol_filter( valid_data['displacement'], window_length=window_length, polyorder=min(3, window_length - 1) ) # 确保位移曲线是递增的 displacement_increasing = np.maximum.accumulate(displacement_smooth) # 筛选时间段内的数据 mask = (valid_data['time'] >= start_time) & (valid_data['time'] <= end_time) selected_data = valid_data.loc[mask] selected_smoothed_data = displacement_increasing[mask.values] # 确保筛选后的数据不为空 if selected_data.empty: continue # 计算总位移变化和总时间 total_displacement = selected_smoothed_data[-1] - selected_smoothed_data[0] total_time_hours = (selected_data['time'].iloc[-1] - selected_data['time'].iloc[0]).total_seconds() / 3600.0 if total_time_hours == 0: continue # 计算平均速率 = 总位移变化 / 总时间 avg_rate = total_displacement / total_time_hours # 存储该点的平均速率 self.all_points_avg_rates[point] = avg_rate except Exception as e: print(f"计算数据点 {point} 的平均速率时出错: {e}") continue # 显示计算结果 if self.all_points_avg_rates: self.statusBar.showMessage(f"成功计算 {len(self.all_points_avg_rates)} 个数据点的平均速率", 5000) # 在状态栏显示前几个点的平均速率作为示例 sample_points = list(self.all_points_avg_rates.keys())[:3] sample_rates = [self.all_points_avg_rates[p] for p in sample_points] sample_text = ", ".join([f"{p}: {r:.4f}" for p, r in zip(sample_points, sample_rates)]) if len(self.all_points_avg_rates) > 3: sample_text += " ..." self.statusBar.showMessage(f"成功计算 {len(self.all_points_avg_rates)} 个数据点的平均速率。示例: {sample_text}", 5000) else: self.statusBar.showMessage("未能计算任何数据点的平均速率", 5000) except Exception as e: self.statusBar.showMessage(f"计算所有点平均速率时出错: {e}", 5000) # 新增功能: 导出所有点的平均速率 def exportAllPointsAverageRate(self): try: if not self.all_points_avg_rates: self.statusBar.showMessage("没有可导出的平均速率数据", 5000) return # 创建DataFrame rates_df = pd.DataFrame({ '数据点': list(self.all_points_avg_rates.keys()), '平均速率(mm/h)': list(self.all_points_avg_rates.values()) }) # 按数据点名称排序 rates_df = rates_df.sort_values('数据点') # 导出为CSV文件 fileName, _ = QFileDialog.getSaveFileName(self, "保存所有点平均速率数据", "", "CSV Files (*.csv)") if fileName: rates_df.to_csv(fileName, index=False) self.statusBar.showMessage(f"成功导出 {len(rates_df)} 个数据点的平均速率", 5000) except Exception as e: self.statusBar.showMessage(f"导出所有点平均速率时出错: {e}", 5000) def st2tt(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() valid_data = self.valid_data # 使用平滑处理后的数据 (self.smoothed_data) cumulative_deformation = self.smoothed_data # 检查是否已经计算了平均速率 if not hasattr(self, 'avg_rate'): self.statusBar.showMessage("请先计算平均速率", 5000) return # 使用已计算的平均速率 avg_rate = self.avg_rate # T = 累积变形量 / 匀速阶段的平均速率 T = cumulative_deformation / avg_rate # 清除之前的绘图 self.canvas.figure.clf() # 绘制T-t曲线 ax = self.canvas.figure.subplots() ax.plot(valid_data['time'], T, label='T-t曲线', color='green') # 设置标题和标签 ax.set_title(f"T-t曲线: {selected_point}") ax.set_xlabel('时间') ax.set_ylabel('T (累积变形量 / 匀速速率)') # 设置图例和网格 ax.legend() ax.grid(True) ax.tick_params(axis='x', rotation=45) # 调整布局 ax.figure.tight_layout() # 刷新画布 self.canvas.draw() self.statusBar.showMessage("T-t曲线绘制完成", 5000) else: self.statusBar.showMessage("请先加载数据或进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"S-t转T-t曲线时出错: {e}", 5000) def calculateTangentAngle(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() valid_data = self.valid_data # 使用平滑处理后的数据 (self.smoothed_data) cumulative_deformation = self.smoothed_data # 检查时间列和累计变形列长度是否一致 if len(cumulative_deformation) != len(valid_data['time']): self.statusBar.showMessage("时间数据和平滑后的数据长度不一致", 5000) return # 检查是否已经计算了平均速率 if not hasattr(self, 'avg_rate'): self.statusBar.showMessage("请先计算平均速率", 5000) return # 使用已计算的平均速率 avg_rate = self.avg_rate # T = 累积变形量 / 匀速阶段的平均速率 T = cumulative_deformation / avg_rate # 计算T的变化率 (ΔT / Δt) time_diff = (valid_data['time'] - valid_data['time'].shift(1)).dt.total_seconds() / 3600 # 使用T的差分而不是T本身 T_diff = np.diff(T) # 确保 time_diff 和 T_diff 长度一致 if len(T_diff) != len(time_diff) - 1: # time_diff第一个是NaN self.statusBar.showMessage("时间差与T差分长度不一致", 5000) return # 去掉time_diff的第一个NaN值 time_diff_valid = time_diff.iloc[1:] # 计算切线角 rate_of_change = np.degrees(np.arctan(T_diff / time_diff_valid)) # 找到最大切线角 max_angle_idx = np.argmax(rate_of_change) max_angle_time = valid_data['time'].iloc[max_angle_idx + 1] # +1 因为 T_diff 长度比 T 少1 max_angle_value = rate_of_change[max_angle_idx] # 找到45度切线角的时间点 idx_45 = np.where(np.isclose(rate_of_change, 45, atol=1))[0] if len(idx_45) > 0: time_45 = valid_data['time'].iloc[idx_45[0] + 1] T_45 = T[idx_45[0] + 1] # 找到80度切线角的时间点 idx_80 = np.where(np.isclose(rate_of_change, 80, atol=1))[0] if len(idx_80) > 0: time_80 = valid_data['time'].iloc[idx_80[0] + 1] T_80 = T[idx_80[0] + 1] # 清除之前的绘图 self.canvas.figure.clf() # 绘制T-t曲线 ax = self.canvas.figure.subplots() ax.plot(valid_data['time'], T, label='T-t曲线', color='green') # 标注最大切线角 ax.scatter([max_angle_time], [T[max_angle_idx + 1]], color='red', label=f"最大切线角({max_angle_value:.2f}°)") ax.text(max_angle_time, T[max_angle_idx + 1], f"{max_angle_value:.2f}°", color='red', fontsize=15) # 标注45度切线角的位置 if len(idx_45) > 0: ax.scatter([time_45], [T_45], color='blue', label="45°切线角") ax.text(time_45, T_45, "45°", color='blue', fontsize=15) # 标注80度切线角的位置 if len(idx_80) > 0: ax.scatter([time_80], [T_80], color='orange', label="80°切线角") ax.text(time_80, T_80, "80°", color='orange', fontsize=15) # 设置标题和标签 ax.set_title(f"T-t曲线: {selected_point}") ax.set_xlabel('时间') ax.set_ylabel('T (累积变形量 / 匀速速率)') # 设置图例和网格 ax.legend() ax.grid(True) ax.tick_params(axis='x', rotation=45) # 调整布局 ax.figure.tight_layout() # 刷新画布 self.canvas.draw() self.statusBar.showMessage(f"最大切线角时间: {max_angle_time}, 值: {max_angle_value:.2f}°", 5000) else: self.statusBar.showMessage("请先加载数据或进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"计算切线角时出错: {e}", 5000) # 新增功能2: 计算所有数据点的切线角 def calculateAllPointsTangentAngle(self): try: if not self.all_points_avg_rates: self.statusBar.showMessage("请先计算所有点的平均速率", 5000) return # 清空之前的计算结果 self.all_points_tangent_angles = {} # 遍历所有有平均速率的数据点 for point, avg_rate in self.all_points_avg_rates.items(): try: # 获取当前数据点的数据 point_data = self.data[['扫描时间', point]].dropna() # 将 '扫描时间' 列转换为 datetime 格式 point_data['扫描时间'] = pd.to_datetime(point_data['扫描时间'], errors='coerce') point_data = point_data[~point_data['扫描时间'].isnull()] displacement = pd.to_numeric(point_data[point], errors='coerce') # 删除无效数据行 valid_mask = ~point_data['扫描时间'].isnull() & ~displacement.isnull() valid_data = pd.DataFrame({ 'time': point_data['扫描时间'][valid_mask], 'displacement': displacement[valid_mask] }).reset_index(drop=True) if len(valid_data) < 10: continue # 动态设置窗口长度并进行平滑 window_length = min(51, len(valid_data) // 10 * 2 + 1) if window_length < 5: window_length = 5 if window_length > len(valid_data): window_length = len(valid_data) - 1 if len(valid_data) % 2 == 0 else len(valid_data) if window_length % 2 == 0: window_length -= 1 if window_length < 3: window_length = 3 displacement_smooth = savgol_filter( valid_data['displacement'], window_length=window_length, polyorder=min(3, window_length - 1) ) # 确保位移曲线是递增的 displacement_increasing = np.maximum.accumulate(displacement_smooth) # T = 累积变形量 / 匀速阶段的平均速率 T = displacement_increasing / avg_rate # 计算T的变化率 (ΔT / Δt) time_diff = (valid_data['time'] - valid_data['time'].shift(1)).dt.total_seconds() / 3600 time_diff = time_diff.iloc[1:] # 去掉第一个NaN值 # 使用T的差分 T_diff = np.diff(T) # 确保 time_diff 和 T_diff 长度一致 if len(T_diff) != len(time_diff): continue # 计算切线角 rate_of_change = np.degrees(np.arctan(T_diff / time_diff)) # 存储该点的切线角数据 self.all_points_tangent_angles[point] = { 'time': valid_data['time'].iloc[1:].reset_index(drop=True), 'tangent_angle': rate_of_change } except Exception as e: print(f"计算数据点 {point} 的切线角时出错: {e}") continue # 显示计算结果 if self.all_points_tangent_angles: self.statusBar.showMessage(f"成功计算 {len(self.all_points_tangent_angles)} 个数据点的切线角", 5000) else: self.statusBar.showMessage("未能计算任何数据点的切线角", 5000) except Exception as e: self.statusBar.showMessage(f"计算所有点切线角时出错: {e}", 5000) # 修改后的功能: 导出所有点的切线角数据 - 每个点位占用一列 def exportAllPointsTangentAngle(self): try: if not self.all_points_tangent_angles: self.statusBar.showMessage("没有可导出的切线角数据", 5000) return # 导出为CSV文件 fileName, _ = QFileDialog.getSaveFileName(self, "保存所有点切线角数据", "", "CSV Files (*.csv)") if fileName: # 创建一个新的DataFrame,时间列为第一列 time_column = None # 找到最长的时间序列作为基准 for point, data_dict in self.all_points_tangent_angles.items(): if time_column is None or len(data_dict['time']) > len(time_column): time_column = data_dict['time'] # 创建新的DataFrame,时间列为第一列 export_df = pd.DataFrame({'时间': time_column}) # 将每个数据点的切线角数据作为新列添加到DataFrame中 for point, data_dict in self.all_points_tangent_angles.items(): # 创建与基准时间列长度相同的数组,用NaN填充缺失值 tangent_angle_column = np.full(len(time_column), np.nan) # 找到当前数据点时间与基准时间的对应关系 for i, t in enumerate(time_column): # 在数据点的时间序列中查找匹配的时间 match_idx = data_dict['time'][data_dict['time'] == t].index if len(match_idx) > 0: tangent_angle_column[i] = data_dict['tangent_angle'].iloc[match_idx[0]] # 添加该数据点的切线角列 export_df[f'{point}切线角()'] = tangent_angle_column # 导出数据 export_df.to_csv(fileName, index=False) self.statusBar.showMessage(f"成功导出 {len(self.all_points_tangent_angles)} 个数据点的切线角数据", 5000) except Exception as e: self.statusBar.showMessage(f"导出所有点切线角数据时出错: {e}", 5000) def exportTangentAngle(self): try: if self.data is not None and hasattr(self, 'smoothed_data') and hasattr(self, 'valid_data'): selected_point = self.comboBox.currentText() valid_data = self.valid_data # 使用平滑处理后的数据 (self.smoothed_data) cumulative_deformation = self.smoothed_data # 检查时间列和累计变形列长度是否一致 if len(cumulative_deformation) != len(valid_data['time']): self.statusBar.showMessage("时间数据和平滑后的数据长度不一致", 5000) return # 检查是否已经计算了平均速率 if not hasattr(self, 'avg_rate'): self.statusBar.showMessage("请先计算平均速率", 5000) return # 使用已计算的平均速率 avg_rate = self.avg_rate # T = 累积变形量 / 匀速阶段的平均速率 T = cumulative_deformation / avg_rate # 计算T的变化率 (ΔT / Δt) time_diff = (valid_data['time'] - valid_data['time'].shift(1)).dt.total_seconds() / 3600 time_diff = time_diff.iloc[1:] # 去掉第一个NaN值 # 使用T的差分 T_diff = np.diff(T) # 确保 time_diff 和 T_diff 长度一致 if len(T_diff) != len(time_diff): self.statusBar.showMessage("时间差与T差分长度不一致", 5000) return rate_of_change = np.degrees(np.arctan(T_diff / time_diff)) # 找到最大切线角 max_angle_idx = np.argmax(rate_of_change) max_angle_time = valid_data['time'].iloc[max_angle_idx + 1] # +1 因为 T_diff 长度比 T 少1 max_angle_value = rate_of_change[max_angle_idx] # 找到45度切线角的时间点 idx_45 = np.where(np.isclose(rate_of_change, 45, atol=1))[0] if len(idx_45) > 0: time_45 = valid_data['time'].iloc[idx_45[0] + 1] # 找到80度切线角的时间点 idx_80 = np.where(np.isclose(rate_of_change, 80, atol=1))[0] if len(idx_80) > 0: time_80 = valid_data['time'].iloc[idx_80[0] + 1] # 创建切线角数据表 angle_data = pd.DataFrame({ '扫描时间': valid_data['time'].iloc[1:].reset_index(drop=True), # 时间少1行 'Tangent Angle (degrees)': rate_of_change }) # 添加45°和80°切线角位置的信息 if len(idx_45) > 0: angle_data['45° 切线角时间'] = np.where(angle_data['扫描时间'] == time_45, '是', '') if len(idx_80) > 0: angle_data['80° 切线角时间'] = np.where(angle_data['扫描时间'] == time_80, '是', '') angle_data['最大切线角时间'] = np.where(angle_data['扫描时间'] == max_angle_time, '是', '') # 导出为CSV文件 fileName, _ = QFileDialog.getSaveFileName(self, "保存切线角数据", "", "CSV Files (*.csv)") if fileName: angle_data.to_csv(fileName, index=False) self.statusBar.showMessage("切线角数据导出成功", 5000) else: self.statusBar.showMessage("请先加载数据或进行平滑处理", 5000) except Exception as e: self.statusBar.showMessage(f"导出切线角数据时出错: {e}", 5000) if __name__ == "__main__": app = QtWidgets.QApplication(sys.argv) window = MyApp() window.show() sys.exit(app.exec_())
11-10
# 导入必要的库 import pandas as pd import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split, cross_val_score, learning_curve from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.decomposition import PCA from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, \ ExtraTreesClassifier from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.neural_network import MLPClassifier import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from matplotlib.patches import Patch import warnings warnings.filterwarnings('ignore') # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签 plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号 # 尝试导入可选库 try: from xgboost import XGBClassifier XGB_AVAILABLE = True except ImportError: print("XGBoost 不可用,跳过 XGBoost 模型") XGB_AVAILABLE = False try: from lightgbm import LGBMClassifier LGBM_AVAILABLE = True except ImportError: print("LightGBM 不可用,跳过 LightGBM 模型") LGBM_AVAILABLE = False try: import shap SHAP_AVAILABLE = True except ImportError: print("SHAP 不可用,跳过 SHAP 解释") SHAP_AVAILABLE = False # 定义张继权教授的四因子理论函数 def apply_four_factor_theory(df): # 检查所需的列是否存在 required_cols = ['PFBA', 'PFPeA', 'PFHxA', 'PFHpA', 'PFOA', 'PFNA', 'PFDA', 'PFBS', 'PFHxS', 'PFOS'] available_cols = [col for col in required_cols if col in df.columns] if len(available_cols) < 4: print(f"警告: 只有 {len(available_cols)} 个PFAS列可用,可能需要调整四因子计算") # 短期酸类因子 (PFBA, PFPeA, PFHxA) short_term_cols = [col for col in ['PFBA', 'PFPeA', 'PFHxA'] if col in df.columns] if short_term_cols: df['Short_term_acid_factor'] = df[short_term_cols].mean(axis=1, skipna=True) else: df['Short_term_acid_factor'] = 0 # 长期酸类因子 (PFHpA, PFOA, PFNA, PFDA) long_term_cols = [col for col in ['PFHpA', 'PFOA', 'PFNA', 'PFDA'] if col in df.columns] if long_term_cols: df['Long_term_acid_factor'] = df[long_term_cols].mean(axis=1, skipna=True) else: df['Long_term_acid_factor'] = 0 # 磺酸类因子 (PFBS, PFHxS, PFOS) sulfonate_cols = [col for col in ['PFBS', 'PFHxS', 'PFOS'] if col in df.columns] if sulfonate_cols: df['Sulfonate_factor'] = df[sulfonate_cols].mean(axis=1, skipna=True) else: df['Sulfonate_factor'] = 0 # 暴露因子 (总PFAS浓度) all_pfas_cols = [col for col in required_cols if col in df.columns] if all_pfas_cols: df['Exposure_factor'] = df[all_pfas_cols].sum(axis=1, skipna=True) else: df['Exposure_factor'] = 0 return df # 自定义表格打印函数 def print_table(data, headers=None, title=None): if title: print(f"\n{title}") print("=" * 60) if headers: # 打印表头 header_line = " | ".join(f"{h:>15}" for h in headers) print(header_line) print("-" * len(header_line)) # 打印数据行 for row in data: if isinstance(row, (list, tuple)): row_line = " | ".join(f"{str(item):>15}" for item in row) else: # 处理DataFrame行 row_line = " | ".join(f"{str(row[col]):>15}" for col in headers) print(row_line) # 尝试多种编码方式读取文件 def read_csv_with_encodings(file_path): # 尝试的编码列表(中文环境中常见的编码) encodings = ['utf-8', 'gbk', 'gb2312', 'gb18030', 'latin1', 'cp936'] for encoding in encodings: try: print(f"尝试使用 {encoding} 编码读取文件...") df = pd.read_csv(file_path, encoding=encoding) # 删除完全为空的行和列 df = df.dropna(how='all', axis=0) df = df.dropna(how='all', axis=1) print(f"成功使用 {encoding} 编码读取文件") return df except UnicodeDecodeError: continue except Exception as e: print(f"使用 {encoding} 编码时出错: {e}") continue # 如果所有编码都失败,尝试使用错误处理 try: print("尝试使用错误处理方式读取文件...") df = pd.read_csv(file_path, encoding='utf-8', errors='ignore') df = df.dropna(how='all', axis=0) df = df.dropna(how='all', axis=1) print("使用错误处理方式成功读取文件") return df except Exception as e: print(f"所有读取尝试都失败: {e}") return None # 加载CSV文件并跳过完全为空的行列 file_path = r'E:\pycharm\meta\整合数据.csv' # 使用原始字符串表示法 try: # 使用多种编码方式尝试读取文件 df = read_csv_with_encodings(file_path) if df is None: print("无法读取文件,请检查文件路径和格式") exit() print(f"数据形状: {df.shape}") print(f"列名: {df.columns.tolist()}") except Exception as e: print(f"读取文件时出错: {e}") exit() # 数据预处理 # 检查PFAS相关列是否存在,如果不存在则尝试找到类似的列 expected_features = ['PFBA', 'PFPeA', 'PFHxA', 'PFHpA', 'PFOA', 'PFNA', 'PFDA', 'PFBS', 'PFHxS', 'PFOS'] available_features = [] for feature in expected_features: if feature in df.columns: available_features.append(feature) else: # 尝试找到包含该字符串的列 matching_cols = [col for col in df.columns if feature.lower() in col.lower()] if matching_cols: available_features.extend(matching_cols) print(f"使用 '{matching_cols[0]}' 替代 '{feature}'") # 如果没有找到任何PFAS特征,使用所有数值列 if not available_features: numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() available_features = numeric_cols print(f"未找到PFAS特征,使用所有数值列: {available_features}") print(f"可用的PFAS特征: {available_features}") # 检查目标列 target_column = '城市' if target_column not in df.columns: # 尝试找到可能的目标列 possible_targets = [col for col in df.columns if any(word in col for word in ['城市', '地区', '区域', '地点', 'city', 'region'])] if possible_targets: target_column = possible_targets[0] print(f"使用 '{target_column}' 作为目标变量") else: # 如果没有找到,使用第一列非数值列作为目标 non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns if len(non_numeric_cols) > 0: target_column = non_numeric_cols[0] print(f"使用 '{target_column}' 作为目标变量") else: # 如果没有非数值列,使用最后一列作为目标 target_column = df.columns[-1] print(f"使用最后一列 '{target_column}' 作为目标变量") # 处理缺失值(用中位数填充) for feature in available_features: if feature in df.columns: df[feature] = pd.to_numeric(df[feature], errors='coerce') df[feature] = df[feature].fillna(df[feature].median()) # 应用张继权教授的四因子理论 df = apply_four_factor_theory(df) # 添加四因子到特征中 features = available_features + ['Short_term_acid_factor', 'Long_term_acid_factor', 'Sulfonate_factor', 'Exposure_factor'] print(f"最终使用的特征: {features}") # 编码目标变量(多分类) le = LabelEncoder() df[target_column] = le.fit_transform(df[target_column].fillna('Unknown')) class_names = le.classes_ # 分割数据集 X = df[features] y = df[target_column] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"训练集大小: {X_train.shape}") print(f"测试集大小: {X_test.shape}") print(f"目标变量类别数: {len(np.unique(y))}") # 标准化特征 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # PCA降维(保留95%方差) pca = PCA(n_components=0.95) X_train_pca = pca.fit_transform(X_train_scaled) X_test_pca = pca.transform(X_test_scaled) print(f"PCA降维后特征数: {X_train_pca.shape[1]}") # ========== 新增:相关性分析 ========== print("\n进行相关性分析...") # 计算特征之间的相关性矩阵 correlation_matrix = X.corr() # 绘制相关性热图 plt.figure(figsize=(12, 10)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f') plt.title('特征相关性热图') plt.tight_layout() plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight') plt.show() # 找出高度相关的特征对 high_corr_pairs = [] for i in range(len(correlation_matrix.columns)): for j in range(i + 1, len(correlation_matrix.columns)): if abs(correlation_matrix.iloc[i, j]) > 0.8: # 阈值设为0.8 high_corr_pairs.append(( correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j] )) if high_corr_pairs: print("高度相关的特征对:") for pair in high_corr_pairs: print(f" {pair[0]} 和 {pair[1]}: {pair[2]:.3f}") else: print("没有发现高度相关的特征对(相关系数>0.8)") # ========== 新增:Meta分析 - 数据分布可视化 ========== print("\n进行Meta分析...") # 1. 目标变量分布 plt.figure(figsize=(10, 6)) df[target_column].value_counts().plot(kind='bar') plt.title('目标变量分布') plt.xlabel('类别') plt.ylabel('样本数量') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('target_distribution.png', dpi=300, bbox_inches='tight') plt.show() # 2. 四因子分布 four_factor_cols = ['Short_term_acid_factor', 'Long_term_acid_factor', 'Sulfonate_factor', 'Exposure_factor'] if all(col in df.columns for col in four_factor_cols): fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.ravel() for i, col in enumerate(four_factor_cols): axes[i].hist(df[col], bins=30, alpha=0.7, color='skyblue', edgecolor='black') axes[i].set_title(f'{col} 分布') axes[i].set_xlabel(col) axes[i].set_ylabel('频率') plt.tight_layout() plt.savefig('four_factor_distribution.png', dpi=300, bbox_inches='tight') plt.show() # 3. 特征箱线图 plt.figure(figsize=(15, 10)) X.boxplot() plt.title('特征箱线图') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('feature_boxplot.png', dpi=300, bbox_inches='tight') plt.show() # 4. PCA可视化 plt.figure(figsize=(10, 8)) if X_train_pca.shape[1] >= 2: scatter = plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap='viridis', alpha=0.7) plt.colorbar(scatter) plt.xlabel('第一主成分') plt.ylabel('第二主成分') plt.title('PCA降维可视化') plt.tight_layout() plt.savefig('pca_visualization.png', dpi=300, bbox_inches='tight') plt.show() # 5. 累计方差解释率 plt.figure(figsize=(10, 6)) pca_full = PCA().fit(X_train_scaled) plt.plot(np.cumsum(pca_full.explained_variance_ratio_)) plt.xlabel('主成分数量') plt.ylabel('累计方差解释率') plt.title('PCA累计方差解释率') plt.grid(True) plt.tight_layout() plt.savefig('pca_explained_variance.png', dpi=300, bbox_inches='tight') plt.show() # ========== 模型训练和评估 ========== # 定义机器学习模型 models = { 'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42), 'DecisionTree': DecisionTreeClassifier(random_state=42), 'RandomForest': RandomForestClassifier(random_state=42), 'SVM': SVC(random_state=42, probability=True), 'KNN': KNeighborsClassifier(), 'NaiveBayes': GaussianNB(), 'GradientBoosting': GradientBoostingClassifier(random_state=42), 'MLP': MLPClassifier(max_iter=1000, random_state=42), 'AdaBoost': AdaBoostClassifier(random_state=42), 'ExtraTrees': ExtraTreesClassifier(random_state=42) } # 添加可选模型 if XGB_AVAILABLE: models['XGBoost'] = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42) if LGBM_AVAILABLE: models['LightGBM'] = LGBMClassifier(random_state=42) print(f"将训练 {len(models)} 个模型") # 训练和评估模型 results = [] cv_results = [] # 用于存储交叉验证结果 model_objects = {} # 存储训练好的模型对象 print("开始训练模型...") for name, model in models.items(): print(f"训练 {name}...") try: # 训练模型 model.fit(X_train_pca, y_train) model_objects[name] = model # 预测和评估 y_pred = model.predict(X_test_pca) y_pred_proba = model.predict_proba(X_test_pca) if hasattr(model, "predict_proba") else None acc = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred, output_dict=True, zero_division=0) # 交叉验证 cv_scores = cross_val_score(model, X_train_pca, y_train, cv=5, scoring='accuracy') cv_mean = cv_scores.mean() cv_std = cv_scores.std() results.append([ name, acc, report['weighted avg']['precision'], report['weighted avg']['recall'], report['weighted avg']['f1-score'], cv_mean, cv_std ]) cv_results.append({ 'model': name, 'scores': cv_scores }) except Exception as e: print(f"训练模型 {name} 时出错: {e}") results.append([name, 0, 0, 0, 0, 0, 0]) # 生成结果表格 headers = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'CV Mean', 'CV Std'] results_df = pd.DataFrame(results, columns=headers) results_df = results_df.sort_values('Accuracy', ascending=False) print("\n模型性能排名:") print_table(results_df.values.tolist(), headers=headers, title="模型性能比较") # ========== 新增:模型性能可视化 ========== print("\n生成模型性能可视化...") # 1. 模型准确率比较 plt.figure(figsize=(12, 8)) models_names = results_df['Model'] accuracies = results_df['Accuracy'] cv_means = results_df['CV Mean'] cv_stds = results_df['CV Std'] x = np.arange(len(models_names)) width = 0.35 plt.bar(x - width / 2, accuracies, width, label='测试集准确率', alpha=0.7) plt.bar(x + width / 2, cv_means, width, yerr=cv_stds, label='交叉验证准确率', alpha=0.7, capsize=5) plt.xlabel('模型') plt.ylabel('准确率') plt.title('模型性能比较') plt.xticks(x, models_names, rotation=45) plt.legend() plt.tight_layout() plt.savefig('model_performance_comparison.png', dpi=300, bbox_inches='tight') plt.show() # 2. 交叉验证箱线图 cv_df = pd.DataFrame({item['model']: item['scores'] for item in cv_results}) plt.figure(figsize=(12, 8)) cv_df.boxplot() plt.title('模型交叉验证准确率分布') plt.xticks(rotation=45) plt.ylabel('准确率') plt.tight_layout() plt.savefig('cv_boxplot.png', dpi=300, bbox_inches='tight') plt.show() # 3. 最佳模型详细分析 best_model_name = results_df.iloc[0]['Model'] best_model = model_objects[best_model_name] print(f"\n对最佳模型 {best_model_name} 进行详细分析...") # 混淆矩阵 y_pred_best = best_model.predict(X_test_pca) cm = confusion_matrix(y_test, y_pred_best) plt.figure(figsize=(10, 8)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names) plt.title(f'{best_model_name} 混淆矩阵') plt.xlabel('预测标签') plt.ylabel('真实标签') plt.tight_layout() plt.savefig(f'confusion_matrix_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() # 学习曲线 def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)): plt.figure(figsize=(10, 6)) train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="训练得分") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="交叉验证得分") plt.xlabel("训练样本数") plt.ylabel("得分") plt.title(title) plt.legend(loc="best") plt.tight_layout() plt.savefig(f'learning_curve_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() plot_learning_curve(best_model, f'{best_model_name} 学习曲线', X_train_pca, y_train, cv=5) # 特征重要性(如果模型支持) if hasattr(best_model, 'feature_importances_'): importances = best_model.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(10, 8)) plt.title(f"{best_model_name} 特征重要性") plt.bar(range(min(20, len(importances))), importances[indices[:20]]) plt.xticks(range(min(20, len(importances))), [f'PC{i + 1}' for i in indices[:20]], rotation=45) plt.tight_layout() plt.savefig(f'feature_importance_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() # ========== 新增:SHAP特征分析组合图 ========== if SHAP_AVAILABLE: print("\n生成SHAP特征分析组合图...") # 只为最佳模型生成SHAP组合图 if best_model_name in ['RandomForest', 'DecisionTree', 'GradientBoosting', 'XGBoost', 'LightGBM']: try: # 创建SHAP解释器 explainer = shap.TreeExplainer(best_model) shap_values = explainer.shap_values(X_test_pca) # 对于多分类问题,选择第一个类别的SHAP值 if isinstance(shap_values, list): shap_values_used = shap_values[0] # 使用第一个类别的SHAP值 else: shap_values_used = shap_values # 创建组合图:全局重要性(左)与个体影响(右) fig = plt.figure(figsize=(20, 10)) gs = gridspec.GridSpec(1, 2, width_ratios=[1, 1]) # 左图:全局特征重要性 ax1 = plt.subplot(gs[0]) shap.summary_plot(shap_values_used, X_test_pca, plot_type="bar", feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax1.set_title(f'{best_model_name} - SHAP全局特征重要性', fontsize=16, fontweight='bold') # 右图:个体样本SHAP值 ax2 = plt.subplot(gs[1]) # 选择一个有代表性的样本(SHAP值绝对值最大的样本) sample_idx = np.argmax(np.sum(np.abs(shap_values_used), axis=1)) # 绘制瀑布图显示个体样本的SHAP值 shap.waterfall_plot( explainer.expected_value[0] if isinstance(explainer.expected_value, list) else explainer.expected_value, shap_values_used[sample_idx], feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax2.set_title(f'{best_model_name} - 个体样本SHAP值分析\n(样本索引: {sample_idx})', fontsize=16, fontweight='bold') plt.tight_layout() plt.savefig(f'shap_combined_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() # 创建另一个组合图:SHAP摘要图与依赖图 fig2 = plt.figure(figsize=(20, 10)) gs2 = gridspec.GridSpec(1, 2, width_ratios=[1, 1]) # 左图:SHAP摘要图(显示特征值与SHAP值的关系) ax3 = plt.subplot(gs2[0]) shap.summary_plot(shap_values_used, X_test_pca, feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, max_display=15) ax3.set_title(f'{best_model_name} - SHAP摘要图', fontsize=16, fontweight='bold') # 右图:SHAP依赖图(最重要的特征) ax4 = plt.subplot(gs2[1]) # 找到最重要的特征 if hasattr(best_model, 'feature_importances_'): feature_importances = best_model.feature_importances_ most_important_feature = np.argmax(feature_importances) else: # 如果没有feature_importances_属性,使用SHAP值计算重要性 feature_importances = np.mean(np.abs(shap_values_used), axis=0) most_important_feature = np.argmax(feature_importances) shap.dependence_plot(most_important_feature, shap_values_used, X_test_pca, feature_names=[f'PC{i + 1}' for i in range(X_test_pca.shape[1])], show=False, ax=ax4) ax4.set_title(f'{best_model_name} - SHAP依赖图\n(最重要的特征: PC{most_important_feature + 1})', fontsize=16, fontweight='bold') plt.tight_layout() plt.savefig(f'shap_detailed_{best_model_name}.png', dpi=300, bbox_inches='tight') plt.show() print(f"已保存 SHAP组合图 for {best_model_name}") except Exception as e: print(f"生成SHAP组合图失败 for {best_model_name}: {e}") else: print(f"最佳模型 {best_model_name} 不支持TreeExplainer,跳过SHAP组合图") else: print("\nSHAP不可用,跳过SHAP组合图") # ========== 生成四因子表格 ========== four_factor_cols = ['Short_term_acid_factor', 'Long_term_acid_factor', 'Sulfonate_factor', 'Exposure_factor'] if all(col in df.columns for col in four_factor_cols): four_factor_table = df[[target_column] + four_factor_cols].head(10) # 解码目标变量以显示原始标签 four_factor_table[target_column] = le.inverse_transform(four_factor_table[target_column]) print_table(four_factor_table.values.tolist(), headers=[target_column] + four_factor_cols, title="四因子理论表格 (前10行)") # ========== 生成PCA组件表格 ========== pca_components = pd.DataFrame(pca.components_, columns=features) print(f"\nPCA主成分表格 (前5个主成分):") print_table(pca_components.head().values.tolist(), headers=['PC'] + features, title="PCA主成分权重") # ========== 保存重要结果 ========== results_df.to_csv('model_results.csv', index=False) print("\n模型结果已保存到 'model_results.csv'") # 显示最佳模型 best_model = results_df.iloc[0] print(f"\n最佳模型: {best_model['Model']} (准确率: {best_model['Accuracy']:.4f})") print("\n所有图表已生成完成!")完善改进代码
11-12
import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from matplotlib.collections import PolyCollection import os import matplotlib.tri as mtri from datetime import datetime from geometry_processor import * plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC"] plt.rcParams["axes.unicode_minus"] = False plt.rcParams["font.family"] = ["SimHei", "WenQuanYi Micro Hei", "Heiti TC", "Arial Unicode MS"] class FlowVisualizer: """流场可视化器""" def __init__(self): self.geometry = None self.results = None self.flow_conditions = None self.config = None self.view_type = 'top' # 默认顶视图 self.analysis_log = [] # 存储日志 def setup(self, geometry, results, flow_conditions, config): """初始化可视化器参数""" self.geometry = geometry self.results = results self.flow_conditions = flow_conditions self.config = config return self def logger(self, message): """日志记录方法,修复缺失的logger属性""" timestamp = datetime.now().strftime("%H:%M:%S") log_entry = f"[{timestamp}] {message}" print(log_entry) # 打印到控制台 self.analysis_log.append(log_entry) # 保存到日志列表 def plot_geometry_overview(self): """绘制几何体总览""" fig = plt.figure(figsize=(15, 10)) # 3D视图 ax1 = fig.add_subplot(221, projection='3d') self._plot_3d_geometry(ax1) # 顶视图 ax2 = fig.add_subplot(222) self._plot_2d_projection(ax2, 'top') # 侧视图 ax3 = fig.add_subplot(223) self._plot_2d_projection(ax3, 'side') # 前视图 ax4 = fig.add_subplot(224) self._plot_2d_projection(ax4, 'front') plt.tight_layout() if self.config.get('save_figures', False): plt.savefig('geometry_overview.png', dpi=self.config.get('dpi', 300), bbox_inches='tight') if self.config.get('auto_show', True): plt.show() def _plot_3d_geometry(self, ax): """绘制3D几何体(修正3D顶点转2D的问题)""" # 采样显示面元(避免过于密集) max_faces = 5000 step = max(1, len(self.geometry.elements) // max_faces) sample_elements = self.geometry.elements[::step] # 存储2D面片和对应的z坐标 faces_2d = [] # 2D顶点(X-Y投影) z_coords = [] # 每个面片的平均z坐标(用于3D定位) for elem in sample_elements: # 获取3D顶点(形状为 (N, 3),N为面元顶点数,如三角形为3,四边形为4) vertices_3d = self.geometry.nodes[elem] # 将3D顶点投影到X-Y平面(提取x和y作为2D坐标) vertices_2d = vertices_3d[:, :2] # 取前两列(x, y) faces_2d.append(vertices_2d) # 计算该面元的平均z坐标(用于在3D空间中定位面片) avg_z = np.mean(vertices_3d[:, 2]) # 取z坐标的平均值 z_coords.append(avg_z) # 创建2D面片集合(此时输入为2D顶点,符合要求) face_collection = PolyCollection( faces_2d, alpha=0.3, facecolor='lightblue', edgecolor='darkblue', linewidth=0.1 ) # 将2D面片添加到3D轴,并通过zs指定每个面片的z坐标 ax.add_collection3d(face_collection, zs=z_coords, zdir='z') # 设置坐标轴范围 bounds = self.geometry.get_bounds() ax.set_xlim(bounds[0]) ax.set_ylim(bounds[1]) ax.set_zlim(bounds[2]) ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') ax.set_title('3D几何体') # 设置视角 ax.view_init(elev=20, azim=45) def _plot_2d_projection(self, ax, view_type): """绘制2D投影""" nodes = self.geometry.nodes if view_type == 'top': # X-Y平面 x, y = nodes[:, 0], nodes[:, 1] xlabel, ylabel = 'X (长度)', 'Y (宽度)' title = '顶视图 (X-Y)' elif view_type == 'side': # X-Z平面 x, y = nodes[:, 0], nodes[:, 2] xlabel, ylabel = 'X (长度)', 'Z (高度)' title = '侧视图 (X-Z)' else: # 'front' Y-Z平面 x, y = nodes[:, 1], nodes[:, 2] xlabel, ylabel = 'Y (宽度)', 'Z (高度)' title = '前视图 (Y-Z)' ax.scatter(x, y, s=0.5, alpha=0.6, c='blue') ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title) ax.set_aspect('equal') ax.grid(True, alpha=0.3) def plot_pressure_distribution(self): """绘制压力分布,增强版坐标与数据的匹配逻辑""" try: # 获取压力系数数据 if 'shock_expansion' in self.results: pressure_data = self.results['shock_expansion']['element_data']['pressure_coefficients'] data_source = "激波修正后" else: pressure_data = self.results['flow_field']['pressure_coefficients'] data_source = "基本流场" # 严格匹配逻辑 if len(pressure_data) == len(self.geometry.face_centers): coords = self.geometry.face_centers # 使用面元中心坐标 self.logger("使用面元中心坐标") elif len(pressure_data) == len(self.geometry.nodes): coords = self.geometry.nodes # 使用节点坐标 self.logger("使用节点坐标") else: # 自动选择最接近的坐标集 if abs(len(pressure_data) - len(self.geometry.face_centers)) < abs( len(pressure_data) - len(self.geometry.nodes)): coords = self.geometry.face_centers else: coords = self.geometry.nodes self.logger( f"警告: 数据长度不匹配,使用替代坐标 (数据长度={len(pressure_data)}, 坐标长度={len(coords)})") # 投影到2D视图 if self.view_type == 'top': x, y = coords[:, 0], coords[:, 1] xlabel, ylabel = 'X坐标', 'Y坐标' elif self.view_type == 'side': x, y = coords[:, 0], coords[:, 2] xlabel, ylabel = 'X坐标', 'Z坐标' else: x, y = coords[:, 1], coords[:, 2] xlabel, ylabel = 'Y坐标', 'Z坐标' # 确保数据长度一致 min_length = min(len(x), len(pressure_data)) x = x[:min_length] y = y[:min_length] pressure_data = pressure_data[:min_length] # 绘制压力云图 fig, ax = plt.subplots(figsize=(10, 8)) scatter = ax.scatter(x, y, c=pressure_data, cmap='jet', s=10, alpha=0.8) plt.colorbar(scatter, ax=ax, label=f'压力系数 ({data_source})') ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(f'表面压力系数分布 (Ma={self.flow_conditions["mach_number"]})') ax.set_aspect('equal') if self.config["save_figures"]: self._save_figure(fig, "pressure_distribution") if self.config["auto_show"]: plt.show() else: plt.close(fig) except Exception as e: self.logger(f"❌ 压力分布可视化失败: {str(e)}") # 输出详细调试信息 if hasattr(self, 'geometry'): self.logger(f"调试信息: 节点数={len(self.geometry.nodes)}, 面元数={len(self.geometry.elements)}") if hasattr(self.geometry, 'face_centers'): self.logger(f"面元中心数={len(self.geometry.face_centers)}") if 'flow_field' in self.results: self.logger(f"压力数据长度={len(self.results['flow_field']['pressure_coefficients'])}") def _interpolate_data(self, data, target_length): """插值并强制校验长度""" interpolated = np.interp( np.linspace(0, len(data) - 1, target_length), # 目标索引 np.arange(len(data)), # 原始索引 data ) # 插值后必须与目标长度一致 assert len(interpolated) == target_length, \ f"插值失败!目标长度{target_length},实际{len(interpolated)}" return interpolated def plot_velocity_field(self): """绘制速度场""" if 'flow_field' not in self.results: print("❌ 没有流场数据") return surface_velocities = self.results['flow_field']['surface_velocities'] velocity_magnitudes = np.linalg.norm(surface_velocities, axis=1) fig = plt.figure(figsize=(18, 12)) # 1. 速度大小云图 - 顶视图 ax1 = fig.add_subplot(231) self._plot_contour_2d(ax1, velocity_magnitudes, 'top', 'velocity', '速度大小分布 - 顶视图') # 2. 速度大小云图 - 侧视图 ax2 = fig.add_subplot(232) self._plot_contour_2d(ax2, velocity_magnitudes, 'side', 'velocity', '速度大小分布 - 侧视图') # 3. 速度向量场 - 顶视图 ax3 = fig.add_subplot(233) self._plot_vector_field_2d(ax3, surface_velocities, 'top', '速度向量场 - 顶视图') # 4. 速度向量场 - 侧视图 ax4 = fig.add_subplot(234) self._plot_vector_field_2d(ax4, surface_velocities, 'side', '速度向量场 - 侧视图') # 5. 撞击角分布 ax5 = fig.add_subplot(235) impact_angles = self.results['flow_field']['impact_angles'] self._plot_histogram(ax5, np.degrees(impact_angles), 'angle', '撞击角分布直方图') # 6. 3D速度分布 ax6 = fig.add_subplot(236, projection='3d') self._plot_3d_scalar_field(ax6, velocity_magnitudes, 'velocity', '3D速度大小分布') plt.tight_layout() if self.config.get('save_figures', False): plt.savefig('velocity_field.png', dpi=self.config.get('dpi', 300), bbox_inches='tight') if self.config.get('auto_show', True): plt.show() def plot_streamlines(self): """绘制流线""" if 'streamlines' not in self.results: print("❌ 没有流线数据") return streamlines = self.results['streamlines']['streamlines'] fig = plt.figure(figsize=(15, 10)) # 3D流线 ax1 = fig.add_subplot(121, projection='3d') self._plot_3d_streamlines(ax1, streamlines) # 2D流线投影 ax2 = fig.add_subplot(122) self._plot_2d_streamlines(ax2, streamlines, 'side') plt.tight_layout() if self.config.get('save_figures', False): plt.savefig('streamlines.png', dpi=self.config.get('dpi', 300), bbox_inches='tight') if self.config.get('auto_show', True): plt.show() def plot_shock_patterns(self): """绘制激波模式""" if 'shock_expansion' not in self.results: print("❌ 没有激波数据") return shock_data = self.results['shock_expansion'] fig = plt.figure(figsize=(18, 12)) # 1. 马赫数分布 ax1 = fig.add_subplot(231) mach_data = shock_data['node_data']['mach_numbers'] self._plot_contour_2d(ax1, mach_data, 'side', 'mach', '马赫数分布') # 2. 压力比分布 ax2 = fig.add_subplot(232) pressure_data = shock_data['node_data']['pressures'] pressure_ratio = pressure_data / self.flow_conditions['pressure'] self._plot_contour_2d(ax2, pressure_ratio, 'side', 'pressure_ratio', '压力比分布') # 3. 温度分布 ax3 = fig.add_subplot(233) temp_data = shock_data['node_data']['temperatures'] self._plot_contour_2d(ax3, temp_data, 'side', 'temperature', '温度分布') # 4. 马赫数变化直方图 ax4 = fig.add_subplot(234) self._plot_histogram(ax4, mach_data, 'mach', '马赫数分布直方图') # 5. 激波强度分析 ax5 = fig.add_subplot(235) self._plot_shock_strength_analysis(ax5, shock_data) # 6. 3D激波模式 ax6 = fig.add_subplot(236, projection='3d') self._plot_3d_scalar_field(ax6, pressure_ratio, 'pressure_ratio', '3D压力比分布') plt.tight_layout() if self.config.get('save_figures', False): plt.savefig('shock_patterns.png', dpi=self.config.get('dpi', 300), bbox_inches='tight') if self.config.get('auto_show', True): plt.show() def plot_force_distribution(self): """绘制力分布""" if 'aerodynamics' not in self.results: print("❌ 没有气动力数据") return aero_data = self.results['aerodynamics'] fig = plt.figure(figsize=(15, 8)) # 1. 气动力系数 ax1 = fig.add_subplot(131) self._plot_force_coefficients(ax1, aero_data) # 2. 力分量对比 ax2 = fig.add_subplot(132) self._plot_force_components(ax2, aero_data) # 3. 力矩系数 ax3 = fig.add_subplot(133) self._plot_moment_coefficients(ax3, aero_data) plt.tight_layout() if self.config.get('save_figures', False): plt.savefig('force_distribution.png', dpi=self.config.get('dpi', 300), bbox_inches='tight') if self.config.get('auto_show', True): plt.show() def _validate_data_consistency(self): """全面校验所有数据与几何信息的一致性""" # 节点数量基准值 node_count = len(self.geometry.nodes) # 面元数量基准值 face_count = len(self.geometry.elements) print(f"数据校验: 节点数={node_count}, 面元数={face_count}") # 检查所有数据数组 if hasattr(self, 'data'): for name, data in self.data.items(): data_len = len(data) if data_len != node_count and data_len != face_count: print(f"⚠️ 数据不一致: {name} 长度={data_len} (需要={node_count}或{face_count})") # 计算差异比例 node_ratio = abs(data_len - node_count) / node_count face_ratio = abs(data_len - face_count) / face_count # 提示可能的错误来源 if node_ratio < 0.1: # 差异小于10% print(f" 提示: 与节点数差异较小({node_ratio:.1%}),可能是索引错误") elif face_ratio < 0.1: # 差异小于10% print(f" 提示: 与面元数差异较小({face_ratio:.1%}),可能是计算错误") def _build_face_center_elements(self): """基于哈希表优化的面元中心三角剖分索引构建(解决性能问题)""" try: # 1. 使用哈希表存储边与面元的映射关系(边由两个顶点索引组成的元组表示) edge_map = {} for face_idx, elem in enumerate(self.geometry.elements): # 处理三角形面元(3条边) if len(elem) == 3: edges = [ tuple(sorted((elem[0], elem[1]))), tuple(sorted((elem[1], elem[2]))), tuple(sorted((elem[2], elem[0]))) ] # 处理四边形面元(4条边) elif len(elem) == 4: edges = [ tuple(sorted((elem[0], elem[1]))), tuple(sorted((elem[1], elem[2]))), tuple(sorted((elem[2], elem[3]))), tuple(sorted((elem[3], elem[0]))) ] else: self.logger(f"警告:不支持的面元类型(顶点数:{len(elem)})") continue # 将边添加到哈希表 for edge in edges: if edge not in edge_map: edge_map[edge] = [] edge_map[edge].append(face_idx) # 2. 构建面元邻接关系 face_adjacency = [[] for _ in range(len(self.geometry.elements))] for edge, faces in edge_map.items(): # 每条边最多属于两个面元(共享边) if len(faces) == 2: face1, face2 = faces if face2 not in face_adjacency[face1]: face_adjacency[face1].append(face2) if face1 not in face_adjacency[face2]: face_adjacency[face2].append(face1) # 3. 生成三角剖分索引 elements = [] for i in range(len(face_adjacency)): neighbors = face_adjacency[i] # 取前两个有效邻居构建三角形 if len(neighbors) >= 2: elements.append([i, neighbors[0], neighbors[1]]) # 处理只有一个邻居的情况(找最近的面元) elif len(neighbors) == 1: # 从所有面元中找一个最近的非邻居面元 min_dist = float('inf') closest_face = -1 # 获取当前面元中心坐标 current_center = self.geometry.face_centers[i] # 遍历所有面元找最近的 for j in range(len(face_adjacency)): if j != i and j not in neighbors: dist = np.linalg.norm(current_center - self.geometry.face_centers[j]) if dist < min_dist: min_dist = dist closest_face = j if closest_face != -1: elements.append([i, neighbors[0], closest_face]) return np.array(elements, dtype=int) if elements else None except Exception as e: self.logger(f"构建面元中心索引失败: {str(e)}") return None def _plot_contour_2d(self, ax, data, view_type, data_type, title): try: # 检查数据是定义在节点上还是面元上 if len(data) == len(self.geometry.nodes): # 节点数据:使用原始面元连接关系 nodes = self.geometry.nodes data_is_face = False elements = self.geometry.elements elif len(data) == len(self.geometry.elements): # 面元数据:使用面元中心坐标 face_centers = self.geometry.face_centers data_is_face = True # 确保长度一致 if len(face_centers) != len(self.geometry.elements): raise ValueError("面元中心数与面元数不匹配") elements = self._build_face_center_elements() else: raise ValueError(f"无法识别的数据长度: {len(data)}") if view_type == 'top': # X-Y平面 if data_is_face: x, y = face_centers[:, 0], face_centers[:, 1] else: x, y = nodes[:, 0], nodes[:, 1] xlabel, ylabel = 'X', 'Y' elif view_type == 'side': # X-Z平面 if data_is_face: x, y = face_centers[:, 0], face_centers[:, 2] else: x, y = nodes[:, 0], nodes[:, 2] xlabel, ylabel = 'X', 'Z' else: # front Y-Z平面 if data_is_face: x, y = face_centers[:, 1], face_centers[:, 2] else: x, y = nodes[:, 1], nodes[:, 2] xlabel, ylabel = 'Y', 'Z' # 确保坐标是1D数组 x = x.flatten() if x.ndim > 1 else x y = y.flatten() if y.ndim > 1 else y data = data.flatten() if data.ndim > 1 else data print(f"数据长度: {len(data)}, 节点数: {len(self.geometry.nodes)}, 面元数: {len(self.geometry.elements)}") if data_is_face: print(f"面元中心坐标长度: {len(face_centers)}") else: print(f"节点坐标长度: {len(nodes)}") # 验证数组长度一致 if len(x) != len(y) or len(x) != len(data): raise ValueError(f"数组长度不匹配: x={len(x)}, y={len(y)}, data={len(data)}") # 选择颜色映射 if data_type == 'pressure': cmap = 'RdYlBu_r' label = 'Cp' elif data_type == 'velocity': cmap = 'plasma' label = 'V (m/s)' elif data_type == 'mach': cmap = 'jet' label = 'Ma' elif data_type == 'temperature': cmap = 'hot' label = 'T (K)' elif data_type == 'pressure_ratio': cmap = 'RdYlBu_r' label = 'p/p∞' else: cmap = 'viridis' label = 'Value' # 创建三角剖分:根据数据类型使用对应的elements if data_is_face: # 若需处理面元数据,需在此处构建面元中心的拓扑关系(参考之前的方案) # 临时方案:使用自动三角剖分(可能不准确) triangulation = mtri.Triangulation(x, y) else: # 节点数据:使用几何模型中的面元连接关系 triangulation = mtri.Triangulation(x, y, triangles=elements) # 绘制等值线 tcf = ax.tripcolor(triangulation, data, shading='gouraud', cmap=cmap) plt.colorbar(tcf, ax=ax, label=label, shrink=0.8) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title) ax.set_aspect('equal') ax.grid(True, alpha=0.3) return True except Exception as e: print(f"绘制等值线图失败:{e}") import traceback traceback.print_exc() return False # 绘制等值线 def _plot_vector_field_2d(self, ax, vectors, view_type, title): """绘制2D向量场(修复数据与坐标长度不匹配问题)""" # 关键修复:根据向量数据长度选择匹配的坐标集 if len(vectors) == len(self.geometry.face_centers): # 面元数据:使用面元中心坐标 coords = self.geometry.face_centers self.logger(f"使用面元中心坐标绘制向量场 (长度: {len(coords)})") else: # 节点数据:使用节点坐标 coords = self.geometry.nodes self.logger(f"使用节点坐标绘制向量场 (长度: {len(coords)})") # 根据视图类型选择投影平面 if view_type == 'top': x, y = coords[:, 0], coords[:, 1] u, v = vectors[:, 0], vectors[:, 1] xlabel, ylabel = 'X', 'Y' elif view_type == 'side': x, y = coords[:, 0], coords[:, 2] u, v = vectors[:, 0], vectors[:, 2] xlabel, ylabel = 'X', 'Z' else: # front x, y = coords[:, 1], coords[:, 2] u, v = vectors[:, 1], vectors[:, 2] xlabel, ylabel = 'Y', 'Z' # 验证数据长度一致性 if len(x) != len(vectors) or len(y) != len(vectors): raise ValueError( f"向量数据与坐标长度不匹配: 向量={len(vectors)}, " f"X坐标={len(x)}, Y坐标={len(y)}" ) # 采样显示(避免过于密集) step = max(1, len(coords) // 500) # 根据实际坐标数量调整采样步长 sample_indices = range(0, len(coords), step) ax.quiver( x[sample_indices], y[sample_indices], u[sample_indices], v[sample_indices], scale=20, alpha=0.7, width=0.002, color='blue' ) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title) ax.set_aspect('equal') def _plot_histogram(self, ax, data, data_type, title): """绘制直方图""" if data_type == 'pressure': bins = 50 xlabel = '压力系数 Cp' color = 'skyblue' elif data_type == 'velocity': bins = 50 xlabel = '速度大小 (m/s)' color = 'lightgreen' elif data_type == 'angle': bins = 50 xlabel = '撞击角 ()' color = 'orange' elif data_type == 'mach': bins = 50 xlabel = '马赫数' color = 'lightcoral' else: bins = 50 xlabel = 'Value' color = 'gray' ax.hist(data, bins=bins, alpha=0.7, edgecolor='black', color=color) ax.set_xlabel(xlabel) ax.set_ylabel('频数') ax.set_title(title) ax.grid(True, alpha=0.3) def _plot_line_variation(self, ax, data, direction, data_type, title): """绘制沿某方向的变化""" nodes = self.geometry.nodes if direction == 'x': coord = nodes[:, 0] xlabel = 'X坐标' elif direction == 'y': coord = nodes[:, 1] xlabel = 'Y坐标' else: # 'z' coord = nodes[:, 2] xlabel = 'Z坐标' # 按坐标排序 sort_indices = np.argsort(coord) ax.scatter(coord[sort_indices], data[sort_indices], alpha=0.6, s=1) ax.set_xlabel(xlabel) if data_type == 'pressure': ax.set_ylabel('压力系数 Cp') elif data_type == 'velocity': ax.set_ylabel('速度大小 (m/s)') else: ax.set_ylabel('Value') ax.set_title(title) ax.grid(True, alpha=0.3) def _plot_windward_leeward_comparison(self, ax, data, data_type, title): """绘制迎风面/背风面对比""" # 获取撞击角数据 impact_angles = self.results['flow_field']['impact_angles'] # 关键修复:确保撞击角数组与数据数组长度一致 # 如果不一致,尝试通过插值或采样使它们匹配 if len(impact_angles) != len(data): print(f"⚠️ 撞击角数据长度({len(impact_angles)})与待可视化数据长度({len(data)})不匹配,正在进行匹配处理...") # 方案1:如果impact_angles更长,尝试下采样到与data相同长度 if len(impact_angles) > len(data): step = len(impact_angles) // len(data) impact_angles = impact_angles[::step][:len(data)] # 方案2:如果data更长,使用线性插值扩展impact_angles else: from scipy.interpolate import interp1d x_old = np.linspace(0, 1, len(impact_angles)) x_new = np.linspace(0, 1, len(data)) f = interp1d(x_old, impact_angles, kind='linear') impact_angles = f(x_new) # 分离迎风面和背风面 windward_mask = impact_angles > 0 leeward_mask = impact_angles <= 0 windward_data = data[windward_mask] leeward_data = data[leeward_mask] # 绘制对比直方图 ax.hist(windward_data, bins=30, alpha=0.7, label='迎风面', color='red') ax.hist(leeward_data, bins=30, alpha=0.7, label='背风面', color='blue') if data_type == 'pressure': ax.set_xlabel('压力系数 Cp') elif data_type == 'velocity': ax.set_xlabel('速度大小 (m/s)') else: ax.set_xlabel('Value') ax.set_ylabel('频数') ax.set_title(title) ax.legend() ax.grid(True, alpha=0.3) def _plot_3d_scalar_field(self, ax, data, data_type, title): """绘制3D标量场""" nodes = self.geometry.nodes # 选择颜色映射 if data_type == 'pressure': cmap = 'RdYlBu_r' elif data_type == 'velocity': cmap = 'plasma' elif data_type == 'mach': cmap = 'jet' elif data_type == 'temperature': cmap = 'hot' elif data_type == 'pressure_ratio': cmap = 'RdYlBu_r' else: cmap = 'viridis' # 3D散点图 scatter = ax.scatter(nodes[:, 0], nodes[:, 1], nodes[:, 2], c=data, cmap=cmap, s=2, alpha=0.6) plt.colorbar(scatter, ax=ax, shrink=0.5, aspect=20) ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') ax.set_title(title) def _plot_3d_streamlines(self, ax, streamlines, max_lines=20): """绘制3D流线""" count = 0 for node_idx, coordinates in streamlines.items(): if count >= max_lines: break if len(coordinates) > 1: coords_array = np.array(coordinates) ax.plot(coords_array[:, 0], coords_array[:, 1], coords_array[:, 2], alpha=0.7, linewidth=1) count += 1 ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') ax.set_title(f'3D表面流线 (显示{count}条)') def _plot_2d_streamlines(self, ax, streamlines, view_type='side', max_lines=30): """绘制2D流线投影""" count = 0 for node_idx, coordinates in streamlines.items(): if count >= max_lines: break if len(coordinates) > 1: coords_array = np.array(coordinates) if view_type == 'top': x, y = coords_array[:, 0], coords_array[:, 1] xlabel, ylabel = 'X', 'Y' elif view_type == 'side': x, y = coords_array[:, 0], coords_array[:, 2] xlabel, ylabel = 'X', 'Z' else: # front x, y = coords_array[:, 1], coords_array[:, 2] xlabel, ylabel = 'Y', 'Z' ax.plot(x, y, alpha=0.7, linewidth=1) count += 1 ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(f'2D流线投影 (显示{count}条)') ax.set_aspect('equal') def _plot_shock_strength_analysis(self, ax, shock_data): """绘制激波强度分析""" pressures = shock_data['node_data']['pressures'] initial_pressure = self.flow_conditions['pressure'] pressure_ratios = pressures / initial_pressure # 分类:压缩、膨胀、无变化 compression_mask = pressure_ratios > 1.05 expansion_mask = pressure_ratios < 0.95 unchanged_mask = (pressure_ratios >= 0.95) & (pressure_ratios <= 1.05) compression_count = np.sum(compression_mask) expansion_count = np.sum(expansion_mask) unchanged_count = np.sum(unchanged_mask) labels = ['压缩区', '膨胀区', '无变化区'] counts = [compression_count, expansion_count, unchanged_count] colors = ['red', 'blue', 'gray'] ax.pie(counts, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90) ax.set_title('激波效应分布') def _plot_force_coefficients(self, ax, aero_data): """绘制气动力系数""" coeffs = aero_data['coefficients'] labels = ['CL', 'CD', 'CY'] values = [coeffs['CL'], coeffs['CD'], coeffs['CY']] colors = ['blue', 'red', 'green'] bars = ax.bar(labels, values, color=colors, alpha=0.7) ax.set_ylabel('系数值') ax.set_title('气动力系数') ax.grid(True, alpha=0.3) # 添加数值标签 for bar, value in zip(bars, values): height = bar.get_height() ax.text(bar.get_x() + bar.get_width() / 2., height + height * 0.01, f'{value:.6f}', ha='center', va='bottom') def _plot_force_components(self, ax, aero_data): """绘制力分量对比""" pressure_forces = aero_data['pressure_forces'] viscous_forces = aero_data['viscous_forces'] components = ['Fx', 'Fy', 'Fz'] pressure_values = [pressure_forces['fx'], pressure_forces['fy'], pressure_forces['fz']] viscous_values = [viscous_forces['fx'], viscous_forces['fy'], viscous_forces['fz']] x = np.arange(len(components)) width = 0.35 ax.bar(x - width / 2, pressure_values, width, label='压力力', alpha=0.7, color='blue') ax.bar(x + width / 2, viscous_values, width, label='粘性力', alpha=0.7, color='red') ax.set_xlabel('力分量') ax.set_ylabel('力 (N)') ax.set_title('气动力分量对比') ax.set_xticks(x) ax.set_xticklabels(components) ax.legend() ax.grid(True, alpha=0.3) def _plot_moment_coefficients(self, ax, aero_data): """绘制力矩系数""" coeffs = aero_data['coefficients'] labels = ['Cl', 'Cm', 'Cn'] values = [coeffs['Cl'], coeffs['Cm'], coeffs['Cn']] colors = ['orange', 'purple', 'brown'] bars = ax.bar(labels, values, color=colors, alpha=0.7) ax.set_ylabel('系数值') ax.set_title('力矩系数') ax.grid(True, alpha=0.3) # 添加数值标签 for bar, value in zip(bars, values): height = bar.get_height() ax.text(bar.get_x() + bar.get_width() / 2., height + height * 0.01, f'{value:.6f}', ha='center', va='bottom') def _save_figure(self, fig, name): """保存图片到文件""" try: import os from pathlib import Path # 创建图片保存目录 img_dir = os.path.join("results", "images") Path(img_dir).mkdir(parents=True, exist_ok=True) # 保存图片 filename = f"{name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png" filepath = os.path.join(img_dir, filename) fig.savefig(filepath, dpi=self.config.get('dpi', 300), bbox_inches='tight') self.logger(f"图片保存成功: {filepath}") except Exception as e: self.logger(f"❌ 图片保存失败: {str(e)}") 'c' argument has 5110 elements, which is inconsistent with 'x' and 'y' with size 4909.怎么解决?
07-24
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

LIQING LIN

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值