23、金融领域机器学习的应用与实现

最新推荐文章于 2025-12-04 22:05:42 发布

omega

最新推荐文章于 2025-12-04 22:05:42 发布

阅读量49

点赞数

CC 4.0 BY-SA版权

分类专栏： Python机器学习在医疗、零售和金融的应用文章标签：机器学习金融应用算法交易

本文链接：https://blog.youkuaiyun.com/omega/article/details/149381628

Python机器学习在医疗、零售和金融的应用专栏收录该内容

28 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

金融领域机器学习的应用与实现

1. 金融领域机器学习的潜力领域

金融行业中，有三个极具潜力的机器学习应用领域：
- 算法股票交易 ：当前的算法股票交易虽实现了手动股票交易的自动化，但缺乏机器学习在预测方面的人工智能。例如，零售股票投资者通常会研究公司新闻、进行基本面和技术分析后决定买卖股票，若依据商品通道指数（CCI）技术指标，CCI显示超卖信号时买入，超买信号时卖出。经验丰富的交易者还会使用更多技术指标，如随机震荡指标和布林带指标等。目前的算法交易只是将这些技术指标的使用自动化，缺乏对股票价格未来走势的预测能力。若构建预测模型并与算法交易结合，将能充分利用价格波动获利。
- 金融和投资顾问 ：投资者在选择投资选项时，常依赖直觉、朋友建议或人工顾问，但这种方式并不靠谱。若利用资本市场的现有数据构建预测模型，可创建不仅能生成虚拟投资组合，还能根据历史数据（如股票价格、艺术品价格、商品价格、石油价格等）为投资者提供投资建议的机器人顾问。该顾问能根据市场价格变化提前提醒投资者调整投资组合。
- 欺诈检测与预防 ：尽管该领域已有机器学习和人工智能的应用，但仍不足以全球范围内预防欺诈。洗钱和其他欺诈交易难以检测，因为系统常将违规用户识别为合法用户。通过异常值检测等技术结合其他欺诈检测算法，可构建实时阻止欺诈交易的强大系统。

2. 金融领域机器学习生命周期的实现

这里将使用监督式机器学习技术，以一个虚构但基于常见电子支付账本的会计数据集为例，展示如何在金融领域实现机器学习。

2.1 数据集介绍

数据集包含以下列：
| 列名 | 描述 |
| ---- | ---- |
| Record No | 数字支付登记册中分类账交易的记录编号 |
| Transaction No | 支付网关成功完成支付交易后提供的唯一交易编号 |
| Fiscal Year | 交易发生的财政年度 |
| Month | 交易发生的月份 |
| Department | 公司的部门，如财务、人力资源、学习与发展、法律、营销、采购、研究、战略规划和运输部门 |
| Account | 公司分类账中的账户编号 |
| Expense Category | 公司向供应商付款时的费用类型 |
| Vendor ID | 支付或采购系统分配给供应商的ID |
| Payment Method | 支付方式，如自动清算所支票或电汇 |
| Payment Status | 支付交易的确认状态，需与银行和内部采购委员会核对 |
| Payment Date | 实际支付日期 |
| Invoice ID | 发票的ID |
| Invoice Date | 发票生成日期 |
| Amount | 实际支付金额 |
| Red Flag | 1表示交易金额低于金额列的第25百分位数且支付状态为已支付但未核对；2表示交易金额高于第75百分位数且状态为已核对，需进一步调查；0表示交易在第25百分位数的1.5倍范围内，视为合法 |

该数据集以平面文件 PaymentsLedger.csv 形式提供，可从 http://www.PuneetMathur.me/Book009/ 获取。

2.2 代码实现

以下是实现过程的代码：

#Importing python libraries
import pandas as pd
from io import StringIO
import os
import numpy as np
os.getcwd()

#Reading data set from flat files
fname="C:/PaymentsLedger.csv"
openledger= pd.read_csv(fname, low_memory=False, index_col=False)
#Check the data Loaded into memory
print(openledger.head(1))

dfworking= pd.DataFrame(openledger)
#Look at the first record
print(dfworking.head(1))
#Check the shape size and columns in the dataset
print(dfworking.shape)
print(dfworking.columns)
dfworking.dtypes

dfworking.isnull().any()
#Counting the Number of Null rows in each Column of the dataframe
dfworking.isnull().sum()

上述代码首先导入必要的Python库，然后从平面文件中读取数据集并检查数据是否正确加载到内存中。接着将数据转换为标准的Pandas数据框，查看数据框的形状、列名和数据类型。最后检查数据框中是否存在缺失值。

2.3 可视化函数

分布函数 ：用于创建特征的偏态分布和对数转换分布的可视化。

import matplotlib.pyplot as pl
import matplotlib.patches as mpatches
from time import time
from sklearn.metrics import f1_score, accuracy_score

def distribution(data, transformed = False):
    """
    Visualization code for displaying skewed distributions of features
    """
    # Create figure
    fig = pl.figure(figsize = (11,5));
    # Skewed feature plotting
    for i, feature in enumerate(['Amount','Month', 'Fiscal Year']):
        ax = fig.add_subplot(1, 3, i+1)
        ax.hist(data[feature], bins = 25, color = '#00A0A0')
        ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14)
        ax.set_xlabel("Value")
        ax.set_ylabel("Number of Records")
        ax.set_ylim((0, 2000))
        ax.set_yticks([0, 500, 1000, 1500, 2000])
        ax.set_yticklabels([0, 500, 1000, 1500, ">2000"])
    # Plot aesthetics
    if transformed:
        fig.suptitle("Log-transformed Distributions of Continuous Census Data Features", 
                     fontsize = 16, y = 1.03)
    else:
        fig.suptitle("Skewed Distributions of Continuous Census Data Features", 
                     fontsize = 16, y = 1.03)
    fig.tight_layout()
    fig.show()
#End of Distribution Visualization function

该函数接受数据集和一个布尔值参数 transformed ，用于判断是否进行对数转换。函数会创建一个包含三个子图的图形，分别展示 Amount 、 Month 和 Fiscal Year 列的分布情况。
- 特征重要性函数 ：用于显示前五个最重要的特征。

# Plotting Feature Importances through this function
def feature_plot(importances, X_train, y_train):
    # Display the five most important features
    indices = np.argsort(importances)[::-1]
    columns = X_train.columns.values[indices[:5]]
    values = importances[indices][:5]
    # Creat the plot
    fig = pl.figure(figsize = (9,5))
    pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16)
    pl.bar(np.arange(4), values, width = 0.6, align="center", color = '#00A000', 
           label = "Feature Weight")
    pl.bar(np.arange(4) - 0.3, np.cumsum(values), width = 0.2, align = "center", color = '#00A0A0', 
           label = "Cumulative Feature Weight")
    pl.xticks(np.arange(5), columns)
    pl.xlim((-0.5, 4.5))
    pl.ylabel("Weight", fontsize = 12)
    pl.xlabel("Feature", fontsize = 12)
    pl.legend(loc = 'upper center')
    pl.tight_layout()
    pl.show()
#End of Feature Importances function

该函数接受特征重要性、训练特征和训练标签作为参数，通过柱状图展示前五个最重要特征的权重和累积权重。

3. 异常值分析

异常值定义为金额列中低于第25百分位数1.5倍或高于第75百分位数1.5倍的值。以下是计算异常值的代码：

#Total number of records
n_records = len(data.index)
#Number of records where payments are below 1.5 times of upper Quantile- upper Outlier Limit
l=data[data['RedFlag'] == 2].index
n_greater_quantile = len(l)
#Number of records where payments are above 1.5 times of lower Quantile- lower Outlier limit
l=data[data['RedFlag'] == 1].index
n_lower_quantile = len(l)
#Percentage of Payments above Upper Outlier limit
p=float(n_greater_quantile)/n_records*100.0
greater_percent =p
#Percentage of Payments above Lower Outlier limit
p=float(n_lower_quantile)/n_records*100.0
lower_percent =p
# Print the results
print "Total number of records: {}".format(n_records)
print "High value Payments above 1.5 times of 75th Percentile: {}".format(n_greater_quantile)
print "Low value Payments below 1.5 times of 25th Percentile: {}".format(n_lower_quantile)
print "Percentage of high value Payments: {:.2f}%".format(greater_percent)
print "Percentage of low value Payments: {:.2f}%".format(lower_percent)

计算结果显示，总记录数为7293条，高于上异常值限制的高价值支付有366条（占比5.02%），低于下异常值限制的低价值支付有748条（占比10.26%）。

4. 数据准备

# PREPARING DATA
# Split the data into features and target label
payment_raw = pd.DataFrame(data['RedFlag'])
type(payment_raw)
features_raw = data.drop('RedFlag', axis = 1)
#Removing redundant columns from features_raw dataset
features_raw.dtypes
features_raw=features_raw.drop('TransactionNo', axis=1)
features_raw=features_raw.drop('Department', axis=1)
features_raw=features_raw.drop('Account', axis=1)
features_raw=features_raw.drop('Expense Category', axis=1)
features_raw=features_raw.drop('Vendor ID', axis=1)
features_raw=features_raw.drop('Payment Method', axis=1)

上述代码将数据集分为特征和目标标签两部分，并移除了特征数据集中的冗余列。

金融领域机器学习的应用与实现（续）

5. 数据预处理流程总结

为了更清晰地展示数据预处理的步骤，我们可以用 mermaid 流程图来呈现：

graph TD;
    A[读取数据集] --> B[检查数据加载情况];
    B --> C[转换为 Pandas 数据框];
    C --> D[查看数据框信息];
    D --> E[检查缺失值];
    E --> F[可视化特征分布];
    F --> G[分析特征重要性];
    G --> H[检测异常值];
    H --> I[数据分割与特征选择];
    I --> J[移除冗余列];

这个流程图展示了从读取数据集到数据预处理完成的整个过程，每个步骤都紧密相连，为后续的机器学习模型构建做好准备。

6. 机器学习模型构建思路

在完成数据预处理后，我们可以开始构建机器学习模型。以下是一些可能的步骤和思路：

6.1 选择合适的模型

根据问题的性质（如分类问题），可以选择合适的监督学习模型，如逻辑回归、决策树、随机森林等。不同的模型有不同的特点和适用场景，需要根据具体情况进行选择。

6.2 模型训练与评估

将预处理后的数据分为训练集和测试集，使用训练集对模型进行训练，然后使用测试集对模型进行评估。评估指标可以包括准确率、精确率、召回率、F1 值等，以全面评估模型的性能。

6.3 模型调优

通过调整模型的参数，如决策树的深度、随机森林的树的数量等，来优化模型的性能。可以使用网格搜索、随机搜索等方法来寻找最优参数组合。

7. 示例代码扩展：使用逻辑回归模型

以下是一个使用逻辑回归模型进行分类的示例代码：

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features_raw, payment_raw, test_size=0.2, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression()

# 训练模型
model.fit(X_train, y_train.values.ravel())

# 预测
y_pred = model.predict(X_test)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"模型准确率: {accuracy}")
print(f"模型 F1 值: {f1}")

这个代码示例展示了如何使用逻辑回归模型进行分类，并评估模型的性能。