建模比赛代码集合

网路末端遗传因子

已于 2024-03-30 23:02:26 修改

阅读量3.3k

点赞数 4

文章标签： python

于 2021-10-10 20:13:12 首次发布

本文链接：https://blog.youkuaiyun.com/qq_54394719/article/details/120689805

版权

1.4 基础的数据处理（代码-python）

sklearn总查询

Introduction · sklearn 中文文档https://sklearn.apachecn.org/

以下连接内容相互交叉，取写得比较好的部分为代表，感谢大佬们的贡献。

算法部分，有些不写了，因为有更好的替代。

0 环境统一

anaconda如何创建一个新的环境_创建一个anaconda环境-优快云博客

A和B一起打比赛，A的环境要给B。先选择A的环境myenv：
conda activate myenv

然后生成myenv环境信息文件：

conda list --explicit > spec-list.txt

最后B将spec-list.txt在自己电脑中创建：
conda create --name python-course --file spec-list.txt

1 数据预处理

1.1 数据清洗

用Python进行数据挖掘（数据预处理）_TcD的博客-优快云博客_python数据预处理【python】

数据加载与粗略查看
处理丢失的数据
处理偏离值
数据统计
特征值的合并、连接
数据转换、标准化、归一化

1.2 编码转换

第三周打卡：数据预处理与特征工程_onepiece0603的博客-优快云博客

1.3 特征工程

特征选择包含：过滤型、包裹型、嵌入型【sklearn，python】

结合sklearn进行特征工程_三石-优快云博客

数据科学猫：数据预处理之数据分箱(Binning)_Orange_Spotty_Cat的博客-优快云博客

竞赛姿势必会：自动特征工程&快速提升做特征效率 - 知乎

拉依达准则python实现_乐此不疲的架构师的博客-优快云博客_拉依达准则

SPSS主成分分析 | 指标权重值计算真的不难！（上）_weixin_39534208的博客-优快云博客

熵值法的Python实现_好吃的鱿鱼的博客-优快云博客

熵权法确定权重_梁山伯与翠花的博客-优快云博客_信息熵确定权重

MIC - 最大信息系数_风云诀4的博客-优快云博客_mic最大信息系数

树模型的特征选择-Boruta - 知乎 (zhihu.com)

几行 Python 代码就可以提取数百个时间序列特征

深度特征合成与遗传特征生成，两种自动特征生成策略的比较

1.4 基础的数据处理（代码-python）

1.4.1 数据处理

SQL输出的表读取有utf-8的错误：

read_csv(filename, encoding = 'gb18030')
# 或者试试：encoding ='utf-8-sig'

关于excel保存：

# 方法一：追加sheet保存（需要已经有这个excel文件）
file_name = 'NAME.xlsx'
with pd.ExcelWriter(file_name, mode='a', engine='openpyxl') as writer:
    df_r.to_excel(writer,index=False, sheet_name=name)

# 方法二：不需要基础表，直接分sheet
writer = pd.ExcelWriter('分机构表.xlsx')  #  创建表存放数据 pd.ExcelWriter('表名.xlsx')
for i in class_list:
    his1 = his[his['orgname'] == i]
    his1.to_excel(writer,i,index=False)
writer.save()   # 文件保存
writer.close()   # 文件关闭

根据某一列进行两个表的合并

import pandas as pd

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                       'A': ['A0', 'A1', 'A2', 'A3'],
                       'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key')

# on参数传递的key作为连接键
result
Out[4]: 
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

对某一列进行模糊筛选

# 对于id列含有内容test_a或test_b的行保留
able_word = 'test_a|test_b'    #用|进行分割

# 对id列进行格式转化
df['id'] = df['id'].astype(str)

# 筛选
def id_select(able_word):
    '''
    diable_word: str 模糊启用词
    return: df 启用词的表
    '''
    df_temp = df
    df_temp = df_temp[df['id'].str.contains(able_word)]
    return df_temp

df_select = id_select(able_word)

计算时间间隔，以s为单位

df['SampleTime']
------------------------
1    2023-03-21 15:47:07
2    2023-03-21 16:01:35
3    2023-03-22 08:09:44
4    2023-03-22 09:27:27
5    2023-03-22 17:31:26

...
54   2023-04-02 17:34:40
55   2023-04-03 09:25:29
56   2023-04-03 09:48:31
57   2023-04-03 10:05:18
Name: SampleTime, dtype: datetime64[ns]
----------------------------------------

# calculate date time
t = df['SampleTime'].values
t_tf = t-t[0]
# 先转list再转array，再从ns转为s
t_tf = np.array(t_tf.tolist())/1000000000/60/60/24

对某一列去除字符串里的数字

import pandas as pd
  
  
# creating dataframe
df = pd.DataFrame.from_dict({'Name': ['rohan21', 'Jelly',
                                      'Alok22', 'Hey65',
                                      'boy92'],
                               
                             'Age': [24, 25, 10, 73, 92],
                               
                             'Income': [28421, 14611, 28200,
                                        45454, 66565]})
  
# removing numbers from strings of speciafied 
# column, here 'Name'
df['Name'] = df['Name'].str.replace('\d+', '')
  
# display output with numbers removed from 
# required strings
print(df)

去除字符串里的固定某个位置的字符

#最后一个字符
data['result'] = data['result'].map(lambda x: str(x)[:-1])
#前两个字符
data['result'] = data['result'].map(lambda x: str(x)[2:])

提取字符串里的数值

df["Language"].str.findall('\d+')  # 提取字符串中的数据部分

【Python数据分析】pandas针对字符串操作 - OLIVER_QIN - 博客园

1.4.2 并行处理

NOTE：以下两种并行方法不能同时在函数中相互套用，不然会乱掉

from joblib import Parallel, delayed
from concurrent.futures import ThreadPoolExecutor

Parallel使用示例：

    def optimize_process(self):

        if self.isParallel:
            func = self.execute_optimizer_timesequence if self.ists else self.execute_optimizer       
            best_param = Parallel(n_jobs=-1, backend='loky')(
                delayed(func)({key: value})
                for key, value in self.data.items())
        else:
            func = self.execute_optimizer_timesequence if self.ists else self.execute_optimizer
            best_param = [func({key: value}) for key, value in self.data.items()]

ThreadPoolExecutor使用示例：

'''
改为多线程并行，而且不打乱输出顺序，可以使用Python的多线程库concurrent.futures的ThreadPoolExecutor。需要注意的是，Python的全局解释器锁（GIL）问题会导致多线程并不能充分利用多核CPU，因此在CPU密集型任务中并不能得到明显效果提升，还有可能因为线程切换导致性能下降。但对于I/O密集型、网络等待型任务，多线程则可以大幅提升效率。以下是示例代码：

这段代码中，我们使用了ThreadPoolExecutor来创建一个线程池，并使用线程池来提交并执行任务。由于Python的多线程具有难以预估的并发性，为了不打乱输出顺序，我们在for循环中按任务提交的顺序依次处理Future结果。

需要注意的是，多线程并行也会带来额外的开销，比如线程间的通信、线程切换等，所以，并不是所有任务都适合用多线程。在决定使用多线程之前，最好能对任务的性质有一定了解，比如是否是I/O密集型任务，是否有大量的等待时间等。
'''
from concurrent.futures import ThreadPoolExecutor

def execute_optimizer_timesequence(self, batch_data: Dict[str , pd.DataFrame]) -> dict:
    def update_best_param(res, err_list, best_param, subset_ind):
        _, result_discrete_param, err = res.result()
        err_list.append(err)
        for c , v in enumerate(result_discrete_param):
            best_param[subset_ind,c] = v
        return best_param

    err_list = []
    keys = list(batch_data.keys())[0]
    sub_data = {keys:{key: value[batch_data[keys]['control'][:,1]==self.flag_ts] 
                        for key, value in batch_data[keys].items()}}
    best_param = self.set_param_matrix(sub_data[keys]['state_time'])

    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = [executor.submit(self.opt_ls_param_process, sub_data, subset_ind, keys) for subset_ind in range(len(sub_data[keys]['state_time'])-1)]
        for i, future in enumerate(futures):
            best_param = update_best_param(future, err_list, best_param, i)

    if self.flag_ts:
        res = self.opt_ls_param_process(sub_data, subset_ind+1, keys)
        best_param = update_best_param(res, err_list, best_param, subset_ind+1)
    else:
        best_param[-1,:] = best_param[-2,:]

    return {keys:{'param' : best_param , 'error' : err_list}}

2 离散连续\分类预测算法

2.1 离散\分类算法

贝叶斯信念网络

贝叶斯网络python实战（以泰坦尼克号数据集为例，pgmpy库）_leida_wt的博客-优快云博客_贝叶斯网络应用实例

SVM分类预测

SVM原理_SVM分类和回归预测中的python代码实现_如何利用html码转载别人的博客-优快云博客

随机森林分类

随机森林算法实现分类_少年吉的博客-优快云博客_随机森林做分类

DBSCAN

基于sk-learn的DBSCAN聚类算法_大数据训练营-优快云博客

KNN

基于scikit-learn包实现机器学习之KNN(K近邻)-完整示例_weixin_30648587的博客-优快云博客

KDTREE【加速查找附近的点】

Scikit-learn：最近邻搜索sklearn.neighbors_皮皮blog-优快云博客

2.2 连续\预测算法

SVM回归预测

SVM原理_SVM分类和回归预测中的python代码实现_如何利用html码转载别人的博客-优快云博客

随机森林回归

用Python实现随机森林回归_wokaowokaowokao12345的专栏-优快云博客_python随机森林回归代码

XGB

PYTHON中XGBOOST的使用_宋建国的博客-优快云博客_python xgboost

ROBYN-MMM

https://github.com/facebookexperimental/Robyn

3 有无监督

3.1 有监督学习

SVM

SVM原理_SVM分类和回归预测中的python代码实现_如何利用html码转载别人的博客-优快云博客

XGB

PYTHON中XGBOOST的使用_宋建国的博客-优快云博客_python xgboost

随机森林算法实现分类_少年吉的博客-优快云博客_随机森林做分类

ROBYN-MMM

https://github.com/facebookexperimental/Robyn

贝叶斯信念网络

贝叶斯网络python实战（以泰坦尼克号数据集为例，pgmpy库）_leida_wt的博客-优快云博客_贝叶斯网络应用实例

3.2 无监督学习

层次分析法

数学建模--层次分析法（代码Python实现）_ddjhpxs的博客-优快云博客_层次分析法python代码

KPCA

python实现KPCA降维_WANG_DDD的博客-优快云博客

4 神经网络

CNN RNN GAN 略

BiLSTM

文本分类实战（四）—— Bi-LSTM模型 - 微笑sun - 博客园

DIN

CTR深度学习模型之 DIN(Deep Interest Network) 的理解与例子_VariableX的博客-优快云博客

5 强化学习

建立自己的gym环境并调用_lxs3213196的博客-优快云博客

6 时间序列

SARIMA

prophet

面板数据

7 评估模型

混淆矩阵

层次分析法

数学建模--层次分析法（代码Python实现）_ddjhpxs的博客-优快云博客_层次分析法python代码

KPCA

python实现KPCA降维_WANG_DDD的博客-优快云博客

ROBYN-MMM

https://github.com/facebookexperimental/Robyn

KDE

核密度估计Kernel Density Estimation(KDE) – 数据常青藤

灰度预测

8 参数搜索

Nevergrad调参用_Blossom Flight的博客-优快云博客

9 数据生成

随机游走

用随机游动生成时间序列的合成数据_数据派THU-优快云博客

MCMC

马尔可夫链蒙特卡洛(MCMC)在python中的实战案例应用_-派神-的博客-优快云博客

10 AutoML

Autogluon代码_Blossom Flight的博客-优快云博客

Autosklearn【linux】

AutoKeras【配置不了】

11 画图

plt画图

最有用的25个 Matplotlib图（含Python代码模板）-腾讯云开发者社区-腾讯云 (tencent.com)

import matplotlib.pyplot as plt

# 避免中文乱码
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号

# 设置画布大小
plt.figsize(figsize=(7, 5))
 
# 设置标题
plt.title('hello world!')
 
# 设置网格线
plt.grid(True)
 
# 设置坐标轴标签
plt.xlabel('x')
plt.ylabel('y')
 
# 设置坐标轴（适应数据）
plt.axis('tight')
 
# 设置图例
plt.legend((line1,line2),['1st','2nd'])
 
# 颜色：blue， 线宽:1.5，虚线
plt.plot(y, 'b', lw=1.5,linestyle='--')
# 数据点标红虚线
plt.plot(y, 'ro')
 
# y坐标轴逆序
plt.gca().invert_yaxis()

# 画散点图
plt.scatter(x, y, marker="*")

#设置坐标轴范围
plt.xlim((-5, 5)) 
plt.ylim((-2, 2))

# 绘制
plt.show()

matplotlib在一张画布上画多个图plt.subplot(),plt.subplots()

比较好看的子图：我的Matplotlib绘图模板 · Zodiac Wang

# 现在画的是一行两列中的第一个
plt.subplot(1,2,1)

'''
画一个3d图，并且有两个轴的尺度为log scale
'''

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# 输入数据
data = {
    'tol': [1.00E-09, 1.00E-08, 2.50E-05, 5.00E-05, 7.50E-05, 0.0001, 0.001, 0.01, 5],
    'mean_loss': [94.96326704, 94.96506666, 95.00390966, 95.00590338, 95.01798116, 95.01963727, 94.99324471, 95.39852246, 95.58903045],
    'time': [11710.02623, 3669.577245, 210.9115348, 134.5718384, 81.52694511, 83.95328236, 40.44198251, 13.13918161, 20.30696917]
}

df = pd.DataFrame(data)
df['tol'] = np.log10(df['tol'])
df['time'] = np.log10(df['time'])

# 创建一个新的图片对象
fig = plt.figure()

# 创建3D plot
ax = fig.add_subplot(111, projection='3d')

# 在3D plot中创建一个散点图
ax.scatter(df['time'], df['mean_loss'], df['tol'])

ax.set_xlabel('Time')
ax.set_ylabel('Mean Loss')
ax.set_zlabel('Tolerance')

plt.show()

matplotlib画图，坐标轴log对数刻度以及十进制显示

配色：color=

Matplotlib颜色对照表

from sklearn.metrics import roc_curve, auc

# Compute ROC curve and ROC area for each class
fpr,tpr,threshold = roc_curve(y_test, y_score) ###计算真正率和假正率
roc_auc = auc(fpr,tpr) ###计算auc的值
 
plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假正率为横坐标，真正率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

关于可视化神经网络中间层的详细说明_python__reported的博客-优快云博客_神经网络中间层

# 矢量图
plt.savefig('SVR_all.eps',dpi=1200,format='eps')

plotly画图

当使用plotly.figure_factory.create_distplot时，通常用于创建直方图和核密度估计图的组合，可以展示数据的分布情况。

import numpy as np

import plotly.graph_objs as go
import plotly.figure_factory as ff
m = np.random.normal(loc=0.08, scale=0.0008, size=5000)
hist_data = [m, m+0.001]

group_labels = ['m1', 'm2']
colors = ['#333F44', '#37AA9C']

# Create distplot
fig = go.FigureWidget(ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors))
fig.layout.update(title='Density curve',
                                   )
fig

12 爬虫

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from datetime import datetime
import time
import random
# 2.选择让谷歌模拟的设备
mobileEmulation = {"deviceName": "iPad"}
# 3.将设备加入到浏览器
# 实例化谷歌浏览器加载项
options = webdriver.ChromeOptions()
options.add_experimental_option("mobileEmulation", mobileEmulation)
driver = webdriver.Chrome(options=options)
actions = ActionChains(driver)
driver.get("https://cn.investing.com/currencies/us-dollar-index") #美元指数期货
count=0 #防止反爬 500次爬取就刷新1次
print("起始时间为")
print(datetime.now())
while(1):
    if(count<=2000):
        key = driver.find_element(By.XPATH, '//*[@id="last_last"]')
        time.sleep(random.randint(25, 50) / 1000)
        print(key.text)
        count=count+1
        print(count)
        print(datetime.now())
    else:
        driver.refresh()
        count=0

FOMC

from selenium import webdriver
from selenium.webdriver.common.by import By
from datetime import datetime
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
def temp_create(j):
    if(len(str(j))==1):
        temp='0'+str(j)
    else:
        temp=str(j)
    return temp
i = 2009
j = 1
k = 1
T = []
while(i<=2012):
    while(j<=12):
        while(k<=31):
            html=str(i)+temp_create(j)+temp_create(k)+'a'#换a或者b
            html_text = 'https://www.federalreserve.gov/newsevents/pressreleases/monetary' + html + '.htm'
            print(html_text)
            driver.get(html_text)
            test = driver.find_element(By.XPATH, '//*[@id="page-title"]/h2')
            if (test.text == 'Page not found'):
                 print("%s当天没有"%html)
                 k=k+1
                 continue
            else:
                 T.append(html_text)
                 print("成功加入")
            k=k+1
 
        k=1
        j=j+1
    j=1
    i=i+1
print(T)
print(len(T))
for i in T:
        time1 = datetime.now()
        driver.get(i)
        c = driver.find_element(By.XPATH, '// *[ @ id = "article"] / div[1] / h3')
        if c.text != "FOMC statement" and c.text!='Federal Reserve issues FOMC statement':
            print("不是FOMC statement 跳出本次爬取")
            continue
        else:
            print("这个链接就是FOMC statement %s"%i)
        main_text = driver.find_element(By.XPATH, '//*[@id="article"]/div[3]')
        title = driver.find_element(By.XPATH, '//*[@id="article"]/div[1]')
        full_name=i[56:73]+ '.txt'
        with open(full_name, 'w+',encoding='utf-8') as f:
            f.write(str(time1)[0:19] + '.' + str(time1)[20:23] +'.' + str(time1)[23:26])
            f.write("\n")
            f.write(title.text)
            f.write(main_text.text)
            time2 = datetime.now()
            f.write("\n")
            f.write(str(time2)[0:19] + '.' + str(time2)[20:23] + '.' + str(time2)[23:26] )

13 小技巧

根据应用程序领域选择一个阈值来最大化重要的度量(通常是精度或召回率)

model.predict_proba(X_test)
#使用概率我们可以测试不同的阈值的性能表现。
def probs_to_prediction(probs, threshold):
    pred=[]
    for x in probs[:,1]:
        if x>threshold:
            pred.append(1)
        else:
            pred.append(0)
    return pred

一键中文数据增强工具

多线程方法

#并行
from joblib import Parallel, delayed
from tqdm import tqdm


def warping_path(from_s, to_s, **kwargs):
    """Compute warping path between two sequences."""
    dist, paths = warping_paths(from_s, to_s, **kwargs)
    path,var = best_path(from_s, to_s,paths)
    loss = dist*var
    return loss

for i in tqdm(range(len(df_test))):
    s1 = np.array(df_test.iloc[i,1:]).astype('float32')
    y = df_test.iloc[i,0]
    for j in class_dataset:#类名
        loss_temp = Parallel(n_jobs=4, backend='loky')(delayed(warping_path)(s1, np.array(class_dataset[j].iloc[k,:]).astype('float32'))
                               for k in range(len(class_dataset[j])))
        class_min_loss[j] = min(loss_temp)
    # 选出最小的
    y_hat = min(class_min_loss,key=lambda x: class_min_loss[x])
    result.append(str(y)==y_hat)

14 数据集

50个最佳机器学习公共数据集（附链接）

15 文本与变量

变量的保存与取用：

# 对于变量gp_model进行本地保存及使用
# 保存
with open('gp_model.pkl', 'wb') as f:
    pickle.dump(est, f)
# 读取
with open('gp_model.pkl', 'rb') as f:
    est = pickle.load(f)

pytorch 模型 .pt, .pth, .pkl的区别及模型保存方式_pytorch pt和 pth文件区别-优快云博客

16 微分方程专题

matlab生物代谢相关，专用库：

SimBiology

ode45在python上的实现

https://codereview.stackexchange.com/questions/163499/ode45-solver-implementation-in-python

微分方程基础实战入门案例

几种经典病毒动力学模型【基于matlab的动力学模型学习笔记_3】_歪卜巴比的博客-优快云博客_传播动力学 matlab建模

scipy.integrate 与 python-control在python上的应用说明

python解决控制问题系例之一：解决微分方程-状态方程求解作图问题_solve_ivp_WeiqingAi的博客-优快云博客

python实例 scipy.integrate

https://pythonnumericalmethods.berkeley.edu/notebooks/chapter22.06-Python-ODE-Solvers.html

基本微分方程的案例是可以通过强化学习实现求解的

17 基础数理方法直接掉包

import math #导入math库
print(math.gcd(a,b)) #利用函数求解最大公约数
print(a*b/math.gcd(a,b)) #利用上面的函数求解最小公倍数

# 求一组数的最小公倍数
import math
s = list(map(int,input().split()))
def gbs(s):
    a,b = s[0],s[1]
    a = a // math.gcd(a, b) * b // math.gcd(a, b) * math.gcd(a, b)
    if len(s)>2:
        for i in range(2,len(s)):
            b = s[i]
            a = a//math.gcd(a,b) * b//math.gcd(a,b) * math.gcd(a, b)
    return a

print(gbs(s))



# 求一组数的最大公约数
import math
def gcd_many(s):
    g = 0
    for i in range(len(s)):
        if i == 0:
            g = s[i]
        else:
            g=math.gcd(g,s[i])

    return g

s = list(map(int,input().split()))
print(gcd_many(s))

18 VScode各种毛病

改完一个py文件以后想调用，需要重载才可以用最新的。直接import还是旧版本。

import sys

# 从根目录到文件的路径：比如model/optimize_algorithm/nls.py就是
reload(sys.modules['model.optimize_algorithm.nls'])

19 自用代码简写案例

'''
我想将字典中的字典数组拆成两块，

比如我有

a = {'a':{'q':np.array([1,2,3]),'e':np.array([1,2,3])},'b':{'z':np.array([5,6,7]),'c':np.array([5,6,9])}}，

如何通过代码得到

a1 = {'a':{'q':np.array([1,2]),'e':np.array([1,2])},'b':{'z':np.array([5,6]),'c':np.array([5,6])}} 和

a2 = {'a':{'q':np.array([3]),'e':np.array([3])},'b':{'z':np.array([7]),'c':np.array([9])}}

'''

import numpy as np

a = {'a':{'q':np.array([1,2,3]),'e':np.array([1,2,3])},'b':{'z':np.array([5,6,7]),'c':np.array([5,6,9])}}

a1 = {k: {i: np.array(list(map(lambda x: x[:-1], a[k].values()))) for i in a[k]} for k in a}
a2 = {k: {i: np.array(list(map(lambda x: x[-1:], a[k].values()))) for i in a[k]} for k in a}

print('a1: ', a1)
print('a2: ', a2)

'''
param_limit = ([1,2],[10,22])

a = tuple(limit_l for limit_l,limit_u in param_limit)

b = tuple(limit_l for limit_l,limit_u in param_limit)

这段代码要怎么写得更简洁？
'''
param_limit = ([1,2],[10,22])
a, b = zip(*param_limit)