【XGBoost】MAP - Charting Student Math Misunderstandings

XGBoost:使用 XGBoost 来代替 SVM,它在不平衡数据集上表现通常较好。

  • 类别权重调整:可以通过遍历scale_pos_weight取值为5,6,7,8,9,10来调整类别权重,从而提升少数类别(含误解)的召回率。

  • 调参:使用 GridSearchCV 来调节模型的超参数,以提高性能。

import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import xgboost as xgb
import matplotlib.pyplot as plt

# Step 1: Load the dataset
file_path = '/path/to/your/train.csv'  # 修改为实际文件路径
data = pd.read_csv(file_path)

# Step 2: Clean the student explanation text (remove punctuation and lower case)
def clean_text(text):
    text = text.lower()  # Convert to lower case
    text = ''.join([char for char in text if char not in string.punctuation])  # Remove punctuation
    return text

# Apply the cleaning function to the 'StudentExplanation' column
data['cleaned_explanation'] = data['StudentExplanation'].apply(clean_text)

# Step 3: Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(data['cleaned_explanation'])

# Step 4: Prepare labels (Misconception column)
# We will predict if the explanation contains a misconception or not
data['Misconception'] = data['Misconception'].fillna('No_Misconception')

# Convert labels to binary: 'No_Misconception' -> 0, any other label -> 1
y = data['Misconception'].apply(lambda x: 0 if x == 'No_Misconception' else 1)

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Use XGBoost with class_weight='balanced' to handle the data imbalance
xgb_model = xgb.XGBClassifier(scale_pos_weight=3, random_state=42)  # Adjust scale_pos_weight for imbalanced classes

# Step 7: Tune the model with GridSearchCV to find the best parameters
param_grid = {
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}
grid_search = GridSearchCV(xgb_model, param_grid, scoring='f1', cv=3, verbose=1)
grid_search.fit(X_train, y_train)

# Step 8: Get the best model from the grid search
best_model = grid_search.best_estimator_

# Step 9: Make predictions
y_pred_xgb = best_model.predict(X_test)

# Step 10: Evaluate the model
print(classification_report(y_test, y_pred_xgb))

# Step 11: Plot confusion matrix
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_xgb, display_labels=['No Misconception', 'Misconception'])
disp.plot(cmap=plt.cm.Blues)
plt.title('XGBoost Model Confusion Matrix')
plt.show()
Fitting 3 folds for each of 27 candidates, totalling 81 fits
              precision    recall  f1-score   support

           0       0.95      0.70      0.80      5277
           1       0.54      0.90      0.67      2063

    accuracy                           0.75      7340
   macro avg       0.74      0.80      0.74      7340
weighted avg       0.83      0.75      0.77      7340

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

技术与健康

你的鼓励将是我最大的创作动力!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值