size not match(label size和 predict size )

最新推荐文章于 2023-04-23 11:07:11 发布

lusic01

最新推荐文章于 2023-04-23 11:07:11 发布

阅读量5.3k

点赞数 1

本文探讨了在使用XGBoost进行多分类任务时遇到的常见错误，特别是当预测大小与标签大小不匹配时的问题。文章详细分析了错误原因，并提供了修改评价指标和使用Scikit-Learn封装的解决方案。

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

I am training a XGBoostClassifier for my training set.

My training features are in the shape of (45001, 10338) which is a numpy array and my training labels are in the shape of (45001,) [I have 1161 unique labels so I have done a label encoding for the labels] which is also a numpy array.

From the documentation, it clearly says that I can create DMatrix from numpy array. So I am using the above mentioned training features and labels as numpy arrays straightaway. But I am getting the following error

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-30-3de36245534e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x, train_y)

<ipython-input-27-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in <listcomp>(.0)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in eval(self, iteration, feval)
    221     def eval(self, iteration, feval):
    222         """"Evaluate the CVPack for one iteration."""
--> 223         return self.bst.eval_set(self.watchlist, iteration, feval)
    224 
    225 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in eval_set(self, evals, iteration, feval)
    865             _check_call(_LIB.XGBoosterEvalOneIter(self.handle, iteration,
    866                                                   dmats, evnames, len(evals),
--> 867                                                   ctypes.byref(msg)))
    868             return msg.value
    869         else:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

Please find my model Code below:

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgb_param['num_class'] = 1161   
        xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc',early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])

    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels, eval_metric='auc')

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(train_labels, dtrain_predictions))

Where am I going wrong in the above place ?

My classifier as follows :

xgb1 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

EDIT - 2 After changing evaluation metric,

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-9-30c62a886c2e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x_trail, train_y_trail)

<ipython-input-8-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    398                            evaluation_result_list=None))
    399         for fold in cvfolds:
--> 400             fold.update(i, obj)
    401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in update(self, iteration, fobj)
    217     def update(self, iteration, fobj):
    218         """"Update the boosters for one iteration"""
--> 219         self.bst.update(self.dtrain, iteration, fobj)
    220 
    221     def eval(self, iteration, feval):

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
    804 
    805         if fobj is None:
--> 806             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
    807         else:
    808             pred = self.predict(dtrain)

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[03:43:03] src/objective/multiclass_obj.cc:42: Check failed: (info.labels.size()) != (0) label set cannot be empty'

python numpy xgboost

share improve this question

edited Aug 4 '17 at 3:45

asked Jul 23 '17 at 4:56

Kathiravan Natarajan

6172830

add a comment

====================================================================================================================================================================================================================================================================================================================================================================

2 Answers

active oldest votes

up vote 5 down vote accepted

+50

The original error that you get is because this metric was not designed for multi-class classification (see here).

You could use scikit learn wrapper of xgboost to overcome this issue. I modified your code with this wrapper, to produce similar function. I am not sure why are you doing gridsearch though, as you are not enumerating over parameters. Instead, you are using the parameters you specified in xgb1. Here is the modified code:

import xgboost as xgb
import sklearn
import numpy as np
from sklearn.model_selection import GridSearchCV

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5):

    if useTrainCV:
        params=alg.get_xgb_params()
        xgb_param=dict([(key,[params[key]]) for key in params])

        boost = xgb.sklearn.XGBClassifier()
        cvresult = GridSearchCV(boost,xgb_param,cv=cv_folds)
        cvresult.fit(X,y)
        alg=cvresult.best_estimator_


    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels)

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % sklearn.metrics.accuracy_score(train_labels, dtrain_predictions))

xgb1 = xgb.sklearn.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)    


X=np.random.normal(size=(200,30))
y=np.random.randint(0,5,200)

modelfit(xgb1, X, y)

The output that I get is

Model Report
Accuracy : 1

Note that I used much smaller size for the data. With the size that you mentioned, the algorithm may be very slow.

share improve this answer

answered Aug 4 '17 at 15:41

Miriam Farber

10.9k82746

In tensorflow, we create batches and run them. can I run this algorithm in batch wise ? Let's say 100 records after another ? How can I save this model and train it again ? I will accept your answer – Kathiravan Natarajan Aug 5 '17 at 0:21
When you train neural network on tensorflow you use batch gradient descent. Thus you can do that in chunks. However, xgboost operates differently, so you cannot just separate it into chunks. However, I looked at xgboost faq page: xgboost.readthedocs.io/en/latest/faq.html, and in the section about large data sets they write this: XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory (This usually means millions of instances). If you are running out of memory, checkout external memory version or distributed version of xgboost – Miriam Farber Aug 5 '17 at 8:45
Thus, based on the above quote, it seems that you can try to run the code on your computer as it is. You can also put verbose=2 in GridSearchCV so that it will print more details while it's running. If it won't work, you could try the distributed version. They give a link to it from the faq page (the one I linked to in the previous comment). You could also put useTrainCV=False. As you have one set of parameters, you don't really need the gridsearch, so you can skip that part of your code (which is currently the most heavy part in your code). – Miriam Farber Aug 5 '17 at 9:00

add a comment

up vote 2 down vote

The error is b/c you are trying to use AUC evaluation metric for multiclass classification, but AUC is only applicable for two-class problems. In xgboost implementation, "auc" expects prediction size to be the same as label size, while your multiclass prediction size would be 45001*1161. Use either "mlogloss" or "merror" multiclass metrics.

P.S.: currently, xgboost would be rather slow with so many classes, as there is some inefficiency with predictions caching during training.

share improve this answer

answered Aug 3 '17 at 2:59