算法调试与分析技巧-优快云博客

本文链接：https://blog.youkuaiyun.com/TQCAI666/article/details/106619869

文章目录

调试技巧
分析技巧
1
2
3.1
3.2
4
5
6 尝试用pynisher加时空限制，凉了
7 xbcp 异常退出
跑完500个iteration但没有结束
271 插入trial表出错
2122 不能创新新的线程

调试技巧

检验pop_auxiliary_feature_groups 正确
-/data_manager.py:128

self.X_train.data[self.X_train.columns[self.X_train.feature_groups.isin(["cat", "highC_cat"])]].dtypes

在232数据集上全为category

分析技巧

在232数据集的experiment_id=39的实验上，gbt_lr取得了比较好的效果。用sql进行数据分析：

select config, 1-loss as accuracy, cost_time,
       estimator,
       config_info -> 'origin' as origin,
       config -> 'preprocessing:normed->final:__choice__' as feature_engineering,
       config -> 'preprocessing:impute:missing_rate' as missing_rate,
--               additional_info -> 'confusion_matrices' as confusion_matrices,
       additional_info -> 'best_iterations' as best_iterations,
       additional_info -> 'cost_times' -> 0 as cost_times,
       additional_info -> 'component_infos' -> 0 as component_infos
from trial where experiment_id=39 order by loss;

1

行方向上拼接两个数据框
pandas=1.0.1 work,
pandas= 0.25.3 不work

df = pd.concat(Xs, axis=0)
df.sort_index(inplace=True)

df = pd.concat(Xs, axis=0, sort=False)
df.sort_index(inplace=True)

2

调用父类方法时，没把参数带全，导致传参失败，出现隐藏bug

3.1

X = super(SimpleImputer, self).fit(X, y, categorical_feature=categorical_feature, numerical_feature=numerical_feature)

232数据集反应的predictive_imputer的错误

missing_rate: 1.0 导致某些列会出现唯一值

{'estimating:__choice__': 'random_forest',
 'preprocessing:cat->normed:__choice__': 'encode.one_hot',
 'preprocessing:combined->normed:__choice__': 'encode.cat_boost',
 'preprocessing:highC_cat->combined:__choice__': 'encode.combine_rare',
 'preprocessing:impute:__choice__': 'impute.gbt',
 'preprocessing:impute:missing_rate': 1.0,
 'preprocessing:normed->final:__choice__': 'select.boruta',
 'preprocessing:num->normed:__choice__': 'operate.keep_going',
 'process_sequence': 'impute;cat->normed;highC_cat->combined;combined->normed;num->normed;normed->final',
 'strategies:balance:__choice__': 'None',
 'estimating:random_forest:bootstrap': 'False:bool',
 'estimating:random_forest:criterion': 'gini',
 'estimating:random_forest:early_stopping_rounds': '8:int',
 'estimating:random_forest:early_stopping_tol': '0.0:float',
 'estimating:random_forest:iter_inc': '16:int',
 'estimating:random_forest:max_depth': 'None:NoneType',
 'estimating:random_forest:max_features': 'sqrt',
 'estimating:random_forest:max_leaf_nodes': 'None:NoneType',
 'estimating:random_forest:min_impurity_decrease': '0.0:float',
 'estimating:random_forest:min_samples_leaf': 14,
 'estimating:random_forest:min_samples_split': 3,
 'estimating:random_forest:min_weight_fraction_leaf': '0.0:float',
 'estimating:random_forest:n_estimators': '1024:int',
 'estimating:random_forest:n_jobs': '12:int',
 'estimating:random_forest:random_state': '42:int',
 'preprocessing:cat->normed:encode.one_hot:placeholder': 'placeholder',
 'preprocessing:combined->normed:encode.cat_boost:placeholder': 'placeholder',
 'preprocessing:highC_cat->combined:encode.combine_rare:copy': 'False:bool',
 'preprocessing:highC_cat->combined:encode.combine_rare:minimum_fraction': '0.01:float',
 'preprocessing:impute:impute.gbt:copy': 'False:bool',
 'preprocessing:impute:impute.gbt:n_jobs': '12:int',
 'preprocessing:impute:impute.gbt:random_state': '42:int',
 'preprocessing:normed->final:select.boruta:max_depth': 5.0,
 'preprocessing:normed->final:select.boruta:n_jobs': '12:int',
 'preprocessing:normed->final:select.boruta:random_state': '42:int',
 'preprocessing:normed->final:select.boruta:weak': 'True:bool',
 'preprocessing:num->normed:operate.keep_going:placeholder': 'placeholder',
 'strategies:balance:None:placeholder': 'placeholder'}

这是删除全为缺失值后数据的缺失率：

array([0.8596882 , 0.        , 0.09576837, 0.        , 0.        ,
       0.84743875, 0.33741648, 0.35412027, 0.        , 0.8830735 ,
       0.98997773, 0.27171492, 0.98218263, 0.85634744, 0.91759465,
       0.76503341, 0.83407572, 0.97104677, 0.9922049 , 0.91759465,
       0.99777283, 0.96659243, 0.99331849, 0.94320713, 0.        ,
       0.        , 0.        , 0.        , 0.92873051, 0.        ,
       0.98997773])

可以看到像0.99777283这样缺失率的列是不包含任何信息的

目前设置的missing_rate参数是[0.2, 0.4, 0.6, 0.8 1.0] 5个取值范围

3.2

又出现了一个错误

经查，是用flatten代替squeeze可以解决

converted = encoder.transform([[origin]]).flatten()[0]

4

2020-8-20 提交元学习任务，有两个task的6个worker全挂了

用sql连表查询

select t2.task_id, count, task_metadata from
(select task_id,count(*) as count from trial group by task_id order by  task_id) as t2, task where t2.task_id = task.task_id ;

3054 和 75192 有问题

在这里插入图片描述
其中75192跑了7次就挂了，问题更大

3054可能因为全是cat，导致的缓存问题

75192 是因为boruta筛掉了所有特征

75192 牵扯的问题解决了，修改了boruta的代码

看到3054

解决这个问题其实也很简单

修改feature_engineer_base.py

if return_stack_trans:
    X_trans=pd.DataFrame(np.zeros([X_stack.shape[0],0]))
    return X_stack, X_trans

5

在这里插入图片描述

75133和75114提交一次失败一次

在这里插入图片描述
很奇葩的两个数据

最后查明，是在-.resource_manager.base.ResourceManager#upload_df_to_fs

        # fixme: restore
        df.to_hdf(tmp_path, "dataset", format="table")
        return self.file_system.upload(dataset_path, tmp_path)

报错

6 尝试用pynisher加时空限制，凉了

75133.bz2

initial_points = [
    {'estimating:__choice__': 'tabular_nn', 'preprocessing:normed->final:__choice__': 'select.boruta',
     'preprocessing:num->normed:__choice__': 'operate.keep_going', 'process_sequence': 'num->normed;normed->final',
     'strategies:balance:__choice__': 'weight', 'estimating:tabular_nn:af_hidden': 'relu',
     'estimating:tabular_nn:af_output': 'linear', 'estimating:tabular_nn:batch_size': '1024:int',
     'estimating:tabular_nn:class_weight': 'None:NoneType', 'estimating:tabular_nn:dropout_hidden': 0.35000000000000003,
     'estimating:tabular_nn:dropout_output': 0.1, 'estimating:tabular_nn:early_stopping_rounds': '16:int',
     'estimating:tabular_nn:early_stopping_tol': '0:int', 'estimating:tabular_nn:layer1': 320,
     'estimating:tabular_nn:layer2': 160, 'estimating:tabular_nn:lr': '0.01:float',
     'estimating:tabular_nn:max_epoch': '128:int', 'estimating:tabular_nn:max_layer_width': '2056:int',
     'estimating:tabular_nn:min_layer_width': '32:int', 'estimating:tabular_nn:n_jobs': '8:int',
     'estimating:tabular_nn:optimizer': 'adam', 'estimating:tabular_nn:random_state': '42:int',
     'estimating:tabular_nn:use_bn': 'True:bool', 'estimating:tabular_nn:verbose': '-1:int',
     'preprocessing:normed->final:select.boruta:max_depth': 7.0,
     'preprocessing:normed->final:select.boruta:n_jobs': '8:int',
     'preprocessing:normed->final:select.boruta:random_state': '42:int',
     'preprocessing:normed->final:select.boruta:weak': 'False:bool',
     'preprocessing:num->normed:operate.keep_going:placeholder': 'placeholder',
     'strategies:balance:weight:placeholder': 'placeholder'}
]

                if self.debug:
                    procedure_result, cloned_model = cloned_model.procedure(
                        self.ml_task, X_train, y_train, X_valid, y_valid,
                        X_test, y_test, max_iter, budget, (budget == self.max_budget)
                    )
                else:
                    arguments = dict(
                        logger=logging.getLogger("pynisher"),
                        wall_time_in_s=self.time_limit,
                        mem_in_mb=self.memory_limit,
                    )
                    obj = pynisher.enforce_limits(**arguments)(cloned_model.procedure)
                    procedure_result, cloned_model = obj(
                        self.ml_task, X_train, y_train, X_valid, y_valid,
                        X_test, y_test, max_iter, budget, (budget == self.max_budget)
                    )
                    if obj.exit_status == 0:  # normal
                        pass
                    elif obj.exit_status == pynisher.TimeoutException:
                        self.logger.error(str(obj.exit_status))
                        self.logger.error(str(config))
                        failed_info = get_trance_back_msg()
                        status = "TIMEOUT"
                        break
                    elif obj.exit_status == pynisher.MemorylimitException:
                        self.logger.error(str(obj.exit_status))
                        self.logger.error(str(config))
                        failed_info = get_trance_back_msg()
                        status = "MEMORY_LIMIT"
                        break
                    else:
                        # inside exceptions
                        self.logger.error(str(obj.exit_status))
                        self.logger.error(str(config))
                        failed_info = get_trance_back_msg()
                        status = "FAILED"
                        break

    def run(self, background=False, concurrent_type="process"):
        """
        Method to start the worker.

        Parameters
        ----------
            background: bool
                If set to False (Default). the worker is executed in the current thread.
                If True, a new daemon thread is created that runs the worker. This is
                useful in a single worker scenario/when the compute function only simulates
                work.
        """
        if background:
            if concurrent_type == "process":
                self.process = Process(target=self._run, name='worker %s process' % self.worker_id)
                # self.process.daemon = True
                self.process.start()
            elif concurrent_type == "thread":
                # maybe in a same thread
                self.worker_id += f"_{threading.get_ident()}"
                self.thread = threading.Thread(target=self._run, name='worker %s thread' % self.worker_id)
                self.thread.daemon = True
                self.thread.start()
        else:
            self._run()

注释掉了daemon，才能跑起来

tabularNN 的 callback变成none了。不知道为什么，放弃了

7 xbcp 异常退出

75114 和 190157 异常退出

2122 异常退出，迭代217次

75134 异常退出，迭代25次

271 凉了，只跑了8次

75134存在cache-lock error 和内存分配问题

跑完500个iteration但没有结束

167202 和 167090 跑完500个iteration但没有结束

271 插入trial表出错

[ERROR] [2020-08-21 23:50:15,537:autoflow.resource_manager.base.ResourceManager] invalid input syntax for type json
LINE 1: ...d2e33521edc9da6ea4d6bb61c4e-4.0_y-info.bz2', CAST('{"confusi...
                                                             ^
DETAIL:  Token "NaN" is invalid.
CONTEXT:  JSON data, line 1: ...op_columns": [], "iter": 2, "gamma_history": [NaN...

[ERROR] [2020-08-21 23:50:15,537:autoflow.resource_manager.base.ResourceManager] Insert 'trial' table failed, 1 try.

2122 不能创新新的线程

Traceback (most recent call last):
  File "/job/code/autoflow/evaluation/train_evaluator.py", line 220, in evaluate
    X_test, y_test, max_iter, budget, (budget == self.max_budget)
  File "/job/code/autoflow/workflow/ml_workflow.py", line 244, in procedure
    self.fit(X_train, y_train, X_valid, y_valid, X_test, y_test)
  File "/job/code/autoflow/workflow/ml_workflow.py", line 196, in fit
    self._final_estimator.fit(X_train, y_train, X_valid, y_valid, X_test, y_test)
  File "/job/code/autoflow/workflow/components/base.py", line 159, in fit
    y_valid_, X_test_, y_test_, feature_groups)
  File "/job/code/autoflow/workflow/components/base.py", line 180, in _fit
    **kwargs)
  File "/job/code/autoflow/workflow/components/iter_algo.py", line 103, in core_fit
    self.iterative_fit(X, y, X_valid, y_valid, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/_testing.py", line 327, in wrapper
    return fn(*args, **kwargs)
  File "/job/code/autoflow/workflow/components/iter_algo.py", line 54, in iterative_fit
    self.component.fit(X, y, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1601, in fit
    for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 921, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 226, in apply_async
    return self._get_pool().apply_async(
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 367, in _get_pool
    self._pool = ThreadPool(self._n_jobs)
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 802, in __init__
    Pool.__init__(self, processes, initializer, initargs)
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
    self._repopulate_pool()
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
    w.start()
  File "/opt/conda/lib/python3.7/multiprocessing/dummy/__init__.py", line 51, in start
    threading.Thread.start(self)
  File "/opt/conda/lib/python3.7/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread