文章目录
调试技巧
检验pop_auxiliary_feature_groups
正确
-/data_manager.py:128
self.X_train.data[self.X_train.columns[self.X_train.feature_groups.isin(["cat", "highC_cat"])]].dtypes
在232数据集上全为category
分析技巧
在232
数据集的experiment_id=39
的实验上,gbt_lr
取得了比较好的效果。用sql进行数据分析:
select config, 1-loss as accuracy, cost_time,
estimator,
config_info -> 'origin' as origin,
config -> 'preprocessing:normed->final:__choice__' as feature_engineering,
config -> 'preprocessing:impute:missing_rate' as missing_rate,
-- additional_info -> 'confusion_matrices' as confusion_matrices,
additional_info -> 'best_iterations' as best_iterations,
additional_info -> 'cost_times' -> 0 as cost_times,
additional_info -> 'component_infos' -> 0 as component_infos
from trial where experiment_id=39 order by loss;
1
行方向上拼接两个数据框
pandas=1.0.1 work,
pandas= 0.25.3 不work
df = pd.concat(Xs, axis=0)
df.sort_index(inplace=True)
df = pd.concat(Xs, axis=0, sort=False)
df.sort_index(inplace=True)
2
调用父类方法时,没把参数带全,导致传参失败,出现隐藏bug
3.1
X = super(SimpleImputer, self).fit(X, y, categorical_feature=categorical_feature, numerical_feature=numerical_feature)
232数据集反应的predictive_imputer的错误
missing_rate
: 1.0
导致某些列会出现唯一值
{'estimating:__choice__': 'random_forest',
'preprocessing:cat->normed:__choice__': 'encode.one_hot',
'preprocessing:combined->normed:__choice__': 'encode.cat_boost',
'preprocessing:highC_cat->combined:__choice__': 'encode.combine_rare',
'preprocessing:impute:__choice__': 'impute.gbt',
'preprocessing:impute:missing_rate': 1.0,
'preprocessing:normed->final:__choice__': 'select.boruta',
'preprocessing:num->normed:__choice__': 'operate.keep_going',
'process_sequence': 'impute;cat->normed;highC_cat->combined;combined->normed;num->normed;normed->final',
'strategies:balance:__choice__': 'None',
'estimating:random_forest:bootstrap': 'False:bool',
'estimating:random_forest:criterion': 'gini',
'estimating:random_forest:early_stopping_rounds': '8:int',
'estimating:random_forest:early_stopping_tol': '0.0:float',
'estimating:random_forest:iter_inc': '16:int',
'estimating:random_forest:max_depth': 'None:NoneType',
'estimating:random_forest:max_features': 'sqrt',
'estimating:random_forest:max_leaf_nodes': 'None:NoneType',
'estimating:random_forest:min_impurity_decrease': '0.0:float',
'estimating:random_forest:min_samples_leaf': 14,
'estimating:random_forest:min_samples_split': 3,
'estimating:random_forest:min_weight_fraction_leaf': '0.0:float',
'estimating:random_forest:n_estimators': '1024:int',
'estimating:random_forest:n_jobs': '12:int',
'estimating:random_forest:random_state': '42:int',
'preprocessing:cat->normed:encode.one_hot:placeholder': 'placeholder',
'preprocessing:combined->normed:encode.cat_boost:placeholder': 'placeholder',
'preprocessing:highC_cat->combined:encode.combine_rare:copy': 'False:bool',
'preprocessing:highC_cat->combined:encode.combine_rare:minimum_fraction': '0.01:float',
'preprocessing:impute:impute.gbt:copy': 'False:bool',
'preprocessing:impute:impute.gbt:n_jobs': '12:int',
'preprocessing:impute:impute.gbt:random_state': '42:int',
'preprocessing:normed->final:select.boruta:max_depth': 5.0,
'preprocessing:normed->final:select.boruta:n_jobs': '12:int',
'preprocessing:normed->final:select.boruta:random_state': '42:int',
'preprocessing:normed->final:select.boruta:weak': 'True:bool',
'preprocessing:num->normed:operate.keep_going:placeholder': 'placeholder',
'strategies:balance:None:placeholder': 'placeholder'}
这是删除全为缺失值后数据的缺失率:
array([0.8596882 , 0. , 0.09576837, 0. , 0. ,
0.84743875, 0.33741648, 0.35412027, 0. , 0.8830735 ,
0.98997773, 0.27171492, 0.98218263, 0.85634744, 0.91759465,
0.76503341, 0.83407572, 0.97104677, 0.9922049 , 0.91759465,
0.99777283, 0.96659243, 0.99331849, 0.94320713, 0. ,
0. , 0. , 0. , 0.92873051, 0. ,
0.98997773])
可以看到像0.99777283
这样缺失率的列是不包含任何信息的
目前设置的missing_rate
参数是[0.2, 0.4, 0.6, 0.8 1.0]
5个取值范围
3.2
又出现了一个错误
经查,是用flatten
代替squeeze
可以解决
converted = encoder.transform([[origin]]).flatten()[0]
4
2020-8-20 提交元学习任务,有两个task的6个worker全挂了
用sql连表查询
select t2.task_id, count, task_metadata from
(select task_id,count(*) as count from trial group by task_id order by task_id) as t2, task where t2.task_id = task.task_id ;
3054 和 75192 有问题
其中75192跑了7次就挂了,问题更大
3054可能因为全是cat,导致的缓存问题
75192 是因为boruta
筛掉了所有特征
75192 牵扯的问题解决了,修改了boruta的代码
看到3054
解决这个问题其实也很简单
修改feature_engineer_base.py
if return_stack_trans:
X_trans=pd.DataFrame(np.zeros([X_stack.shape[0],0]))
return X_stack, X_trans
5
75133和75114提交一次失败一次
很奇葩的两个数据
最后查明,是在-.resource_manager.base.ResourceManager#upload_df_to_fs
# fixme: restore
df.to_hdf(tmp_path, "dataset", format="table")
return self.file_system.upload(dataset_path, tmp_path)
报错
6 尝试用pynisher加时空限制,凉了
75133.bz2
initial_points = [
{'estimating:__choice__': 'tabular_nn', 'preprocessing:normed->final:__choice__': 'select.boruta',
'preprocessing:num->normed:__choice__': 'operate.keep_going', 'process_sequence': 'num->normed;normed->final',
'strategies:balance:__choice__': 'weight', 'estimating:tabular_nn:af_hidden': 'relu',
'estimating:tabular_nn:af_output': 'linear', 'estimating:tabular_nn:batch_size': '1024:int',
'estimating:tabular_nn:class_weight': 'None:NoneType', 'estimating:tabular_nn:dropout_hidden': 0.35000000000000003,
'estimating:tabular_nn:dropout_output': 0.1, 'estimating:tabular_nn:early_stopping_rounds': '16:int',
'estimating:tabular_nn:early_stopping_tol': '0:int', 'estimating:tabular_nn:layer1': 320,
'estimating:tabular_nn:layer2': 160, 'estimating:tabular_nn:lr': '0.01:float',
'estimating:tabular_nn:max_epoch': '128:int', 'estimating:tabular_nn:max_layer_width': '2056:int',
'estimating:tabular_nn:min_layer_width': '32:int', 'estimating:tabular_nn:n_jobs': '8:int',
'estimating:tabular_nn:optimizer': 'adam', 'estimating:tabular_nn:random_state': '42:int',
'estimating:tabular_nn:use_bn': 'True:bool', 'estimating:tabular_nn:verbose': '-1:int',
'preprocessing:normed->final:select.boruta:max_depth': 7.0,
'preprocessing:normed->final:select.boruta:n_jobs': '8:int',
'preprocessing:normed->final:select.boruta:random_state': '42:int',
'preprocessing:normed->final:select.boruta:weak': 'False:bool',
'preprocessing:num->normed:operate.keep_going:placeholder': 'placeholder',
'strategies:balance:weight:placeholder': 'placeholder'}
]
if self.debug:
procedure_result, cloned_model = cloned_model.procedure(
self.ml_task, X_train, y_train, X_valid, y_valid,
X_test, y_test, max_iter, budget, (budget == self.max_budget)
)
else:
arguments = dict(
logger=logging.getLogger("pynisher"),
wall_time_in_s=self.time_limit,
mem_in_mb=self.memory_limit,
)
obj = pynisher.enforce_limits(**arguments)(cloned_model.procedure)
procedure_result, cloned_model = obj(
self.ml_task, X_train, y_train, X_valid, y_valid,
X_test, y_test, max_iter, budget, (budget == self.max_budget)
)
if obj.exit_status == 0: # normal
pass
elif obj.exit_status == pynisher.TimeoutException:
self.logger.error(str(obj.exit_status))
self.logger.error(str(config))
failed_info = get_trance_back_msg()
status = "TIMEOUT"
break
elif obj.exit_status == pynisher.MemorylimitException:
self.logger.error(str(obj.exit_status))
self.logger.error(str(config))
failed_info = get_trance_back_msg()
status = "MEMORY_LIMIT"
break
else:
# inside exceptions
self.logger.error(str(obj.exit_status))
self.logger.error(str(config))
failed_info = get_trance_back_msg()
status = "FAILED"
break
def run(self, background=False, concurrent_type="process"):
"""
Method to start the worker.
Parameters
----------
background: bool
If set to False (Default). the worker is executed in the current thread.
If True, a new daemon thread is created that runs the worker. This is
useful in a single worker scenario/when the compute function only simulates
work.
"""
if background:
if concurrent_type == "process":
self.process = Process(target=self._run, name='worker %s process' % self.worker_id)
# self.process.daemon = True
self.process.start()
elif concurrent_type == "thread":
# maybe in a same thread
self.worker_id += f"_{threading.get_ident()}"
self.thread = threading.Thread(target=self._run, name='worker %s thread' % self.worker_id)
self.thread.daemon = True
self.thread.start()
else:
self._run()
注释掉了daemon,才能跑起来
tabularNN 的 callback变成none了。不知道为什么,放弃了
7 xbcp 异常退出
75114 和 190157 异常退出
2122 异常退出, 迭代217次
75134 异常退出, 迭代25次
271 凉了, 只跑了8次
75134存在cache-lock error 和内存分配问题
跑完500个iteration但没有结束
167202 和 167090 跑完500个iteration但没有结束
271 插入trial表出错
[ERROR] [2020-08-21 23:50:15,537:autoflow.resource_manager.base.ResourceManager] invalid input syntax for type json
LINE 1: ...d2e33521edc9da6ea4d6bb61c4e-4.0_y-info.bz2', CAST('{"confusi...
^
DETAIL: Token "NaN" is invalid.
CONTEXT: JSON data, line 1: ...op_columns": [], "iter": 2, "gamma_history": [NaN...
[ERROR] [2020-08-21 23:50:15,537:autoflow.resource_manager.base.ResourceManager] Insert 'trial' table failed, 1 try.
2122 不能创新新的线程
Traceback (most recent call last):
File "/job/code/autoflow/evaluation/train_evaluator.py", line 220, in evaluate
X_test, y_test, max_iter, budget, (budget == self.max_budget)
File "/job/code/autoflow/workflow/ml_workflow.py", line 244, in procedure
self.fit(X_train, y_train, X_valid, y_valid, X_test, y_test)
File "/job/code/autoflow/workflow/ml_workflow.py", line 196, in fit
self._final_estimator.fit(X_train, y_train, X_valid, y_valid, X_test, y_test)
File "/job/code/autoflow/workflow/components/base.py", line 159, in fit
y_valid_, X_test_, y_test_, feature_groups)
File "/job/code/autoflow/workflow/components/base.py", line 180, in _fit
**kwargs)
File "/job/code/autoflow/workflow/components/iter_algo.py", line 103, in core_fit
self.iterative_fit(X, y, X_valid, y_valid, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/_testing.py", line 327, in wrapper
return fn(*args, **kwargs)
File "/job/code/autoflow/workflow/components/iter_algo.py", line 54, in iterative_fit
self.component.fit(X, y, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1601, in fit
for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 921, in __call__
if self.dispatch_one_batch(iterator):
File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 226, in apply_async
return self._get_pool().apply_async(
File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 367, in _get_pool
self._pool = ThreadPool(self._n_jobs)
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 802, in __init__
Pool.__init__(self, processes, initializer, initargs)
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/opt/conda/lib/python3.7/multiprocessing/dummy/__init__.py", line 51, in start
threading.Thread.start(self)
File "/opt/conda/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread