做LR时报错:Input contains NaN, infinity or a value too large for dtype(‘float64’)

本文探讨了在逻辑回归中遇到的错误:输入包含NaN、无穷大或对于'dtype'('float64')而言过大的值。文章分析了两种常见原因——数据未标准化和学习率过高,并提供了相应的解决方案。

做LR时报错: Input contains NaN, infinity or a value too large for dtype(‘float64’)

数据未标准化
这是因为当您将sigmoid / logit函数应用于您的假设时,输出概率几乎都是大约0或全1并且使用您的成本函数,log(1 - 1)或log(0)将产生-Inf。
解决此问题的一种方法是在使用梯度下降进行训练之前对数据进行标准化。
学习率太高
一旦对数据进行标准化,成本似乎是有限的,但经过几次迭代后突然转到NaN。如果你的学习速度或alpha太大,每次迭代都会朝着最小化的方向超调,从而使每次迭代的成本振荡甚至发散。成本在每次迭代时会发散或增大,直到它太大而无法使用浮点精度来表示。
解决此问题的方法是降低您的学习率α。
PS:确定最佳学习速率的常用方法是在一系列对数间隔的α值上执行梯度下降,查看最终成本函数值是什么,选择导致最小成本的学习速率。

想了解更详细,戳连接:[https://stackoverflow.com/questions/35419882/cost-function-in-logistic-regression-gives-nan-as-a-result]

C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score y_pred = method_caller(estimator, "predict", X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call result, _ = _get_response_values( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_response.py", line 109, in _get_response_values y_pred, pos_label = estimator.predict(X), None File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 386, in predict return self._decision_function(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 369, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 604, in _validate_data out = check_array(X, input_name="X", **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn( C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 678, in fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 621, in _validate_data X, y = check_X_y(X, y, **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y X = check_array( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn(some_fits_failed_message, FitFailedWarning) LR: nan(nan) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score y_pred = method_caller(estimator, "predict", X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call result, _ = _get_response_values( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_response.py", line 109, in _get_response_values y_pred, pos_label = estimator.predict(X), None File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 386, in predict return self._decision_function(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_coordinate_descent.py", line 1071, in _decision_function return super()._decision_function(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 369, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 604, in _validate_data out = check_array(X, input_name="X", **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. Lasso does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn( C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_coordinate_descent.py", line 905, in fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 621, in _validate_data X, y = check_X_y(X, y, **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y X = check_array( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. Lasso does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score y_pred = method_caller(estimator, "predict", X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call result, _ = _get_response_values( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_response.py", line 109, in _get_response_values y_pred, pos_label = estimator.predict(X), None File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 386, in predict return self._decision_function(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_coordinate_descent.py", line 1071, in _decision_function return super()._decision_function(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_base.py", line 369, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 604, in _validate_data out = check_array(X, input_name="X", **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. ElasticNet does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn( LASSO: nan(nan) EN: nan(nan) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\linear_model\_coordinate_descent.py", line 905, in fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 621, in _validate_data X, y = check_X_y(X, y, **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y X = check_array( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. ElasticNet does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score y_pred = method_caller(estimator, "predict", X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call result, _ = _get_response_values( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_response.py", line 109, in _get_response_values y_pred, pos_label = estimator.predict(X), None File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\neighbors\_regression.py", line 237, in predict neigh_ind = self.kneighbors(X, return_distance=False) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\neighbors\_base.py", line 804, in kneighbors X = self._validate_data(X, accept_sparse="csr", reset=False, order="C") File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 604, in _validate_data out = check_array(X, input_name="X", **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn( C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\neighbors\_regression.py", line 218, in fit return self._fit(X, y) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\neighbors\_base.py", line 456, in _fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 621, in _validate_data X, y = check_X_y(X, y, **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y X = check_array( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 355, in _score return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_param_validation.py", line 211, in wrapper return func(*args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_regression.py", line 474, in mean_squared_error y_type, y_true, y_pred, multioutput = _check_reg_targets( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_regression.py", line 100, in _check_reg_targets y_true = check_array(y_true, ensure_2d=False, dtype=dtype) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input contains NaN. warnings.warn( KNN: nan(nan) CART: nan(nan) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\tree\_classes.py", line 1320, in fit super()._fit( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\tree\_classes.py", line 242, in _fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 619, in _validate_data y = check_array(y, input_name="y", **check_y_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input y contains NaN. warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:842: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__ score = scorer._score( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score y_pred = method_caller(estimator, "predict", X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call result, _ = _get_response_values( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\_response.py", line 109, in _get_response_values y_pred, pos_label = estimator.predict(X), None File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\svm\_base.py", line 431, in predict X = self._validate_for_predict(X) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\svm\_base.py", line 611, in _validate_for_predict X = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 604, in _validate_data out = check_array(X, input_name="X", **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. SVR does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn( C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 4 fits failed out of a total of 5. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 4 fits failed with the following error: Traceback (most recent call last): File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\svm\_base.py", line 190, in fit X, y = self._validate_data( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\base.py", line 621, in _validate_data X, y = check_X_y(X, y, **check_params) File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y X = check_array( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 959, in check_array _assert_all_finite( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 124, in _assert_all_finite _assert_all_finite_element_wise( File "C:\Users\zxm\.conda\envs\pytorch\lib\site-packages\sklearn\utils\validation.py", line 173, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input X contains NaN. SVR does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values warnings.warn(some_fits_failed_message, FitFailedWarning) SVM: nan(nan) 为什么这样报错
10-12
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, RobustScaler from sklearn.metrics import classification_report, accuracy_score import xgboost as xgb import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader from scipy.stats.mstats import winsorize # ---------------------- 1. 加载数据 ---------------------- df = pd.read_csv("../data/900_gather_feature_results.csv") # 替换为您的文件路径 print(f"数据形状: {df.shape}") print(f"标签类别分布:\n{df['second_id'].value_counts()}") # ---------------------- 2. 特征与标签分离 ---------------------- # 排除非数值特征(app_name、timestamp)和标签 X = df.drop(["second_id", "app_name", "timestamp"], axis=1) y = df["second_id"] # ---------------------- 3. 处理分类特征(protocol) ---------------------- categorical_cols = X.select_dtypes(include=["object"]).columns.tolist() print(f"分类特征: {categorical_cols}") for col in categorical_cols: le = LabelEncoder() X[col] = le.fit_transform(X[col]) # 字符串→整数编码 # ---------------------- 4. 关键特征修复:处理导致inf的根源问题 ---------------------- # 4.1 处理duration(所有比率特征的分母) if "duration" in X.columns: # 替换0或inf为中位数(避免除法计算错误) duration_median = X["duration"].replace([0, np.inf], np.nan).median() X["duration"] = X["duration"].replace([0, np.inf], duration_median) X["duration"] = X["duration"].fillna(duration_median) # 处理剩余缺失值 # 4.2 重新计算所有比率特征(确保分母有效) ratio_features = [ "bytes_rate", "fwd_bytes_rate", "bwd_bytes_rate", "packets_rate", "fwd_packets_rate", "bwd_packets_rate" ] for feature in ratio_features: if feature in X.columns: numerator = feature.split("_")[0] + "_total_payload_bytes" if numerator in X.columns and "duration" in X.columns: # 计算比率再次检查分母(避免重复出现0或inf) X[feature] = X[numerator] / X["duration"] X[feature] = X[feature].replace(np.inf, X[feature].median()) # 处理剩余inf # ---------------------- 5. 通用异常值与缺失值处理 ---------------------- print("\n===== 数据异常值检查(修复前) =====") print("NaN总数:", X.isnull().sum().sum()) print("inf总数:", (X == np.inf).sum().sum()) # 5.1 填充所有缺失值(用中位数,鲁棒性更强) for col in X.columns: X[col] = X[col].fillna(X[col].median()) # 5.2 处理剩余极端值(使用2.5%截断,避免过度截断) for col in X.columns: X[col] = winsorize(X[col], limits=[0.025, 0.025]) # 保留95%数据 # 5.3 验证数据干净度 print("\n===== 数据异常值检查(修复后) =====") print("NaN总数:", X.isnull().sum().sum()) print("inf总数:", (X == np.inf).sum().sum()) print("特征数量:", X.shape) # ---------------------- 6. 划分训练集与测试集 ---------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # 分层抽样保证类别分布 ) # ---------------------- 7. 特征标准化(鲁棒标准化) ---------------------- scaler = RobustScaler() # 对异常值更鲁棒的标准化 X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # ---------------------- 8. XGBoost训练与评估 ---------------------- dtrain = xgb.DMatrix(X_train_scaled, label=y_train) dtest = xgb.DMatrix(X_test_scaled, label=y_test) num_classes = len(y.unique()) params = { "objective": "multi:softmax", "num_class": num_classes, "max_depth": 5, # 减少过拟合风险 "eta": 0.1, "eval_metric": "mlogloss", "verbosity": 0, "random_state": 42 } num_round = 100 model_xgb = xgb.train(params, dtrain, num_round) y_pred_xgb = model_xgb.predict(dtest) print(f"\nXGBoost 测试集准确率: {accuracy_score(y_test, y_pred_xgb):.4f}") print("分类报告:\n", classification_report(y_test, y_pred_xgb)) # ---------------------- 9. 神经网络训练(可选) ---------------------- X_train_tensor = torch.FloatTensor(X_train_scaled) y_train_tensor = torch.LongTensor(y_train.values) X_test_tensor = torch.FloatTensor(X_test_scaled) y_test_tensor = torch.LongTensor(y_test.values) batch_size = 64 train_dataset = TensorDataset(X_train_tensor, y_train_tensor) test_dataset = TensorDataset(X_test_tensor, y_test_tensor) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) class MLP(nn.Module): def __init__(self, input_size, hidden_size, num_classes): super(MLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden_size, hidden_size // 2) self.dropout = nn.Dropout(0.3) # 防止过拟合 self.fc3 = nn.Linear(hidden_size // 2, num_classes) def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.fc2(x) x = self.relu(x) x = self.dropout(x) x = self.fc3(x) return x input_size = X_train.shape[1] hidden_size = 256 num_classes = len(y.unique()) model_nn = MLP(input_size, hidden_size, num_classes) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model_nn.parameters(), lr=0.001) num_epochs = 50 for epoch in range(num_epochs): model_nn.train() running_loss = 0.0 for batch_X, batch_y in train_loader: optimizer.zero_grad() outputs = model_nn(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() running_loss += loss.item() * batch_X.size(0) epoch_loss = running_loss / len(train_loader.dataset) if (epoch + 1) % 10 == 0: print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}") model_nn.eval() y_pred_nn = [] with torch.no_grad(): for batch_X, _ in DataLoader(test_dataset, batch_size=batch_size): outputs = model_nn(batch_X) _, predicted = torch.max(outputs.data, 1) y_pred_nn.extend(predicted.numpy()) print(f"\n神经网络 测试集准确率: {accuracy_score(y_test, y_pred_nn):.4f}") 上面的代码在运行报错:===== 数据异常值检查(修复后) ===== NaN总数: 0 inf总数: 6698 特征数量: (9931, 76) Traceback (most recent call last): File "/mnt/hgfs/Sharefile/DFI/1.特征分析/code/train.py", line 78, in <module> X_train_scaled = scaler.fit_transform(X_train) File "/home/wxy/anaconda3/envs/website_classifier/lib/python3.9/site-packages/sklearn/base.py", line 844, in fit_transform return self.fit(X, **fit_params).transform(X) File "/home/wxy/anaconda3/envs/website_classifier/lib/python3.9/site-packages/sklearn/preprocessing/_data.py", line 1488, in fit X = self._validate_data( File "/home/wxy/anaconda3/envs/website_classifier/lib/python3.9/site-packages/sklearn/base.py", line 557, in _validate_data X = check_array(X, **check_params) File "/home/wxy/anaconda3/envs/website_classifier/lib/python3.9/site-packages/sklearn/utils/validation.py", line 792, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/home/wxy/anaconda3/envs/website_classifier/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains infinity or a value too large for dtype('float64').
11-07
import gc import time from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, TensorDataset from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.preprocessing import StandardScaler from skopt import gp_minimize from skopt.space import Real, Categorical, Integer import warnings import seaborn as sns from sklearn.preprocessing import RobustScaler from sklearn.model_selection import TimeSeriesSplit from scipy.stats import boxcox # 设置中文显示 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False plt.switch_backend('TkAgg') # 设置路径 Path(r"D:\result2").mkdir(parents=True, exist_ok=True) Path("model_results/").mkdir(parents=True, exist_ok=True) # 检查GPU可用性 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"使用 {device} 进行训练") # 设置随机种子保证可重复性 torch.manual_seed(42) np.random.seed(42) # 1. 数据预处理模块 def load_and_preprocess_data(): """加载并预处理数据(内存安全版)""" chunksize = 10000 # 每次处理1万行 dfs = [] datetimes_list = [] location_codes_list = [] # 指定列数据类型以减少内存使用 dtype_dict = { 'damage_count': 'float32', 'damage_depth': 'float32', } for chunk in pd.read_csv( r"D:\my_data\clean\locationTransfer.csv", chunksize=chunksize, dtype=dtype_dict ): # 保存非数值列 datetimes_list.append(chunk['datetime'].copy()) location_codes_list.append(chunk['locationCode'].copy()) # 只处理数值列 numeric_cols = chunk.select_dtypes(include=[np.number]).columns chunk = chunk[numeric_cols] chunk = chunk.dropna(subset=['damage_count']) chunk = chunk[pd.to_numeric(chunk['damage_count'], errors='coerce').notna()] chunk = chunk.fillna(method='ffill').fillna(method='bfill') dfs.append(chunk) if len(dfs) > 10: # 测试限制块数 break # 合并数据块 df = pd.concat(dfs, ignore_index=True) def create_lag_features(df, lags=3): for lag in range(1, lags + 1): df[f'damage_count_lag_{lag}'] = df['damage_count'].shift(lag) return df.dropna() df = create_lag_features(df) # 在合并df之后,填充na之前 df = df.dropna(subset=['damage_count']) datetimes = pd.concat(datetimes_list, ignore_index=True) location_codes = pd.concat(location_codes_list, ignore_index=True) # 确保长度一致 min_length = min(len(df), len(datetimes), len(location_codes)) df = df.iloc[:min_length] datetimes = datetimes.iloc[:min_length] location_codes = location_codes.iloc[:min_length] # 检查是否存在 NaNnan_check = df.isnull().sum().sum() inf_check = df.isin([np.Inf, -np.Inf]).sum().sum() if nan_check > 0 or inf_check > 0: # 处理 NaN 值或者无穷大值 # 填充缺失值为均值 df = df.fillna(df.mean()) # 删除包含 NaN 值或者无穷大值的行 df = df.dropna() # 结构化特征 X_structured = df.drop(columns=['damage_count', 'damage_depth', 'damage_db', 'asset_code_mapping', 'pile_longitude', 'pile_latitude', 'locationCode', 'datetime', 'locationCode_encoded','damage_count_lag_1','damage_count_lag_2','damage_count_lag_3'], errors='ignore') # 填充缺失值 numeric_cols = X_structured.select_dtypes(include=[np.number]).columns for col in numeric_cols: X_structured[col] = X_structured[col].fillna(X_structured[col].mean()) # 标准化数据 scaler = RobustScaler() # 替换StandardScaler,更抗异常值 X_structured = pd.DataFrame(scaler.fit_transform(X_structured), columns=X_structured.columns) # 确保X_structured是DataFrame if not isinstance(X_structured, pd.DataFrame): X_structured = pd.DataFrame(X_structured, columns=[f"feature_{i}" for i in range(X_structured.shape[1])]) # X_structured = X_structured.values # 将DataFrame转换为NumPy数组 # 修改后的目标变量处理部分 y = df[['damage_count']].values.astype(np.float32) # 添加数据缩放 y_scaler = RobustScaler() y = y_scaler.fit_transform(y) # 使用标准化代替log变换 y = np.clip(y, -1e6, 1e6) # 设置合理的上下界 # 添加数据检查 assert not np.any(np.isinf(y)), "y中包含无限值" assert not np.any(np.isnan(y)), "y中包含NaN值" # 数据检查 print("原始数据统计:") print(f"最小值: {y.min()}, 最大值: {y.max()}, NaN数量: {np.isnan(y).sum()}") print("处理后y值范围:", np.min(y), np.max(y)) print("无限值数量:", np.isinf(y).sum()) # 清理内存 del df, chunk, dfs gc.collect() torch.cuda.empty_cache() return datetimes, X_structured, y, location_codes, scaler, y_scaler # 2. 间序列数据集类 class TimeSeriesDataset(Dataset): """自定义间序列数据集类""" def __init__(self, X, y, timesteps): # 确保输入是NumPy数组 if isinstance(X, pd.DataFrame): X = X.values if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series): y = y.values assert X.ndim == 2, f"X应为2维,实际为{X.ndim}维" assert y.ndim == 2, f"y应为2维,实际为{y.ndim}维" # 添加维度调试信息 print(f"数据形状 - X: {X.shape}, y: {y.shape}") self.X = torch.FloatTensor(X).unsqueeze(-1) # [samples, timesteps, 1] self.y = torch.FloatTensor(y) self.timesteps = timesteps # 验证形状 if len(self.X) != len(self.y): raise ValueError("X和y的长度不匹配") def __len__(self): return len(self.X) - self.timesteps def __getitem__(self, idx): # [seq_len, num_features] x_window = self.X[idx:idx + self.timesteps] y_target = self.y[idx + self.timesteps - 1] return x_window.permute(1, 0), y_target # 调整维度顺序 def select_features_by_importance(X, y, n_features, feature_names=None): """使用随机森林选择特征(支持NumPy数组和DataFrame)""" # 确保X是二维数组 if isinstance(X, pd.DataFrame): feature_names = X.columns.tolist() X = X.values elif feature_names is None: feature_names = [f"feature_{i}" for i in range(X.shape[1])] # 处理y的维度 y = np.ravel(np.asarray(y)) y = np.nan_to_num(y, nan=np.nanmean(y)) # 检查特征数 if X.shape[1] < n_features: n_features = X.shape[1] print(f"警告: 特征数少于请求数,使用所有 {n_features} 个特征") # 训练随机森林 rf = RandomForestRegressor(n_estimators=250, random_state=42, n_jobs=-1) rf.fit(X, y) # 获取特征重要性 feature_importances = rf.feature_importances_ indices = np.argsort(feature_importances)[::-1][:n_features] # 返回选中的特征数据和重要性 return X[:, indices], feature_importances[indices], [feature_names[i] for i in indices] # 3. LSTM模型定义 class LSTMModel(nn.Module): """LSTM回归模型""" def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.4): super().__init__() # 确保hidden_size*2能被num_heads整除 if (hidden_size * 2) % 4 != 0: hidden_size = ((hidden_size * 2) // 4) * 4 // 2 # 调整到最近的合规值 print(f"调整hidden_size为{hidden_size}以满足整除条件") self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0, bidirectional=True # 添加双向结构 ) # 添加维度检查的初始化 def weights_init(m): if isinstance(m, nn.Linear): if m.weight.dim() < 2: m.weight.data = m.weight.data.unsqueeze(0) # 确保至少2维 nn.init.xavier_uniform_(m.weight) if m.bias is not None: nn.init.constant_(m.bias, 0.0) self.bn = nn.BatchNorm1d(hidden_size * 2) # # 注意力机制层 self.attention = nn.MultiheadAttention(embed_dim=hidden_size * 2, num_heads=4) # 改用多头注意力 # 更深D输出层 self.fc = nn.Sequential( nn.Linear(hidden_size*2, hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, 1) ) # 应用初始化 self.apply(weights_init) def forward(self, x): lstm_out, _ = self.lstm(x) # [batch, seq_len, hidden*2] lstm_out = lstm_out.permute(1, 0, 2) # [seq_len, batch, features] attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out) attn_out = attn_out.permute(1, 0, 2) # 恢复为[batch, seq_len, features] return self.fc(attn_out[:, -1, :]).squeeze() def plot_feature_importance(feature_names, importance_values, save_path): """绘制特征重要性图""" # 验证输入 if len(feature_names) == 0 or len(importance_values) == 0: print("警告: 无特征重要性数据可绘制") return if len(feature_names) != len(importance_values): print(f"警告: 特征名数量({len(feature_names)})与重要性值数量({len(importance_values)})不匹配") # 取较小值 min_len = min(len(feature_names), len(importance_values)) feature_names = feature_names[:min_len] importance_values = importance_values[:min_len] # 按重要性排序 indices = np.argsort(importance_values)[::-1] sorted_features = [feature_names[i] for i in indices] sorted_importance = importance_values[indices] plt.figure(figsize=(12, 8)) plt.bar(range(len(sorted_features)), sorted_importance, align="center") plt.xticks(range(len(sorted_features)), sorted_features, rotation=90) plt.xlabel("特征") plt.ylabel("重要性得分") plt.title("特征重要性排序") plt.tight_layout() # 确保保存路径存在 save_path.parent.mkdir(parents=True, exist_ok=True) plt.savefig(save_path, dpi=300) plt.close() def evaluate(model, val_loader, criterion): model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) val_loss += loss.item() return val_loss / len(val_loader) # 4. 模型训练函数 def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler=None, epochs=100, patience=30): """训练模型并返回最佳模型和训练历史""" best_loss = float('inf') history = {'train_loss': [], 'val_loss': []} # 添加梯度累积 accumulation_steps = 5 # 每4个batch更新一次参数 for epoch in range(epochs): # 训练阶段 model.train() train_loss = 0.0 optimizer.zero_grad() for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) scaler = torch.cuda.amp.GradScaler() # 在训练循环中添加 with torch.cuda.amp.autocast(): outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) scaler.scale(loss).backward() for batch_idx, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) # 前向传播 outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) # 梯度累积 loss = loss / accumulation_steps scaler.scale(loss).backward() if (batch_idx + 1) % accumulation_steps == 0: scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() optimizer.zero_grad() train_loss += loss.item() * accumulation_steps # 验证 val_loss = evaluate(model, val_loader, criterion) if scheduler: scheduler.step(val_loss) # 根据验证损失调整学习率 # 记录历史 avg_train_loss = train_loss / len(train_loader) history['train_loss'].append(avg_train_loss) history['val_loss'].append(val_loss) # 早停逻辑 if val_loss < best_loss * 0.99:# 相对改进阈值 best_loss = val_loss best_epoch = epoch torch.save(model.state_dict(), 'best_model.pth') print(f"Epoch {epoch + 1}/{epochs} - 训练损失: {avg_train_loss :.4f} - 验证损失: {val_loss:.4f}") # 早停判断 if epoch - best_epoch >= patience: print(f"早停触发,最佳epoch: {best_epoch+1}") break # 加载最佳模型 model.load_state_dict(torch.load('best_model.pth')) return model, history # 5. 贝叶斯优化函数 def optimize_hyperparameters(X_train, y_train, input_size): """使用贝叶斯优化寻找最佳超参数""" # 自定义评分函数 def score_fn(params): """内部评分函数""" try: params = adjust_hidden_size(params) # 调整参数 hidden_size, num_layers, dropout, lr, batch_size, timesteps = params # 确保参数有效 batch_size = max(32, min(256, int(batch_size))) timesteps = max(3, min(10, int(timesteps))) dropout = min(0.5, max(0.1, float(dropout))) lr = min(0.01, max(1e-5, float(lr))) # 检查数据是否足够 if len(X_train) < 2 * timesteps+1: # 至少需要2倍间步长的数据 return float('inf') # 创建模型 model = LSTMModel( input_size=input_size, hidden_size=int(hidden_size), num_layers=min(3, int(num_layers)), dropout=min(0.5, float(dropout)) ).to(device) # 初始化权重 # for name, param in model.named_parameters(): # if 'weight' in name: # nn.init.xavier_normal_(param) # elif 'bias' in name: # nn.init.constant_(param, 0.1) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-3) # 创建数据加载器 dataset = TimeSeriesDataset(X_train, y_train, timesteps=int(timesteps)) # 简化验证流程 train_size = int(0.8 * len(dataset)) train_dataset = torch.utils.data.Subset(dataset, range(train_size)) val_dataset = torch.utils.data.Subset(dataset, range(train_size, len(dataset))) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # 简单训练和验证 model.train() for epoch in range(15): # 减少epoch数以加快评估 for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # 验证 model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: outputs = model(inputs.to(device)) loss = criterion(outputs, targets.squeeze().to(device)) if torch.isnan(loss) or torch.isinf(loss): return float('inf') val_loss += loss.item() return val_loss / len(val_loader) except Exception as e: print(f"参数评估失败: {str(e)}") return float('inf') # 定义搜索空间 search_spaces = [ Integer(32, 128, name='hidden_size'), Integer(1, 3, name='num_layers'), Real(0.2, 0.5, name='dropout'), Real(5e-4, 1e-3, prior='log-uniform', name='lr'), Categorical([64, 128, 256], name='batch_size'), Integer(3, 10, name='timesteps') # 优化间步长 ] def adjust_hidden_size(params): """确保hidden_size*2能被4整除""" hs = params[0] params[0] = ((hs * 2) // 4) * 4 // 2 return params result = gp_minimize( score_fn, search_spaces, n_calls=50, random_state=42, verbose=True, n_jobs=1 # 并行执行 ) # 提取最佳参数 best_params = { 'hidden_size': result.x[0], 'num_layers': result.x[1], 'dropout': result.x[2], 'lr': result.x[3], 'batch_size': result.x[4], 'timesteps': result.x[5] } print("优化完成,最佳参数:", best_params) return best_params def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 6. 评估函数 def evaluate_model(model, test_loader, criterion, test_indices, y_scaler=None): """评估模型性能""" model.eval() test_loss = 0.0 y_true = [] y_pred = [] all_indices = [] with torch.no_grad(): for batch_idx, (inputs, targets) in enumerate(test_loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) if outputs.dim() == 1: outputs = outputs.unsqueeze(1) loss = criterion(outputs, targets) test_loss += loss.item() * inputs.size(0) # 收集预测结果 y_true.extend(targets.cpu().numpy()) y_pred.extend(outputs.cpu().numpy()) # 获取原始数据集中的索引 current_indices = test_indices[batch_idx * test_loader.batch_size: (batch_idx + 1) * test_loader.batch_size] all_indices.extend(current_indices) y_true = np.array(y_true).reshape(-1) y_pred = np.array(y_pred).reshape(-1) if y_scaler is not None: y_true = y_scaler.inverse_transform(y_true.reshape(-1, 1)).flatten() y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten() # 基础指标 metrics = { 'MSE': mean_squared_error(y_true, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)), 'MAE': mean_absolute_error(y_true, y_pred), 'R2': r2_score(y_true, y_pred), 'MAPE': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100, # 避免除0 'indices': all_indices, # 添加原始索引 'y_true_original': y_true, 'y_pred_original': y_pred, 'test_loss': test_loss } # 可视化误差分布 errors = y_true - y_pred plt.figure(figsize=(12, 6)) sns.histplot(errors, kde=True, bins=50) plt.title('Error Distribution') plt.savefig('error_distribution.png') plt.close() return metrics, y_true, y_pred def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 7. 主函数 def main(): # 1. 加载和预处理数据 print("正在加载和预处理数据...") datetimes, X_structured, y, location_codes, scaler , y_scaler= load_and_preprocess_data() # 2. 特征选择 print('正在进行特征选择') # 修改为选择前15%特征 n_features = int(X_structured.shape[1] * 0.15) X_selected, feature_importances, top_features = select_features_by_importance( X_structured, y, n_features ) X_selected = X_structured[top_features] print(f"选择后的特征及其重要性:") for feature, importance in zip(top_features, feature_importances): print(f"{feature}: {importance:.4f}") print(X_selected) # 绘制特征重要性图 plot_feature_importance(top_features, feature_importances, Path("feature_importance.png")) # 3. 创建间序列数据集 print("正在创建间序列数据集...") timesteps = 5 dataset = TimeSeriesDataset(X_selected, y, timesteps) # 4. 数据划分 train_size = int(0.8 * len(dataset)) train_indices = list(range(train_size)) test_indices = list(range(train_size, len(dataset))) train_dataset = torch.utils.data.Subset(dataset, train_indices) test_dataset = torch.utils.data.Subset(dataset, test_indices) # 5. 贝叶斯优化超参数 print("正在进行贝叶斯优化...") try: best_params = optimize_hyperparameters( X_selected.iloc[:train_size], y[:train_size].copy(), input_size=X_selected.shape[1] ) print("最佳参数:", best_params) except Exception as e: print(f"贝叶斯优化失败: {str(e)}") # 6. 使用最佳参数训练最终模型 torch.cuda.empty_cache() # 清理 GPU 缓存 print("\n使用最佳参数训练模型...") # 获取并验证batch_size batch_size = int(best_params.get('batch_size')) print(f"实际使用的batch_size类型: {type(batch_size)}, 值: {batch_size}") # 调试输出 model = LSTMModel( input_size=X_selected.shape[1], hidden_size=int(best_params['hidden_size']), num_layers=int(best_params['num_layers']), dropout=float(best_params['dropout']) ).to(device) # 数据加载器 train_loader = DataLoader( train_dataset, batch_size=int(batch_size), shuffle=True, # 训练集需要打乱 collate_fn=collate_fn, num_workers=4, # 多进程加载 pin_memory=True # 加速GPU传输 ) val_loader = DataLoader( test_dataset, batch_size=int(batch_size)*2, # 更大的批次提升验证效率 shuffle=False, # 验证集不需要打乱 collate_fn=lambda batch: ( torch.stack([x for x, y in batch]).permute(0, 2, 1), torch.stack([y for x, y in batch]) ), num_workers=2, pin_memory=True ) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # 训练模型 model, history = train_model( model, train_loader, val_loader, criterion, optimizer, scheduler=scheduler, epochs=200, patience=15) torch.cuda.empty_cache() # 清理 GPU 缓存 # 7. 评估模型 print("\n评估模型性能...") metrics, y_true, y_pred = evaluate_model(model, val_loader, criterion, test_indices, y_scaler) print(f"测试集 MSE: {metrics['MSE']:.4f}, MAE: {metrics['MAE']:.4f}, R2: {metrics['R2']:.4f}") # 8. 保存所有结果 print("\n保存所有结果...") output_dir = Path(r"D:\result2") output_dir.mkdir(parents=True, exist_ok=True) # 保存评估指标 metrics_df = pd.DataFrame({ 'Metric': ['MSE', 'MAE', 'R2', 'MAPE', 'Test Loss'], 'Value': [metrics['MSE'], metrics['MAE'], metrics['R2'], metrics['MAPE'], metrics['test_loss']] }) metrics_df.to_csv(output_dir / 'evaluation_metrics.csv', index=False) # 保存训练历史 history_df = pd.DataFrame(history) history_df.to_csv(output_dir / 'training_history.csv', index=False) # 保存预测结果与原始数据 pred_indices = [i + timesteps - 1 for i in metrics['indices']] # 调整索引以匹配原始数据 # 确保我们有足够的datetime和locationCode数据 if len(datetimes) > max(pred_indices) and len(location_codes) > max(pred_indices): y_true = y_true.flatten() # 确保是一维 y_pred = y_pred.flatten() # 确保是一维 result_df = pd.DataFrame({ 'datetime': datetimes.iloc[pred_indices].values, 'locationCode': location_codes.iloc[pred_indices].values, 'true_value': y_true, 'predicted_value': y_pred }) # 有条件地添加分位数 if y_pred.shape[1] > 2: result_df['predicted_lower'] = y_pred[:, 0] # 10%分位数 result_df['predicted_upper'] = y_pred[:, 2] # 90%分位数 # 添加其他特征 for i, feature in enumerate(X_selected.columns): result_df[feature] = X_selected.iloc[pred_indices, i].values result_df.to_csv(output_dir / 'predictions_with_metadata.csv', index=False) else: print("警告: datetime或locationCode数据不足,无法完全匹配预测结果") # 保存基础预测结果 pd.DataFrame({ 'true_value': y_true.flatten(), 'predicted_value': y_pred.flatten() }).to_csv(output_dir / 'predictions.csv', index=False) # 9. 可视化结果 plt.figure(figsize=(12, 6)) plt.plot(history['train_loss'], label='训练损失') plt.plot(history['val_loss'], label='验证损失') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('训练过程') plt.legend() plt.savefig(output_dir / 'training_process.png', dpi=300) plt.close() # 添加预测结果可视化 plt.figure(figsize=(15, 6)) plt.plot(y_true[:200], label='真实值') plt.plot(y_pred[:200], label='预测值') # 只使用中位数预测 plt.title('预测结果对比') plt.legend() plt.savefig(output_dir / 'prediction_comparison.png', dpi=300) plt.show() # 误差分布图 errors = y_true - y_pred[:, 1] plt.hist(errors, bins=50) plt.title('预测误差分布') plt.savefig(output_dir / 'error_distribution.png', dpi=300) # 保存图像 plt.close() # # 添加分位数预测可视化 # plt.figure(figsize=(15, 6)) # plt.plot(y_true[:100], label='真实值') # plt.plot(y_pred[:100, 0], label='10%分位数') # plt.plot(y_pred[:100, 1], label='中位数') # plt.plot(y_pred[:100, 2], label='90%分位数') # plt.legend() # plt.savefig(output_dir / 'quantile_predictions.png', dpi=300) # 保存图像 # plt.close() # 9. 保存模型 if metrics['r2'] > 0.8: model_path = output_dir / 'best_model.pth' torch.save(model.state_dict(), model_path) print(f"模型保存成功: {model_path}") print(f"所有结果已保存到 {output_dir}") if __name__ == "__main__": warnings.filterwarnings('ignore') start_time = time.time() main() print(f"总运行: {(time.time() - start_time) / 60:.2f}分钟")import gc import time from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, TensorDataset from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.preprocessing import StandardScaler from skopt import gp_minimize from skopt.space import Real, Categorical, Integer import warnings import seaborn as sns from sklearn.preprocessing import RobustScaler from sklearn.model_selection import TimeSeriesSplit from scipy.stats import boxcox # 设置中文显示 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False plt.switch_backend('TkAgg') # 设置路径 Path(r"D:\result2").mkdir(parents=True, exist_ok=True) Path("model_results/").mkdir(parents=True, exist_ok=True) # 检查GPU可用性 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"使用 {device} 进行训练") # 设置随机种子保证可重复性 torch.manual_seed(42) np.random.seed(42) # 1. 数据预处理模块 def load_and_preprocess_data(): """加载并预处理数据(内存安全版)""" chunksize = 10000 # 每次处理1万行 dfs = [] datetimes_list = [] location_codes_list = [] # 指定列数据类型以减少内存使用 dtype_dict = { 'damage_count': 'float32', 'damage_depth': 'float32', } for chunk in pd.read_csv( r"D:\my_data\clean\locationTransfer.csv", chunksize=chunksize, dtype=dtype_dict ): # 保存非数值列 datetimes_list.append(chunk['datetime'].copy()) location_codes_list.append(chunk['locationCode'].copy()) # 只处理数值列 numeric_cols = chunk.select_dtypes(include=[np.number]).columns chunk = chunk[numeric_cols] chunk = chunk.dropna(subset=['damage_count']) chunk = chunk[pd.to_numeric(chunk['damage_count'], errors='coerce').notna()] chunk = chunk.fillna(method='ffill').fillna(method='bfill') dfs.append(chunk) if len(dfs) > 10: # 测试限制块数 break # 合并数据块 df = pd.concat(dfs, ignore_index=True) def create_lag_features(df, lags=3): for lag in range(1, lags + 1): df[f'damage_count_lag_{lag}'] = df['damage_count'].shift(lag) return df.dropna() df = create_lag_features(df) # 在合并df之后,填充na之前 df = df.dropna(subset=['damage_count']) datetimes = pd.concat(datetimes_list, ignore_index=True) location_codes = pd.concat(location_codes_list, ignore_index=True) # 确保长度一致 min_length = min(len(df), len(datetimes), len(location_codes)) df = df.iloc[:min_length] datetimes = datetimes.iloc[:min_length] location_codes = location_codes.iloc[:min_length] # 检查是否存在 NaNnan_check = df.isnull().sum().sum() inf_check = df.isin([np.Inf, -np.Inf]).sum().sum() if nan_check > 0 or inf_check > 0: # 处理 NaN 值或者无穷大值 # 填充缺失值为均值 df = df.fillna(df.mean()) # 删除包含 NaN 值或者无穷大值的行 df = df.dropna() # 结构化特征 X_structured = df.drop(columns=['damage_count', 'damage_depth', 'damage_db', 'asset_code_mapping', 'pile_longitude', 'pile_latitude', 'locationCode', 'datetime', 'locationCode_encoded','damage_count_lag_1','damage_count_lag_2','damage_count_lag_3'], errors='ignore') # 填充缺失值 numeric_cols = X_structured.select_dtypes(include=[np.number]).columns for col in numeric_cols: X_structured[col] = X_structured[col].fillna(X_structured[col].mean()) # 标准化数据 scaler = RobustScaler() # 替换StandardScaler,更抗异常值 X_structured = pd.DataFrame(scaler.fit_transform(X_structured), columns=X_structured.columns) # 确保X_structured是DataFrame if not isinstance(X_structured, pd.DataFrame): X_structured = pd.DataFrame(X_structured, columns=[f"feature_{i}" for i in range(X_structured.shape[1])]) # X_structured = X_structured.values # 将DataFrame转换为NumPy数组 # 修改后的目标变量处理部分 y = df[['damage_count']].values.astype(np.float32) # 添加数据缩放 y_scaler = RobustScaler() y = y_scaler.fit_transform(y) # 使用标准化代替log变换 y = np.clip(y, -1e6, 1e6) # 设置合理的上下界 # 添加数据检查 assert not np.any(np.isinf(y)), "y中包含无限值" assert not np.any(np.isnan(y)), "y中包含NaN值" # 数据检查 print("原始数据统计:") print(f"最小值: {y.min()}, 最大值: {y.max()}, NaN数量: {np.isnan(y).sum()}") print("处理后y值范围:", np.min(y), np.max(y)) print("无限值数量:", np.isinf(y).sum()) # 清理内存 del df, chunk, dfs gc.collect() torch.cuda.empty_cache() return datetimes, X_structured, y, location_codes, scaler, y_scaler # 2. 间序列数据集类 class TimeSeriesDataset(Dataset): """自定义间序列数据集类""" def __init__(self, X, y, timesteps): # 确保输入是NumPy数组 if isinstance(X, pd.DataFrame): X = X.values if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series): y = y.values assert X.ndim == 2, f"X应为2维,实际为{X.ndim}维" assert y.ndim == 2, f"y应为2维,实际为{y.ndim}维" # 添加维度调试信息 print(f"数据形状 - X: {X.shape}, y: {y.shape}") self.X = torch.FloatTensor(X).unsqueeze(-1) # [samples, timesteps, 1] self.y = torch.FloatTensor(y) self.timesteps = timesteps # 验证形状 if len(self.X) != len(self.y): raise ValueError("X和y的长度不匹配") def __len__(self): return len(self.X) - self.timesteps def __getitem__(self, idx): # [seq_len, num_features] x_window = self.X[idx:idx + self.timesteps] y_target = self.y[idx + self.timesteps - 1] return x_window.permute(1, 0), y_target # 调整维度顺序 def select_features_by_importance(X, y, n_features, feature_names=None): """使用随机森林选择特征(支持NumPy数组和DataFrame)""" # 确保X是二维数组 if isinstance(X, pd.DataFrame): feature_names = X.columns.tolist() X = X.values elif feature_names is None: feature_names = [f"feature_{i}" for i in range(X.shape[1])] # 处理y的维度 y = np.ravel(np.asarray(y)) y = np.nan_to_num(y, nan=np.nanmean(y)) # 检查特征数 if X.shape[1] < n_features: n_features = X.shape[1] print(f"警告: 特征数少于请求数,使用所有 {n_features} 个特征") # 训练随机森林 rf = RandomForestRegressor(n_estimators=250, random_state=42, n_jobs=-1) rf.fit(X, y) # 获取特征重要性 feature_importances = rf.feature_importances_ indices = np.argsort(feature_importances)[::-1][:n_features] # 返回选中的特征数据和重要性 return X[:, indices], feature_importances[indices], [feature_names[i] for i in indices] # 3. LSTM模型定义 class LSTMModel(nn.Module): """LSTM回归模型""" def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.4): super().__init__() # 确保hidden_size*2能被num_heads整除 if (hidden_size * 2) % 4 != 0: hidden_size = ((hidden_size * 2) // 4) * 4 // 2 # 调整到最近的合规值 print(f"调整hidden_size为{hidden_size}以满足整除条件") self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0, bidirectional=True # 添加双向结构 ) # 添加维度检查的初始化 def weights_init(m): if isinstance(m, nn.Linear): if m.weight.dim() < 2: m.weight.data = m.weight.data.unsqueeze(0) # 确保至少2维 nn.init.xavier_uniform_(m.weight) if m.bias is not None: nn.init.constant_(m.bias, 0.0) self.bn = nn.BatchNorm1d(hidden_size * 2) # # 注意力机制层 self.attention = nn.MultiheadAttention(embed_dim=hidden_size * 2, num_heads=4) # 改用多头注意力 # 更深D输出层 self.fc = nn.Sequential( nn.Linear(hidden_size*2, hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, 1) ) # 应用初始化 self.apply(weights_init) def forward(self, x): lstm_out, _ = self.lstm(x) # [batch, seq_len, hidden*2] lstm_out = lstm_out.permute(1, 0, 2) # [seq_len, batch, features] attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out) attn_out = attn_out.permute(1, 0, 2) # 恢复为[batch, seq_len, features] return self.fc(attn_out[:, -1, :]).squeeze() def plot_feature_importance(feature_names, importance_values, save_path): """绘制特征重要性图""" # 验证输入 if len(feature_names) == 0 or len(importance_values) == 0: print("警告: 无特征重要性数据可绘制") return if len(feature_names) != len(importance_values): print(f"警告: 特征名数量({len(feature_names)})与重要性值数量({len(importance_values)})不匹配") # 取较小值 min_len = min(len(feature_names), len(importance_values)) feature_names = feature_names[:min_len] importance_values = importance_values[:min_len] # 按重要性排序 indices = np.argsort(importance_values)[::-1] sorted_features = [feature_names[i] for i in indices] sorted_importance = importance_values[indices] plt.figure(figsize=(12, 8)) plt.bar(range(len(sorted_features)), sorted_importance, align="center") plt.xticks(range(len(sorted_features)), sorted_features, rotation=90) plt.xlabel("特征") plt.ylabel("重要性得分") plt.title("特征重要性排序") plt.tight_layout() # 确保保存路径存在 save_path.parent.mkdir(parents=True, exist_ok=True) plt.savefig(save_path, dpi=300) plt.close() def evaluate(model, val_loader, criterion): model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) val_loss += loss.item() return val_loss / len(val_loader) # 4. 模型训练函数 def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler=None, epochs=100, patience=30): """训练模型并返回最佳模型和训练历史""" best_loss = float('inf') history = {'train_loss': [], 'val_loss': []} # 添加梯度累积 accumulation_steps = 5 # 每4个batch更新一次参数 for epoch in range(epochs): # 训练阶段 model.train() train_loss = 0.0 optimizer.zero_grad() for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) scaler = torch.cuda.amp.GradScaler() # 在训练循环中添加 with torch.cuda.amp.autocast(): outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) scaler.scale(loss).backward() for batch_idx, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) # 前向传播 outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) # 梯度累积 loss = loss / accumulation_steps scaler.scale(loss).backward() if (batch_idx + 1) % accumulation_steps == 0: scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() optimizer.zero_grad() train_loss += loss.item() * accumulation_steps # 验证 val_loss = evaluate(model, val_loader, criterion) if scheduler: scheduler.step(val_loss) # 根据验证损失调整学习率 # 记录历史 avg_train_loss = train_loss / len(train_loader) history['train_loss'].append(avg_train_loss) history['val_loss'].append(val_loss) # 早停逻辑 if val_loss < best_loss * 0.99:# 相对改进阈值 best_loss = val_loss best_epoch = epoch torch.save(model.state_dict(), 'best_model.pth') print(f"Epoch {epoch + 1}/{epochs} - 训练损失: {avg_train_loss :.4f} - 验证损失: {val_loss:.4f}") # 早停判断 if epoch - best_epoch >= patience: print(f"早停触发,最佳epoch: {best_epoch+1}") break # 加载最佳模型 model.load_state_dict(torch.load('best_model.pth')) return model, history # 5. 贝叶斯优化函数 def optimize_hyperparameters(X_train, y_train, input_size): """使用贝叶斯优化寻找最佳超参数""" # 自定义评分函数 def score_fn(params): """内部评分函数""" try: params = adjust_hidden_size(params) # 调整参数 hidden_size, num_layers, dropout, lr, batch_size, timesteps = params # 确保参数有效 batch_size = max(32, min(256, int(batch_size))) timesteps = max(3, min(10, int(timesteps))) dropout = min(0.5, max(0.1, float(dropout))) lr = min(0.01, max(1e-5, float(lr))) # 检查数据是否足够 if len(X_train) < 2 * timesteps+1: # 至少需要2倍间步长的数据 return float('inf') # 创建模型 model = LSTMModel( input_size=input_size, hidden_size=int(hidden_size), num_layers=min(3, int(num_layers)), dropout=min(0.5, float(dropout)) ).to(device) # 初始化权重 # for name, param in model.named_parameters(): # if 'weight' in name: # nn.init.xavier_normal_(param) # elif 'bias' in name: # nn.init.constant_(param, 0.1) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-3) # 创建数据加载器 dataset = TimeSeriesDataset(X_train, y_train, timesteps=int(timesteps)) # 简化验证流程 train_size = int(0.8 * len(dataset)) train_dataset = torch.utils.data.Subset(dataset, range(train_size)) val_dataset = torch.utils.data.Subset(dataset, range(train_size, len(dataset))) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # 简单训练和验证 model.train() for epoch in range(15): # 减少epoch数以加快评估 for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # 验证 model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: outputs = model(inputs.to(device)) loss = criterion(outputs, targets.squeeze().to(device)) if torch.isnan(loss) or torch.isinf(loss): return float('inf') val_loss += loss.item() return val_loss / len(val_loader) except Exception as e: print(f"参数评估失败: {str(e)}") return float('inf') # 定义搜索空间 search_spaces = [ Integer(32, 128, name='hidden_size'), Integer(1, 3, name='num_layers'), Real(0.2, 0.5, name='dropout'), Real(5e-4, 1e-3, prior='log-uniform', name='lr'), Categorical([64, 128, 256], name='batch_size'), Integer(3, 10, name='timesteps') # 优化间步长 ] def adjust_hidden_size(params): """确保hidden_size*2能被4整除""" hs = params[0] params[0] = ((hs * 2) // 4) * 4 // 2 return params result = gp_minimize( score_fn, search_spaces, n_calls=50, random_state=42, verbose=True, n_jobs=1 # 并行执行 ) # 提取最佳参数 best_params = { 'hidden_size': result.x[0], 'num_layers': result.x[1], 'dropout': result.x[2], 'lr': result.x[3], 'batch_size': result.x[4], 'timesteps': result.x[5] } print("优化完成,最佳参数:", best_params) return best_params def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 6. 评估函数 def evaluate_model(model, test_loader, criterion, test_indices, y_scaler=None): """评估模型性能""" model.eval() test_loss = 0.0 y_true = [] y_pred = [] all_indices = [] with torch.no_grad(): for batch_idx, (inputs, targets) in enumerate(test_loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) if outputs.dim() == 1: outputs = outputs.unsqueeze(1) loss = criterion(outputs, targets) test_loss += loss.item() * inputs.size(0) # 收集预测结果 y_true.extend(targets.cpu().numpy()) y_pred.extend(outputs.cpu().numpy()) # 获取原始数据集中的索引 current_indices = test_indices[batch_idx * test_loader.batch_size: (batch_idx + 1) * test_loader.batch_size] all_indices.extend(current_indices) y_true = np.array(y_true).reshape(-1) y_pred = np.array(y_pred).reshape(-1) if y_scaler is not None: y_true = y_scaler.inverse_transform(y_true.reshape(-1, 1)).flatten() y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten() # 基础指标 metrics = { 'MSE': mean_squared_error(y_true, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)), 'MAE': mean_absolute_error(y_true, y_pred), 'R2': r2_score(y_true, y_pred), 'MAPE': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100, # 避免除0 'indices': all_indices, # 添加原始索引 'y_true_original': y_true, 'y_pred_original': y_pred, 'test_loss': test_loss } # 可视化误差分布 errors = y_true - y_pred plt.figure(figsize=(12, 6)) sns.histplot(errors, kde=True, bins=50) plt.title('Error Distribution') plt.savefig('error_distribution.png') plt.close() return metrics, y_true, y_pred def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 7. 主函数 def main(): # 1. 加载和预处理数据 print("正在加载和预处理数据...") datetimes, X_structured, y, location_codes, scaler , y_scaler= load_and_preprocess_data() # 2. 特征选择 print('正在进行特征选择') # 修改为选择前15%特征 n_features = int(X_structured.shape[1] * 0.15) X_selected, feature_importances, top_features = select_features_by_importance( X_structured, y, n_features ) X_selected = X_structured[top_features] print(f"选择后的特征及其重要性:") for feature, importance in zip(top_features, feature_importances): print(f"{feature}: {importance:.4f}") print(X_selected) # 绘制特征重要性图 plot_feature_importance(top_features, feature_importances, Path("feature_importance.png")) # 3. 创建间序列数据集 print("正在创建间序列数据集...") timesteps = 5 dataset = TimeSeriesDataset(X_selected, y, timesteps) # 4. 数据划分 train_size = int(0.8 * len(dataset)) train_indices = list(range(train_size)) test_indices = list(range(train_size, len(dataset))) train_dataset = torch.utils.data.Subset(dataset, train_indices) test_dataset = torch.utils.data.Subset(dataset, test_indices) # 5. 贝叶斯优化超参数 print("正在进行贝叶斯优化...") try: best_params = optimize_hyperparameters( X_selected.iloc[:train_size], y[:train_size].copy(), input_size=X_selected.shape[1] ) print("最佳参数:", best_params) except Exception as e: print(f"贝叶斯优化失败: {str(e)}") # 6. 使用最佳参数训练最终模型 torch.cuda.empty_cache() # 清理 GPU 缓存 print("\n使用最佳参数训练模型...") # 获取并验证batch_size batch_size = int(best_params.get('batch_size')) print(f"实际使用的batch_size类型: {type(batch_size)}, 值: {batch_size}") # 调试输出 model = LSTMModel( input_size=X_selected.shape[1], hidden_size=int(best_params['hidden_size']), num_layers=int(best_params['num_layers']), dropout=float(best_params['dropout']) ).to(device) # 数据加载器 train_loader = DataLoader( train_dataset, batch_size=int(batch_size), shuffle=True, # 训练集需要打乱 collate_fn=collate_fn, num_workers=4, # 多进程加载 pin_memory=True # 加速GPU传输 ) val_loader = DataLoader( test_dataset, batch_size=int(batch_size)*2, # 更大的批次提升验证效率 shuffle=False, # 验证集不需要打乱 collate_fn=lambda batch: ( torch.stack([x for x, y in batch]).permute(0, 2, 1), torch.stack([y for x, y in batch]) ), num_workers=2, pin_memory=True ) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # 训练模型 model, history = train_model( model, train_loader, val_loader, criterion, optimizer, scheduler=scheduler, epochs=200, patience=15) torch.cuda.empty_cache() # 清理 GPU 缓存 # 7. 评估模型 print("\n评估模型性能...") metrics, y_true, y_pred = evaluate_model(model, val_loader, criterion, test_indices, y_scaler) print(f"测试集 MSE: {metrics['MSE']:.4f}, MAE: {metrics['MAE']:.4f}, R2: {metrics['R2']:.4f}") # 8. 保存所有结果 print("\n保存所有结果...") output_dir = Path(r"D:\result2") output_dir.mkdir(parents=True, exist_ok=True) # 保存评估指标 metrics_df = pd.DataFrame({ 'Metric': ['MSE', 'MAE', 'R2', 'MAPE', 'Test Loss'], 'Value': [metrics['MSE'], metrics['MAE'], metrics['R2'], metrics['MAPE'], metrics['test_loss']] }) metrics_df.to_csv(output_dir / 'evaluation_metrics.csv', index=False) # 保存训练历史 history_df = pd.DataFrame(history) history_df.to_csv(output_dir / 'training_history.csv', index=False) # 保存预测结果与原始数据 pred_indices = [i + timesteps - 1 for i in metrics['indices']] # 调整索引以匹配原始数据 # 确保我们有足够的datetime和locationCode数据 if len(datetimes) > max(pred_indices) and len(location_codes) > max(pred_indices): y_true = y_true.flatten() # 确保是一维 y_pred = y_pred.flatten() # 确保是一维 result_df = pd.DataFrame({ 'datetime': datetimes.iloc[pred_indices].values, 'locationCode': location_codes.iloc[pred_indices].values, 'true_value': y_true, 'predicted_value': y_pred }) # 有条件地添加分位数 if y_pred.shape[1] > 2: result_df['predicted_lower'] = y_pred[:, 0] # 10%分位数 result_df['predicted_upper'] = y_pred[:, 2] # 90%分位数 # 添加其他特征 for i, feature in enumerate(X_selected.columns): result_df[feature] = X_selected.iloc[pred_indices, i].values result_df.to_csv(output_dir / 'predictions_with_metadata.csv', index=False) else: print("警告: datetime或locationCode数据不足,无法完全匹配预测结果") # 保存基础预测结果 pd.DataFrame({ 'true_value': y_true.flatten(), 'predicted_value': y_pred.flatten() }).to_csv(output_dir / 'predictions.csv', index=False) # 9. 可视化结果 plt.figure(figsize=(12, 6)) plt.plot(history['train_loss'], label='训练损失') plt.plot(history['val_loss'], label='验证损失') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('训练过程') plt.legend() plt.savefig(output_dir / 'training_process.png', dpi=300) plt.close() # 添加预测结果可视化 plt.figure(figsize=(15, 6)) plt.plot(y_true[:200], label='真实值') plt.plot(y_pred[:200], label='预测值') # 只使用中位数预测 plt.title('预测结果对比') plt.legend() plt.savefig(output_dir / 'prediction_comparison.png', dpi=300) plt.show() # 误差分布图 errors = y_true - y_pred[:, 1] plt.hist(errors, bins=50) plt.title('预测误差分布') plt.savefig(output_dir / 'error_distribution.png', dpi=300) # 保存图像 plt.close() # # 添加分位数预测可视化 # plt.figure(figsize=(15, 6)) # plt.plot(y_true[:100], label='真实值') # plt.plot(y_pred[:100, 0], label='10%分位数') # plt.plot(y_pred[:100, 1], label='中位数') # plt.plot(y_pred[:100, 2], label='90%分位数') # plt.legend() # plt.savefig(output_dir / 'quantile_predictions.png', dpi=300) # 保存图像 # plt.close() # 9. 保存模型 if metrics['r2'] > 0.8: model_path = output_dir / 'best_model.pth' torch.save(model.state_dict(), model_path) print(f"模型保存成功: {model_path}") print(f"所有结果已保存到 {output_dir}") if __name__ == "__main__": warnings.filterwarnings('ignore') start_time = time.time() main() print(f"总运行: {(time.time() - start_time) / 60:.2f}分钟")参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 5 ended. Evaluation done at random point. Time taken: 0.0120 Function value obtained: inf Current minimum: inf Iteration No: 6 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 6 ended. Evaluation done at random point. Time taken: 0.0170 Function value obtained: inf Current minimum: inf Iteration No: 7 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 7 ended. Evaluation done at random point. Time taken: 0.0126 Function value obtained: inf Current minimum: inf Iteration No: 8 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 8 ended. Evaluation done at random point. Time taken: 0.0126 Function value obtained: inf Current minimum: inf Iteration No: 9 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 9 ended. Evaluation done at random point. Time taken: 0.0085 Function value obtained: inf Current minimum: inf Iteration No: 10 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 贝叶斯优化失败: Input y contains infinity or a value too large for dtype('float64').结合代码分析为啥优化失败 并且改造
05-12
<think>好的,我现在需要解决用户在使用Pandas的replace方法将0替换为NaN后,在优化过程中遇到的错误:'Input y contains infinity or a value too large for dtype(float64)'。首先,我得理解这个错误的原因以及用户的具体情况。 用户之前已经按照我的建议将数据中的0替换为NaN,但现在在优化过程中出现了问题。根据错误信息,问题可能出在目标变量y中存在无穷大或超出float64范围的值。虽然用户替换了0为NaN,但可能数据中仍然存在其他问题,比如Infinity值或者替换后引入的其他问题。 首先,回顾用户之前的操作步骤,他们可能执行了类似`df.replace(0, np.nan)`的代码,这会导致数据中的0变成NaN。然而,NaN本身在模型训练会被视为缺失值,如果在后续处理中没有正确处理这些NaN,比如没有删除或填充,模型可能会报错。但根据用户当前的错误信息,提示的是存在无穷大或过大值,而不是NaN,所以这可能意味着在替换0之后,数据中仍然存在Infinity或者某些极大值。 接下来,我需要检查用户提供的引用内容。引用[1]提到数据中存在NaN、无穷大或过大值会导致错误,处理方法包括处理缺失值。引用[2]也提到类似的问题,可能用户需要综合处理多种数据问题。引用[3]中的用户处理了无穷大值,替换为NaN后删除包含NaN的行。而引用[4]是关于决策树模型的训练,可能与当前问题无关。 用户当前的错误信息是'y contains infinity...',这说明目标变量y中存在无效值。可能用户在替换0为NaN后,y列中的某些值没有被正确处理,或者替换操作之后,其他操作引入了无穷大值。例如,可能在数据预处理过程中进行了某种计算,导致除以零或其他运算错误,从而生成InfinityNaN。 现在需要思考解决步骤。首先,用户需要检查目标变量y中是否存在NaNInfinity或过大值。可以使用`np.isinf`和`pd.isna`来检测这些值。如果存在,需要处理这些值,比如删除对应行或用适当的值填充。 另外,用户提到是在使用replace方法将0替换为NaN的场景下出现的问题,可能存在替换后的数据未正确清理的情况。例如,替换0为NaN后,可能其他列或y本身存在Infinity值,这些值在替换0没有被处理,导致后续模型训练出错。 可能的解决方案包括: 1. 检查目标变量y中的无效值:使用`np.isinf`和`pd.isna`来检测y中的InfinityNaN值,并进行处理,比如删除这些行或用中位数、均值填充。 2. 扩展数据清洗范围:不仅替换0为NaN,还要替换Infinity值为NaN,然后删除或填充这些NaN。 3. 确保数据类型的正确性:替换后的数据类型可能变为float,需要确认不影响后续操作。 4. 在模型训练前进行数据验证:使用`check_array`等工具验证数据是否有效。 此外,用户可能在替换0为NaN后没有处理其他问题,例如在替换过程中某些计算导致生成Infinity值,或者在数据中存在原本就有的Infinity值。需要综合处理所有可能的无效值类型。 最后,还需要考虑模型对缺失值的容忍度,有些模型无法处理NaN,因此需要在训练前确保数据已经过适当处理。例如,使用填充方法或删除含有缺失值的样本。</think>针对在使用`replace(0, np.nan)`后出现的`Input y contains infinity or a value too large for dtype(float64)`错误,以下是系统性解决方案: --- ### 错误根源分析 该错误表明目标变量`y`中存在三种异常情况: 1. **NaN值**:来自0替换为NaN的操作 2. **Infinity值**:计算过程中产生的无穷大值(如除以零) 3. **数值溢出**:超过float64可表示范围的数值(如1e308) --- ### 分步解决方案 #### 1. 同步检测所有异常值 使用组合式检测方法定位问题数据: ```python import numpy as np import pandas as pd # 检测目标变量y中的异常值 invalid_mask = ( pd.isna(y) | # 检测NaN np.isinf(y) | # 检测Infinity (np.abs(y) > 1e300) # 检测数值溢出 ) print("异常数据索引:", y[invalid_mask].index.tolist()) ``` #### 2. 针对性清洗策略 根据检测结果选择处理方式: **方案一:安全删除法(推荐)** ```python clean_data = data[~invalid_mask] print(f"删除{sum(invalid_mask)}条异常记录") ``` **方案二:稳健填充法** ```python from sklearn.impute import SimpleImputer # 中位数填充(对异常值不敏感) imputer = SimpleImputer(strategy='median') y_clean = imputer.fit_transform(y.values.reshape(-1,1)) ``` **方案三:数值裁剪法** ```python # 处理Infinity和超大值 y_clean = np.clip(y, -1e300, 1e300) # 设置合理数值边界 y_clean = np.nan_to_num(y_clean) # 将NaN转换为0 ``` #### 3. 数据验证(关键步骤) 使用scikit-learn的验证工具确保数据合规: ```python from sklearn.utils import check_array try: check_array(y_clean, force_all_finite=True, dtype=np.float64) except ValueError as e: print("数据验证未通过:", str(e)) ``` --- ### 进阶调试技巧 1. **溯源分析**: ```python # 定位首次出现异常的列 for col in df.columns: if df[col].map(lambda x: np.isinf(x) if isinstance(x, (int, float)) else False).any(): print(f"发现Infinity值在列 {col}") ``` 2. **数值分布可视化**: ```python import matplotlib.pyplot as plt plt.figure(figsize=(10,6)) plt.boxplot(y_clean.dropna(), vert=False) plt.title("目标变量分布箱线图") plt.show() ``` --- ### 工程化建议 1. **创建数据清洗管道**: ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer clean_pipe = Pipeline([ ('replace_zero', FunctionTransformer(lambda x: x.replace(0, np.nan))), ('remove_invalid', FunctionTransformer(lambda x: x[np.isfinite(x)])) ]) y_clean = clean_pipe.fit_transform(y) ``` 2. **启用浮点运算监控**: ```python np.seterr(all='warn') # 开启浮点运算异常警告 ``` --- ### 关联知识扩展 1. **float64的数值范围**: - 最小正数:`2.2250738585072014e-308` - 最大正数:`1.7976931348623157e+308` - 超出范围会转为`inf`或`-inf` 2. **内存优化技巧**: ```python # 转换数据类型减少内存占用 y = y.astype(np.float32) # 范围变为±3.4e38 ``` ---
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值