python代码遇到的问题汇总2_internal: blas gemm launch failed-优快云博客

本文链接：https://blog.youkuaiyun.com/csdn_1_10086/article/details/124491356

错误1

完整错误信息：failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(50, 538), b.shape=(538, 1000), m=50, n=1000, k=538
参考博文原因是使用 GPU 版 TensorFlow 训练模型，需要注意在初始化 Session 的时候为其分配固定数量的显存，否则可能会在开始训练的时候直接报错退出。

报错位置：
在这里插入图片描述
在这个报错代码的函数外面添加下面两行代码，方便其他session调用：

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess=tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

错误2

完整错误信息：tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'gs' not implemented (file: 'gs://anomaly_detection/mtad_tf/data/train/machine-1-1.tfrecords')

报错代码：

#这个gs没办法识别
parser.add_argument('--tfrecords_file', type=str, default='gs://anomaly_detection/mtad_tf/data/train/{}.tfrecords',
            help='tfrecords output file. It will be used as a prefix if split.') 

#将自己的数据生成tfrecord文件
with tf.io.TFRecordWriter(tfrecords_file.format(os.path.splitext(machine_file)[0])) as writer:

是tensorflow报的错误，可能是这个版本问题或者语法使用错误。
参考文章是版本的问题，修改版本如下：

tensorflow的版本修改为2.2
tensorflow-datasets的版本修改为3.1.0
python的版本修改为3.6

没有用，于是我直接在根目录下建立文件夹data和train文件夹（data/train），并且将tfrecords_file的值gs://anomaly_detection/mtad_tf/data/train/{}.tfrecords修改为data/train/{}.tfrecords：

parser.add_argument('--tfrecords_file', type=str, default='data/train/{}.tfrecords',
            help='tfrecords output file. It will be used as a prefix if split.')

成功！
在这里插入图片描述

错误3

详细信息AttributeError: module 'tensorflow' has no attribute 'contrib'

参考博客，从tensorflow官方文档可以知道tf。contrib模块不包含在TensorFlow 2.0中。它的许多子模块已经集成到TensorFlow core中，或者分拆到TensorFlow_io或 tensorflow_addons中。

我的报错位置的代码：

tf.contrib.data.map_and_batch( #这里
        parser, batch_size=params["batch_size"],
        num_parallel_batches=8,
        drop_remainder=drop_reminder))

于是搜索下data.map_and_batch得到这个新的用法：
在这里插入图片描述
修改代码为：

tf.data.experimental.map_and_batch( #这里
        parser, batch_size=params["batch_size"],
        num_parallel_batches=8,
        drop_remainder=drop_reminder))

按照这个方法，其他的一些报错的修改方式如下：

将tf和替换成tf.compat.v1：

错误信息 module 'tensorflow' has no attribute 'logging'、错误信息 module 'tensorflow' has no attribute 'ConfigProto'、错误信息module 'tensorflow' has no attribute 'variable_scope'

但是代码报的错很多，于是打算下降tensorflow的版本到1.15.5，没有报错了。

tensorflow可以安装的版本有：0.12.1,1.0.0,1.1.0 ,1.2.0, 1.2.1, 1.3.0, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0, 1.12.2, 1.12.3, 1.13.1, 1.13.2, 1.14.0, 1.15.0, 1.15.2, 1.15.3, 1.15.4, 1.15.5, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.5.0, 2.5.1, 2.5.2, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2

错误4

详细错误信息：ValueError: too many values to unpack (expected 3)（未解决）

报错位置1：

 state_model = ClusterStateRecognition()
    state_model.set_configuration(params)
    state_model.build_model()
    state_model.fit(rawx)
    #这里没有报错，几乎一样的代码
    #仔细看看这个state_model.build_model()有什么问题没
    train_patterns, train_prob = state_model.predict(trainx)
    print(train_patterns.shape, train_prob.shape)

    state_model = SAXStateRecognition()
    state_model.set_configuration(params)
    state_model.build_model(his_len=15, segment_dim=2)
    state_model.fit(rawx)  #报错位置
    #predict输入要求trainx是三维的，但是他是四维的
    train_patterns, train_prob = state_model.predict(trainx)
    print(train_patterns.shape, train_prob.shape)

报错位置2 ：

def predict(self, x):
        self.restore(self.param.model_save_path)
        #报错位置
        sax_dataset_inv = self.model_obj.inverse_transform(self.model_obj.fit_transform(x))
        uniques = sorted(np.unique(sax_dataset_inv))
        print('sax numbers:', len(uniques))

查看函数self.model_obj.fit_transform(x)：

def fit_transform(self, X, y=None, **fit_params):
        """Fit a SAX representation and transform the data accordingly.

        Parameters
        ----------
        X : array-like of shape (n_ts, sz, d)  #输入的x=（n_ts,sz,d）
            Time series dataset

        Returns
        -------
        numpy.ndarray of integers with shape (n_ts, n_segments, d)
            SAX-Transformed dataset  #返回的数据=(n_ts, n_segments, d)
        """
        X = check_array(X, allow_nd=True, force_all_finite=False)
        X = check_dims(X) #没有改变x的维度
        return self._fit(X)._transform(X)

上面输入的x是四维的，而这个函数的输入要求是三维的，从而导致的这个错误：
ValueError: too many values to unpack (expected 3)，仔细看看报错的位，是第三方的开源代码tslearn里面的函数报错，可能是版本不匹配的问题（如果没有已经存在的解决方案，那么就在网上搜一搜以前的版本挨个试试）：
在这里插入图片描述
查看我环境的tslearn的版本：

查看tslearn的所有包的版本：

	conda list tslearn  #查看所有的tslearn版本
	conda search tslearn  #查看当前python可以安装的所有tslearn版本

在这里插入图片描述
选一个适合 python36 的在 0.5.2 版本周围的0.5.1.0安装。

卸载安装tslearn命令：

pip uninstall tslearn
pip install tslearn==0.5.1.0

查看了tslearn的github，发现piecewise.py文件各个版本里面的函数transform(self, X, y=None)要求的输入都是三维的。

由于前面conda list的版本，我又安装回了0.5.2版，再次运行报错ModuleNotFoundError: No module named 'tensorflow.keras.models'，参考下面解决办法。解决了报错之后回到了原点，看来不是版本的问题。

假设底层的逻辑没有问题，那么可能网络配置的一些参数和数据的参数不匹配导致的。

错误5

错误详细信息：tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: input. Can't parse serialized Example（无法分析序列化的示例）.

我修改了输入的数据，数据输入的维度和输出的个数可能不匹配出现这个问题。在有多个载入模块的时候，需要仔细校验每一个载入模块的参数类型，数目是否与原来的参数数目匹配。

原数据：SMD(服务器机器数据集）是一个新的5周数据集，包含3组实体。它们中的每一个都是通过机器命名的。SMD由28台不同机器的数据组成，28个子集应分别接受培训和测试。对于每个子集，我们将其分成两个等长的部分，用于训练和测试。我们提供了一个点是否为异常的标签，以及每个异常的维度。
因此，SMD由以下部分组成：
训练：数据集的前半部分。
测试：数据集的后半部分。
测试标签：测试集的标签。它表示一个点是否为异常。
解释标签：尺寸列表对每个异常都有影响。

1.原数据的输入维度是38，我的输入维度 28
在这里插入图片描述
修改之后报错：tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: Assign requires shapes of both tensors to match. lhs shape= [18,28] rhs shape= [18,38]（两个张量形状不一样）
可能还有网络模型的参数，debug找找模型在哪里。在这里找到一个数据38
在这里插入图片描述
后面我发现readme里面要的run_mode参数要执行为FORECASTING，之后再运行下一步的命令，这参数写错了导致前面的模型用的是原数据的模型参数。
然后就遇到内存不够的问题：Shuffle buffer filled.（缓冲区已满）
在这里插入图片描述
调整下batch_size，调整到100

还是出错，没有解决