hyperparameter

最新推荐文章于 2025-10-11 23:03:07 发布

翻译最新推荐文章于 2025-10-11 23:03:07 发布 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404

文章标签：

#tensorflow #深度学习

python 同时被 2 个专栏收录

3 篇文章

订阅专栏

deep learning

1 篇文章

订阅专栏

本文探讨了在深度学习中不同激活函数与权重初始化方案的效果，如sigmoid、tanh与ReLU配合normal和uniform分布的实验。发现ReLU激活函数配合He初始化（特别是uniform分布）表现优秀，而Glorot初始化在tanh函数中表现出色。总结中推荐对于ReLU激活的网络使用He初始化的uniform分布。

from deepreplay.callbacks import ReplayData
from deepreplay.replay import Replay
from deepreplay.plot import compose_plots
from keras.initializers import normal
from matplotlib import pyplot as plt

filename = 'part2_weight_initializers.h5'
group_name = 'sigmoid_stdev_0.01'

# Uses normal initializer
initializer = normal(mean=0, stddev=0.01, seed=13)

# Builds BLOCK model
model = build_model(n_layers=5, input_dim=10, units=100, 
                    activation='sigmoid', initializer=initializer)

# Since we only need initial weights, we don't even need to train the model! 
# We still use the ReplayData callback, but we can pass the model as argument instead
replaydata = ReplayData(X, y, filename=filename, group_name=group_name, model=model)

# Now we feed the data to the actual Replay object
# so we can build the visualizations
replay = Replay(replay_filename=filename, group_name=group_name)

# Using subplot2grid to assemble a complex figure...
fig = plt.figure(figsize=(12, 6))
ax_zvalues = plt.subplot2grid((2, 2), (0, 0))
ax_weights = plt.subplot2grid((2, 2), (0, 1))
ax_activations = plt.subplot2grid((2, 2), (1, 0))
ax_gradients = plt.subplot2grid((2, 2), (1, 1))

wv = replay.build_weights(ax_weights)
gv = replay.build_gradients(ax_gradients)
# Z-values
zv = replay.build_outputs(ax_zvalues, before_activation=True, 
                          exclude_outputs=True, include_inputs=False)
# Activations
av = replay.build_outputs(ax_activations, exclude_outputs=True, include_inputs=False)

# Finally, we use compose_plots to update all
# visualizations at once
fig = compose_plots([zv, wv, av, gv], 
                    epoch=0, 
                    title=r'Activation: sigmoid - Initializer: Normal $\sigma = 0.01$')

sigmoid + normal + stddev 0.01 不行 X

sigmoid + normal + stddev 0.1 X

sigmoid + normal + stddev 1 X

Trying a different Activation Function

tanh+ normal + stddev 0.01 X

tanh+ normal + stddev 1 X

tanh+ normal + stddev 0.1 可行

Xavier / Glorot Initialization Scheme

tanh+ Glorot normal 可行

tanh+ Glorot uniform 可行

Rectified Linear Unit (ReLU) Activation Function

relu+ Glorot normal X

He Initialization Scheme

relu+ He normal 可行

relu+ He Uniform 可行

So, we need not only a similar variance along all the layers, but also a proper scale for the gradients. The scale is quite important, as it will, together with the learning rate, define how fast the weights are going to be updated. If the gradients are way too small, the learning (that is, the update of the weights) will be extremely slow.