强化学习--策略网络--TensorFlow

本文介绍了如何利用TensorFlow来实现策略网络,重点在于 TensorFlow 的应用及其在强化学习中的策略网络构建过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

TensorFlow 实现策略网络

#baseline
import tensorflow as tf
import numpy as np
import gym
env = gym.make('CartPole-v0')
env.reset()
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
    #env.render()
    observation,reward,done,_ = env.step(np.random.randint(0,2))
    reward_sum += reward
    if done:
        random_episodes += 1
        print('Reward for the episode was :',reward_sum)
        reward_sum = 0
        env.reset()
Reward for the episode was : 11.0
Reward for the episode was : 31.0
Reward for the episode was : 46.0
Reward for the episode was : 18.0
Reward for the episode was : 10.0
Reward for the episode was : 25.0
Reward for the episode was : 13.0
Reward for the episode was : 25.0
Reward for the episode was : 16.0
Reward for the episode was : 14.0
# 实现强化学习策略网络
#常用网络参数
H = 50#节点数
batch_size = 25
learning_rate = 0.1
D = 4 #观测维度
gamma = 0.99#Reward的discount比例
# 占位符  ---构建一个MLP
observations = tf.placeholder(tf.float32,[None,D],name='input_x')
w1 = tf.get_variable('w1',shape=[D,H],initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值