引子
- 古之成大事者,规模远大与综理密微,二者缺一不可。
- 不管天气好坏,坚持每天前进大概30公里。
- 起初店里的生意比较惨淡,他们有大把的时间来编写程序。
起因, 目的:
8x8的地图很容易失败, 这个问题,我之前讲过。
如何解决, 一句话,根据距离来修改奖励.
过程:
1. 先让 ChatGPT 推荐几种方法
- ChatGPT 推荐了5种方法,我试了, 都不行。失败率还是很高。
- 再推荐几种方法,还是不行。比如 DQN, 我试了,失败。机器人在某个地方左右反复移动,就是不前进。
2. 因此我建议,使用距离公式。
因为,我之前在什么地方看过类似的思路。具体想不起来了。可能是算法题里面见过的。
然后让 GPT 根据这个思路来写代码。
运行一下, ok!
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import pickle
def get_distance(state, goal_state, grid_size=8):
"""计算当前位置到终点的曼哈顿距离"""
x1, y1 = divmod(state, grid_size)
x2, y2 = divmod(goal_state, grid_size)
return abs(x1 - x2) + abs(y1 - y2)
def run(episodes, is_training=True, render=False):
env = gym.make('FrozenLake-v1', map_name="8x8", is_slippery=False)
q = np.zeros((env.observation_space.n, env.action_space.n)) # init a 64 x 4 array
goal_state = 63 # 8x8 终点在右下角(编号 63)
lr = 0.9 # alpha or learning rate
discount_factor_g = 0.9 # gamma or discount rate. Near 0: more weight/reward placed on immediate state. Near 1: more on future state.
epsilon = 1 # 1 = 100% random actions
epsilon_decay_rate = 0.0001 # epsilon decay rate. 1/0.0001 = 10,000
rng = np.random.default_rng() # random number generator
rewards_per_episode = np.zeros(episodes)
for i in range(episodes):
state = env.reset()[0] # states: 0 to 63, 0=top left corner,63=bottom right corner
terminated = False # True when fall in hole or reached goal
truncated = False # True when actions > 200
while not terminated and not truncated:
if is_training and rng.random() < epsilon:
action = env.action_space.sample() # actions: 0=left,1=down,2=right,3=up
else:
action = np.argmax(q[state,:])
new_state,reward,terminated,truncated,_ = env.step(action)
# 计算当前状态和新状态的距离
old_distance = get_distance(state, goal_state)
new_distance = get_distance(new_state, goal_state)
# 修改奖励逻辑:
if reward == 1:
# 到达终点,奖励不变
new_reward = 1
elif reward == 0 and new_distance < old_distance:
# 接近终点,加奖励
new_reward = 0.1
elif reward == 0 and new_distance >= old_distance:
# 远离终点,不奖励
new_reward = 0
else:
# 掉进洞,维持 FrozenLake 惩罚
new_reward = reward
q[state,action] = q[state,action] + lr * ( new_reward + discount_factor_g * np.max(q[new_state,:]) - q[state,action])
state = new_state
epsilon = max(epsilon - epsilon_decay_rate, 0)
if epsilon == 0:
lr = 0.0001
if reward == 1:
rewards_per_episode[i] = 1
env.close()
plt.figure(figsize=(10, 8))
sum_rewards = np.zeros(episodes)
for t in range(episodes):
sum_rewards[t] = np.sum(rewards_per_episode[max(0, t-100):(t+1)])
plt.plot(sum_rewards)
plt.savefig('frozen_lake8x8-distance-reward--1.png')
return np.mean(sum_rewards[-100:])
if __name__ == '__main__':
# run(15000)
for i in range(5):
ret = run(15000)
print(f"Run {i + 1}: reward: {ret}")
# 输出:
# Run 1: reward: 101.0
# Run 2: reward: 101.0
# Run 3: reward: 101.0
# Run 4: reward: 101.0
# Run 5: reward: 101.0
3. 最近说一下这样做的理由
gpt 说的也挺好的:
📌 这样做的优点
✅ 更稳定的学习过程:机器人不会盲目探索,而是朝目标前进。
✅ 减少掉坑的次数:鼓励它更早学习正确路径。
✅ 适用于 8×8 复杂地图:比完全随机奖励的 Q-learning 更快收敛。
就是说,鼓励探索,探索的过程,也是有奖励的。
结论 + todo
最后的最后,分享一句话,来自 sentex: