CS234 value iteration/policy iteration

这篇博客介绍了在CS234课程作业中,如何使用值迭代和策略迭代解决FrozenLake问题。FrozenLake是一个模拟冬季湖面冰洞挑战的游戏,玩家需谨慎行动避免落入水中。通过值迭代和策略迭代方法,作者展示了如何计算最优策略并得到相应状态的价值矩阵。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

CS234 Assignment#1 value iteration/policy iteration

这一部分做的是值迭代和策略迭代的,以一个FrozenLake为例子作为测试。
FrozenLake:
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you’ll fall into the freezing water.
At this time, there’s an international frisbee shortage, so it’s absolutely imperative that
you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won’t always move in the direction you intend.
The surface is described using a grid like the following

    SFFF
    FHFH
    FFFH
    HFFG

S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal, where the frisbee is located

The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.

CS234 vi_and_pi.py
附注:
这个例子中
0:← // 1:↓// 2:→// 3:↑
参数P中terminal 为true or false
P中的probability在例子中都为1.0,但实际上在不为1的时候
P[state][action]是这样的:
{(probability1, nextstate1, reward1, terminal1)
(probability2, nextstate2, reward2, terminal2)
…..}
value iteration

def value_iteration(P, nS, nA, gamma=0.9, max_iteration=20, tol=1e-3):
    """
    Learn value function and policy by using value iteration method for a given
    gamma and environment.

    Parameters:
    ----------
    P: dictionary
        It is from gym.core.Environment
        P[state][action] is tuples with (probability, nextstate, reward, terminal)
    nS: int
        number of states
    nA: int
        number of actions
    gamma: float
        Discount factor. Number in range [0, 1)
    max_iteration: int
   
### UC Berkeley CS 188 Project 2 Guide and Resources For the second project of UC Berkeley's CS 188 course on Artificial Intelligence, students typically engage with reinforcement learning concepts through practical application. The focus is often on implementing algorithms that allow agents to learn optimal behaviors within given environments. The official course materials provide comprehensive guidance including problem sets designed specifically for this educational purpose[^1]. These documents outline objectives such as understanding how value iteration works or applying Q-learning techniques effectively. Additionally, they offer starter code which can be invaluable when starting out; it usually includes necessary libraries like BURLAP - Brown-UMBC Reinforcement Learning and Planning, a library written in Java [^1]. Students are encouraged not only to follow provided instructions closely but also experiment beyond them by tweaking parameters or exploring alternative strategies. This hands-on approach helps deepen comprehension while developing critical thinking skills essential for solving complex problems encountered later during more advanced studies or real-world applications. To get started efficiently: - Review lecture notes covering relevant theories. - Study previous years' solutions (if available). - Participate actively in discussion forums where peers share insights gained from their own experiences tackling similar challenges. ```python # Example Python Code Snippet Related To Value Iteration Algorithm Implementation For Educational Purposes Only def value_iteration(mdp, epsilon=0.001): V = {s: 0 for s in mdp.states} delta = float('inf') while delta >= epsilon * (1-mdp.discount_factor)/mdp.discount_factor: new_V = {} for state in mdp.states: q_values = [] for action in mdp.actions(state): total = sum(prob*(reward + mdp.discount_factor*V[next_state]) for next_state, prob, reward in mdp.transitions(state,action)) q_values.append(total) if q_values: best_action_value = max(q_values) new_V[state] = best_action_value delta = max(abs(new_V[s]-V.get(s,0)) for s in new_V.keys()) V.update(new_V) policy = {} for state in mdp.states: actions_qvalues = [(a,sum(p*(r+mdp.discount_factor*V[n]) for n,p,r in mdp.transitions(state,a))) for a in mdp.actions(state)] if actions_qvalues: best_action,_ = max(actions_qvalues,key=lambda x:x[1]) policy[state]=best_action return policy,V ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值