markov decision processes, markov reward processes-优快云博客

文章讨论了在没有模型的情况下进行策略评估，包括Markov决策过程、Markov奖励过程和动态规划。提到了非基于模型的方法如MonteCarlo和TemporalDifference(TD)学习，其中TD学习结合了动态规划和MonteCarlo，使用bootstrap方法更新未来奖励的估计。TD学习在不需要完整轨迹的情况下更新，适用于连续性（非阶段结束）的域。同时，文章也探讨了策略梯度方法，其时间步长可以超过TD方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

off-policy:
need sampling

policy evaluation without a model：
1.markov decision processes
2.markov reward processes
3.DP(dynamic programming)
need Markov assumption, and we only consider one step(boostrap?sampling?) from the current situation,

policy evaluation with a model：
1.monte carlo: averaging returns by predicting the next trajectory
lots of data because need the sample of furture condition,
so it is high varience, and bise would be high when has few sample data

MDP(Markov Decision Processes) change during time, we would drop old data and retrain, the non-stationary process include recommender systerm, methods include:incremental monte carlo in policy evaluation

boostraps’ results can be high biased ,both high biased and high varience

2.TD(temporal difference )
the combine of DP and monte carlo method, compute and update the next reword immediately (not only behave like the monte carlo go and sum the predicting reward at the perticuler step, but also DP only consider one step based on the boostrapping diffierent situation?)

TD uses boostraps(sum of the furture discounted reward) and samples(estimate approximate expectiation over states) method

TD update not need to use the end of eposode to calculate the reward, it use the current reward plus gamma times the value of next step(boostraps), it introduces the Bellman opreator into incremental every visit MC(Monte Carlo) which choose to consider one return(in the currrent i episode)

Bellman opreator:
immediately raward plus our discounted future raward

Where TD is different from DP is though TD also use one next step(boostrap), it expand to all states

Sarsa (state-action-reward-state-action)

TD learning with Q learning are similar
Q leaning is when at that stage, need to choose action? TD leaning is bascilly Q learing which are fixing in the policy

###############
usable when no model of the current domain : DP is not usable, MC is usable, TD is usable（model free algorithm）

handles continuing(non-episodic) domains(progress not terminate): DP is usable, MC is not usable, TD is usable

handles Non-Markovian domains: DP is not usable, MC is usable, TD is not usable

because DP(unbiased) and TD(biased) both use boostrap for the current state to predict the next step’s value, in other words the future value only depends on this current state(or it includes all the future information in this one next step?), so get the immediate reward plus whatever state the agent transtion to, it has sufficient statistic of the future reward history( Markov assumption?) and I can plug in the bootstrap estimator

converages to true value in limit: DP is usable, MC is usable, TD is usable

unbiased estimate of value :DP is not faire,because it always giving the exact V(k - 1) value for that policy, but it is not going to be the same as the indinite horizon value function, MC is unbaised(when for the first visit is unbaised, when for the every visit is baised.), TD is not unbaised, but it converges to true value in limit

what is the difference between consistence and unbaised, when it is converge to the true

unbaised value, it is also know as formally being a consistent estimator, infinite amount of data,
but you also can have biased estimators when the states are consistent

###############
trajectory:
A:0 -->B(six times:1; two times:0)

MC:
A:0
B:6/8
TD:
A:0+gamma6/8
B:6/8 + gammanext step

MC: in the batch of statistic data it converges to the minimum mean squared error estimate
TD:converges to the dynamic programming policy with a maximum likelihood estimates for the dynamics and the reward model.
TD is converging to this sort of maximum likelihood MDP value estimation

3.Policy Gradient
the time step can be more than one compared to TD method