off-policy:
need sampling
policy evaluation without a model:
1.markov decision processes
2.markov reward processes
3.DP(dynamic programming)
need Markov assumption, and we only consider one step(boostrap?sampling?) from the current situation,
policy evaluation with a model:
1.monte carlo: averaging returns by predicting the next trajectory
lots of data because need the sample of furture condition,
so it is high varience, and bise would be high when has few sample data
MDP(Markov Decision Processes) change during time, we would drop old data and retrain, the non-stationary process include recommender systerm, methods include:incremental monte carlo in policy evaluation
boostraps’ results can be high biased ,both high biased and high varience
2.TD(temporal difference )
the combine of DP and monte carlo method, compute and update the next reword immediately (not only behave like the monte carlo go and sum the predicting reward at the perticuler step, but also DP only consider one step based on the boostrapping diffierent situation?)
TD uses boostraps(sum of the furture discounted reward) and samples(estimate approximate expectiation over states) method
TD update not need to use the end of eposode to calculate the reward, it use the current reward plus gamma times the value of next step(boostraps), it introduces the Bellman opreator into incremental every visit MC(Monte Carlo) which choose to consider one return(in the currrent i episode)
Bellman opreator:
immediately raward plus our discounted future raward
Where TD is different from DP is though TD also use one next step(boostrap), it expand to all states
Sarsa (state-action-reward-state-action)
TD learning with Q learning are similar
Q leaning is when at that stage, need to choose action? TD leaning is bascilly Q learing which are fixing in the policy
###############
usable when no model of the current domain : DP is not usable, MC is usable, TD is usable(model free algorithm)
handles continuing(non-episodic) domains(progress not terminate): DP is usable, MC is not usable, TD is usable
handles Non-Markovian domains: DP is not usable, MC is usable, TD is not usable
because DP(unbiased) and TD(biased) both use boostrap for the current state to predict the next step’s value, in other words the future value only depends on this current state(or it includes all the future information in this one next step?), so get the immediate reward plus whatever state the agent transtion to, it has sufficient statistic of the future reward history( Markov assumption?) and I can plug in the bootstrap estimator
converages to true value in limit: DP is usable, MC is usable, TD is usable
unbiased estimate of value :DP is not faire,because it always giving the exact V(k - 1) value for that policy, but it is not going to be the same as the indinite horizon value function, MC is unbaised(when for the first visit is unbaised, when for the every visit is baised.), TD is not unbaised, but it converges to true value in limit
what is the difference between consistence and unbaised, when it is converge to the true
unbaised value, it is also know as formally being a consistent estimator, infinite amount of data,
but you also can have biased estimators when the states are consistent
###############
trajectory:
A:0 -->B(six times:1; two times:0)
MC:
A:0
B:6/8
TD:
A:0+gamma6/8
B:6/8 + gammanext step
MC: in the batch of statistic data it converges to the minimum mean squared error estimate
TD:converges to the dynamic programming policy with a maximum likelihood estimates for the dynamics and the reward model.
TD is converging to this sort of maximum likelihood MDP value estimation
3.Policy Gradient
the time step can be more than one compared to TD method