model_free的算法可以和模型学习结合。
方法是向环境(environment)传递了S,A后获得反馈S_,R,保存这个关系,记为
S_,R = Model(S, A),如此多次就学习了model的一部分,就像是有了对外界环境的想象,如此就可以在不接触外界环境的情况下进行学习。
Q-learning和模型学习结合就是Tabular Dyna-Q方法:
Initialize Q(s,a)Q(s,a) and Model(s,a)∀s∈S and a∈A(s)Model(s,a)∀s∈S and a∈A(s)
Do forever(for each episode):
(a) S←S← current (nonterminal) state
(b) A←ϵ−greedy(S,Q)A←ϵ−greedy(S,Q)
(c) Execute action AA; observe resultant reward, RR, and state, S′S′
(d) Q(S,A)←Q(S,A)+α[R+γmaxa Q(S′,a)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa Q(S′,a)−Q(S,A)]
(e) Model(S,A)←R,S′Model(S,A)←R,S′ (assuming deterministic environment)
(f) Repeat n times:
S←S← random previously observed state
A←A← random action previously taken in SS
R,S′←Model(S,A)R,S′←Model(S,A)
Q(S,A)←Q(S,A)+α[R+γmaxa Q(S′,a)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa Q(S′,a)−Q(S,A)]