Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics

本文提出了一种新的方法,利用Q-learning在线求解线性二次跟踪器LQT问题,针对未知动态系统。研究了基于增广系统的代数Riccati方程,与传统LQT方法相比,无需预先知道系统动力学或指令生成器。通过Q-learning优化跟踪误差,展示了算法的稳定性和误差界限。文章还讨论了离线和在线算法,以及如何仅依赖输入和参考轨迹数据进行LQT求解的未来方向。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Reinforcement Q -learning for optimal tracking control of linear discrete-time systems with unknown dynamics✩,2014, Bahare Kiumarsi ,Frank L. Lewis , Hamidreza Modares ,Ali Karimpour ,Mohammad-Bagher Naghibi-Sistani

对未知离散时间系统,提出新的基于Qlearning算法求解无限时域线性二次跟踪器LQT问题。线性指令生成器以产生参考轨迹,由指令生成器和原系统组成增广系统。值函数是增广系统(状态和参考轨迹)的二次型函数。给出求解LQT的贝尔曼方程和增广的代数Riccati方程。而本文只需要求解增广的ARE方程。Qlearning算法以在线求解增广ARE(未知系统动力学或指令生成器、不需要增广系统动力学),
传统的LQT求解有前馈输入(求解非因果差分方程)和反馈输入(求解ARE)。
在以往文章中采用动力学可逆概念以求得前馈控制输入,RL以求解最优反馈控制输入。但动力学可逆需要控制输入是可逆的,且具有完全的系统动力学知识。
LQT的标准解需要同时求解ARE和非因果差分方程,且求解非因果辅助变量需要时间上向后计算。只能用于由渐进稳定的指令生成器生成参考轨迹;
Assumption1给出LQT的参考轨迹由指令生成器,给出增广系统和其性能指标或值函数。
Lemma1给出值函数的二次型形式,将稳定控制策略代入值函数得到二次型形式,P为核矩阵。
Theorem1给出求解LQT的因果解的ARE。选择更小的折扣因子和更大的权重Q使得跟踪误差更小。augmented LQT ARE与standard LQT ARE 多折扣因子
在这里插入图片描述

Theorem2给出LQT的ARE解的最优性和稳定性,定义与折扣因子有关的误差,在渐近稳定误差下求解LQT ARE的最优控制。且表明系统的跟踪误差是有界的。选择更小折扣因子或更大Q使得跟踪误差越小。

RL方法根据系统轨迹测量的数据在线递归迭代更新值函数和控制策略
Algorithm1给出离线PI算法求解LQT,初始稳定控制策略
在这里插入图片描述
该算法离线,需要完整的动力学知识,因此采用贝尔曼方程(而不是李雅普诺夫方程)在线评估控制策略,即Algorithm2,仍需要初始稳定控制策略。
在这里插入图片描述
尽管不需系统动力学知识求解贝尔曼方程,但仍需要系统动力学知识更新控制策略(53)。
基于Qfunction,Algorithm3给出Qfunction贝尔曼方程,使用LQT的Qfunction提出PI算法
在这里插入图片描述
学习和矩阵H,以不需要系统动力学。最小二乘LS实现策略评估。

如果系统状态不是在期望轨迹或位置上,则需要PE条件。
展望:仅通过测量输入数据和参考轨迹数据求解该方法下的LQT问题;考虑VI值迭代下的实现(不需要初始可容许控制策略)

matlab仿真(Frank L.Lewis)

代码私聊。有很多小伙伴需要这篇文章的仿真,这里给出Frank L. Lewis的网站,有往年许多的仿真代码。大家可以自取 https://lewisgroup.uta.edu/code/Software%20from%20Research.htm

也可互相交流学习

### Q-Learning Algorithm for Drone Swarm Control In the context of unmanned aerial vehicles (UAVs), particularly when dealing with a swarm, implementing reinforcement learning techniques such as Q-learning can significantly enhance coordination and navigation capabilities among multiple drones. The essence lies in training each agent within the swarm to make decisions based on rewards or penalties received from its environment. The objective is to maximize cumulative reward over time by selecting optimal actions given specific states. For UAV swarms, this translates into optimizing flight paths while avoiding collisions and maintaining formation integrity. Each follower drone learns through interactions with both the target drone and other followers, adjusting its behavior according to feedback obtained during missions[^1]. #### State Representation For effective implementation of Q-learning in drone swarms: - **State Space**: Defined using parameters like relative positions between drones, velocities, distances to obstacles, etc. - **Action Set**: Includes discrete movements that allow drones to navigate towards desired locations safely without colliding with one another or external objects. Actions might include moving up, down, left, right, forward, backward, hovering, accelerating, decelerating, turning at certain angles, changing altitude levels, etc. #### Reward Function Design Designing an appropriate reward function plays a crucial role in guiding agents toward desirable outcomes efficiently. Rewards should encourage behaviors beneficial for achieving mission objectives—such as staying close enough but not too near to targets, keeping formations stable under varying conditions—and discourage undesirable ones like excessive speed changes which could lead to instability or accidents. ```python def calculate_reward(current_state, next_state): distance_to_target = get_distance(next_state['target_position'], next_state['follower_position']) if collision_detected(): return -10 elif distance_improved(current_state, next_state): return 5 else: return -1 ``` This Python snippet demonstrates how simple rules-based logic can be applied to compute immediate rewards after taking action A transitioning state S to new state S'. Adjustments may vary depending upon application specifics; however, core principles remain consistent across implementations involving multi-agent systems controlled via RL algorithms. #### Exploration vs Exploitation Balance During early stages of training, exploration dominates so that all possible strategies are explored adequately before exploiting learned knowledge more aggressively later on. Techniques such as epsilon-greedy policies help strike this balance effectively ensuring sufficient discovery alongside exploitation phases throughout learning process. --- --related questions-- 1. How does incorporating deep neural networks improve performance compared to traditional tabular methods? 2. What challenges arise specifically due to communication latency issues amongst members within large-scale UAV swarms? 3. Can you provide examples where real-world applications have successfully utilized these concepts beyond simulation environments? 4. Are there alternative approaches besides Q-learning worth considering for similar tasks? If yes, what advantages do they offer? 5. In practical scenarios, how would safety measures against unpredictable environmental factors integrate into existing frameworks discussed here?
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值