强化学习笔记1-introduction

原创于 2021-02-14 22:44:15 发布

· 158 阅读

·

0

·

版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

RL 专栏收录该内容

4 篇文章

订阅专栏

本文探讨了机器学习和深度学习如何带来突破，同时揭示了AI的局限性和解释性需求。重点关注强化学习的组件，如环境、学习代理和状态-动作映射，以及如何通过自我意识的智能体解决延迟问题。讨论了不同类型的行动、控制策略和自动系统的复杂性，强调了零样本学习和组件理解的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 intelligence->actions

machine learning ,deep learning result in breakthrough
有些方面超过人类，一些不如人类（fragile）

2 different mistake导致分错类

component depends on the environment, neighbor->需要 explainable AI
data-driven AI 由于大量可获得的数据计算能力复杂计算模型而获得成功

3 但是强人工智能还未达到瓶颈是

两种 recognition

怎么知道 how（不泛化，easy to learn and use）
why这么想（不需要学习如果我们知道component，更易交流和延伸-zero-shot learning）
怎么找这些components
n-dim (-1,1)在里面画最大sphere

很难search for 最优解需要reduce dimension
什么类型factors ：summable（各个部分加和）；restrictive（可控因素如颜色形状等）；classical/rule-based AI（例如负负为正）

4 intelligence->actions->controlling->intelligent agent知道自己的actions

延迟：

control 应该just in time到达
agent应该观察自己的action exactly in time
人脑看到、决策的延迟加起来也很多
model based prediction, 人们错误认为看到手手动脑子控制的时间相同

智能agent能找到components

把自己从环境中分出来
需要用自己来表达别人
自己也需要预测

action类型

reflexes
conditional reflexes
reflex-like learned actions
actions launched after thinking /planning

强化学习：AI的分支从交互学习

没有previous 信息, based on 试错
goal-oriented，长期max of reward
例如棋盘有很多可能
agent（strategy）->（state）-> reward -> environment ->action
RL的components：

环境->黑盒
learning agent
state observation of the agent 在环境中
reward（000001 ）
strategy state->action mapping
action 影响环境

RL 很难

因为agent不知道什么是good，只有critic 没有老师，试错来找到optimal solution! 在这里插入图片描述

reward for good 决策可能延迟（短期reward不等于长期，什么应该被rewarded）
uncertain环境

自动系统

intelligent distributed agent 最好conscious

知道正在发生的episodes
有能力学习components 把自己从环境中分出来 self conscious
有能力建模 the mental states of the agents
有能力决策其他agent的未来
legal right and obligations水平
应在legal system可控范围内

参考: Sutton R , Barto A . Reinforcement Learning:An Introduction[M]. MIT Press, 1998.

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。