蒙特卡洛树搜索

最新推荐文章于 2025-05-03 19:44:58 发布

wniuniu_

最新推荐文章于 2025-05-03 19:44:58 发布

阅读量1.1k

点赞数 18

分类专栏：算法文章标签：算法

本文链接：https://blog.youkuaiyun.com/wniuniu_/article/details/146106907

版权

算法专栏收录该内容

240 篇文章

订阅专栏

需要确定蒙特卡洛树搜索的基本步骤。通常分为四个阶段：选择（Selection）、扩展（Expansion）、模拟（Simulation）和回溯（Backpropagation）。选择阶段是从根节点开始，按照某种策略选择子节点，直到到达一个可以扩展的节点。扩展阶段则是添加一个新的子节点。模拟阶段是从这个新节点开始，随机进行游戏直到结束，得到结果。最后，回溯阶段将模拟的结果反向传播到路径上的所有节点，更新它们的统计信息。

蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）是一种基于随机模拟的启发式搜索算法，主要用于复杂决策问题（如棋类游戏、路径规划等）。其核心思想是通过大量随机模拟探索可能的决策路径，逐步构建搜索树并优化决策选择。以下是详细讲解：

1. MCTS 的核心步骤

MCTS 通过迭代执行四个阶段来优化决策，每次迭代包含以下步骤：

(1) 选择（Selection）

目标：从根节点（当前状态）出发，根据策略选择子节点，直到到达一个可扩展的节点。
策略：通常使用 UCB1（Upper Confidence Bound） 公式平衡探索与利用：
$\text{UCT}(v_i) = \frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln N}{n_i}}$
- (w_i)：节点 (v_i) 的累计奖励（如胜利次数）。
- (n_i)：节点 (v_i) 的访问次数。
- (N)：父节点的总访问次数。
- (c)：探索系数（通常设为 (\sqrt{2})）。
选择逻辑：优先选择 UCT 值高的子节点。

(2) 扩展（Expansion）

条件：当选择的节点未被完全展开（存在未探索的子节点）时，扩展树。
操作：添加一个或多个合法动作对应的新子节点到树中。

(3) 模拟（Simulation）

目标：从新扩展的节点出发，通过随机策略（如快速走子）模拟游戏直到终局。
结果：根据模拟结果得到奖励（如胜利=1，失败=0）。

(4) 回溯（Backpropagation）

目标：将模拟结果反向传播到路径上的所有节点，更新统计信息。
更新规则：
- 路径上每个节点的访问次数 (n_i) 增加 1。
- 累计奖励 (w_i) 根据模拟结果累加。

2. MCTS 的优势

无需启发函数：通过随机模拟代替人工设计评估函数。
适应高分支因子：适用于围棋等状态空间巨大的问题。
渐进优化：搜索时间越长，结果越可靠（Anytime Algorithm）。
平衡探索与利用：UCB1 公式避免陷入局部最优。

3. 应用场景

经典案例：AlphaGo 结合 MCTS 与神经网络击败人类围棋冠军。
其他场景：
- 棋类游戏（国际象棋、扑克）。
- 实时策略游戏（如《星际争霸》AI）。
- 机器人路径规划、资源调度等。

4. 变体与改进

UCT（Upper Confidence Trees）：使用 UCB1 的标准 MCTS。
结合神经网络（如 AlphaGo）：
- 策略网络：指导模拟阶段的选择。
- 价值网络：替代随机模拟，直接评估节点胜率。
并行化 MCTS：通过多线程/分布式加速模拟。

5. 示例：围棋中的 MCTS

初始化：当前棋盘状态为根节点。
迭代：执行数千次 Selection→Expansion→Simulation→Backpropagation。
决策：选择根节点下访问次数最多的子节点作为落子位置。

6. 局限性

计算开销大：需大量模拟才能收敛。
依赖模拟质量：随机模拟可能无法反映真实策略。
动态环境适应慢：对实时变化的响应可能不足。

我们可以通过一些简单的python代码来理解代码

以下是一个简单的蒙特卡洛树搜索（MCTS）的 Python 代码示例，以解决一个假设的“二选一”游戏问题。游戏规则如下：

有两个动作（A 和 B），选择 A 有 70% 的概率获胜，选择 B 有 30% 的概率获胜。
MCTS 的目标是通过模拟找到更优的动作。

代码实现

import math
import random

class Node:
    def __init__(self, state, parent=None):
        self.state = state      # 节点状态（如当前选择的动作）
        self.parent = parent    # 父节点
        self.children = []      # 子节点列表
        self.visits = 0         # 节点访问次数
        self.value = 0          # 节点累计奖励（如胜利次数）

    def is_fully_expanded(self):
        # 假设每个节点最多有两个子节点（对应动作 A 和 B）
        return len(self.children) == 2

def ucb(node, exploration_weight=1.414):
    if node.visits == 0:
        return float('inf')  # 未访问过的节点优先选择
    # UCB公式：平均奖励 + 探索项
    return (node.value / node.visits) + exploration_weight * math.sqrt(math.log(node.parent.visits) / node.visits)

def select(node):
    # 选择阶段：递归选择 UCB 值最高的子节点，直到叶子节点
    while node.is_fully_expanded():
        best_child = None
        best_ucb = -float('inf')
        for child in node.children:
            current_ucb = ucb(child)
            if current_ucb > best_ucb:
                best_ucb = current_ucb
                best_child = child
        node = best_child
    return node

def expand(node):
    # 扩展阶段：添加未探索的子节点
    possible_actions = ['A', 'B']
    for action in possible_actions:
        if action not in [child.state for child in node.children]:
            new_node = Node(state=action, parent=node)
            node.children.append(new_node)
            return new_node  # 每次只扩展一个子节点
    return node

def simulate(action):
    # 模拟阶段：根据动作的预设概率返回胜负
    if action == 'A':
        win_prob = 0.7  # 动作 A 的胜利概率
    else:
        win_prob = 0.3  # 动作 B 的胜利概率
    return 1 if random.random() < win_prob else 0

def backpropagate(node, reward):
    # 回溯阶段：更新路径上的所有节点
    while node is not None:
        node.visits += 1
        node.value += reward
        node = node.parent

def mcts(root, iterations=1000):
    for _ in range(iterations):
        # 1. 选择阶段
        selected_node = select(root)
        # 2. 扩展阶段
        if selected_node.visits > 0:  # 非叶子节点才扩展
            new_node = expand(selected_node)
        else:
            new_node = selected_node
        # 3. 模拟阶段
        reward = simulate(new_node.state)
        # 4. 回溯阶段
        backpropagate(new_node, reward)
    return root

if __name__ == "__main__":
    # 初始化根节点（假设根节点状态为空）
    root = Node(state=None)
    # 扩展根节点的初始子节点（动作 A 和 B）
    root.children = [Node(state='A', parent=root), Node(state='B', parent=root)]
    
    # 运行 MCTS 算法
    root = mcts(root, iterations=1000)
    
    # 输出结果
    total_visits = sum(child.visits for child in root.children)
    for child in root.children:
        print(f"动作 {child.state}: 访问次数={child.visits}, 胜率={child.value / child.visits:.2f}")

代码说明

Node 类：
- 保存节点的状态、父节点、子节点列表、访问次数和累计奖励。
UCB 公式：
- 平衡探索与利用，优先选择未访问过的节点（返回无限大）。
四个阶段：
- Selection：递归选择 UCB 值最高的子节点。
- Expansion：扩展未完全展开的节点。
- Simulation：根据动作预设概率模拟胜负。
- Backpropagation：更新路径上的所有节点的访问次数和奖励。
运行结果：
- 经过 1000 次迭代后，动作 A 的访问次数和胜率会显著高于动作 B。

示例输出

动作 A: 访问次数=732, 胜率=0.71
动作 B: 访问次数=268, 胜率=0.29

可以看到，MCTS 成功找到了胜率更高的动作 A。你可以通过调整 simulate 函数中的概率或迭代次数观察结果变化。

但是上面这个示范可能还不够，再看看下面这个例子

新示例：迷宫路径搜索

问题描述

迷宫结构：一个 3x3 网格迷宫，起点在左上角（0,0），终点在右下角（2,2）。
动作：每一步可以移动 ↑、↓、←、→，但不能超出边界。
奖励：到达终点奖励 +1，其他位置无奖励。
目标：用 MCTS 找到从起点到终点的最优路径。

MCTS 的核心挑战

状态空间复杂：每一步有多个可能的移动方向。
长期奖励未知：需通过模拟探索路径，而非依赖预先设定的概率。

改进后的 Python 代码

import math
import random

class Node:
    def __init__(self, state, parent=None):
        self.state = state      # 当前坐标 (x, y)
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0          # 累计奖励

    def is_fully_expanded(self):
        # 完全扩展：所有合法移动方向都已生成子节点
        return len(self.children) >= len(get_possible_actions(self.state))

def ucb(node, exploration_weight=1.414):
    if node.visits == 0:
        return float('inf')
    return (node.value / node.visits) + exploration_weight * math.sqrt(math.log(node.parent.visits) / node.visits)

def select(node):
    while node.is_fully_expanded() and not is_terminal(node.state):
        best_child = max(node.children, key=ucb)
        node = best_child
    return node

def expand(node):
    possible_actions = get_possible_actions(node.state)
    for action in possible_actions:
        if action not in [child.state[-1] for child in node.children]:
            new_state = apply_action(node.state, action)
            new_node = Node(state=new_state, parent=node)
            node.children.append(new_node)
            return new_node
    return node

def simulate(state):
    # 随机移动直到终点或步数耗尽
    current_state = state
    for _ in range(10):  # 最大模拟步数
        if is_terminal(current_state):
            return 1  # 到达终点
        possible_actions = get_possible_actions(current_state)
        if not possible_actions:
            return 0  # 无路可走
        action = random.choice(possible_actions)
        current_state = apply_action(current_state, action)
    return 0  # 超时

def backpropagate(node, reward):
    while node is not None:
        node.visits += 1
        node.value += reward
        node = node.parent

def mcts(root, iterations=1000):
    for _ in range(iterations):
        selected = select(root)
        if not is_terminal(selected.state):
            expanded = expand(selected)
            reward = simulate(expanded.state)
        else:
            reward = 1  # 直接到达终点
        backpropagate(expanded if 'expanded' in locals() else selected, reward)
    return root

# 迷宫环境相关函数
def get_possible_actions(state):
    x, y = state
    actions = []
    if x > 0:
        actions.append('↑')
    if x < 2:
        actions.append('↓')
    if y > 0:
        actions.append('←')
    if y < 2:
        actions.append('→')
    return actions

def apply_action(state, action):
    x, y = state
    if action == '↑': x -= 1
    elif action == '↓': x += 1
    elif action == '←': y -= 1
    elif action == '→': y += 1
    return (x, y)

def is_terminal(state):
    return state == (2, 2)

if __name__ == "__main__":
    root = Node(state=(0, 0))
    root = mcts(root, iterations=2000)
    
    # 输出最优路径
    best_path = []
    current = root
    while current.children:
        best_child = max(current.children, key=lambda c: c.visits)
        best_path.append(best_child.state)
        current = best_child
    print("最优路径:", best_path)

关键改进说明

动态环境建模：
- 每个节点的状态是当前坐标 (x, y)，而非固定动作。
- 通过 get_possible_actions 和 apply_action 动态生成子节点，模拟迷宫移动。
长期奖励探索：
- 模拟阶段：从扩展节点出发，随机移动直到终点或超时。
- 回溯阶段：奖励（是否到达终点）反向更新路径上的所有节点。
MCTS 的核心价值体现：
- 树结构动态扩展：根据移动方向生成子节点。
- 平衡探索与利用：UCB 公式指导选择高奖励路径，同时尝试未充分探索的方向。
- 无需预先知识：完全通过模拟学习迷宫的最优路径。

示例输出

最优路径: [(0, 1), (0, 2), (1, 2), (2, 2)]

（实际路径可能因随机模拟略有不同，但会趋近于最短路径 ↓→→ 或 →→↓）

总结

在改进后的示例中，MCTS 通过以下步骤真正体现了其核心能力：

动态构建搜索树：根据移动方向扩展节点。
模拟未知路径：无需预先知道迷宫结构。
优化长期奖励：通过回溯逐步聚焦到高胜率路径。