通过简单的强化学习实现井字棋(Tic-Tac-Toe)

一、强化学习简介
强化学习的过程可以理解为Agent与Environment的交互、学习、进步的过程,在井字棋中,可以简单的将其中的一方理解为Agent,另一方为Environment。交互的过程中主要有一下4个要素:

状态(state):指可能出现的情况或局面,在井字棋中指局面上的落子情况与先后手。
操作(action):指从一个状态(state)到另一个状态(state)的过程,在井字棋中指下一步的落子。
价值(value):衡量每一个状态(state)的好坏程度,在井字棋中指在当前局面下取胜的可能性。价值与状态的对应关系又称价值函数(value function)
增益(reward):指每一个操作所带来的价值(value)的变化,在井字棋中指这一步下的好不好。
强化学习实际是在模拟人类学习的过程。设想一个小孩在学习井字棋,一开始他完全不会下,只知道胜利局面的value是最高的,失败局面的value是最低的,只能随便下一个位置(explore),这样随便下几局之后,他会学习到,将要导致胜利的前几步的value是比较高的,将要导致失败的前几步的value是比较低的,这时通过explore使他对每个状态value的有了新的认识。当然他也会利用现有的知识(exploit),每当落子前,他会考虑落子后棋盘上可能的所有局面(state),然后选择一个对自己最有利的(max value)位置下棋。通过exploit,他可以利用他自己的知识使自己获胜的概率达到最大。

二、井字棋的算法
1、训练
强化学习井字棋训练的过程如下:

重复epochs次
    while true:
        if 分出胜负或平局
            返回结果,break
        随机的选择explore或exploit
        if 选择explore
            随机的选择落点下棋
        else 选择exploit
            从value_table中查找对应最大value状态的落点下棋
            根据新状态的value在value_table中更新原状态的value

其中“根据新状态的value在value_table中更新原状态的value” 是非常重要的一部分,决定了强化学习的学习方法,即能不能学到知识。由于井字棋状态逻辑非常简单,因此使用如下简单的表达式即可: 
V(S)=V(S)+α(V(S′)−V(S))
V(S)=V(S)+α(V(S′)−V(S))

其中VV表示value function,SS表示当前状态,S′S′表示新状态,V(S)V(S)表示S的value,αα 表示学习率,是可以调整的超参。
另外还需要控制的参数有训练次数epochs和选择explore的概率ϵϵ 。
 

# -*- coding: utf-8 -*-
"""
Created on Tue Jun 11 13:51:32 2019

@author: judy.yuan
"""

import numpy as np
import pickle

BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS


class State:
    def __init__(self):
        # the board is represented by an n * n array,
        # 1 represents a chessman of the player who moves first,
        # -1 represents a chessman of another player
        # 0 represents an empty position
        self.data = np.zeros((BOARD_ROWS, BOARD_COLS))
        self.winner = None
        self.hash_val = None
        self.end = None

    # compute the hash value for one state, it's unique
    def hash(self):
        if self.hash_val is None:
            self.hash_val = 0
            for i in np.nditer(self.data):
                self.hash_val = self.hash_val * 3 + i + 1
        return self.hash_val

    # check whether a player has won the game, or it's a tie
    def is_end(self):
        if self.end is not None:
            return self.end
        results = []
        # check row
        for i in range(BOARD_ROWS):
            results.append(np.sum(self.data[i, :]))
        # check columns
        for i in range(BOARD_COLS):
            results.append(np.sum(self.data[:, i]))

        # check diagonals
        trace = 0
        reverse_trace = 0
        for i in range(BOARD_ROWS):
            trace += self.data[i, i]
            reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
        results.append(trace)
        results.append(reverse_trace)

        for result in results:
            if result == 3:
                self.winner = 1
                self.end = True
                return self.end
            if result == -3:
                self.winner = -1
                self.end = True
                return self.end

        # whether it's a tie
        sum_values = np.sum(np.abs(self.data))
        if sum_values == BOARD_SIZE:
            self.winner = 0
            self.end = True
            return self.end

        # game is still going on
        self.end = False
        return self.end

    # @symbol: 1 or -1
    # put chessman symbol in position (i, j)
    def next_state(self, i, j, symbol):
        new_state = State()
        new_state.data = np.copy(self.data)
        new_state.data[i, j] = symbol
        return new_state

    # print the board
    def print_state(self):
        for i in range(BOARD_ROWS):
            print('-------------')
            out = '| '
            for j in range(BOARD_COLS):
                if self.data[i, j] == 1:
                    token = '*'
                elif self.data[i, j] == -1:
                    token = 'x'
                else:
                    token = '0'
                out += token + ' | '
            print(out)
        print('-------------')


def get_all_states_impl(current_state, current_symbol, all_states):
    for i in range(BOARD_ROWS):
        for j in range(BOARD_COLS):
            if current_state.data[i][j] == 0:
                new_state = current_state.next_state(i, j, current_symbol)
                new_hash = new_state.hash()
                if new_hash not in all_states:
                    is_end = new_state.is_end()
                    all_states[new_hash] = (new_state, is_end)
                    if not is_end:
                        get_all_states_impl(new_state, -current_symbol, all_states)


def get_all_states():
    current_symbol = 1
    current_state = State()
    all_states = dict()
    all_states[current_state.hash()] = (current_state, current_state.is_end())
    get_all_states_impl(current_state, current_symbol, all_states)
    return all_states


# all possible board configurations
all_states = get_all_states()


class Judger:
    # @player1: the player who will move first, its chessman will be 1
    # @player2: another player with a chessman -1
    def __init__(self, player1, player2):
        self.p1 = player1
        self.p2 = player2
        self.current_player = None
        self.p1_symbol = 1
        self.p2_symbol = -1
        self.p1.set_symbol(self.p1_symbol)
        self.p2.set_symbol(self.p2_symbol)
        self.current_state = State()

    def reset(self):
        self.p1.reset()
        self.p2.reset()

    def alternate(self):
        while True:
            yield self.p1
            yield self.p2

    # @print_state: if True, print each board during the game
    def play(self, print_state=False):
        alternator = self.alternate()
        self.reset()
        current_state = State()
        self.p1.set_state(current_state)
        self.p2.set_state(current_state)
        if print_state:
            current_state.print_state()
        while True:
            player = next(alternator)
            i, j, symbol = player.act()
            next_state_hash = current_state.next_state(i, j, symbol).hash()
            current_state, is_end = all_states[next_state_hash]
            self.p1.set_state(current_state)
            self.p2.set_state(current_state)
            if print_state:
                current_state.print_state()
            if is_end:
                return current_state.winner


# AI player
class Player:
    # @step_size: the step size to update estimations
    # @epsilon: the probability to explore
    def __init__(self, step_size=0.1, epsilon=0.1):
        self.estimations = dict()
        self.step_size = step_size
        self.epsilon = epsilon
        self.states = []
        self.greedy = []
        self.symbol = 0

    def reset(self):
        self.states = []
        self.greedy = []

    def set_state(self, state):
        self.states.append(state)
        self.greedy.append(True)

    def set_symbol(self, symbol):
        self.symbol = symbol
        for hash_val in all_states:
            state, is_end = all_states[hash_val]
            if is_end:
                if state.winner == self.symbol:
                    self.estimations[hash_val] = 1.0
                elif state.winner == 0:
                    # we need to distinguish between a tie and a lose
                    self.estimations[hash_val] = 0.5
                else:
                    self.estimations[hash_val] = 0
            else:
                self.estimations[hash_val] = 0.5

    # update value estimation
    def backup(self):
        states = [state.hash() for state in self.states]

        for i in reversed(range(len(states) - 1)):
            state = states[i]
            td_error = self.greedy[i] * (
                self.estimations[states[i + 1]] - self.estimations[state]
            )
            self.estimations[state] += self.step_size * td_error

    # choose an action based on the state
    def act(self):
        state = self.states[-1]
        next_states = []
        next_positions = []
        for i in range(BOARD_ROWS):
            for j in range(BOARD_COLS):
                if state.data[i, j] == 0:
                    next_positions.append([i, j])
                    next_states.append(state.next_state(
                        i, j, self.symbol).hash())

        if np.random.rand() < self.epsilon:
            action = next_positions[np.random.randint(len(next_positions))]
            action.append(self.symbol)
            self.greedy[-1] = False
            return action

        values = []
        for hash_val, pos in zip(next_states, next_positions):
            values.append((self.estimations[hash_val], pos))
        # to select one of the actions of equal value at random due to Python's sort is stable
        np.random.shuffle(values)
        values.sort(key=lambda x: x[0], reverse=True)
        action = values[0][1]
        action.append(self.symbol)
        return action

    def save_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
            pickle.dump(self.estimations, f)

    def load_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
            self.estimations = pickle.load(f)


# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
    def __init__(self, **kwargs):
        self.symbol = None
        self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
        self.state = None

    def reset(self):
        pass

    def set_state(self, state):
        self.state = state

    def set_symbol(self, symbol):
        self.symbol = symbol

    def act(self):
        self.state.print_state()
        key = input("Input your position:")
        data = self.keys.index(key)
        i = data // BOARD_COLS
        j = data % BOARD_COLS
        return i, j, self.symbol


def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        player1.backup()
        player2.backup()
        judger.reset()
    player1.save_policy()
    player2.save_policy()


def compete(turns):
    player1 = Player(epsilon=0)
    player2 = Player(epsilon=0)
    judger = Judger(player1, player2)
    player1.load_policy()
    player2.load_policy()
    player1_win = 0.0
    player2_win = 0.0
    for _ in range(turns):
        winner = judger.play()
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        judger.reset()
    print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))


# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0)
        judger = Judger(player1, player2)
        player2.load_policy()
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")


if __name__ == '__main__':
    train(int(1e5))
    compete(int(1e3))
    play()

 

<think>好的,我现在需要处理用户关于安装麒麟信安3.3操作系统的查询。首先,我需要回顾用户提供的系统级指令,确保回答符合格式要求,比如数学表达式使用$...$,独立公式用$$...$$,中文回答,生成相关问题等。不过当前问题涉及操作系统安装,可能不涉及数学公式,但还是要检查是否有需要的地方,比如版本号或硬件参数可能需要强调。 接下来,用户的问题具体是寻找麒麟信安3.3的安装教程。根据用户提供的引用内容,引用[1]是关于麒麟信安4.0及以上安装Docker的步骤,而引用[2]是银河麒麟虚拟机安装的指南。用户需要的是3.3版本的安装教程,但现有引用中没有直接提到3.3版本的信息。这可能需要推断或基于现有信息提供相近的指导。 首先,我需要确认麒麟信安3.3和4.0的安装步骤是否相似。通常,同一操作系统不同版本的安装流程可能大同小异,可能在硬件要求、软件包版本或某些配置步骤上有差异。因此,可以建议用户参考4.0的安装步骤,同时注意版本差异。 根据引用[2],银河麒麟的安装涉及创建虚拟机、选择ISO镜像、配置硬件、安装设置等步骤。虽然这是针对银河麒麟的,但麒麟信安可能也有类似的安装流程,尤其是在虚拟机环境中。因此,可以引导用户按照类似的步骤操作,但提醒注意版本特定的要求。 接下来,我需要整理回答的结构,确保符合用户要求的清晰结构,分步骤说明。可能需要包括准备工作、安装步骤、注意事项等部分。同时,要引用提供的资料,比如引用[1]中的硬件要求和更新系统软件包的步骤,引用[2]中的虚拟机安装流程,但需要明确指出这些是不同版本或不同系统的步骤,可能存在差异。 此外,用户可能需要安装指南的具体步骤,比如如何获取ISO镜像、启动安装程序、分区设置、网络配置等。但现有引用中没有麒麟信安3.3的具体细节,因此需要建议用户查阅官方文档或联系支持,同时提供基于现有信息的通用步骤。 还需要注意用户提到的麒麟信安3.3是否与引用中的麒麟信安4.0或银河麒麟有兼容性或步骤上的不同。可能需要指出这些潜在差异,避免用户操作错误。 最后,生成相关问题时要围绕安装指南、系统要求、常见问题等,确保问题相关且有帮助。 总结步骤: 1. 确认用户需求:麒麟信安3.3安装教程。 2. 分析现有引用,发现没有直接对应的信息,但找到类似系统的安装步骤。 3. 基于现有引用和通用安装知识,提供可能的步骤,并指出版本差异。 4. 提醒用户参考官方文档或寻求官方支持。 5. 生成相关问题,如系统要求、虚拟机安装、升级到新版本等。</think>根据现有资料及相近版本安装经验,麒麟信安3.3操作系统的安装可参考以下通用步骤: ### 一、系统安装准备 1. **硬件要求**(参考4.0版本调整): - CPU架构:$x86\_64$ - 内存:建议≥2GB - 硬盘空间:≥20GB(需考虑应用需求)[^1] 2. **获取安装介质**: - 联系麒麟信安官方获取3.3版本ISO镜像 - 校验文件哈希值确保完整性 ### 二、安装流程(虚拟机环境参考[^2]) ```markdown 1. 创建虚拟机: - 使用VMware/VirtualBox选择"典型"配置 - 加载麒麟信安3.3 ISO镜像 - 分配资源(建议2核CPU/4GB内存/40GB磁盘) 2. 启动安装程序: - 选择"安装麒麟信安操作系统" - 设置语言、时区与键盘布局 3. 磁盘分区: - 自动分区(推荐新手) - 手动分区需保留/boot(≥1GB)、swap(≈内存2倍)、/(≥15GB) 4. 用户配置: - 设置root密码(需满足复杂度要求) - 创建普通用户(建议启用sudo权限) ``` ### 三、注意事项 1. **驱动兼容性**:3.3版本可能需手动加载部分硬件驱动 2. **软件源配置**:安装后执行`sudo apt update`更新源(若源未自动识别) 3. **安全加固**:建议安装后配置防火墙规则(如$iptables$或$firewalld$)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值