深度强化学习PG解决装箱问题

做了一个尝试,用深度强化学习policy gradient解决装箱问题。 暂时没有用Q learning。可以考虑试试AC、PPG。也可以试试VRP问题。

当然也不深,特征才用了四个,最简单的dnn,把架子先搭起来。时间的关系,还有很多待优化的,只是一个思想参考。搜了一下 强化学习解决VRP或装箱问题的代码文章不太多,嘿嘿。而且搜到的一些 运用强化学习的思想 是用来优化各种算子的,还是要配合启发式算法用。 我这个是整个流程直接使用强化学习的。好吧,悄悄给自己点个小赞。

action只有两个,拿或者不拿。 状态当前用了四个 装箱前后分别的体积重量占比。是的,状态与货物有关。训练时每次step的时候,会产生(随机选取)新的货物,也会新产生 对应货物的state。   
step的时候, 不拿:  计算不拿的奖励,旧货物还原 生成新货物和状态; 拿:  那就计算拿的奖励  episode还未done则 继续生成新货物和状态。
状态与货物有关, 状态还与动作有关。首先随机选货物(后续可以优化成概率依据体积)。动作 原本应该是根据神经网络前向传播的结果来的。 在最初训练的时候,第一个货物 如果不拿,则状态的4个属性值全为0,扔到网络里会玩不起来,所以训练时第一个货物的动作固定为拿。  而在最后的前向流程时,第一个则不合适固定拿,因为此时的货物 体积可能很小。直接从网络里前向计算得到action?计算action需要状态.. 而状态会根据action不同而不同...这就变成了先有鸡还是先有蛋的问题了....(顺便说,在古生物学家眼中 先有蛋。蛋是一种羊膜卵,最早的羊膜卵是3亿年前的林晰产下的。羊膜动物代表林晰是两栖动物中的进化者,其他古老的两栖动物繁殖仍然离不开水,林晰产下了不需要在水中孵化的羊膜卵。而最早的鸡形动物1亿年后才在地球上出现。)  怎么办? 再单独搞个神经网络 训练下初始状态时 拿什么样的货物最好? 暂时不用吧,我使用了当前货物体积 与此批货物体积均值(为避免异常值影响实际用的中位数)的比值, 如果大于均值,则使用3:1的概率选中它。 3:1数字可配。     

现有的代码没有考虑物体长宽高的影响(可能容积满足带长度并不能装下),如果要做,可以考虑把trailer中所有物品的长宽高属性、加和的结果 与拖车长宽高的比值 之类的数据扔到网络中,也就是特征多加几个, 然后 判断计算体积超标或重量超标的负反馈时, 增加对长宽高超限的负反馈。

为什么要有最后的前向流程?  训练时,episode快结束时 拿货物可能已经拿了 但已经超体积超重了。 所以前面只能是训练。最后还需要来个训练集上的测试流程。 如果用同一个训练好的网络 处理其他批次的货物,相当于是validation流程了。 

一个episode done是因为一个拖车装满了(体积占比0.95以 上)。但是本批次 还有剩余货物的话,env_done(自创的) 不会为True。一批次的货物来了,当成是一个env。 当然,如果按仿真流水线的方法,会有不同的批次陆续到达。 也可以再做另外的批次合并处理。
前向时,经常有很多次action都是不拿,不拿次数太多导致流程很长,总是得不到装满。 有两部分原因,1 是网络没训练好, 2是标签没给好,也就是训练时的奖励设置不够好。

奖励的设置 很重要,作用相当于监督学习的标签了。如果超重或超体积,当然负反馈最大。如果体积占比已经很大了但仍然取货物成功,奖励较大。如果刚开始货物不多 却拿了小体积的东西,奖励较小。 增量阈值0.1 0.3 0.5这些 也要很小心的设置。这里未来都应该做成配置项。

def action_1_reward(self, cur_vol_ratio, cur_weight_ratio, max_vol, before_vol):
    if cur_vol_ratio > 1 or cur_weight_ratio > 1:
        return -10

    before_vol_ratio = before_vol * 1.0 / max_vol

    if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.1:
        return 1
    if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.3:
        return 2
    if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.5:
        return 3

    if before_vol_ratio > 0.9:
        return 5
    if before_vol_ratio > 0.8:
        return 3

    return 1

def action_0_reward(self):
    return 0

tensorflow版本 用的1.14     如果用tensorflow2.x版本跑代码要少量修改一下

分文件写的,这里是一些主要逻辑参考:

关键文件:  网络相关的  pg.py        binning环境相关的  binning.py

生成货物:gen_trailer_cargo.py

from collections import namedtuple
import random
import pandas as pd


cargo_weight_range = tuple(range(20, 501, 10))
depth_range = tuple(range(10, 151, 10))
width_range = tuple(range(10, 151, 10))
height_range = tuple(range(20, 81, 10))
# vol_unit = "cm"
trailer_load_bearing = (20000, 40000, 60000)
trailer_vol = (0.001, 0.003, 0.006)
# trailer_vol = (0.05, 0.08, 0.12)


Cargo = namedtuple("Cargo", ("volume", "weight"))


def gen_cargo_df(quantity=50):
    weights = random.choices(cargo_weight_range, k=quantity)
    depths = random.choices(depth_range, k=quantity)
    widths = random.choices(width_range, k=quantity)
    heights = random.choices(height_range, k=quantity)
    # cm -> cube cm  * 1e-9
    vols = [d * w * h * 1e-9 for d, w, h in zip(depths, widths, heights)]

    df = pd.DataFrame({"id": list(range(1, quantity + 1)),
                       "weight": weights,
                       "volume": vols,
                       "depths": depths,
                       "widths": widths,
                       "heights": heights},
                      index=list(range(1, quantity + 1)))

    return df.to_dict(orient="index")


def gen_trailer_df():
    # trailer 没有数量限制
    df = pd.DataFrame({"id": list(range(1, len(trailer_vol) + 1)),
                       "load_bearing": trailer_load_bearing,
                       "trailer_vol": trailer_vol},
                      index=list(range(1, len(trailer_vol) + 1)))
    return df.to_dict(orient="index")

网络相关的  pg.py

import random
import tensorflow as tf
# from keras.models import Sequential
# from keras.layers import Dense
from keras.losses import sparse_categorical_crossentropy
import numpy as np
from service.sol.reinforce.env.binnging import BinningEnv


max_epoch = 60


class PolicyGradient:
    def __init__(self, n_features, n_actions, learning_rate=0.01, reward_decay=0.95, action_0_max_times=2000):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.action_0_max_times = action_0_max_times
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        self.clear_action_0_times()
        self.make_network()


    def make_network(self):
        with tf.name_scope("inputs"):
            self.input_ph = tf.placeholder(tf.float32, [None, self.n_features], name="input_")
            self.actions_ph = tf.placeholder(tf.int32, [None, self.n_actions], name="actions")
            self.discounted_episode_rewards_ph = tf.placeholder(tf.float32, [None, ], name="discounted_episode_rewards")

            with tf.name_scope("fc1"):
                fc1 = tf.contrib.layers.fully_connected(inputs=self.input_ph,
                                                        num_outputs=8,
                                                        activation_fn=tf.nn.relu,
                                                        weights_initializer=tf.contrib.layers.xavier_initializer())

            # with tf.name_scope("fc2"):
            #     fc2 = tf.contrib.layers.fully_connected(inputs=fc1,
            #                                             num_outputs=4,
            #                                             activation_fn=tf.nn.relu,
            #                                             weights_initializer=tf.contrib.layers.xavier_initializer())

            with tf.name_scope("fc3"):
                fc3 = tf.contrib.layers.fully_connected(inputs=fc1,
                                                        num_outputs=self.n_actions,
                                                        activation_fn=None,
                                                        weights_initializer=tf.contrib.layers.xavier_initializer())

            with tf.name_scope("softmax"):
                self.action_distribution = tf.nn.softmax(fc3)

            with tf.name_scope("loss"):
                neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits=fc3, labels=self.actions_ph)
                self.loss = tf.reduce_mean(neg_log_prob * self.discounted_episode_rewards_ph)

            with tf.name_scope("train"):
                self.train_opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

    def choose_action(self, state, sess, greedy=False, len_remain=0):
        action_prob = sess.run(self.action_distribution, feed_dict={self.input_ph: state.reshape([1,4])})[0]
        if greedy:
            action = np.argmax(action_prob)
        else:
            action = random.choices(range(len(action_prob)), weights=action_prob)[0]
        if not action:
            self.action_0_times += 1
            if self.action_0_times >= self.action_0_max_times:
                print(" reach action 0 max_times !! will action 1", len_remain)
                return 1
        return action

    def store_transition(self, s, a, r):
        self.ep_obs.append(np.array([s], np.float32))
        self.ep_as.append(a)
        self.ep_rs.append(r)

    def clear_action_0_times(self):
        self.action_0_times = 0

    def learn(self, sess, print_flag=False):
        discounted_ep_rs_norm = self._discount_and_norm_rewards()
        loss_, _ = sess.run([self.loss, self.train_opt],
                            feed_dict={self.input_ph: np.vstack(np.array(self.ep_obs)),
                                       self.actions_ph: np.vstack(np.eye(2)[self.ep_as]),
                                       self.discounted_episode_rewards_ph: discounted_ep_rs_norm
                                       })
        # self.model.fit(np.vstack(self.ep_obs), self.ep_as, epochs=1000, verbose=0)
        if print_flag:
            print(" ----------------------------- train! %s length reward totally %s  loss is %s"
                  % (len(self.ep_rs), np.sum(self.ep_rs), round(loss_, 3)))
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []  # empty episode data
        self.clear_action_0_times()
        return None

    def _discount_and_norm_rewards(self):
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0

        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add

        discounted_ep_rs = discounted_ep_rs.astype(float)
        # print("qqqqqqqqq", np.std(discounted_ep_rs), discounted_ep_rs)
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        std = np.std(discounted_ep_rs)
        discounted_ep_rs /= std if std else 1
        return discounted_ep_rs


def binning_run(cargoes, trailers, max_epoch=max_epoch):
    env = BinningEnv(cargoes, trailers)
    epoch_max_rewards = [(0, -float('inf'))]

    agent = PolicyGradient(
        n_actions=len(env.actions),
        n_features=len(env.features),
        learning_rate=0.00001,
        reward_decay=0.99
    )

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for epoch in range(max_epoch):
            best_info = max(epoch_max_rewards, key=lambda x: x[1])
            print("epoch: %s   best reward %s at epoch %s" % (epoch, best_info[1], best_info[0]))
            env = BinningEnv(cargoes, trailers)
            while not env.env_done():
                state, tmp_cargo = env.start_new_trailer()
                epi_max_reward = 0
                while True:
                    action = agent.choose_action(state.digital(), sess, len_remain=len(env.remain_cargoes))
                    # episode结束 done为1时, 新的state_和tmp_cargo为None
                    state_, reward, done, tmp_cargo = env.step(action, tmp_cargo)
                    agent.store_transition(state, action, reward)

                    # PG用的是MC,如果到了最终状态
                    if done:
                        epi_max_reward = max(epi_max_reward, sum(agent.ep_rs))
                        epoch_max_rewards.append((epoch, epi_max_reward))
                        agent.learn(sess, print_flag=epoch % 10 == 0)
                        break
                    state = state_

        print(env.print_assign_info())
        print()

        env = BinningEnv(cargoes, trailers)
        while not env.env_done():
            state, tmp_cargo = env.start_new_trailer()
            trailer_bak = env.cur_trailer
            action = env.get_first_action(tmp_cargo, in_train=False)
            trailer_rewards = 0
            while True:
                state_, reward, done, tmp_cargo = env.step(action, tmp_cargo, in_train=False)
                trailer_rewards += reward
                if done:
                    print("trailer type %s get rewards %s " % (trailer_bak, trailer_rewards))
                    break
                state = state_
                action = agent.choose_action(state.digital(), sess, greedy=True, len_remain=len(env.remain_cargoes))
        print()
        print(env.print_assign_info())
        print("finish")

binning_data.py:

from collections import namedtuple, defaultdict


# TrailerCargosSol = namedtuple("TrailerCargosSol", ("trailer_type", "cargo_list"))
class TrailerCargosSol(namedtuple("TrailerCargosSol", ("trailer_type", "cargo_list"))):
    def __repr__(self):
        return "Trailer_type %s with cargo_ids: %s\n" % (self.trailer_type, self.cargo_list)

binning环境相关的  binning.py


from collections import namedtuple, defaultdict
import random
import numpy as np
from service.data_related.binning_data import TrailerCargosSol


class BinningState(namedtuple("BinningState", ("cur_vol_r", "cur_weight_r", "after_vol_r", "after_weight_r"))):
    def __repr__(self):
        return "BinningState: cur_vol %s cur_weight: %s and after_vol_r %s after_weight_r %s\n" \
               % (self.cur_vol_r, self.cur_weight_r, self.after_vol_r, self.after_weight_r)

    def digital(self):
        return np.array([self.cur_vol_r, self.cur_weight_r, self.after_vol_r, self.after_weight_r])


class BinningEnv:
    def __init__(self, cargoes, trailers):
        self.actions = (0, 1)   # 不取此货物  取此货物
        self.features = BinningState._fields
        self.cargoes = cargoes
        self.trailers = trailers
        self.cargoes_vol_avg = np.percentile([val_dict["volume"] for val_dict in self.cargoes.values()], q=50)
        self.remain_cargoes = cargoes.copy()
        self.assign_info = []
        # self.assign_info.append(TrailerCargosSol(trailers[len(assign_info)], a_trailer_cargos))
        self.clear_cur()

    def clear_cur(self):
        self.cur_trailer = -1
        self.cur_cargoes = []
        self.cur_ratio_state = []
        self.lasted_cargo_state = []

    def episode_done(self, cur_vol_ratio, vol_max_ratio=0.95):
        if cur_vol_ratio >= vol_max_ratio or (not self.remain_cargoes):
            self.assign_info.append(TrailerCargosSol(self.cur_trailer, self.cur_cargoes))
            # print(" eeeeeepisode done ", cur_vol_ratio)
            self.clear_cur()
            return True
        return False

    def print_assign_info(self):
        strs = []
        for trailer, cargoes in self.assign_info:
            tmp1 = sum(self.cargoes[cargo_ix]["volume"] for cargo_ix in cargoes) / self.trailers[trailer]['trailer_vol']
            tmp2 = sum(self.cargoes[cargo_ix]["weight"] for cargo_ix in cargoes) / self.trailers[trailer]['load_bearing']
            strs.append("Trailer_type %s with cargo_ids: %s and vol pct %s, weight pct %s. "
                        % (trailer, cargoes, round(tmp1, 3), round(tmp2, 3)))
        return "\n".join(strs)

    def env_done(self):
        return not bool(self.remain_cargoes)

    def action_1_reward(self, cur_vol_ratio, cur_weight_ratio, max_vol, before_vol):
        if cur_vol_ratio > 1 or cur_weight_ratio > 1:
            return -6

        before_vol_ratio = before_vol * 1.0 / max_vol

        if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.1:
            return 1
        if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.3:
            return 2
        if before_vol_ratio < 0.5 and (cur_vol_ratio - before_vol_ratio) < 0.5:
            return 3

        if before_vol_ratio > 0.9:
            return 4
        if before_vol_ratio > 0.8:
            return 3

        return 1

    def action_0_reward(self):
        return 0

    def random_pop_cargo(self):
        tmp_key = random.choice(list(self.remain_cargoes.keys()))
        tmp_val = self.remain_cargoes[tmp_key]
        del self.remain_cargoes[tmp_key]
        return tmp_key, tmp_val

    def get_state_by_cargo(self, tmp_cargo, action, new_trailer=False):
        max_vol = self.trailers[self.cur_trailer]['trailer_vol']
        max_weight = self.trailers[self.cur_trailer]['load_bearing']
        if new_trailer:
            self.lasted_cargo_state = [[0, 0], [tmp_cargo[1]['volume'], tmp_cargo[1]['weight']]]
            state = BinningState(0, 0, tmp_cargo[1]["volume"] / max_vol, tmp_cargo[1]["weight"] / max_weight)
            self.cur_ratio_state.append(state)
        else:
            if not action:  # 不拿时 直接替换上一状态   拿时 上一转态前移至上上状态 替换上一状态
                state = BinningState((self.lasted_cargo_state[0][0] + tmp_cargo[1]["volume"]) / max_vol,
                                     (self.lasted_cargo_state[0][1] + tmp_cargo[1]["weight"]) / max_weight,
                                     tmp_cargo[1]["volume"] / max_vol, tmp_cargo[1]["weight"] / max_weight)
                self.cur_ratio_state[-1] = state
            else:
                state = BinningState((self.lasted_cargo_state[1][0] + tmp_cargo[1]["volume"]) / max_vol,
                                     (self.lasted_cargo_state[1][1] + tmp_cargo[1]["weight"]) / max_weight,
                                     tmp_cargo[1]["volume"] / max_vol, tmp_cargo[1]["weight"] / max_weight)
                self.cur_ratio_state.append(state)
                self.cur_cargoes.append(tmp_cargo[0])
                self.lasted_cargo_state[0] = self.lasted_cargo_state[1]
                self.lasted_cargo_state[1][0] += tmp_cargo[1]['volume']
                self.lasted_cargo_state[1][1] += tmp_cargo[1]['weight']
        return state

    def choose_action(self):
        if self.cur_cargoes:
            pass
        else:
            pass
        return

    def start_new_trailer(self, in_train=True):
        # 开始新的一个trailer装箱
        # todo  先随机取
        self.cur_trailer = random.choice(list(self.trailers.keys()))
        tmp_cargo = self.random_pop_cargo()
        action = self.get_first_action(tmp_cargo, in_train=in_train)
        state = self.get_state_by_cargo(tmp_cargo, action, new_trailer=True)
        return state, tmp_cargo

    def get_first_action(self, tmp_cargo, in_train=True, prob_above_avg=0.75):
        if in_train:
            # train流程 开始时 固定拿第一个货物  否则状态全0了
            return self.actions[1]
        vol_ratio = tmp_cargo[1]['volume'] / self.cargoes_vol_avg
        prob = [prob_above_avg, 1 - prob_above_avg] if vol_ratio >= 1 else [1 - prob_above_avg, prob_above_avg]
        return random.choices(self.actions, weights=prob)[0]

    def step(self, action, tmp_cargo, in_train=True):
        # in_train为false时, 可以不计算奖励。不过这里为了打印观察 都统一计算了
        if action == 0:    # 不拿:  计算不拿的奖励  旧货物还原 生成新货物和状态
            self.remain_cargoes.update({tmp_cargo[0]: tmp_cargo[1]})
            tmp_cargo = self.random_pop_cargo()
            state = self.get_state_by_cargo(tmp_cargo, action)
            return state, self.action_0_reward(), False, tmp_cargo
        else:              # 拿:  计算拿的奖励  还未done则 继续生成新货物和状态
            reward = self.action_1_reward(self.cur_ratio_state[-1].cur_vol_r, self.cur_ratio_state[-1].cur_weight_r,
                                          self.trailers[self.cur_trailer]['trailer_vol'],
                                          self.lasted_cargo_state[1][0])
            done_flag = self.episode_done(self.cur_ratio_state[-1].cur_vol_r)
            if done_flag:
                return None, reward, done_flag, None

            tmp_cargo = self.random_pop_cargo()
            state = self.get_state_by_cargo(tmp_cargo, action)
            return state, reward, done_flag, tmp_cargo

main.py

from service.data_related.gen_trailer_cargo import gen_cargo_df, gen_trailer_df
from service.sol.reinforce.pg import binning_run


def run():
    cargoes = gen_cargo_df()
    trailers = gen_trailer_df()
    binning_run(cargoes, trailers)


if __name__ == "__main__":
    run()

部分打印的日志如下, 可以看到 经过一些epoch的学习,epoch内最大的reward有提升。

而最后前向过程中,仍然有少部分 容积率超过1。改进的方式 应该是先更细致的调整奖励分数(大比例) 再调整网络。  重量占比很低不重要,目前拖车的最大载重设得很大,几乎无限制

 ----------------------------- train! 4 length reward totally -2  loss is -0.06
 ----------------------------- train! 30 length reward totally 12  loss is -0.014
 ----------------------------- train! 5 length reward totally -4  loss is 0.114
 ----------------------------- train! 9 length reward totally -3  loss is 0.054
 ----------------------------- train! 15 length reward totally 8  loss is 0.085
 ----------------------------- train! 17 length reward totally 17  loss is 0.1
 ----------------------------- train! 37 length reward totally 18  loss is -0.054
 ----------------------------- train! 32 length reward totally 27  loss is 0.092
 ----------------------------- train! 22 length reward totally 14  loss is 0.023
 ----------------------------- train! 28 length reward totally 21  loss is -0.088
 ----------------------------- train! 3 length reward totally 2  loss is 0.0

epoch: 21   best reward 41 at epoch 11
epoch: 22   best reward 41 at epoch 11
epoch: 23   best reward 49 at epoch 22
epoch: 24   best reward 49 at epoch 22
epoch: 25   best reward 49 at epoch 22
epoch: 26   best reward 49 at epoch 22
epoch: 27   best reward 49 at epoch 22
epoch: 28   best reward 49 at epoch 22
epoch: 29   best reward 49 at epoch 22
epoch: 30   best reward 49 at epoch 22

Trailer_type 2 with cargo_ids: [430, 477, 272, 126] and vol pct 0.773, weight pct 0.022.
Trailer_type 1 with cargo_ids: [342, 351] and vol pct 0.812, weight pct 0.043.
Trailer_type 1 with cargo_ids: [372, 419, 271] and vol pct 0.993, weight pct 0.021.
Trailer_type 1 with cargo_ids: [460, 442, 32, 2] and vol pct 0.646, weight pct 0.044.
Trailer_type 3 with cargo_ids: [56, 30, 48, 359, 330, 240, 358, 203, 98, 125] and vol pct 0.966, weight pct 0.045.
Trailer_type 1 with cargo_ids: [160] and vol pct 0.009, weight pct 0.015.
Trailer_type 3 with cargo_ids: [127, 392, 452, 461, 207, 239, 50, 335, 365, 195, 418, 265, 294, 466, 350, 65, 453] and vol pct 0.888, weight pct 0.074.
Trailer_type 2 with cargo_ids: [14, 455, 9, 13, 363, 205, 68, 472, 417, 206, 355, 92, 76, 28, 111, 215, 5] and vol pct 0.872, weight pct 0.117.
Trailer_type 3 with cargo_ids: [35, 166, 177, 180, 266, 375, 149, 396, 196, 130, 332, 389] and vol pct 0.948, weight pct 0.055.
Trailer_type 3 with cargo_ids: [250, 156, 273, 270, 119, 356, 97, 243] and vol pct 0.974, weight pct 0.024.
Trailer_type 2 with cargo_ids: [25, 211, 103, 115, 58, 260, 93, 190] and vol pct 0.879, weight pct 0.059.
Trailer_type 1 with cargo_ids: [393, 481] and vol pct 0.976, weight pct 0.033.
Trailer_type 2 with cargo_ids: [403, 67, 106, 293, 315, 408, 255, 299, 435, 491, 280, 395, 497, 159, 228, 334, 202] and vol pct 0.879, weight pct 0.12.
Trailer_type 3 with cargo_ids: [287, 41, 333, 234, 148, 54, 114, 394, 437, 347, 23, 319, 436, 251] and vol pct 0.975, weight pct 0.058.
Trailer_type 2 with cargo_ids: [314, 52, 229, 264, 317, 55, 371, 496, 281] and vol pct 0.951, weight pct 0.05.
Trailer_type 2 with cargo_ids: [282, 277, 312, 450, 276] and vol pct 0.799, weight pct 0.022.
Trailer_type 2 with cargo_ids: [409, 204, 153, 259, 249, 470, 426] and vol pct 0.804, weight pct 0.046.
Trailer_type 1 with cargo_ids: [33, 405, 451, 429, 258, 78] and vol pct 0.673, weight pct 0.082.
Trailer_type 1 with cargo_ids: [40, 173] and vol pct 0.612, weight pct 0.019.
Trailer_type 1 with cargo_ids: [291, 278, 322] and vol pct 1.294, weight pct 0.026.
Trailer_type 3 with cargo_ids: [478, 189, 43, 8, 224, 492, 402, 367, 425, 63, 401, 290, 337, 187, 336, 165] and vol pct 0.928, weight pct 0.065.
Trailer_type 1 with cargo_ids: [36, 112, 231, 331] and vol pct 1.012, weight pct 0.064.
Trailer_type 2 with cargo_ids: [75, 306, 89, 422, 79, 167] and vol pct 0.968, weight pct 0.051.
Trailer_type 1 with cargo_ids: [410, 24, 411, 4] and vol pct 0.767, weight pct 0.087.
Trailer_type 3 with cargo_ids: [369, 323, 500, 295, 74, 275, 124, 158, 340, 463, 15, 94, 46] and vol pct 0.928, weight pct 0.065.
Trailer_type 2 with cargo_ids: [131, 3, 346, 122, 232, 199, 18] and vol pct 0.987, weight pct 0.045.
Trailer_type 1 with cargo_ids: [209] and vol pct 0.96, weight pct 0.025.
Trailer_type 1 with cargo_ids: [233, 404, 113, 424] and vol pct 1.135, weight pct 0.056.
Trailer_type 2 with cargo_ids: [71, 263, 120, 214, 123, 134, 154, 116, 247] and vol pct 0.998, weight pct 0.042.
Trailer_type 3 with cargo_ids: [420, 383, 480, 267, 494, 449, 91, 192, 274, 284, 385, 456] and vol pct 0.937, weight pct 0.044.
Trailer_type 1 with cargo_ids: [198] and vol pct 0.3, weight pct 0.014.
Trailer_type 2 with cargo_ids: [296, 397, 354, 200, 349, 80, 288] and vol pct 0.863, weight pct 0.035.
Trailer_type 1 with cargo_ids: [415] and vol pct 0.42, weight pct 0.013.
Trailer_type 2 with cargo_ids: [464, 443, 237, 454, 109, 326, 176, 465, 219, 489] and vol pct 0.961, weight pct 0.05.
Trailer_type 2 with cargo_ids: [484, 22, 226, 447, 493, 289] and vol pct 1.044, weight pct 0.036.
Trailer_type 1 with cargo_ids: [321, 329, 210, 432, 57] and vol pct 1.126, weight pct 0.079.
Trailer_type 1 with cargo_ids: [285, 445, 490, 414] and vol pct 0.884, weight pct 0.043.
Trailer_type 1 with cargo_ids: [201, 188, 172, 495] and vol pct 0.902, weight pct 0.05.
Trailer_type 1 with cargo_ids: [59, 301, 374] and vol pct 0.811, weight pct 0.022.
Trailer_type 1 with cargo_ids: [141, 286, 399, 137] and vol pct 1.628, weight pct 0.055.
Trailer_type 1 with cargo_ids: [230, 446, 10, 227] and vol pct 0.878, weight pct 0.067.
Trailer_type 3 with cargo_ids: [242, 348, 162, 73, 26, 357, 344, 444, 431, 433, 483, 440, 379] and vol pct 0.919, weight pct 0.051.
Trailer_type 2 with cargo_ids: [19, 185, 64, 441, 297, 53, 208] and vol pct 0.908, weight pct 0.04.
Trailer_type 3 with cargo_ids: [316, 467, 51, 151, 313, 171, 486, 87, 343, 235, 341, 427, 108, 476, 193, 257, 388, 368, 62] and vol pct 0.884, weight pct 0.065.
Trailer_type 3 with cargo_ids: [376, 105, 150, 133, 49, 325, 107, 279, 362, 212, 42, 269, 225] and vol pct 0.845, weight pct 0.049.
Trailer_type 1 with cargo_ids: [352] and vol pct 0.72, weight pct 0.024.
Trailer_type 1 with cargo_ids: [283, 186, 364, 338, 81] and vol pct 0.986, weight pct 0.026.
Trailer_type 1 with cargo_ids: [194, 88] and vol pct 1.68, weight pct 0.024.
Trailer_type 2 with cargo_ids: [434, 479, 473, 20, 482, 216, 155, 253, 262, 353, 7, 244, 61, 220] and vol pct 1.018, weight pct 0.114.
Trailer_type 1 with cargo_ids: [307] and vol pct 0.936, weight pct 0.015.
Trailer_type 3 with cargo_ids: [104, 423, 39, 142, 77, 384, 179, 471, 268, 170, 378, 178, 416, 406, 161, 318, 38, 398, 6, 31, 85, 182] and vol pct 0.883, weight pct 0.082.
Trailer_type 1 with cargo_ids: [128, 248, 44, 462, 16] and vol pct 0.857, weight pct 0.101.
Trailer_type 3 with cargo_ids: [238, 11, 135, 66, 382, 99, 391, 181, 254, 400, 458, 327] and vol pct 0.891, weight pct 0.06.
Trailer_type 1 with cargo_ids: [145, 339, 345] and vol pct 1.176, weight pct 0.037.
Trailer_type 2 with cargo_ids: [21, 96, 12, 488, 17, 175, 95, 1] and vol pct 0.947, weight pct 0.051.
Trailer_type 3 with cargo_ids: [117, 221, 360, 474, 381, 298, 386, 309, 373, 90, 370, 236, 102, 168, 84, 82, 184, 121, 252, 27, 366, 448, 498, 421] and vol pct 0.928, weight pct 0.106.
Trailer_type 3 with cargo_ids: [110, 140, 118, 183, 86, 83, 412, 138, 261, 37, 292, 439, 380, 213, 147, 136] and vol pct 0.873, weight pct 0.069.
Trailer_type 2 with cargo_ids: [222] and vol pct 0.09, weight pct 0.007.
finish

以上思路仅供参考,如有错误欢迎指出,多谢!

【资源说明】 1、该资源包括项目的全部源码,下载可以直接使用! 2、本项目适合作为计算机、数学、电子信息等专业的课程设计、期末大作业和毕设项目,作为参考资料学习借鉴。 3、本资源作为“参考资料”如果需要实现其他功能,需要能看懂代码,并且热爱钻研,自行调试。 DQN深度强化学习解决三维在线装箱问题python源码+项目说明.zip ## 问题描述 物流公司在流通过程中,需要将打包完毕的箱子装入到一个货车的车厢中,为了提高物流效率,需要将车厢尽量填满,显然,车厢如果能被100%填满是最优的,但通常认为,车厢能够填满85%,可认为装箱是比较优化的。 设车厢为长方形,其长宽高分别为L,W,H;共有n个箱子,箱子也为长方形,第i个箱子的长宽高为li,wi,hi(n个箱子的体积总和是要远远大于车厢的体积),做以下假设和要求: 1. 长方形的车厢共有8个角,并设靠近驾驶室并位于下端的一个角的坐标为(0,0,0),车厢共6个面,其中长的4个面,以及靠近驾驶室的面是封闭的,只有一个面是开着的,用于工人搬运箱子; 2. 需要计算出每个箱子在车厢中的坐标,即每个箱子摆放后,其和车厢坐标为(0,0,0)的角相对应的角在车厢中的坐标,并计算车厢的填充率。 ## 运行环境 主机 |内存 | 显卡 | IDE | Python | torch -----|------|------|-----|--------|----- CPU:12th Gen Intel(R) Core (TM) i7-12700H 2.30 GHz | 6GB RAM | NVIDIA GEFORCE RTX 3050 | Pycharm2022.2.1 | python3.8 | 1.13.0 ## 思路 (1)箱子到来后,根据车厢的实际空间情况,按照策略选择放置点; (2)当摆放箱子时,以6种姿态摆放,并对其进行评估,使用评估值最高的姿态将箱子摆放在选中的角点上; (3)重复以上步骤,直到摆放完毕。 ## 建立模型 在车厢内部设置坐标系,靠近驾驶室并位于下端的一个角的坐标为(0,0,0),相交于原点的车厢长边、宽边和高边分别为x轴,y轴和z轴方向,L、W、H分别为车厢的长、宽、高。箱子具有六种摆放姿态,分别以箱子的长宽、长高、宽高平面为底,旋转90°可以得到另外三种摆放姿态。 ## 核心 ### 箱子放置策略 本算法将角点作为车厢内部空间中箱子的摆放位置,每次放入新箱子后搜索新生成的角点,当向车厢中放入第一个箱子时,假设车厢中只有原点一个角点,当一个箱子放入后,会产生新的角点,再放置箱子后,又会产生新的角点。 建立箱子可放置点列表,表示箱子i到来时,车厢内部所有可选的摆放位置,在放置新箱子后更新可放置点列表,并记录已放置箱子到车厢顶部距离,用于后续的奖励函数。 ### DQN (1)设置一些超参数,包括ε-greedy使用的ε,折扣因子γ,目标网络更新频率,经验池容量等。 (2)由于给定的箱子数据较少,为了增加模型训练数据数量,将给定的箱子数据打乱,以随机的形式生成并保存,作为训练数据,训练网络模型。 (3)奖励函数 使用x-y平面中两个最大剩余矩形面积(如下图)之和与箱子到车厢顶部的距离作为奖励值R,奖励函数表示如下: (4)动作-价值函数网络和目标动作-价值函数网络设置为包含6层卷积层的CNN。对当前状态和动作建模,使其能够输入到价值网络Q和Q’中。以车厢的底面为基准,建模L*W的矩阵,每个元素代表该点放置的箱子最大高度。 (5)动作选择 根据当前的状态(当前车厢的属性,包括尺寸、放置的所有箱子、H矩阵、可放置点列表等),使用ε-greedy方法选择具有最大Q值的动作或随机选择动作(动作是箱子的放置点和摆放姿态)。 (6)经验重放 ## 说明 将所有文件夹放置在同一目录下,train.py用于模型训练,cnn.pth是已经训练好的模型,在eval.py中导入后直接运行eval.py即可。 ## 不足 1、填充率 一般认为车厢填充率高于85%,认为装箱算法是较优的,本实验设计的装箱方案填充率较低,在60%-80%间,分析原因可能在于强化学习网络的参数不够合适,算法有待优化。 改进的方向:调整强化学习网络的参数,选择更加合适的参数。
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值