对vpg的代码详解
新手,自己的解释,可能不对
import collections 用来初始化q表
if __name__ == '__main__':
# Parse arguments
parser = argparse.ArgumentParser(description='Train robotic agents to learn how to plan complementary pushing and grasping actions for manipulation with deep reinforcement learning in PyTorch.')
# --------------- Setup options ---------------
parser.add_argument('--is_sim', dest='is_sim', action='store_true', default=True, help='run in simulation?')
parser.add_argument('--obj_mesh_dir', dest='obj_mesh_dir', action='store', default='objects/blocks', help='directory containing 3D mesh files (.obj) of objects to be added to simulation')#
parser.add_argument('--num_obj', dest='num_obj', type=int, action='store', default=10, help='number of objects to add to simulation')#多少个物体放入仿真环境
parser.add_argument('--tcp_host_ip', dest='tcp_host_ip', action='store', default='100.127.7.223', help='IP address to robot arm as TCP client (UR5)')#作为 TCP 客户端 (UR5) 的机械臂 IP 地址
parser.add_argument('--tcp_port', dest='tcp_port', type=int, action='store', default=30002, help='port to robot arm as TCP client (UR5)')#作为 TCP 客户端 (UR5) 的机械臂端口
parser.add_argument('--rtc_host_ip', dest='rtc_host_ip', action='store', default='100.127.7.223', help='IP address to robot arm as real-time client (UR5)')
parser.add_argument('--rtc_port', dest='rtc_port', type=int, action='store', default=30003, help='port to robot arm as real-time client (UR5)')
parser.add_argument('--heightmap_resolution', dest='heightmap_resolution', type=float, action='store', default=0.002, help='meters per pixel of heightmap')#(UR5) 的机械臂端口高度图每像素米
parser.add_argument('--random_seed', dest='random_seed', type=int, action='store', default=1234, help='random seed for simulation and neural net initialization')
parser.add_argument('--cpu', dest='force_cpu', action='store_true', default=False, help='force code to run in CPU mode')#cpu运行
# ------------- Algorithm options -------------
parser.add_argument('--method', dest='method', action='store', default='reinforcement', help='set to \'reactive\' (supervised learning) or \'reinforcement\' (reinforcement learning ie Q-learning)')
parser.add_argument('--push_rewards', dest='push_rewards', action='store_true', default=True, help='use immediate rewards (from change detection) for pushing?')
parser.add_argument('--future_reward_discount', dest='future_reward_discount', type=float, action='store', default=0.5)
parser.add_argument('--experience_replay', dest='experience_replay', action='store_true', default=True, help='use prioritized experience replay?')
parser.add_argument('--heuristic_bootstrap', dest='heuristic_bootstrap', action='store_true', default=False, help='use handcrafted grasping algorithm when grasping fails too many times in a row during training?')
parser.add_argument('--explore_rate_decay', dest='explore_rate_decay', action='store_true', default=False)
parser.add_argument('--grasp_only', dest='grasp_only', action='store_true', default=False)
# -------------- Testing options --------------
parser.add_argument('--is_testing', dest='is_testing', action='store_true', default=True)
parser.add_argument('--max_test_trials', dest='max_test_trials', type=int, action='store', default=30, help='maximum number of test runs per case/scenario')
parser.add_argument('--test_preset_cases', dest='test_preset_cases', action='store_true', default=True)
parser.add_argument('--test_preset_file', dest='test_preset_file', action='store', default='simulation/test-cases/test-10-obj-07.txt')
# ------ Pre-loading and logging options ------
parser.add_argument('--load_snapshot', dest='load_snapshot', action='store_true', default=True, help='load pre-trained snapshot of model?')
parser.add_argument('--snapshot_file', dest='snapshot_file', action='store',default='downloads/vpg-original-sim-pretrained-10-obj.pth')
parser.add_argument('--continue_logging', dest='continue_logging', action='store_true', default=True, help='continue logging from previous session?')
parser.add_argument('--logging_directory', dest='logging_directory', action='store',default='logs/YOUR-SESSION-DIRECTORY-NAME-HERE')
parser.add_argument('--save_visualizations', dest='save_visualizations', action='store_true', default=False, help='save visualizations of FCN predictions?')
# Run main program with specified arguments
args = parser.parse_args()
main(args)
参数解析器
设置选项
–is_sim 保存是否用仿真环境
–obj_mesh_dir 包含要添加到模拟的对象的 3D 网格文件 (.obj) 的目录’
–num_obj 要添加模拟的对象数量
–tcp_host_ip IP地址给机械臂作为TCP客户端(UR5)
–heightmap_resolution heightmap的每像素米数
算法选项
–push_rewards 使用即时奖励(来自变化检测)是否加入奖励
–experience_replay 使用优先经验回放
–grasp_only 只抓不推
测试选项
–max_test_trials 每个案例/场景的最大测试运行次数
预加载和日志记录选项
–load_snapshot和来恢复训练会话–continue_logging,
–snapshot_file从 指定的会话目录加载由 指定的最新模型快照和转换–logging_directory
main
初始化
# --------------- Setup options ---------------
is_sim = args.is_sim # Run in simulation?
obj_mesh_dir = os.path.abspath(args.obj_mesh_dir) if is_sim else None # Directory containing 3D mesh files (.obj) of objects to be added to simulation
num_obj = args.num_obj if is_sim else None # Number of objects to add to simulation
tcp_host_ip = args.tcp_host_ip if not is_sim else None # IP and port to robot arm as TCP client (UR5)
tcp_port = args.tcp_port if not is_sim else None
rtc_host_ip = args.rtc_host_ip if not is_sim else None # IP and port to robot arm as real-time client (UR5)
rtc_port = args.rtc_port if not is_sim else None
if is_sim:
workspace_limits = np.asarray([[-0.724, -0.276], [-0.224, 0.224], [-0.0001, 0.4]]) # Cols: min max, Rows: x y z (define workspace limits in robot coordinates)
else:
workspace_limits = np.asarray([[0.3, 0.748], [-0.224, 0.224], [-0.255, -0.1]]) # Cols: min max, Rows: x y z (define workspace limits in robot coordinates)
heightmap_resolution = args.heightmap_resolution # Meters per pixel of heightmap
random_seed = args.random_seed
force_cpu = args.force_cpu
# ------------- Algorithm options -------------
method = args.method # 'reactive' (supervised learning) or 'reinforcement' (reinforcement learning ie Q-learning)
push_rewards = args.push_rewards if method == 'reinforcement' else None # Use immediate rewards (from change detection) for pushing?
future_reward_discount = args.future_reward_discount
experience_replay = args.experience_replay # Use prioritized experience replay?
heuristic_bootstrap = args.heuristic_bootstrap # Use handcrafted grasping algorithm when grasping fails too many times in a row?
explore_rate_decay = args.explore_rate_decay
grasp_only = args.grasp_only
# -------------- Testing options --------------
is_testing = args.is_testing
max_test_trials = args.max_test_trials # Maximum number of test runs per case/scenario
test_preset_cases = args.test_preset_cases
test_preset_file = os.path.abspath(args.test_preset_file) if test_preset_cases else None
# ------ Pre-loading and logging options ------
load_snapshot = args.load_snapshot # Load pre-trained snapshot of model?
snapshot_file = os.path.abspath(args.snapshot_file) if load_snapshot else None
continue_logging = args.continue_logging # Continue logging from previous session
logging_directory = os.path.abspath(args.logging_directory) if continue_logging else os.path.abspath('logs')
save_visualizations = args.save_visualizations # Save visualizations of FCN predictions? Takes 0.6s per training step if set to True
# Set random seed
np.random.seed(random_seed)
参数初始化,并添加随机种子。
初始化机器人,训练器,日志
# Initialize pick-and-place system (camera and robot)
robot = Robot(is_sim, obj_mesh_dir, num_obj, workspace_limits,
tcp_host_ip, tcp_port, rtc_host_ip, rtc_port,
is_testing, test_preset_cases, test_preset_file)
# Initialize trainer
trainer = Trainer(method, push_rewards, future_reward_discount,
is_testing, load_snapshot, snapshot_file, force_cpu)
# Initialize data logger
logger = Logger(continue_logging, logging_directory)
logger.save_camera_info(robot.cam_intrinsics, robot.cam_pose, robot.cam_depth_scale) # Save camera intrinsics and pose
logger.save_heightmap_info(workspace_limits, heightmap_resolution) # Save heightmap parameters
# Find last executed iteration of pre-loaded log, and load execution info and RL variables
if continue_logging:
trainer.preload(logger.transitions_directory)
# Initialize variables for heuristic bootstrapping and exploration probability
no_change_count = [2, 2] if not is_testing else [0, 0]
explore_prob = 0.5 if not is_testing else 0.0
# Quick hack for nonlocal memory between threads in Python 2
nonlocal_variables = {'executing_action' : False,
'primitive_action' : None,
'best_pix_ind' : None,
'push_success' : False,
'grasp_success' : False}
其中Robot见robot.py文件
class Robot(object):
参数初始化(先只看仿真情况下)
对物体的颜色和形状定义
def __init__(self, is_sim, obj_mesh_dir, num_obj, workspace_limits,
tcp_host_ip, tcp_port, rtc_host_ip, rtc_port,
is_testing, test_preset_cases, test_preset_file):
self.is_sim = is_sim
self.workspace_limits = workspace_limits
# If in simulation...
if self.is_sim:
# Define colors for object meshes (Tableau palette)
self.color_space = np.asarray([[78.0, 121.0, 167.0], # blue
[89.0, 161.0, 79.0], # green
[156, 117, 95], # brown
[242, 142, 43], # orange
[237.0, 201.0, 72.0], # yellow
[186, 176, 172], # gray
[255.0, 87.0, 89.0], # red
[176, 122, 161], # purple
[118, 183, 178], # cyan
[255, 157, 167]])/255.0 #pink
# Read files in object mesh directory
self.obj_mesh_dir = obj_mesh_dir
self.num_obj = num_obj
self.mesh_list = os.listdir(self.obj_mesh_dir)
# Randomly choose objects to add to scene
self.obj_mesh_ind = np.random.randint(0, len(self.mesh_list), size=self.num_obj)
self.obj_mesh_color = self.color_space[np.asarray(range(self.num_obj)) % 10, :]
连接仿真环境
vrep.simxFinish(-1) # Just in case, close all opened connections
self.sim_client = vrep.simxStart('127.0.0.1', 19997, True, True, 5000, 5) # Connect to V-REP on port 19997
if self.sim_client == -1:
print('Failed to connect to simulation (V-REP remote API server). Exiting.')
exit()
else:
print('Connected to simulation.')
self.restart_sim()
self.is_testing = is_testing
self.test_preset_cases = test_preset_cases
self.test_preset_file = test_preset_file
模拟中的设置虚拟摄像头
self.setup_sim_camera()
如果进行测试,请读取对象网格并从测试用例文件中摆姿势(生成特定姿势)
if self.is_testing and self.test_preset_cases:
file = open(self.test_preset_file, 'r')
file_content = file.readlines()
self.test_obj_mesh_files = []
self.test_obj_mesh_colors = []
self.test_obj_positions = []
self.test_obj_orientations = []
for object_idx in range(self.num_obj):
file_content_curr_object = file_content[object_idx].split()
self.test_obj_mesh_files.append(os.path.join(self.obj_mesh_dir,file_content_curr_object[0]))
self.test_obj_mesh_colors.append([float(file_content_curr_object[1]),float(file_content_curr_object[2]),float(file_content_curr_object[3])])
self.test_obj_positions.append([float(file_content_curr_object[4]),float(file_content_curr_object[5]),float(file_content_curr_object[6])])
self.test_obj_orientations.append([float(file_content_curr_object[7]),float(file_content_curr_object[8]),float(file_content_curr_object[9])])
file.close()
self.obj_mesh_color = np.asarray(self.test_obj_mesh_colors)
self.add_objects()
self.add_objects()
在 x,y 位置和方向(随机或预加载)将每个对象添加到机器人工作区。
self.obj_mesh_ind 记载着图形形状的数组
self.object_handles = []
sim_obj_handles = []
for object_idx in range(len(self.obj_mesh_ind)):
curr_mesh_file = os.path.join(self.obj_mesh_dir, self.mesh_list[self.obj_mesh_ind[object_idx]])
if self.is_testing and self.test_preset_cases:
curr_mesh_file = self.test_obj_mesh_files[object_idx]
curr_shape_name = 'shape_%02d' % object_idx
drop_x = (self.workspace_limits[0][1] - self.workspace_limits[0][0] - 0.2) * np.random.random_sample() + self.workspace_limits[0][0] + 0.1
drop_y = (self.workspace_limits[1][1] - self.workspace_limits[1][0] - 0.2) * np.random.random_sample() + self.workspace_limits[1][0] + 0.1
object_position = [drop_x, drop_y, 0.15]
object_orientation = [2*np.pi*np.random.random_sample(), 2*np.pi*np.random.random_sample(), 2*np.pi*np.random.random_sample()]
if self.is_testing and self.test_preset_cases:
object_position = [self.test_obj_positions[object_idx][0], self.test_obj_positions[object_idx][1], self.test_obj_positions[object_idx][2]]
object_orientation = [self.test_obj_orientations[object_idx][0], self.test_obj_orientations[object_idx][1], self.test_obj_orientations[object_idx][2]]
object_color = [self.obj_mesh_color[object_idx][0], self.obj_mesh_color[object_idx][1], self.obj_mesh_color[object_idx][2]]
ret_resp,ret_ints,ret_floats,ret_strings,ret_buffer = vrep.simxCallScriptFunction(self.sim_client, 'remoteApiCommandServer',vrep.sim_scripttype_childscript,'importShape',[0,0,255,0], object_position + object_orientation + object_color, [curr_mesh_file, curr_shape_name], bytearray(), vrep.simx_opmode_blocking)
if ret_resp == 8:
print('Failed to add new objects to simulation. Please restart.')
exit()
curr_shape_handle = ret_ints[0]
self.object_handles.append(curr_shape_handle)
if not (self.is_testing and self.test_preset_cases):
time.sleep(2)
self.prev_obj_positions = []
self.obj_positions = []
if self.is_testing and self.test_preset_cases: 模型文件改成指定文件。
在限制空间中随机生成X,Y,方向,颜色。
simxCallScriptFunction: 调用Vrep中Lua脚本中的函数’importShape’
class Trainer(object):
读取训练方法
cpu gpu(推荐gpu)上星期的报错可以在这里解决 更改method
self.method = method
# Check if CUDA can be used
if torch.cuda.is_available() and not force_cpu:
print("CUDA detected. Running with GPU acceleration.")
self.use_cuda = True
elif force_cpu:
print("CUDA detected, but overriding with option '--cpu'. Running with only CPU.")
self.use_cuda = False
else:
print("CUDA is *NOT* detected. Running with only CPU.")
self.use_cuda = False
初始化分类损失 定义一个push和一个grasp损失
# Initialize classification loss
push_num_classes = 3 # 0 - push, 1 - no change push, 2 - no loss
push_class_weights = torch.ones(push_num_classes)
push_class_weights[push_num_classes - 1] = 0
if self.use_cuda:
self.push_criterion = CrossEntropyLoss2d(push_class_weights.cuda()).cuda()
else:
self.push_criterion = CrossEntropyLoss2d(push_class_weights)
grasp_num_classes = 3 # 0 - grasp, 1 - failed grasp, 2 - no loss
grasp_class_weights = torch.ones(grasp_num_classes)
grasp_class_weights[grasp_num_classes - 1] = 0
if self.use_cuda:
self.grasp_criterion = CrossEntropyLoss2d(grasp_class_weights.cuda()).cuda()
else:
self.grasp_criterion = CrossEntropyLoss2d(grasp_class_weights)
用于深度强化学习的全卷积 Q 网络
elif self.method == 'reinforcement':
self.model = reinforcement_net(self.use_cuda)
self.push_rewards = push_rewards
self.future_reward_discount = future_reward_discount
# Initialize Huber loss
self.criterion = torch.nn.SmoothL1Loss(reduce=False) # Huber loss
if self.use_cuda:
self.criterion = self.criterion.cuda()
其中huber loss(平滑平均绝对误差):
相比平方误差损失,Huber损失对于数据中异常值的敏感性要差一些。在值为0时,它也是可微分的。它基本上是绝对值,在误差很小时会变为平方值。误差使其平方值的大小如何取决于一个超参数δ,该参数可以调整。当δ~ 0时,Huber损失会趋向于MAE平均绝对值误差,也可以看做L1损失;当δ~ ∞(很大的数字),Huber损失会趋向于MSE均方误差。
class reinforcement_net(nn.Module):(model.py)
# Initialize network trunks with DenseNet pre-trained on ImageNet
self.push_color_trunk = torchvision.models.densenet.densenet121(pretrained=True)
self.push_depth_trunk = torchvision.models.densenet.densenet121(pretrained=True)
self.grasp_color_trunk = torchvision.models.densenet.densenet121(pretrained=True)
self.grasp_depth_trunk = torchvision.models.densenet.densenet121(pretrained=True)
初始化网络主干,self.num_rotations = 16 定义16个旋转角度。
# Construct network branches for pushing and grasping
self.pushnet = nn.Sequential(OrderedDict([
('push-norm0', nn.BatchNorm2d(2048)),
('push-relu0', nn.ReLU(inplace=True)),
('push-conv0', nn.Conv2d(2048, 64, kernel_size=1, stride=1, bias=False)),
('push-norm1', nn.BatchNorm2d(64)),
('push-relu1', nn.ReLU(inplace=True)),
('push-conv1', nn.Conv2d(64, 3, kernel_size=1, stride=1, bias=False))
# ('push-upsample2', nn.Upsample(scale_factor=4, mode='bilinear'))
]))
self.graspnet = nn.Sequential(OrderedDict([
('grasp-norm0', nn.BatchNorm2d(2048)),
('grasp-relu0', nn.ReLU(inplace=True)),
('grasp-conv0', nn.Conv2d(2048, 64, kernel_size=1, stride=1, bias=False)),
('grasp-norm1', nn.BatchNorm2d(64)),
('grasp-relu1', nn.ReLU(inplace=True)),
('grasp-conv1', nn.Conv2d(64, 3, kernel_size=1, stride=1, bias=False))
# ('grasp-upsample2', nn.Upsample(scale_factor=4, mode='bilinear'))
]))
建立推和抓的网络主干 push,grasp网络结构一样,batchnorm2d,relu激活函数,conv11改变通道数,在batchnorm2d,relu激活函数,在conv11改变成3通道输出
# Initialize network weights
for m in self.named_modules():
if 'push-' in m[0] or 'grasp-' in m[0]:
if isinstance(m[1], nn.Conv2d):
nn.init.kaiming_normal(m[1].weight.data)
elif isinstance(m[1], nn.BatchNorm2d):
m[1].weight.data.fill_(1)
m[1].bias.data.zero_()
初始化网络权重:
named_modules( ):
返回网络中所有模块的迭代器,同时产生模块的名称以及模块本身。
Example:
>>> l = nn.Linear(2, 2)
>>> net = nn.Sequential(l, l)
>>> for idx, m in enumerate(net.named_modules()):
print(idx, '->', m)
0 -> ('', Sequential(
(0): Linear(in_features=2, out_features=2, bias=True)
(1): Linear(in_features=2, out_features=2, bias=True)
))
1 -> ('0', Linear(in_features=2, out_features=2, bias=True))
self.model保存了self.output_prob, self.interm_feat
其中:
nn.init.kaiming_normal(m[1].weight.data)
用kaiming初始化m[1]权重,kaiming是何凯明提出了针对于Relu的初始化方法。
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification He, K. et al. (2015)
kaiming均匀分布
torch.nn.init.kaiming_uniform_(tensor, a=0, mode=‘fan_in’, nonlinearity=‘leaky_relu’)
此为均匀分布,U~(-bound, bound), bound = sqrt(6/(1+a^2)*fan_in)
其中,a为激活函数的负半轴的斜率,relu是0
mode- 可选为fan_in 或 fan_out, fan_in使正向传播时,方差一致; fan_out使反向传播时,方差一致
nonlinearity- 可选 relu 和 leaky_relu ,默认值为 。 leaky_relu
nn.init.kaiming_uniform_(w, mode=‘fan_in’, nonlinearity=‘relu’)
kaiming正态分布
torch.nn.init.kaiming_normal_(tensor, a=0, mode=‘fan_in’, nonlinearity=‘leaky_relu’)
此为0均值的正态分布,N~ (0,std),其中std = sqrt(2/(1+a^2)*fan_in)
其中,a为激活函数的负半轴的斜率,relu是0
mode- 可选为fan_in 或 fan_out, fan_in使正向传播时,方差一致;fan_out使反向传播时,方差一致
nonlinearity- 可选 relu 和 leaky_relu ,默认值为 。 leaky_relu
nn.init.kaiming_normal_(w, mode=‘fan_out’, nonlinearity=‘relu’)
def forward(self, input_color_data, input_depth_data, is_volatile=False, specific_rotation=-1):
if is_volatile:
output_prob = []
interm_feat = []
# Apply rotations to images
for rotate_idx in range(self.num_rotations):
rotate_theta = np.radians(rotate_idx*(360/self.num_rotations))
# Compute sample grid for rotation BEFORE neural network
affine_mat_before = np.asarray([[np.cos(-rotate_theta), np.sin(-rotate_theta), 0],[-np.sin(-rotate_theta), np.cos(-rotate_theta), 0]])
affine_mat_before.shape = (2,3,1)
affine_mat_before = torch.from_numpy(affine_mat_before).permute(2,0,1).float()
if self.use_cuda:
flow_grid_before = F.affine_grid(Variable(affine_mat_before, requires_grad=False).cuda(), input_color_data.size())
else:
flow_grid_before = F.affine_grid(Variable(affine_mat_before, requires_grad=False), input_color_data.size())
# Rotate images clockwise
if self.use_cuda:
rotate_color = F.grid_sample(Variable(input_color_data, volatile=True).cuda(), flow_grid_before, mode='nearest')
rotate_depth = F.grid_sample(Variable(input_depth_data, volatile=True).cuda(), flow_grid_before, mode='nearest')
else:
rotate_color = F.grid_sample(Variable(input_color_data, volatile=True), flow_grid_before, mode='nearest')
rotate_depth = F.grid_sample(Variable(input_depth_data, volatile=True), flow_grid_before, mode='nearest')
# Compute intermediate features
interm_push_color_feat = self.push_color_trunk.features(rotate_color)
interm_push_depth_feat = self.push_depth_trunk.features(rotate_depth)
interm_push_feat = torch.cat((interm_push_color_feat, interm_push_depth_feat), dim=1)
interm_grasp_color_feat = self.grasp_color_trunk.features(rotate_color)
interm_grasp_depth_feat = self.grasp_depth_trunk.features(rotate_depth)
interm_grasp_feat = torch.cat((interm_grasp_color_feat, interm_grasp_depth_feat), dim=1)
interm_feat.append([interm_push_feat, interm_grasp_feat])
#上半部分为数据预处理
# Compute sample grid for rotation AFTER branches
affine_mat_after = np.asarray([[np.cos(rotate_theta), np.sin(rotate_theta), 0],[-np.sin(rotate_theta), np.cos(rotate_theta), 0]])
affine_mat_after.shape = (2,3,1)
affine_mat_after = torch.from_numpy(affine_mat_after).permute(2,0,1).float()
if self.use_cuda:
flow_grid_after = F.affine_grid(Variable(affine_mat_after, requires_grad=False).cuda(), interm_push_feat.data.size())
else:
flow_grid_after = F.affine_grid(Variable(affine_mat_after, requires_grad=False), interm_push_feat.data.size())
# Forward pass through branches, undo rotation on output predictions, upsample results
output_prob.append([nn.Upsample(scale_factor=16, mode='bilinear').forward(F.grid_sample(self.pushnet(interm_push_feat), flow_grid_after, mode='nearest')),
nn.Upsample(scale_factor=16, mode='bilinear').forward(F.grid_sample(self.graspnet(interm_grasp_feat), flow_grid_after, mode='nearest'))])
return output_prob, interm_feat
#前向传递分支,撤消输出预测的旋转,上采样结果
其中:permute(2,0,1)为将tensor的维度换位。
grid_sample和affine_grid
简单来说就是,提供一个input的Tensor以及一个对应的flow-field网格(比如光流,体素流等),然后根据grid中每个位置提供的坐标信息(这里指input中pixel的坐标),将input中对应位置的像素值填充到grid指定的位置,得到最终的输出。
torch.nn.functional.grid_sample(input, grid, mode='bilinear', padding_mode='zeros', align_corners=None)
这里的input和output就是输入的图片,或者是网络中的feature map。关键的处理过程在于grid,grid的最后一维的大小为2,即表示input中pixel的位置信息 (x,y)
,这里一般会将x和y的取值范围归一化到 [−1,1]之间, (−1,−1)表示input左上角的像素的坐标,(1,1)表示input右下角的像素的坐标,对于超出这个范围的坐标(x,y),函数将会根据参数padding_mode的设定进行不同的处理。
padding_mode=‘zeros’:对于越界的位置在网格中采用pixel value=0进行填充。padding_mode=‘border’:对于越界的位置在网格中采用边界的pixel value进行填充。padding_mode=‘reflection’:对于越界的位置在网格中采用关于边界的对称值进行填充。
mode有nearest和bilinear两种模式。 nearest就是直接采用与 距离最近处的像素值来填充grid,而bilinear则是采用双线性插值的方法来进行填充,总之其与nearest的区别就是nearest只考虑最近点的pixel value,而bilinear则采用周围的四个pixel value进行加权平均值来填充grid。
双线性差值则是类似与四边形原理的差值,如下图所示:
总结:
要使用 pytorch 的平移操作,只需要两步:
创建 grid:grid = torch.nn.functional.affine_grid(theta, size),其实我们可以通过调节 size 设置所得到的图像的大小(相当于resize);
grid_sample 进行重采样:outputs = torch.nn.functional.grid_sample(inputs, grid, mode=‘bilinear’)
affine_grid总是与直觉相反的变换
Variable()
是将数据转换成Variable类型
预训练模型加载
# Load pre-trained model
if load_snapshot:
self.model.load_state_dict(torch.load(snapshot_file))
print('Pre-trained model snapshot loaded from: %s' % (snapshot_file))
计算前向传播模型以计算Q
# Compute forward pass through model to compute affordances/Q
def forward(self, color_heightmap, depth_heightmap, is_volatile=False, specific_rotation=-1):
# Apply 2x scale to input heightmaps
color_heightmap_2x = ndimage.zoom(color_heightmap, zoom=[2,2,1], order=0)
depth_heightmap_2x = ndimage.zoom(depth_heightmap, zoom=[2,2], order=0)
assert(color_heightmap_2x.shape[0:2] == depth_heightmap_2x.shape[0:2])
# Add extra padding (to handle rotations inside network)
diag_length = float(color_heightmap_2x.shape[0]) * np.sqrt(2)
diag_length = np.ceil(diag_length/32)*32
padding_width = int((diag_length - color_heightmap_2x.shape[0])/2)
color_heightmap_2x_r = np.pad(color_heightmap_2x[:,:,0], padding_width, 'constant', constant_values=0)
color_heightmap_2x_r.shape = (color_heightmap_2x_r.shape[0], color_heightmap_2x_r.shape[1], 1)
color_heightmap_2x_g = np.pad(color_heightmap_2x[:,:,1], padding_width, 'constant', constant_values=0)
color_heightmap_2x_g.shape = (color_heightmap_2x_g.shape[0], color_heightmap_2x_g.shape[1], 1)
color_heightmap_2x_b = np.pad(color_heightmap_2x[:,:,2], padding_width, 'constant', constant_values=0)
color_heightmap_2x_b.shape = (color_heightmap_2x_b.shape[0], color_heightmap_2x_b.shape[1], 1)
color_heightmap_2x = np.concatenate((color_heightmap_2x_r, color_heightmap_2x_g, color_heightmap_2x_b), axis=2)
depth_heightmap_2x = np.pad(depth_heightmap_2x, padding_width, 'constant', constant_values=0)
# Pre-process color image (scale and normalize)
image_mean = [0.485, 0.456, 0.406]
image_std = [0.229, 0.224, 0.225]
input_color_image = color_heightmap_2x.astype(float)/255
for c in range(3):
input_color_image[:,:,c] = (input_color_image[:,:,c] - image_mean[c])/image_std[c]
# Pre-process depth image (normalize)
image_mean = [0.01, 0.01, 0.01]
image_std = [0.03, 0.03, 0.03]
depth_heightmap_2x.shape = (depth_heightmap_2x.shape[0], depth_heightmap_2x.shape[1], 1)
input_depth_image = np.concatenate((depth_heightmap_2x, depth_heightmap_2x, depth_heightmap_2x), axis=2)
for c in range(3):
input_depth_image[:,:,c] = (input_depth_image[:,:,c] - image_mean[c])/image_std[c]
# Construct minibatch of size 1 (b,c,h,w)
input_color_image.shape = (input_color_image.shape[0], input_color_image.shape[1], input_color_image.shape[2], 1)
input_depth_image.shape = (input_depth_image.shape[0], input_depth_image.shape[1], input_depth_image.shape[2], 1)
input_color_data = torch.from_numpy(input_color_image.astype(np.float32)).permute(3,2,0,1)
input_depth_data = torch.from_numpy(input_depth_image.astype(np.float32)).permute(3,2,0,1)
# Pass input data through model
output_prob, state_feat = self.model.forward(input_color_data, input_depth_data, is_volatile, specific_rotation)
if self.method == 'reactive':
# Return affordances (and remove extra padding)
for rotate_idx in range(len(output_prob)):
if rotate_idx == 0:
push_predictions = F.softmax(output_prob[rotate_idx][0], dim=1).cpu().data.numpy()[:,0,(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2),(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2)]
grasp_predictions = F.softmax(output_prob[rotate_idx][1], dim=1).cpu().data.numpy()[:,0,(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2),(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2)]
else:
push_predictions = np.concatenate((push_predictions, F.softmax(output_prob[rotate_idx][0], dim=1).cpu().data.numpy()[:,0,(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2),(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2)]), axis=0)
grasp_predictions = np.concatenate((grasp_predictions, F.softmax(output_prob[rotate_idx][1], dim=1).cpu().data.numpy()[:,0,(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2),(padding_width/2):(color_heightmap_2x.shape[0]/2 - padding_width/2)]), axis=0)
elif self.method == 'reinforcement':
# Return Q values (and remove extra padding)
for rotate_idx in range(len(output_prob)):
if rotate_idx == 0:
push_predictions = output_prob[rotate_idx][0].cpu().data.numpy()[:,0,int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2),int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2)]
grasp_predictions = output_prob[rotate_idx][1].cpu().data.numpy()[:,0,int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2),int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2)]
else:
push_predictions = np.concatenate((push_predictions, output_prob[rotate_idx][0].cpu().data.numpy()[:,0,int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2),int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2)]), axis=0)
grasp_predictions = np.concatenate((grasp_predictions, output_prob[rotate_idx][1].cpu().data.numpy()[:,0,int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2),int(padding_width/2):int(color_heightmap_2x.shape[0]/2 - padding_width/2)]), axis=0)
return push_predictions, grasp_predictions, state_feat
对输入高度图应用 2x 比例
对图像信息进行归一化
ndimage.zoom
是上下采样的一种方法,zoom,
#float/sequence—沿轴的缩放系数,如果是浮点型,表示每个轴的缩放是相同的,如果是序列,zoom应包含每个轴的缩放值;
scipy.ndimage.zoom(x, 2, order=0)
代表采样用nearest方法 order=0 bilinear 1 cubic3
对彩色图像和深度图像预处理(归一化)
对每一个像素点做一个标准化input_color_image[:,:,c] = (input_color_image[:,:,c] - image_mean[c])/image_std[c]
构造大小为 1 (b,c,h,w) 的 minibatch
1.color_heightmap_2x [r,g,b]经过标准化变成input_color_image [r,g,b]
2.经过
input_color_image.shape = (input_color_image.shape[0], input_color_image.shape[1], input_color_image.shape[2], 1) 变成[r,g,b,1]
3.经过permute(3,2,0,1)变成[1,b,r,g]
同理深度图像的shape
input_depth_data 变为 (b,c,h,w) b=1
output_prob, state_feat = self.model.forward(input_color_data, input_depth_data, is_volatile, specific_rotation)读取返回值self.output_prob, self.interm_feat
self.output_prob:处理后得到的概率信息[rotate_idx][0]是push的信息[1]是grasp的信息
self.interm_feat:处理过程中的中间特征信息
return push_predictions, grasp_predictions, state_feat,分别是push的预测概率,grasp的预测概率,中间特征的信息
并行线程处理网络输出并执行动作
def process_actions():
while True:
if nonlocal_variables['executing_action']:
# Determine whether grasping or pushing should be executed based on network predictions
best_push_conf = np.max(push_predictions)
best_grasp_conf = np.max(grasp_predictions)
print('Primitive confidence scores: %f (push), %f (grasp)' % (best_push_conf, best_grasp_conf))
nonlocal_variables['primitive_action'] = 'grasp'
explore_actions = False
if not grasp_only:
if is_testing and method == 'reactive':
if best_push_conf > 2*best_grasp_conf:
nonlocal_variables['primitive_action'] = 'push'
else:
if best_push_conf > best_grasp_conf:
nonlocal_variables['primitive_action'] = 'push'
explore_actions = np.random.uniform() < explore_prob
if explore_actions: # Exploitation (do best action) vs exploration (do other action)
print('Strategy: explore (exploration probability: %f)' % (explore_prob))
nonlocal_variables['primitive_action'] = 'push' if np.random.randint(0,2) == 0 else 'grasp'
else:
print('Strategy: exploit (exploration probability: %f)' % (explore_prob))
trainer.is_exploit_log.append([0 if explore_actions else 1])
logger.write_to_log('is-exploit', trainer.is_exploit_log)
# If heuristic bootstrapping is enabled: if change has not been detected more than 2 times, execute heuristic algorithm to detect grasps/pushes
# NOTE: typically not necessary and can reduce final performance.
if heuristic_bootstrap and nonlocal_variables['primitive_action'] == 'push' and no_change_count[0] >= 2:
print('Change not detected for more than two pushes. Running heuristic pushing.')
nonlocal_variables['best_pix_ind'] = trainer.push_heuristic(valid_depth_heightmap)
no_change_count[0] = 0
predicted_value = push_predictions[nonlocal_variables['best_pix_ind']]
use_heuristic = True
elif heuristic_bootstrap and nonlocal_variables['primitive_action'] == 'grasp' and no_change_count[1] >= 2:
print('Change not detected for more than two grasps. Running heuristic grasping.')
nonlocal_variables['best_pix_ind'] = trainer.grasp_heuristic(valid_depth_heightmap)
no_change_count[1] = 0
predicted_value = grasp_predictions[nonlocal_variables['best_pix_ind']]
use_heuristic = True
else:
use_heuristic = False
# Get pixel location and rotation with highest affordance prediction from heuristic algorithms (rotation, y, x)
if nonlocal_variables['primitive_action'] == 'push':
nonlocal_variables['best_pix_ind'] = np.unravel_index(np.argmax(push_predictions), push_predictions.shape)
predicted_value = np.max(push_predictions)
elif nonlocal_variables['primitive_action'] == 'grasp':
nonlocal_variables['best_pix_ind'] = np.unravel_index(np.argmax(grasp_predictions), grasp_predictions.shape)
predicted_value = np.max(grasp_predictions)
trainer.use_heuristic_log.append([1 if use_heuristic else 0])
logger.write_to_log('use-heuristic', trainer.use_heuristic_log)
# Save predicted confidence value
trainer.predicted_value_log.append([predicted_value])
logger.write_to_log('predicted-value', trainer.predicted_value_log)
# Compute 3D position of pixel
print('Action: %s at (%d, %d, %d)' % (nonlocal_variables['primitive_action'], nonlocal_variables['best_pix_ind'][0], nonlocal_variables['best_pix_ind'][1], nonlocal_variables['best_pix_ind'][2]))
best_rotation_angle = np.deg2rad(nonlocal_variables['best_pix_ind'][0]*(360.0/trainer.model.num_rotations))
best_pix_x = nonlocal_variables['best_pix_ind'][2]
best_pix_y = nonlocal_variables['best_pix_ind'][1]
primitive_position = [best_pix_x * heightmap_resolution + workspace_limits[0][0], best_pix_y * heightmap_resolution + workspace_limits[1][0], valid_depth_heightmap[best_pix_y][best_pix_x] + workspace_limits[2][0]]
# If pushing, adjust start position, and make sure z value is safe and not too low
if nonlocal_variables['primitive_action'] == 'push': # or nonlocal_variables['primitive_action'] == 'place':
finger_width = 0.02
safe_kernel_width = int(np.round((finger_width/2)/heightmap_resolution))
local_region = valid_depth_heightmap[max(best_pix_y - safe_kernel_width, 0):min(best_pix_y + safe_kernel_width + 1, valid_depth_heightmap.shape[0]), max(best_pix_x - safe_kernel_width, 0):min(best_pix_x + safe_kernel_width + 1, valid_depth_heightmap.shape[1])]
if local_region.size == 0:
safe_z_position = workspace_limits[2][0]
else:
safe_z_position = np.max(local_region) + workspace_limits[2][0]
primitive_position[2] = safe_z_position
# Save executed primitive
if nonlocal_variables['primitive_action'] == 'push':
trainer.executed_action_log.append([0, nonlocal_variables['best_pix_ind'][0], nonlocal_variables['best_pix_ind'][1], nonlocal_variables['best_pix_ind'][2]]) # 0 - push
elif nonlocal_variables['primitive_action'] == 'grasp':
trainer.executed_action_log.append([1, nonlocal_variables['best_pix_ind'][0], nonlocal_variables['best_pix_ind'][1], nonlocal_variables['best_pix_ind'][2]]) # 1 - grasp
logger.write_to_log('executed-action', trainer.executed_action_log)
# Visualize executed primitive, and affordances
if save_visualizations:
push_pred_vis = trainer.get_prediction_vis(push_predictions, color_heightmap, nonlocal_variables['best_pix_ind'])
logger.save_visualizations(trainer.iteration, push_pred_vis, 'push')
cv2.imwrite('visualization.push.png', push_pred_vis)
grasp_pred_vis = trainer.get_prediction_vis(grasp_predictions, color_heightmap, nonlocal_variables['best_pix_ind'])
logger.save_visualizations(trainer.iteration, grasp_pred_vis, 'grasp')
cv2.imwrite('visualization.grasp.png', grasp_pred_vis)
# Initialize variables that influence reward
nonlocal_variables['push_success'] = False
nonlocal_variables['grasp_success'] = False
change_detected = False
# Execute primitive
if nonlocal_variables['primitive_action'] == 'push':
nonlocal_variables['push_success'] = robot.push(primitive_position, best_rotation_angle, workspace_limits)
print('Push successful: %r' % (nonlocal_variables['push_success']))
elif nonlocal_variables['primitive_action'] == 'grasp':
nonlocal_variables['grasp_success'] = robot.grasp(primitive_position, best_rotation_angle, workspace_limits)
print('Grasp successful: %r' % (nonlocal_variables['grasp_success']))
nonlocal_variables['executing_action'] = False
time.sleep(0.01)
action_thread = threading.Thread(target=process_actions)
action_thread.daemon = True
action_thread.start()
exit_called = False
根据网络预测确定是抓还是推
nonlocal_variables[‘primitive_action’] = ‘grasp’,初始化为grasp动作
如果不是只抓,判断best_push和best_grasp哪个更大,决定用什么动作
unravel_index为求一个多维数组的最大索引
计算像素的 3D 位置
[best_pix_x * heightmap_resolution + workspace_limits[0][0], best_pix_y * heightmap_resolution + workspace_limits[1][0], valid_depth_heightmap[best_pix_y][best_pix_x] + workspace_limits[2][0]]
其中 best_pix_x来自nonlocal_variables[‘best_pix_ind’ ]
nonlocal_variables = {‘executing_action’ : False,
'primitive_action' : None,
'best_pix_ind' : None,
'push_success' : False,
'grasp_success' : False}
通过
trainer.grasp_heuristic(valid_depth_heightmap)
trainer.push_heuristic(valid_depth_heightmap)得到
def grasp_heuristic(self, depth_heightmap):
num_rotations = 16
for rotate_idx in range(num_rotations):
rotated_heightmap = ndimage.rotate(depth_heightmap, rotate_idx*(360.0/num_rotations), reshape=False, order=0)
valid_areas = np.zeros(rotated_heightmap.shape)
valid_areas[np.logical_and(rotated_heightmap - ndimage.interpolation.shift(rotated_heightmap, [0,-25], order=0) > 0.02, rotated_heightmap - ndimage.interpolation.shift(rotated_heightmap, [0,25], order=0) > 0.02)] = 1
# valid_areas = np.multiply(valid_areas, rotated_heightmap)
blur_kernel = np.ones((25,25),np.float32)/9
valid_areas = cv2.filter2D(valid_areas, -1, blur_kernel)
tmp_grasp_predictions = ndimage.rotate(valid_areas, -rotate_idx*(360.0/num_rotations), reshape=False, order=0)
tmp_grasp_predictions.shape = (1, rotated_heightmap.shape[0], rotated_heightmap.shape[1])
if rotate_idx == 0:
grasp_predictions = tmp_grasp_predictions
else:
grasp_predictions = np.concatenate((grasp_predictions, tmp_grasp_predictions), axis=0)
best_pix_ind = np.unravel_index(np.argmax(grasp_predictions), grasp_predictions.shape)
return best_pix_ind
首先对输入的 depth_heightmap做一个16个方向的旋转
scipy.ndimage.interpolation.shift(),该函数有三个参数
第一个参数是输入,[数组](https://so.youkuaiyun.com/so/search?q=%E6%95%B0%E7%BB%84&spm=1001.2101.3001.7020)类型
第二个参数是偏移量([行,列])
第三个参数是填充数
shift(a,[2,2],cval = 100) #[2,2]表示原数组向下移动2行、向右移动2列
depth_heightmap减去将depth_heightmap向右移动了25个像素(可能)用0填充的输出
与depth_heightmap减去将depth_heightmap向左移动了25个像素(可能)用0填充的输出,左逻辑与
开始训练
print('\n%s iteration: %d' % ('Testing' if is_testing else 'Training', trainer.iteration))
iteration_time_0 = time.time()
# Make sure simulation is still stable (if not, reset simulation)
if is_sim: robot.check_sim()
# Get latest RGB-D image
color_img, depth_img = robot.get_camera_data()
depth_img = depth_img * robot.cam_depth_scale # Apply depth scale from calibration
# Get heightmap from RGB-D image (by re-projecting 3D point cloud)
color_heightmap, depth_heightmap = utils.get_heightmap(color_img, depth_img, robot.cam_intrinsics, robot.cam_pose, workspace_limits, heightmap_resolution)
valid_depth_heightmap = depth_heightmap.copy()
valid_depth_heightmap[np.isnan(valid_depth_heightmap)] = 0
print是第几次迭代和是train还是test
检查仿真环境是否稳定
从摄像头获取颜色信息和深度信息robot.get_camera_data()
深度信息是基于比例的,乘上robot.cam_depth_scale(相机深度范围)变成符合这个场景的深度信息。
从 RGB-D 图像获取高度图(通过重新投影 3D 点云)
经过utils.get_heightmap得到color_heightmap, depth_heightmap
numpy.isnan(array [,out]):
逐个元素测试是否为NaN并以布尔数组形式返回结果。
如果表为空,则重置模拟或暂停实际训练
stuff_count = np.zeros(valid_depth_heightmap.shape)
stuff_count[valid_depth_heightmap > 0.02] = 1
empty_threshold = 300
if is_sim and is_testing:
empty_threshold = 10
if np.sum(stuff_count) < empty_threshold or (is_sim and no_change_count[0] + no_change_count[1] > 10):
no_change_count = [0, 0]
if is_sim:
print('Not enough objects in view (value: %d)! Repositioning objects.' % (np.sum(stuff_count)))
robot.restart_sim()
robot.add_objects()
if is_testing: # If at end of test run, re-load original weights (before test run)
trainer.model.load_state_dict(torch.load(snapshot_file))
else:
# print('Not enough stuff on the table (value: %d)! Pausing for 30 seconds.' % (np.sum(stuff_count)))
# time.sleep(30)
print('Not enough stuff on the table (value: %d)! Flipping over bin of objects...' % (np.sum(stuff_count)))
robot.restart_real()
trainer.clearance_log.append([trainer.iteration])
logger.write_to_log('clearance', trainer.clearance_log)
if is_testing and len(trainer.clearance_log) >= max_test_trials:
exit_called = True # Exit after training thread (backprop and saving labels)
continue
在当前线程(又名训练线程)中运行训练迭代
Detect changes:
depth_diff = abs(depth_heightmap - prev_depth_heightmap)
depth_diff[np.isnan(depth_diff)] = 0
depth_diff[depth_diff > 0.3] = 0
depth_diff[depth_diff < 0.01] = 0
depth_diff[depth_diff > 0] = 1
change_threshold = 300
change_value = np.sum(depth_diff)
change_detected = change_value > change_threshold or prev_grasp_success
print('Change detected: %r (value: %d)' % (change_detected, change_value))
只保留下0.3>depth_diff>0.01的部分
若,depth_diff中1的元素大于打印Change detected: True
if change_detected:
if prev_primitive_action == 'push':
no_change_count[0] = 0
elif prev_primitive_action == 'grasp':
no_change_count[1] = 0
else:
if prev_primitive_action == 'push':
no_change_count[0] += 1
elif prev_primitive_action == 'grasp':
no_change_count[1] += 1
对是否改变动作的计数
label_value, prev_reward_value = trainer.get_label_value(prev_primitive_action, prev_push_success, prev_grasp_success, change_detected, prev_push_predictions, prev_grasp_predictions, color_heightmap, valid_depth_heightmap)
trainer.label_value_log.append([label_value])
logger.write_to_log('label-value', trainer.label_value_log)
trainer.reward_value_log.append([prev_reward_value])
logger.write_to_log('reward-value', trainer.reward_value_log)
通过 get_label_value函数获取相关值
函数输入prev_primitive_action, prev_push_success, prev_grasp_success, change_detected, prev_push_predictions, prev_grasp_predictions, color_heightmap, valid_depth_heightmap
只看方法为reinforcement的
def get_label_value(self, primitive_action, push_success, grasp_success, change_detected, prev_push_predictions, prev_grasp_predictions, next_color_heightmap, next_depth_heightmap):
elif self.method == 'reinforcement':
# Compute current reward
current_reward = 0
if primitive_action == 'push':
if change_detected:
current_reward = 0.5
elif primitive_action == 'grasp':
if grasp_success:
current_reward = 1.0
# Compute future reward
if not change_detected and not grasp_success:
future_reward = 0
else:
next_push_predictions, next_grasp_predictions, next_state_feat = self.forward(next_color_heightmap, next_depth_heightmap, is_volatile=True)
future_reward = max(np.max(next_push_predictions), np.max(next_grasp_predictions))
# # Experiment: use Q differences
# push_predictions_difference = next_push_predictions - prev_push_predictions
# grasp_predictions_difference = next_grasp_predictions - prev_grasp_predictions
# future_reward = max(np.max(push_predictions_difference), np.max(grasp_predictions_difference))
print('Current reward: %f' % (current_reward))
print('Future reward: %f' % (future_reward))
if primitive_action == 'push' and not self.push_rewards:
expected_reward = self.future_reward_discount * future_reward
print('Expected reward: %f + %f x %f = %f' % (0.0, self.future_reward_discount, future_reward, expected_reward))
else:
expected_reward = current_reward + self.future_reward_discount * future_reward
print('Expected reward: %f + %f x %f = %f' % (current_reward, self.future_reward_discount, future_reward, expected_reward))
return expected_reward, current_reward
计算现在的奖励current_reward ,如果原始动作为push, change detected为真(diff值大于三百),执行推给定奖励0.5
如果现在动作为grasp并且成功了奖励为1
计算未来奖励
如果change_detected 为假并且抓取没有成功,未来回报设置为0
其他情况,将下个场景的颜色信息和深度信息输入计算前向传播模型以计算Q
push_predictions, grasp_predictions, state_feat,分别是push的预测概率,grasp的预测概率,中间特征的信息,分别存放到以下三个变量
next_push_predictions, next_grasp_predictions, next_state_feat
未来奖励就是next_push_predictions, next_grasp_predictions就是两者之间的最大值
打印未来奖励和现在奖励
但是未来奖励的比重是比现在奖励低的,因此,这次抓取的期望奖励就是现在奖励加上打折扣的未来奖励,即:
expected_reward = current_reward + self.future_reward_discount * future_reward
return expected_reward, current_reward回到main.py保存在label_value, prev_reward_value之中.
trainer.backprop(prev_color_heightmap, prev_valid_depth_heightmap, prev_primitive_action, prev_best_pix_ind, label_value)
反向传播Backpropagate 输出Training loss
if experience_replay and not is_testing:
sample_primitive_action = prev_primitive_action
if sample_primitive_action == 'push':
sample_primitive_action_id = 0
if method == 'reactive':
sample_reward_value = 0 if prev_reward_value == 1 else 1 # random.randint(1, 2) # 2
elif method == 'reinforcement':
sample_reward_value = 0 if prev_reward_value == 0.5 else 0.5
elif sample_primitive_action == 'grasp':
sample_primitive_action_id = 1
if method == 'reactive':
sample_reward_value = 0 if prev_reward_value == 1 else 1
elif method == 'reinforcement':
sample_reward_value = 0 if prev_reward_value == 1 else 1
为经验回放做采样
sample_primitive_action 保存是什么动作
sample_primitive_action_id 动作的id push是0
sample_reward_value 根据prev_reward_value(current_reward)奖励不为0时为0
到375都是做经验回放,暂时先不看
while nonlocal_variables['executing_action']:
time.sleep(0.01)
同步动作线程和训练线程
保存信息进入下一次迭代。(whlie True)