Pytorch PPO实施未学习

时间:2018-12-16 13:09:48

标签: python pytorch reinforcement-learning

此PPO实施在某处存在一个错误,我无法弄清楚出了什么问题。网络返回评论者的正态分布和值估计。 actor的最后一层提供四个F.tanh ed动作值,这些值用作分布的平均值。 nn.Parameter(torch.zeros(action_dim))是标准偏差。

将20个并行代理的轨迹添加到同一存储器中。情节长度为1000,memory.sample()返回20k个存储器条目中的np.random.permutation作为张量为64的张量。在堆叠这些张量张量之前,这些值将作为(1,-1)张量存储在{ {1}}个。返回的张量被collection.deque编辑。

环境

detach()

更新步骤

brain_name = envs.brain_names[0]
env_info = envs.reset(train_mode=True)[brain_name] 
env_info = envs.step(actions.cpu().detach().numpy())[brain_name]
next_states = env_info.vector_observations
rewards = env_info.rewards                 
dones = env_info.local_done  

数据采样

def clipped_surrogate_update(policy, memory, num_epochs=10, clip_param=0.2, gradient_clip=5, beta=0.001, value_loss_coeff=0.5):

    advantages_batch, states_batch, log_probs_old_batch, returns_batch, actions_batch = memory.sample()

    advantages_batch = (advantages_batch - advantages_batch.mean()) / advantages_batch.std()

    for _ in range(num_epochs):
        for i in range(len(advantages_batch)):

            advantages_sample = advantages_batch[i]
            states_sample = states_batch[i]
            log_probs_old_sample = log_probs_old_batch[i]
            returns_sample = returns_batch[i]
            actions_sample = actions_batch[i]

            dist, values = policy(states_sample)

            log_probs_new = dist.log_prob(actions_sample.to(device)).sum(-1).unsqueeze(-1)
            entropy = dist.entropy().sum(-1).unsqueeze(-1).mean()

            ratio = (log_probs_new - log_probs_old_sample).exp()

            clipped_ratio = torch.clamp(ratio, 1-clip_param, 1+clip_param)
            clipped_surrogate_loss = -torch.min(ratio*advantages_sample, clipped_ratio*advantages_sample).mean()
            value_function_loss = (returns_sample - values).pow(2).mean()

            Loss = clipped_surrogate_loss - beta * entropy + value_loss_coeff * value_function_loss

            optimizer_policy.zero_grad()
            Loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), gradient_clip)
            optimizer_policy.step()
            del Loss

1 个答案:

答案 0 :(得分:0)

在“通用优势估算”循环中,advantagesreturns以相反的顺序添加。

advantage_list.insert(0, advantages.detach())
return_list.insert(0, returns.detach())