我正在建立一个演员评论强化学习算法来解决环境。我想使用一个编码器来查找环境的表示形式。
当我与演员和评论家共享编码器时,我的网络无法学习任何内容:
class Encoder(nn.Module):
def __init__(self, state_dim):
super(Encoder, self).__init__()
self.l1 = nn.Linear(state_dim, 512)
def forward(self, state):
a = F.relu(self.l1(state))
return a
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 128)
self.l3 = nn.Linear(128, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
# a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 128)
self.l3 = nn.Linear(128, 1)
def forward(self, state, action):
state_action = torch.cat([state, action], 1)
q = F.relu(self.l1(state_action))
# q = F.relu(self.l2(q))
q = self.l3(q)
return q
但是,当我对演员使用不同的编码器而对评论家使用不同的编码器时,它可以正常学习。
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
def forward(self, state, action):
state_action = torch.cat([state, action], 1)
q = F.relu(self.l1(state_action))
q = F.relu(self.l2(q))
q = self.l3(q)
return q
我非常确定它是由于优化程序的缘故。 在共享的编码器代码中,我将其定义为foolow:
self.actor_optimizer = optim.Adam(list(self.actor.parameters())+
list(self.encoder.parameters()))
self.critic_optimizer = optim.Adam(list(self.critic.parameters()))
+list(self.encoder.parameters()))
在单独的编码器中,其正好:
self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))
两个优化器必须是演员批评算法,因为
如何结合两个优化器来正确优化编码器?
答案 0 :(得分:0)
我不确定您共享编码器的精确程度。
但是,我建议您创建编码器的实例,并将其同时传递给演员和评论家
encoder_net = Encoder(state_dim)
actor = Actor(encoder_net, state_dim, action_dim, max_action)
critic = Critic(encoder_net, state_dim)
在正向传递过程中,首先将状态批处理首先通过编码器,然后再通过网络的其余部分,例如:
class Encoder(nn.Module):
def __init__(self, state_dim):
super(Encoder, self).__init__()
self.l1 = nn.Linear(state_dim, 512)
def forward(self, state):
a = F.relu(self.l1(state))
return a
class Actor(nn.Module):
def __init__(self, encoder, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.encoder = encoder
self.l1 = nn.Linear(512, 128)
self.l3 = nn.Linear(128, action_dim)
self.max_action = max_action
def forward(self, state):
state = self.encoder(state)
a = F.relu(self.l1(state))
# a = F.relu(self.l2(a))
a = torch.tanh(self.l3(a)) * self.max_action
return a
class Critic(nn.Module):
def __init__(self, encoder, state_dim):
super(Critic, self).__init__()
self.encoder = encoder
self.l1 = nn.Linear(512, 128)
self.l3 = nn.Linear(128, 1)
def forward(self, state):
state = self.encoder(state)
q = F.relu(self.l1(state))
# q = F.relu(self.l2(q))
q = self.l3(q)
return q
注意:评论家网络现在是状态值函数V(s)的函数近似器,而不是状态作用值函数Q(s,a)。
通过此实现,您可以执行优化而无需将编码器参数传递给优化器,如下所示:
self.actor_optimizer = optim.Adam((self.actor.parameters()))
self.critic_optimizer = optim.Adam((self.critic.parameters()))
因为现在两个网络之间共享编码器参数。
希望这会有所帮助! :)