构建用于原材料采购的 Open AI RL 环境的奖励函数

时间:2021-01-12 22:11:46

标签: python openai-gym

我正在试验深度强化学习,并在我运行模拟购买原材料的环境中创建了以下内容。起始数量是我在接下来的 12 周(sim_weeks)内准备购买的材料数量。我必须以 195000 磅的倍数采购,预计每周使用 45000 磅材料。

start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000


class ResinEnv(Env):
    def __init__(self):
        # Actions we can take: buy 0, buy 1x,
        self.action_space = Discrete(2)
        # purchase array space...
        self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
        # Set start qty
        self.state = start_qty
        # Set purchase length
        self.purchase_length = sim_weeks
        #self.current_step = 1
        
    def step(self, action):
        # Apply action
        #this gives us qty_available at the end of the week
        self.state-=forecast_qty
        
        #see if we need to buy
        self.state += (action*purchase_mult)
       
        
        #now calculate the days on hand from this:
        days = self.state/forecast_qty/7
        
        
        # Reduce weeks left to purchase by 1 week
        self.purchase_length -= 1 
        #self.current_step+=1
        
        # Calculate reward: reward is the negative of days_on_hand
        if self.state<0:
            reward = -10000
        else:
            reward = -days
        
        # Check if shower is done
        if self.purchase_length <= 0: 
            done = True
        else:
            done = False
        
        # Set placeholder for info
        info = {}
        
        # Return step information
        return self.state, reward, done, info

    def render(self):
        # Implement viz
        pass
    
    def reset(self):
        # Reset qty
        self.state = start_qty
        self.purchase_length = sim_weeks
        
        return self.state

我正在讨论奖励函数是否足够。我试图做的是最小化每个步骤的手头天数总和,其中给定步骤的手头天数由代码中的天数定义。我决定既然目标是最大化奖励函数,那么我可以将现有天数转换为负数,然后使用新的负数作为奖励(因此最大化奖励会最小化现有天数)。然后我添加了强惩罚,让任何给定周的可用数量为负数。

有没有更好的方法来做到这一点?我是这个主题的新手,也是 Python 的新手。任何意见是极大的赞赏! 我

1 个答案:

答案 0 :(得分:1)

我认为您应该考虑减少奖励的规模。检查 herehere 以稳定神经网络中的训练。如果您对 RL 代理的唯一任务是最大程度地减少手头的天数,那么奖励系统是有意义的。只需要一些规范化!

相关问题