我正在试验深度强化学习,并在我运行模拟购买原材料的环境中创建了以下内容。起始数量是我在接下来的 12 周(sim_weeks)内准备购买的材料数量。我必须以 195000 磅的倍数采购,预计每周使用 45000 磅材料。
start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000
class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1
def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty
#see if we need to buy
self.state += (action*purchase_mult)
#now calculate the days on hand from this:
days = self.state/forecast_qty/7
# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1
#self.current_step+=1
# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days
# Check if shower is done
if self.purchase_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks
return self.state
我正在讨论奖励函数是否足够。我试图做的是最小化每个步骤的手头天数总和,其中给定步骤的手头天数由代码中的天数定义。我决定既然目标是最大化奖励函数,那么我可以将现有天数转换为负数,然后使用新的负数作为奖励(因此最大化奖励会最小化现有天数)。然后我添加了强惩罚,让任何给定周的可用数量为负数。
有没有更好的方法来做到这一点?我是这个主题的新手,也是 Python 的新手。任何意见是极大的赞赏! 我