Question

我正在从Q-learning Perspective研究 GridWorld 。我对以下问题有疑问：

1) In the grid-world example, rewards are positive for goals, negative
   for running into the edge of the world, and zero the rest of the time.
   Are the signs of these rewards important, or only the intervals
   between them?

Answer 1

请记住，Q值是预期值。通过选择最大化每个给定状态的Q函数的动作来提取策略。

var colDef = {headerName: "Tree Value", valueGetter: "data.a+data.b", editable: true, newValueHandler: myNewValueHandler};

function myNewValueHandler(params) {
  // set the value you want in here using the params
}

请注意，您可以对所有Q值应用常量值，而不会影响策略。如果通过应用某个常数值来移动所有q值并不重要，则q值相对于max的关系仍然是相同的。事实上，你可以应用任何仿射变换（Q'= a * Q + b），你的决定不会改变。