Skip to content

Fix sign error in Q-learning gradient-descent update equations#2701

Open
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/qlearning-gradient-sign
Open

Fix sign error in Q-learning gradient-descent update equations#2701
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/qlearning-gradient-sign

Conversation

@Chessing234
Copy link
Copy Markdown

Closes #2649.

Bug

In chapter_reinforcement-learning/qlearning.md (section 17.3), the expanded form of the gradient-descent Q-learning update has a flipped sign:

Q <- Q - alpha * grad_Q l(Q)
    = (1 - alpha) Q - alpha ( r + gamma max_a' Q' )    <-- wrong

The same flipped sign also appears in the next equation (terminal-state variant).

Root cause

The loss is l(Q) = (Q - r - gamma max_a' Q')^2, so grad_Q l = 2 (Q - r - gamma max_a' Q'). Substituting into the gradient step and absorbing the factor of 2 into alpha:

Q - alpha * (Q - r - gamma max_a' Q')
  = Q - alpha*Q + alpha*(r + gamma max_a' Q')
  = (1 - alpha) Q + alpha (r + gamma max_a' Q')

The book's expansion kept the minus sign from the gradient-descent step instead of distributing it through the parentheses, producing (1 - alpha) Q - alpha (...).

Why the fix is correct

The in-chapter Python implementation at line 137 of the same file already uses the correct form:

y = reward + gamma * np.max(Q[next_state,:])
Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])

Expanding: (1 - alpha) Q + alpha * y, matching the fixed equation. The text and the code were inconsistent; this commit makes the text match the code (and the standard Q-learning update rule).

Change

Two single-character edits (-+) on the two equation lines. No other changes.

Closes d2l-ai#2649.

In section 17.3, the expanded form of the gradient-descent update on the
Q-learning loss is written as:

  Q <- Q - alpha * grad_Q l(Q)
      = (1 - alpha) Q - alpha (r + gamma max_a' Q')   (wrong sign)

The loss is l(Q) = (Q - r - gamma max_a' Q')^2, so grad_Q l = 2 (Q - r -
gamma max_a' Q'). Substituting into the gradient step and absorbing the
factor of 2 into alpha gives:

  Q - alpha (Q - r - gamma max_a' Q')
    = (1 - alpha) Q + alpha (r + gamma max_a' Q')

The textbook expansion flipped the second sign. The next equation
(terminal-state variant) inherits the same flipped sign. The in-chapter
code at line 137 already uses the correct form

  Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])

with y = r + gamma max_a' Q', so the equations were inconsistent with
the code they are supposed to describe. This commit flips the two
offending minus signs to plus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in chapter 17, section 3 "Q-learning", equation 17.3.3 and 17.3.4.

1 participant