Fix sign error in Q-learning gradient-descent update equations by Chessing234 · Pull Request #2701 · d2l-ai/d2l-en

Chessing234 · 2026-04-19T12:54:53Z

Closes #2649.

Bug

In chapter_reinforcement-learning/qlearning.md (section 17.3), the expanded form of the gradient-descent Q-learning update has a flipped sign:

Q <- Q - alpha * grad_Q l(Q)
    = (1 - alpha) Q - alpha ( r + gamma max_a' Q' )    <-- wrong

The same flipped sign also appears in the next equation (terminal-state variant).

Root cause

The loss is l(Q) = (Q - r - gamma max_a' Q')^2, so grad_Q l = 2 (Q - r - gamma max_a' Q'). Substituting into the gradient step and absorbing the factor of 2 into alpha:

Q - alpha * (Q - r - gamma max_a' Q')
  = Q - alpha*Q + alpha*(r + gamma max_a' Q')
  = (1 - alpha) Q + alpha (r + gamma max_a' Q')

The book's expansion kept the minus sign from the gradient-descent step instead of distributing it through the parentheses, producing (1 - alpha) Q - alpha (...).

Why the fix is correct

The in-chapter Python implementation at line 137 of the same file already uses the correct form:

y = reward + gamma * np.max(Q[next_state,:])
Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])

Expanding: (1 - alpha) Q + alpha * y, matching the fixed equation. The text and the code were inconsistent; this commit makes the text match the code (and the standard Q-learning update rule).

Change

Two single-character edits (- → +) on the two equation lines. No other changes.

Closes d2l-ai#2649. In section 17.3, the expanded form of the gradient-descent update on the Q-learning loss is written as: Q <- Q - alpha * grad_Q l(Q) = (1 - alpha) Q - alpha (r + gamma max_a' Q') (wrong sign) The loss is l(Q) = (Q - r - gamma max_a' Q')^2, so grad_Q l = 2 (Q - r - gamma max_a' Q'). Substituting into the gradient step and absorbing the factor of 2 into alpha gives: Q - alpha (Q - r - gamma max_a' Q') = (1 - alpha) Q + alpha (r + gamma max_a' Q') The textbook expansion flipped the second sign. The next equation (terminal-state variant) inherits the same flipped sign. The in-chapter code at line 137 already uses the correct form Q[state, action] = Q[state, action] + alpha * (y - Q[state, action]) with y = r + gamma max_a' Q', so the equations were inconsistent with the code they are supposed to describe. This commit flips the two offending minus signs to plus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sign error in Q-learning gradient-descent update equations#2701

Fix sign error in Q-learning gradient-descent update equations#2701
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/qlearning-gradient-sign

Chessing234 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 19, 2026

Bug

Root cause

Why the fix is correct

Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant