use no_grad for actor Q-vals and re-use action-probs & log-probs in a…

…lpha loss
vwxyzjn · timoklein · Jan 13, 2023 · Aug 29, 2022 · Aug 29, 2022 · Aug 31, 2022
commit cad5fff474bf43e9463ccea1e5e42e14f1395db9
diff --git a/cleanrl/sac_atari.py b/cleanrl/sac_atari.py
@@ -70,7 +70,7 @@ def parse_args():
         help="Entropy regularization coefficient.")
     parser.add_argument("--autotune", type=lambda x:bool(strtobool(x)), default=True, nargs="?", const=True,
         help="automatic tuning of the entropy coefficient")
-    parser.add_argument("--target-entropy-scale", type=float, default=0.88,
+    parser.add_argument("--target-entropy-scale", type=float, default=0.9,
         help="coefficient for scaling the autotune entropy target")
     args = parser.parse_args()
     # fmt: on
@@ -298,9 +298,10 @@ def get_action(self, x):
 
                 # ACTOR training
                 _, log_pi, action_probs = actor.get_action(data.observations)
-                qf1_values = qf1(data.observations)
-                qf2_values = qf2(data.observations)
-                min_qf_values = torch.min(qf1_values, qf2_values)
+                with torch.no_grad():
+                    qf1_values = qf1(data.observations)
+                    qf2_values = qf2(data.observations)
+                    min_qf_values = torch.min(qf1_values, qf2_values)
                 # no need for reparameterization, the expectation can be calculated for discrete actions
                 actor_loss = (action_probs * ((alpha * log_pi) - min_qf_values)).mean()
 
@@ -309,10 +310,8 @@ def get_action(self, x):
                 actor_optimizer.step()
 
                 if args.autotune:
-                    # use action probabilities for temperature loss
-                    with torch.no_grad():
-                        _, log_pi, action_probs = actor.get_action(data.observations)
-                    alpha_loss = (action_probs * (-log_alpha * (log_pi + target_entropy))).mean()
+                    # re-use action probabilities for temperature loss
+                    alpha_loss = (action_probs.detach() * (-log_alpha * (log_pi + target_entropy).detach())).mean()
 
                     a_optimizer.zero_grad()
                     alpha_loss.backward()

diff --git a/docs/rl-algorithms/sac.md b/docs/rl-algorithms/sac.md
@@ -354,7 +354,7 @@ Surpassing Human-Level Performance on ImageNet Classification"](https://arxiv.or
         alpha = log_alpha.exp().item()
     ```
 
-4. [`sac_atari.py`](https://github.com/timoklein/cleanrl/blob/sac-discrete/cleanrl/sac_atari.py) uses `--target-entropy-scale=0.88` while the [SAC-discrete paper](https://arxiv.org/abs/1910.07207) uses `--target-entropy-scale=0.98` due to improved stability when training for more than 100k steps. Tuning this parameter to the environment at hand is advised and can lead to significant performance gains.
+4. [`sac_atari.py`](https://github.com/timoklein/cleanrl/blob/sac-discrete/cleanrl/sac_atari.py) uses `--target-entropy-scale=0.9` while the [SAC-discrete paper](https://arxiv.org/abs/1910.07207) uses `--target-entropy-scale=0.98` due to improved stability when training for more than 100k steps. Tuning this parameter to the environment at hand is advised and can lead to significant performance gains.
 
 5. [`sac_atari.py`](https://github.com/timoklein/cleanrl/blob/sac-discrete/cleanrl/sac_atari.py) performs learning updates only on every $n^{\text{th}}$ step. This leads to improved stability and prevents the agent's performance from degenerating during longer training runs.  
 Note the difference to [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py): [`sac_atari.py`](https://github.com/timoklein/cleanrl/blob/sac-discrete/cleanrl/sac_atari.py) updates every $n^{\text{th}}$ environment step and does a single update of actor and critic on every update step. [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) updates the critic every step and the actor every $n^{\text{th}}$ step. It then compensates for the delayed actor updates by performing $n$ actor update steps.