Can Curiosity Driven Agents Play Doom?

My effort to get a curiosity driven agent to complete Doom E1M1 level "The Hangar"

1. Install & Run

1.1 Clone & assets

git clone https://github.com/pythonlearner1025/BoredDoomGuy.git
cd BoredDoomGuy

Download/locate Doom1.WAD (Ultimate Doom). Place it at the repo root.

Steam/GOG purchasers: copy the original WAD and rename to Doom1.WAD

Shareware fallback:

wget https://distro.ibiblio.org/slitaz/sources/packages/d/doom1.wad
mv doom1.wad Doom1.WAD

1.2 Python virtualenv

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

PyTorch + ViZDoom wheel selections are pinned inside requirements.txt. If wheels fail for your platform, rebuild ViZDoom from source (see scripts/).

1.3 Baseline run (ICM)

source env/bin/activate
python icm.py

1.4 Learning Progress Monitor (LPM) run

source env/bin/activate
python icm_lpm.py

2. Environment snapshot

Level: Doom 1, map E1M1
API: ViZDoom player mode, 60×80 grayscale stacks
Action space: defined in MACRO_ACTIONS inside each script

In both icm.py and icm_lpm.py, the environment supports infinite-horizon mode. In this mode, death does not terminate the observe/act loop in run_episode, and only hitting the timeout or max_frames stops the rollout.

Following Burda et al., the death-to-respawn transition is treated as just another state change. This mirrors how people play Doom and discourages reward hacking via intentional suicide, because the spawn room becomes boring once the agent has fully explored it.

3. Papers & origin

Implementation	Reference	Link
`icm.py`	Large Scale Curiosity Driven Learning	https://arxiv.org/pdf/1808.04355
`icm_lpm.py`	Beyond Noisy TVs: Noise-Robust Exploration via Learning Progress Monitoring	https://arxiv.org/pdf/2509.25438v1
`icm_rnd.py`	Exploration by Random Network Distillation	https://arxiv.org/pdf/1810.12894

4. `icm.py` – Intrinsic Curiosity Module

Pathak's Intrinsic Curiosity Module trains a forward dynamics predictor alongside PPO with GAE-normalised advantages. Curiosity is the prediction error between encoded next-state features and the forward dynamics model's output, so we freeze the policy rollout to collect frame stacks and actions.

During each PPO minibatch we encode the stacked frames, predict the next embedding, and measure the curiosity reward as the mean-squared error. The encoder then feeds an inverse dynamics model that learns to recover the action, keeping features focused on agent-controllable factors.

4.1 Core training code

fwd_optimizer = torch.optim.Adam(phi_enc.parameters())
feat_idm_optimizer = torch.option.Adam(list(phi_enc.parameters()) + list(idm_model.parameters()))

phi_s = phi_enc(mb_obs_stack)
# mb_obs_next_stack is collected during rollout
phi_s_next = phi_enc(mb_obs_next_stack)
phi_pred = fwd_model(phi_s.detach(), act_onehot)

per_sample_fwd = F.mse_loss(phi_pred, phi_s_next.detach(), reduction='none').mean(dim=1)

# train just the fwd dynamics model (phi_enc detached)
fwd_loss = per_sample_fwd.mean()
fwd_loss.backward()
fwd_optimizer.step()

phi_idm_t = phi_enc(mb_obs_stack)
phi_idm_t_next = phi_enc(mb_obs_next_stack)

# train both phi_enc (feature encoder) and idm_model (inverse dynamics model)
action_logits = idm_model(phi_idm_t, phi_idm_t_next)
# mb_acts is collected during rollout
idm_loss = F.cross_entropy(action_logits, mb_acts, reduction='none').mean()

phi_pred_enc = fwd_model(phi_idm_t, act_onehot)
fwd_loss_enc = F.mse_loss(phi_pred_enc, phi_idm_t_next.detach(), reduction='none').mean(dim=1).mean()
joint_loss = (1 - beta) * idm_loss + beta * fwd_loss_enc
joint_loss.backward()
feat_idm_optimizer.step()

We optimise the joint loss to update the encoder and IDM while training the forward model separately so gradients stay stable, mirroring the original ICM recipe.

5. `icm_lpm.py` – Learning Progress Monitoring

Learning Progress Monitoring (LPM)'s key idea is to set reward signal as prediction improvement over training iterations, where the current iteration is $\tau$:

$$\text{rwd}_i = \epsilon^{\tau-1} - \epsilon^{\tau}$$

$o_{t}$ is the observed state of the game in episode step t, and

$$\hat{o}_{t} = f_{\text{dynamics}}^{\tau}(o_{t-1}, a_{t-1})$$

$$\epsilon^{\tau} = \log(\text{MSE}(\hat{o}_{t}, o_{t}))$$

$\epsilon^{\tau-1}$ can be naively calculated by using $f_{\text{dynamics}}^{\tau-1}$, but we instead train a neural network to predict it directly given the observation and action.

The formulation rewards agents for improved prediction of observation transition dynamics. Let's consier two cases: transitions that are learnable and transitions that are not.

In the first case, an agent is uncertain about a transition at first but over iterations the forward dynamics model learns, and $\epsilon^{\tau}$ approaches zero, and the agent receives reward proportional to its past uncertainty, $\epsilon^{\tau-1}$. But since the error prediction model is trained too, $\epsilon^{\tau-1}$ approaches zero too, so over many iterations rwd_i for this particular transition approaches zero.

In the second case, the transition is purely random and both forward dynamics model and error prediction model's output logits collapse to a uniform distribution and are equivalent. Therefore, $\textrwd_i$ is zero.

The above two cases illustrates how, in theory, a LPM model would act out of curiosity (seeks learnable novel dynamics of the world), while avoiding the Noisy TV problem (derives no reward from incompressible/unlearnable dynamics, i.e. random events)

5.1 Core Training Code

with torch.no_grad():
    phi_t = phi_enc(obs_t)
    pred_phi_t_next = fwd_model(phi_t, act_onehot)
    expected_error = math.exp(error_model(phi_t, act_onehot).item())

phi_t_next = phi_enc(obs_next_t)
epsilon = F.mse_loss(pred_phi_t_next, phi_t_next, reduction='none').view(1, -1).mean().item()
intrinsic_reward = (expected_error - epsilon) if error_ready else 0.0

pred_phi_t_next = fwd_model(phi_obs, act_onehot)
dyn_loss = F.mse_loss(pred_phi_t_next, phi_next)
eps_pred = error_model(phi_obs, act_onehot)

err_loss = F.mse_loss(eps_pred, torch.log(eps_b + EPSILON_CLAMP))

6. `icm_rnd.py` - Exploration by Random Network Distillation

This paper is surprisingly similar to Learning Progress Monitoring, except for several key differences:

removes forward dynamics model with frozen random network
intrinsic reward is difference between random network embedding of observation and prediction network embedding of observation. roughly the relationship between the random network and prediction network is teacher-student, hence the title. training samples for student network are not delayed by one iteration like in LPM.
use two separate value heads, one for intrinsic and one for extrinsic rewards
keep track of running mean and variance of observations and use it to normalize observations for random network and prediction networks.

Empirical Observations

e0ee62b:
- Total mean rwd dips down from 1 and and plateus around 1e-3 (iter 30). Intrinsic rwd identical to total mean rwd graph, per-iter mean extrinsic more noisy but oscillates around 1e-3 with around 0.018 std, while the running average shows the same story.
- Forwad dynamics error is extremely small (starts 4e-4 and approaches 1e-5 by iter 30), suggesting RandomEncoder destroys visual differences in obs during encoding
ad61d92:
- This is an implementation of OpenAI's Random Network Distillation work
- Empirically, with INTRINSIC_COEF=EXTRINSIC_COEF=0.5, it was the only implementation out of ICM (Pathak et al, 2017) and LPM (Hou et al, 2025) that didn't asymptotically converge to zero reward.
- However, intrinsic reward collapses to zero early on and the policy is updated only by extrinsic rewards - i.e. it behaves no differently than a PPO agent with reward scaled by half.
- I am setting EXTRINSIC_COEF=1.0 to investigate if purely curiosity driven agent can gain external rewards + exhibit complex behavior (kills enemies, finds secret room, completes level)

7. TODOs for future revision

Implement Learning Progress Monitoring, a 2025 paper solution to Noisy Tv Problem. Does it work? No, converges to no action
Implement Random Network Distillation, OpenAI's solution. Does it work? No, extrinsic reward dominates policy.
Document best-known hyperparameter tweaks for Doom E1M1.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
agents		agents
assets		assets
scripts		scripts
.gitignore		.gitignore
README.md		README.md
check_vae_reconstruction.py		check_vae_reconstruction.py
compute_rollout_psnr.py		compute_rollout_psnr.py
install_dataset.sh		install_dataset.sh
model.py		model.py
play.py		play.py
preprocess_for_hf.py		preprocess_for_hf.py
quickstart_preprocessing.sh		quickstart_preprocessing.sh
requirements.txt		requirements.txt
tmp.arrow		tmp.arrow
train_doom.py		train_doom.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Can Curiosity Driven Agents Play Doom?

1. Install & Run

1.1 Clone & assets

1.2 Python virtualenv

1.3 Baseline run (ICM)

1.4 Learning Progress Monitor (LPM) run

2. Environment snapshot

3. Papers & origin

4. `icm.py` – Intrinsic Curiosity Module

4.1 Core training code

5. `icm_lpm.py` – Learning Progress Monitoring

5.1 Core Training Code

6. `icm_rnd.py` - Exploration by Random Network Distillation

Empirical Observations

7. TODOs for future revision

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

pythonlearner1025/BoredDoomGuy

Folders and files

Latest commit

History

Repository files navigation

Can Curiosity Driven Agents Play Doom?

1. Install & Run

1.1 Clone & assets

1.2 Python virtualenv

1.3 Baseline run (ICM)

1.4 Learning Progress Monitor (LPM) run

2. Environment snapshot

3. Papers & origin

4. icm.py – Intrinsic Curiosity Module

4.1 Core training code

5. icm_lpm.py – Learning Progress Monitoring

5.1 Core Training Code

6. icm_rnd.py - Exploration by Random Network Distillation

Empirical Observations

7. TODOs for future revision

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

4. `icm.py` – Intrinsic Curiosity Module

5. `icm_lpm.py` – Learning Progress Monitoring

6. `icm_rnd.py` - Exploration by Random Network Distillation

Packages