fix: use safe deserialization and tarfile extraction#70
fix: use safe deserialization and tarfile extraction#70gn00295120 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Add weights_only=True to torch.load() calls to prevent arbitrary code execution via pickle (CVE-2025-32434). Add filter='data' to tarfile.extractall() to prevent path traversal (CVE-2007-4559). Addresses CWE-502, CWE-22.
There was a problem hiding this comment.
Pull request overview
This PR hardens model loading and archive extraction paths to reduce the risk of unsafe deserialization (PyTorch checkpoints) and tar path traversal during asset extraction.
Changes:
- Add
weights_only=Truetotorch.load()calls across model-loading codepaths. - Add
filter='data'totarfile.extractall()calls when extractingvoices.tgzanddist.tgz.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
moshi/moshi/server.py |
Uses tarfile.extractall(..., filter='data') when extracting downloaded archives. |
moshi/moshi/offline.py |
Uses tarfile.extractall(..., filter='data') when extracting voice prompt archives. |
moshi/moshi/models/loaders.py |
Loads PyTorch checkpoints with weights_only=True for safer deserialization. |
moshi/moshi/models/lm.py |
Loads saved voice prompt embeddings with weights_only=True. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
moshi/moshi/server.py
Outdated
| logger.info(f"extracting {voices_tgz} to {voices_dir}") | ||
| with tarfile.open(voices_tgz, "r:gz") as tar: | ||
| tar.extractall(path=voices_tgz.parent) | ||
| tar.extractall(path=voices_tgz.parent, filter='data') |
There was a problem hiding this comment.
tarfile.TarFile.extractall() only supports the filter kwarg on Python 3.12+; this repo declares requires-python = ">= 3.10", so on 3.10/3.11 this will raise TypeError: extractall() got an unexpected keyword argument 'filter' at runtime. Consider either (a) raising the minimum supported Python to 3.12 (and documenting it), or (b) feature-detecting support for filter (e.g., via inspect.signature) and falling back to a manual “safe extract” implementation that rejects absolute paths, .. traversal, and symlinks/hardlinks.
moshi/moshi/server.py
Outdated
| if not dist.exists(): | ||
| with tarfile.open(dist_tgz, "r:gz") as tar: | ||
| tar.extractall(path=dist_tgz.parent) | ||
| tar.extractall(path=dist_tgz.parent, filter='data') |
There was a problem hiding this comment.
Same compatibility issue as above: passing filter='data' to extractall() will crash on Python < 3.12, but moshi/pyproject.toml currently allows Python 3.10+. Please add a version/feature-detected fallback safe extraction path (or bump the minimum Python version accordingly).
moshi/moshi/offline.py
Outdated
| log("info", f"extracting {voices_tgz} to {voices_dir}") | ||
| with tarfile.open(voices_tgz, "r:gz") as tar: | ||
| tar.extractall(path=voices_tgz.parent) | ||
| tar.extractall(path=voices_tgz.parent, filter='data') |
There was a problem hiding this comment.
tarfile.extractall(..., filter='data') is not supported on Python 3.10/3.11 and will raise TypeError at runtime. Since the project currently supports Python >=3.10, please add a compatibility fallback (manual safe extraction) or bump the minimum supported Python version.
Summary
weights_only=Trueto alltorch.load()calls to prevent arbitrary code execution via pickle deserialization (CVE-2025-32434, CWE-502)filter='data'to alltarfile.extractall()calls to prevent path traversal attacks (CVE-2007-4559, CWE-22)Files Changed
moshi/moshi/models/lm.pyline ~979torch.load(..., weights_only=True)moshi/moshi/models/loaders.pylines ~160, ~217, ~295torch.load(..., weights_only=True)moshi/moshi/server.pylines ~333, ~349tarfile.extractall(..., filter='data')moshi/moshi/offline.pyline ~145tarfile.extractall(..., filter='data')Security Impact
torch.load()withoutweights_only=Truedeserializes arbitrary Python objects via pickle, allowing remote code execution if a malicious checkpoint file is loaded. Theweights_only=Trueflag restricts deserialization to tensor data only.tarfile.extractall()withoutfilter='data'can be exploited via crafted archives containing../path components (tar slip / directory traversal). Thefilter='data'argument (available since Python 3.12, backported to 3.11.4+) strips dangerous metadata and blocks absolute/traversal paths.Test Plan
weights_only=Trueon standard.ptcheckpointstarfileextraction ofvoices.tgzanddist.tgzsucceeds withfilter='data'