You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: RFC-0039-cuda-support.md
+10-18Lines changed: 10 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
2
-
# [CUDA version support]
2
+
# [CUDA version for PyTorch CI/CD]
3
3
4
4
**Authors:**
5
5
*@atalman@malfet@tinglvv@nWEIdia@ptrblck
@@ -20,26 +20,22 @@ The proposal is to provide two main benefits
20
20
- It brings significant performance improvements
21
21
- It fixes significant correctness issues
22
22
- It significantly reduces binary/memory footprint
23
+
- It adds desired functionality or features
23
24
24
25
### We would deprecate version of CUDA when
25
26
26
-
As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:
27
+
As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support at least 2 versions of CUDA with an optional exception for Legacy Version (see below).
27
28
28
-
- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version
29
-
- Stable Version: This is a stable CUDA version that is used most of the time. This is the version we want to upload to PyPI. Please note that if the stable version is equal or slightly older to the version then the version fbcode is using, we probably need to keep it in CI/CD system as Optional Legacy Version.
30
-
- Latest Experimental Version: This is the latest version of CUDA that we want to support. Minimal requirement is to have it available in nightly releases (CD)
29
+
- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version. We should keep this version as static as possible (i.e. no cuDNN, NCCL, or other libraries) to avoid mixing the legacy stack with latest libs which can lead to unexpected behavior.
30
+
- Stable Version: This is a stable CUDA version that is used most of the time. This is the version we want to upload to PyPI.
31
+
- Latest Experimental Version: This is the latest version of CUDA that we want to support. Minimal requirement to qualify to be included in PyTorch OSS Release is to have it available in nightly releases (CD)
31
32
32
33
### Detailed Process of Introducing new CUDA version
33
34
34
-
1. Evaluate CUDA update necessity
35
-
Goal: When any of this is true:
36
-
- It enables new important GPU architecture (For example Blackwell with CUDA-12.8)
37
-
- It brings significant performance improvements
38
-
- It fixes significant correctness issues
39
-
- It significantly reduces binary/memory footprint
35
+
1. Evaluate CUDA update necessity. Please see section above: [We should introduce new version of CUDA when](https://github.com/pytorch/rfcs/blob/cuda_update/RFC-0039-cuda-support.md#we-should-introduce-new-version-of-cuda-when)
40
36
41
37
2. Evaluate if we have all packages for update
42
-
When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release.
38
+
When: As soon as Update determined to be necessary. Start by creating RFC (see [example](https://github.com/pytorch/pytorch/issues/145544)) with possible CUDA matrix to support for next release.
43
39
Goal: Make sure everything is available to perform complete upgrade of CUDA and dependencies
44
40
45
41
3. Update CUDA in CD (this is necessary condition to qualify for CUDA version be released as Experimental)
@@ -66,7 +62,7 @@ When: Evaluate deprecation of legacy CUDA version from CI/CD is complete
66
62
Goal: Support for legacy CUDA versions is dropped, starting from PyTorch Domain Libraries and then in PyTorch core. First we drop CD support and then CI support.
67
63
68
64
69
-
# CUDA/CUDNN Upgrade Runbook
65
+
# CUDA/cuDNN Upgrade Runbook
70
66
71
67
So you wanna upgrade PyTorch to support a new CUDA? Follow these steps in order! They are adapted from previous CUDA upgrade processes.
2) CUDA is available on Docker hub images : https://hub.docker.com/r/nvidia/cuda
95
-
Following example is for cuda 12.4: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/12.4.0/ubuntu2204/devel?ref_type=heads (TODO: Update this for 12.8)
96
-
(Make sure to use version without CUDNN, it should be installed separately by install script)
97
-
98
-
3) Validate new driver availability: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Check following table: Table 3. CUDA Toolkit and Corresponding Driver Versions
90
+
2) Validate new driver availability: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Check following table: Table 3. CUDA Toolkit and Corresponding Driver Versions
0 commit comments