Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 95 additions & 75 deletions docs/source/performance/perf-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,101 +28,119 @@ nvidia/Llama-3.1-405B-Instruct-FP4
```

#### Llama 3.3 70B FP4

| | GPU | B200 | | | |
|:-----------------------------|:---|:----------|:----------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 |
| ISL, OSL| | | | | |
| | | | | | |
| 128, 128 | | 11,253.28 | 17,867.66 | 24,944.50 | 27,471.49 |
| 128, 2048 | | 9,925.00 | 15,459.71 | 23,608.58 | 30,742.86 |
| 128, 4096 | | 6,318.92 | 8,711.88 | 17,659.74 | 24,947.05 |
| 500, 2000 | | 7,559.88 | 10,602.27 | 20,910.23 | 28,182.34 |
| 1000, 1000 | | 6,866.96 | 10,838.01 | 16,567.86 | 19,991.64 |
| 1000, 2000 | | 6,736.88 | 9,132.08 | 15,737.02 | 20,518.04 |
| 1024, 2048 | | 6,580.56 | 8,767.45 | 15,722.55 | 20,437.96 |
| 2048, 128 | | 1,375.49 | 1,610.69 | 2,707.58 | 3,717.82 |
| 2048, 2048 | | 4,544.73 | 6,956.14 | 12,292.23 | 15,661.22 |
| 5000, 500 | | 1,488.19 | 2,379.73 | 3,588.45 | 4,810.21 |
| 20000, 2000 | | 580.96 | 1,043.58 | 1,957.84 | 3,167.30 |
|:------------------------|:--------|:----------|:----------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 |
| ISL, OSL | | | | | |
| | | | | | |
| 128, 128 | | 10,994.48 | 17,542.11 | 24,667.31 | 27,272.27 |
| 128, 2048 | | 9,580.46 | 15,432.35 | 23,568.12 | 31,174.31 |
| 128, 4096 | | 6,418.39 | 9,841.53 | 17,808.76 | 25,229.25 |
| 500, 2000 | | 7,343.32 | 11,850.57 | 20,709.67 | 28,038.78 |
| 1000, 1000 | | 6,752.53 | 10,815.88 | 16,413.04 | 20,060.66 |
| 1000, 2000 | | 6,670.07 | 9,830.73 | 15,597.49 | 20,672.37 |
| 1024, 2048 | | 6,636.75 | 9,807.13 | 15,519.23 | 20,617.28 |
| 2048, 128 | | 1,342.17 | 1,989.41 | 3,033.14 | 4,035.64 |
| 5000, 500 | | 1,429.67 | 2,419.67 | 3,686.84 | 5,182.96 |
| 20000, 2000 | | 629.77 | 1,177.01 | 2,120.66 | 3,429.03 |

#### Llama 3.1 405B FP4
| | GPU | B200 |
|:-----------------------------|:---|:----------|
| | TP Size | 8 |
| ISL, OSL| | |
| | | |
| 128, 128 | | 9,184.83 |
| 128, 2048 | | 10,387.23 |
| 128, 4096 | | 8,741.80 |
| 500, 2000 | | 9,242.34 |
| 1000, 1000 | | 7,565.50 |
| 1000, 2000 | | 7,696.76 |
| 1024, 2048 | | 7,568.93 |
| 2048, 128 | | 953.57 |
| 2048, 2048 | | 6,092.32 |
| 5000, 500 | | 1,332.22 |
| 20000, 2000 | | 961.58 |

| | GPU | B200 | |
|:------------------------|:------- |:---------|:----------|
| | TP Size | 4 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 6,163.81 | 9,002.90 |
| 128, 2048 | | 7,081.21 | 10,288.28 |
| 128, 4096 | | 6,028.37 | 8,713.77 |
| 500, 2000 | | 5,858.75 | 9,125.86 |
| 1000, 1000 | | 4,848.00 | 7,582.97 |
| 1000, 2000 | | 5,375.25 | 7,626.28 |
| 1024, 2048 | | 5,345.70 | 7,464.03 |
| 2048, 128 | | 693.55 | 1,086.56 |
| 5000, 500 | | 947.49 | 1,532.45 |
| 20000, 2000 | | 641.11 | 1,097.84 |

### FP8 Models:
```
nvidia/Llama-3.1-8B-Instruct-FP8
nvidia/Llama-3.1-70B-Instruct-FP8
nvidia/Llama-3.3-70B-Instruct-FP8
nvidia/Llama-3.1-405B-Instruct-FP8
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
```

#### Llama 3.1 8B FP8
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 1 | 1 |
| | TP Size | 1 | 1 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 28,447.38 | 27,568.68 |
| 128, 2048 | | 23,294.74 | 22,003.62 |
| 128, 4096 | | 17,481.48 | 13,640.35 |
| 500, 2000 | | 21,462.57 | 17,794.39 |
| 1000, 1000 | | 17,590.60 | 15,270.02 |
| 1000, 2000 | | 17,139.51 | 13,850.22 |
| 1024, 2048 | | 16,970.63 | 13,374.15 |
| 2048, 128 | | 3,531.33 | 3,495.05 |
| 2048, 2048 | | 12,022.38 | 9,653.67 |
| 5000, 500 | | 3,851.65 | 3,371.16 |
| 20000, 2000 | | 1,706.06 | 1,340.92 |

#### Llama 3.1 70B FP8
| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
| 128, 128 | | 27,970.14 | 27,688.36 |
| 128, 2048 | | 23,326.38 | 21,841.15 |
| 128, 4096 | | 17,508.51 | 13,730.89 |
| 500, 2000 | | 21,390.41 | 17,833.34 |
| 1000, 1000 | | 17,366.89 | 15,270.62 |
| 1000, 2000 | | 16,831.31 | 13,798.08 |
| 1024, 2048 | | 16,737.03 | 13,385.50 |
| 2048, 128 | | 3,488.03 | 3,414.67 |
| 5000, 500 | | 3,813.69 | 3,394.54 |
| 20000, 2000 | | 1,696.66 | 1,345.42 |

#### Llama 3.3 70B FP8

| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
|:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
| ISL, OSL| | | | | | | | | |
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
| ISL, OSL | | | | | | | | | |
| | | | | | | | | | |
| 128, 128 | | 3,657.58 | 6,477.50 | 10,466.04 | 15,554.57 | 3,191.27 | 6,183.41 | 10,260.68 | 14,686.01 |
| 128, 2048 | | 4,351.07 | 8,450.31 | 13,438.71 | 20,750.58 | 745.19 | 5,822.02 | 11,442.01 | 17,463.99 |
| 128, 4096 | | 2,696.61 | 5,598.92 | 11,524.93 | 16,634.90 | | 3,714.87 | 8,209.91 | 12,598.55 |
| 500, 2000 | | 3,475.58 | 6,712.35 | 12,332.32 | 17,311.28 | | 4,704.31 | 10,278.02 | 14,630.41 |
| 1000, 1000 | | 2,727.42 | 5,097.36 | 8,698.15 | 12,794.92 | 734.67 | 4,191.26 | 7,427.35 | 11,082.48 |
| 1000, 2000 | | 2,913.54 | 5,841.15 | 9,016.49 | 13,174.68 | 526.31 | 3,920.44 | 7,590.35 | 11,108.11 |
| 1024, 2048 | | 2,893.02 | 5,565.28 | 9,017.72 | 13,117.34 | 525.43 | 3,896.14 | 7,557.32 | 11,028.32 |
| 2048, 128 | | 433.30 | 772.97 | 1,278.26 | 1,947.33 | 315.90 | 747.51 | 1,240.12 | 1,840.12 |
| 2048, 2048 | | 1,990.25 | 3,822.83 | 7,068.68 | 10,529.06 | 357.98 | 2,732.86 | 5,640.31 | 8,772.88 |
| 5000, 500 | | 543.88 | 1,005.81 | 1,714.77 | 2,683.22 | 203.27 | 866.77 | 1,571.92 | 2,399.78 |
| 20000, 2000 | | 276.99 | 618.01 | 1,175.35 | 2,021.08 | | 408.43 | 910.77 | 1,568.84 |
| 128, 128 | | 3,605.47 | 6,427.69 | 10,407.42 | 15,434.37 | 3,128.33 | 6,216.91 | | |
| 128, 2048 | | 4,315.80 | 8,464.03 | 13,508.59 | 20,759.72 | 756.42 | 5,782.57 | 11,464.94 | 17,424.32 |
| 128, 4096 | | 2,701.17 | 5,573.55 | 11,458.56 | 16,668.75 | | 3,868.37 | 8,206.39 | 12,624.61 |
| 500, 2000 | | 3,478.76 | 6,740.06 | 12,200.18 | | | 4,684.06 | 9,903.53 | 14,553.93 |
| 1000, 1000 | | 2,744.32 | 5,119.72 | 8,685.44 | 12,744.51 | 742.14 | 4,247.19 | 7,435.65 | 11,018.81 |
| 1000, 2000 | | 2,896.44 | 5,847.26 | 9,031.21 | 13,141.17 | 533.74 | 3,866.53 | 7,611.12 | 11,139.22 |
| 1024, 2048 | | 2,874.18 | 5,568.61 | 8,946.71 | 13,082.62 | 530.16 | 3,796.68 | 7,575.24 | 11,004.31 |
| 2048, 128 | | 435.90 | 772.67 | 1,264.76 | | | 736.89 | 1,213.33 | 1,839.22 |
| 2048, 2048 | | | | | 10,412.85 | | | | |
| 5000, 500 | | 545.96 | 997.15 | 1,698.22 | 2,655.28 | 204.94 | 862.91 | 1,552.68 | 2,369.84 |
| 20000, 2000 | | 276.66 | 620.33 | 1,161.29 | 1,985.85 | | 416.13 | 903.66 | 1,554.10 |

#### Llama 3.1 405B FP8
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 8 | 8 |
| | TP Size | 8 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 128 | | 3,800.11 | 3,732.40 |
| 128, 2048 | | 5,661.13 | 4,572.23 |
| 128, 4096 | | 5,167.18 | 2,911.42 |
| 500, 2000 | | 4,854.29 | 3,661.85 |
| 1000, 1000 | | 3,332.15 | 2,963.36 |
| 1000, 2000 | | 3,682.15 | 3,253.17 |
| 1024, 2048 | | 3,685.56 | 3,089.16 |
| 2048, 128 | | 453.42 | 448.89 |
| 2048, 2048 | | 3,055.73 | 2,139.94 |
| 5000, 500 | | 656.11 | 579.14 |
| 20000, 2000 | | 514.02 | 370.26 |
| 128, 2048 | | 5,567.87 | |
| 128, 4096 | | 5,136.85 | |
| 500, 2000 | | 4,787.61 | 3,673.91 |
| 1000, 1000 | | 3,286.30 | 3,012.22 |
| 1000, 2000 | | 3,636.76 | 3,262.20 |
| 1024, 2048 | | 3,618.66 | 3,109.70 |
| 2048, 128 | | 443.10 | 449.02 |
| 5000, 500 | | 645.46 | |
| 20000, 2000 | | | 372.12 |

#### Llama 4 Maverick FP8

| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
|:-----------------------------|:---|:------------------|:-----------------|
| | TP Size | 8 | 8 |
| ISL, OSL | | | |
| | | | |
| 128, 2048 | | 27,543.87 | |
| 128, 4096 | | 18,541.01 | 11,163.12 |
| 500, 2000 | | 21,117.34 | |
| 1000, 2000 | | | 10,556.00 |
| 1024, 2048 | | 16,859.45 | 11,584.33 |
| 2048, 128 | | 4,364.06 | 3,832.38 |
| 2048, 2048 | | 12,800.89 | |
| 5000, 500 | | 5,128.60 | |
| 20000, 2000 | | 1,764.27 | 1,400.79 |

## Reproducing Benchmarked Results

Expand Down Expand Up @@ -198,6 +216,8 @@ a model name (HuggingFace reference or path to a local model), a [generated data
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
```

The data collected for the v0.20 benchmarks was run with the following file:

`llm_options.yml`
```yaml

Expand All @@ -222,7 +242,7 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
- 8192
```

In majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.

The results will be printed to the terminal upon benchmark completion. For example,

Expand Down