add proof

yzh119 · yzh119 · commit a73a3465b648 · 2025-05-06T18:23:45.000-07:00
diff --git a/_posts/2025-03-10-sampling.md b/_posts/2025-03-10-sampling.md
@@ -152,6 +152,125 @@ To address this issue, in FlashInfer [v0.2.3](https://github.com/flashinfer-ai/f
 
 Figure 4 shows the transition from round(i) to round(i+1) in Dual Pivot Rejection Sampling, in each round, if the sampled token is accepted, we return the token, otherwise, the new range's extent is $\frac{\text{high}-\text{pivot}_1}{2} < \frac{\text{high}-\text{low}}{2}$, which is at least half of the previous range. Thus it's guaranteed that the number of rounds is $O(\log(1/\epsilon))$ where $\epsilon$ is the minimal possible value in floating point representation.
 
+## Theoretical Proof of the Correctness of Rejection Sampler
+
+In this section, we provide a theoretical proof of the correctness of the rejection sampler, we choose the top-k sampling as an example, and
+other samplers can be proved in a similar way.
+
+### Nomenclature
+
+| Symbol | Meaning |
+|--------|---------|
+| $p_i > 0$ | Un‑normalised score (unnormalised probability mass) of item $i$ |
+| $T = \operatorname{Top}k = \{i_1,\dots,i_k\}$ | Indices of the **k** largest scores |
+| $Z = \sum_{j \in T} p_j$ | Total mass of the top‑k items |
+| $\tau$ | Current **pivot** (threshold) value |
+
+### Theorem
+
+The algorithm outputs each top‑k index $j \in T$ with probability
+
+$$
+\Pr[\text{output}=j] \;=\; \frac{p_j}{Z},
+$$
+
+i.e. **exactly the distribution obtained by first discarding all non‑top‑k items and then sampling categorically inside the top‑k set**.
+
+---
+
+### Proof
+
+Fix any pivot $\tau < \min_{j \in T} p_j$ (true at every step because $\tau$ is always taken from a rejected non‑top‑k item).
+Define
+
+$$
+Q_j(\tau) \;=\; \Pr[\text{algorithm eventually returns } j \mid \text{current pivot } \tau], 
+\quad j \in T .
+$$
+
+With
+
+$$
+S(\tau) \;=\; \sum_{m : p_m > \tau} p_m 
+\;=\; Z \;+\; W(\tau),\qquad
+W(\tau) \;=\!\!\!\! \sum_{r \notin T,\, p_r > \tau}\!\!\!\! p_r ,
+$$
+
+where $S(\tau)$ is the sum of all the probabilities of the tokens that are greater than $\tau$, and $W(\tau)$ is the remaining mass of "bad" items still above the threshold.
+
+The next draw obeys
+
+$$
+\Pr[i \mid \tau] \;=\; \frac{p_i}{S(\tau)}.
+$$
+
+Hence
+
+$$
+Q_j(\tau) 
+\;=\;
+\underbrace{\frac{p_j}{S(\tau)}}_{\text{accept immediately}}
+\;+\;
+\sum_{\substack{r \notin T \\ p_r > \tau}}
+      \underbrace{\frac{p_r}{S(\tau)}}_{\text{draw } r}\;
+      \underbrace{Q_j\!\bigl(p_r\bigr)}_{\text{pivot becomes } p_r}
+\tag{★}
+$$
+
+We show that the following formula is a valid solution to (★):
+
+$$
+\boxed{\,Q_j(\tau) \;=\; \dfrac{p_j}{Z}\,}
+\qquad\text{for every }\tau < \min_{j \in T} p_j .
+$$
+
+We can verify the solution by substituting it into (★):
+
+$$
+\begin{aligned}
+\text{RHS} 
+&= \frac{p_j}{S(\tau)}
+   \;+\; \frac{p_j}{Z} \frac{W(\tau)}{S(\tau)} \\
+&= \frac{p_j}{S(\tau)}\!\Bigl(1+\frac{W(\tau)}{Z}\Bigr) \\
+&= \frac{p_j}{Z} \frac{Z+W(\tau)}{S(\tau)} \\
+&= \frac{p_j}{Z},
+\end{aligned}
+$$
+
+because $S(\tau) = Z + W(\tau)$.
+Thus the claimed form satisfies the recurrence, so $Q_j(\tau) \equiv p_j/Z$.
+
+Now let's show that the solution is unique.
+Suppose there is another solution $Q_j'(\tau)$ satisfies (★), let's define 
+$\Delta_j(\tau) = Q_j(\tau) - Q_j'(\tau)$, we have:
+
+$$
+\Delta_j(\tau) = \sum_{\substack{r \notin T \\ p_r > \tau}}
+        \frac{p_r}{S(\tau)} \Delta_j(p_r)
+$$
+
+The sum of the coefficient $\sum_{\substack{r \notin T \\ p_r > \tau}} \frac{p_r}{S(\tau)} = \frac{W(\tau)}{S(\tau)}$, which satisfies:
+
+$$0 \leq \frac{W(\tau)}{S(\tau)} < 1$$
+
+Suppose $\tau^*$ is the pivot where $|\Delta_j(\tau)|$ reach its maximum, if it's positive, we have:
+
+$$
+|\Delta_j(\tau^*)| \leq \sum_{\substack{r \notin T \\ p_r > \tau^*}}
+        \frac{p_r}{S(\tau)} |\Delta_j(p_r)| \leq \sum_{\substack{r \notin T \\ p_r > \tau^*}}
+        \frac{p_r}{S(\tau)} |\Delta_j(\tau^*)| = \frac{W(\tau^*)}{S(\tau^*)} |\Delta_j(\tau^*)| < |\Delta_j(\tau^*)|
+$$
+
+which is contradiction, which means $\Delta_j(\tau^*) = 0$, and our solution is unique.
+
+The algorithm starts with $\tau = 0$; therefore
+
+$$
+\Pr[\text{output}=j] = Q_j(0) = \frac{p_j}{Z},
+$$
+
+exactly the desired top‑k categorical distribution.
+
 ## Evaluation
 
 Our evaluation demonstrates that FlashInfer's sampling kernel delivers substantial improvements in both kernel-level latency and end-to-end throughput compared to traditional sorting-based implementations.