-
LLM Inference Serving: Survey of Recent Advances and Opportunities
-
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
-
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching:https://arxiv.org/pdf/2411.16102
-
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: https://arxiv.org/abs/2406.06858
-
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching:https://arxiv.org/abs/2412.03594
llm-optimizer
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
parent directory.. | ||||