Fix Server v2 production issues (#767, #783, #787) #788

anivar · 2025-08-16T04:01:37Z

This PR fixes three critical issues preventing Server v2 from being used in production:

URL prefix normalization - Now handles --url-prefix //path correctly like the old server
Args file loading - Fixed .args being loaded too late
Connection stability - Removed aggressive client dropping, fixed partial writes, increased buffer size

The changes are minimal (~60 lines) and follow patterns from the existing server implementation.

vlasky · 2025-12-04T00:14:51Z

@anivar please explain the consequences of the .args being loaded too late. What failure/issue does it cause?

Fixes four critical production issues in llamafile Server v2: 1. **Fix .args loading timing** (llama.cpp main/main.cpp) - Move cosmo_args() call before determine_program() - Ensures --server --v2 flags in .args are seen when determining program mode - Fixes mozilla-ai#783 2. **Add URL prefix normalization** (llamafile/flags.cpp) - Consolidate consecutive slashes (//api/v1 → /api/v1) - Ensure leading slash, remove trailing slash - Validate AFTER normalization - Use static std::string for proper lifetime management (no memory leak) - Fixes mozilla-ai#767 3. **Robust partial write handling** (llamafile/server/client.cpp) - Implement full write loop to handle partial writes correctly - Handle EINTR (signal interruption) gracefully - Properly detect connection closure - Increase file transfer buffer from 512B to 16KB for better performance 4. **Remove aggressive client dropping** (llamafile/server/worker.cpp) - Remove code that kills oldest active connection when all workers busy - Let TCP listen backlog naturally queue incoming connections - Provides better UX (graceful queuing vs abrupt disconnection) - Fixes mozilla-ai#787 All fixes improve upon original PR mozilla-ai#788 with better error handling and no memory leaks.

anivar · 2025-12-04T17:02:50Z

@vlasky Great question! The timing issue causes a real production problem for anyone distributing llamafiles with embedded configuration.

Here's what happens: when a user embeds --server --v2 in their .args file and runs the llamafile without any CLI arguments, determine_program() executes before .args loads. At that point, argv is essentially empty, so the function picks the wrong mode (usually defaulting to chatbot). By the time .args finally loads and those flags become available, the program mode decision has already been made and the server never starts.

This completely breaks the "distribute a self-contained llamafile" use case - you can't ship a llamafile that's pre-configured to run as a server via .args, which defeats one of the main benefits of the format.

The fix is straightforward: load .args before calling determine_program() so the embedded flags are visible when making the mode decision.

While I was in there, I also improved the other fixes - the URL normalization now avoids a memory leak by using static storage, the partial write handler does a proper retry loop instead of just one attempt, and the file transfer buffer got bumped to 16KB for better performance.

Move cosmo_args() before determine_program() so that embedded .args flags like --server --v2 are visible when determining program mode. Without this fix, a .llamafile with embedded server flags would fall through to chatbot mode instead of launching llamafiler. To avoid double-loading .args (once in main.cpp, once in prog.cpp), rename lf::server::main() to lf::server::run() and add an args_already_loaded parameter. The dispatcher passes true since it already loaded .args, while the standalone llamafiler binary passes false to load its own args. Enhances the .args timing fix from PR mozilla-ai#788.

vlasky · 2025-12-08T04:04:58Z

@anivar thanks for that helpful detail! I've created an enhanced version of the .args fix:

Fix .args loading order for program mode detection #840

anivar · 2025-12-11T07:55:08Z

Thanks @vlasky for identifying the double-loading issue and providing the fix in PR #840.

I've adapted the same approach for the submodule-based architecture in this PR.

(PR #840 targeted v0.9.3; this PR adapts the fix for current main with submodules and patches)

github-actions bot added llama.cpp llamafile labels Aug 16, 2025

anivar force-pushed the fix-server-v2-production-issues branch from 4746d30 to 78a2261 Compare December 4, 2025 16:58

vlasky mentioned this pull request Dec 8, 2025

Fix .args loading order for program mode detection #840

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Server v2 production issues (#767, #783, #787) #788

Fix Server v2 production issues (#767, #783, #787) #788

Uh oh!

anivar commented Aug 16, 2025

Uh oh!

vlasky commented Dec 4, 2025

Uh oh!

anivar commented Dec 4, 2025

Uh oh!

vlasky commented Dec 8, 2025

Uh oh!

anivar commented Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix Server v2 production issues (#767, #783, #787) #788

Are you sure you want to change the base?

Fix Server v2 production issues (#767, #783, #787) #788

Uh oh!

Conversation

anivar commented Aug 16, 2025

Uh oh!

vlasky commented Dec 4, 2025

Uh oh!

anivar commented Dec 4, 2025

Uh oh!

vlasky commented Dec 8, 2025

Uh oh!

anivar commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anivar commented Dec 11, 2025 •

edited

Loading