-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix Server v2 production issues (#767, #783, #787) #788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@anivar please explain the consequences of the .args being loaded too late. What failure/issue does it cause? |
Fixes four critical production issues in llamafile Server v2: 1. **Fix .args loading timing** (llama.cpp main/main.cpp) - Move cosmo_args() call before determine_program() - Ensures --server --v2 flags in .args are seen when determining program mode - Fixes mozilla-ai#783 2. **Add URL prefix normalization** (llamafile/flags.cpp) - Consolidate consecutive slashes (//api/v1 → /api/v1) - Ensure leading slash, remove trailing slash - Validate AFTER normalization - Use static std::string for proper lifetime management (no memory leak) - Fixes mozilla-ai#767 3. **Robust partial write handling** (llamafile/server/client.cpp) - Implement full write loop to handle partial writes correctly - Handle EINTR (signal interruption) gracefully - Properly detect connection closure - Increase file transfer buffer from 512B to 16KB for better performance 4. **Remove aggressive client dropping** (llamafile/server/worker.cpp) - Remove code that kills oldest active connection when all workers busy - Let TCP listen backlog naturally queue incoming connections - Provides better UX (graceful queuing vs abrupt disconnection) - Fixes mozilla-ai#787 All fixes improve upon original PR mozilla-ai#788 with better error handling and no memory leaks.
4746d30 to
78a2261
Compare
|
@vlasky Great question! The timing issue causes a real production problem for anyone distributing llamafiles with embedded configuration. Here's what happens: when a user embeds This completely breaks the "distribute a self-contained llamafile" use case - you can't ship a llamafile that's pre-configured to run as a server via The fix is straightforward: load While I was in there, I also improved the other fixes - the URL normalization now avoids a memory leak by using static storage, the partial write handler does a proper retry loop instead of just one attempt, and the file transfer buffer got bumped to 16KB for better performance. |
Move cosmo_args() before determine_program() so that embedded .args flags like --server --v2 are visible when determining program mode. Without this fix, a .llamafile with embedded server flags would fall through to chatbot mode instead of launching llamafiler. To avoid double-loading .args (once in main.cpp, once in prog.cpp), rename lf::server::main() to lf::server::run() and add an args_already_loaded parameter. The dispatcher passes true since it already loaded .args, while the standalone llamafiler binary passes false to load its own args. Enhances the .args timing fix from PR mozilla-ai#788.
|
@anivar thanks for that helpful detail! I've created an enhanced version of the .args fix: |
This PR fixes three critical issues preventing Server v2 from being used in production:
--url-prefix //pathcorrectly like the old server.argsbeing loaded too lateThe changes are minimal (~60 lines) and follow patterns from the existing server implementation.
Fixes #767, #783, #787