Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
161 commits
Select commit Hold shift + click to select a range
1c3fdf8
Add all generation parameters to server.cpp and allow resetting context
digiwombat May 23, 2023
2071d73
Forgot to remove some testing code.
digiwombat May 23, 2023
421e66b
Update examples/server/server.cpp
digiwombat May 23, 2023
add5f1b
Update examples/server/server.cpp
digiwombat May 23, 2023
3537ad1
Merge branch 'ggerganov:master' into master
digiwombat May 23, 2023
8d7b28c
Fixed some types in the params.
digiwombat May 23, 2023
c2b55cc
Added LoRA Loading
digiwombat May 25, 2023
48cb16a
Merge branch 'ggerganov:master' into master
digiwombat May 27, 2023
66ed19d
Corrected dashes in the help lines.
digiwombat May 27, 2023
36c86d7
Automate Context resetting and minor fixes
digiwombat May 27, 2023
d20f36b
Removed unnecessary last_prompt_token set
digiwombat May 27, 2023
fdce895
Merge branch 'ggerganov:master' into master
digiwombat May 27, 2023
e84b802
Change top_k type.
digiwombat May 28, 2023
1f40a78
Didn't see the already defined top_k var.
digiwombat May 28, 2023
51e0994
server rewrite
SlyEcho May 27, 2023
f93fe36
Add all generation parameters to server.cpp and allow resetting context
digiwombat May 23, 2023
df0e0d0
Forgot to remove some testing code.
digiwombat May 23, 2023
549291f
keep processed from the beginning
SlyEcho May 28, 2023
177868e
Changed to params/args
digiwombat May 28, 2023
e8efd75
Initial timeout code and expanded json return on completion.
digiwombat May 28, 2023
23928f2
Added generation_settings to final json object.
digiwombat May 28, 2023
2e5c5ee
Changed JSON names to match the parameter name rather than the variab…
digiwombat May 28, 2023
dda915c
Added capturing the stopping word and sending it along with the final…
digiwombat May 28, 2023
7740301
Set unspecified generation settings back to default. (Notes below)
digiwombat May 28, 2023
7186d65
seed and gen params
SlyEcho May 28, 2023
15ddc49
Merge remote-tracking branch 'slyecho/server_refactor'
digiwombat May 28, 2023
74c6f36
Editorconfig suggested fixes
SlyEcho May 28, 2023
2c9ee7a
Apply suggestions from code review
digiwombat May 28, 2023
655899d
Add ignore_eos option to generation settings.
digiwombat May 28, 2023
b38d41e
--memory_f32 flag to --memory-f32 to match common.cpp
digiwombat May 28, 2023
6c58f64
--ctx_size flag to --ctx-size to match common.cpp
digiwombat May 28, 2023
33b6957
Fixed failing to return result on stopping token.
digiwombat May 28, 2023
42cf4d8
Merge branch 'master' into master
SlyEcho May 28, 2023
03ea8f0
Fix for the regen issue.
digiwombat May 30, 2023
d6fff56
add streaming via server-sent events
May 30, 2023
3292f05
Changed to single API endpoint for streaming and non.
digiwombat May 30, 2023
38eaf2b
Removed testing fprintf calls.
digiwombat May 30, 2023
a25f830
Default streaming to false if it's not set in the request body.
digiwombat May 31, 2023
2533878
Merge branch 'master' into sse
digiwombat May 31, 2023
e6de69a
Merge pull request #3 from anon998/sse
digiwombat May 31, 2023
7a853dc
prevent the server from swallowing exceptions in debug mode
May 31, 2023
aa0788b
add --verbose flag and request logging
May 31, 2023
9197674
Merge pull request #4 from anon998/logging
digiwombat May 31, 2023
b6f536d
Cull to end of generated_text when encountering a stopping string in …
digiwombat May 31, 2023
7a8104f
add missing quote when printing stopping strings
May 31, 2023
3a079d5
stop generating when the stream is closed
May 31, 2023
9f2424a
Merge pull request #5 from anon998/stop-stream
digiwombat May 31, 2023
c1cbde8
print error when server can't bind to the interface
May 31, 2023
2c08f29
make api server use only a single thread
May 31, 2023
284bc29
reserve memory for generated_text
May 31, 2023
f1710b9
add infinite generation when n_predict is -1
May 31, 2023
aa2bbb2
fix parameter type
May 31, 2023
27911d6
fix default model alias
May 31, 2023
dd30219
buffer incomplete multi-byte characters
May 31, 2023
40e1380
print timings + build info
May 31, 2023
d58e486
default penalize_nl to false + format
May 31, 2023
3edaf6b
print timings by default
May 31, 2023
96fa480
Merge pull request #6 from anon998/fix-multibyte
digiwombat May 31, 2023
7332b41
Simple single-line server log for requests
digiwombat May 31, 2023
dda4c10
Switch to the CPPHTTPLIB logger. Verbose adds body dump as well as re…
digiwombat May 31, 2023
86337e3
Server console logs now come in one flavor: Verbose.
digiwombat May 31, 2023
1b96df2
Spacing fix. Nothing to see here.
digiwombat May 31, 2023
276fa99
Misunderstood the instructions, I think. Back to the raw JSON output …
digiwombat May 31, 2023
43d295f
filter empty stopping strings
May 31, 2023
1bd7cc6
reuse format_generation_settings for logging
May 31, 2023
497160a
remove old log function
May 31, 2023
f2e1130
Merge pull request #7 from anon998/logging-reuse
digiwombat May 31, 2023
9104fe5
Change how the token buffers work.
SlyEcho May 31, 2023
8478e59
Merge pull request #8 from SlyEcho/server_refactor
digiwombat May 31, 2023
bed308c
Apply suggestions from code review
SlyEcho May 31, 2023
342604b
Added a super simple CORS header as default for all endpoints.
digiwombat May 31, 2023
e9b1f0b
fix stopping strings
May 31, 2023
5f6e16d
Merge pull request #9 from anon998/stopping-strings
digiwombat Jun 1, 2023
f7882e2
Fixed a crash caused by erasing from empty last_n_tokens
digiwombat Jun 1, 2023
5bbc030
Add Options enpoints and Access-Control-Allow-Headers to satisfy CORS…
cirk2 Jun 1, 2023
8c6a5fc
last tokens fixes
SlyEcho Jun 1, 2023
9531ae6
Add logit bias support
SlyEcho Jun 1, 2023
797155a
Merge pull request #10 from cirk2/master
digiwombat Jun 1, 2023
af71126
Merge pull request #11 from SlyEcho/server_refactor
digiwombat Jun 1, 2023
49a18bd
remove unused parameter warning
Jun 1, 2023
6025476
default penalize_nl back to true
Jun 1, 2023
8cbc4be
clear logit_bias between requests + print
Jun 1, 2023
d29b6d5
Merge pull request #12 from anon998/clear-logit-bias
digiwombat Jun 1, 2023
0bc0477
Apply suggestions from code review
SlyEcho Jun 2, 2023
731ecc0
fix typo
Jun 2, 2023
ebfead6
remove unused variables
Jun 2, 2023
1488a0f
make functions that never return false void
Jun 2, 2023
49dce94
make types match gpt_params exactly
Jun 2, 2023
a8a9f19
small fixes
Jun 2, 2023
2932db1
avoid creating element in logit_bias accidentally
Jun 2, 2023
47efbb5
use std::isinf to check if ignore_eos is active
Jun 2, 2023
88cc7bb
Stuff with logits
SlyEcho Jun 2, 2023
abb7782
Merge branch 'master' into small-fixes
Jun 2, 2023
bebea65
Merge pull request #13 from anon998/small-fixes
digiwombat Jun 2, 2023
8f9e546
trim partial stopping strings when not streaming
Jun 2, 2023
f820740
move multibyte check to doCompletion
Jun 2, 2023
f5d5e70
Merge pull request #14 from anon998/do-completion-update
digiwombat Jun 2, 2023
1bd52c8
Merge branch 'ggerganov:master' into master
digiwombat Jun 2, 2023
3df0192
improve long input truncation
SlyEcho Jun 2, 2023
28cc0cd
Merge pull request #15 from SlyEcho/server_refactor
digiwombat Jun 2, 2023
3ff27d3
Fixed up a few things in embedding mode.
digiwombat Jun 2, 2023
41bb71b
replace invalid characters instead of crashing
Jun 2, 2023
4dd72fc
Merge pull request #16 from anon998/fix-log-json
digiwombat Jun 2, 2023
16e1c98
Removed the embedding api endpoint and associated code.
digiwombat Jun 2, 2023
7cebe2e
Merge branch 'master' of https://github.com/digiwombat/llama.cpp
digiwombat Jun 2, 2023
bcd6167
improve docs and example
SlyEcho Jun 2, 2023
de6df48
Removed embedding from README
digiwombat Jun 2, 2023
310bf61
Merge pull request #17 from SlyEcho/server_refactor
digiwombat Jun 2, 2023
5758e9f
Removed embedding from flags.
digiwombat Jun 2, 2023
e1e2be2
remove --keep from help text
Jun 2, 2023
a6ed390
update readme
Jun 2, 2023
05a5a48
make help text load faster
Jun 2, 2023
98ae2de
parse --mlock and --no-mmap + format
Jun 2, 2023
df2ecc9
Merge pull request #18 from anon998/update-readme
digiwombat Jun 2, 2023
64a0653
Merge remote-tracking branch 'upstream/master'
digiwombat Jun 7, 2023
61befcb
Apply suggestions from code review
SlyEcho Jun 8, 2023
ccd85e0
Apply suggestions from code review
SlyEcho Jun 8, 2023
a9c3477
Spaces to 4 and other code style cleanup. Notes in README.
digiwombat Jun 9, 2023
cc2b336
Missed a pair of catch statements for formatting.
digiwombat Jun 9, 2023
23a1b18
Merge branch 'ggerganov:master' into master
digiwombat Jun 9, 2023
7580427
Resolving some review comments
digiwombat Jun 9, 2023
889d904
Merge branch 'master' of https://github.com/digiwombat/llama.cpp
digiwombat Jun 9, 2023
7cdeb08
More formatting cleanup
digiwombat Jun 9, 2023
1a9141b
Remove model assign in main(). Clarified stop in README.
digiwombat Jun 9, 2023
917540c
Clarify build instructions in README.
lesaun Jun 10, 2023
d6d263f
Merge pull request #19 from lesaun/master
digiwombat Jun 10, 2023
bac0ddb
Merge branch 'ggerganov:master' into master
digiwombat Jun 10, 2023
2c00bf8
more formatting changes
SlyEcho Jun 11, 2023
9612d12
big logging update
SlyEcho Jun 11, 2023
6518f9c
build settings
SlyEcho Jun 11, 2023
eee8b28
Merge pull request #20 from SlyEcho/server_refactor
digiwombat Jun 11, 2023
4148b9b
remove void
SlyEcho Jun 12, 2023
dff11a1
json parsing improvements
SlyEcho Jun 12, 2023
13cf692
more json changes and stop info
SlyEcho Jun 12, 2023
b91200a
javascript chat update.
SlyEcho Jun 12, 2023
1510337
fix make flags propagation
SlyEcho Jun 12, 2023
fc4264d
api url
SlyEcho Jun 12, 2023
28694f7
add a simple bash script too
SlyEcho Jun 12, 2023
429ed95
move CPPHTTPLIB settings inside server
SlyEcho Jun 12, 2023
f344d09
streaming shell script
SlyEcho Jun 12, 2023
50e7c54
Merge pull request #21 from SlyEcho/server_refactor
digiwombat Jun 12, 2023
fc78910
Merge branch 'ggerganov:master' into master
digiwombat Jun 12, 2023
6d72f0f
Make chat shell script work by piping the content out of the subshell.
digiwombat Jun 12, 2023
9d564db
trim response and trim trailing space in prompt
Jun 13, 2023
9099709
Merge pull request #22 from anon998/bash-trim
digiwombat Jun 13, 2023
b8b8a6e
Add log flush
SlyEcho Jun 13, 2023
6627a02
Allow overriding the server address
SlyEcho Jun 13, 2023
1f39452
remove old verbose variable
Jun 13, 2023
99ef967
add static prefix to the other functions too
Jun 13, 2023
575cf23
remove json_indent variable
Jun 13, 2023
7df316b
fix linter warnings + make variables const
Jun 13, 2023
7a48ade
fix comment indentation
Jun 13, 2023
6075d78
Merge pull request #23 from anon998/fix-linter-warnings
digiwombat Jun 13, 2023
546f850
Update examples/server/server.cpp
SlyEcho Jun 14, 2023
bd81096
fix typo in readme + don't ignore integers
Jun 14, 2023
5e107c2
Merge pull request #24 from anon998/logit-bias
digiwombat Jun 14, 2023
f858cd6
Merge remote-tracking branch 'upstream/master'
digiwombat Jun 14, 2023
aee8595
Update README.md
digiwombat Jun 15, 2023
488c62a
Merge remote-tracking branch 'upstream/master'
digiwombat Jun 15, 2023
fb49c05
Merge branch 'ggerganov:master' into master
digiwombat Jun 16, 2023
1b4b93a
Merge branch 'ggerganov:master' into master
digiwombat Jun 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
improve docs and example
  • Loading branch information
SlyEcho committed Jun 2, 2023
commit bcd616700e561424db77bfabc334f13b811f9968
275 changes: 48 additions & 227 deletions examples/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,45 @@

This example allow you to have a llama.cpp http server to interact from a web page or consume the API.

## Table of Contents
Command line options:

- `--threads N`, `-t N`: use N threads.
- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
- `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
- `--port`: Set the port to listen. Default: `8080`.

1. [Quick Start](#quick-start)
2. [Node JS Test](#node-js-test)
3. [API Endpoints](#api-endpoints)
4. [More examples](#more-examples)
5. [Common Options](#common-options)
6. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)

## Quick Start

To get started right away, run the following command, making sure to use the correct path for the model you have:

#### Unix-based systems (Linux, macOS, etc.):
### Unix-based systems (Linux, macOS, etc.):

```bash
./server -m models/7B/ggml-model.bin --ctx_size 2048
./server -m models/7B/ggml-model.bin -c 2048
```

#### Windows:
### Windows:

```powershell
server.exe -m models\7B\ggml-model.bin --ctx_size 2048
server.exe -m models\7B\ggml-model.bin -c 2048
```

That will start a server that by default listens on `127.0.0.1:8080`. You can consume the endpoints with Postman or NodeJS with axios library.
That will start a server that by default listens on `127.0.0.1:8080`.
You can consume the endpoints with Postman or NodeJS with axios library.

## Testing with CURL

Using [curl](https://curl.se/). On Windows `curl.exe` should be available in the base OS.

```sh
curl --request POST \
--url http://localhost:8080/completion \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

## Node JS Test

Expand All @@ -50,7 +63,6 @@ const prompt = `Building a website can be done in 10 simple steps:`;
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
n_predict: 512,
});

Expand All @@ -69,244 +81,53 @@ node .

## API Endpoints

You can interact with this API Endpoints. This implementations just support chat style interaction.
You can interact with this API Endpoints.
This implementations just support chat style interaction.

- **POST** `hostname:port/completion`: Setting up the Llama Context to begin the completions tasks.

*Options:*

`batch_size`: Set the batch size for prompt processing (default: 512).

`temperature`: Adjust the randomness of the generated text (default: 0.8).

`top_k`: Limit the next token selection to the K most probable tokens (default: 40).
*Options:*

`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
`temperature`: Adjust the randomness of the generated text (default: 0.8).

`n_predict`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
`top_k`: Limit the next token selection to the K most probable tokens (default: 40).

`threads`: Set the number of threads to use during computation.
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).

`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
`n_predict`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).

`as_loop`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.

`interactive`: It allows interacting with the completion, and the completion stops as soon as it encounters a `stop word`. To enable this, set to `true`.
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.

`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.

`stop`: Specify the words or characters that indicate a stop. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.

`exclude`: Specify the words or characters you do not want to appear in the completion. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
`stop`: Specify the strings that indicate a stop.
These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
Default: `[]`

- **POST** `hostname:port/embedding`: Generate embedding of a given text

*Options:*

`content`: Set the text to get generate the embedding.
*Options:*

`threads`: Set the number of threads to use during computation.
`content`: Set the text to get generate the embedding.

To use this endpoint, you need to start the server with the `--embedding` option added.
To use this endpoint, you need to start the server with the `--embedding` option added.

- **POST** `hostname:port/tokenize`: Tokenize a given text

*Options:*

`content`: Set the text to tokenize.
*Options:*

- **GET** `hostname:port/next-token`: Receive the next token predicted, execute this request in a loop. Make sure set `as_loop` as `true` in the completion request.

*Options:*

`stop`: Set `hostname:port/next-token?stop=true` to stop the token generation.
`content`: Set the text to tokenize.

## More examples

### Interactive mode

This mode allows interacting in a chat-like manner. It is recommended for models designed as assistants such as `Vicuna`, `WizardLM`, `Koala`, among others. Make sure to add the correct stop word for the corresponding model.

The prompt should be generated by you, according to the model's guidelines. You should keep adding the model's completions to the context as well.

This example works well for `Vicuna - version 1`.

```javascript
const axios = require("axios");

let prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
### Human: Hello, Assistant.
### Assistant: Hello. How may I help you today?
### Human: Please tell me the largest city in Europe.
### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.`;

async function ChatCompletion(answer) {
// the user's next question to the prompt
prompt += `\n### Human: ${answer}\n`

result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_keep: -1,
n_predict: 2048,
stop: ["\n### Human:"], // when detect this, stop completion
exclude: ["### Assistant:"], // no show in the completion
threads: 8,
as_loop: true, // use this to request the completion token by token
interactive: true, // enable the detection of a stop word
});

// create a loop to receive every token predicted
// note: this operation is blocking, avoid use this in a ui thread

let message = "";
while (true) {
// you can stop the inference adding '?stop=true' like this http://127.0.0.1:8080/next-token?stop=true
result = await axios.get("http://127.0.0.1:8080/next-token");
process.stdout.write(result.data.content);
message += result.data.content;

// to avoid an infinite loop
if (result.data.stop) {
console.log("Completed");
// make sure to add the completion to the prompt.
prompt += `### Assistant: ${message}`;
break;
}
}
}

// This function should be called every time a question to the model is needed.
async function Test() {
// the server can't inference in paralell
await ChatCompletion("Write a long story about a time magician in a fantasy world");
await ChatCompletion("Summary the story");
}

Test();
```

### Alpaca example

**Temporaly note:** no tested, if you have the model, please test it and report me some issue

```javascript
const axios = require("axios");

let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.
`;

async function DoInstruction(instruction) {
prompt += `\n\n### Instruction:\n\n${instruction}\n\n### Response:\n\n`;
result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
batch_size: 128,
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_keep: -1,
n_predict: 2048,
stop: ["### Instruction:\n\n"], // when detect this, stop completion
exclude: [], // no show in the completion
threads: 8,
as_loop: true, // use this to request the completion token by token
interactive: true, // enable the detection of a stop word
});

// create a loop to receive every token predicted
// note: this operation is blocking, avoid use this in a ui thread

let message = "";
while (true) {
result = await axios.get("http://127.0.0.1:8080/next-token");
process.stdout.write(result.data.content);
message += result.data.content;

// to avoid an infinite loop
if (result.data.stop) {
console.log("Completed");
// make sure to add the completion and the user's next question to the prompt.
prompt += message;
break;
}
}
}

// This function should be called every time a instruction to the model is needed.
DoInstruction("Destroy the world"); // as joke
```

### Embeddings

First, run the server with `--embedding` option:

```bash
server -m models/7B/ggml-model.bin --ctx_size 2048 --embedding
```

Run this code in NodeJS:

```javascript
const axios = require('axios');

async function Test() {
let result = await axios.post("http://127.0.0.1:8080/embedding", {
content: `Hello`,
threads: 5
});
// print the embedding array
console.log(result.data.embedding);
}
Check the sample in [chat.mjs](chat.mjs).
Run with node:

Test();
```sh
node chat.mjs
```

### Tokenize

Run this code in NodeJS:

```javascript
const axios = require('axios');

async function Test() {
let result = await axios.post("http://127.0.0.1:8080/tokenize", {
content: `Hello`
});
// print the embedding array
console.log(result.data.tokens);
}

Test();
```

## Common Options

- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
- `--port`: Set the port to listen. Default: `8080`.

### RNG Seed

- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).

The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.

## Performance Tuning and Memory Options

### No Memory Mapping

- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance.

### Memory Float 32

- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement but does not appear to increase generation quality in a measurable way. Not recommended.

## Limitations:

- The actual implementation of llama.cpp need a `llama-state` for handle multiple contexts and clients, but this could require more powerful hardware.
61 changes: 61 additions & 0 deletions examples/server/chat.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import * as readline from 'node:readline/promises';
import { stdin as input, stdout as output } from 'node:process';

const chat = [
{ human: "Hello, Assistant.",
assistant: "Hello. How may I help you today?" },
{ human: "Please tell me the largest city in Europe.",
assistant: "Sure. The largest city in Europe is Moscow, the capital of Russia." },
]

function format_prompt(question) {
return "A chat between a curious human and an artificial intelligence assistant. "
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.\n"
+ chat.map(m => `### Human: ${m.human}\n### Assistant: ${m.assistant}`).join("\n")
+ `\n### Human: ${question}\n### Assistant:`
}

async function ChatCompletion(question) {
const result = await fetch("http://127.0.0.1:8080/completion", {
method: 'POST',
body: JSON.stringify({
prompt: format_prompt(question),
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_keep: 29,
n_predict: 256,
stop: ["\n### Human:"], // when detect this, stop completion
stream: true,
})
})

if (!result.ok) {
return;
}

let answer = ''

for await (var chunk of result.body) {
const t = Buffer.from(chunk).toString('utf8')
if (t.startsWith('data: ')) {
const message = JSON.parse(t.substring(6))
answer += message.content
process.stdout.write(message.content)
if (message.stop) break;
}
}

process.stdout.write('\n')
chat.push({ human: question, assistant: answer })
}

const rl = readline.createInterface({ input, output });

while(true) {

const question = await rl.question('> ')
await ChatCompletion(question);

}