diff --git a/docs/arena.md b/docs/arena.md index 979f41db5..9d83b8c0d 100644 --- a/docs/arena.md +++ b/docs/arena.md @@ -5,10 +5,11 @@ We invite the entire community to join this benchmarking effort by contributing ## How to add a new model If you want to see a specific model in the arena, you can follow the methods below. -- Method 1: Hosted by LMSYS. - 1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model). - 2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors. +### Method 1: Hosted by 3rd party API providers or yourself +If you have a model hosted by a 3rd party API provider or yourself, please give us the access to an API endpoint. + - We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/gradio/fastchat/serve/api_provider.py) for calling OpenAI models. + - If you have your own API protocol, please follow the [instructions](model_support.md) to add them. Contribute your code by sending a pull request. -- Method 2: Hosted by 3rd party API providers or yourself. - 1. If you have a model hosted by a 3rd party API provider or yourself, please give us an API endpoint. We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/33dca5cf12ee602455bfa9b5f4790a07829a2db7/fastchat/serve/gradio_web_server.py#L333-L358) for calling OpenAI models. - 2. You can use FastChat's OpenAI API [server](openai_api.md) to serve your model with OpenAI-compatible APIs and provide us with the endpoint. +### Method 2: Hosted by LMSYS +1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md). +2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors. diff --git a/docs/model_support.md b/docs/model_support.md index d75717fc9..4a3b703c3 100644 --- a/docs/model_support.md +++ b/docs/model_support.md @@ -1,8 +1,12 @@ # Model Support +This document describes how to support a new model in FastChat. -## How to support a new model +## Content +- [Local Models](#local-models) +- [API-Based Models](#api-based-models) -To support a new model in FastChat, you need to correctly handle its prompt template and model loading. +## Local Models +To support a new local model in FastChat, you need to correctly handle its prompt template and model loading. The goal is to make the following command run with the correct prompts. ``` @@ -27,32 +31,7 @@ FastChat uses the `Conversation` class to handle prompt templates and `BaseModel After these steps, the new model should be compatible with most FastChat features, such as CLI, web UI, model worker, and OpenAI-compatible API server. Please do some testing with these features as well. -### API-based model - -For API-based model, you still need to follow the above steps to implement conversation template, adapter, and register the model. In addition, you need to -1. Implement an API-based streaming token generator in [fastchat/serve/api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py) -2. Specify your endpoint info in a JSON configuration file -``` -{ - "gpt-3.5-turbo-0613": { - "model_name": "gpt-3.5-turbo-0613", - "api_base": "https://api.openai.com/v1", - "api_key": "XXX", - "api_type": "openai" - } -} -``` -3. Invoke your API generator in `bot_response` of [fastchat/serve/gradio_web_server.py](https://github.com/lm-sys/FastChat/blob/22642048eeb2f1f06eb1c4e0490d802e91e62473/fastchat/serve/gradio_web_server.py#L427) accordingly. -4. Launch the gradio web server with argument `--register [JSON-file]`. -``` -python3 -m fastchat.serve.gradio_web_server --register [JSON-file] -``` -You should be able to chat with your API-based model! - -Currently, FastChat supports OpenAI, Anthropic, Google Vertex AI, Mistral, and Nvidia NGC. - - -## Supported models +### Supported models - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf` @@ -121,3 +100,27 @@ Currently, FastChat supports OpenAI, Anthropic, Google Vertex AI, Mistral, and N setting the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` in any model worker. +## API-Based Models +1. Implement an API-based streaming generator in [fastchat/serve/api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py). You can learn from the OpenAI example. +2. Specify your endpoint info in a JSON configuration file +``` +{ + "gpt-3.5-turbo-0613": { + "model_name": "gpt-3.5-turbo-0613", + "api_type": "openai", + "api_base": "https://api.openai.com/v1", + "api_key": "sk-******", + "anony_only": false + } +} +``` + - "api_type" can be one of the following: openai, anthropic, gemini, mistral. For you own API, you can add a new type and implement it. + - "anony_only" means whether to show this model in anonymous mode only. +3. Launch the gradio web server with argument `--register [JSON-file]`. + +``` +python3 -m fastchat.serve.gradio_web_server --controller "" --share --register [JSON-file] +``` + +You should be able to chat with your API-based model! +Currently, FastChat supports OpenAI, Anthropic, Google Vertex AI, Mistral, and Nvidia NGC. diff --git a/fastchat/serve/api_provider.py b/fastchat/serve/api_provider.py index c053b9071..9fa852bca 100644 --- a/fastchat/serve/api_provider.py +++ b/fastchat/serve/api_provider.py @@ -1,20 +1,93 @@ """Call API providers.""" -from json import loads -import os - import json +import os import random -import requests import time +import requests + from fastchat.utils import build_logger -from fastchat.constants import WORKER_API_TIMEOUT logger = build_logger("gradio_web_server", "gradio_web_server.log") +def get_api_provider_stream_iter( + conv, + model_name, + model_api_dict, + temperature, + top_p, + max_new_tokens, +): + if model_api_dict["api_type"] == "openai": + prompt = conv.to_openai_api_messages() + stream_iter = openai_api_stream_iter( + model_api_dict["model_name"], + prompt, + temperature, + top_p, + max_new_tokens, + api_base=model_api_dict["api_base"], + api_key=model_api_dict["api_key"], + ) + elif model_api_dict["api_type"] == "anthropic": + prompt = conv.get_prompt() + stream_iter = anthropic_api_stream_iter( + model_name, prompt, temperature, top_p, max_new_tokens + ) + elif model_api_dict["api_type"] == "gemini": + stream_iter = gemini_api_stream_iter( + model_api_dict["model_name"], + conv, + temperature, + top_p, + max_new_tokens, + api_key=model_api_dict["api_key"], + ) + elif model_api_dict["api_type"] == "bard": + prompt = conv.to_openai_api_messages() + stream_iter = bard_api_stream_iter( + model_api_dict["model_name"], + prompt, + temperature, + top_p, + api_key=model_api_dict["api_key"], + ) + elif model_api_dict["api_type"] == "mistral": + prompt = conv.to_openai_api_messages() + stream_iter = mistral_api_stream_iter( + model_name, prompt, temperature, top_p, max_new_tokens + ) + elif model_api_dict["api_type"] == "nvidia": + prompt = conv.to_openai_api_messages() + stream_iter = nvidia_api_stream_iter( + model_name, + prompt, + temperature, + top_p, + max_new_tokens, + model_api_dict["api_base"], + ) + elif model_api_dict["api_type"] == "ai2": + prompt = conv.to_openai_api_messages() + stream_iter = ai2_api_stream_iter( + model_name, + model_api_dict["model_name"], + prompt, + temperature, + top_p, + max_new_tokens, + api_base=model_api_dict["api_base"], + api_key=model_api_dict["api_key"], + ) + else: + raise NotImplementedError() + + return stream_iter + + def openai_api_stream_iter( model_name, messages, @@ -111,65 +184,6 @@ def anthropic_api_stream_iter(model_name, prompt, temperature, top_p, max_new_to yield data -def init_palm_chat(model_name): - import vertexai # pip3 install google-cloud-aiplatform - from vertexai.preview.language_models import ChatModel - from vertexai.preview.generative_models import GenerativeModel - - project_id = os.environ["GCP_PROJECT_ID"] - location = "us-central1" - vertexai.init(project=project_id, location=location) - - if model_name in ["palm-2"]: - # According to release note, "chat-bison@001" is PaLM 2 for chat. - # https://cloud.google.com/vertex-ai/docs/release-notes#May_10_2023 - model_name = "chat-bison@001" - chat_model = ChatModel.from_pretrained(model_name) - chat = chat_model.start_chat(examples=[]) - elif model_name in ["gemini-pro"]: - model = GenerativeModel(model_name) - chat = model.start_chat() - return chat - - -def palm_api_stream_iter(model_name, chat, message, temperature, top_p, max_new_tokens): - if model_name in ["gemini-pro"]: - max_new_tokens = max_new_tokens * 2 - parameters = { - "temperature": temperature, - "top_p": top_p, - "max_output_tokens": max_new_tokens, - } - gen_params = { - "model": model_name, - "prompt": message, - } - gen_params.update(parameters) - if model_name == "palm-2": - response = chat.send_message(message, **parameters) - else: - response = chat.send_message(message, generation_config=parameters, stream=True) - - logger.info(f"==== request ====\n{gen_params}") - - try: - text = "" - for chunk in response: - text += chunk.text - data = { - "text": text, - "error_code": 0, - } - yield data - except Exception as e: - logger.error(f"==== error ====\n{e}") - yield { - "text": f"**API REQUEST ERROR** Reason: {e}\nPlease try again or increase the number of max tokens.", - "error_code": 1, - } - yield data - - def gemini_api_stream_iter( model_name, conv, temperature, top_p, max_new_tokens, api_key=None ): @@ -353,7 +367,7 @@ def ai2_api_stream_iter( text = "" for line in res.iter_lines(): if line: - part = loads(line) + part = json.loads(line) if "result" in part and "output" in part["result"]: for t in part["result"]["output"]["text"]: text += t diff --git a/fastchat/serve/gradio_block_arena_anony.py b/fastchat/serve/gradio_block_arena_anony.py index c6949746d..c9d8aba6b 100644 --- a/fastchat/serve/gradio_block_arena_anony.py +++ b/fastchat/serve/gradio_block_arena_anony.py @@ -27,7 +27,6 @@ disable_btn, invisible_btn, acknowledgment_md, - ip_expiration_dict, get_ip, get_model_description_md, ) @@ -630,7 +629,6 @@ def build_side_by_side_ui_anony(models): Find out who is the 🥇LLM Champion! ## 👇 Chat now! - """ states = [gr.State() for _ in range(num_sides)] @@ -640,7 +638,9 @@ def build_side_by_side_ui_anony(models): gr.Markdown(notice_markdown, elem_id="notice_markdown") with gr.Group(elem_id="share-region-anony"): - with gr.Accordion("🔍 Expand to see 20+ Arena players", open=False): + with gr.Accordion( + f"🔍 Expand to see the descriptions of {len(models)} models", open=False + ): model_description_md = get_model_description_md(models) gr.Markdown(model_description_md, elem_id="model_description_markdown") with gr.Row(): diff --git a/fastchat/serve/gradio_block_arena_named.py b/fastchat/serve/gradio_block_arena_named.py index 66aad60fc..9774c3dea 100644 --- a/fastchat/serve/gradio_block_arena_named.py +++ b/fastchat/serve/gradio_block_arena_named.py @@ -25,9 +25,8 @@ disable_btn, invisible_btn, acknowledgment_md, - get_model_description_md, - ip_expiration_dict, get_ip, + get_model_description_md, ) from fastchat.utils import ( build_logger, @@ -307,7 +306,9 @@ def build_side_by_side_ui_named(models): container=False, ) with gr.Row(): - with gr.Accordion("🔍 Expand to see 20+ model descriptions", open=False): + with gr.Accordion( + f"🔍 Expand to see the descriptions of {len(models)} models", open=False + ): model_description_md = get_model_description_md(models) gr.Markdown(model_description_md, elem_id="model_description_markdown") diff --git a/fastchat/serve/gradio_web_server.py b/fastchat/serve/gradio_web_server.py index c58f5dd36..b81c169fd 100644 --- a/fastchat/serve/gradio_web_server.py +++ b/fastchat/serve/gradio_web_server.py @@ -14,7 +14,6 @@ import gradio as gr import requests -from fastchat.conversation import SeparatorStyle from fastchat.constants import ( LOGDIR, WORKER_API_TIMEOUT, @@ -29,25 +28,14 @@ ) from fastchat.model.model_adapter import ( get_conversation_template, - ANTHROPIC_MODEL_LIST, ) from fastchat.model.model_registry import get_model_info, model_info -from fastchat.serve.api_provider import ( - anthropic_api_stream_iter, - openai_api_stream_iter, - palm_api_stream_iter, - gemini_api_stream_iter, - bard_api_stream_iter, - mistral_api_stream_iter, - nvidia_api_stream_iter, - ai2_api_stream_iter, - init_palm_chat, -) +from fastchat.serve.api_provider import get_api_provider_stream_iter from fastchat.utils import ( build_logger, - moderation_filter, get_window_url_params_js, get_window_url_params_with_tos_js, + moderation_filter, parse_gradio_auth_creds, ) @@ -87,18 +75,18 @@ """ -ip_expiration_dict = defaultdict(lambda: 0) - # JSON file format of API-based models: # { -# "vicuna-7b": { -# "model_name": "vicuna-7b-v1.5", -# "api_base": "http://8.8.8.55:5555/v1", -# "api_key": "password", -# "api_type": "openai", # openai, anthropic, palm, mistral -# "anony_only": false, # whether to show this model in anonymous mode only -# }, +# "gpt-3.5-turbo-0613": { +# "model_name": "gpt-3.5-turbo-0613", +# "api_type": "openai", +# "api_base": "https://api.openai.com/v1", +# "api_key": "sk-******", +# "anony_only": false +# } # } +# "api_type" can be one of the following: openai, anthropic, gemini, mistral. +# "anony_only" means whether to show this model in anonymous mode only. api_endpoint_info = {} @@ -109,9 +97,6 @@ def __init__(self, model_name): self.skip_next = False self.model_name = model_name - if model_name in ["palm-2", "gemini-pro"]: - self.palm_chat = init_palm_chat(model_name) - def to_gradio_chatbot(self): return self.conv.to_gradio_chatbot() @@ -140,6 +125,8 @@ def get_conv_log_filename(): def get_model_list(controller_url, register_api_endpoint_file): global api_endpoint_info + + # Add models from the controller if controller_url: ret = requests.post(controller_url + "/refresh_all_workers") assert ret.status_code == 200 @@ -148,11 +135,12 @@ def get_model_list(controller_url, register_api_endpoint_file): else: models = [] - # Add API providers + # Add models from the API providers if register_api_endpoint_file: api_endpoint_info = json.load(open(register_api_endpoint_file)) models += list(api_endpoint_info.keys()) + # Remove anonymous models models = list(set(models)) visible_models = models.copy() for mdl in visible_models: @@ -162,6 +150,7 @@ def get_model_list(controller_url, register_api_endpoint_file): if mdl_dict["anony_only"]: visible_models.remove(mdl) + # Sort models and add descriptions priority = {k: f"___{i:03d}" for i, k in enumerate(model_info)} models.sort(key=lambda x: priority.get(x, x)) visible_models.sort(key=lambda x: priority.get(x, x)) @@ -178,7 +167,6 @@ def load_demo_single(models, url_params): selected_model = model dropdown_update = gr.Dropdown(choices=models, value=selected_model, visible=True) - state = None return state, dropdown_update @@ -188,7 +176,6 @@ def load_demo(url_params, request: gr.Request): ip = get_ip(request) logger.info(f"load_demo. ip: {ip}. params: {url_params}") - ip_expiration_dict[ip] = time.time() + SESSION_EXPIRATION_TIME if args.model_list_mode == "reload": models, all_models = get_model_list( @@ -285,17 +272,6 @@ def add_text(state, model_selector, text, request: gr.Request): return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 5 -def post_process_code(code): - sep = "\n```" - if sep in code: - blocks = code.split(sep) - if len(blocks) % 2 == 1: - for i in range(1, len(blocks), 2): - blocks[i] = blocks[i].replace("\\_", "_") - code = sep.join(blocks) - return code - - def model_worker_stream_iter( conv, model_name, @@ -424,78 +400,15 @@ def bot_response( top_p, max_new_tokens, ) - elif model_api_dict["api_type"] == "openai": - prompt = conv.to_openai_api_messages() - stream_iter = openai_api_stream_iter( - model_api_dict["model_name"], - prompt, - temperature, - top_p, - max_new_tokens, - api_base=model_api_dict["api_base"], - api_key=model_api_dict["api_key"], - ) - elif model_api_dict["api_type"] == "anthropic": - prompt = conv.get_prompt() - stream_iter = anthropic_api_stream_iter( - model_name, prompt, temperature, top_p, max_new_tokens - ) - elif model_api_dict["api_type"] == "palm": - stream_iter = palm_api_stream_iter( - model_name, - state.palm_chat, - conv.messages[-2][1], - temperature, - top_p, - max_new_tokens, - ) - elif model_api_dict["api_type"] == "gemini": - stream_iter = gemini_api_stream_iter( - model_api_dict["model_name"], + else: + stream_iter = get_api_provider_stream_iter( conv, - temperature, - top_p, - max_new_tokens, - api_key=model_api_dict["api_key"], - ) - elif model_api_dict["api_type"] == "bard": - prompt = conv.to_openai_api_messages() - stream_iter = bard_api_stream_iter( - model_api_dict["model_name"], - prompt, - temperature, - top_p, - api_key=model_api_dict["api_key"], - ) - elif model_api_dict["api_type"] == "mistral": - prompt = conv.to_openai_api_messages() - stream_iter = mistral_api_stream_iter( - model_name, prompt, temperature, top_p, max_new_tokens - ) - elif model_api_dict["api_type"] == "nvidia": - prompt = conv.to_openai_api_messages() - stream_iter = nvidia_api_stream_iter( model_name, - prompt, - temperature, - top_p, - max_new_tokens, - model_api_dict["api_base"], - ) - elif model_api_dict["api_type"] == "ai2": - prompt = conv.to_openai_api_messages() - stream_iter = ai2_api_stream_iter( - model_name, - model_api_dict["model_name"], - prompt, + model_api_dict, temperature, top_p, max_new_tokens, - api_base=model_api_dict["api_base"], - api_key=model_api_dict["api_key"], ) - else: - raise NotImplementedError conv.update_last_message("▌") yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5 @@ -518,8 +431,6 @@ def bot_response( ) return output = data["text"].strip() - if "vicuna" in model_name: - output = post_process_code(output) conv.update_last_message(output) yield (state, state.to_gradio_chatbot()) + (enable_btn,) * 5 except requests.exceptions.RequestException as e: @@ -687,12 +598,8 @@ def build_about(): HuggingFace """ - - # state = gr.State() gr.Markdown(about_markdown, elem_id="about_markdown") - # return [state] - def build_single_model_ui(models, add_promotion_links=False): promotion = ( diff --git a/fastchat/serve/gradio_web_server_multi.py b/fastchat/serve/gradio_web_server_multi.py index 5aa7d36ce..5429f2bbc 100644 --- a/fastchat/serve/gradio_web_server_multi.py +++ b/fastchat/serve/gradio_web_server_multi.py @@ -9,9 +9,6 @@ import gradio as gr -from fastchat.constants import ( - SESSION_EXPIRATION_TIME, -) from fastchat.serve.gradio_block_arena_anony import ( build_side_by_side_ui_anony, load_demo_side_by_side_anony, @@ -29,7 +26,6 @@ build_about, get_model_list, load_demo_single, - ip_expiration_dict, get_ip, ) from fastchat.serve.monitor.monitor import build_leaderboard_tab @@ -48,14 +44,13 @@ def load_demo(url_params, request: gr.Request): ip = get_ip(request) logger.info(f"load_demo. ip: {ip}. params: {url_params}") - ip_expiration_dict[ip] = time.time() + SESSION_EXPIRATION_TIME selected = 0 if "arena" in url_params: selected = 0 elif "compare" in url_params: selected = 1 - elif "single" in url_params: + elif "direct" in url_params or "model" in url_params: selected = 2 elif "leaderboard" in url_params: selected = 3 @@ -67,9 +62,9 @@ def load_demo(url_params, request: gr.Request): ) single_updates = load_demo_single(models, url_params) - side_by_side_anony_updates = load_demo_side_by_side_anony(all_models, url_params) side_by_side_named_updates = load_demo_side_by_side_named(models, url_params) + return ( (gr.Tabs(selected=selected),) + single_updates @@ -84,6 +79,7 @@ def build_demo(models, elo_results_file, leaderboard_table_file): load_js = get_window_url_params_with_tos_js else: load_js = get_window_url_params_js + head_js = """ """ @@ -99,6 +95,7 @@ def build_demo(models, elo_results_file, leaderboard_table_file): window.__gradio_mode__ = "app"; """ + with gr.Blocks( title="Chat with Open Large Language Models", theme=gr.themes.Default(text_size=text_size), @@ -116,9 +113,11 @@ def build_demo(models, elo_results_file, leaderboard_table_file): single_model_list = build_single_model_ui( models, add_promotion_links=True ) + if elo_results_file: with gr.Tab("Leaderboard", id=3): build_leaderboard_tab(elo_results_file, leaderboard_table_file) + with gr.Tab("About Us", id=4): about = build_about() diff --git a/fastchat/utils.py b/fastchat/utils.py index f02b286a7..cf0095f44 100644 --- a/fastchat/utils.py +++ b/fastchat/utils.py @@ -177,7 +177,7 @@ def oai_moderation(text): def moderation_filter(text, model_list): - MODEL_KEYWORDS = ["claude", "gpt-4", "gpt-3.5", "bard"] + MODEL_KEYWORDS = ["claude", "gpt-4", "bard"] for keyword in MODEL_KEYWORDS: for model in model_list: