Skip to content

Conversation

@GoHomeToMacDonal
Copy link
Contributor

An implementation of ChatGLM 2 based on vLLM. This implementation adapts PagedAttentionWithRoPE and ParallelLinear layers for model inference.

@simon-mo
Copy link
Collaborator

simon-mo commented Nov 2, 2023

Thank you for the contribution, unfortunately this PR seems to have some merge conflict and ChatGLM3 also came out. Feel free to coordinate the contribution here if you have bandwidth!

#1552

@GoHomeToMacDonal
Copy link
Contributor Author

@simon-mo Hi, we have resolved the code conflict, and it can be directly merged into the main branch.

As chatglm3 does not change the model structure, this implementation can be directly adopted to chatglm3. Below is the testing code:

from vllm import LLM, SamplingParams

prompts = ["""<|system|>
You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
<|user|>
Hello
<|assistant|>
"""]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="/home/skim/.cache/modelscope/hub/ZhipuAI/chatglm3-6b", trust_remote_code=True)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The output will be Hello! How can I assist you today?

@simon-mo simon-mo requested a review from zhuohan123 November 6, 2023 17:44
@simon-mo simon-mo mentioned this pull request Nov 6, 2023
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Merged with main and fixed a small style issue. The code works with both ChatGLM2 and ChatGLM3 on one GPU in my case. Thank you for your contribution!

@zhuohan123 zhuohan123 merged commit 1a2bbc9 into vllm-project:main Nov 7, 2023
liuyhwangyh pushed a commit to liuyhwangyh/vllm that referenced this pull request Nov 8, 2023
add support modelscope mode

revert not affect file

Support Yi model (vllm-project#1567)

ChatGLM Support (vllm-project#1261)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Nov 13, 2023
@Midnight-719
Copy link

hi ,can u tell me how to use it , I still have this problem: AttributeError: 'ChatGLMConfig' object has no attribute 'num_hidden_layers', Currently, I have updated to the latest version of VLLM

@GoHomeToMacDonal
Copy link
Contributor Author

This is problem is caused by old version of transformers. I suggest upgrading both your transformers package and ChatGLM model to the recent versions.

@Midnight-719
Copy link

This is problem is caused by old version of transformers. I suggest upgrading both your transformers package and ChatGLM model to the recent versions.

Yes, I have tried,transformers==4.35.0

@GoHomeToMacDonal
Copy link
Contributor Author

This is problem is caused by old version of transformers. I suggest upgrading both your transformers package and ChatGLM model to the recent versions.

Yes, I have tried,transformers==4.35.0

Please provide more information of installed packages, and I will try to reproduce your problem later.

@Jeffwan
Copy link
Contributor

Jeffwan commented Nov 15, 2023

@GoHomeToMacDonal If you use other prompts, it shows big difference between the native model.. Did you try more examples?

this is one example, seems it stopped after meeting some token
image

@GoHomeToMacDonal
Copy link
Contributor Author

@GoHomeToMacDonal If you use other prompts, it shows big difference between the native model.. Did you try more examples?

this is one example, seems it stopped after meeting some token image

I guess you used the default max_tokens=16 in the sampling parameters. I suggest to set max_tokens to a larger value, e.g., 1024. For more details, please refer to vllm/sampling_params.py.

In addition, as ChatGLM3 added some special tokens, e.g., <|system|>, use ChatGLMTokenizer.build_chat_input build the input token ids and feed them into vLLM will generate more stable results.

@Jeffwan
Copy link
Contributor

Jeffwan commented Nov 15, 2023

@GoHomeToMacDonal It is the max_tokens setting issue. adding max_tokens works as expected. It's my first time using openai wrapper, thanks for the advice. BTW, I plan to use lm-sys/FastChat#2622 to build the conv template, I did some test and the result looks equivalent to what ChatGLMTokenizer.build_chat_input generates.

build the input token ids and feed them into vLLM will generate more stable results.
Just curious, does vllm provides the token interface?

@wangruohui
Copy link
Contributor

Hello, I am using ChatGLM2 but it seems sometimes the output is not aligned with huggingface version. Could anyone help to take a look at #1670 ?

@zengzikang
Copy link

The latest version already supports GLM. Can GLM3 support official tool calls and other functions? Does it support dialogue function?

@GoHomeToMacDonal
Copy link
Contributor Author

The latest version already supports GLM. Can GLM3 support official tool calls and other functions? Does it support the dialogue function?

You need to implement the corresponding code for function calls and prompt building. The vllm library focuses on model inference, i.e., it can substitute ChatGLMForConditionalGeneration.generate to llm.generate. I suggest building prompts based on the official ChatGLM 3 repository, and replacing model inference functions, e.g., ChatGLMForConditionalGeneration.chat, ChatGLMForConditionalGeneration.stream_chat with vLLM.

@zengzikang
Copy link

The latest version already supports GLM. Can GLM3 support official tool calls and other functions? Does it support the dialogue function?

You need to implement the corresponding code for function calls and prompt building. The vllm library focuses on model inference, i.e., it can substitute ChatGLMForConditionalGeneration.generate to llm.generate. I suggest building prompts based on the official ChatGLM 3 repository, and replacing model inference functions, e.g., ChatGLMForConditionalGeneration.chat, ChatGLMForConditionalGeneration.stream_chat with vLLM.

Using vllm to infer the GLM3 model, the speed is only about 13% faster, is it normal?

@junior-zsy
Copy link

@GoHomeToMacDonal This is still not supported for the Chatglm2-6b-32k version ,I have a message for the issue #1725

@kerthcet kerthcet mentioned this pull request Dec 16, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
This was referenced Mar 6, 2024
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Refactor the token-wise padding mechanism to a more elegant
implementation, correcting the padding logic errors introduced by the
previous multimodal commit vllm-project#736 .

This is a clean version of vllm-project#1259 .
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Yizhou Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants