Skip to content

Allow using multiple GPUs without tensor parallelism #1031

@gjurdzinski-deepsense

Description

@gjurdzinski-deepsense

Feature request

Currently to use multiple GPUs, one must set --num-shards to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.

In fact, in the code from_pretrained already have an argument device_map set to "auto" which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.

Motivation

This would allow more customization of the LLM deployment.

Also some models don't work with tensor parallelism. Eg. falcon-7b-instruct has 71 heads, what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.

Your contribution

I'm happy to test the solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions