Allow using multiple GPUs without tensor parallelism

### Feature request

Currently to use multiple GPUs, one must set `--num-shards` to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.

In fact, in the code `from_pretrained` already have an argument `device_map` set to `"auto"` which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.

### Motivation

This would allow more customization of the LLM deployment.

Also some models don't work with tensor parallelism. Eg. [`falcon-7b-instruct` has 71 heads](https://huggingface.co/tiiuae/falcon-7b-instruct/blob/main/config.json#L21), what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.

### Your contribution

I'm happy to test the solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow using multiple GPUs without tensor parallelism #1031

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow using multiple GPUs without tensor parallelism #1031

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions