-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed as not planned
Labels
Description
Feature request
Currently to use multiple GPUs, one must set --num-shards
to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.
In fact, in the code from_pretrained
already have an argument device_map
set to "auto"
which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.
Motivation
This would allow more customization of the LLM deployment.
Also some models don't work with tensor parallelism. Eg. falcon-7b-instruct
has 71 heads, what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.
Your contribution
I'm happy to test the solution.
chiragjn, Ichigo3766, GuiCamargoX, oat-mirror and edwardzjl