Get started
Go to the apps section in the web console and click either the small, medium or large instance of vLLM. This will give you some good default settings, but you can fully customise your deployment at the next step. Note: You will need to enter your HuggingFace model id and your HuggingFace API token. Please be aware that many of the models are gated, you will need to go to the model page and sign an agreement and wait for approval before you can use the model. If you try to use a model without having approval vLLM won’t work.Model selection
The supported model list is here: vLLM models They include these categories:- Text Generation
- Text Embedding
- Reward Modeling
- Classification
- Sentence Pair Scoring
- Multimodel Text, Image, Video, Audio
- Transcription
GPU selection
The model(s) you wish to run will determine the amount of VRAM you will need on your GPU. There is a calculator here: LLM-Model-VRAM-CalculatorUsing vLLM
When you deploy the VM you will be shown the VM information page. On the left hand side there is a pane called ‘Metadata’. For vLLM we can see the following metadata:CUDO_HF_TOKENthe HuggingFace token you providedCUDO_MODELthe HuggingFace model you providedCUDO_TOKENthe token generated to act as your API key / passwordportthe port to connect to your vLLM instance on
Open AI API
Now the model is ready, you can use openai python library:CUDO_TOKEN and VM-IP-ADDRESS with the data from the previous step.