This tutorial walks you through steps to serve Large Language Model(LLM) on the cloud using dstack. dstack is a framework that helps us allocate jobs on any cloud of your choice. Furthermore, you can choose a VM instance between on-demand and spot to meet your requirements.
The cloud service provider is called backend in dstack, and currently supported backends include Google Cloud Platform(GCP), Amazon Web Service(AWS), Microsoft Azure, Lambda Labs, and TensorDock.
This tutorial is not covering how to setup your own cloud accoucnt, gateway on it, and how to request GPU quotas. These topics will be covered in separate tutorials in the future, and this tutorial will be updated to point them when they are up. For the sake of simplicity, this tutorial will use dstack Sky which is a dstack’s fully managed cloud service.
Let’s create a directory named mistral
first:
$ mkdir mistral
$ cd mistral
Then, inside the mistral
directory, run dstack init
command.
This tutorial assumes that you have already installed dstack package via
pip install "dstack[all]"
$ dstack init
OK
If you encounter
No default project, specify project name
error, you need to rundstack server
first as below
$ dstack server &
$ dstack init
OK
Since dstack is just a tool to provision any kind of job on the cloud, we could leverage almost every existing frameworks to serve a LLM such as Text Generation Inference, vLLM, or something else.
For instance, write below yaml
file and save as serve.dstack.yml
in the mistral
directory. It describes a job that serve mistralai/Mistral-7B-Instruct-v0.2
via vLLM on a machine with 24GB of GPU memory:
# serve.dstack.yml
type: service
python: "3.11"
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2
port: 8000
resources:
gpu: 24GB
commands:
- pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
model:
format: openai
type: chat
name: mistral-7b-it
Or, below yaml
file describes a job that serve the same model via TGI on the same type of machine:
# serve.dstack.yml
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2
port: 8000
resources:
gpu: 24GB
commands:
- text-generation-launcher --port 80 --trust-remote-code
model:
format: tgi
type: chat
name: mistral-7b-it
The
model
field in theyaml
basically exposes the model’s endpoint as OpenAI API compatible format. Different LLM serving frameworks exposes different API endpoints, so themodel
field helps exposing different API endpoints in a uniform way.
Choose either of the yaml
files, then save it under the mistral
directory as serve.dstack.yml
.
Now, we are ready to provision a VM instance that serves mistralai/Mistral-7B-Instruct-v0.2
model via OpenAI API compatible endpoints. To do that, we can simply run dstack run
command as below:
$ dstack run . -f mistral/serve.dstack.yml
Then, it shows the all the possible plan to choose and interactive prompt to confirm the decision.
⠸ Getting run plan...
Configuration serve.dstack.yml
Project deep-diver-main
User deep-diver
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Max price -
Max duration -
Spot policy auto
Retry policy no
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Shown 3 of 193 offers, $5.876 max
Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...
On the Continue? [y/n]:
prompt, if you type y
, it will start the provisioning process.
If you haven’t installed
openai
package, install it viapip install openai
command.
Now, you can directly interact with the provisioned mistralai/Mistral-7B-Instruct-v0.2
model using openai
package as below:
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.<gateway domain>",
api_key="<dstack token>"
)
completion = client.chat.completions.create(
model="mistral-7b-it",
messages=[
{"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
]
)
print(completion.choices[0].message)
Everything is the same as the usage of OpenAI API, but ust remember the followings:
base_url
is the endpoint exposed by dstack. You can find this information on the model
menu in the dstack UI.api_key
is the dstack’s access token.model
in client.chat.completions.create
is the model name that you cnofigured in the yaml
file from the step 3.We have gone through how to serve Mistral-7B-Instruct-v0.2
model on the cloud with dstack. However, it is straight forward to serve different LLM with exactly the same steps since almost every LLMs is supported in the most modern LLM serving frameworks such as TGI, vLLM, and etc.,
In the next step, I am going to write byte sized tutorials about
Stay tuned. When they are ready, this page will be updated accordingly as well.