Let's serve Mistral-7B on the cloud with dstack

This tutorial walks you through steps to serve Large Language Model(LLM) on the cloud using dstack. dstack is a framework that helps us allocate jobs on any cloud of your choice. Furthermore, you can choose a VM instance between on-demand and spot to meet your requirements.

The cloud service provider is called backend in dstack, and currently supported backends include Google Cloud Platform(GCP), Amazon Web Service(AWS), Microsoft Azure, Lambda Labs, and TensorDock.

This tutorial is not covering how to setup your own cloud accoucnt, gateway on it, and how to request GPU quotas. These topics will be covered in separate tutorials in the future, and this tutorial will be updated to point them when they are up. For the sake of simplicity, this tutorial will use dstack Sky which is a dstack’s fully managed cloud service.

The workflow of this tutorial

Initializing dstack project
Writing job description
Provisioning the job on the cloud
Interacting with Mistral-7B
Conclusion

Let’s create a directory named mistral first:

$ mkdir mistral
$ cd mistral

Then, inside the mistral directory, run dstack init command.

This tutorial assumes that you have already installed dstack package via pip install "dstack[all]"

$ dstack init
OK

If you encounter No default project, specify project name error, you need to run dstack server first as below

$ dstack server &
$ dstack init
OK

Since dstack is just a tool to provision any kind of job on the cloud, we could leverage almost every existing frameworks to serve a LLM such as Text Generation Inference, vLLM, or something else.

For instance, write below yaml file and save as serve.dstack.yml in the mistral directory. It describes a job that serve mistralai/Mistral-7B-Instruct-v0.2 via vLLM on a machine with 24GB of GPU memory:

# serve.dstack.yml
type: service

python: "3.11"
env:
    - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2
port: 8000
resources:
    gpu: 24GB
commands:
    - pip install vllm
    - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
model:
    format: openai
    type: chat
    name: mistral-7b-it

Or, below yaml file describes a job that serve the same model via TGI on the same type of machine:

# serve.dstack.yml
type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
    - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2
port: 8000
resources:
  gpu: 24GB
commands:
    - text-generation-launcher --port 80 --trust-remote-code
model:
    format: tgi
    type: chat
    name: mistral-7b-it

The model field in the yaml basically exposes the model’s endpoint as OpenAI API compatible format. Different LLM serving frameworks exposes different API endpoints, so the model field helps exposing different API endpoints in a uniform way.

Choose either of the yaml files, then save it under the mistral directory as serve.dstack.yml.

Now, we are ready to provision a VM instance that serves mistralai/Mistral-7B-Instruct-v0.2 model via OpenAI API compatible endpoints. To do that, we can simply run dstack run command as below:

$ dstack run . -f mistral/serve.dstack.yml

Then, it shows the all the possible plan to choose and interactive prompt to confirm the decision.

⠸ Getting run plan...
    Configuration  serve.dstack.yml             
    Project        deep-diver-main              
    User           deep-diver                   
    Min resources  2..xCPU, 8GB.., 1xGPU (24GB) 
    Max price      -                            
    Max duration   -                            
    Spot policy    auto                         
    Retry policy   no                           

    #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE       
    1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
    2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
    3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
    ...                                                                                            
    Shown 3 of 193 offers, $5.876 max

Continue? [y/n]: y
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling)
spicy-treefrog-1 provisioning completed (running)
Service is published at ...

On the Continue? [y/n]: prompt, if you type y, it will start the provisioning process.

If you haven’t installed openai package, install it via pip install openai command.

Now, you can directly interact with the provisioned mistralai/Mistral-7B-Instruct-v0.2 model using openai package as below:

from openai import OpenAI

client = OpenAI(
  base_url="https://gateway.<gateway domain>",
  api_key="<dstack token>"
)

completion = client.chat.completions.create(
  model="mistral-7b-it",
  messages=[
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

Everything is the same as the usage of OpenAI API, but ust remember the followings:

base_url is the endpoint exposed by dstack. You can find this information on the model menu in the dstack UI.
api_key is the dstack’s access token.
model in client.chat.completions.create is the model name that you cnofigured in the yaml file from the step 3.

We have gone through how to serve Mistral-7B-Instruct-v0.2 model on the cloud with dstack. However, it is straight forward to serve different LLM with exactly the same steps since almost every LLMs is supported in the most modern LLM serving frameworks such as TGI, vLLM, and etc.,

In the next step, I am going to write byte sized tutorials about

how to configure dstack server and gateway on GCP and AWS
how to use dstack for the other types of jobs such as LLM fine-tuning

Stay tuned. When they are ready, this page will be updated accordingly as well.

The workflow of this tutorial

Next