# Running Two LLM Containers Simultaneously

This section describes the setup for running two language models simultaneously on one server with multiple GPUs.

### When This Is Needed

Running two LLM containers is useful when:

* You have multiple GPUs and want to use them for different models
* You need to run different models at the same time (for example, one for chat, another for specialized tasks)
* You need to distribute the load between multiple models

### Requirements

* A server with at least 2 NVIDIA GPUs
* Each GPU must have enough memory for the chosen model
* Docker and Docker Compose installed
* NVIDIA Container Toolkit installed

### Setup

#### Step 1: Check Available GPUs

Make sure you have at least 2 GPUs:

```bash
nvidia-smi
```

**Expected Result:** You should see at least 2 GPUs in the list.

#### Step 2: Uncomment the Second LLM Container

Open the `docker-compose.yml` file and find the commented block `aiserver-llm-server2` (around lines 103-142).

Uncomment the entire block by removing the `#` characters at the beginning of each line:

```yaml
# Was:
# aiserver-llm-server2:
#   container_name: aiserver-llm-server2
#   image: aiserver-llm-server:latest
#   ...

# Now:
aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  ...
```

#### Step 3: Configure Ports

Make sure the ports do not conflict:

* **aiserver-llm-server** (first container): port `3003:8000`
* **aiserver-llm-server2** (second container): the port should be different, for example `3006:8000` or `3007:8000`

In the uncommented block, check the line:

```yaml
ports:
  - 3006:8000  # or another free port
```

#### Step 4: Configure GPU for Each Container

It is important to set which GPU will be used by each container.

**For the First Container (aiserver-llm-server)**

Typically uses GPU 0 (by default). Check the environment variables in the `.env` file or in the `docker-compose.yml` itself:

```yaml
environment:
  LLM_CUDA_VISIBLE_DEVICES: 0  # or do not specify, then GPU 0 will be used
```

**For the Second Container (aiserver-llm-server2)**

In the uncommented block, find the line:

```yaml
environment:
  LLM_CUDA_VISIBLE_DEVICES: 1  # Uses GPU 1
```

Make sure the value corresponds to the number of the second GPU (usually `1` for the second GPU).

#### Step 5: Configure Models

Ensure that each model is configured correctly:

**First Container (aiserver-llm-server)**

Uses settings from the `.env` file or default values. Check the variable:

```bash
LLM_COMPLETION_MODEL_NAME=/model-store/model-name-1
```

**Second Container (aiserver-llm-server2)**

In the uncommented block, find the line:

```yaml
environment:
  LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
```

Change it to the required model if a different one is needed.

#### Step 6: Check Configuration

Before running, check the configuration:

```bash
# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check that the ports are not occupied
netstat -tuln | grep -E '3003|3006'
```

#### Step 7: Start Containers

```bash
# Stop current containers (if running)
docker compose -f docker-compose.yml down

# Start all containers including the second LLM server
docker compose -f docker-compose.yml up -d

# Check that both containers are running
docker compose -f docker-compose.yml ps | grep llm-server
```

**Expected Result:** You should see two containers:

* `aiserver-llm-server` (port 3003)
* `aiserver-llm-server2` (port 3006)

#### Step 8: Check Operation

```bash
# Check logs of the first container
docker logs aiserver-llm-server

# Check logs of the second container
docker logs aiserver-llm-server2

# Check GPU usage
nvidia-smi
```

**Expected Result:**

* Both containers should start successfully
* In `nvidia-smi`, processes should be visible on different GPUs
* Logs should not contain critical errors

### Setting Environment Variables

If you need to change the settings for the second container, edit the `environment` block in `docker-compose.yml`:

```yaml
aiserver-llm-server2:
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/your-model"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU number (0, 1, 2, etc.)
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
    # ... other settings
```

### Possible Issues

#### Container Does Not Start

**Problem:** The second container does not start or crashes with an error.

**Solution:**

1. Check logs: `docker logs aiserver-llm-server2`
2. Ensure GPU is available: `nvidia-smi`
3. Check that the port is free: `netstat -tuln | grep 3006`
4. Check that the model exists: `ls -la llm-server/models/`

#### Port Conflict

**Problem:** Error "port is already allocated".

**Solution:**

* Change the port of the second container to a free one (for example, `3007:8000`)
* Or stop the service occupying the port

#### Insufficient GPU Memory

**Problem:** The model does not load, memory errors.

**Solution:**

* Decrease `LLM_GPU_MEMORY_UTILIZATION` (for example, to `0.7`)
* Use smaller models
* Free up GPU memory by stopping other processes

#### Both Containers Use One GPU

**Problem:** Both containers use GPU 0 instead of different GPUs.

**Solution:**

* Ensure that `LLM_CUDA_VISIBLE_DEVICES` is set correctly for each container
* Check that the variable is not overridden in the `.env` file
* Restart the containers after changing settings

### Example Full Configuration

Example setup of two containers in `docker-compose.yml`:

```yaml
aiserver-llm-server:
  container_name: aiserver-llm-server
  image: aiserver-llm-server:latest
  restart: always
  env_file:
    - .env
  ports:
    - 3003:8000
  environment:
    LLM_CUDA_VISIBLE_DEVICES: 0  # GPU 0
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net

aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  restart: always
  ports:
    - 3006:8000
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU 1
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net
```

### Additional Settings

#### Using Different Models

You can run different models in each container:

```yaml
# First container - chat model
LLM_COMPLETION_MODEL_NAME: "/model-store/Llama-3-8B"

# Second container - code model
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
```

#### Memory Configuration

If you have GPUs with different memory sizes, configure memory usage for each container:

```yaml
# For GPU with less memory
LLM_GPU_MEMORY_UTILIZATION: "0.7"

# For GPU with more memory
LLM_GPU_MEMORY_UTILIZATION: "0.9"
```

### Monitoring

To monitor the operation of both containers:

```bash
# Status of containers
docker compose -f docker-compose.yml ps

# Resource usage
docker stats aiserver-llm-server aiserver-llm-server2

# GPU usage
watch -n 1 nvidia-smi
```

**Expected Result:** Both containers should operate stably, using different GPUs.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sherparpa.ru/en/sherpa-ai/sherpa-ai-server/ustanovka-sherpa-ai-server/zapusk-dvukh-llm-konteinerov-odnovremenno.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
