# Running the LLM Server on a Separate Server

This section describes the setup for running only the LLM server on a separate server, without the other components of Sherpa AI Server.

### When This Is Needed

Running the LLM server on a separate server is useful when:

* You need to distribute the load between servers
* The LLM server requires powerful GPUs and is better off on a separate server
* Scaling is required - multiple LLM servers for load balancing
* You need to isolate the LLM server from the main application

### Requirements

* A server with NVIDIA GPU (CUDA 11.8+)
* Docker and Docker Compose installed
* NVIDIA Container Toolkit installed
* LLM models loaded in the directory `llm-server/models/`

### Setup

#### Step 1: Prepare the Server

Make sure all necessary components are installed on the server

```bash
# Check GPU
nvidia-smi

# Check Docker
docker --version
docker compose version
```

#### Step 2: Prepare Files

Copy the following files and directories to the server:

```bash
# Required files:
# - docker-compose.yml (or docker-compose.main.yml)
# - .env file with settings
# - llm-server/models/ - directory with models
# - llm-server/templates/ - directory with templates (if used)
```

#### Step 3: Comment Out Unnecessary Services

Open the `docker-compose.yml` file and comment out all services except `aiserver-llm-server`.

**Example: Commented Services**

```yaml
services:

  # aiserver-pg:
  #   container_name: aiserver-pg
  #   image: aiserver-pg:latest
  #   ...

  # aiserver-embed:
  #   container_name: aiserver-embed
  #   ...

  # aiserver:
  #   container_name: aiserver
  #   ...

  # The only active service:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - 3003:8000
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

  # aiserver-code_interpreter:
  #   ...

  # aiserver-whisper:
  #   ...

  # aiserver-bge_reranker:
  #   ...
```

#### Step 5: Configure Environment Variables

Create or edit the `.env` file with LLM server settings:

```bash
# LLM server settings
LLM_CUDA_VISIBLE_DEVICES=0
LLM_TENSOR_PARALLEL_SIZE=1
LLM_GPU_MEMORY_UTILIZATION=0.90
LLM_COMPLETION_MODEL_NAME=/model-store/meta-llama/Meta-Llama-3-8B-Instruct
LLM_DTYPE=auto
LLM_TRUST_REMOTE_CODE=false
LLM_QUANTIZATION=false
LLM_MAX_MODEL_LEN=8192
LLM_HOST=0.0.0.0
LLM_PORT=8000
LLM_MAX_NUM_BATCHED_TOKENS=16384
LLM_MAX_NUM_SEQS=16
LLM_ENABLE_TOOLS=true
LLM_TOOL_CALL_PARSER=llama3_json
LLM_EXCLUDE_TOOLS_WHEN_NONE=true
```

**Important:**

* Ensure that the model path is correct: `LLM_COMPLETION_MODEL_NAME=/model-store/model-name`

#### Step 6: Check Configuration

Before starting, check the configuration:

```bash
# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check if the port is free
netstat -tuln | grep 3003

# Check for the model
ls -la llm-server/models/
```

#### Step 7: Start the LLM Server

```bash
# Start only the LLM server
docker compose -f docker-compose.yml up -d aiserver-llm-server

# Or start everything (but only uncommented services will run)
docker compose -f docker-compose.yml up -d

# Check the status
docker compose -f docker-compose.yml ps
```

**Expected Result:** Only the `aiserver-llm-server` container should start.

#### Step 8: Check Operation

```bash
# Check the logs
docker logs aiserver-llm-server

# Check GPU usage
nvidia-smi

# Check API availability (should return model information)
curl http://localhost:3003/v1/models
```

**Expected Result:**

* The container should start successfully
* There should be no critical errors in the logs
* The API should respond to requests
* The GPU should be used for loading the model

### Connecting from Another Server

If the LLM server is running on a separate server, configure the connection from the main server.

#### On the Main Server

In the `.env` file of the main server, specify the address of the LLM server:

```bash
# Address of the LLM server (replace with the IP or domain of your LLM server)
LLM_SERVER_URL=http://192.168.1.100:3003
# or
LLM_SERVER_URL=http://llm-server.example.com:3003
```

### Minimal Docker-Compose Configuration

Example of a minimal `docker-compose.yml` for the LLM server only:

```yaml
services:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - "3003:8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

networks:
  llm-net:
    name: llm-net
    driver: bridge
```

Save this file as `docker-compose.llm-only.yml` and use:

```bash
docker compose -f docker-compose.llm-only.yml up -d
```

### Possible Issues

#### Container Does Not Start

**Problem:** The container crashes immediately after starting.

**Solution:**

1. Check the logs: `docker logs aiserver-llm-server`
2. Ensure the GPU is available: `nvidia-smi`
3. Check that the model exists: `ls -la llm-server/models/`
4. Check permissions for the model directory

#### Model Does Not Load

**Problem:** Errors when loading the model.

**Solution:**

1. Check the model path in `.env`: `LLM_COMPLETION_MODEL_NAME`
2. Ensure the model is loaded: `ls -la llm-server/models/`
3. Check logs for loading errors: `docker logs aiserver-llm-server | grep -i error`

#### Insufficient GPU Memory

**Problem:** The model does not fit in GPU memory.

**Solution:**

* Reduce `LLM_GPU_MEMORY_UTILIZATION` in `.env`
* Use the quantized version of the model (set `LLM_QUANTIZATION=true`)
* Use a smaller model

#### Port Not Accessible Externally

**Problem:** Cannot connect to the LLM server from another server.

**Solution:**

1. Check the firewall: `sudo ufw status`
2. Check that the port is forwarded: `docker port aiserver-llm-server`
3. Check Docker network settings

### Monitoring

To monitor the operation of the LLM server:

```bash
# Container status
docker ps | grep llm-server

# Resource usage
docker stats aiserver-llm-server

# GPU usage
watch -n 1 nvidia-smi

# Real-time logs
docker logs -f aiserver-llm-server

# API check
curl http://localhost:3003/health
curl http://localhost:3003/v1/models
```

### Performance Optimization

To optimize the performance of the LLM server:

1. **Configure GPU Memory:**

   ```bash
   LLM_GPU_MEMORY_UTILIZATION=0.90  # Use maximum available memory
   ```
2. **Batching Configuration:**

   ```bash
   LLM_MAX_NUM_BATCHED_TOKENS=16384
   LLM_MAX_NUM_SEQS=16
   ```
3. **Use Quantization:**

   ```bash
   LLM_QUANTIZATION=true  # For models that support quantization
   ```

After completing all steps, you should have the LLM server running on a separate server, which can be used from the main server or other applications via the API on port 3003.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sherparpa.ru/en/sherpa-ai/sherpa-ai-server/ustanovka-sherpa-ai-server/zapusk-llm-servera-na-otdelnom-servere.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
