Running Mistral:7b LLM on OpenShift¶

Feb 05, 2024

10 min read

Running your local LLM model with Ollama on your OpenShift cluster.

Ollama is an open-source tool that simplifies running and managing Large Language Models (LLMs) locally. It provides a simple API for managing model lifecycles and making inference requests.

Mistral is a powerful Large Language Model trained by a French start-up that currently outperforms other models of the same size. By combining Mistral with Ollama on OpenShift, you can run a production-grade LLM environment within your own infrastructure.

All the files and configurations referenced in this article are available in the openshift-ollama GitHub repository.

Running Mistral:7B on OpenShift¶

Architecture Overview¶

Our deployment consists of the following components:

An Ollama server pod with GPU access for model execution
Persistent storage for model weights and cache
Internal and external routes for API access
Optional integration with client applications (Telegram bot, VSCode plugins, iOS app)

Deploying Ollama¶

First, we’ll create the required OpenShift namespace:

Within an ollama namespace

openshift-ollama.yaml¶

apiVersion: v1
kind: Namespace
metadata:
  name: ollama

We create the PVC and Deployment using the ollama official container image:

https://hub.docker.com/r/ollama/ollama

openshift-ollama.yaml¶

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-storage
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /.ollama
          name: ollama-storage
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Always
      imagePullPolicy: Always
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-storage

We notice the nvidia.com/gpu: 1 resources parameter.

Exposing the Ollama API endpoint¶

We expose the Ollama API endpoint both internally (for other containers like our Telegram Bot) and externally (for services such as VSCode plugins or mobile apps).

Warning

Security note: The external endpoint in this setup is only exposed on a private network accessed via VPN. Since we haven’t configured authentication for this endpoint, exposing it to a public IP would create a significant security risk.

openshift-ollama.yaml¶

---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    name: ollama
  ports:
  - port: 80
    name: http
    targetPort: http
    protocol: TCP

---
kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: ollama
  namespace: ollama
  labels: {}
spec:
  to:
    kind: Service
    name: ollama
  tls: null
  port:
    targetPort: http

Running Mistral LLM on Ollama¶

Now that our Ollama service is up and running, we can pull the Mistral:7b model using the exposed API endpoint:

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "mistral:7b"}'

Note

Depending on your network speed and GPU capabilities, this may take several minutes as it downloads and prepares the model.

After the download is complete, we can validate that our model has been successfully loaded:

$ curl -s http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/tags |jq .
{
  "models": [
    {
      "name": "mistral:7b",
      "model": "mistral:7b",
      "modified_at": "2024-02-03T19:44:00.872177836Z",
      "size": 4109865159,
      "digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "7B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

The response contains important information:

format: GGUF (GPT-Generated Unified Format) - optimized for inference
family: llama - the base architecture family
parameter_size: 7B - the model has 7 billion parameters
quantization_level: Q4_0 - the precision level of the model weights

Creating Custom Models with Specializations¶

One of Ollama’s powerful features is the ability to create custom models with specialized behaviors without retraining. Let’s create a model specialized in OpenShift documentation:

$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
  "name": "ocplibrarian",
  "modelfile": "FROM mistral:7b\nSYSTEM You are a Librarian, specialized in retrieving content from the OpenShift documentation."
}'

This creates a new model called “ocplibrarian” that inherits all capabilities from Mistral:7b but has been given a specific persona and purpose through the SYSTEM instruction.

Alternative Models: OpenHermes 2.5¶

Mistral is one of many models you can run on Ollama. Let’s also try OpenHermes 2.5, an instruction-tuned variant of Mistral that offers improved performance on many tasks:

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "openhermes2.5-mistral:7b-q4_K_M"}'

After pulling the model, we can create a customized version with a specific personality:

$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
  "name": "hermes2",
  "modelfile": "FROM openhermes2.5-mistral:7b-q4_K_M\nSYSTEM You are \"Hermes 2\", a conversational AI assistant developed by Teknium. Your purpose is to assist users with helpful, accurate information and engaging conversation."
}'

The model is now available as “hermes2” and carries the specified system prompt that defines its behavior.

Interacting with the LLM¶

Building a Telegram Bot Interface¶

For a practical demonstration of the Ollama API, let’s create a simple Telegram bot that serves as an interface to our LLM. We’ll call it Tellama.

The core functionality retrieves user messages from Telegram, forwards them to the Ollama API, and returns the LLM’s responses:

tellama.py¶

async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    data = {
      "model": "mistral:7b",
      "messages": [
        {
          "role": "user",
          "content": update.message.text
        }
      ],
      "stream": False
    }
    response = requests.post(OLLAMA_ENDPOINT, json=data )

    if response.status_code == 200:
        ollama_response = response.json().get('message').get('content')
        await update.message.reply_text(ollama_response)
    else:
        await update.message.reply_text('Sorry, there was an error processing your request.')

This implementation is intentionally simple and has some limitations: - No conversation history or context is maintained between messages - No support for streaming responses - Limited error handling

These limitations could be addressed in a more robust implementation, but this serves as a good starting point for testing the API.

Deploying the Telegram Bot on OpenShift¶

Now let’s containerize our Python script using a UBI (Universal Base Image) container and deploy it to our OpenShift cluster:

# Build the container image
cd tellama
podman build .

# Get the image ID and OpenShift registry URL
export image_id=$(echo $(podman images --format json |jq .[0].Id) | cut -c 2-13)
export ocp_registry=$(oc get route default-route -n openshift-image-registry -ojsonpath='{.spec.host}')

# Push to the OpenShift internal registry
podman login -u <user> -p $(oc whoami -t) ${ocp_registry}
podman push $image_id docker://${ocp_registry}/ollama/tellama:latest

After creating a new Telegram bot messaging @botfather we create a new Secret containing the Telegram Token.

ollama-secret.yaml¶

apiVersion: v1
kind: Secret
metadata:
  name: ollama-secret
stringData:
  TELEGRAM_TOKEN: <your_telegram_token>
  OLLAMA_ENDPOINT: "http://ollama/api/chat"

We can now deploy our created image within the OpenShift cluster. Replacing the image: with the value of $ocp_registry

tellama.yaml¶

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tellama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: tellama
        image: default-route-openshift-image-registry.apps.da2.epheo.eu/ollama/tellama:latest
        resources: {}
        envFrom:
        - secretRef:
            name: ollama-secret
      restartPolicy: Always
      imagePullPolicy: Always

Integrating with Development Tools¶

Developer Experience with VSCode Extensions¶

With our Ollama service running, we can integrate it with various development tools. Here are some VSCode extensions that work well with self-hosted LLMs:

Continue: A VSCode extension similar to GitHub Copilot that can leverage your self-hosted LLM for code completion, explanation, and generation.

See also

https://github.com/continuedev/continue

You can install Continue directly from the VSCode extension marketplace. After installation, you’ll need to configure it to use your self-hosted Ollama service:

Create or edit the ~/.continue/config.json configuration file
Add Ollama as your LLM provider with the appropriate endpoint

Note

Replace the example endpoint with your actual Ollama route: oc get route ollama -n ollama -ojsonpath='{.spec.host}'

{
  "models": [
    {
      "title": "Mistral",
      "provider": "ollama",
      "model": "mistral:7b",
      "apiBase": "http://ollama-ollama.apps.da2.epheo.eu"
    }
  ]
}

commitollama is a specialized VSCode extension that generates meaningful Git commit messages using your self-hosted LLM.

Configuration is straightforward through your VSCode settings:

{
  "commitollama.custom.endpoint": "http://ollama-ollama.apps.da2.epheo.eu",
  "commitollama.model": "custom",
  "commitollama.custom.model": "mistral:7b"
}

With this configuration, you can simply stage your changes and let commitollama analyze the diff to generate descriptive commit messages.

https://raw.githubusercontent.com/jepricreations/commitollama/main/commitollama-demo.gif

Mobile Integration: Enchanted for iOS¶

For on-the-go access to your self-hosted LLM, you can use Enchanted, an open-source iOS application specifically designed for Ollama compatibility.

See also

https://github.com/AugustDev/enchanted

Enchanted offers a ChatGPT-like interface for interacting with your self-hosted models, providing a clean, intuitive mobile experience with features like:

Conversation history
Different model selection
Conversation context management
Share and export options

Note

Since our Ollama endpoint is exposed on a private network, your iOS device can connect either through a local WiFi network or by configuring a VPN connection to your network.

../../_images/ollama-enchanted-appstore.jpg

../../_images/ollama-enchanted-settings.jpg

Performance Considerations¶

GPU Selection¶

While any NVIDIA GPU can run these models, performance varies considerably:

RTX 3080/3090 or better: Excellent for running multiple 7B models
CPU-only: Functional but with much slower inference speeds (5-10x slower)

Model Quantization¶

Quantization reduces model precision to improve performance at a small cost to accuracy:

Q4_0: Fastest inference, lowest VRAM usage (4-bit quantization)
Q5_K_M: Good balance between quality and performance
Q8_0: Higher quality results but requires more VRAM and slower inference

Memory Requirements¶

Typical memory requirements for a 7B parameter model:

Full precision (FP16): ~14GB VRAM
Q8_0 quantization: ~7GB VRAM
Q4_0 quantization: ~4GB VRAM

Troubleshooting¶

For common issues and their solutions, refer to the troubleshooting guide.

Common challenges include:

GPU detection issues
Memory limitations
Network connectivity problems
Model loading failures

Security Considerations¶

When deploying LLM services like Ollama, consider these important security aspects:

Authentication¶

The default Ollama API has no built-in authentication. For production use, consider:

Implementing a reverse proxy with authentication
Using OpenShift network policies to restrict access
Creating application-specific API keys

Data Privacy¶

One of the main advantages of self-hosting LLMs is data privacy:

All prompts and completions remain within your infrastructure
No data is sent to third-party services
Sensitive information can be processed safely

However, remember that the model itself may contain biases or potentially generate problematic content. Implement appropriate guardrails for your specific use case.

Resource Isolation¶

Use OpenShift’s resource management features to ensure your Ollama deployment:

Has appropriate resource limits to prevent cluster resource exhaustion
Is isolated from other critical workloads
Has priority classes set according to your needs

This approach is particularly valuable for:

Organizations with data privacy requirements
Development teams building AI-powered applications
Edge computing scenarios with limited internet connectivity
Research and experimentation with different LLM models

As LLMs continue to evolve, having a local deployment option provides both flexibility and cost control compared to purely cloud-based alternatives.