Skip to main content

One post tagged with "ai"

View All Tags

Setting Up Ollama and Running DeepSpeed on Linux

· 3 min read
Pourya Bagheri
Quantum Computing | Blockchain Soloution | MERN

Introduction

Ollama is a powerful tool for running large language models efficiently on local hardware. When combined with DeepSpeed, a deep learning optimization library, it enables even more efficient execution, particularly for fine-tuning and inference. In this guide, we will walk through the setup process for both Ollama and DeepSpeed on a Linux system.

Prerequisites

Before proceeding, ensure that your system meets the following requirements:

  • A Linux-based operating system (Ubuntu 20.04 or later recommended)

  • A modern NVIDIA GPU with CUDA support

  • Python 3.8 or later

  • Pip and Virtualenv installed

  • Sufficient storage and RAM for model execution

Step 1: Installing Ollama

Ollama provides an easy-to-use interface for managing large language models. To install it on Linux, follow these steps:

1. Open a terminal and update your system:

sudo apt update && sudo apt upgrade -y

2. Download and install Ollama:

curl -fsSL https://ollama.ai/install.sh | sh

3. Verify the installation:

ollama --version

alt text

If the installation was successful, you should see the version number displayed.

Step 2: Setting Up DeepSpeed

DeepSpeed optimizes deep learning models for better performance and scalability. To install and configure it:

1. Create and activate a Python virtual environment:

python3 -m venv deepspeed_env
source deepspeed_env/bin/activate

2. Install DeepSpeed and required dependencies:

pip install deepspeed torch transformers

3. Verify the installation:

deepspeed --version

Step 3: Running a Model with Ollama and DeepSpeed

Now that we have both tools installed, we can load a model and test it.

1. Pull a model with Ollama:

ollama pull mistral

This downloads the Mistral model, which we will use for testing.

2. Run inference with Ollama:

ollama run mistral "Hello, how are you?"

If successful, the model should generate a response.

3. Use DeepSpeed to optimize inference (example using a Hugging Face model):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import deepspeed

model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

ds_model = deepspeed.init_inference(model, dtype=torch.float16, replace_with_kernel_inject=True)

prompt = "What is DeepSpeed?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = ds_model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

Example : Deepseek-r1 1.5b

alt text

Conclusion

By installing Ollama and DeepSpeed on Linux, you can efficiently run and optimize large language models. This setup enables users to leverage local hardware for AI model execution, reducing dependency on cloud services. If further fine-tuning or model adaptation is required, both tools provide advanced functionalities to enhance performance.