Skip to main content

Working with AI Models

This guide explains how to use AI models in Libre WebUI. Whether you're new to AI or an experienced user, this guide will help you choose the right models and get the best performance.

Reading Time

~10 minutes - Complete guide from model selection to optimization

🎯 What You Can Do

Libre WebUI supports all the features that modern AI assistants offer:

💬 Chat & Conversations

  • Have natural conversations with AI models
  • Get streaming responses (words appear as they're generated)
  • Use advanced settings like temperature and creativity controls
  • Create custom system prompts to change the AI's personality

🖼️ Vision & Images

  • Upload images and ask questions about them
  • Analyze charts, diagrams, and photographs
  • Get help with visual tasks like describing scenes or reading text in images

📝 Structured Responses

  • Request responses in specific formats (JSON, lists, etc.)
  • Get organized summaries and analysis
  • Use predefined templates for common tasks

🛠️ Model Management

  • Download and manage AI models locally
  • Switch between different models for different tasks
  • Monitor model performance and memory usage

🧠 AI Models Guide

For systems with 8GB VRAM or 16GB RAM:

ModelDownload SizeVRAM (Q4_K_M)SpeedBest For
llama3.2:3b2GB~2GB60+ tok/sFast general use
gemma2:2b1.6GB~1.5GB70+ tok/sQuick responses
phi3:3.8b2.3GB~2.5GB50+ tok/sReasoning tasks
llama3.1:8b4.7GB~5GB40+ tok/sRecommended
qwen2.5:7b4.7GB~5GB40+ tok/sMultilingual
Best Starting Point

llama3.1:8b offers the best balance of quality and speed for most users with 8GB+ VRAM. Use Q4_K_M quantization.

Understanding Model Sizes and Quantization

Model sizes refer to the number of parameters. Quantization compresses these parameters to reduce memory usage:

ParametersQ4_K_M SizeQ8_0 SizeFP16 SizeUse Case
1-3B1-2GB2-4GB2-6GBFast tasks, mobile
7-8B4-5GB7-8GB14-16GBGeneral use
13-14B8-9GB13-14GB26-28GBPower users
30-34B18-20GB30-34GB60-68GBHigh-end
70B40-42GB70GB140GBProfessional
Quantization Recommendation

Q4_K_M is the recommended quantization for most users. It reduces memory by ~75% with minimal quality loss. Use Q8_0 when you have extra VRAM for better quality.

🚀 Getting Started with Models

Step 1: Download Your First Model

  1. Go to the Models section in the sidebar
  2. Click "Pull Model"
  3. Enter a model name like gemma3:4b
  4. Wait for the download to complete

Step 2: Start Chatting

  1. Go back to the Chat section
  2. You'll see your model is now available
  3. Type a message and press Enter
  4. Watch the AI respond in real-time!

Step 3: Try Advanced Features

  • Upload an image (with vision models like qwen2.5vl:32b)
  • Adjust settings like creativity and response length
  • Create custom prompts to change the AI's behavior

🎨 Creative Use Cases

Writing Assistant

"Help me write a professional email to..."
"Proofread this document and suggest improvements"
"Create a story outline about..."

Learning & Research

"Explain quantum physics in simple terms"
"What are the pros and cons of..."
"Help me understand this concept by giving examples"

Programming Helper (with devstral:24b)

"Create a complete web application with authentication"
"Debug this complex codebase and suggest improvements"
"Build an autonomous coding agent for this project"

Image Analysis (with qwen2.5vl:32b)

"What's in this image and what does it mean?"
"Extract all text from this document accurately"
"Analyze this complex chart and provide insights"

Advanced Reasoning (with deepseek-r1:32b)

"Think through this complex problem step by step"
"What are the hidden implications of this decision?"
"Solve this multi-step logical puzzle"

⚙️ Advanced Features

Custom System Prompts

Change how the AI behaves by setting a system prompt:

"You are a helpful programming tutor. Always explain concepts step by step."
"You are a creative writing assistant. Help me brainstorm ideas."
"You are a professional editor. Focus on clarity and grammar."

Structured Outputs

Ask for responses in specific formats:

"List the pros and cons in JSON format"
"Give me a summary with bullet points"
"Create a table comparing these options"

Temperature & Creativity

  • Low temperature (0.1-0.3): Focused, consistent responses
  • Medium temperature (0.5-0.7): Balanced creativity and coherence
  • High temperature (0.8-1.0): More creative and varied responses

Hardware Requirements Quick Reference

Your SystemRecommended ModelsExpected Speed
8GB VRAM (RTX 3060, RTX 4060)7-8B Q440-60 tok/s
12GB VRAM (RTX 3060 12GB, RTX 4070)8B Q8 or 13B Q435-50 tok/s
16GB VRAM (RTX 4070 Ti, RX 7800 XT)13-14B Q430-45 tok/s
24GB VRAM (RTX 4090, RX 7900 XTX)30B Q425-40 tok/s
48GB+ VRAM (Dual GPU, Mac M3 Max)70B Q415-30 tok/s
CPU Only (16GB RAM)7B Q45-15 tok/s
Need More Details?

See the complete Hardware Requirements Guide for GPU recommendations, Apple Silicon performance, and optimization tips.

💡 Tips for Better Results

Writing Better Prompts

  • Be specific: "Write a 200-word summary" vs "Summarize this"
  • Give context: "I'm a beginner" or "I'm an expert in..."
  • Ask for examples: "Show me examples of..."
  • Specify format: "Give me a numbered list" or "Explain step by step"

Managing Performance

  • Use smaller models for simple tasks to save memory
  • Switch models based on your current task
  • Monitor memory usage in the Models section
  • Keep frequently used models loaded for faster responses

Privacy & Security

Your data never leaves your computerNo internet connection required (after downloading models) ✅ Full control over your conversationsNo tracking or data collection

Troubleshooting

Model won't download?

  • Check your internet connection
  • Ensure you have enough disk space (models can be 2-50GB)
  • Try: ollama pull llama3.1:8b from terminal to see detailed errors

Responses are slow?

  • Check if model fits in VRAM: ollama ps shows memory usage
  • Try a smaller model or lower quantization (Q4 instead of Q8)
  • Close other GPU-intensive applications
  • If using CPU only, expect 5-15 tokens/sec

Out of memory errors?

  • Use Q4_K_M quantization instead of Q8 or FP16
  • Try a smaller model size (8B instead of 14B)
  • Check VRAM usage: nvidia-smi (NVIDIA) or Activity Monitor (Mac)
  • Reduce context length in settings

AI gives strange or repetitive responses?

  • Lower the temperature setting (try 0.5-0.7)
  • Try a different model for your task
  • Clear the conversation and start fresh
  • Check if you're using the right model type (vision model for images, etc.)

Ready to get started? Head to the Quick Start Guide to install Libre WebUI, or check Hardware Requirements to optimize your setup.