Skip to content

Multi-Modal Support

Jan supports image attachments with both local and cloud AI models. Upload images directly in your chats and get visual understanding, analysis, and creative responses from compatible models.

Local models with image support work immediately without configuration. Popular vision models include the latest Gemma3 and Qwen3 series, which excel at image understanding while running entirely on your device.

Recommended Local Vision Models:

  • Gemma3 4B - Excellent balance of performance and resource usage
  • Qwen3 7B/14B - Superior image analysis capabilities
  • LLaVA models - Specialized for visual question answering

Here’s Gemma3 4B analyzing a meme with some personality:

AI meme for analysis

Load a vision model like Gemma3 4B and attach your image:

Vision model chat setup

Prompt used: “Describe what you see in the image please. Be a bit sarcastic.”

The model delivers contextual analysis with the requested tone:

Vision model response

Cloud providers like OpenAI (GPT-4V), Anthropic (Claude), and Google (Gemini) offer powerful vision capabilities. However, image support must be manually enabled for each model.

Navigate to your model settings and enable vision support:

Claude vision settings

Toggle both Tools and Vision if you want to combine image understanding with web search or other MCP capabilities.

With Claude 3.5 Sonnet configured for vision, upload an image and get creative responses:

Claude vision chat

Prompt used: “Write an AI joke about the image attached please.”

Claude combines image understanding with humor:

Claude vision response

  • Meme analysis and creation
  • Visual jokes and commentary
  • Art critique and style analysis
  • Creative writing from visual prompts
  • Document analysis and OCR
  • Chart and graph interpretation
  • Product identification and comparison
  • Technical diagram explanation
  • Historical photo analysis
  • Scientific image interpretation
  • Visual learning assistance
  • Research documentation
Model TypeImage SupportSetup RequiredPrivacyBest For
Local (Gemma3, Qwen3)AutomaticNoneCompletePrivacy, offline use
GPT-4VManual enableAPI key + toggleCloud processedAdvanced analysis
Claude 3.5 SonnetManual enableAPI key + toggleCloud processedCreative tasks
Gemini Pro VisionManual enableAPI key + toggleCloud processedMulti-language

Jan accepts common image formats:

  • JPEG/JPG - Most compatible
  • PNG - Full transparency support
  • WebP - Modern web format
  • GIF - Static images only
Analyze this circuit diagram and explain how it works. Identify any potential issues or improvements.
Look at this artwork and write a short story inspired by the mood and colors you see.
Help me understand this math problem shown in the image. Walk through the solution step by step.
Review this presentation slide and suggest improvements for clarity and visual impact.
Extract all the text from this document and format it as a clean markdown list.

We’re actively improving multi-modal support:

Automatic Detection: Models will show visual capabilities without manual configuration Batch Processing: Upload multiple images for comparison and analysis Better Indicators: Clear visual cues for vision-enabled models Enhanced Formats: Support for more image types and sizes

Local Models:

  • Ensure sufficient RAM (8GB+ recommended for vision models)
  • Use GPU acceleration for faster image processing
  • Start with smaller models if resources are limited

Cloud Models:

  • Monitor API usage as vision requests typically cost more
  • Resize large images before upload to save bandwidth
  • Combine with tools for enhanced workflows

Local Processing: Images processed by local models never leave your device. Complete privacy for sensitive visual content.

Cloud Processing: Images sent to cloud providers are processed on their servers. Check provider privacy policies for data handling practices.

Multi-modal AI opens new possibilities for visual understanding and creative assistance. Whether you prefer local privacy or cloud capabilities, Jan makes it easy to work with images and text together.