lobehub/docs/usage/getting-started/vision.mdx

---
title: Vision & Image Understanding
description: >-
  Upload images and have AI agents analyze, describe, extract text, and answer
  questions about visual content.
tags:
  - LobeHub
  - Vision
  - Image Analysis
  - OCR
  - Multimodal
---

# Vision & Image Understanding

LobeHub supports vision capabilities — Agents can see and understand images you share. This multimodal feature enables conversations that go beyond text to include rich visual context.

## What AI Can Do with Images

Vision-enabled models can:

- **Analyze images** — Understand photos, screenshots, diagrams, and documents
- **Read text (OCR)** — Extract text from images, screenshots, handwritten notes, and signs
- **Describe visuals** — Provide detailed descriptions of scenes and objects
- **Answer questions** — Respond to queries about what's in an image
- **Compare images** — Analyze differences between multiple images
- **Recognize patterns** — Identify layouts, design styles, and trends

## Uploading Images

### Upload Methods

<Tabs>
  <Tab title="Drag and Drop">
    Drag an image file from your computer into the chat input area. Works with single or multiple images at once. The simplest method for files already on your desktop.
  </Tab>

  <Tab title="Click to Upload">
    Click the attachment/image icon in the input area, browse your files, and select one or more images. Best for selecting files from specific folders.
  </Tab>

  <Tab title="Paste from Clipboard">
    Copy any image (screenshot, copied from a web page, etc.), click in the message input, and press `Ctrl+V` (or `Cmd+V` on Mac). The image appears instantly — ideal for quick screenshot questions.
  </Tab>
</Tabs>

### Supported Formats and Limits

Supported formats: JPEG/JPG, PNG, WebP, GIF (static frames only), BMP

- Maximum size: \~20 MB per image
- Recommended: under 5 MB for best performance
- Large images are automatically compressed

<Callout type={'info'}>
  The image upload button only appears when you are using a vision-capable model. If you don't see
  it, switch to a model that supports vision (see supported models below).
</Callout>

<Callout type={'warning'}>
  Vision features consume more tokens than text-only conversations, which may affect API costs for
  self-hosted or API-key deployments.
</Callout>

## Using Vision Features

### Image Analysis

Ask general questions about an image:

```
"What's in this image?"
"Describe what you see in detail"
"What are the main elements of this photo?"
```

### Text Extraction (OCR)

Extract text from images, screenshots, and documents:

```
"What does the text say?"
"Transcribe all text from this image"
"Read the error message in this screenshot"
```

Works with screenshots, photos of signs, printed documents, and code in images. Handwriting recognition works with varying accuracy.

### Multiple Images

Upload several images at once and ask for comparison or combined analysis:

```
"Compare these three design variations and suggest which is most effective"
"What are the differences between these before/after photos?"
"Analyze the trends shown in these charts"
```

### Asking Specific Questions

The more specific your question, the better the analysis:

<Tabs>
  <Tab title="Object Identification">
    - "What type of plant is this?"
    - "What brand of laptop is shown?"
    - "Identify the components in this circuit board"
  </Tab>

  <Tab title="Scene Understanding">
    - "Where was this photo likely taken?"
    - "What time of day does this appear to be?"
    - "Describe the setting and atmosphere"
  </Tab>

  <Tab title="Technical Analysis">
    - "What colors are used in this design?"
    - "Evaluate the layout and spacing"
    - "What font family is being used?"
  </Tab>

  <Tab title="Content Analysis">
    - "What's the main message of this infographic?"
    - "Summarize the data shown in this chart"
    - "What arguments does this slide present?"
  </Tab>
</Tabs>

## Use Cases

<Tabs>
  <Tab title="Software Development">
    Share screenshots of error messages, UI bugs, stack traces, or whiteboard diagrams. Ask the AI to "fix this error", "review this interface design", or "convert this whiteboard diagram to code".
  </Tab>

  <Tab title="Education & Learning">
    Upload textbook problems, diagrams, scientific images, or handwritten notes. Ask for explanations, summaries, or digital transcriptions.
  </Tab>

  <Tab title="Content & Design">
    Get feedback on logo designs, poster layouts, color schemes, and compositions. Create captions, alt text, and writing prompts from images.
  </Tab>

  <Tab title="Professional Use">
    Extract data from invoices, analyze dashboards and charts, review presentation slides, and digitize business cards and receipts.
  </Tab>

  <Tab title="Research">
    Analyze scientific images, compare visualizations across papers, extract data from published figures, and identify patterns in visual data.
  </Tab>

  <Tab title="Daily Life">
    Identify plants, products, or landmarks. Translate signs and menus. Get cooking or home repair guidance from photos.
  </Tab>
</Tabs>

## Best Practices

<AccordionGroup>
  <Accordion title="Use Clear, Well-Lit Images">
    Blurry or dark images reduce accuracy significantly. Use good lighting and steady focus for best results.
  </Accordion>

  <Accordion title="Add Context with Text">
    Combine images with a specific question or description of what you want to know. "What's wrong with this code?" alongside a screenshot is far more useful than uploading the image alone.
  </Accordion>

  <Accordion title="Crop to Relevant Areas">
    Remove unnecessary parts of images to focus the AI's attention on what matters. This also reduces token usage.
  </Accordion>

  <Accordion title="Be Specific in Your Questions">
    Instead of "What's this?", ask "What type of architectural style is this building?" Specific questions get more useful answers.
  </Accordion>

  <Accordion title="Verify Critical Information">
    Vision AI can and does make mistakes. Always independently verify important details, especially for medical, legal, or financial content.
  </Accordion>

  <Accordion title="Optimize Image Size">
    Keep images under 5 MB for best performance. Very large images are compressed automatically, which may reduce quality.
  </Accordion>
</AccordionGroup>

## Limitations

<Callout type={'warning'}>
  Vision models have limitations. Always verify critical information independently.
</Callout>

- **People and faces** — Cannot identify specific individuals (privacy protection by design)
- **Fine details** — May miss very small text or details in low-resolution images
- **Handwriting** — Variable accuracy depending on legibility
- **Video** — Cannot process video files; only static images are supported
- **Medical/legal** — Not suitable for medical diagnosis or legal advice; treat as informational only
- **Privacy** — Images are processed by the AI provider's servers; avoid uploading sensitive or confidential content without redaction

## Supported Models

Vision requires a vision-capable model. Look for models with a vision indicator in the model selector:

| Provider  | Vision Models                                                      |
| --------- | ------------------------------------------------------------------ |
| OpenAI    | GPT-4V, GPT-4o, GPT-4o mini                                        |
| Anthropic | Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet+ |
| Google    | Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Pro Vision                |

Other providers may also offer vision models — check the model's capability tags in the selector.

<Cards>
  <Card href={'/docs/usage/getting-started/resource'} title={'Resource Library'} />

  <Card href={'/docs/usage/getting-started/image-generation'} title={'Image Generation'} />

  <Card href={'/docs/usage/providers'} title={'AI Providers'} />
</Cards>