lobehub/docs/usage/getting-started/vision.mdx
René Wang b94503db8b
📝 docs: upgrade usage docs with improved structure and content (#12704)
Adopt Mintlify-quality writing patterns across 11 existing docs and add 3 new docs.
Adds Steps, Tabs, AccordionGroup, and mermaid diagrams for better readability.

Priority 1 (major expansion): agent-market, resource, scheduled-task, mcp-market
Priority 2 (structural): memory, web-search, tts-stt, vision, chain-of-thought
Priority 3 (minor): artifacts, agent
New docs: chat, file-upload, skills-and-tools

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 09:56:39 +08:00

216 lines
7.8 KiB
Text

---
title: Vision & Image Understanding
description: >-
Upload images and have AI agents analyze, describe, extract text, and answer
questions about visual content.
tags:
- LobeHub
- Vision
- Image Analysis
- OCR
- Multimodal
---
# Vision & Image Understanding
LobeHub supports vision capabilities — Agents can see and understand images you share. This multimodal feature enables conversations that go beyond text to include rich visual context.
## What AI Can Do with Images
Vision-enabled models can:
- **Analyze images** — Understand photos, screenshots, diagrams, and documents
- **Read text (OCR)** — Extract text from images, screenshots, handwritten notes, and signs
- **Describe visuals** — Provide detailed descriptions of scenes and objects
- **Answer questions** — Respond to queries about what's in an image
- **Compare images** — Analyze differences between multiple images
- **Recognize patterns** — Identify layouts, design styles, and trends
## Uploading Images
### Upload Methods
<Tabs>
<Tab title="Drag and Drop">
Drag an image file from your computer into the chat input area. Works with single or multiple images at once. The simplest method for files already on your desktop.
</Tab>
<Tab title="Click to Upload">
Click the attachment/image icon in the input area, browse your files, and select one or more images. Best for selecting files from specific folders.
</Tab>
<Tab title="Paste from Clipboard">
Copy any image (screenshot, copied from a web page, etc.), click in the message input, and press `Ctrl+V` (or `Cmd+V` on Mac). The image appears instantly — ideal for quick screenshot questions.
</Tab>
</Tabs>
### Supported Formats and Limits
Supported formats: JPEG/JPG, PNG, WebP, GIF (static frames only), BMP
- Maximum size: \~20 MB per image
- Recommended: under 5 MB for best performance
- Large images are automatically compressed
<Callout type={'info'}>
The image upload button only appears when you are using a vision-capable model. If you don't see
it, switch to a model that supports vision (see supported models below).
</Callout>
<Callout type={'warning'}>
Vision features consume more tokens than text-only conversations, which may affect API costs for
self-hosted or API-key deployments.
</Callout>
## Using Vision Features
### Image Analysis
Ask general questions about an image:
```
"What's in this image?"
"Describe what you see in detail"
"What are the main elements of this photo?"
```
### Text Extraction (OCR)
Extract text from images, screenshots, and documents:
```
"What does the text say?"
"Transcribe all text from this image"
"Read the error message in this screenshot"
```
Works with screenshots, photos of signs, printed documents, and code in images. Handwriting recognition works with varying accuracy.
### Multiple Images
Upload several images at once and ask for comparison or combined analysis:
```
"Compare these three design variations and suggest which is most effective"
"What are the differences between these before/after photos?"
"Analyze the trends shown in these charts"
```
### Asking Specific Questions
The more specific your question, the better the analysis:
<Tabs>
<Tab title="Object Identification">
- "What type of plant is this?"
- "What brand of laptop is shown?"
- "Identify the components in this circuit board"
</Tab>
<Tab title="Scene Understanding">
- "Where was this photo likely taken?"
- "What time of day does this appear to be?"
- "Describe the setting and atmosphere"
</Tab>
<Tab title="Technical Analysis">
- "What colors are used in this design?"
- "Evaluate the layout and spacing"
- "What font family is being used?"
</Tab>
<Tab title="Content Analysis">
- "What's the main message of this infographic?"
- "Summarize the data shown in this chart"
- "What arguments does this slide present?"
</Tab>
</Tabs>
## Use Cases
<Tabs>
<Tab title="Software Development">
Share screenshots of error messages, UI bugs, stack traces, or whiteboard diagrams. Ask the AI to "fix this error", "review this interface design", or "convert this whiteboard diagram to code".
</Tab>
<Tab title="Education & Learning">
Upload textbook problems, diagrams, scientific images, or handwritten notes. Ask for explanations, summaries, or digital transcriptions.
</Tab>
<Tab title="Content & Design">
Get feedback on logo designs, poster layouts, color schemes, and compositions. Create captions, alt text, and writing prompts from images.
</Tab>
<Tab title="Professional Use">
Extract data from invoices, analyze dashboards and charts, review presentation slides, and digitize business cards and receipts.
</Tab>
<Tab title="Research">
Analyze scientific images, compare visualizations across papers, extract data from published figures, and identify patterns in visual data.
</Tab>
<Tab title="Daily Life">
Identify plants, products, or landmarks. Translate signs and menus. Get cooking or home repair guidance from photos.
</Tab>
</Tabs>
## Best Practices
<AccordionGroup>
<Accordion title="Use Clear, Well-Lit Images">
Blurry or dark images reduce accuracy significantly. Use good lighting and steady focus for best results.
</Accordion>
<Accordion title="Add Context with Text">
Combine images with a specific question or description of what you want to know. "What's wrong with this code?" alongside a screenshot is far more useful than uploading the image alone.
</Accordion>
<Accordion title="Crop to Relevant Areas">
Remove unnecessary parts of images to focus the AI's attention on what matters. This also reduces token usage.
</Accordion>
<Accordion title="Be Specific in Your Questions">
Instead of "What's this?", ask "What type of architectural style is this building?" Specific questions get more useful answers.
</Accordion>
<Accordion title="Verify Critical Information">
Vision AI can and does make mistakes. Always independently verify important details, especially for medical, legal, or financial content.
</Accordion>
<Accordion title="Optimize Image Size">
Keep images under 5 MB for best performance. Very large images are compressed automatically, which may reduce quality.
</Accordion>
</AccordionGroup>
## Limitations
<Callout type={'warning'}>
Vision models have limitations. Always verify critical information independently.
</Callout>
- **People and faces** — Cannot identify specific individuals (privacy protection by design)
- **Fine details** — May miss very small text or details in low-resolution images
- **Handwriting** — Variable accuracy depending on legibility
- **Video** — Cannot process video files; only static images are supported
- **Medical/legal** — Not suitable for medical diagnosis or legal advice; treat as informational only
- **Privacy** — Images are processed by the AI provider's servers; avoid uploading sensitive or confidential content without redaction
## Supported Models
Vision requires a vision-capable model. Look for models with a vision indicator in the model selector:
| Provider | Vision Models |
| --------- | ------------------------------------------------------------------ |
| OpenAI | GPT-4V, GPT-4o, GPT-4o mini |
| Anthropic | Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet+ |
| Google | Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Pro Vision |
Other providers may also offer vision models — check the model's capability tags in the selector.
<Cards>
<Card href={'/docs/usage/getting-started/resource'} title={'Resource Library'} />
<Card href={'/docs/usage/getting-started/image-generation'} title={'Image Generation'} />
<Card href={'/docs/usage/providers'} title={'AI Providers'} />
</Cards>