mirror of
https://github.com/lobehub/lobehub
synced 2026-04-21 17:47:27 +00:00
Adopt Mintlify-quality writing patterns across 11 existing docs and add 3 new docs. Adds Steps, Tabs, AccordionGroup, and mermaid diagrams for better readability. Priority 1 (major expansion): agent-market, resource, scheduled-task, mcp-market Priority 2 (structural): memory, web-search, tts-stt, vision, chain-of-thought Priority 3 (minor): artifacts, agent New docs: chat, file-upload, skills-and-tools Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
216 lines
7.8 KiB
Text
216 lines
7.8 KiB
Text
---
|
|
title: Vision & Image Understanding
|
|
description: >-
|
|
Upload images and have AI agents analyze, describe, extract text, and answer
|
|
questions about visual content.
|
|
tags:
|
|
- LobeHub
|
|
- Vision
|
|
- Image Analysis
|
|
- OCR
|
|
- Multimodal
|
|
---
|
|
|
|
# Vision & Image Understanding
|
|
|
|
LobeHub supports vision capabilities — Agents can see and understand images you share. This multimodal feature enables conversations that go beyond text to include rich visual context.
|
|
|
|
## What AI Can Do with Images
|
|
|
|
Vision-enabled models can:
|
|
|
|
- **Analyze images** — Understand photos, screenshots, diagrams, and documents
|
|
- **Read text (OCR)** — Extract text from images, screenshots, handwritten notes, and signs
|
|
- **Describe visuals** — Provide detailed descriptions of scenes and objects
|
|
- **Answer questions** — Respond to queries about what's in an image
|
|
- **Compare images** — Analyze differences between multiple images
|
|
- **Recognize patterns** — Identify layouts, design styles, and trends
|
|
|
|
## Uploading Images
|
|
|
|
### Upload Methods
|
|
|
|
<Tabs>
|
|
<Tab title="Drag and Drop">
|
|
Drag an image file from your computer into the chat input area. Works with single or multiple images at once. The simplest method for files already on your desktop.
|
|
</Tab>
|
|
|
|
<Tab title="Click to Upload">
|
|
Click the attachment/image icon in the input area, browse your files, and select one or more images. Best for selecting files from specific folders.
|
|
</Tab>
|
|
|
|
<Tab title="Paste from Clipboard">
|
|
Copy any image (screenshot, copied from a web page, etc.), click in the message input, and press `Ctrl+V` (or `Cmd+V` on Mac). The image appears instantly — ideal for quick screenshot questions.
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
### Supported Formats and Limits
|
|
|
|
Supported formats: JPEG/JPG, PNG, WebP, GIF (static frames only), BMP
|
|
|
|
- Maximum size: \~20 MB per image
|
|
- Recommended: under 5 MB for best performance
|
|
- Large images are automatically compressed
|
|
|
|
<Callout type={'info'}>
|
|
The image upload button only appears when you are using a vision-capable model. If you don't see
|
|
it, switch to a model that supports vision (see supported models below).
|
|
</Callout>
|
|
|
|
<Callout type={'warning'}>
|
|
Vision features consume more tokens than text-only conversations, which may affect API costs for
|
|
self-hosted or API-key deployments.
|
|
</Callout>
|
|
|
|
## Using Vision Features
|
|
|
|
### Image Analysis
|
|
|
|
Ask general questions about an image:
|
|
|
|
```
|
|
"What's in this image?"
|
|
"Describe what you see in detail"
|
|
"What are the main elements of this photo?"
|
|
```
|
|
|
|
### Text Extraction (OCR)
|
|
|
|
Extract text from images, screenshots, and documents:
|
|
|
|
```
|
|
"What does the text say?"
|
|
"Transcribe all text from this image"
|
|
"Read the error message in this screenshot"
|
|
```
|
|
|
|
Works with screenshots, photos of signs, printed documents, and code in images. Handwriting recognition works with varying accuracy.
|
|
|
|
### Multiple Images
|
|
|
|
Upload several images at once and ask for comparison or combined analysis:
|
|
|
|
```
|
|
"Compare these three design variations and suggest which is most effective"
|
|
"What are the differences between these before/after photos?"
|
|
"Analyze the trends shown in these charts"
|
|
```
|
|
|
|
### Asking Specific Questions
|
|
|
|
The more specific your question, the better the analysis:
|
|
|
|
<Tabs>
|
|
<Tab title="Object Identification">
|
|
- "What type of plant is this?"
|
|
- "What brand of laptop is shown?"
|
|
- "Identify the components in this circuit board"
|
|
</Tab>
|
|
|
|
<Tab title="Scene Understanding">
|
|
- "Where was this photo likely taken?"
|
|
- "What time of day does this appear to be?"
|
|
- "Describe the setting and atmosphere"
|
|
</Tab>
|
|
|
|
<Tab title="Technical Analysis">
|
|
- "What colors are used in this design?"
|
|
- "Evaluate the layout and spacing"
|
|
- "What font family is being used?"
|
|
</Tab>
|
|
|
|
<Tab title="Content Analysis">
|
|
- "What's the main message of this infographic?"
|
|
- "Summarize the data shown in this chart"
|
|
- "What arguments does this slide present?"
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Use Cases
|
|
|
|
<Tabs>
|
|
<Tab title="Software Development">
|
|
Share screenshots of error messages, UI bugs, stack traces, or whiteboard diagrams. Ask the AI to "fix this error", "review this interface design", or "convert this whiteboard diagram to code".
|
|
</Tab>
|
|
|
|
<Tab title="Education & Learning">
|
|
Upload textbook problems, diagrams, scientific images, or handwritten notes. Ask for explanations, summaries, or digital transcriptions.
|
|
</Tab>
|
|
|
|
<Tab title="Content & Design">
|
|
Get feedback on logo designs, poster layouts, color schemes, and compositions. Create captions, alt text, and writing prompts from images.
|
|
</Tab>
|
|
|
|
<Tab title="Professional Use">
|
|
Extract data from invoices, analyze dashboards and charts, review presentation slides, and digitize business cards and receipts.
|
|
</Tab>
|
|
|
|
<Tab title="Research">
|
|
Analyze scientific images, compare visualizations across papers, extract data from published figures, and identify patterns in visual data.
|
|
</Tab>
|
|
|
|
<Tab title="Daily Life">
|
|
Identify plants, products, or landmarks. Translate signs and menus. Get cooking or home repair guidance from photos.
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Best Practices
|
|
|
|
<AccordionGroup>
|
|
<Accordion title="Use Clear, Well-Lit Images">
|
|
Blurry or dark images reduce accuracy significantly. Use good lighting and steady focus for best results.
|
|
</Accordion>
|
|
|
|
<Accordion title="Add Context with Text">
|
|
Combine images with a specific question or description of what you want to know. "What's wrong with this code?" alongside a screenshot is far more useful than uploading the image alone.
|
|
</Accordion>
|
|
|
|
<Accordion title="Crop to Relevant Areas">
|
|
Remove unnecessary parts of images to focus the AI's attention on what matters. This also reduces token usage.
|
|
</Accordion>
|
|
|
|
<Accordion title="Be Specific in Your Questions">
|
|
Instead of "What's this?", ask "What type of architectural style is this building?" Specific questions get more useful answers.
|
|
</Accordion>
|
|
|
|
<Accordion title="Verify Critical Information">
|
|
Vision AI can and does make mistakes. Always independently verify important details, especially for medical, legal, or financial content.
|
|
</Accordion>
|
|
|
|
<Accordion title="Optimize Image Size">
|
|
Keep images under 5 MB for best performance. Very large images are compressed automatically, which may reduce quality.
|
|
</Accordion>
|
|
</AccordionGroup>
|
|
|
|
## Limitations
|
|
|
|
<Callout type={'warning'}>
|
|
Vision models have limitations. Always verify critical information independently.
|
|
</Callout>
|
|
|
|
- **People and faces** — Cannot identify specific individuals (privacy protection by design)
|
|
- **Fine details** — May miss very small text or details in low-resolution images
|
|
- **Handwriting** — Variable accuracy depending on legibility
|
|
- **Video** — Cannot process video files; only static images are supported
|
|
- **Medical/legal** — Not suitable for medical diagnosis or legal advice; treat as informational only
|
|
- **Privacy** — Images are processed by the AI provider's servers; avoid uploading sensitive or confidential content without redaction
|
|
|
|
## Supported Models
|
|
|
|
Vision requires a vision-capable model. Look for models with a vision indicator in the model selector:
|
|
|
|
| Provider | Vision Models |
|
|
| --------- | ------------------------------------------------------------------ |
|
|
| OpenAI | GPT-4V, GPT-4o, GPT-4o mini |
|
|
| Anthropic | Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet+ |
|
|
| Google | Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Pro Vision |
|
|
|
|
Other providers may also offer vision models — check the model's capability tags in the selector.
|
|
|
|
<Cards>
|
|
<Card href={'/docs/usage/getting-started/resource'} title={'Resource Library'} />
|
|
|
|
<Card href={'/docs/usage/getting-started/image-generation'} title={'Image Generation'} />
|
|
|
|
<Card href={'/docs/usage/providers'} title={'AI Providers'} />
|
|
</Cards>
|