mirror of
https://github.com/lobehub/lobehub
synced 2026-04-21 09:37:28 +00:00
237 lines
11 KiB
Text
237 lines
11 KiB
Text
---
|
||
title: >-
|
||
Configuring Online Search Functionality - Enhancing AI's Ability to Access Web
|
||
Information
|
||
description: >-
|
||
Learn how to configure the SearXNG online search functionality for LobeHub,
|
||
enabling AI to access the latest web information.
|
||
tags:
|
||
- Online Search
|
||
- SearXNG
|
||
- Web Information
|
||
- AI Enhancement
|
||
---
|
||
|
||
# Configuring Online Search Functionality
|
||
|
||
LobeHub supports configuring **web search functionality** for AI, enabling it to retrieve real-time information from the internet to provide more accurate and up-to-date responses. Web search supports multiple search engine providers, including [SearXNG](https://github.com/searxng/searxng), [Search1API](https://www.search1api.com), [Google](https://programmablesearchengine.google.com), and [Brave](https://brave.com/search/api), among others.
|
||
|
||
<Callout type="info">
|
||
Web search allows AI to access time-sensitive content, such as the latest news, technology trends,
|
||
or product information. You can deploy the open-source SearXNG yourself, or choose to integrate
|
||
mainstream search services like Search1API, Google, Brave, etc., combining them freely based on
|
||
your use case.
|
||
</Callout>
|
||
|
||
By setting the search service environment variable `SEARCH_PROVIDERS` and the corresponding API Keys, LobeHub will query multiple sources and return the results. You can also configure crawler service environment variables such as `CRAWLER_IMPLS` (e.g., `browserless`, `firecrawl`, `tavily`, etc.) to extract webpage content, enhancing the capability of search + reading.
|
||
|
||
# Core Environment Variables
|
||
|
||
## `CRAWLER_IMPLS`
|
||
|
||
Configure available web crawlers for structured extraction of webpage content.
|
||
|
||
```env
|
||
CRAWLER_IMPLS="naive,search1api"
|
||
```
|
||
|
||
Supported crawler types are listed below:
|
||
|
||
| Value | Description | Environment Variable |
|
||
| ------------- | ------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
|
||
| `browserless` | Headless browser crawler based on [Browserless](https://www.browserless.io/), suitable for rendering complex pages. | `BROWSERLESS_TOKEN` |
|
||
| `exa` | Crawler capabilities provided by [Exa](https://exa.ai/), API required. | `EXA_API_KEY` |
|
||
| `firecrawl` | [Firecrawl](https://firecrawl.dev/) headless browser API, ideal for modern websites. | `FIRECRAWL_API_KEY` |
|
||
| `jina` | Crawler service from [Jina AI](https://jina.ai/), supports fast content summarization. | `JINA_READER_API_KEY` |
|
||
| `naive` | Built-in general-purpose crawler for standard web structures. | |
|
||
| `search1api` | Page crawling capabilities from [Search1API](https://www.search1api.com), great for structured content extraction. | `SEARCH1API_API_KEY` `SEARCH1API_CRAWL_API_KEY` `SEARCH1API_SEARCH_API_KEY` |
|
||
| `tavily` | Web scraping and summarization API from [Tavily](https://www.tavily.com/). | `TAVILY_API_KEY` |
|
||
|
||
> 💡 Setting multiple crawlers increases success rate; the system will try different ones based on priority.
|
||
|
||
---
|
||
|
||
## `CRAWL_CONCURRENCY`
|
||
|
||
Controls crawler concurrency per crawl task. The default is `3`. On low-resource servers, use `1` to reduce CPU spikes.
|
||
|
||
```env
|
||
CRAWL_CONCURRENCY=3
|
||
```
|
||
|
||
## `CRAWLER_RETRY`
|
||
|
||
Controls retry attempts per URL on crawl failures. The default is `1` (up to 2 attempts total).
|
||
|
||
```env
|
||
CRAWLER_RETRY=1
|
||
```
|
||
|
||
---
|
||
|
||
## `SEARCH_PROVIDERS`
|
||
|
||
Configure which search engine providers to use for web search.
|
||
|
||
```env
|
||
SEARCH_PROVIDERS="searxng"
|
||
```
|
||
|
||
Supported search engines include:
|
||
|
||
| Value | Description | Environment Variable |
|
||
| ------------ | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
|
||
| `anspire` | Search service provided by [Anspire](https://anspire.ai/). | `ANSPIRE_API_KEY` |
|
||
| `bocha` | Search service from [Bocha](https://open.bochaai.com/). | `BOCHA_API_KEY` |
|
||
| `brave` | [Brave](https://search.brave.com/help/api), a privacy-friendly search source. | `BRAVE_API_KEY` |
|
||
| `exa` | [Exa](https://exa.ai/), a search API designed for AI. | `EXA_API_KEY` |
|
||
| `firecrawl` | Search capabilities via [Firecrawl](https://firecrawl.dev/). | `FIRECRAWL_API_KEY` |
|
||
| `google` | Uses [Google Programmable Search Engine](https://programmablesearchengine.google.com/). | `GOOGLE_PSE_API_KEY` `GOOGLE_PSE_ENGINE_ID` |
|
||
| `jina` | Semantic search provided by [Jina AI](https://jina.ai/). | `JINA_READER_API_KEY` |
|
||
| `kagi` | Premium search API by [Kagi](https://kagi.com/), requires a subscription key. | `KAGI_API_KEY` |
|
||
| `search1api` | Aggregated search capabilities from [Search1API](https://www.search1api.com). | `SEARCH1API_API_KEY` `SEARCH1API_CRAWL_API_KEY` `SEARCH1API_SEARCH_API_KEY` |
|
||
| `searxng` | Use a self-hosted or public [SearXNG](https://searx.space/) instance. | `SEARXNG_URL` |
|
||
| `tavily` | [Tavily](https://www.tavily.com/), offers fast web summaries and answers. | `TAVILY_API_KEY` |
|
||
|
||
> ⚠️ Some search providers require you to apply for an API Key and configure it in your `.env` file.
|
||
|
||
---
|
||
|
||
## `BROWSERLESS_URL`
|
||
|
||
Specifies the API endpoint for [Browserless](https://www.browserless.io/), used for web crawling tasks. Browserless is a browser automation platform based on Headless Chrome, ideal for rendering dynamic pages.
|
||
|
||
```env
|
||
BROWSERLESS_URL=https://chrome.browserless.io
|
||
```
|
||
|
||
> 📌 Usually used together with `CRAWLER_IMPLS=browserless`.
|
||
|
||
---
|
||
|
||
## `BROWSERLESS_BLOCK_ADS`
|
||
|
||
Enables ad blocking functionality. When using [Browserless](https://www.browserless.io/) for web scraping, it automatically blocks common ad resources (such as scripts, images, trackers, etc.), improving scraping speed and page clarity.
|
||
|
||
```env
|
||
BROWSERLESS_BLOCK_ADS=1
|
||
```
|
||
|
||
> 📌 Supported values:
|
||
>
|
||
> - `1`: Enable ad blocking (recommended);
|
||
> - `0`: Disable ad blocking (default).
|
||
|
||
> ✅ It is recommended to use with `BROWSERLESS_STEALTH_MODE=1` to enhance stealth and scraping success rate.
|
||
|
||
---
|
||
|
||
## `BROWSERLESS_STEALTH_MODE`
|
||
|
||
Enables stealth mode. When using [Browserless](https://www.browserless.io/) for web scraping, it applies various anti-detection techniques (such as modifying the user agent, removing webdriver traits, simulating user interactions) to bypass anti-bot mechanisms.
|
||
|
||
```env
|
||
BROWSERLESS_STEALTH_MODE=1
|
||
```
|
||
|
||
> 📌 Supported values:
|
||
>
|
||
> - `1`: Enable stealth mode (recommended);
|
||
> - `0`: Disable stealth mode (default).
|
||
|
||
> ⚠️ Some websites use advanced anti-scraping techniques. Enabling stealth mode can significantly improve scraping success rate.
|
||
|
||
---
|
||
|
||
## `GOOGLE_PSE_ENGINE_ID`
|
||
|
||
Configure the Search Engine ID for Google Programmable Search Engine (Google PSE), used to restrict the search scope. Must be used alongside `GOOGLE_PSE_API_KEY`.
|
||
|
||
```env
|
||
GOOGLE_PSE_ENGINE_ID=your-google-cx-id
|
||
```
|
||
|
||
> 🔑 How to get it: Visit [programmablesearchengine.google.com](https://programmablesearchengine.google.com/), create a search engine, and obtain the `cx` parameter.
|
||
|
||
---
|
||
|
||
## `FIRECRAWL_URL`
|
||
|
||
Sets the access URL for the [Firecrawl](https://firecrawl.dev/) API, used for web content scraping. Default value:
|
||
|
||
```env
|
||
FIRECRAWL_URL=https://api.firecrawl.dev/v2
|
||
```
|
||
|
||
> ⚙️ Usually does not need to be changed unless you’re using a self-hosted version or a proxy service.
|
||
|
||
---
|
||
|
||
## `TAVILY_SEARCH_DEPTH`
|
||
|
||
Configure the result depth for [Tavily](https://www.tavily.com/) searches.
|
||
|
||
```env
|
||
TAVILY_SEARCH_DEPTH=basic
|
||
```
|
||
|
||
Supported values:
|
||
|
||
- `basic`: Fast search, returns brief results;
|
||
- `advanced`: Deep search, returns more context and web page details.
|
||
|
||
---
|
||
|
||
## `TAVILY_EXTRACT_DEPTH`
|
||
|
||
Configure how deeply Tavily extracts content from web pages.
|
||
|
||
```env
|
||
TAVILY_EXTRACT_DEPTH=basic
|
||
```
|
||
|
||
Supported values:
|
||
|
||
- `basic`: Extracts basic info like title and content summary;
|
||
- `advanced`: Extracts structured data, lists, charts, and more from web pages.
|
||
|
||
---
|
||
|
||
## `SEARXNG_URL`
|
||
|
||
The URL of the SearXNG instance, which is a necessary configuration to enable the online search functionality. For example:
|
||
|
||
```shell
|
||
SEARXNG_URL=https://searxng-instance.com
|
||
```
|
||
|
||
This URL should point to a functional SearXNG instance. You can choose to self-host SearXNG or use a publicly available SearXNG instance.
|
||
|
||
You can find publicly available SearXNG instances in the [SearXNG instance list](https://searx.space/). Choose an instance that is fast and reliable, and then configure its URL in LobeHub.
|
||
|
||
> Note that the `searxng` you use must have `json` output enabled; otherwise, the `lobehub` call will result in an error. If self-hosting, find the `searxng` configuration file and add `json` as shown below.
|
||
|
||
```bash
|
||
$ vi searxng/settings.yml
|
||
...
|
||
search:
|
||
formats:
|
||
- html
|
||
- json
|
||
```
|
||
|
||
# Verifying Online Search Functionality
|
||
|
||
After configuration, you can verify whether the online search functionality is working correctly by following these steps:
|
||
|
||
1. Restart the LobeHub service.
|
||
2. Start a new chat session, enable smart online search, and then ask AI a question that requires the latest information, such as "What is the current gold price today?" or "What are the latest major news stories?"
|
||
3. Observe whether AI can return the latest information based on internet searches.
|
||
|
||
If AI can answer these time-sensitive questions, it indicates that the online search functionality has been successfully configured.
|
||
|
||
## References
|
||
|
||
- [LobeHub Online Search RFC Discussion](https://github.com/lobehub/lobehub/discussions/6447)
|
||
- [SearXNG GitHub Repository](https://github.com/searxng/searxng)
|
||
- [Discussion on Enabling JSON Output for SearXNG](https://github.com/searxng/searxng/discussions/3542)
|