lobehub/packages/web-crawler
Arvin Xu a6e330cfa9
🐛 fix(web-crawler): prevent happy-dom CSS parsing crash in htmlToMarkdown (#13652)
- Disable CSS file loading and JS evaluation in happy-dom Window (root cause)
- Add try-catch around Readability.parse() for defense in depth
- Add regression tests for invalid CSS selectors and external stylesheet links

Closes LOBE-6869

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 12:59:49 +08:00
..
src 🐛 fix(web-crawler): prevent happy-dom CSS parsing crash in htmlToMarkdown (#13652) 2026-04-08 12:59:49 +08:00
package.json ♻️ refactor(auth): improve auth configuration for better Docker runtime support (#11253) 2026-01-06 15:15:22 +08:00
README.md 📝 docs: Polishing and improving product documentation (#12612) 2026-03-03 16:01:41 +08:00
README.zh-CN.md 📝 docs: Polishing and improving product documentation (#12612) 2026-03-03 16:01:41 +08:00
tsconfig.json 👷 build: nodejs 24 (#10003) 2025-11-03 12:56:15 +08:00
vitest.config.mts test: refactor to improve utils tests and add more tests (#9124) 2025-09-06 12:21:58 +08:00

@lobechat/web-crawler

LobeHub's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.

📝 Introduction

@lobechat/web-crawler is a core component of LobeHub responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.

🛠️ Core Features

  • Intelligent Content Extraction: Identifies main content based on Mozilla Readability algorithm
  • Multi-level Crawling Strategy: Supports multiple crawling implementations including basic crawling, Jina, Search1API, and Browserless rendering
  • Custom URL Rules: Handles specific website crawling logic through a flexible rule system

🤝 Contribution

Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:

How to Contribute URL Rules

  1. Add new rules to the urlRules.ts file
  2. Rule example:
// Example: handling specific websites
const url = [
  // ... other URL matching rules
  {
    // URL matching pattern, supports regex
    urlPattern: 'https://example.com/articles/(.*)',

    // Optional: URL transformation, redirects to an easier-to-crawl version
    urlTransform: 'https://example.com/print/$1',

    // Optional: specify crawling implementation, supports 'naive', 'jina', 'search1api', and 'browserless'
    impls: ['naive', 'jina', 'search1api', 'browserless'],

    // Optional: content filtering configuration
    filterOptions: {
      // Whether to enable Readability algorithm for filtering distracting elements
      enableReadability: true,
      // Whether to convert to plain text
      pureText: false,
    },
  },
];

Rule Submission Process

  1. Fork the LobeHub repository
  2. Add or modify URL rules
  3. Submit a Pull Request describing:
  • Target website characteristics
  • Problems solved by the rule
  • Test cases (example URLs)

📌 Note

This is an internal module of LobeHub ("private": true), designed specifically for LobeHub and not published as a standalone package.