mirror of
https://github.com/khoj-ai/khoj
synced 2026-04-21 15:57:17 +00:00
Fix extract_from_webpage discarding pre-fetched content (#1269)
## Summary
In `extract_from_webpage()`, the `content` parameter is unconditionally
overwritten to `None` on the line before the `is_none_or_empty(content)`
check. This means any pre-fetched content (e.g. text content already
retrieved by the Exa search engine) is always discarded, forcing an
unnecessary re-scrape of the webpage.
## Bug
```python
async def extract_from_webpage(
url: str,
subqueries: set[str] = None,
content: str = None, # <-- caller passes pre-fetched content
...
) -> Tuple[set[str], str, Union[None, str]]:
content = None # <-- BUG: immediately overwrites it
if is_none_or_empty(content): # always True
content = await scrape_webpage_with_fallback(url)
```
## Fix
Remove the `content = None` assignment so the passed-in content is used
when available, falling back to scraping only when needed.
This bug was introduced in a refactor and causes:
- Wasted API calls to web scrapers for pages whose content is already
available
- Increased latency for search results that include inline content (e.g.
Exa)
Signed-off-by: JiangNan <1394485448@qq.com>
This commit is contained in:
parent
6735d33af2
commit
678549c6b0
1 changed files with 0 additions and 1 deletions
|
|
@ -556,7 +556,6 @@ async def extract_from_webpage(
|
|||
tracer: dict = {},
|
||||
) -> Tuple[set[str], str, Union[None, str]]:
|
||||
# Read the web page
|
||||
content = None
|
||||
if is_none_or_empty(content):
|
||||
content = await scrape_webpage_with_fallback(url)
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue