Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
7c17057
fix: https://github.com/unclecode/crawl4ai/issues/756
aravindkarnam Mar 1, 2025
5edfea2
Fix LiteLLM branding and link
joshrad-dev Mar 2, 2025
1e819cd
fixes: https://github.com/unclecode/crawl4ai/issues/774
aravindkarnam Mar 3, 2025
f14e4a4
Merge pull request #776 from jawshoeadan/patch-1
aravindkarnam Mar 3, 2025
504207f
docs: update text in llm-strategies.md to reflect new changes in LlmC…
aravindkarnam Mar 3, 2025
341b7a5
🐛 Truncate width to integer string in parse_srcset
dvschuyl Mar 11, 2025
cbb8755
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Mar 13, 2025
a3954dd
refactor: Move the checking of protocol and prepending protocol insid…
aravindkarnam Mar 14, 2025
c190ba8
refactor: Instead of custom validation of question, rely on the built…
aravindkarnam Mar 14, 2025
84883be
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Mar 18, 2025
9109ecd
chore: Raise an exception with clear messaging when body tag is missi…
aravindkarnam Mar 18, 2025
529a797
docs: remove hallucinations from docs for CrawlerRunConfig + Add chun…
aravindkarnam Mar 18, 2025
4359b12
docs + fix: Update example for full page screenshot & PDF export. Fix…
aravindkarnam Mar 18, 2025
8cecbec
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Mar 20, 2025
eedda1a
fix: Truncate long urls in middle than end since users are confused t…
aravindkarnam Mar 20, 2025
ac2f9ae
fix: streamline url status logging via single entrypoint i.e. logger.…
aravindkarnam Mar 20, 2025
e0c2a7c
chore: remove mistakenly commited deps.txt file
aravindkarnam Mar 21, 2025
8b761f2
fix: improve logged url readability by decoding encoded urls
aravindkarnam Mar 21, 2025
6740e87
fix: remove trailing slash when the path is empty. This is causing du…
aravindkarnam Mar 21, 2025
f891133
fix: Move adding of visited urls to the 'visited' set, when queueing …
aravindkarnam Mar 21, 2025
471d110
fix: url normalisation ref: https://github.com/unclecode/crawl4ai/iss…
aravindkarnam Mar 21, 2025
e01d1e7
fix: link normalisation in BestFirstStrategy
aravindkarnam Mar 21, 2025
efa7325
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Mar 24, 2025
2f0e217
Chore: Add brotli as dependancy to fix: https://github.com/unclecode/…
aravindkarnam Mar 25, 2025
e3111d0
fix: prevent session closing after each request to maintain connectio…
aravindkarnam Mar 25, 2025
585e5e5
fix: https://github.com/unclecode/crawl4ai/issues/733
aravindkarnam Mar 25, 2025
7be5427
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Mar 27, 2025
c635f6b
refactor(browser): reorganize browser strategies and improve Docker i…
unclecode Mar 27, 2025
57e0423
fix:target_element should not affect link extraction. -> https://gith…
aravindkarnam Mar 28, 2025
64f20ab
refactor(docker): update Dockerfile and browser strategy to use Chromium
unclecode Mar 28, 2025
d8cbeff
fix: https://github.com/unclecode/crawl4ai/issues/842
aravindkarnam Mar 28, 2025
3ff7eec
refactor(browser): consolidate browser strategy implementations
unclecode Mar 28, 2025
bb02398
refactor(browser): improve browser strategy architecture and lifecycl…
unclecode Mar 30, 2025
1119f2f
fix: https://github.com/unclecode/crawl4ai/issues/911
maggie-edkey Mar 31, 2025
ef1f0c4
fix:https://github.com/unclecode/crawl4ai/issues/701
aravindkarnam Mar 31, 2025
d8357e8
Merge pull request #915 from maggie-edkey/css-selector
aravindkarnam Mar 31, 2025
757e317
fix: https://github.com/unclecode/crawl4ai/issues/839
aravindkarnam Mar 31, 2025
765f856
Merge pull request #808 from dvschuyl/bug/parse-srcset-fix-float-width
aravindkarnam Mar 31, 2025
555455d
feat(browser): implement browser pooling and page pre-warming
unclecode Mar 31, 2025
c5cac2b
feat(browser): add BrowserHub for centralized browser management and …
unclecode Apr 1, 2025
9e16a4b
Merge next and resolve conflicts
aravindkarnam Apr 2, 2025
179921a
fix(crawler): update get_page call to include additional return value
unclecode Apr 2, 2025
86df202
fix(crawler): handle exceptions in get_page call to ensure page retri…
unclecode Apr 2, 2025
73fda8a
fix: address the PR review: https://github.com/unclecode/crawl4ai/pul…
aravindkarnam Apr 3, 2025
4133e54
typo-fix: https://github.com/unclecode/crawl4ai/pull/918
aravindkarnam Apr 3, 2025
7155778
chore: move from faust-cchardet to chardet
aravindkarnam Apr 3, 2025
14894b4
feat(config): set DefaultMarkdownGenerator as the default markdown ge…
unclecode Apr 3, 2025
b1693b1
Remove old quickstart files
unclecode Apr 5, 2025
591f55e
refactor(browser): rename methods and update type hints in BrowserHub…
unclecode Apr 6, 2025
5b66208
Refactor next branch
unclecode Apr 6, 2025
02e627e
fix(crawler): simplify page retrieval logic in AsyncPlaywrightCrawler…
unclecode Apr 8, 2025
9038e9a
Merge branch 'main' into next
unclecode Apr 8, 2025
6f7ab9c
fix: Revert changes to session management in AsyncHttpWebcrawler and …
aravindkarnam Apr 8, 2025
a2061bf
feat(crawler): add MHTML capture functionality
unclecode Apr 9, 2025
66ac07b
feat(crawler): add network request and console message capturing
unclecode Apr 10, 2025
108b2a8
Fixed capturing console messages for case the url is the local file. …
unclecode Apr 10, 2025
7c358a1
fix(browser): add null check for crawlerRunConfig.url
unclecode Apr 10, 2025
18e8227
feat(crawler): add console message capture functionality
unclecode Apr 10, 2025
3179d6a
fix(core): improve error handling and stability in core components
unclecode Apr 11, 2025
022f5c9
Merged next branch
aravindkarnam Apr 12, 2025
d84508b
fix: revert the old target_elms code in regular webscraping strategy
aravindkarnam Apr 12, 2025
9fc5d31
fix: revert the old target_elms code in LXMLwebscraping strategy
aravindkarnam Apr 12, 2025
7d8e81f
fix: fix target_elements, in a less invasive and more efficient way s…
aravindkarnam Apr 12, 2025
ecec53a
Docker tested on Windows machine.
unclecode Apr 13, 2025
dcc2654
fix: Add a nominal wait time for remove overlay elements since it's a…
aravindkarnam Apr 14, 2025
c56974c
feat(docs): enhance documentation UI with ToC and GitHub stats
unclecode Apr 14, 2025
cd7ff6f
feat(docs): add AI assistant interface and code copy button
unclecode Apr 14, 2025
82aa53a
Merge branch 'next-alpine-docker' into next
unclecode Apr 14, 2025
793668a
Remove parameter_updates.txt
unclecode Apr 14, 2025
230f22d
refactor(proxy): move ProxyConfig to async_configs and improve LLM to…
unclecode Apr 15, 2025
5206c6f
Modify the test file
unclecode Apr 15, 2025
94d4865
docs(tests): clarify server URL comments in deep crawl tests
unclecode Apr 15, 2025
eed7f88
Merge branch 'next' into 2025-MAR-ALPHA-1
aravindkarnam Apr 17, 2025
7db6b46
feat(markdown): add content source selection for markdown generation
unclecode Apr 17, 2025
30ec4f5
feat(docs): add comprehensive Docker API demo script
unclecode Apr 17, 2025
fd899f6
Merge branch 'next-fix-markdown-source' into next
unclecode Apr 17, 2025
921e0c4
feat(tests): implement high volume stress testing framework
unclecode Apr 17, 2025
3bf78ff
refactor(docker-demo): enhance error handling and output formatting
unclecode Apr 17, 2025
907cba1
Merge branch 'next-stress' into next
unclecode Apr 17, 2025
16b2318
feat(api): implement crawler pool manager for improved resource handling
unclecode Apr 18, 2025
c2902fd
reverse:last change in order of execution for it introduced a new iss…
aravindkarnam Apr 19, 2025
d2648ea
fix: solved with deepcopy of elements https://github.com/unclecode/cr…
aravindkarnam Apr 19, 2025
b27bb36
merge next. Resolve conflicts. Fix some import errors and error hand…
aravindkarnam Apr 19, 2025
a58c800
refactor(server): migrate to pool-based crawler management
unclecode Apr 20, 2025
5297e36
feat(mcp): Implement MCP protocol and enhance server capabilities
unclecode Apr 21, 2025
b5c2573
feat(browser): add geolocation, locale and timezone support
unclecode Apr 21, 2025
0007aea
Update changelog
unclecode Apr 21, 2025
f3ebb38
Merge PR #899 into next, resolve conflicts in server.py and docs/brow…
unclecode Apr 22, 2025
4812f08
feat(docker): update Docker deployment for v0.6.0
unclecode Apr 22, 2025
c98ffe2
Update CHANGELOG
unclecode Apr 22, 2025
b0aa8bc
Update README
unclecode Apr 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
refactor(docker-demo): enhance error handling and output formatting
Improve the Docker API demo script with better error handling, more detailed output,
and enhanced visualization:
- Add detailed error messages and stack traces for debugging
- Implement better status code handling and display
- Enhance JSON output formatting with monokai theme and word wrap
- Add depth information display for deep crawls
- Improve proxy usage reporting
- Fix port number inconsistency

No breaking changes.
  • Loading branch information
unclecode committed Apr 17, 2025
commit 3bf78ff47a67c82a962dbc0d19da166b42229961
194 changes: 165 additions & 29 deletions docs/examples/docker/demo_docker_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
console = Console()

# --- Configuration ---
BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")
BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
# Target URLs
SIMPLE_URL = "https://httpbin.org/html"
LINKS_URL = "https://httpbin.org/links/10/0"
Expand Down Expand Up @@ -50,8 +50,14 @@ async def check_server_health(client: httpx.AsyncClient):
return False

def print_payload(payload: Dict[str, Any]):
"""Prints the JSON payload nicely."""
syntax = Syntax(json.dumps(payload, indent=2), "json", theme="default", line_numbers=False)
"""Prints the JSON payload nicely with a dark theme."""
syntax = Syntax(
json.dumps(payload, indent=2),
"json",
theme="monokai", # <--- Changed theme here
line_numbers=False,
word_wrap=True # Added word wrap for potentially long payloads
)
console.print(Panel(syntax, title="Request Payload", border_style="blue", expand=False))

def print_result_summary(results: List[Dict[str, Any]], title: str = "Crawl Results Summary", max_items: int = 3):
Expand Down Expand Up @@ -126,12 +132,15 @@ async def stream_request(client: httpx.AsyncClient, endpoint: str, payload: Dict
print_payload(payload)
console.print(f"Sending POST stream request to {client.base_url}{endpoint}...")
all_results = []
initial_status_code = None # Store initial status code

try:
start_time = time.time()
async with client.stream("POST", endpoint, json=payload) as response:
initial_status_code = response.status_code # Capture initial status
duration = time.time() - start_time # Time to first byte potentially
console.print(f"Initial Response Status: [bold {'green' if response.status_code == 200 else 'red'}]{response.status_code}[/] (first byte ~{duration:.2f}s)")
response.raise_for_status()
console.print(f"Initial Response Status: [bold {'green' if response.is_success else 'red'}]{initial_status_code}[/] (first byte ~{duration:.2f}s)")
response.raise_for_status() # Raise exception for bad *initial* status codes

console.print("[magenta]--- Streaming Results ---[/]")
completed = False
Expand All @@ -143,11 +152,16 @@ async def stream_request(client: httpx.AsyncClient, endpoint: str, payload: Dict
completed = True
console.print("[bold green]--- Stream Completed ---[/]")
break
elif data.get("url"): # Looks like a result
elif data.get("url"): # Looks like a result dictionary
all_results.append(data)
# Display summary info as it arrives
success_icon = "[green]✔[/]" if data.get('success') else "[red]✘[/]"
url = data.get('url', 'N/A')
console.print(f" {success_icon} Received: [link={url}]{url}[/link]")
# Display status code FROM THE RESULT DATA if available
result_status = data.get('status_code', 'N/A')
console.print(f" {success_icon} Received: [link={url}]{url}[/link] (Status: {result_status})")
if not data.get('success') and data.get('error_message'):
console.print(f" [red]Error: {data['error_message']}[/]")
else:
console.print(f" [yellow]Stream meta-data:[/yellow] {data}")
except json.JSONDecodeError:
Expand All @@ -156,20 +170,23 @@ async def stream_request(client: httpx.AsyncClient, endpoint: str, payload: Dict
console.print("[bold yellow]Warning: Stream ended without 'completed' marker.[/]")

except httpx.HTTPStatusError as e:
console.print(f"[bold red]HTTP Error:[/]")
console.print(f"Status: {e.response.status_code}")
# Use the captured initial status code if available, otherwise from the exception
status = initial_status_code if initial_status_code is not None else e.response.status_code
console.print(f"[bold red]HTTP Error (Initial Request):[/]")
console.print(f"Status: {status}")
try:
console.print(Panel(Syntax(json.dumps(e.response.json(), indent=2), "json", theme="default"), title="Error Response"))
except json.JSONDecodeError:
console.print(f"Response Body: {e.response.text}")
except httpx.RequestError as e:
console.print(f"[bold red]Request Error: {e}[/]")
except Exception as e:
console.print(f"[bold red]Unexpected Error: {e}[/]")
console.print(f"[bold red]Unexpected Error during streaming: {e}[/]")
console.print_exception(show_locals=False) # Print stack trace for unexpected errors

# Call print_result_summary with the *collected* results AFTER the stream is done
print_result_summary(all_results, title=f"{title} Collected Results")


def load_proxies_from_env() -> List[Dict]:
"""
Load proxies from the PROXIES environment variable.
Expand Down Expand Up @@ -583,7 +600,7 @@ async def demo_extract_llm(client: httpx.AsyncClient):

if isinstance(extracted_data, dict):
console.print("[cyan]Extracted Data (LLM):[/]")
syntax = Syntax(json.dumps(extracted_data, indent=2), "json", theme="default", line_numbers=False)
syntax = Syntax(json.dumps(extracted_data, indent=2), "json", theme="monokai", line_numbers=False)
console.print(Panel(syntax, border_style="cyan", expand=False))
else:
console.print("[yellow]LLM extraction did not return expected dictionary.[/]")
Expand Down Expand Up @@ -618,6 +635,12 @@ async def demo_deep_basic(client: httpx.AsyncClient):
}
results = await make_request(client, "/crawl", payload, "Demo 5a: Basic Deep Crawl")
# print_result_summary is called by make_request, showing URLs and depths
for result in results:
if result.get("success") and result.get("metadata"):
depth = result["metadata"].get("depth", "N/A")
console.print(f" Depth: {depth}")
elif not result.get("success"):
console.print(f" [red]Error: {result['error_message']}[/]")

# 5. Streaming Deep Crawl
async def demo_deep_streaming(client: httpx.AsyncClient):
Expand Down Expand Up @@ -646,6 +669,109 @@ async def demo_deep_streaming(client: httpx.AsyncClient):
# stream_request handles printing results as they arrive
await stream_request(client, "/crawl/stream", payload, "Demo 5b: Streaming Deep Crawl")

# 5a. Deep Crawl with Filtering & Scoring
async def demo_deep_filtering_scoring(client: httpx.AsyncClient):
"""Demonstrates deep crawl with advanced URL filtering and scoring."""
max_depth = 2 # Go a bit deeper to see scoring/filtering effects
max_pages = 6
excluded_pattern = "*/category-1/*" # Example pattern to exclude
keyword_to_score = "product" # Example keyword to prioritize

payload = {
"urls": [DEEP_CRAWL_BASE_URL],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"stream": False,
"cache_mode": "BYPASS",
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": max_depth,
"max_pages": max_pages,
"filter_chain": {
"type": "FilterChain",
"params": {
"filters": [
{ # Stay on the allowed domain
"type": "DomainFilter",
"params": {"allowed_domains": [DEEP_CRAWL_DOMAIN]}
},
{ # Only crawl HTML pages
"type": "ContentTypeFilter",
"params": {"allowed_types": ["text/html"]}
},
{ # Exclude URLs matching the pattern
"type": "URLPatternFilter",
"params": {
"patterns": [excluded_pattern],
"reverse": True # Block if match
}
}
]
}
},
"url_scorer": {
"type": "CompositeScorer",
"params": {
"scorers": [
{ # Boost score for URLs containing the keyword
"type": "KeywordRelevanceScorer",
"params": {"keywords": [keyword_to_score], "weight": 1.5} # Higher weight
},
{ # Slightly penalize deeper pages
"type": "PathDepthScorer",
"params": {"optimal_depth": 1, "weight": -0.1}
}
]
}
},
# Optional: Only crawl URLs scoring above a threshold
# "score_threshold": 0.1
}
}
}
}
}
results = await make_request(client, "/crawl", payload, "Demo 5c: Deep Crawl with Filtering & Scoring")

# --- Verification/Analysis ---
if results:
console.print("[cyan]Deep Crawl Filtering/Scoring Analysis:[/]")
excluded_found = False
prioritized_found_at_depth1 = False
prioritized_found_overall = False

for result in results:
url = result.get("url", "")
depth = result.get("metadata", {}).get("depth", -1)

# Check Filtering
if excluded_pattern.strip('*') in url: # Check if the excluded part is present
console.print(f" [bold red]Filter FAILED:[/bold red] Excluded pattern part '{excluded_pattern.strip('*')}' found in URL: {url}")
excluded_found = True

# Check Scoring (Observation)
if keyword_to_score in url:
prioritized_found_overall = True
if depth == 1: # Check if prioritized keywords appeared early (depth 1)
prioritized_found_at_depth1 = True

if not excluded_found:
console.print(f" [green]Filter Check:[/green] No URLs matching excluded pattern '{excluded_pattern}' found.")
else:
console.print(f" [red]Filter Check:[/red] URLs matching excluded pattern '{excluded_pattern}' were found (unexpected).")

if prioritized_found_at_depth1:
console.print(f" [green]Scoring Check:[/green] URLs with keyword '{keyword_to_score}' were found at depth 1 (scoring likely influenced).")
elif prioritized_found_overall:
console.print(f" [yellow]Scoring Check:[/yellow] URLs with keyword '{keyword_to_score}' found, but not necessarily prioritized early (check max_pages/depth limits).")
else:
console.print(f" [yellow]Scoring Check:[/yellow] No URLs with keyword '{keyword_to_score}' found within crawl limits.")

# print_result_summary called by make_request already shows URLs and depths

# 6. Deep Crawl with Extraction
async def demo_deep_with_css_extraction(client: httpx.AsyncClient):
# Schema to extract H1 and first paragraph from any page
Expand Down Expand Up @@ -782,16 +908,26 @@ async def demo_deep_with_proxy(client: httpx.AsyncClient):
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": 0, # Just crawl start URL via proxy
"max_pages": 1,
"max_depth": 1, # Just crawl start URL via proxy
"max_pages": 5,
}
}
}
}
}
# make_request calls print_result_summary, which shows URL and success status
await make_request(client, "/crawl", payload, "Demo 6c: Deep Crawl + Proxies")
results = await make_request(client, "/crawl", payload, "Demo 6c: Deep Crawl + Proxies")
if not results:
console.print("[red]No results returned from the crawl.[/]")
return
console.print("[cyan]Proxy Usage Summary from Deep Crawl:[/]")
# Verification of specific proxy IP usage would require more complex setup or server logs.
for result in results:
if result.get("success") and result.get("metadata"):
proxy_ip = result["metadata"].get("proxy_ip", "N/A")
console.print(f" Proxy IP used: {proxy_ip}")
elif not result.get("success"):
console.print(f" [red]Error: {result['error_message']}[/]")


# 6d. Deep Crawl with SSL Certificate Fetching
Expand Down Expand Up @@ -844,26 +980,26 @@ async def main_demo():
return

# --- Run Demos ---
# await demo_basic_single_url(client)
# await demo_basic_multi_url(client)
# await demo_streaming_multi_url(client)
await demo_basic_single_url(client)
await demo_basic_multi_url(client)
await demo_streaming_multi_url(client)

# await demo_markdown_default(client)
# await demo_markdown_pruning(client)
# await demo_markdown_bm25(client)
await demo_markdown_default(client)
await demo_markdown_pruning(client)
await demo_markdown_bm25(client)

# await demo_param_css_selector(client)
# await demo_param_js_execution(client)
# await demo_param_screenshot(client)
# await demo_param_ssl_fetch(client)
# await demo_param_proxy(client) # Skips if no PROXIES env var
await demo_param_css_selector(client)
await demo_param_js_execution(client)
await demo_param_screenshot(client)
await demo_param_ssl_fetch(client)
await demo_param_proxy(client) # Skips if no PROXIES env var

# await demo_extract_css(client)
await demo_extract_css(client)
await demo_extract_llm(client) # Skips if no common LLM key env var

await demo_deep_basic(client)
await demo_deep_streaming(client)
# demo_deep_filtering_scoring skipped for brevity, add if needed
await demo_deep_streaming(client) # This need extra work


await demo_deep_with_css_extraction(client)
await demo_deep_with_llm_extraction(client) # Skips if no common LLM key env var
Expand Down