Skip to content

[Bug]: result formats markdown and cleaned_html will include damaged html tables #756

@Blackvz

Description

@Blackvz

crawl4ai version

0.4.248

Expected Behavior

When I crawl a page with a html table on it (for example: https://www.german-tigers.de/trainingszeiten.php) then the table should be correctly exported at least in cleaned_html. When I look into the html format of the result than the table is correctly in there, probably because this output is raw and not cleaned. But a table should also be correctly exported in cleaned_html. If columns or rows are missing then its a bug.

Current Behavior

Empty columns in a html table will get removed. This makes the table invalid and the LLM cannot properly extract data from that table, because the table is already wrong in the cleaned_html.

Is this reproducible?

Yes

Inputs Causing the Bug

- Test URL (https://www.german-tigers.de/trainingszeiten.php)
- Use the AsyncWebCrawler and just run .arun() on that url. No config needed. Check cleaned_html output and you will see, that the table is wrong.

Steps to Reproduce

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.german-tigers.de/trainingszeiten.php",
        )

        print(result.cleaned_html)
        return result.cleaned_html

if __name__ == "__main__":
    asyncio.run(main())

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.german-tigers.de/trainingszeiten.php",
        )

        print(result.cleaned_html)
        return result.cleaned_html

if __name__ == "__main__":
    asyncio.run(main())

OS

macOS

Python version

3.12

Browser

Arc

Browser version

1.83.1

Error logs & Screenshots (if applicable)

Image

Metadata

Metadata

Assignees

Labels

⚙ DoneBug fix, enhancement, FR that's completed pending release⚡ HighPriority - High🐞 BugSomething isn't working💪 - IntermediateDifficulty level - Intermediate📌 Root causedidentified the root cause of bug

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions