Skip to content

Conversation

@DevinTDHa
Copy link
Member

Description

This PR changes how Reader2Doc behaves, especially for XML files:

  1. Reader2Doc will now create single document annotations per default. This should result in a more expected behavior when reading large documents. By default, lines are joined by the newline character \n but can be set with setJoinString.
  2. XML reader now ignores empty tags without text content. To extract attributes of tags, the new flag setExtractTagAttributes(attributes: list[str]) will trigger addition of the values to the document output. For example, for the test.xml
<bookstore>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>

We can extract category and lang values with the Reader2Doc Config

reader2doc = Reader2Doc() \
    .setContentType("application/xml") \
    .setContentPath("../src/test/resources/reader/xml/test.xml") \
    .setOutputCol("document") \
    .setExtractTagAttributes(["category", "lang"])

Resulting in

children
en
Harry Potter
J K. Rowling
2005
29.99
web
en
Learning XML
Erik T. Ray
2003
39.95

- doesn't output empty text anymore
- Can extract tag attribute values
- adjusted defaults, so we always output a single large document
- can specify join char with new parameter
- adjusted other readers for new defaults
@DevinTDHa DevinTDHa merged commit 35c668e into JohnSnowLabs:release/621-release-candidate Nov 7, 2025
1 of 4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Nov 7, 2025
10 tasks
DevinTDHa added a commit that referenced this pull request Nov 7, 2025
* Reader2Doc new defaults to always output single document

* XMLReader improvements

- doesn't output empty text anymore
- Can extract tag attribute values

* Reader2Doc improvements

- adjusted defaults, so we always output a single large document
- can specify join char with new parameter
- adjusted other readers for new defaults

* Reader2Doc improvements python side

* ReaderAssembler: Fix failing test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant