XML Reader and Reader2Doc Improvements #14691

DevinTDHa · 2025-11-07T16:04:08Z

Description

This PR changes how Reader2Doc behaves, especially for XML files:

Reader2Doc will now create single document annotations per default. This should result in a more expected behavior when reading large documents. By default, lines are joined by the newline character \n but can be set with setJoinString.
XML reader now ignores empty tags without text content. To extract attributes of tags, the new flag setExtractTagAttributes(attributes: list[str]) will trigger addition of the values to the document output. For example, for the test.xml

<bookstore>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>

We can extract category and lang values with the Reader2Doc Config

reader2doc = Reader2Doc() \
    .setContentType("application/xml") \
    .setContentPath("../src/test/resources/reader/xml/test.xml") \
    .setOutputCol("document") \
    .setExtractTagAttributes(["category", "lang"])

Resulting in

children
en
Harry Potter
J K. Rowling
2005
29.99
web
en
Learning XML
Erik T. Ray
2003
39.95

- doesn't output empty text anymore - Can extract tag attribute values

- adjusted defaults, so we always output a single large document - can specify join char with new parameter - adjusted other readers for new defaults

* Reader2Doc new defaults to always output single document * XMLReader improvements - doesn't output empty text anymore - Can extract tag attribute values * Reader2Doc improvements - adjusted defaults, so we always output a single large document - can specify join char with new parameter - adjusted other readers for new defaults * Reader2Doc improvements python side * ReaderAssembler: Fix failing test

DevinTDHa added 5 commits November 7, 2025 13:57

Reader2Doc new defaults to always output single document

e55708d

XMLReader improvements

76fff09

- doesn't output empty text anymore - Can extract tag attribute values

Reader2Doc improvements

c1050c9

- adjusted defaults, so we always output a single large document - can specify join char with new parameter - adjusted other readers for new defaults

Reader2Doc improvements python side

325fb11

ReaderAssembler: Fix failing test

b0173c2

DevinTDHa merged commit 35c668e into JohnSnowLabs:release/621-release-candidate Nov 7, 2025
1 of 4 checks passed

DevinTDHa mentioned this pull request Nov 7, 2025

Release/621 release candidate #14687

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XML Reader and Reader2Doc Improvements #14691

XML Reader and Reader2Doc Improvements #14691

Uh oh!

DevinTDHa commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XML Reader and Reader2Doc Improvements #14691

XML Reader and Reader2Doc Improvements #14691

Uh oh!

Conversation

DevinTDHa commented Nov 7, 2025

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant