-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide #46833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide #46833
Conversation
|
cc @HyukjinKwon |
| .. code-block:: python | ||
| from dataclasses import dataclass | ||
| from typing import Iterator, List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from typing import Iterator, List | |
| from typing import Iterator, List | |
per PEP 8
| class FakeDataSourceWriter(DataSourceWriter): | ||
| def write(self, rows: Iterator[Row]) -> SimpleCommitMessage: | ||
| from pyspark import TaskContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can import this on the top together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import actually needs to be inside the write method otherwise it will throw a serialization error.
| return SimpleCommitMessage(partition_id=partition_id, count=cnt) | ||
| def commit(self, messages: List[SimpleCommitMessage]) -> None: | ||
| total_count = sum([message.count for message in messages]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| total_count = sum([message.count for message in messages]) | |
| total_count = sum(message.count for message in messages) |
| # | Douglas James|2007-01-18| 46226| Alabama| | ||
| # +--------------+----------+-------+------------+ | ||
| Write to the fake datasource: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it bold or use smaller header? e.g., ~~~~~~
| Write to the fake datasource: | ||
|
|
||
| To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`. | |
| To write data to a custom location, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`. |
Sink is actually use as a streaming term so I would avoid using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
python/pyspark/sql/datasource.py
Outdated
| ---------- | ||
| messages : list of :class:`WriterCommitMessage`\\s | ||
| A list of commit messages. | ||
| A list of commit messages. Can contain `None` values if any task failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List[Optional["WriterCommitMessage"]]
I would change the type hint too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I will improve this in a separate PR. I think there are other type hints that need modification.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
|
Merged to master. |
What changes were proposed in this pull request?
This PR adds an example for creating a simple data source writer in the user guide.
Why are the changes needed?
To improve the PySpark documentation.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Verified locally.
Was this patch authored or co-authored using generative AI tooling?
No