Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds an example for creating a simple data source writer in the user guide.

Why are the changes needed?

To improve the PySpark documentation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Verified locally.

Was this patch authored or co-authored using generative AI tooling?

No

@allisonwang-db
Copy link
Contributor Author

cc @HyukjinKwon

.. code-block:: python
from dataclasses import dataclass
from typing import Iterator, List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import Iterator, List
from typing import Iterator, List

per PEP 8

class FakeDataSourceWriter(DataSourceWriter):
def write(self, rows: Iterator[Row]) -> SimpleCommitMessage:
from pyspark import TaskContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can import this on the top together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import actually needs to be inside the write method otherwise it will throw a serialization error.

return SimpleCommitMessage(partition_id=partition_id, count=cnt)
def commit(self, messages: List[SimpleCommitMessage]) -> None:
total_count = sum([message.count for message in messages])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
total_count = sum([message.count for message in messages])
total_count = sum(message.count for message in messages)

# | Douglas James|2007-01-18| 46226| Alabama|
# +--------------+----------+-------+------------+
Write to the fake datasource:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it bold or use smaller header? e.g., ~~~~~~

Write to the fake datasource:

To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.
To write data to a custom location, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.

Sink is actually use as a streaming term so I would avoid using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

----------
messages : list of :class:`WriterCommitMessage`\\s
A list of commit messages.
A list of commit messages. Can contain `None` values if any task failed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List[Optional["WriterCommitMessage"]]

I would change the type hint too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I will improve this in a separate PR. I think there are other type hints that need modification.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants