[SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide #46833

allisonwang-db · 2024-06-01T00:52:54Z

What changes were proposed in this pull request?

This PR adds an example for creating a simple data source writer in the user guide.

Why are the changes needed?

To improve the PySpark documentation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Verified locally.

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2024-06-01T00:53:03Z

cc @HyukjinKwon

HyukjinKwon · 2024-06-03T01:44:08Z

python/docs/source/user_guide/sql/python_data_source.rst

+.. code-block:: python
+
+    from dataclasses import dataclass
+    from typing import Iterator, List


Suggested change

from typing import Iterator, List

from typing import Iterator, List

per PEP 8

HyukjinKwon · 2024-06-03T01:44:45Z

python/docs/source/user_guide/sql/python_data_source.rst

+    class FakeDataSourceWriter(DataSourceWriter):
+
+        def write(self, rows: Iterator[Row]) -> SimpleCommitMessage:
+            from pyspark import TaskContext


I think we can import this on the top together.

This import actually needs to be inside the write method otherwise it will throw a serialization error.

HyukjinKwon · 2024-06-03T01:46:09Z

python/docs/source/user_guide/sql/python_data_source.rst

+            return SimpleCommitMessage(partition_id=partition_id, count=cnt)
+
+        def commit(self, messages: List[SimpleCommitMessage]) -> None:
+            total_count = sum([message.count for message in messages])


Suggested change

total_count = sum([message.count for message in messages])

total_count = sum(message.count for message in messages)

HyukjinKwon · 2024-06-03T01:47:17Z

python/docs/source/user_guide/sql/python_data_source.rst

    # | Douglas James|2007-01-18|  46226|     Alabama|
    # +--------------+----------+-------+------------+

+Write to the fake datasource:


Should we make it bold or use smaller header? e.g., ~~~~~~

HyukjinKwon · 2024-06-03T01:48:28Z

python/docs/source/user_guide/sql/python_data_source.rst


+Write to the fake datasource:
+
+To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.


Suggested change

To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.

To write data to a custom location, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.

Sink is actually use as a streaming term so I would avoid using it.

Sounds good!

HyukjinKwon · 2024-06-03T01:49:26Z

python/pyspark/sql/datasource.py

        ----------
        messages : list of :class:`WriterCommitMessage`\\s
-            A list of commit messages.
+            A list of commit messages. Can contain `None` values if any task failed.


List[Optional["WriterCommitMessage"]]

I would change the type hint too.

Actually, I will improve this in a separate PR. I think there are other type hints that need modification.

HyukjinKwon

LGTM otherwise

HyukjinKwon · 2024-06-17T23:45:44Z

Merged to master.

add docs

155a0d4

github-actions bot added SQL DOCS PYTHON labels Jun 1, 2024

HyukjinKwon reviewed Jun 3, 2024

View reviewed changes

HyukjinKwon approved these changes Jun 3, 2024

View reviewed changes

zhengruifeng approved these changes Jun 6, 2024

View reviewed changes

allisonwang-db added 2 commits June 17, 2024 13:26

address comments

633c97f

update

307f553

HyukjinKwon closed this in f0b7cfa Jun 17, 2024

	from typing import Iterator, List
	from typing import Iterator, List

	total_count = sum([message.count for message in messages])
	total_count = sum(message.count for message in messages)


		Write to the fake datasource:

		To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.

	To write data to a custom sink, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.
	To write data to a custom location, make sure that you specify the `mode()` clause. Supported modes are `append` and `overwrite`.

[SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide #46833

[SPARK-48497][PYTHON][DOCS] Add an example for Python data source writer in user guide #46833

Uh oh!

Conversation

allisonwang-db commented Jun 1, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db commented Jun 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants