-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark #19339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
d557892
baaa93f
d4ef30a
9bd4eed
7525b48
350a93d
4040103
f542967
5988336
032b0c8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -438,9 +438,13 @@ def func(iterator): | |
| keyed = path.mapPartitions(func) | ||
| keyed._bypass_serializer = True | ||
| jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) | ||
| # [SPARK-22112] | ||
| # There aren't any jvm api for creating a dataframe from rdd storing csv. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just personal preference:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok let me fix it. thanks :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the usual style.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's fix these comments like, or when we happened to fix some code around here or review other PRs fixing some codes around here in the future.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok thanks |
||
| # We can do it through creating a jvm dataset firstly and using the jvm api | ||
| # for creating a dataframe from dataset storing csv. | ||
| jdataset = self._spark._ssql_ctx.createDataset( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add a small comment here to explain why we should create the dataset (which could look a bit weird in PySpark I believe). |
||
| jrdd.rdd(), | ||
| self._spark._sc._jvm.Encoders.STRING()) | ||
| self._spark._jvm.Encoders.STRING()) | ||
| return self._df(self._jreader.csv(jdataset)) | ||
| else: | ||
| raise TypeError("path can be only string, list or RDD") | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a way within Python and this seems working:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldmedal, it'd be great if you could double check whether this really works and it can be shorten or cleaner. This was just my rough try only to reach the goal so I am not sure if it is the best way.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, This way looks good. I'll try it. Thanks for your suggestion.