-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
fb589a3
c4282bf
e7983d6
16cc0c2
b006d65
5ce9895
bc65e6b
a240973
265dd48
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsen | |||||
| case o => throw new IllegalArgumentException(s"Invalid value `$o` for parameter " + | ||||||
| s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.") | ||||||
| } | ||||||
|
|
||||||
| val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR) | ||||||
| lineSeparator.foreach(s => require(s.nonEmpty, s"'$LINE_SEPARATOR' cannot be an empty string.")) | ||||||
| } | ||||||
|
|
||||||
| private[libsvm] object LibSVMOptions { | ||||||
| val NUM_FEATURES = "numFeatures" | ||||||
| val VECTOR_TYPE = "vectorType" | ||||||
| val DENSE_VECTOR_TYPE = "dense" | ||||||
| val SPARSE_VECTOR_TYPE = "sparse" | ||||||
| val LINE_SEPARATOR = "lineSep" | ||||||
|
||||||
| * <li>`sep` (default `,`): sets the single character as a separator for each | |
| * field and value.</li> |
and was thinking of matching the name ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for history, it was delimiter but renamed to sep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need multi-characters for the separator? Hive assumes a single character in
LINES TERMINATED BYhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTableThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, actually in case of Univocity, it requires 2 characters:
I don't see a reason to restrict this for now. At least, I can provide an usecase with Windows -
\r\n.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, ok. But I couldn't just imagine an usecase to use more than two characters. It's okay to me that we'll follow committers decisions on this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen datasets that have multi-character delimiters of more than 2 characters. Specifically
|~|So yes there is a use case, but it's a long tail one. I'd be happy to get this progress of up to 2 characters and work towards 3+ in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this option in a single place for these formats? I feel putting this option in each format looks a little messy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we support one or two characters only, I feel we better explicitly throw an exception for more than two characters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionis also there for many datasources. Probably, let me try to open up a discussion about tying up those later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!