-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24676][SQL] Project required data from CSV parsed data when column pruning disabled #21657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92418 has finished for PR 21657 at commit
|
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use something else instead of avg? I would eliminate equality checking for floating point operands. You can easily get 4.49999999999 instead of 4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the bug is not related to the column pruning feature, and most likely it presented in previous versions. Should it be ported to the 2.3 branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, before the column pruning pr, UnivocityParser always required columns only. So, we don't need to project them in CSVFileFormat. cc: @HyukjinKwon
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
Line 50 in e3de6ab
| private val row = new GenericInternalRow(requiredSchema.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we better just merge this into the master since we already added some changes related with column pruning stuff. Let me double check it before merging it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string exprs and selectExpr can be replaced by .select(sum('p), avg('c0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both is ok. We have preferred one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both are fine~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"true" -> true ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, if required schema is empty, the fix works too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #92445 has finished for PR 21657 at commit
|
|
Actually the issue happened because I removed the mapping: 64fad0b#diff-d19881aceddcaa5c60620fdcda99b4c4L79 I would propose to revert it back, and remove all those "expensive" (comparing to look up |
|
@MaxGekk Do you mean we remove the option for column pruning in csv? |
I mean reverting back the index mapping - If |
|
Not sure though, the |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account that spark.sql.csv.parser.columnPruning.enabled is set to true by default. I think performance of another path is not so matter. For me both solutions are ok.
|
@HyukjinKwon WDYT? |
|
@HyukjinKwon kindly ping |
|
Eh .. actually can we revive 64fad0b#diff-d19881aceddcaa5c60620fdcda99b4c4L79 ? This sounds safer to me. |
|
ok, will revive the code here |
|
Test build #92590 has finished for PR 21657 at commit
|
|
@HyukjinKwon @MaxGekk plz check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the code here and in the line https://github.com/apache/spark/pull/21657/files#diff-d19881aceddcaa5c60620fdcda99b4c4R51 above is the same. Could you fold the lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would apply this small optimization: java.lang.Integer.valueOf(dataSchema.indexOf(f))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy for tokenIndexArr means that internal laziness flag will be check per each token. I would remove lazy for tokenIndexArr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #92629 has finished for PR 21657 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you convert it to Array explicitly. I have checked type of tokenIndexArr, it is actually scala.collection.immutable.$colon$colon - lazy list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any side-effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have O(n) instead of O(1) for getting a value from the collection by an index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. I'll recheck. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed.
|
cc: @HyukjinKwon |
|
Test build #92703 has finished for PR 21657 at commit
|
|
retest this please |
|
Test build #92704 has finished for PR 21657 at commit
|
|
retest this please |
|
Test build #92706 has finished for PR 21657 at commit
|
|
@MaxGekk, does this look good to you in general? |
|
@HyukjinKwon yes |
|
|
||
| // This index is used to reorder parsed tokens | ||
| private val tokenIndexArr = | ||
| requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, we can do an optimization by memory here. The array is used under the flag options.columnPruning only. We can create an empty array (or null) if options.columnPruning is set to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This array is used in both cases: line 56 ( options.columnPruning=true) and line 208 (options.columnPruning=false)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see
| val options: CSVOptions) extends Logging { | ||
| require(requiredSchema.toSet.subsetOf(dataSchema.toSet), | ||
| "requiredSchema should be the subset of schema.") | ||
| "requiredSchema should be the subset of dataSchema.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: generally, we should consider printing out the schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #92834 has finished for PR 21657 at commit
|
|
@HyukjinKwon ping |
|
@HyukjinKwon @gatorsmile Would you mind to merge the PR? |
|
Please let me do one pass within few days. |
|
ok to test |
|
retest this please |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding and fixing this issue!
LGTM except a few comments.
|
|
||
| class UnivocityParser( | ||
| dataSchema: StructType, | ||
| requiredSchema: StructType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the parameter descriptions of dataSchema and requiredSchema above class UnivocityParser?
| val dir = path.getAbsolutePath | ||
| spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.partitionBy("p") | ||
| .option("header", "true").csv(dir) | ||
| var df = spark.read.option("header", true).csv(dir).selectExpr("sum(p)", "count(c0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we do not use var for DataFrame even in test cases
|
|
||
| val tokenizer = { | ||
| val parserSetting = options.asParserSettings | ||
| if (options.columnPruning && requiredSchema.length < dataSchema.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to
// When to-be-parsed schema is shorter than the to-be-read data schema, we let Univocity CSV parser select a sequence of fields for reading by their positions.
if (parsedSchema.length < dataSchema.length)
| private val schema = if (options.columnPruning) requiredSchema else dataSchema | ||
|
|
||
| private val row = new GenericInternalRow(schema.length) | ||
| private val parsedSchema = if (options.columnPruning) requiredSchema else dataSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment like
// When column pruning is enabled, the parser only parses the required columns based on their positions in the data schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| var i = 0 | ||
| while (i < schema.length) { | ||
| row(i) = valueConverters(i).apply(tokens(i)) | ||
| while (i < requiredSchema.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the comment like
// When the length of the returned tokens is identical to the length of the parsed schema, we just need to convert the tokens that correspond to the required columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
|
||
| private def convert(tokens: Array[String]): InternalRow = { | ||
| if (tokens.length != schema.length) { | ||
| if (tokens.length != parsedSchema.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, could you add a test case that satisfy tokens.length != parsedSchema.length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
|
Test build #93010 has finished for PR 21657 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
|
Test build #93015 has finished for PR 21657 at commit
|
|
Thanks! Merged to master |
What changes were proposed in this pull request?
This pr modified code to project required data from CSV parsed data when column pruning disabled.
In the current master, an exception below happens if
spark.sql.csv.parser.columnPruning.enabledis false. This is because required formats and CSV parsed formats are different from each other;How was this patch tested?
Added tests in
CSVSuite.