-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time #5298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -98,12 +98,32 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging { | |
| val metadata = new JHashMap[String, String]() | ||
| val requestedAttributes = RowReadSupport.getRequestedSchema(configuration) | ||
|
|
||
| // convert fileSchema to attributes | ||
| val fileAttributes = ParquetTypesConverter.convertToAttributes(fileSchema, true, true) | ||
| val fileAttMap = fileAttributes.map(f => f.name.toLowerCase -> f.name).toMap | ||
|
|
||
| if (requestedAttributes != null) { | ||
| // reconcile names of requested Attributes | ||
| val modRequestedAttributes = requestedAttributes.map(attr => { | ||
| val lName = attr.name.toLowerCase | ||
| if (fileAttMap.contains(lName)) { | ||
| attr.withName(fileAttMap(lName)) | ||
| } else { | ||
| if (attr.nullable) { | ||
| attr | ||
| } else { | ||
| // field is not nullable but not present in the parquet file schema!! | ||
| // this is just a safety check since in hive all columns are nullable | ||
| // throw exception here | ||
| throw new RuntimeException(s"""Field ${attr.name} is non-nullable, | ||
| but not found in parquet file schema: ${fileSchema}""".stripMargin) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This exception message would look pretty weird when printed since there isn't a "margin" character ( throw new RuntimeException(
s"Field ${attr.name} is non-nullable, " +
s"but not found in parquet file schema: ${fileSchema}") |
||
| }}}) | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we essentially duplicating
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, the difference being that this happens within each task, whereas ParquetRelation2.mergeMetastoreParquetSchema happens on the driver. This eliminates the need of mergeMetastoreParquetSchema method
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And please don't put multiple |
||
| // If the parquet file is thrift derived, there is a good chance that | ||
| // it will have the thrift class in metadata. | ||
| val isThriftDerived = keyValueMetaData.keySet().contains("thrift.class") | ||
| parquetSchema = ParquetTypesConverter | ||
| .convertFromAttributes(requestedAttributes, isThriftDerived) | ||
| .convertFromAttributes(modRequestedAttributes, isThriftDerived) | ||
| metadata.put( | ||
| RowReadSupport.SPARK_ROW_REQUESTED_SCHEMA, | ||
| ParquetTypesConverter.convertToString(requestedAttributes)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid we can't always set the last two arguments to
true, they should be determined according to correspondingSQLConfconfigurations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these booleans are for finding the datatype of the attribute, whereas here we are just interested in finding out the names of the columns, to reconcile with metastore schema. Hence it is safe to always send these parameters as true, since we do not have SQL context here from which to derive these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation. Would please also add a comment for this?