Skip to content

Conversation

@dcoliversun
Copy link
Contributor

What changes were proposed in this pull request?

This PR aims to supplement undocumented parquet configurations in documentation.

Why are the changes needed?

Help users to confirm configurations through documentation instead of code.

Does this PR introduce any user-facing change?

Yes, more configurations in documentation.

How was this patch tested?

Pass the GA.

@github-actions github-actions bot added the DOCS label Oct 8, 2022
Copy link
Contributor Author

@dcoliversun dcoliversun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @HyukjinKwon @dongjoon-hyun
It would be good if you have a time to review :)

<td>1.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.int96TimestampConversion</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_INT96_TIMESTAMP_CONVERSION = buildConf("spark.sql.parquet.int96TimestampConversion")
.doc("This controls whether timestamp adjustments should be applied to INT96 data when " +
"converting to timestamps, for data written by Impala. This is necessary because Impala " +
"stores INT96 data with a different timezone offset than Hive & Spark.")
.version("2.3.0")
.booleanConf
.createWithDefault(false)

<td>2.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.outputTimestampType</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_OUTPUT_TIMESTAMP_TYPE = buildConf("spark.sql.parquet.outputTimestampType")
.doc("Sets which Parquet timestamp type to use when Spark writes data to Parquet files. " +
"INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS " +
"is a standard timestamp type in Parquet, which stores number of microseconds from the " +
"Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which " +
"means Spark has to truncate the microsecond portion of its timestamp value.")
.version("2.3.0")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(ParquetOutputTimestampType.values.map(_.toString))
.createWithDefault(ParquetOutputTimestampType.INT96.toString)

<td>1.2.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.aggregatePushdown</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_AGGREGATE_PUSHDOWN_ENABLED = buildConf("spark.sql.parquet.aggregatePushdown")
.doc("If true, aggregates will be pushed down to Parquet for optimization. Support MIN, MAX " +
"and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date " +
"type. For COUNT, support all data types. If statistics is missing from any Parquet file " +
"footer, exception would be thrown.")
.version("3.3.0")
.booleanConf
.createWithDefault(false)

<td>1.5.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.respectSummaryFiles</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_SCHEMA_RESPECT_SUMMARIES = buildConf("spark.sql.parquet.respectSummaryFiles")
.doc("When true, we make assumption that all part-files of Parquet are consistent with " +
"summary files and we will ignore them when merging schema. Otherwise, if this is " +
"false, which is the default, we will merge all part-files. This should be considered " +
"as expert-only option, and shouldn't be enabled before knowing what it means exactly.")
.version("1.5.0")
.booleanConf
.createWithDefault(false)

<td>1.6.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.enableVectorizedReader</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.parquet.enableVectorizedReader")
.doc("Enables vectorized parquet decoding.")
.version("2.0.0")
.booleanConf
.createWithDefault(true)

<td>2.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.columnarReaderBatchSize</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.columnarReaderBatchSize")
.doc("The number of rows to include in a parquet vectorized reader batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading data.")
.version("2.4.0")
.intConf
.createWithDefault(4096)

<td>2.4.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.fieldId.write.enabled</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_FIELD_ID_WRITE_ENABLED =
buildConf("spark.sql.parquet.fieldId.write.enabled")
.doc("Field ID is a native field of the Parquet schema spec. When enabled, " +
"Parquet writers will populate the field Id " +
"metadata (if present) in the Spark schema to the Parquet schema.")
.version("3.3.0")
.booleanConf
.createWithDefault(true)

<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.fieldId.read.enabled</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_FIELD_ID_READ_ENABLED =
buildConf("spark.sql.parquet.fieldId.read.enabled")
.doc("Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers " +
"will use field IDs (if present) in the requested Spark schema to look up Parquet " +
"fields instead of using column names")
.version("3.3.0")
.booleanConf
.createWithDefault(false)

<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.fieldId.read.ignoreMissing</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val IGNORE_MISSING_PARQUET_FIELD_ID =
buildConf("spark.sql.parquet.fieldId.read.ignoreMissing")
.doc("When the Parquet file doesn't have any field IDs but the " +
"Spark read schema is using field IDs to read, we will silently return nulls " +
"when this flag is enabled, or error otherwise.")
.version("3.3.0")
.booleanConf
.createWithDefault(false)

<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.sql.parquet.timestampNTZ.enabled</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val PARQUET_TIMESTAMP_NTZ_ENABLED =
buildConf("spark.sql.parquet.timestampNTZ.enabled")
.doc(s"Enables ${TimestampTypes.TIMESTAMP_NTZ} support for Parquet reads and writes. " +
s"When enabled, ${TimestampTypes.TIMESTAMP_NTZ} values are written as Parquet timestamp " +
"columns with annotation isAdjustedToUTC = false and are inferred in a similar way. " +
s"When disabled, such values are read as ${TimestampTypes.TIMESTAMP_LTZ} and have to be " +
s"converted to ${TimestampTypes.TIMESTAMP_LTZ} for writes.")
.version("3.4.0")
.booleanConf
.createWithDefault(true)

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are OK. I don't think any are meant to be hidden or internal-only.
Are these logically ordered?

@dcoliversun
Copy link
Contributor Author

@srowen Yes, related configurations are grouped together and are in logical order with each other.

@srowen srowen closed this in f39b75c Oct 9, 2022
@dcoliversun dcoliversun deleted the SPARK-40710 branch October 10, 2022 01:25
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants