[SPARK-40710][DOCS] Supplement undocumented parquet configurations in documentation #38160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

dcoliversun wants to merge 1 commit into apache:master from dcoliversun:SPARK-40710

Contributor

dcoliversun commented Oct 8, 2022

What changes were proposed in this pull request?

This PR aims to supplement undocumented parquet configurations in documentation.

Why are the changes needed?

Help users to confirm configurations through documentation instead of code.

Does this PR introduce any user-facing change?

Yes, more configurations in documentation.

How was this patch tested?

Pass the GA.


          [SPARK-40710][DOCS] Supplement undocumented parquet configurations in…

08090d4

… documentation

github-actions bot added the DOCS label

dcoliversun commented

View reviewed changes

Contributor Author

dcoliversun left a comment

cc @HyukjinKwon @dongjoon-hyun
It would be good if you have a time to review :)

docs/sql-data-sources-parquet.md

    
                <td>1.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.int96TimestampConversion</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 899 to 905 in 309638e

    
           val PARQUET_INT96_TIMESTAMP_CONVERSION = buildConf("spark.sql.parquet.int96TimestampConversion") 
        
             .doc("This controls whether timestamp adjustments should be applied to INT96 data when " + 
        
               "converting to timestamps, for data written by Impala.  This is necessary because Impala " + 
        
               "stores INT96 data with a different timezone offset than Hive & Spark.") 
        
             .version("2.3.0") 
        
             .booleanConf 
        
             .createWithDefault(false)

docs/sql-data-sources-parquet.md

    
                <td>2.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.outputTimestampType</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 911 to 921 in 309638e

    
           val PARQUET_OUTPUT_TIMESTAMP_TYPE = buildConf("spark.sql.parquet.outputTimestampType") 
        
             .doc("Sets which Parquet timestamp type to use when Spark writes data to Parquet files. " + 
        
               "INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS " + 
        
               "is a standard timestamp type in Parquet, which stores number of microseconds from the " + 
        
               "Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which " + 
        
               "means Spark has to truncate the microsecond portion of its timestamp value.") 
        
             .version("2.3.0") 
        
             .stringConf 
        
             .transform(_.toUpperCase(Locale.ROOT)) 
        
             .checkValues(ParquetOutputTimestampType.values.map(_.toString)) 
        
             .createWithDefault(ParquetOutputTimestampType.INT96.toString)

docs/sql-data-sources-parquet.md

    
                <td>1.2.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.aggregatePushdown</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1003 to 1010 in 309638e

    
           val PARQUET_AGGREGATE_PUSHDOWN_ENABLED = buildConf("spark.sql.parquet.aggregatePushdown") 
        
             .doc("If true, aggregates will be pushed down to Parquet for optimization. Support MIN, MAX " + 
        
               "and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date " + 
        
               "type. For COUNT, support all data types. If statistics is missing from any Parquet file " + 
        
               "footer, exception would be thrown.") 
        
             .version("3.3.0") 
        
             .booleanConf 
        
             .createWithDefault(false)

docs/sql-data-sources-parquet.md

    
                <td>1.5.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.respectSummaryFiles</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 872 to 879 in 309638e

    
           val PARQUET_SCHEMA_RESPECT_SUMMARIES = buildConf("spark.sql.parquet.respectSummaryFiles") 
        
             .doc("When true, we make assumption that all part-files of Parquet are consistent with " + 
        
                  "summary files and we will ignore them when merging schema. Otherwise, if this is " + 
        
                  "false, which is the default, we will merge all part-files. This should be considered " + 
        
                  "as expert-only option, and shouldn't be enabled before knowing what it means exactly.") 
        
             .version("1.5.0") 
        
             .booleanConf 
        
             .createWithDefault(false)

docs/sql-data-sources-parquet.md

    
                <td>1.6.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.enableVectorizedReader</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1033 to 1038 in 309638e

    
           val PARQUET_VECTORIZED_READER_ENABLED = 
        
             buildConf("spark.sql.parquet.enableVectorizedReader") 
        
               .doc("Enables vectorized parquet decoding.") 
        
               .version("2.0.0") 
        
               .booleanConf 
        
               .createWithDefault(true)

docs/sql-data-sources-parquet.md

    
                <td>2.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.columnarReaderBatchSize</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1058 to 1063 in 309638e

    
           val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.columnarReaderBatchSize") 
        
             .doc("The number of rows to include in a parquet vectorized reader batch. The number should " + 
        
               "be carefully chosen to minimize overhead and avoid OOMs in reading data.") 
        
             .version("2.4.0") 
        
             .intConf 
        
             .createWithDefault(4096)

docs/sql-data-sources-parquet.md

    
                <td>2.4.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.fieldId.write.enabled</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1065 to 1072 in 309638e

    
           val PARQUET_FIELD_ID_WRITE_ENABLED = 
        
            buildConf("spark.sql.parquet.fieldId.write.enabled") 
        
              .doc("Field ID is a native field of the Parquet schema spec. When enabled, " + 
        
                "Parquet writers will populate the field Id " + 
        
                "metadata (if present) in the Spark schema to the Parquet schema.") 
        
              .version("3.3.0") 
        
              .booleanConf 
        
              .createWithDefault(true)

docs/sql-data-sources-parquet.md

    
                <td>3.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.fieldId.read.enabled</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1074 to 1081 in 309638e

    
           val PARQUET_FIELD_ID_READ_ENABLED = 
        
             buildConf("spark.sql.parquet.fieldId.read.enabled") 
        
               .doc("Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers " + 
        
                 "will use field IDs (if present) in the requested Spark schema to look up Parquet " + 
        
                 "fields instead of using column names") 
        
               .version("3.3.0") 
        
               .booleanConf 
        
               .createWithDefault(false)

docs/sql-data-sources-parquet.md

    
                <td>3.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.fieldId.read.ignoreMissing</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1083 to 1090 in 309638e

    
           val IGNORE_MISSING_PARQUET_FIELD_ID = 
        
             buildConf("spark.sql.parquet.fieldId.read.ignoreMissing") 
        
               .doc("When the Parquet file doesn't have any field IDs but the " + 
        
                 "Spark read schema is using field IDs to read, we will silently return nulls " + 
        
                 "when this flag is enabled, or error otherwise.") 
        
               .version("3.3.0") 
        
               .booleanConf 
        
               .createWithDefault(false)

docs/sql-data-sources-parquet.md

    
                <td>3.3.0</td>

              </tr>

              <tr>

                <td><code>spark.sql.parquet.timestampNTZ.enabled</code></td>

Contributor Author

dcoliversun Oct 8, 2022

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1092 to 1101 in 309638e

    
           val PARQUET_TIMESTAMP_NTZ_ENABLED = 
        
             buildConf("spark.sql.parquet.timestampNTZ.enabled") 
        
               .doc(s"Enables ${TimestampTypes.TIMESTAMP_NTZ} support for Parquet reads and writes. " + 
        
                 s"When enabled, ${TimestampTypes.TIMESTAMP_NTZ} values are written as Parquet timestamp " + 
        
                 "columns with annotation isAdjustedToUTC = false and are inferred in a similar way. " + 
        
                 s"When disabled, such values are read as ${TimestampTypes.TIMESTAMP_LTZ} and have to be " + 
        
                 s"converted to ${TimestampTypes.TIMESTAMP_LTZ} for writes.") 
        
               .version("3.4.0") 
        
               .booleanConf 
        
               .createWithDefault(true)

srowen reviewed

View reviewed changes

Member

srowen left a comment

I think these are OK. I don't think any are meant to be hidden or internal-only.
Are these logically ordered?

Contributor Author

dcoliversun commented Oct 9, 2022

@srowen Yes, related configurations are grouped together and are in logical order with each other.

srowen closed this in

f39b75c

dcoliversun deleted the SPARK-40710 branch

October 10, 2022 01:25

AmplabJenkins commented Oct 10, 2022

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DOCS