Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: HyukjinKwon/spark
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 6b76741
Choose a base ref
...
head repository: HyukjinKwon/spark
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 1431a4a
Choose a head ref
  • 5 commits
  • 36 files changed
  • 5 contributors

Commits on Jan 13, 2022

  1. [SPARK-37686][PYTHON][SQL] Use _invoke_function helpers for all pyspa…

    …rk.sql.functions
    
    ### What changes were proposed in this pull request?
    
    This PR proposes conversion of functions not covered by SPARK-32084 to `_invoke_functions` style.
    
    Two new `_invoke` functions where added:
    
    - `_invoke_function_over_columns`
    - `_invoke_function_over_seq_of_columns`
    
    to address common examples.
    
    ### Why are the changes needed?
    
    To reduce boilerplate (especially related to type checking) and improve manageability.
    
    Additionally, it opens opportunity for reducing driver-side invocation latency.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes apache#34951 from zero323/SPARK-37686.
    
    Authored-by: zero323 <mszymkiewicz@gmail.com>
    Signed-off-by: zero323 <mszymkiewicz@gmail.com>
    zero323 committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    0e186e8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition key…

    …s contain all the join keys
    
    ### What changes were proposed in this pull request?
    
    This is a followup of apache#32875 . Basically apache#32875 did two improvements:
    1. allow bucket join even if the bucket hash function is different from Spark's shuffle hash function
    2. allow bucket join even if the hash partition keys are subset of join keys.
    
    The first improvement is the major target for implementing the SPIP "storage partition join". The second improvement is kind of a consequence of the framework refactor, which is not planned.
    
    This PR is to disable the second improvement by default, which may introduce perf regression if there are data skew without shuffle. We need more designs to enable this improvement, like checking the ndv.
    
    ### Why are the changes needed?
    
    Avoid perf regression
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    Closes apache#35138 from cloud-fan/join.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    4b4ff4b View commit details
    Browse the repository at this point in the history
  3. [SPARK-37864][SQL] Support vectorized read boolean values use RLE enc…

    …oding with Parquet DataPage V2
    
    ### What changes were proposed in this pull request?
    Parquet v2 data page write Boolean Values use RLE encoding, when read v2 boolean type values it will throw exceptions as follows now:
    
    ```java
    Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305) ~[classes/:?]
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277) ~[classes/:?]
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344) ~[classes/:?]
        at
    ```
    
    This PR extends the `readBooleans` and `skipBooleans` of `VectorizedRleValuesReader` to ensure that the above scenario can pass.
    
    ### Why are the changes needed?
    Support Parquet v2 data page RLE encoding  for the vectorized read path
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Add new test case
    
    Closes apache#35163 from LuciferYang/SPARK-37864.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: Chao Sun <sunchao@apple.com>
    LuciferYang authored and sunchao committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    9980555 View commit details
    Browse the repository at this point in the history
  4. [SPARK-37900][CORE] Use SparkMasterRegex.KUBERNETES_REGEX in `Secur…

    …ityManager`
    
    ### What changes were proposed in this pull request?
    
    This PR removes `SecurityManager.k8sRegex` and use `SparkMasterRegex.KUBERNETES_REGEX` in `SecurityManager`.
    
    ### Why are the changes needed?
    
    `SparkMasterRegex.KUBERNETES_REGEX` is more accurate and official than the existing `val k8sRegex = "k8s.*".r` pattern.
    
    https://github.com/apache/spark/blob/99805558fc80743747f32c7008cb7cc99c1cda01/core/src/main/scala/org/apache/spark/SparkContext.scala#L3063
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs with the existing test coverage.
    
    Closes apache#35195 from dongjoon-hyun/SPARK-37900.
    
    Authored-by: Dongjoon Hyun <dongjoon@apache.org>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    dongjoon-hyun committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    f7dd37c View commit details
    Browse the repository at this point in the history
  5. [SPARK-37887][CORE] Fix the check of repl log level

    ### What changes were proposed in this pull request?
    
    This patch fixes the check of repl's log level. So we can correctly know if the repl class is set with log level or not.
    
    ### Why are the changes needed?
    
    Same as the check in `SparkShellLoggingFilter`, `getLevel` cannot be used anymore to check if the log level is set or not for a logger in log4j2.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manual verified locally.
    
    Closes apache#35198 from viirya/SPARK-37887.
    
    Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
    viirya authored and dongjoon-hyun committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    1431a4a View commit details
    Browse the repository at this point in the history
Loading