Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

By replacing String.format, we can achieve nearly 200x performance improvement.

The SparkStringUtils.getHexString is widely used by

  • the Spark Thrift Server to convert binary to string when sending results to clients
  • the Spark SQL shell for display
  • the Spark Shell when calling show
  • the Spark Connect scala client when stringifying binaries in arrow vectors
+OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
+Apple M2 Max
+Cardinality 100000:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Spark                                             42210          43595        1207          0.0      422102.9       1.0X
+Java                                                238            243           2          0.4        2381.9     177.2X 

Why are the changes needed?

perf improvement

Does this PR introduce any user-facing change?

no

How was this patch tested?

By existing binary*.sql's results

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 14, 2024
@yaooqinn yaooqinn requested review from cloud-fan and hvanhovell June 14, 2024 05:53
@yaooqinn
Copy link
Member Author

"(?s)" + out.result() // (?s) enables dotall mode, causing "." to match new lines
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the function that should be deleted in #42184?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can not answer your question since I wasn't there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both org.apache.spark.sql.catalyst.util.StringUtils and getHexString are public, should we keep them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catalyst package is private by contract

@yaooqinn
Copy link
Member Author

Thank you @HyukjinKwon , merged to master

@yaooqinn yaooqinn closed this in 0c16624 Jun 17, 2024
@yaooqinn yaooqinn deleted the SPARK-48627 branch June 17, 2024 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants