Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Propagate cached schema in dataframe operations:

  • DataFrame.alias
  • DataFrame.coalesce
  • DataFrame.repartition
  • DataFrame.repartitionByRange
  • DataFrame.dropDuplicates
  • DataFrame.distinct
  • DataFrame.filter
  • DataFrame.where
  • DataFrame.limit
  • DataFrame.sort
  • DataFrame.sortWithinPartitions
  • DataFrame.orderBy
  • DataFrame.sample
  • DataFrame.hint
  • DataFrame.randomSplit
  • DataFrame.observe

Why are the changes needed?

to avoid unnecessary RPCs if possible

Does this PR introduce any user-facing change?

No, optimization only

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the py_connect_propagate_schema branch June 13, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants