Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 26, 2023

What changes were proposed in this pull request?

In the PR, I propose to do not catch DataFrame query context in DataFrame methods but leave that close to Column functions.

Why are the changes needed?

To improve user experience with Spark DataFrame/Dataset APIs, and provide more precise context of errors.

Does this PR introduce any user-facing change?

No, since the feature hasn't been released yet.

How was this patch tested?

By running the modified test suites.

Was this patch authored or co-authored using generative AI tooling?

No.

@MaxGekk MaxGekk changed the title [WIP][SQL] Skip query context catching in DataFrame methods [SPARK-46655][SQL] Skip query context catching in DataFrame methods Jan 10, 2024
@MaxGekk MaxGekk marked this pull request as ready for review January 10, 2024 06:34
@MaxGekk MaxGekk requested a review from cloud-fan January 10, 2024 06:34
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 10, 2024

@cloud-fan Could you look at the PR, please.

private val sparkCodePattern = Pattern.compile("(org\\.apache\\.spark\\.sql\\." +
"(?:functions|Column|ColumnName|SQLImplicits|Dataset|DataFrameStatFunctions|DatasetHolder)" +
"(?:|\\..*|\\$.*))" +
"|(scala\\.collection\\..*)")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Scala collections classes as Spark classes because they might be mixed with them inside of Spark code.

Comment on lines -717 to -718
queryContext = Array(
ExpectedContext(fragment = "withColumn", callSitePattern = getCurrentClassCallSitePattern)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to revert changes for the withColumn() method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withColumn takes Column as parameters, why do we lose the error context if Column captures it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when we throw the exception at UnsupportedOperationChecker:
Screenshot 2024-01-10 at 16 45 27
origin of Window is empty. It contains nothing because when we construct it at Analyzer, we put captured column context inside of a sequence of windowExpressions:
Screenshot 2024-01-10 at 16 58 00

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added origin from the head expression:

          throw QueryExecutionErrors.nonTimeWindowNotSupportedInStreamingError(
            windowFuncList,
            columnNameList,
            windowSpecList,
            windowExpression.head.origin)

for double check. In that case, the query context is propagated obviously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the related PR: #41578

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan So far, I would revert changes in withColumn. Need to figure out how to properly fix the case of Window + Dataframe query context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should add withOrigin in select as well? Any API that produces Project or Window should have error context.

@cloud-fan
Copy link
Contributor

@MaxGekk after a second thought, I feel that we don't need dataframe error context for analysis errors. The analysis is eager in dataframe, so people will know the dataframe call site after seeing the stacktrace of the analysis exception.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2024

after a second thought, I feel that we don't need dataframe error context for analysis errors.

ok, I have reverted all methods in Dataframe. Please, review this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2024

Merging to master. Thank you, @cloud-fan for review.

@MaxGekk MaxGekk closed this in d307960 Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants