Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

Fix #482

While working on #444, I couldn't get integration tests working since using spark.sql to count the number of data files returns the wrong result. This becomes more bizarre when Python's iceberg table state is not the same as Spark's iceberg table state.

This PR adds a simple test to verify that "python's iceberg table snapshot id" is the same as "spark's iceberg table's snapshot id".

The culprit is the spark.sql.catalog.catalog-name.cache-enabled setting, which defaults to True and caches table metadata.
Spark sql calls will use the cached iceberg metadata instead of the updated one.

spark.sql.catalog.catalog-name.cache-enabled	true or false	Whether to enable catalog cache, default value is true

https://iceberg.apache.org/docs/latest/spark-configuration/#catalog-configuration

@kevinjqliu kevinjqliu changed the title Fix issuing multiple spark sql queries in tests Fix issue with running multiple spark sql queries in the same integration test Mar 6, 2024
@amogh-jahagirdar amogh-jahagirdar changed the title Fix issue with running multiple spark sql queries in the same integration test Disable Spark Catalog caching for integration tests Mar 6, 2024
@amogh-jahagirdar
Copy link
Contributor

Thanks @kevinjqliu I think this change makes sense. I don't think there's ever a reason on the Python side where we want to have the spark caching enabled. On the Iceberg Java side we do have tests which validate the caching catalog behavior when it's enabled/disabled so we don't need to test that through PyIceberg (I think). I've triggered CI if it passes, I'll go ahead and merge



@pytest.mark.integration
def test_multiple_spark_sql_queries(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think a better name would be test_python_writes_with_spark_snapshot_reads or something more specific than what it currently is . It's mre verbose but I think it captures the goal of the test better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, added!

@sungwy
Copy link
Collaborator

sungwy commented Mar 7, 2024

Great idea @kevinjqliu ! Thanks for adding this

@amogh-jahagirdar
Copy link
Contributor

Sweet, thanks @kevinjqliu! I'm going to go ahead and merge this now.

@amogh-jahagirdar amogh-jahagirdar merged commit e56326d into apache:main Mar 7, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/fix-multiple-spark-sql-queries branch March 7, 2024 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark <> Iceberg bug integration test

3 participants