infra: use spark connect to run pytests by kevinjqliu · Pull Request #2491 · apache/iceberg-python

kevinjqliu · 2025-09-21T16:01:11Z

Rationale for this change

Closes #2492

Run pytest using Spark Connect for a more consistent test env

a few general cleanup changes

Are these changes tested?

Are there any user-facing changes?

kevinjqliu · 2025-09-21T16:01:47Z

.gitignore

-# Hive/metastore files
-metastore_db/
-
-# Spark/metastore files
-spark-warehouse/
-derby.log


no longer needed since we no longer run spark and metastore locally

kevinjqliu · 2025-09-21T16:03:40Z

Makefile

  CLEANUP_COMMAND = echo "Keeping containers running for debugging (KEEP_COMPOSE=1)"
 else
-  CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans 2>/dev/null || true
+  CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans --timeout 0 2>/dev/null || true


dont wait for docker compose down, more responsive

kevinjqliu · 2025-09-21T16:06:11Z

dev/docker-compose-integration.yml

-      - 8888:8888
-      - 8080:8080


removed port 8888 that was previously used for notebooks
replaced spark master web ui (8080) with spark app ui (4040)

kevinjqliu · 2025-09-21T16:06:44Z

dev/Dockerfile

+ENV SCALA_VERSION=2.12
+ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION}
+ENV ICEBERG_VERSION=1.9.2
 ENV PYICEBERG_VERSION=0.10.0
+ENV HADOOP_VERSION=3.3.4
+ENV AWS_SDK_VERSION=1.12.753


copied over from tests/conftest, these were originally downloaded client side

Nice, this is much better 👍

kevinjqliu · 2025-09-21T16:07:55Z

dev/spark-defaults.conf

+# Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore
+spark.sql.catalog.spark_catalog        org.apache.iceberg.spark.SparkSessionCatalog
+spark.sql.catalog.spark_catalog.type   hive
+spark.sql.catalog.spark_catalog.uri    thrift://hive:9083
+spark.hadoop.fs.s3a.endpoint           http://minio:9000
+spark.sql.catalogImplementation        hive
+spark.sql.warehouse.dir                s3a://warehouse/hive/


spark_catalog is primarily used by the test_migrate_table test. It calls <catalog>.system.snapshot which requires spark_catalog

iceberg-python/tests/integration/test_hive_migration.py

Line 73 in 6935b41

CALL hive.system.snapshot('{src_table_identifier}', 'hive.{dst_table_identifier}')

It requires the SparkSessionCatalog 👍

kevinjqliu · 2025-09-21T16:08:48Z

tests/integration/test_add_files.py

 @pytest.mark.integration
 def test_add_files_snapshot_properties(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
-    identifier = f"default.unpartitioned_table_v{format_version}"
+    identifier = f"default.snapshot_properties_v{format_version}"


name conflict with another test above

iceberg-python/tests/integration/test_add_files.py

Lines 160 to 161 in 6935b41

def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:

identifier = f"default.unpartitioned_table_v{format_version}"

I actually prefer to use the test-name in the table name:

Suggested change

identifier = f"default.snapshot_properties_v{format_version}"

identifier = f"default. test_add_files_snapshot_properties_v{format_version}"

This way we can relate the two 👍

kevinjqliu · 2025-09-21T16:17:56Z

Makefile

 # ========================

-PYTEST_ARGS ?= -v  # Override with e.g. PYTEST_ARGS="-vv --tb=short"
+PYTEST_ARGS ?= -v -x  # Override with e.g. PYTEST_ARGS="-vv --tb=short"


-x to exit test when ctrl-c

-x, --exitfirst exit instantly on first error or failed test.

https://docs.pytest.org/en/6.2.x/reference.html#command-line-flags

Fokko

I like this a lot! It consolidates a lot of the configuration, thanks for working on this 👍

Fokko · 2025-09-22T14:31:06Z

dev/Dockerfile

+ENV SCALA_VERSION=2.12
+ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION}
+ENV ICEBERG_VERSION=1.9.2
 ENV PYICEBERG_VERSION=0.10.0
+ENV HADOOP_VERSION=3.3.4
+ENV AWS_SDK_VERSION=1.12.753


Nice, this is much better 👍

Fokko · 2025-09-22T14:32:25Z

dev/spark-defaults.conf

+# Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore
+spark.sql.catalog.spark_catalog        org.apache.iceberg.spark.SparkSessionCatalog
+spark.sql.catalog.spark_catalog.type   hive
+spark.sql.catalog.spark_catalog.uri    thrift://hive:9083
+spark.hadoop.fs.s3a.endpoint           http://minio:9000
+spark.sql.catalogImplementation        hive
+spark.sql.warehouse.dir                s3a://warehouse/hive/


It requires the SparkSessionCatalog 👍

# Rationale for this change Add 2 new Make command (`make notebook`) to spin up a jupyter notebook; `make notebook-infra` spins up jupyter notebook along with integration test infrastructure. ### Pyiceberg Example Notebook Pyiceberg example notebook (`notebooks/pyiceberg_example.ipynb`) is based on the https://py.iceberg.apache.org/#getting-started-with-pyiceberg page and doesn't require additional test infra. ### Spark Example Notebook Spark integration example notebook (`notebooks/spark_integration_example.ipynb`) is based on https://iceberg.apache.org/docs/nightly/spark-getting-started/ and requires integration test infrastructure (Spark, IRC, S3) With spark connect (#2491) and our testing setup, we can quickly spin up a local env with `make test-integration-exec` which includes: * spark * iceberg rest catalog * hive metastore * minio In the jupyter notebook, connect to spark easily ``` from pyspark.sql import SparkSession # Create SparkSession against the remote Spark Connect server spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() spark.sql("SHOW CATALOGS").show() ``` ## Are these changes tested? Yes, run both `make notebook` and `make notebook-infra` locally and run the example notebooks ## Are there any user-facing changes?

kevinjqliu added 4 commits September 20, 2025 11:09

make ctrl-c responsive, stop on first error

a85d2d7

remove unused directory

90e8902

pytest immediate exit and verbose logs

d22d0e4

use thrift for hms

23f4736

kevinjqliu force-pushed the kevinjqliu/clean-up-spark branch 2 times, most recently from 39bd08e to 6ac69bc Compare September 21, 2025 16:17

kevinjqliu commented Sep 21, 2025

View reviewed changes

kevinjqliu mentioned this pull request Sep 21, 2025

improve test spark env #2492

Closed

use spark connect

1c1d75e

kevinjqliu force-pushed the kevinjqliu/clean-up-spark branch from 6ac69bc to 1c1d75e Compare September 21, 2025 16:25

kevinjqliu requested a review from Fokko September 21, 2025 17:04

Merge branch 'main' into kevinjqliu/clean-up-spark

40eb5d5

Fokko approved these changes Sep 22, 2025

View reviewed changes

thanks fokko

2569263

kevinjqliu merged commit 513295d into apache:main Sep 22, 2025
10 checks passed

kevinjqliu deleted the kevinjqliu/clean-up-spark branch September 22, 2025 16:53

This was referenced Sep 26, 2025

dev: add make notebook #2528

Merged

Improve dev/Dockerfile #1527

Closed

	def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
	identifier = f"default.unpartitioned_table_v{format_version}"

	identifier = f"default.snapshot_properties_v{format_version}"
	identifier = f"default. test_add_files_snapshot_properties_v{format_version}"

Conversation

kevinjqliu commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu commented Sep 21, 2025 •

edited

Loading