infra: use spark connect to run pytests#2491
Conversation
39bd08e to
6ac69bc
Compare
| # Hive/metastore files | ||
| metastore_db/ | ||
|
|
||
| # Spark/metastore files | ||
| spark-warehouse/ | ||
| derby.log |
There was a problem hiding this comment.
no longer needed since we no longer run spark and metastore locally
| CLEANUP_COMMAND = echo "Keeping containers running for debugging (KEEP_COMPOSE=1)" | ||
| else | ||
| CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans 2>/dev/null || true | ||
| CLEANUP_COMMAND = docker compose -f dev/docker-compose-integration.yml down -v --remove-orphans --timeout 0 2>/dev/null || true |
There was a problem hiding this comment.
dont wait for docker compose down, more responsive
| - 8888:8888 | ||
| - 8080:8080 |
There was a problem hiding this comment.
removed port 8888 that was previously used for notebooks
replaced spark master web ui (8080) with spark app ui (4040)
| ENV SCALA_VERSION=2.12 | ||
| ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION} | ||
| ENV ICEBERG_VERSION=1.9.2 | ||
| ENV PYICEBERG_VERSION=0.10.0 | ||
| ENV HADOOP_VERSION=3.3.4 | ||
| ENV AWS_SDK_VERSION=1.12.753 |
There was a problem hiding this comment.
copied over from tests/conftest, these were originally downloaded client side
| # Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore | ||
| spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog | ||
| spark.sql.catalog.spark_catalog.type hive | ||
| spark.sql.catalog.spark_catalog.uri thrift://hive:9083 | ||
| spark.hadoop.fs.s3a.endpoint http://minio:9000 | ||
| spark.sql.catalogImplementation hive | ||
| spark.sql.warehouse.dir s3a://warehouse/hive/ |
There was a problem hiding this comment.
spark_catalog is primarily used by the test_migrate_table test. It calls <catalog>.system.snapshot which requires spark_catalog
There was a problem hiding this comment.
It requires the SparkSessionCatalog 👍
tests/integration/test_add_files.py
Outdated
| @pytest.mark.integration | ||
| def test_add_files_snapshot_properties(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None: | ||
| identifier = f"default.unpartitioned_table_v{format_version}" | ||
| identifier = f"default.snapshot_properties_v{format_version}" |
There was a problem hiding this comment.
name conflict with another test above
iceberg-python/tests/integration/test_add_files.py
Lines 160 to 161 in 6935b41
There was a problem hiding this comment.
I actually prefer to use the test-name in the table name:
| identifier = f"default.snapshot_properties_v{format_version}" | |
| identifier = f"default. test_add_files_snapshot_properties_v{format_version}" |
This way we can relate the two 👍
| # ======================== | ||
|
|
||
| PYTEST_ARGS ?= -v # Override with e.g. PYTEST_ARGS="-vv --tb=short" | ||
| PYTEST_ARGS ?= -v -x # Override with e.g. PYTEST_ARGS="-vv --tb=short" |
There was a problem hiding this comment.
-x to exit test when ctrl-c
-x, --exitfirst exit instantly on first error or failed test.
https://docs.pytest.org/en/6.2.x/reference.html#command-line-flags
6ac69bc to
1c1d75e
Compare
Fokko
left a comment
There was a problem hiding this comment.
I like this a lot! It consolidates a lot of the configuration, thanks for working on this 👍
| ENV SCALA_VERSION=2.12 | ||
| ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_${SCALA_VERSION} | ||
| ENV ICEBERG_VERSION=1.9.2 | ||
| ENV PYICEBERG_VERSION=0.10.0 | ||
| ENV HADOOP_VERSION=3.3.4 | ||
| ENV AWS_SDK_VERSION=1.12.753 |
| # Configure Spark's default session catalog (spark_catalog) to use Iceberg backed by the Hive Metastore | ||
| spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog | ||
| spark.sql.catalog.spark_catalog.type hive | ||
| spark.sql.catalog.spark_catalog.uri thrift://hive:9083 | ||
| spark.hadoop.fs.s3a.endpoint http://minio:9000 | ||
| spark.sql.catalogImplementation hive | ||
| spark.sql.warehouse.dir s3a://warehouse/hive/ |
There was a problem hiding this comment.
It requires the SparkSessionCatalog 👍
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Add 2 new Make command (`make notebook`) to spin up a jupyter notebook;
`make notebook-infra` spins up jupyter notebook along with integration
test infrastructure.
### Pyiceberg Example Notebook
Pyiceberg example notebook (`notebooks/pyiceberg_example.ipynb`) is
based on the
https://py.iceberg.apache.org/#getting-started-with-pyiceberg page and
doesn't require additional test infra.
### Spark Example Notebook
Spark integration example notebook
(`notebooks/spark_integration_example.ipynb`) is based on
https://iceberg.apache.org/docs/nightly/spark-getting-started/ and
requires integration test infrastructure (Spark, IRC, S3)
With spark connect (#2491) and our testing setup, we can quickly spin up
a local env with `make test-integration-exec` which includes:
* spark
* iceberg rest catalog
* hive metastore
* minio
In the jupyter notebook, connect to spark easily
```
from pyspark.sql import SparkSession
# Create SparkSession against the remote Spark Connect server
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
spark.sql("SHOW CATALOGS").show()
```
## Are these changes tested?
Yes, run both `make notebook` and `make notebook-infra` locally and run
the example notebooks
## Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
Rationale for this change
Closes #2492
Run pytest using Spark Connect for a more consistent test env
Are these changes tested?
Are there any user-facing changes?