Add all_manifests metadata table with tests#1241
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the PR! I've added some comments!
pyiceberg/table/inspect.py
Outdated
| import pyarrow as pa | ||
|
|
||
| all_manifests_schema = get_manifests_schema() | ||
| all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False)) |
There was a problem hiding this comment.
interestingly, this isnt in the documentation https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
but only in the code https://github.com/apache/iceberg/blame/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/AllManifestsTable.java#L53-L54
There was a problem hiding this comment.
Yes, it's not present in iceberg docs.
| schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(), | ||
| ) | ||
|
|
||
| def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
wdyt about adding an optional snapshot_id here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests
There was a problem hiding this comment.
Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests table.
So I'll need to add this second parameter for manifests method as well. Thoughts?
There was a problem hiding this comment.
yea i think thats fine since _generate_manifests_table is internal
| for column in df.column_names: | ||
| for left, right in zip(lhs[column].to_list(), rhs[column].to_list()): | ||
| assert left == right, f"Difference in column {column}: {left} != {right}" |
There was a problem hiding this comment.
nit: is it possible to use assert_frame_equal here?
There was a problem hiding this comment.
Yes, making the change.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for adding this metadata table!
|
@soumya-ghosh I see this one is still pending, are you still interested to get this in? |
|
Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly. |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! thank you for following up
| schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(), | ||
| ) | ||
|
|
||
| def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
yea i think thats fine since _generate_manifests_table is internal
|
@Fokko bumping this up for review |
|
@soumya-ghosh do you mind resolving the conflict? |
|
@kevinjqliu conflict is resolved. As the PR is approved but not merged for over a month now, hence merge conflicts happen occasionally. |
|
Sorry about the delay, this dropped off my radar. Thanks again for the contribution @soumya-ghosh ! |
Implements
all_manifestsmetadata table - #1053Have refactored the code tor re-use logic of
manifestsmetadata table.The schema of
all_manifestscontains an additional column as compared tomanifeststable - reference_snapshot_id which indicates the snapshot id those manifests are contained in.Ref - Iceberg implementation - here and here