Conversation
|
Not specific feedback for the RFC, just sharing some context links and previous discussion that was helpful in getting the apache/datafusion#16779 Also tagging @ggershinsky in case he has any cycles to read this. His guidance was instrumental on the DataFusion and Arrow-rs PME work, as well as Iceberg Java encryption implementation. |
|
Thanks, I'll be glad to have a look. |
| ``` | ||
| Master Key (in KMS) | ||
| └── wraps → KEK (Key Encryption Key) — stored in table metadata as EncryptedKey | ||
| └── wraps → DEK (Data Encryption Key) — stored in StandardKeyMetadata per file |
There was a problem hiding this comment.
some DEKs (those for manifest list files) are also stored in table metadata as EncryptedKey. These DEKs are indeed packaged in a StandardKeyMetadata (along with AAD prefix and file length). The serialized StandardKeyMetadata is encrypted/wrapped by the KEK, and stored in the table metadata / encrypted_keys structure.
The manifest file DEKs are packaged in StandardKeyMetadata, and stored as-is (without encryption) in manifest list files. The latter are encrypted then.
The data file DEKs are packaged in StandardKeyMetadata, and stored as-is (without encryption) in manifest files. The latter are encrypted then.
|
|
||
| - **Master keys** live in the KMS and never leave it | ||
| - **KEKs** are wrapped by the master key and stored in `TableMetadata.encryption_keys` | ||
| - **DEKs** are wrapped by a KEK and stored per-file in `StandardKeyMetadata` |
There was a problem hiding this comment.
Only manifest list DEKs are wrapped by a KEK. Other DEKs are encrypted in the parent files, by the parent DEKs
| │ | ||
| ▼ | ||
| load_manifest_list(file_io, table_metadata) | ||
| 1. Look up encryption_key_id in table_metadata.encryption_keys |
There was a problem hiding this comment.
Also need to unwrap the KEK (via a KMS client)
| a. file_io.new_encrypted_output(path) → AGS1-encrypting OutputFile | ||
| b. em.wrap_key_metadata() → EncryptedKey for table metadata | ||
| c. Store key_id on Snapshot.encryption_key_id | ||
| 3. Table updates include AddEncryptionKey for new KEKs |
There was a problem hiding this comment.
Also need to wrap the KEK (via a KMS client)
|
Thanks for taking a look @ggershinsky I've tried to fill in some of the details here |
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @xanderbailey for this pr, generally LGTM, left some suggestions.
|
|
||
| ### Iceberg Spec: Encryption | ||
|
|
||
| The [Iceberg table spec](https://iceberg.apache.org/spec/#table-metadata) defines encryption |
| │ ├── kms/ | ||
| │ │ ├── mod.rs | ||
| │ │ └── in_memory.rs # InMemoryKms (testing only) | ||
| │ └── integration_tests.rs # End-to-end encryption round-trip tests |
There was a problem hiding this comment.
In rust we usually put integration tests in tests dir, next to src dir.
|
|
||
| ```rust | ||
| #[async_trait] | ||
| pub trait EncryptionManager: Debug + Send + Sync { |
There was a problem hiding this comment.
Do we really need this to be a trait? I think a no-op kms client would be enough?
There was a problem hiding this comment.
Yeah we actually don't need this to be a trait the standard encryption manager is the only one we need
| `InputFile` and `OutputFile` are enums with three variants each: | ||
|
|
||
| ```rust | ||
| pub enum InputFile { |
There was a problem hiding this comment.
I think we this enum should sth as following:
pub enum EncryptedInputFile {
Encrypted {
key_metadata: KeyMetadata
inner: InputFile
},
NativeedEncryted {
key_material: NativeKeyMaterial,
inner: InputFile
}
}
There was a problem hiding this comment.
I think this requires a lot more changes to the current code since encrypted inputs and outputs won't be handled transparently. Or is the suggestion to have InputFile as an enum of Plain, EncryptedInputFile and EncryptedInputFile is itself an enum?
There was a problem hiding this comment.
I find it a little difficult to imagine what it will look like given different choice? Could you give raise some examples? I'm even thinking if we actually need an enum. InputFile, EncryptedInputFile and NativeEncryptedInputFile are used in differenent cases. Let's use an example, ManifestReader, if we remove the enum, what we need is to add an extra api for read(EncryptedInputFile). This is acceptable to me, some we have a FileRead trait to abstract out the descryption process.
There was a problem hiding this comment.
I'm leaning towards the example snippet @blackmwk above and I don't think we should expand Input/OutputFile to an enum. EncryptedFile should be a wrapper of Input/OutputFile rather than a variation.
By the same logic, having a wrapper Storage like EncryptionStorage may not be ideal. For encryption, we only need to (un)wrap Input/OutputFile in the FileIO level
There was a problem hiding this comment.
I think we can do this if you both think this is better but this requires some reasonable surgery on structs like ManifestListWriter which today takes an OutputFile and will now need to take a Box<dyn FileWrite> and ManifestWriter will need Box<dyn FileWrite> and a location.
There was a problem hiding this comment.
Okay I tried it out and it's actually not too bad. If you're both happy with the changes I mentioned above then I think this is a good approach.
| // Via FileIOBuilder extension (works with RestCatalog and any extension-aware catalog) | ||
| let file_io = FileIOBuilder::new("s3") | ||
| .with_prop("s3.region", "us-east-1") | ||
| .with_extension(encryption_manager) |
| .build()?; | ||
|
|
||
| // Or via convenience method on FileIO | ||
| let file_io = file_io.with_encryption_manager(encryption_manager); |
| // Option A: EncryptionManager on the catalog | ||
| let catalog = GlueCatalogBuilder::default() | ||
| .with_storage_factory(Arc::new(OpenDalStorageFactory::S3)) | ||
| .with_encryption_manager(encryption_manager) |
There was a problem hiding this comment.
I think what we actually need from user is KmsClientFactory?
There was a problem hiding this comment.
Oh interesting, where they can provide a way to construct a number of KMS clients?
There was a problem hiding this comment.
I mean the code to be as following:
let catalog = GlueCatalogBuilder::default()
.with_storage_factory(Arc::new(OpenDalStorageFactory::S3))
.with_kms_client_factory(new AwsKmsClientFactory())
| .load("my_catalog", props) | ||
| .await?; | ||
|
|
||
| // Option B: Wrapping StorageFactory |
There was a problem hiding this comment.
I prefer option A due to https://github.com/apache/iceberg-rust/pull/2183/changes#r2927906969
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Don't close it, still under development. |
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @xanderbailey , I think it's also done. Please address the comments and fix typos, then we will be able to merge it.
|
|
||
| ```rust | ||
| /// Wraps a plain InputFile with decryption capabilities. | ||
| pub enum EncryptedInputFile { |
There was a problem hiding this comment.
Could we remove the enum for now, just plain struct EncryptedInputFile, NativeEncryptedInputFile? I don't get the benefits of such an enum. We could always add it back later if necessary.
There was a problem hiding this comment.
This is cleaner!
| } | ||
|
|
||
| /// Wraps a plain OutputFile with encryption capabilities. | ||
| pub enum EncryptedOutputFile { |
|
|
||
| The `EncryptionManager` is stored as a type-safe `FileIOBuilder` extension. This integrates | ||
| naturally with catalogs that support extensions (e.g. `RestCatalog.with_file_io_extension()`): | ||
| The `EncryptionManager` is attached to `FileIO` via a convenience method. The encryption |
There was a problem hiding this comment.
I don't think it's a good idea to attach it to FileIO.
There was a problem hiding this comment.
Instead we should implement EncryptingStroage
There was a problem hiding this comment.
Interesting idea!
- Looks like Storage is serde serialisable which is hard to do for an
iceberg-rust/crates/iceberg/src/io/storage/mod.rs
Lines 79 to 80 in 7c2d5d1
EncryptionManagerwhich holds anArc<dyn KeyManagementClient>and that has caches and so on inside it. - PME isn't transparent so
StorageandFileIowould need to expose anative_encrypt/decryptmethod which maybe seems like we're breaking the abstractions?
What do you think?
There was a problem hiding this comment.
If you look at java, bot KeyManagementClient and EncryptionManager are serializable, so I think we should also mark them as serializable.
What's PME?
There was a problem hiding this comment.
PME is the Parquet Modular Encryption
There was a problem hiding this comment.
In this case how about we create an EncryptingFileIO, EncryptingStorage?
There was a problem hiding this comment.
TBH I currently don't have a full picture of all the data structures involved, so instead of adding too much abstractions early, I prefer to keep more types. We could always add abstractions later to simplify types. WDYT?
There was a problem hiding this comment.
EncryptingFileIO would mean more code changes since each call site would now need to take EncryptingFileIO which seems wrong also somehow?
I'm going to see if I can move EncryptionManager out of FileIo altogether and just have it on the Table and pass it down where it's needed. I'll report back when I have a better idea of how this looks
There was a problem hiding this comment.
Yeah okay so I think this is cleaner, TableBuilder constructs the EncryptionManager and passes that down. FileIo is not aware of encryption. It's a little awkward now to pass down the EM everywhere but honestly, I think it's more explicit than sort of assuming the FileIo has been configured correctly.
There was a problem hiding this comment.
Let's make it explicit first, and polish the api later. As long as we be careful with pub visibility, we could reduce user impact as much as possible.
Which issue does this PR close?
RFC for table encryption
Part of: #2034
Rough draft with some of the key parts: #2042
What changes are included in this PR?
Are these changes tested?