Skip to content

[Parquet][Python] API to decrypt parquet file using one DEK and no metadata #47435

@changhu-m

Description

@changhu-m

Describe the enhancement requested

I am having trouble decrypting a parquet file encrypted in the C++ Arrow library using PyArrow.

In C++, I encrypt the file with one key and no metadata.

  // Convert vector<uint8_t> to string for the FileEncryptionProperties::Builder
  std::string keyStr(reinterpret_cast<const char*>(key.data()), key.size());

  auto fileEncryptionProps = parquet::FileEncryptionProperties::Builder(keyStr)
                                 .algorithm(parquet::ParquetCipher::AES_GCM_V1)
                                 ->build();

  const auto props = parquet::WriterProperties::Builder()
                         .encryption(fileEncryptionProps)
                         ->build();

  const auto fields = convertSchema(table->schema());
  const auto schemaNode = std::static_pointer_cast<parquet::schema::GroupNode>(
      parquet::schema::GroupNode::Make(
          "schema", parquet::Repetition::REQUIRED, fields));
  auto schema =
      std::static_pointer_cast<parquet::schema::GroupNode>(schemaNode);

  // Open output file
  std::shared_ptr<arrow::io::FileOutputStream> outFile;
  auto result = arrow::io::FileOutputStream::Open(outputFilePath);
  if (!result.ok()) {
    throw std::runtime_error("Failed to open output file: " + outputFilePath);
  }
  outFile = result.ValueOrDie();

  auto parquetStreamWriter = make_unique<parquet::StreamWriter>(
      parquet::ParquetFileWriter::Open(outFile, schema, props));

  return writeToStreamWriter(table, *(parquetStreamWriter.get()));

I can decrypt the file in C++ using the same key and no metadata.

  const std::string keyStr(key.begin(), key.end());
  auto decryptionProps =
      parquet::FileDecryptionProperties::Builder().footer_key(keyStr)->build();
  auto readerProps = parquet::ReaderProperties();
  readerProps.file_decryption_properties(decryptionProps);

  std::unique_ptr<parquet::ParquetFileReader> reader =
      parquet::ParquetFileReader::OpenFile(outputPath, false, readerProps);

  // Read the file content into an arrow table actualTable using reader
  std::shared_ptr<arrow::Table> actualTable;
  std::unique_ptr<parquet::arrow::FileReader> arrowReader;
  auto status = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(), std::move(reader), &arrowReader);

  status = arrowReader->ReadTable(&actualTable);

However, I can't find an API in PyArrow to do the equivalent.

In Python, I can do the following, but it requires setting extra dummy key metadata in c++ (i.e., this code won't decrypt a file encrypted by the C++ approach at the beginning of this post).

import pyarrow.parquet.encryption as pe

# Create a simple KMS client that returns our DEK
class SimpleKmsClient(pe.KmsClient):
    def __init__(self):
        pe.KmsClient.__init__(self)
    
    def unwrap_key(self, wrapped_key, master_key_identifier):
        return dek
    
    def wrap_key(self, key_bytes, master_key_identifier):
        raise NotImplementedError("wrap_key not needed for decryption")

# Create KMS factory
def kms_factory(kms_connection_configuration):
    return SimpleKmsClient()

crypto_factory = pe.CryptoFactory(kms_factory)

# Simple decryption config
decryption_config = pe.DecryptionConfiguration()
kms_connection_config = pe.KmsConnectionConfig()

# Create file decryption properties
file_decryption_props = crypto_factory.file_decryption_properties(
    kms_connection_config, decryption_config
)

# Read the file
table = pq.read_table(path, decryption_properties=file_decryption_props)

My question is: Is there any Pyhton API to decrypt using only one DEK and no metadata?

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions