Skip to content

FileSystemMetricsRepository file to be a parquet (/ delta) #185

Description

@WiktorHawrylik

Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).

Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:

  • Enable anomalies explainability by enabling live metrics repository table
  • Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)

Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.

Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions