FileSystemMetricsRepository file to be a parquet (/ delta)

**Is your feature request related to a problem? Please describe.**
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json.  Processing json in spark environment is not efficient (both programmatically and not).

**Describe the solution you'd like**
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:

- Enable anomalies explainability by enabling live metrics repository table
- Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)

**Describe alternatives you've considered**
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.

**Additional context**
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FileSystemMetricsRepository file to be a parquet (/ delta) #185

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

FileSystemMetricsRepository file to be a parquet (/ delta) #185

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions