Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).
Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:
- Enable anomalies explainability by enabling live metrics repository table
- Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)
Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.
Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?
Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).
Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:
Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.
Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?