Skip to content

Validation result is sometimes incorrect when using group rules #38

@borchero

Description

@borchero

In yesterday's PyData meetup in Zurich, one question prompted me to realize that we're incorrectly dealing with group rules and row-level rules: if a row-level rule removes a row which would make a group rule fail, we do not realize it. For example:

import dataframely as dy
import polars as pl

class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    diagnosis = dy.String(primary_key=True, regex="^[A-Z]{3}$")
    is_main = dy.Bool(nullable=False)

    @dy.rule()
    def exactly_one_main_diagnosis() -> pl.Expr:
        return pl.col("is_main").sum() == 1

df = pl.DataFrame(
    {
        "invoice_id": ["A", "A", "A"],
        "diagnosis": ["ABC", "ABD", "123"],
        "is_main": [False, False, True],
    }
)
good, _ = DiagnosisSchema.filter(df)
print(good)

results in

shape: (2, 3)
┌────────────┬───────────┬─────────┐
│ invoice_id ┆ diagnosis ┆ is_main │
│ ---        ┆ ---       ┆ ---     │
│ str        ┆ str       ┆ bool    │
╞════════════╪═══════════╪═════════╡
│ A          ┆ ABC       ┆ false   │
│ A          ┆ ABD       ┆ false   │
└────────────┴───────────┴─────────┘

which clearly violates the schema since we don't have a main diagnosis for the group.

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions