Skip to content

Recursive non-JSON dataclass serialization#1565

Open
elibixby wants to merge 20 commits into
flyteorg:masterfrom
elibixby:custom_dataclass
Open

Recursive non-JSON dataclass serialization#1565
elibixby wants to merge 20 commits into
flyteorg:masterfrom
elibixby:custom_dataclass

Conversation

@elibixby

@elibixby elibixby commented Mar 27, 2023

Copy link
Copy Markdown
Contributor

TL;DR

Recursively call out to TypeEngine to serialize/deserialize dataclasses as LiteralMaps.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This solves two problems:

  1. Bugs arising from discrepancies in the serialization and deserialization of dataclass members (e.g. Enums are serialized differently within a dataclass vs directly in a task signature, causing a bug when Enums are specified from UI forms)
  2. The inability to include plugin/extra/user-defined types in dataclasses (e.g. currently it is not possible to define a dataclass with a nn.Module field, nor with a field for a type where a user writes a custom TypeTransformer)

With some additional backend work, this could also solve: flyteorg/flyte#1670, since the dataclass would be serialized as a proto and not a json-struct, individual submessages would be accessible to the Go backend without parsing arbitrary json. If this were implemented we could potentially eliminate the need for separate typing.NamedTuple support as dataclass would support a superset of the functionality of a typing.NamedTuple.

Concerns:

  • Currently LiteralType doesn't support any good annotation to indicate a non-univariate map. This information can be stuffed inside metadata as Is currently done with json dataclasses, but the "Right" way to do this is likely to add a mapping field to LiteralType which maps to subtypes and represents the schema of a multivariate map. As a bonus, this would remove some of the custom logic used for serializing function signatures.
  • Likely flytectl as well as the Flyte UI, currently use metadata to perform yaml file generation, and UI form generation respectively.

Tracking Issue

flyteorg/flyte#3359

Follow-up issue

NA
OR
https://github.com/flyteorg/flyte/issues/

@welcome

welcome Bot commented Mar 27, 2023

Copy link
Copy Markdown

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@elibixby

elibixby commented Mar 28, 2023

Copy link
Copy Markdown
Contributor Author

@wild-endeavor @kumare3

I'd like to better understand in what casees/where this functionality is used guess_python_type. AFAICT it's only really used in Transformers which encode many types with one transformer class. Where this makes sense is e.g. in generic collections, where you can't attach additional logic to the class.

However, I think python should probably be the source of truth for a Dataclass schema, and not the wire format. In general I struggle to see how reconstructing the type from a wire serialized schema will ever be safe. You could do something like this (even without JSONSchema).

    def guess_python_type(self, literal_type: LiteralType) -> Type[T]:
        fields = tuple(
            (name, TypeEngine.guess_python_type(val))
            for name, val in literal_type.metadata.items()
        )

        return dataclasses.make_dataclass(cls_name=literal_type.structure.tag, fields=fields)

But you won't be able to capture, e.g. functions defined on the dataclass, defaults values, special constructors etc. Also it seems like a good number of transformers won't implement guess_python_type which means this will fail sporadically on dataclasses where this is a field type.

Instead, it seems much safer to simply make a new dataclass TypeTransformer for each dataclass that you encounter:

class DataclassTransformer(TypeTransformer[T]):

    def __init__(self, dataclass_type: Type[T]):
        super().__init__(name=f'Dataclass[{dataclass_type.__name__}]', t=dataclass_type, enable_type_assertions=True)
        self._dataclass_type = dataclass_type
    
    
    def guess_python_type(...):
        #some safety checks
        return self._dataclass_type
...
class TypeEngine(Generic[T]):
...
    @classmethod
    def get_transformer(cls, python_type: Type) -> TypeTransformer[T]:
         ...
         if dataclasses.is_dataclass(python_type):
                cls._REGISTRY[python_type] = DataclassTransformer(python_type)
                return cls._REGISTRY[python_type]

Afaict the only risk this runs is if someone subclasses a dataclass, you could accidently use the Transformer for the superclass, rather than creating a new transformer for the subclass. This is easily solved by making the is_dataclass check before checking for registered superclass transformers.

This has the added advantage of actually making the generic typing of DataclassTransformer useful from a type checking perspective.

Again, this may not be necessary if we can just dump guess_python_type from DataclassTransformer entirely, but I'm not sure enough in my understanding of what it does to suggest that.

@elibixby

elibixby commented Mar 31, 2023

Copy link
Copy Markdown
Contributor Author

There doesn't appear to be a valid LiteralType for a LiteralMap with differently typed values. This doesn't really matter though, because we never need the hint of the LiteralType at least in unit tests. I'm not sure what the functionality of the second order typing is... (get_literal_type and guess_python_type. It doesn't seem to be used anywhere in the codebase outside of the corresponding methods in TypeEngine).

EDIT: My guess is the JSON Schema is used by the UI and CLI to generate forms/input yamls. But I can't find where this happens.

@elibixby elibixby changed the title Sketch of non-JSON dataclass serialization Recursive non-JSON dataclass serialization Mar 31, 2023
@elibixby elibixby marked this pull request as ready for review April 3, 2023 19:51
Comment thread tests/flytekit/unit/core/test_dataclass.py

@pingsutw pingsutw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elibixby Thanks, I like this idea. we should also update flyteConsole, right? otherwise, it won't be able to show the inputs / outputs.

@elibixby

elibixby commented Apr 4, 2023

Copy link
Copy Markdown
Contributor Author

@elibixby Thanks, I like this idea. we should also update flyteConsole, right? otherwise, it won't be able to show the inputs / outputs.

Yeah I think we will need to update both the console (for displaying and building input forms), as well as flytectl for generating execution.yaml files. I imagine they should see similar code reductions from not needing to deal with jsonSchema, and being able to fall back on pre-existing parsing logic.

Would love someone more familiar with these side of things to take this on, or at least provide pointers to what needs to be changed.

@pingsutw

pingsutw commented Apr 5, 2023

Copy link
Copy Markdown
Member

yup. I'll help. let's get it merged first.

@elibixby

elibixby commented Apr 7, 2023

Copy link
Copy Markdown
Contributor Author

Another note: I needed to rework DictTransformer a little. The only reason the tests were passing using non-str type keys was because dataclass was using a different code-path for dict serialization. Since the point of this PR is to unify the behavior here, I figured removing this capability was reasonable.

In reality there's no way to serialize an int-key map in protobuf (structs always use string keys). There were even a couple tests that codified this behavior (creating a dict with an int key, and expecting it to be a string key), but this felt really quite confusing to users, so I removed it. Technically this is a breaking change, but I kinda doubt anyone is relying on this functionality given how weird it is.

Might be a way of restoring this behavior in a more principled way.

@wild-endeavor wild-endeavor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @elibixby - thinking about it a bit more, is it a big deal from here to preserve the current behavior for dataclass_json? that is, if it's a dataclass_json, still use the existing code. If it's a dataclass dataclass, then use the new logic.

I'm concerned about breaking existing users. Currently a dataclass_json gets encoded to a literal Scalar Generic. In the new world it would be a Literal with a map, and also represents a scalar object in Python as a map object in Flyte. Let me think about this a bit more, it feels a bit awkward right?

cc @pingsutw - what do you think?

@elibixby

elibixby commented Apr 8, 2023

Copy link
Copy Markdown
Contributor Author

@wild-endeavor I agree that leaving old logic in place for backwards compatibility is worth considering especially if there are concerns with flytectl or console version mismatch.

However I don't agree that this is an awkward use of the python <-> proto mapping. A dataclass is much closer to a map than it is to a scalar. A dataclass is defined by having a map __dataclass_fields__ as a class member. Not only that but my personal experience using flyte has been that the duplicate nesting logic of dataclasses is the source of lots of frustrating unexpected behavior, (why do unions work in functions but not in dataclasses, why are enums serialized differently, why don't custom types work in dataclasses, etc), and by removing json serialization you can see from the test cases that several other weird edge cases are removed as well (forward references breaking dataclasses, etc).

EDIT: The only slightly awkward thing is the stuffing of the type schema into a struct when it could just as easily be represented with a map field on LiteralType proto.

@wild-endeavor

Copy link
Copy Markdown
Contributor

Okay that sounds good. We can make this new transformer operate as a map/dictionary type. But yeah if we could preserve the existing dataclass_json behavior, I think that will be good. So no change for dataclass_json, this new logic for pure dataclass.

is that possible?

And then we can do a generic json transformer, and think about how to pull out the logic from run.py as a second/third step. what do you think?

@pingsutw

pingsutw commented Apr 18, 2023

Copy link
Copy Markdown
Member

@elibixby let's not remove the code in current dataclass transformer for now.
Could we rename dataclassTransformer to dataclassJsonTransformer? and create a new transformer called dataclassTransformer that can serialize the python value to literal instead of json schema. wdyt?

@dataclass_json
@dataclass
class DatasetStruct(object): # -> use dataclassJsonTransformer
    ...


@dataclass
class DatasetStruct(object): # -> use dataclassTransformer
    ...

This way we can make sure it will not break current users.

@ggydush

ggydush commented May 22, 2023

Copy link
Copy Markdown
Collaborator

Any update on this? I think this would help resolve an issue I'm seeing with the existing transformer! Happy to pair/help on any remaining outstanding issues!

@elibixby

Copy link
Copy Markdown
Contributor Author

Any update on this? I think this would help resolve an issue I'm seeing with the existing transformer! Happy to pair/help on any remaining outstanding issues!

Hey, I just haven't had time to restore the original json dataclass transformer to work along side. I've been focused on a couple more pressing issues with the type system that came out of this CL (namely writing an escape hatch for the type system in #1615 and changing pickling to be a universal default rather than only for top level types)

I think everything that is needed for merged is here, just some old code/tests need to be restored alongside the new stuff. If you'd be interested in helping with that I'd appreciate it!

@pingsutw pingsutw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rename current DataclassTransformer to DataclassJsonTransformer, and move the code that @elibixby added to DataclassTransformer. last, we need to update this line. if python_type has @dataclass_json decorator, it should return DataclassJsonTransformer; otherwise, it should return DataclassTransformer

@ggydush would you like to help updating this PR

@ggydush

ggydush commented Jun 6, 2023

Copy link
Copy Markdown
Collaborator

@pingsutw taking a look today! Will see what I can do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants