Serializing GUFE objects#

When you create objects the follow the GUFE APIs, you will want them to be serializable; that is, you’ll want to be able to store everything to disk. In this section, we’ll discuss how GUFE handles serialization, and what a developer needs to do to serialize an object.

From a user perspective, this information is mainly relevant for understanding what you can expect from storage systems built for GUFE objects. From a developer perspective, this will tell you what you need to do to be compatible with (and get the most from) the GUFE serialization mechanisms.

In practice: What a developer needs to do#

For most use cases, an object can be made serializable by gufe if it (1) inherits from GufeTokenizable and (2) implements several abstract private methods related to serialization: _to_dict, _from_dict, and _defaults:

  • _to_dict returns a dictionary mapping string names for each attribute to preserve to a serializable representation of that attribute. A representation is considered “serializable” if it is (1) a GufeTokenizable, (2) a primitive type that can be handled by Python’s built-in JSON, or (3) a type for which a custom JSONCodec has been registered (see Supporting custom types in serialization).

  • _from_dict is a classmethod that takes the dictionary returned by _to_dict and reconstructs the object.

  • _defaults returns whatever _to_dict would return for any default values; i.e., it should be a dictionary where the keys are the the keys of the _to_dict dictionary that have default values, and the values are the associated defaults. Importantly, this may differ from the default value given at initialization – for example, it is common practice to use None as a stand-in for a mutable container (e.g., an empty list). In that case, the value given by _defaults should be the container, not None, since that is what will be in the _to_dict.

As a very simple example, consider the following complete implementation:

class Foo(GufeTokenizable):
    def __init__(self, bar, baz=None):
        self.bar = bar
        if baz is None:
            baz = []
        self.baz = baz

    def _to_dict(self):
        return {'bar': self.bar, 'baz': self.baz}

    @classmethod
    def _from_dict(cls, dct):
        return cls(**dct)

    def _defaults(self):
        return {'baz': []}

Testing your GufeTokenizable#

We provide a convenience mix-in for testing that a GufeTokenizable will be correctly stored and reloaded. Here’s an example of how to use it for the Foo class above:

import pytest
from gufe.tests.test_tokenization import GufeTokenizationTestsMixin

class TestFoo(GufeTokenizationTestsMixin):
    cls = Foo
    key = "Foo-8e7fd2803d73ef0cd9faceef751acfca"

    @pytest.fixture
    def instance(self):
        return Foo(5, baz=['qux', 'quux', 'quuux'])

You’ll need to get the key from creating the instance once and recording str(obj.key)– since the key is stable, it shouldn’t change (and, in fact, that is tested). This will add several tests to your test suite to ensure that you can save and reload your GufeTokenizable.

GufeTokenizables must be functionally immutable#

One important restriction on GufeTokenizable subclasses is that they must be functionally immutable. That is, they must have no mutable attributes that change their functionality. In most cases, this means that the object must be strictly immutable.

When an object is immutable, that means that none of its attributes change after initialization. So all attributes should be set when you create an object, and never changed after that. If your object is immutable, then it is suitable to be a GufeTokenizable.

There is a special case of mutability that is also allowed, which is if the object is functionally immutable. As an example, consider a flag to turn on or off usage of a cache of input-output pairs for some deterministic method. If the cache is turned on, you first try to return the value from it, and only perform the calculation if the inputs don’t have a cached output associated. In this case, the flag is mutable, but this has no effect on the results. Indeed, the cache itself may be implemented as a mutable attribute of the object, but again, this would not change the results that are returned. It would also be recommended that an attribute like a cache, which is only used internally, should be marked private with a leading underscore.

On the other hand, a flag that changes code path in a way that might change the results of any operation would mean that the object cannot be a GufeTokenizable.

Supporting custom types in serialization#

Occasionally, you may need to add support for a custom type that is not already known by GUFE. For example, there may be objects that are included in your GUFE object that come from another library, and are not GUFE tokenizables.

When this happens, you will need to create and register a custom JSON codec. To do this, at a minimum you need to define a to_dict and from_dict function for your type. You can then give the class of your type and those functions to a JSONCodec. As an example, consider the code for our codec for pathlib.Path objects:

PATH_CODEC = JSONCodec(
    cls=pathlib.Path,
    to_dict=lambda p: {'path': str(p)},
    from_dict=lambda dct: pathlib.Path(dct['path'])
)

Here the to_dict and from_dict are lambdas. The ideas of these are the same as the GufeTokenizable._to_dict and GufeTokenizable._from_dict described above.

In this default setup, the codec recognizes your object type by doing an isinstance check on the cls you gave it. It updates the dictionary that comes from your to_dict with the class and module of the object, as well as a key to mark this dictionary as coming from this codec. The key-value pairs that it adds make it so that the codec can recognize the dictionary when it is deserializes (decodes) the data.

Details of how the object is recognized for encoding or how the dictionary is recognized for decoding can be changed by passing functions to the is_my_obj or is_my_dict parameters of JSONCodec.

Warning

The Custom encoders & decoders only override the default JSON encoder/decoder if they are not able to natively handle the object. This leads to some odd / lossy behaviour for some objects such as np.float64 which is natively converted to a float type by the default encoder, whilst other numpy generic types are appropriately roundtripped.

On the use of NumPy generic types#

Due to their inconsistent behaviour in how they are handled by the default JSON encoder/decoder routines (see warning above), it is our suggestion that Python types should be used preferrentially instead of NumPy generic types. For example if one would be looking to store a single float value, a float would be prefered to a np.float32 or np.float64.

Please note that this only applied to generic types being used outside of numpy arrays. NumPy arrays are, as far as we know, always handled in a consistent manner.

Dumping arbitrary objects to JSON#

Any GufeTokenizable can be dumped to a JSON file using the custom JSON handlers. Given a GufeTokenizable called obj and a path-like called filename, you can dump to JSON with this recipe:

import json
from gufe.tokenization import JSON_HANDLER
with open(filename, mode='w') as f:
    json.dump(obj.to_dict(), f, cls=JSON_HANDLER.encoder)

Similarly, you can reload the object with:

import json
from gufe.tokenization import JSON_HANDLER, GufeTokenizable
with open(filename, mode='r') as f:
    obj = GufeTokenizable.from_dict(json.load(f, cls=JSON_HANDLER.decoder))

Note that these objects are not space-efficient: that is, if you have the same object in memory referenced by multiple objects (e.g., an identical ProteinComponent in more than one ChemicalSystem), then you will save multiple copies of its JSON representation.

On reloading, tools that use the recommended from_dict method will undo this duplication; see Deduplication in memory for details.

As a more space-efficient alternative to to_dict/from_dict, consider using to_keyed_chain/from_keyed_chain instead. This deals in a representation using the KeyedChain approach, which avoids duplication of dependent GufeTokenizables in the serialized JSON representation.

Convenient serialization ~~~~~~~~~~~~————

We also provide convenience methods to convert any GufeTokenizable to and from JSON using a space-efficient serialization strategy based on our KeyedChain representation. This is intended for developers that want to serialise these objects using the current best practice and are not concerned with the details of the process. The to_json API offers the flexibility to convert to JSON directly or to write to a filelike object:

# get a json representation in-memory
json = obj.to_json()

# save to a file directly
obj.to_json(file=filename)

Similarly, you can recreate the object using the from_json classmethod:

# load the object from a json file produced with `to_json`
obj = cls.from_json(file=filename)

# load from a string produced with `to_json`
obj = cls.from_json(content=json)

When your object has recursive references#

In some cases, your object may have recursive references to other objects. For example, you may have objects parent and child, where parent.children includes child and child.parent is parent. This means that the object dependency graph is not a directed acyclic graph. The best solution here is to avoid this design pattern whenever possible. Importantly, the child object in this case cannot be immutable, unless it is only created as a part of the creation of parent.

However, this could be made functionally immutable by (1) requiring that a valid child set its parent attribute exactly once; (2) not storing the child.parent attribute, and instead ensuring it gets set by the parent (e.g., in __init__). Note that this also assumes that all children are provided to parent on initialization, otherwise parent is also mutable.

Another approach here would be to make it so that child was not a GufeTokenizable, and instead explicitly handle its serialization in a more complicated parent._to_dict method, with similarly more complicated parent._from_dict. This approach is particularly useful if the child object isn’t too complicated, and if the child is unlikely to be reused outside the context of the parent.

Understanding the theory: The GUFE key#

One of the important concepts is that every GUFE object has a unique identifier, which we call its key. The key is a string, typically in the format {CLASS_NAME}-{HEXADECIMAL_LABEL}, e.g., ProteinComponent-7338abda590510f1dae764e068a65fdc. For most objects, the hexadecimal label is generated based on the contents of the class – in particular, it is based on contents of the _to_dict dictionary, filtered to remove anything that matches the _defaults dictionary.

This gives the GUFE key a number of important properties:

  • Because the key is based on a cryptographic hash, it is extremely unlikely that two objects that are functionally different will have the same key.

  • Because the key is created deterministically, it is preserved across different creation times, including across different hardware, across different Python sessions, and even within the same Python session.

  • Because the key is based on the non-default attributes, it is preserved across minor versions of the code (since we follow SemVer). The developer-provided _defaults method is used here.

These properties, in particular the stability across Python sessions, make the GUFE key a stable identifier for the object, which means that they can be used for store-by-reference.

Deduplication of GufeTokenizables#

There are two types of deduplication of GufeTokenizables. Objects can be deduplicated on storage to disk because we store by reference to the gufe key. Additionally, objects are deduplicated in memory because we keep a registry of all instantiated GufeTokenizables.

Deduplication in memory#

Memory deduplication means that only one object with a given GUFE key will exist in any single Python session. We ensure this by maintaining a registry of all GufeTokenizables that gets updated any time a GufeTokenizable is created. (This is a mapping to weak references, which allows Python’s garbage collection to clean up GufeTokenizables that are no longer needed.) This is essentially an implementation of the flyweight pattern.

This memory deduplication is ensured by the GufeTokenizable.from_dict, which is typically used in deserialization. It will always use the first object in memory with that key. This can lead to some unexpected behavior; for example, using the Foo class defined above:

>>> a = Foo(0)
>>> b = Foo(0)
>>> a is b
False
>>> c = Foo.from_dict(a.to_dict())
>>> c is a  # surprise!
True
>>> d = Foo.from_dict(b.to_dict())
>>> d is b
False
>>> d is a  # this is because `a` has the spot in the registry
True

Deduplication on disk#

Deduplication in disk storage is fundamentally the responsibility of the specific storage system, which falls outside the scope of gufe. However, gufe provides some tools to facilitate implementation of a storage system.

The main idea is again to use the key to ensure uniqueness, and to use it as a label for the object’s serialized representation. Additionally, the key, as a simple string, can be used as a stand-in for the object, so when an outer GufeTokenizable contains an inner GufeTokenizable, the outer can store the key in place of the inner object. That is, we can store by reference to the key.

To convert a GufeTokenizable obj into a dictionary that references inner GufeTokenizables by key, use obj.to_keyed_dict(). That method replaces each GufeTokenizable by a dict with a single key, ':gufe-key:', mapping to the key of the object. Of course, you’ll also need to do the same for all inner GufeTokenizables; to get a list of all of them, use get_all_gufe_objs() on the outermost obj.