Data parsing & validation

Prerequisites

This tutorial explains how pydantic is used in MMIC to parse and validate schemas. This tutorial assumes you’re familiar with the python programming language and the basics of MM components. If not, you should 1st go over the “getting started” tutorial.

Introduction

Python does not inherently enforce typing of functions or variables at runtime. Instead, python allows for type hints to be specified in order to inform users and other developers what is expected by a function and what is provided.

Pydantic is a python package designed to expand on the concept of typing in python. It is used to enforce type hints in python, allowing objects to be validated during runtime.

For primitive types, that is str, int, float, and bool , this simply takes the form of an annotation.

def name_typing(name: str) -> str:
    return name

The variable name is annotated with a : followed by the type, in this case, we expect a name to be a str. The function’s return type is also hinted at. After the parameters, but before the : we use an -> to denote there is a type hint, followed by the type. In this case we are using a string as well.

Here we test the function with a couple of variables.

name_str = 'John'
name_num = 5
print(name_typing(name_str))
print(name_typing(name_num))

We get the following output:

John
5

Python does not enforce the type, so the function will return the variable and print both out correctly. If we want to have more complex types, such as List, Dict, Union, or Optional types, we need to utilize python’s typing library.

from typing import List, Dict, Union, Optional

We can then utilize them in type hints.

registrant_names: List[str]
birthday: Optional[str]

Pydantic BaseModel

[TODO: Cover Base Pydantic model, and why it is used.]

Each component in MMIC has two schema, an Input Schema and an Output Schema. Depending on the component the two schemas may be the same or different. To define a schema we utilize a pydantic model to create a set of variables with specific types to ensure that a component can run for a given input as long as it conforms to the Input Schema.

The model used by pydantic is the BaseModel. We can import this from pydantic and then inherit it when we create a schema.

from pydantic import BaseModel
class Schema(BaseModel):

The benefits of inhertiting the BaseModel is the validation that comes with it. When an instance of a class model is created from a schema, the values assigned to each variable in this object are validated against their assigned type.

class SchemaExample(BaseModel):
    var1: int
    var2: str
    var3: Optional[float]

To provide further information about each variable, we can utilize the Field method from pydantic.

class SchemaExample(BaseModel):
    var1: int = Field(..., description="This is the first variable for the example schema.")
    var2: str = Field(..., description="This is the second variable for the example schema.")
    var3: Optional[float]

The first argument in the Field method is a positional argument for the default value of the variable. Entering ... for the default value will let pydantic know the variable has no default value. There are a number of keyword arguments for the Field method that can be found in the Pydantic documentation, but we will mention description as a useful one to include, as it allows for more detailed documentation for each variable.

For a more concrete example, we can look at the following SimpleAtom model.

from pydantic import BaseModel, Field
from typing import Optional

class SimpleAtom(BaseModel):
    element: str = Field(
        ...,
        description="Element name e.g. C"
    )
    mass_number: int = Field(
        ...
        description="Atomic mass number."
    )
    x: Optional[float] = Field(
        None, 
        description="Atomic position along the x-axis."
    )

Note that the ellipsis ... passed to Field indicates a required field. Hence, element and mass_number are required fields whereas x can be optionally assigned and by default take the value None.

We will also introduce another model, Atom, that describes an atom in 3D.

class Atom(SimpleAtom):
    y: Optional[float] = Field(
        None,
        description="Atomic position along the y-axis."
    )
    z: Optional[float] = Field(
        None,
        description="Atomic position along the z-axis."
    )

The first thing to notice about the Atom class is it does not inherit from Pydantic’s BaseModel class. Since our inheritance tree includes the BaseModel we will enforce typing on the variables within the schema. Similar to normal inheritance, we inherit all of the typed variables from the parent class.

Therefore, the Atom model contains all the typed variables from the SimpleAtom schema (element, mass_number, and x), and adds 2 new variables, y which is of type float, and z which is also of type float.

Schema Validation

Pydantic performs validation on all the variables in a schema. By default, this means if you specify a type for a variable and pass it a value of a different type, it will provide you an error message with useful information. Note that we use typed schema variables as keyword arguments, not positional arguments.

example = SchemaExample(var1="Hello", var2=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pydantic/main.py", line 362, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for
SchemaExample
var1
  value is not a valid integer (type=type_error.integer)

We can see that it reported a single validation error. Pydantic will try and perturb primitive types if they have a simple conversion. In this case, since var2 is a string type, and an integer, 5 can easily be converted losslessly by python, it will assume 5 was meant to be passed as a string. However, since the first variable is looking for an int and it was passed a string, it rovides information on the variable that had an error, and what the expected type was.

Single field validation

As schemas become more complicated, simple assignment validation may no longer be sufficient. In addition, you may wish to assign additional conditions to the values of variables outside of simply their type. Pydantic supports custom validators to cover these cases.

To write a custom validator, you must import a few additional things from pydantic.

from pydantic import validator, ValidationError

This will import the decorator @validator which is used to specify a method is used for validation, and ValidationError can be used to check if an error was thrown during validation.

from pydantic import BaseModel, validator

class SchemaExample(BaseModel):
    var1: int = Field(..., description="This is the 1st variable for the example schema.")
    var2: str = Field(..., description="This is the 2nd variable for the example schema.")
    var3: Optional[float] 
    
    @validator('var2')
    def var2_alphanumeric(cls, v):
        assert v.isalnum(), 'var2 must be alphanumeric'
        return v

We have added a custom validator to ensure that var2 is only made of alphanumeric characters. We use the decorator @validator('var2') to specify that the following function should be used to perform validation on the variable var2. We pass the cls since this method is used on the SchemaExample class and not on an instanced object and by pydantic convention use v for the value that is being assigned to the variable.

Here we can test the validator to ensure it is working.

example = SchemaExample(var1=5, var2='hellos12345_@#$')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pydantic/main.py", line 362, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for
SchemaExample

var2
  var2 must be alphanumeric (type=assertion_error)

A more thorough description of validators and how to write custom ones can be found within the pydantic documentation.

Multiple field validation

We can also use the same validator on multiple or all fields. For example, let’s revisit the SimpleAtom model we defined earlier and redefine its schemas.

from pydantic import BaseModel, Field, validator
from typing import Optional

class SimpleAtom(BaseModel):
    atomic_number: int = Field(
        ...,
        description="Atomic number, must be > 0."
    )
    mass_number: int = Field(
        ...,
        description="Atomic mass number, must be > 0."
    )
    x: Optional[float] = Field(
        None,
        description="Atomic position along the x-axis."
    )

    @validator("atomic_number", "mass_number")
    def _must_be_positive(cls, v):
        if v <= 0:
            raise ValueError("atomic_number and mass_number must be > 0!")
        return v

In this model, the atomic_number and mass_number are required fields of type int. In order to ensure both integers are always positive, we define the _must_be_positive validator that is applied to both fields.

Let’s test the validator and make sure it works as expected.

SimpleAtom(atomic_number=1, mass_number=0)

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    SimpleAtom(atomic_number=1, mass_number=0)
  File "pydantic/main.py", line 400, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for SimpleAtom
mass_number
  atomic_number and mass_number must be > 0! (type=value_error)

[TODO: cover root_validator]

ProtoModel

For the MMIC poject, we utilize an extension of the pydantic BaseModel. This was originally developed for the QCArchive project and can be found within the QCElemental package.

Any schema based on this ProtoModel gains a few additional features.

It covers a few ways to parse in a model. First a model object can be parsed in from a raw string or bytes using the parse_raw() method. This is useful if the values for a model are contained in a serialized form, such as json or msgpack-ext. A model can also be parsed from a file using the parse_file() method. Current supported encodings are json, msgpack-ext, and pickle files.

Similarly, a model can be serialized into json, json-ext, or msgpack-ext. This allows the model to be shipped around as a set of bytes or as a string.

Finally, ProtoModel allows for the comparison of instanced objects. The self.compare() method can be used to compare the current object to a given object recursively.