Validating and parsing

Inspired by a discussion in the joinmastodon.org Discord.

Someone's writing an app processing incoming HTTP requests with a Content-Type header. Specifically, they want to handle the case where the content type is application/json.

They noticed some clients are sending an unnecessary charset=utf-8 parameter in the content type, and wondered how they should properly handle this parameter.

They're doing this as a learning exercise so wanted to avoid leaning on an existing library if possible.

The discussion went back and forth a little bit between validation and parsing without really getting in to the meat of it, it's too long for a chat message post, and I'm not aware of a good discussion of the topic elsewhere, hence this post.

What to do?

The difference between validation and parsing

I use the terms to mean the following.

You validate incoming data to decide if your code should continue on the happy path, or branch to an error handler because the data is invalid. The incoming data will not be used again.

You parse incoming data to validate it, and to extract the information to a different type for use elsewhere in the program.

To use Pachli as an example, the code validates the content type of incoming data as application/json because if it's not then an error has occurred. The content type is not used after that.

Pachli parses a server's version string to a dedicated Version type to hold a semantic version because the version information is used elsewhere in the code to make decisions.

Parsing to the dedicated type once means the rest of the code knows the version is valid and can be reasoned about, instead of repeatedly having to re-parse the version string.

The Content-Type header

The Content-Type header specification is defined in RFC 9110 sect. 8.3. Using ABNF the grammar for the header is

token          = 1*tchar

tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

parameters      = *( OWS ";" OWS [ parameter ] )
parameter       = parameter-name "=" parameter-value
parameter-name  = token
parameter-value = ( token / quoted-string )
  
media-type = type "/" subtype parameters
type       = token
subtype    = tokenContent-Type = media-type

Content-Type = media-type

It also notes

The type and subtype tokens are case-insensitive.

and gives these examples

text/html;charset=utf-8
Text/HTML;Charset="utf-8"
text/html; charset="utf-8"
text/html;charset=UTF-8

RFC 8259 defines the application/json media type, and is explicit that this type does not have a charset parameter, writing:

Note: No “charset” parameter is defined for this registration. Adding one really has no effect on compliant recipients.

What to do?

The original questioner can now decide whether they want to validate the header as application/json, or parse the header to a more specific set of types.

Validation

From the definition of the Content-Type header and the application/json media type we can see a sensible strategy for validation would be:

  1. Remove everything from the header value after-and-including the first occurrence of ;
  2. Trim any whitespace from the start and end of the string
  3. Perform a case-insensitive comparison of the value with application/json

Step 1 handles malformed clients or servers that emit variations of application/json;charset=utf-8 (or any other charset). While technically not correct this is unlikely to impede interoperability.

Step 2 is also for malformed clients or servers. The specification does not allow spaces before the primary type (application), and there may be spaces between the subtype and the ; that was removed in step 1. Again, removing these should be harmless and improves interoperability.

Step 3 validates the header contains the correct value.

Parsing

Parsing the header is appropriate if the application is going to branch based on the type and subtype. For example, different handlers for application/json and application/activity+json.

To do that I would create specific Python types to represent the parameters and types that can appear in a Content-Type header, and have a function that parses the header and returns one of those types.

The types should include an explicit Unknown value to indicate an unhandled type, and can either raise errors or return an additional error type.

For example:

import re
from dataclasses import dataclass

# This is an example to demonstrate parsing a value to one of a
# specific set of types, distinct from simply validating the value
# matches expectations.


@dataclass
class BaseParameter:
    """
    Base class for all parameters.

    value: Value of the parameter as it appeared in the header.
    """
    value: str


@dataclass
class Charset(BaseParameter):
    """A 'charset=...' parameter."""
    pass


@dataclass
class UnknownParameter(BaseParameter):
    """
    An unknown parameter.
    
    name: Parsed name of the parameter
    value: Parsed value of the parameter
    original: Original parameter string as it appeared in the header.
    """
    name: str
    original: str


Parameter = Charset | UnknownParameter
"""Parameters that can occur in a content-type header."""


@dataclass
class ApplicationJson:
    """The application/json content type."""
    pass


@dataclass
class ApplicationActivityPlusJson:
    """
    The application/activity+json content type

    charset: Parsed value of the 'charset' parameter, if present.
    """
    charset: Charset | None


@dataclass
class UnknownContentType:
    """
    An unknown content type.

    type: Parsed name of the type
    parameters: Parsed list of parameters, if present.
    original: Original content type as it appeared in the header.
    """
    type: str
    parameters: list[Parameter]
    original: str


ContentType = ApplicationJson | ApplicationActivityPlusJson | UnknownContentType
"""Possible content type values."""


# THIS IS NOT A PRODUCTION QUALITY PARSER
def parse_content_type(val: str) -> ContentType:
    r = re.compile('^(?P<type>[^;]+)(?:;(?P<parameters>.*))?')
    m = r.search(val)
    if m is None:
        raise ValueError
    type = m.group('type')
    if type is not None:
        type = type.lower().strip()
    match type:
        case 'application/json':
            return ApplicationJson()
        case 'application/activity+json':
            parameters = parse_parameters(m.group('parameters'))
            charset = ([e for e in parameters if isinstance(e, Charset)][:1]
                       or [None])[0]
            return ApplicationActivityPlusJson(charset)
        case _:
            parameters = parse_parameters(m.group('parameters'))
            return UnknownContentType(type, parameters, val)


# THIS IS NOT A PRODUCTION QUALITY PARSER
def parse_parameters(val: str) -> list[Parameter]:
    if val is None:
        return []

    parameter_strs = val.split(';')
    r = []
    for parameter_str in parameter_strs:
        parameter_str = parameter_str.strip()
        (name, value) = re.split(r'\s*=\s*', parameter_str, maxsplit=1)
        match name.lower():
            case 'charset':
                r.append(Charset(value.lower()))
            case _:
                r.append(UnknownParameter(name, value, parameter_str))
    return r


if __name__ == "__main__":
    print(parse_content_type('application/json'))
    print(parse_content_type('application/JSON'))
    print(parse_content_type('  application/json  '))
    print(parse_content_type('application/json ; charset=utf-8'))
    print(parse_content_type('application/activity+json'))
    print(parse_content_type('application/activity+json;charset=us-ascii'))
    print(parse_content_type('made/up; charset=utf-8; some=parameter'))

    match parse_content_type('application/activity+json;charset=us-ascii'):
        case ApplicationJson():
            print('It was some form of application/json')
        case ApplicationActivityPlusJson(charset):
            print('It was some form of application/activity+json')
            if charset is not None:
                print(f'Charset was {charset.value}')
        case UnknownContentType(type, parameters, original):
            print(f'Content type "{original}" was not recognised')

Anything else?

Additional checks could be performed.