Validating and parsing
Inspired by a discussion in the joinmastodon.org Discord.
Someone's writing an app processing incoming HTTP requests with a Content-Type
header. Specifically, they want to handle the case where the content type is application/json
.
They noticed some clients are sending an unnecessary charset=utf-8
parameter in the content type, and wondered how they should properly handle this parameter.
They're doing this as a learning exercise so wanted to avoid leaning on an existing library if possible.
The discussion went back and forth a little bit between validation and parsing without really getting in to the meat of it, it's too long for a chat message post, and I'm not aware of a good discussion of the topic elsewhere, hence this post.
What to do?
The difference between validation and parsing
I use the terms to mean the following.
You validate incoming data to decide if your code should continue on the happy path, or branch to an error handler because the data is invalid. The incoming data will not be used again.
You parse incoming data to validate it, and to extract the information to a different type for use elsewhere in the program.
To use Pachli as an example, the code validates the content type of incoming data as application/json
because if it's not then an error has occurred. The content type is not used after that.
Pachli parses a server's version string to a dedicated Version type to hold a semantic version because the version information is used elsewhere in the code to make decisions.
Parsing to the dedicated type once means the rest of the code knows the version is valid and can be reasoned about, instead of repeatedly having to re-parse the version string.
The Content-Type header
The Content-Type
header specification is defined in RFC 9110 sect. 8.3. Using ABNF the grammar for the header is
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
parameters = *( OWS ";" OWS [ parameter ] )
parameter = parameter-name "=" parameter-value
parameter-name = token
parameter-value = ( token / quoted-string )
media-type = type "/" subtype parameters
type = token
subtype = tokenContent-Type = media-type
Content-Type = media-type
It also notes
The type and subtype tokens are case-insensitive.
and gives these examples
text/html;charset=utf-8
Text/HTML;Charset="utf-8"
text/html; charset="utf-8"
text/html;charset=UTF-8
RFC 8259 defines the application/json
media type, and is explicit that this type does not have a charset
parameter, writing:
Note: No “charset” parameter is defined for this registration. Adding one really has no effect on compliant recipients.
What to do?
The original questioner can now decide whether they want to validate the header as application/json
, or parse the header to a more specific set of types.
Validation
From the definition of the Content-Type
header and the application/json
media type we can see a sensible strategy for validation would be:
- Remove everything from the header value after-and-including the first occurrence of
;
- Trim any whitespace from the start and end of the string
- Perform a case-insensitive comparison of the value with
application/json
Step 1 handles malformed clients or servers that emit variations of application/json;charset=utf-8
(or any other charset). While technically not correct this is unlikely to impede interoperability.
Step 2 is also for malformed clients or servers. The specification does not allow spaces before the primary type (application
), and there may be spaces between the subtype and the ;
that was removed in step 1. Again, removing these should be harmless and improves interoperability.
Step 3 validates the header contains the correct value.
Parsing
Parsing the header is appropriate if the application is going to branch based on the type and subtype. For example, different handlers for application/json
and application/activity+json
.
To do that I would create specific Python types to represent the parameters and types that can appear in a Content-Type
header, and have a function that parses the header and returns one of those types.
The types should include an explicit Unknown
value to indicate an unhandled type, and can either raise errors or return an additional error type.
For example:
import re
from dataclasses import dataclass
# This is an example to demonstrate parsing a value to one of a
# specific set of types, distinct from simply validating the value
# matches expectations.
@dataclass
class BaseParameter:
"""
Base class for all parameters.
value: Value of the parameter as it appeared in the header.
"""
value: str
@dataclass
class Charset(BaseParameter):
"""A 'charset=...' parameter."""
pass
@dataclass
class UnknownParameter(BaseParameter):
"""
An unknown parameter.
name: Parsed name of the parameter
value: Parsed value of the parameter
original: Original parameter string as it appeared in the header.
"""
name: str
original: str
Parameter = Charset | UnknownParameter
"""Parameters that can occur in a content-type header."""
@dataclass
class ApplicationJson:
"""The application/json content type."""
pass
@dataclass
class ApplicationActivityPlusJson:
"""
The application/activity+json content type
charset: Parsed value of the 'charset' parameter, if present.
"""
charset: Charset | None
@dataclass
class UnknownContentType:
"""
An unknown content type.
type: Parsed name of the type
parameters: Parsed list of parameters, if present.
original: Original content type as it appeared in the header.
"""
type: str
parameters: list[Parameter]
original: str
ContentType = ApplicationJson | ApplicationActivityPlusJson | UnknownContentType
"""Possible content type values."""
# THIS IS NOT A PRODUCTION QUALITY PARSER
def parse_content_type(val: str) -> ContentType:
r = re.compile('^(?P<type>[^;]+)(?:;(?P<parameters>.*))?')
m = r.search(val)
if m is None:
raise ValueError
type = m.group('type')
if type is not None:
type = type.lower().strip()
match type:
case 'application/json':
return ApplicationJson()
case 'application/activity+json':
parameters = parse_parameters(m.group('parameters'))
charset = ([e for e in parameters if isinstance(e, Charset)][:1]
or [None])[0]
return ApplicationActivityPlusJson(charset)
case _:
parameters = parse_parameters(m.group('parameters'))
return UnknownContentType(type, parameters, val)
# THIS IS NOT A PRODUCTION QUALITY PARSER
def parse_parameters(val: str) -> list[Parameter]:
if val is None:
return []
parameter_strs = val.split(';')
r = []
for parameter_str in parameter_strs:
parameter_str = parameter_str.strip()
(name, value) = re.split(r'\s*=\s*', parameter_str, maxsplit=1)
match name.lower():
case 'charset':
r.append(Charset(value.lower()))
case _:
r.append(UnknownParameter(name, value, parameter_str))
return r
if __name__ == "__main__":
print(parse_content_type('application/json'))
print(parse_content_type('application/JSON'))
print(parse_content_type(' application/json '))
print(parse_content_type('application/json ; charset=utf-8'))
print(parse_content_type('application/activity+json'))
print(parse_content_type('application/activity+json;charset=us-ascii'))
print(parse_content_type('made/up; charset=utf-8; some=parameter'))
match parse_content_type('application/activity+json;charset=us-ascii'):
case ApplicationJson():
print('It was some form of application/json')
case ApplicationActivityPlusJson(charset):
print('It was some form of application/activity+json')
if charset is not None:
print(f'Charset was {charset.value}')
case UnknownContentType(type, parameters, original):
print(f'Content type "{original}" was not recognised')
Anything else?
Additional checks could be performed.
- Instead of ignoring the
charset
parameter, parse the value and confirm it isutf-8
. If it is not then signal the error in some fashion so the operator of the offending client/server can be informed. For example, receivingapplication/json;charset=us-ascii
should be a red flag somewhere.