Store Message Bodies as Binary Blobs

Registered by Kurt Griffiths on 2013-04-25

Rather than storing messages as JSON or BSON, we should use msgpack blobs in our storage drivers.

TODO: Create GIGO matrix for JSON, msgpack, AMQP

This provides several important benefits:

1. Incoming messages that we receive as msgpack or AMQP may include binary bodies, or in the case of msgpack, a body which may include some unicode strings and some binary strings. pymongo's BSON package requires explicitly flagging a six.binary_string type (at least, in py2) if it doesn't contain valid UTF-8 sequences. This is done by wrapping the string with the bson.Binary class. Crawling the incoming message and decorating six.binary_string fields is impractical. Instead, we can simply marshal to msgpack, compress the result, then wrap the final binary result in bson.Binary before passing to pymongo.
2. In the case of other data stores, such as sqlalchemy, persisting as JSON will not work because strings must be valid Unicode (similar to BSON). If we change to msgpack, and use it's native bin type, six.binary_string types need not be valid UTF-8.
3. Moving to an opaque (vs BSON) binary format for storage means we can also employ compression so that data stores can keep more messages in RAM, and we can also reduce the amount of bits that need to be sent over the wire between web heads and storage nodes. (Compression will not be implemented as part of this blueprint, but by moving to an opaque binary storage format, we open the doors for supporting it in the future).

Important Considerations

1. Migrating existing deployments (we might add a version field to our data store schemas to help detect and migrate from previous schema versions). This may be as simple as if a message doesn't have a content-type field, assume it is an old message and stores as BSON (mongo) or JSON (sqla)
2. This should work with both v1.0 and v1.1 of the API
3. Features requiring binary storage should be implemented as part of the v1.1 API, which will be released in tandem with this blueprint implementation
4. Queue metadata will also need to be stored in the way given above.

Implementation

1. Incoming JSON we decode to six.text_type, then pass to json.loads, which results in fields being of type six.text_type. This is then passed to msgpack in the storage driver, which will apply the str type to those fields. The storage driver will also set the message's content_type to "application/json".
2. Incoming msgpack will be decoded using unpackb with encoding='utf-8'. Clients are expected to use the msgpack binary type for binary strings, and encode regular strings using UTF-8. Assuming that is true, after unpacking, we will have either a single Python six.binary_type, a single six.text_type, or a dict containing nested fields which may or may not be of binary or text types. This is then passed to msgpack in the storage driver, with use_bin_type=True, such that when unpacking we can distinguish between text and binary. The storage driver will set the message's content_type to "application/x-msgpack".
3. When listing messages, if the source was JSON, we can translate directly to the requested content type, either JSON or msgpack
4. When listing messages and JSON is requested, but the message was inserted using msgpack, we
can crawl the dict and base64 any fields that are of type six.binary_type. This will work because of (2), above. Alternatively, we could return a "content_type" field in the response, set to "application/x-msgpack" and base64 the entire body as a msgpack blob.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Low
Drafter:
Kurt Griffiths
Direction:
Needs approval
Assignee:
None
Definition:
Discussion
Series goal:
Accepted for future
Implementation:
Not started
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.