Writing JSONField from scratch
Last week, I improved an existing Django package and implemented some SQLite
lookups for its JSONField to get some sense of how Django model fields work.
This week, I tried to create a unified JSONField from scratch. It's
unified in the sense that it works on all database backends supported by
Django. I also set up docker containers to run Django's test suite with all
database backends with the help of django-docker-box (though I ended up
making my own for Oracle since it's a bit more complex). However, I'm just
going to talk about writing the field as that's more interesting.
As before, the docs on how to write custom model fields is very handy. I
started building my own JSONField and got it working in a relatively short
time. Not without bugs and quirks I had to deal with on each database backend,
though.
First, I wrote the code with SQLite as the top priority. I overrode the
db_type method so it returns json. It's also the name of the data type used
in MySQL and MariaDB for JSON values, so I guess it's a good start.
def db_type(self, connection):
return 'json'
This means that the SQL query for the table creation will be something like:
CREATE TABLE ... (..., "column_name" json)
Now, is there a json datatype in SQLite, especially with the JSON1
extension available? Nope. I thought there was, as the JSON1 docs have some
examples that use JSON as the data type of some columns. Anyway, the docs
state that the extension stores JSON data as ordinary text.
You can actually use custom data type names in SQLite. It means we can still
define our column's data type as json, so the database schema will look
pretty nice.
However, SQLite uses dynamic typing. This means that data types aren't determined by the container (i.e. column) of the value. In SQLite, the data type is associated with the value itself. There's something called type affinity in SQLite. It's a system that tries to convert the values inserted into a table according to the preferred storage class for a column, called affinity.
There's a break down of how SQLite determines column affinity in the data type docs. I recommend giving it a quick read.
Anyway, according to the docs, our json data type will have a NUMERIC
affinity. This means, SQLite will try to store any values in the column as an
INTEGER or REAL (in that order). This can lead into some problems.
When we retrieve a JSON value from the database, we would like it to be
converted into its Python equivalent (bool, int, float, str, dict,
list), despite it being stored as a text in the database. To illustrate:
>>> value = {'name': 'John', 'age': 20, 'height': 180.3}
>>> obj = JSONModel.objects.create(value=value)
>>> obj = JSONModel.objects.get(id=obj.id)
>>> type(obj.value)
<class 'dict'>
That means, the serialization (fancy word for translation) of the value
should be handled automatically by our JSONField. In Django, we implement
this by overriding two essential methods in our Field subclass:
get_prep_value and from_db_value (for serialization and deserialization,
respectively).
Thanks to Python's json library, we can use the json.dumps and json.loads
functions for that purpose.
def get_prep_value(self, value):
"""Convert the value into a JSON string."""
if value is None:
return value
return json.dumps(value)
def from_db_value(self, value, expression, connection):
"""Convert the JSON string value into a Python object."""
if value is None:
return value
return json.loads(value)
Keep in mind that our field has a NUMERIC affinity in the database. Let's say
we want to insert a Python int as the value, let's say 3. It's a valid JSON
value. If we call json.dumps(3), we will get the Python string '3'. Suppose
we create a new object of our model. The value will be inserted into the
database like so:
INSERT INTO myapp_mymodel VALUES (..., '3', ...)
That SQL query is perfectly fine, since SQLite stores JSON as ordinary text.
However, with NUMERIC affinity, SQLite will try to convert the value into an
INTEGER in the database. It succeeds in doing so, therefore it's stored as an
INTEGER.
If we go ahead and retrieve our object using Model.objects.get, our
from_db_value will receive a Python int object 3 as the value, instead of
str object '3', from the database backend. Of course, calling
json.loads(3) would raise an exception (a TypeError, to be precise).
We could go ahead and add if isinstance(value, (int, float)) to our
from_db_value method, but I find the best solution is to just ditch json as
our db_type and use text instead, so our column will have a TEXT affinity
and no conversion will be done by SQLite.
Actually, SQLite will assign TEXT affinity if the declared type of a column
contains the string "CHAR", "CLOB", or "TEXT". So, we can actually
specify something like json text as our data type, but I'm not sure if that's
something people would like. (I kind of like it, so I might use that later.)
Anyway, that's just some SQLite magic.
For MySQL and MariaDB, we can use json as our data type and our JSONField
would work without having to change anything else. That's pretty cool!
If we want to make it cooler, we can add an SQL CHECK constraint using the
JSON_VALID function available in SQLite, MySQL, and MariaDB. To do so, we can
just override db_check method in our field. This will prevent invalid JSON
values from getting inserted into the database table.
def db_check(self, connection):
data = self.db_type_parameters(connection)
if connection.vendor in ('mysql', 'sqlite'):
return 'JSON_VALID(%(qn_column)s)' % data
return super().db_check(connection)
Note that connection.vendor is also 'mysql' for MariaDB. Also, for MariaDB
10.4.3 and up, the JSON_VALID function is automatically used as a CHECK
constraint for JSON data type.
Now, let's move on to PostgreSQL.
PostgreSQL provides two data types for JSON values: json and jsonb. In
short, jsonb is better. It supports indexing
and other cool stuffs. You can see the JSON Types docs for more details.
django.contrib.postgres.fields.JSONField also uses jsonb. If you look at
its source code, you can see that it doesn't override from_db_value. Why is
that?
Well, Django uses psycopg2 as the database adapter for its PostgreSQL
backend. It turns out, psycopg2 already does JSON adaptation when it
retrieves json and jsonb data from the database. The Python json module
and json.loads are used, just like what our from_db_value method does. You
can verify this in the psycopg2 source code.
Calling json.loads with the adapted value could throw a TypeError. It's
kind of like what happened with SQLite earlier. We can easily handle this by
modifying our method like this:
def from_db_value(self, value, expression, connection):
"""Convert the JSON string value into a Python object."""
if value is None or connection.vendor == 'postgresql':
return value
return json.loads(value)
However, I want to spice things up a bit. Let's allow custom JSON encoder and decoder to be used in our serialization and deserialization process!
Both json.dumps and json.loads accept a keyword argument cls. It can be
used to specify the class of a JSON encoder and decoder, respectively. The
argument is None by default, which will make the functions use the built-in
json.JSONEncoder and json.JSONDecoder.
If we would like to allow custom JSON encoder and decoder, we need to pass them
as an argument in our JSONField constructor and make it an instance
attribute. This way, we can pass the encoder and decoder to our json.dumps
and json.loads calls like so:
def __init__(self, encoder=None, decoder=None, default=dict, *args, **kwargs):
self.encoder, self.decoder = encoder, decoder
super().__init__(default=default, *args, **kwargs)
def get_prep_value(self, value):
if value is None:
return value
return json.dumps(value, cls=self.encoder)
def from_db_value(self, value, expression, connection):
if value is None:
return value
return json.loads(value, cls=self.decoder)
We also need to override the deconstruct method accordingly:
def deconstruct(self):
name, path, args, kwargs = super().deconstruct()
if self.default is dict:
del kwargs['default']
if self.encoder is not None:
kwargs['encoder'] = self.encoder
if self.decoder is not None:
kwargs['decoder'] = self.decoder
return name, path, args, kwargs
You probably notice that I left off the conditional for PostgreSQL in
from_db_value. If we stick to our solution, we won't be able to use a custom
decoder, since we will just return the value adapted by psycopg2.
A naive solution might be to call json.dumps on that value and call
json.loads with our decoder, basically serializing it and deserializing it
again. It could work, but that would be slow. We need a way to prevent
psycopg2 from adapting the value to Python objects.
According to the docs, we can either cast the column to text in the query, or
register a no-op loads with register_default_json (the registration is
shared for the same database connection). If we choose the latter, we might
break compatibility with contrib.postgres's JSONField, since it doesn't
allow a custom decoder and it relies on psycopg2's loads instead.
Thankfully, we can implement the former by overriding select_format. It's not
documented as of this writing, but the docstring gives a clue on how it can be
used. I found examples of overridden select_format in contrib's GIS fields.
Apparently, we can do it like this:
def select_format(self, compiler, sql, params):
if compiler.connection.vendor == 'postgresql':
return '%s::text' % sql, params
return super().select_format(compiler, sql, params)
Therefore, from_db_value will always retrieve a string value for non-NULL
values in the database, and we can call json.loads just like for other
backends.
Phew! Now, we've got our JSONField working on SQLite, MySQL, MariaDB, and
PostgreSQL. We've also allowed the use of custom JSON encoder and decoder.
Let's move on to the last database backend: Oracle.
Oracle can use VARCHAR2, BLOB, or CLOB data types to store JSON values.
It recommends us to use BLOB, but there are some downsides:
-
When selecting data from a
BLOBcolumn, if you want to view it as printable text then you must use SQL functionto_clob. -
When performing insert or update operations on a
BLOBcolumn, you must explicitly convert character strings to BLOB format using SQL functionrawtohex.
I haven't verified if that's the case with cx_Oracle (Oracle Database adapter
for Python). Since an implementation of JSONField on Oracle is available
and it uses CLOB instead, I'm going to use CLOB too. If BLOB turns out to
be feasible to implement (without some complex things), I might switch to
BLOB.
Oracle also provides the IS JSON constraint for columns that hold JSON data.
It also recommends us to use this constraint in our table definition, because
we won't be able to use the simple dot-notation syntax to query the JSON data
otherwise.
However, from my experience, the IS JSON constraint only accepts a JSON
object or array as its value. Meaning, we can only use dict or list Python
objects and not any of bool, int, float, and str as our field's value.
To add that constraint in our field, we can override db_check like before:
def db_check(self, connection):
data = self.db_type_parameters(connection)
if connection.vendor in ('mysql', 'sqlite'):
return 'JSON_VALID(%(qn_column)s)' % data
if connection.vendor == 'oracle':
return '%(qn_column)s IS JSON' % data
return super().db_check(connection)
Another thing to note is that cx_Oracle returns a Python object of type LOB
for values with BLOB, CLOB, and NCLOB data types. We cannot simply use
json.loads as the decoder doesn't know how to decode LOB objects. In order
to do that, we must obtain the str equivalent of the LOB object by calling
the .read method of the LOB.
The oracle-json-field package overcomes this by subclassing TextField.
TextField on Oracle is implemented using NCLOB in Django, and the database
backend already has a converter that calls the .read method.
We can either modify the backend so it also does the same for our JSONField,
or we can also override the get_db_converters method in our field. I choose
the latter to be consistent with my previous decisions.
The get_db_converters method isn't documented as of this writing, but we
basically just need to create a list that contains the functions that we want
to be run before returning the final value. Then, we extend that list with the
one returned by the original get_db_converters.
I implemented mine like this:
def convert_jsonfield_value(value, expression, connection):
if connection.vendor == 'oracle':
return value.read()
return value
class JSONField(Field):
...
def get_db_converters(self, connection):
return [convert_jsonfield_value] + super().get_db_converters(connection)
Now, our from_db_value method will get the equivalent str of the LOB as
the value.
And, that's it! We've successfully implemented our own unified JSONField. We
have yet to implement custom lookups and transforms, but that's on our to-do
list. I made a draft PR to Django for this field today (which also happens to
be my birthday!). If some things in that PR turn out different from what I
wrote here, maybe I've found some better approaches in implementing them.
Whoa, this turned out long. Anyway, if you want to leave a feedback, feel free to comment below or on the PR, I'd love to hear some thoughts about this. I'll be back here with another post next week!