Speed up bulk-exists check with python sets

Posted on Sat 01 August 2020 in Better Django

image

The .exists() function of Django's ORM is very useful. However, if you need to do this in bulk (think hundreds of thousands or more), this becomes a strain on your database. Let's say you are fetching records from an external system, and if the record doesn't exist locally, you need to do something:

for item in external_records:
    if not Data.objects.filter(external_id=item.id).exists():
        # Do something

If you are working with a large quantity of records, this will flood your DB with queries. Instead, you could first pull all external_id values from the DB:

existing_ids = Data.objects.all().values_list("external_id", flat=True)
for item in external_records:
    if item.id not in existing_ids:
        # Do something

But again, for large volumes of data, this will be slow in terms of computing time. The trick is to use a set instead of a list. The python in operator performs much, much better for sets, since items in sets are guaranteed to be unique. The above code thus becomes:

existing_ids = set(Data.objects.all().values_list("external_id", flat=True))
for item in external_records:
    if item.id not in existing_ids:
        # Do something

A very simple change that increases performance dramatically.

(Code above can be found on GitHub)