Squashing five years’ worth of database migrations

Hi folks! I wanted to share a recent minor success story in case it's helpful for others.

I maintain an open source project called [NetBox](https://github.com/netbox-community/netbox) that's been around for about five years. It's a fairly complex application with lots of models, and as you can imagine over the years we've accumulated quite a few database migrations as the project has grown. In all, I counted 319 discrete migrations across seven apps, excluding the migrations introduced by Django itself.

While having this many migrations isn't necessarily a problem, it's certainly not ideal. The biggest issue is the excessive delay introduced when running CI tests: It was taking around 80 seconds just to build the database each time. So for our upcoming version 3.0 release, we wanted to see if we could optimize all these old migrations.

You're probably familiar with Django's ability to [squash migrations](https://docs.djangoproject.com/en/3.2/topics/migrations/#squashing-migrations), which is generally very good. However, we've experimented with this approach in the past and kept running into issues regarding migration dependencies. (I believe this stems from the highly interconnected nature of our models: Applications with a simpler model structure would probably fare better.) So instead, we opted to try regenerating the migrations from scratch.

Now, it's important to understand the impact of doing this. Let's say you have a Django application with fifty migrations, numbered 0001 through 0050. You can delete them all and then run `manage.py makemigrations` to generate a single replacement migration (or maybe a handful, depending on your project's complexity). However, existing installations that are still on some migration from before 0050 won't be able to upgrade, since the indices get reset and there's no longer a migration path to follow. This is obviously a problem if you have thousands of installations in the wild as we do.

### Our Solution

What we decided to do was regenerate all migrations up to the _prior_ minor release (v2.11.0), and preserve the most recent migrations. This provides a more flexible upgrade path for existing installations, where they can upgrade to any v2.11.x release first, and then to v3.0 or later.

The overall process was as follows:

1. Copy all migrations post-2.11 out of the working project.
2. Check out the last release prior to v2.11.0 in git (in our case, v2.10.10).
3. Delete all migration files project-wide.
4. Run `manage.py makemigrations` to generate new, optimized migrations.
5. Copy the new migrations as well.
6. Restore the working branch to its original state.

At this point, we have copied a set of optimized migrations covering from the initial release to v2.10.10, and from v2.11.0 to v3.0.0. However, they don't match: We need to modify the new migrations to follow the same indices as the originals they replaced. For example, under one app which had 48 original migrations, we now have a list of migrations that look like this:

0001_initial.py
0002_auto_something.py
0047_original_migration_foo.py
0048_original_migration_bar.py

In order to "stitch" these migrations together, we need to:

1. Rename the new migration files (to keep things orderly).
2. Update the migration names as necessary in the `dependencies` list within each migration.
3. Add the list of migrations being replaced (`replaces`) to each _new_ migration.

So first, we rename our two new migration files. We append the index of the last covered migration to the last new file to indicate the scope of migrations being replaced.

0001_squashed.py
0002_squashed_0046.py
0047_original_migration_foo.py
0048_original_migration_bar.py

(Note that migrations 0001 and 0002 are essentially an atomic set. We just happen to have two files instead of one because of dependencies imposed by some other migration in a different app that depends on models in this app.)

Next, we update the dependencies named within each migration as needed. `0001_squashed` has no dependencies, because it's the first migration in the app. `0002_squashed_0046` and `0047_original_migration_foo` each list the migration that comes before it, and `0048_original_migration_bar` shouldn't need to be changed.

# 0002_squashed_0046.py

class Migration(migrations.Migration):

dependencies = [
('app_name', '0001_squashed'),
]

# 0047_original_migration_foo.py

class Migration(migrations.Migration):

dependencies = [
('app_name', '0002_squashed_0046'),
]

Finally, we define the list of original migrations being replaced in the _last_ new migration. In our case, we grabbed these names from the v2.10.10 version of the codebase.

# 0002_squashed_0046.py

class Migration(migrations.Migration):
dependencies = [
('app_name', '0001_squashed'),
]

replaces = [
('app_name', '0002_original_foo'),
('app_name', '0003_original_bar'),
('app_name', '0004_original_baz'),
...
('app_name', '0046_original_final'),

If everything went well, the output of `manage.py showmigrations` should now look exactly as it did with the original migrations.

### Validating the New Migrations

As were automatically rebuilding most of our migrations, we wanted to test their integrity very thoroughly. What we did was build one database using only the original migrations up through the v3.0 release, and another database using our optimized migrations. We then compared both the resulting schema and migrations table of each. To compare the database schemas, we queried `information_schema.columns` from either table to ensure consistent output suitable for performing a direct diff. (The ordering of table columns varied quite a bit when using `pg_dump` because of the order in which they were being added.)

SELECT table_name,
column_name,
data_type,
column_default,
is_nullable,
character_maximum_length,
numeric_precision
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name ASC, column_name ASC

We also verified that the `django_migrations` tables were consistent between the two databases:

SELECT app, name FROM django_migrations ORDER BY app, name

With everything in place, the only difference between the two resulting databases was the naming of a few sequences, which was merely cosmetic. (Some of the table names had been changed over time.) The best part is that we reduced our overall time to build a new database from ~84 seconds to ~21 seconds.

I know migrations can be a tricky topic for a lot of people (myself included), but hopefully this helps shed some light on the topic of optimizing migrations.

7 thoughts on “Squashing five years’ worth of database migrations”

  1. Almost 80% went over my head, but nonetheless it was a great to read your experience.

    With a project that big, and so many users depending, we have to think everything through. Maybe I’ll refer to your post sometime in the future for reference.

    Reply
  2. Hmm, isn’t it easier to just make a new database and import the old data?

    Step wise like this:

    * Make last migration and migrate to old db
    * delete all migrations
    * create new migration and migrate to the new database
    * use pg\_dump to dump the tables you need from old db to new db

    Reply
  3. What a great approach, really love this. Thanks for sharing this info – makes me want to analyze my own tests to see how much time is being spent just on spinning up the database.

    Legendary depth here – much respect for coming up with such a creative solution!

    Reply
  4. Nice solution in the end.

    I went for creating a completely new migration schema when I needed to squash my migrations (I see you mentioned that as well). Was much easier for me, as the code & DB lives just on 4 computers: 2 team members and 2 environments.

    Not sure I would’ve thought about your solution if I were in your situation. You live, you learn 🙂

    Reply
  5. If you use pytest and the issue is only the CI time, you can pass –no-migrarions flag to pytest and it will create the db based on the current models without running the migrations.

    I did this the issues that I had were on tests that relied on data inserted by previous migrations that were not being inserted anymore. So the fix was add the same data or just a subset in the test initialization.

    the no-migration is another alternative if your problem is the CI time

    Reply

Leave a Comment