Hi folks! I wanted to share a recent minor success story in case it's helpful for others.
I maintain an open source project called [NetBox](https://github.com/netbox-community/netbox) that's been around for about five years. It's a fairly complex application with lots of models, and as you can imagine over the years we've accumulated quite a few database migrations as the project has grown. In all, I counted 319 discrete migrations across seven apps, excluding the migrations introduced by Django itself.
While having this many migrations isn't necessarily a problem, it's certainly not ideal. The biggest issue is the excessive delay introduced when running CI tests: It was taking around 80 seconds just to build the database each time. So for our upcoming version 3.0 release, we wanted to see if we could optimize all these old migrations.
You're probably familiar with Django's ability to [squash migrations](https://docs.djangoproject.com/en/3.2/topics/migrations/#squashing-migrations), which is generally very good. However, we've experimented with this approach in the past and kept running into issues regarding migration dependencies. (I believe this stems from the highly interconnected nature of our models: Applications with a simpler model structure would probably fare better.) So instead, we opted to try regenerating the migrations from scratch.
Now, it's important to understand the impact of doing this. Let's say you have a Django application with fifty migrations, numbered 0001 through 0050. You can delete them all and then run `manage.py makemigrations` to generate a single replacement migration (or maybe a handful, depending on your project's complexity). However, existing installations that are still on some migration from before 0050 won't be able to upgrade, since the indices get reset and there's no longer a migration path to follow. This is obviously a problem if you have thousands of installations in the wild as we do.
### Our Solution
What we decided to do was regenerate all migrations up to the _prior_ minor release (v2.11.0), and preserve the most recent migrations. This provides a more flexible upgrade path for existing installations, where they can upgrade to any v2.11.x release first, and then to v3.0 or later.
The overall process was as follows:
1. Copy all migrations post-2.11 out of the working project.
2. Check out the last release prior to v2.11.0 in git (in our case, v2.10.10).
3. Delete all migration files project-wide.
4. Run `manage.py makemigrations` to generate new, optimized migrations.
5. Copy the new migrations as well.
6. Restore the working branch to its original state.
At this point, we have copied a set of optimized migrations covering from the initial release to v2.10.10, and from v2.11.0 to v3.0.0. However, they don't match: We need to modify the new migrations to follow the same indices as the originals they replaced. For example, under one app which had 48 original migrations, we now have a list of migrations that look like this:
In order to "stitch" these migrations together, we need to:
1. Rename the new migration files (to keep things orderly).
2. Update the migration names as necessary in the `dependencies` list within each migration.
3. Add the list of migrations being replaced (`replaces`) to each _new_ migration.
So first, we rename our two new migration files. We append the index of the last covered migration to the last new file to indicate the scope of migrations being replaced.
(Note that migrations 0001 and 0002 are essentially an atomic set. We just happen to have two files instead of one because of dependencies imposed by some other migration in a different app that depends on models in this app.)
Next, we update the dependencies named within each migration as needed. `0001_squashed` has no dependencies, because it's the first migration in the app. `0002_squashed_0046` and `0047_original_migration_foo` each list the migration that comes before it, and `0048_original_migration_bar` shouldn't need to be changed.
dependencies = [
dependencies = [
Finally, we define the list of original migrations being replaced in the _last_ new migration. In our case, we grabbed these names from the v2.10.10 version of the codebase.
dependencies = [
replaces = [
If everything went well, the output of `manage.py showmigrations` should now look exactly as it did with the original migrations.
### Validating the New Migrations
As were automatically rebuilding most of our migrations, we wanted to test their integrity very thoroughly. What we did was build one database using only the original migrations up through the v3.0 release, and another database using our optimized migrations. We then compared both the resulting schema and migrations table of each. To compare the database schemas, we queried `information_schema.columns` from either table to ensure consistent output suitable for performing a direct diff. (The ordering of table columns varied quite a bit when using `pg_dump` because of the order in which they were being added.)
WHERE table_schema = 'public'
ORDER BY table_name ASC, column_name ASC
We also verified that the `django_migrations` tables were consistent between the two databases:
SELECT app, name FROM django_migrations ORDER BY app, name
With everything in place, the only difference between the two resulting databases was the naming of a few sequences, which was merely cosmetic. (Some of the table names had been changed over time.) The best part is that we reduced our overall time to build a new database from ~84 seconds to ~21 seconds.
I know migrations can be a tricky topic for a lot of people (myself included), but hopefully this helps shed some light on the topic of optimizing migrations.