Addressing deadlock errors

TLDR

If your monitoring is seeing a bunch of django.db.utils:OperationalError errors with the message "Deadlock found when trying to get lock;", it can often be fixed by using decorators to exempt particular views from Django’s implicit request-level transaction. This PR shows how to do this for both function and class-based views:

But please read the rest of this post before making a PR to fix one of these…

What’s actually happening?

Studio and the LMS run with ATOMIC_REQUESTS enabled by default. This means that Django will implicitly open a database transaction at the start of a request and commit it at the end of request processing. But this can cause issues when slow, concurrent requests are trying to update the same row.

As an example, say a request spends most of its time processing and rendering courseware but also updates a row somewhere to record the most recent course this user was active in. It’s going slowly for some reason–maybe it’s a particularly expensive problem type, or maybe the system is just under load and being slow for other reasons. So the student hits reload or clicks a button a few extra times. Now there are multiple requests out there that are probably slow for the same underlying reason, but are now also trying to get lock on the same row that is being updated in the first request’s transaction (which hasn’t closed yet).

Why use ATOMIC_REQUESTS at all?

We used it way back when because it was easy to enable, and it usually does the thing you expect–i.e. keeping you from shooting yourself in the foot with inconsistent state because your request died halfway through the data you’re updating.

What’s the alternative?

The Right Thing to Do is to use transactions manually, in a much more thoughtful way than we do generally. Our apps are pretty inconsistent about this, but some of the newer ones wrap important operations in explicit transactions.

Why not turn off ATOMIC_REQUESTS altogether?

Because it’s quite likely things will break in non-obvious, slowly data-corrupting ways, and it’s difficult to set aside time to test that across the large set of code that is edx-platform and the other apps it installs.

So is it dangerous to allow specific views to be non-atomic?

You’ll want to examine the view to figure out what it’s actually deadlocking on (which will be in the error message). If it’s doing something important, like creating an enrollment, purchasing something, etc.– then yes, it can be dangerous to do so, and you’ll need to think it through carefully. But this more commonly shows up in things that are either only changing a single row, or where the result is just not that sensitive (e.g. “most recently used course”).

1 Like

Thanks for this write up @dave.

  1. I’ve also heard that using ATOMIC_REQUESTS was Django’s recommendation once upon a time, and that that recommendation changed.

  2. May be a silly proposal, but I’ve wondered if we couldn’t use linting to make it explicit everywhere via the decorator and then change the default. I know there would be much more to discuss, but something we could discuss.

I thought so as well (or maybe it was the TransactionMiddleware that preceded it?), but I didn’t find anything with a quick web search.

It’s not silly, but I don’t think it’s as pressing as many other issues we have. Using ATOMIC_REQUESTS keeps us from shooting ourselves in the foot with an overhead that seems relatively modest for most operations. I think it’s easier to tweak the specific views we know are causing deadlocks or other performance issues, rather than to make more invasive changes that introduce a risk of slow data corruption. Particularly since testing for this kind of regression is cumbersome and is not something we typically do.

1 Like