In short, when X users (in my case 60) simultaneously start Quiz (either the one part of the demo course OR especially the one created from “libraries”, the server CPU goes to 100% or 200% // memory remains stable but the server becomes unresponsive for 5-6 minutes (I use a quite powerful server - 6 core Intel Xeon E5-1650V3, with 128 GB of memory.
@Regis suggested I report it here as it seems this affects the edx core.
Having trace data from a product like New Relic or Honeycomb is really helpful for diagnosing these issues. If you have any traces from that and can post them, please do so.
One of the things that affected the courseware experience prior to the Courseware MFE (available, but not the default in Lilac), is that rendering a sequence would slow down in proportion to the number of things in the total sequence, and not just the number of things visible for a given unit. If you find that individual Units load much more quickly than the courseware as a whole, a switchover to the new courseware MFE might help. You can test that by going to a URL like:
That may be compounded by codejail, which is an expensive sandboxed process that executes instructor code for certain ProblemBlocks.
Then again, it could also be that something that’s supposed to be cached is not persisting, necessitating constant recalculation. This can happen when your cache is misconfigured, or if your course is so large that the cached artifact exceeds the 1 MB default limit for memcached (causing set calls to silently fail). There are a lot of possibilities though, and it’s difficult to guess without more detailed profiling information.
My team is currently working on stories to speed up the courseware browsing experience, but those improvements wouldn’t be in a named release until Maple, it would require using the new Courseware MFE experience, and without profiling data I’m not sure it would help your exact issue.
If you’re looking for short term mitigation, you might want to try to divide your quiz into smaller ones. Focus on both reducing the total number of items in a sequence as well as the number of ProblemBlocks in any given Unit. You can reconfigure the grading policy so that it still weighs the content of the combined quizzes to have the same effect on their overall course grade.
Hi Dave - thanks. Actually to replicate it is quite simple - you can use the quiz that is part of the Demo course…you just launch the quiz that is inside and you check the top command…the CPU will immediately go to approx. 28%…now that’s not of an issue with 1 user but when 60 try to do the same, server dies…
I realize that courseware is far more CPU intensive than it should be in pretty much all cases. But there are a number of contributing factors to poor performance, and a number of surprising edge cases that can make things dramatically worse (e.g. inline discussion blocks in CCX courses). It’s not clear to me at the moment which parts are hurting you the worst right now. Given how large your quiz is (~100 problems), it seems likely that you’re suffering from the general issues we have around rendering large subsections of content. But there are other possibilities as well.
For instance, say CPU is high and sluggish, and people start hitting reload or pressing the buttons many times. One thing that used to happen is that we’d throw a giant implicit transaction around certain courseware views, and multiple loads of that same view by the same user would cause workers to hang because multiple transactions were blocking on trying to update the same few rows of XBlock student state. This would essentially remove workers from the available pool, and dump the requests that they could have served onto other workers. The slower it got, the more frustrated users became, and the more likely it was to dogpile because they’d re-submit.
If you’re looking for something actionable, I think the most likely thing to work is the bit I mentioned earlier about breaking the quiz up into smaller, separate quizzes for future runs. Another option is of course to scale up the cluster. We are working on these issues–all four tickets being worked on by my team this week relate to improving server side performance in courseware–but those fixes will be part of the Maple release. It is possible that you’re running into an edge case that could be fixed by dropping a plugin, fixing a config value, flipping a feature toggle, or cherry-picking some particular commit. But that’s likely going to be hard to debug, with an uncertain outcome.
Another thing to consider in terms of future content is that we are definitely making optimizations on navigation (so not loading the whole sequence), and we will probably be optimizing composition (what things are in a particular Unit, which will help with content library module performance), but we will likely not be able to make a significant dent in Unit rendering performance without breaking backwards compatibility. Meaning that if you put 100 problem blocks in the same Unit in the Maple or Nutmeg release of Open edX, it’s very likely that none of our optimizations will make a significant difference.
I’m sorry that I can’t give better news. As @regis says, this is near and dear to my heart. We’re actively working on it and making progress, but there’s no easy fix–the courseware is slow because of deeply rooted data access issues that date back to the original prototype courseware that Open edX evolved from. In order to address these problems, we’re essentially creating new applications and data models and translating from one system to the other at publish time. I did a short writeup of that architectural shift, if you’re interested.
Hi dave, sorry for not answering before…First of all thanks a lot for your detailed answer.
I managed to track down the issue.
It seems this is related to “Libraries” and “Randomized Content Block” - somewhere something is making it not working as it should and causing the CPU issues…