I am currently working with @alecar on the final feature for the Survey Report project.
The goal of this step is to make it possible for intances to send the report (aggregated and annonimized data) automatically every six months.
Naturally this can be turned off and configured for each instance, but the goal is to make it easy for intances to report the data to measure the growth of the project.
As it stands now, the report can be generated and sent by a superuser in the admin panel. Sending it automatically requires that we have some way of launching async tasks with some regularity.
Currently there are two main ways of scheduling tasks in the platform:
Celery Beat: adding djang_celery_beat as an additional dependency has been done. It was even backported to the koa and lilac branches.
Schedule tasks with an external scheduler like jenkins or crontab: this is the approach that has been favored by edx. It is also how we at edunext manage our largest instances.
Now that we have mostly landed in a world where k8s is the way of hosting production grade instances and where tutor is the supported way of writing the manifests for said k8s clusters, Iād like to bring back the question of the scheduler to the forefront.
Specifically I would like to know
why was celery beat not used in edx.org? What problems guided you in the direction of jenkins?
would the core project be open to installing celery-beat now as a dependency ?
if adding celery-beat is a no, what would the correct approach for smaller instances be?
Your insights and feedback will be greatly appreciated. Thank you in advance for your time and expertise in this matter.
Iām tagging some people that mostly guided those discussions in the past.
The only concrete case that I know of where edX was having celery beat failures was in the (now unsupported) notifier, way back in 2013:
I donāt know what specific issues there were though. The only reference to it I can find in my email is this edx-code post by Jim Abramson back in 2015:
The edx notifier service uses the Advanced Python Scheduler.
We started initially with Celery Beat but had problems getting it to work / debugging the problems. That was a couple years ago - it may be more reliable at this point, but APScheduler has been rock-solid for us.
In terms of edXās preference for Jenkins, I donāt remember the exact rationale. I suspect it might have been because edX already had the Jenkins infrastructure for other things (running tests, preparing releases, etc.), so it was easier to build on top of that rather than introduce another tool and sets of configuration and access control management. There might have also been a lack of enthusiasm for running another worker type/deployment specifically for celery-beat, since it would require its own instance and possibly get bogged down by some of the really large/slow tasks that take hours to run. @feanil or @e0d might remember better.
would the core project be open to installing celery-beat now as a dependency ?
Iād like to hear more about the operational experiences that OpenCraft and others in the community have had with it, but Iām certainly open to it. My hope would be that:
It should be optionalāwe should allow people to continue to use Jenkins for these things if they really want, even if Celery Beat is the simple and supported option.
We should have some way to test this functionality in automated tests, so that we donāt accidentally break it for a few months and only discover issues when manually testing the next release.
If for some reason it doesnāt work, we at least document the problems properly somewhere this time.
The core of the problem as I recall was ensuring that things only ran once. The way that the edx infrastructure was setup, it was difficult to ensure that celery-beat only got started up on one machine and not on all the celery workers while keeping all the instances ephemeral.
This combined with the fact that we already had Jenkins infrastructure and it was easy to gaurantee run-once there, resulted in the choice of Jenkins for a lot of scheduled work for edx.org
In the k8s world, starting up a separate celery-beat pod with at most one instances seems much easier so it may be a much better option.
To continue the thread of old timers going āuh, letās reach back into the ol memory banks,ā but with a more modern twist: we were considering celery-beat while looking at how to set up event bus consumer infrastructure over the last year, and I think someone told us that there were performance issues on edx.org the last time someone tried to use it there, even though the community was using it just fine. (Take that with a grain of salt, because this memory is also somewhat fuzzy, and it was probably Dave that told us this anyway.)
To add more to this discussion, celery_beat is indeed used in edX. The repository where it is used is a private repository video-encode-manager (https://github.com/edx/video-encode-manager). VEM extensively uses celery beat to schedule video encode pipeline tasks.
With celery beat, we have been able to freely and quickly change schedules of the tasks from admin when needed. The admin also gave insights into when the last time a task was run which has been really helpful in diagnosing issues.
VEM is deployed on k8s, so the beat worker is running in its own Pod. VEM server and celery workers are in separate pods and have different startup commands.
With celery_beat, the major issue that has been encountered a few times is that it tends to break quite frequently with celery updates (major + sometimes minor). This has led us to constrain celery and celery beat to specific versions until celery beat releases a new version compatible with the latest celery. These issues were not detectable via unit tests and we only found out when the changes reached stage/production. So, relating to Daveās point from above (We should have some way to test this functionality in automated tests, so that we donāt accidentally break it for a few months and only discover issues when manually testing the next release.), this would be a challenge.
First off all, thanks @Felipe for pinging me on this!
I donāt know much about it, but the last comments of the linked PR pretty much explains what was the reason for Jenkins.
Not knowing about the comment and previous attempts, ~2 years later, I did add celery beat integration to the edX configuration repository utilizing single-beat, which is exactly for solving the problem of multiple concurrently running celery beat schedulers by keeping a ālockā in Redis.
It is running in production for ~1.5 years in production for one of our clients without any issues. We did use it for another client with ~4-10 auto-scaling workers, there were no issues either.
In Kubernetes, we are not experiencing any issues either on our test instances yet. (Production traffic is not hosted there yet.)
Honestly, I cannot see anything why it would block anyone to continue using Jenkins, cron-tabs or one-off commands. As I see celery-beat, it is just an āapplication-nativeā crontab.