Adding Celery Beat as a scheduler to the core offering

Hey dear all,

I am currently working with @alecar on the final feature for the Survey Report project.

The goal of this step is to make it possible for intances to send the report (aggregated and annonimized data) automatically every six months.
Naturally this can be turned off and configured for each instance, but the goal is to make it easy for intances to report the data to measure the growth of the project.

As it stands now, the report can be generated and sent by a superuser in the admin panel. Sending it automatically requires that we have some way of launching async tasks with some regularity.

Currently there are two main ways of scheduling tasks in the platform:

  • Celery Beat: adding djang_celery_beat as an additional dependency has been done. It was even backported to the koa and lilac branches.

  • Schedule tasks with an external scheduler like jenkins or crontab: this is the approach that has been favored by edx. It is also how we at edunext manage our largest instances.

During my research I found:

Now that we have mostly landed in a world where k8s is the way of hosting production grade instances and where tutor is the supported way of writing the manifests for said k8s clusters, Iā€™d like to bring back the question of the scheduler to the forefront.

Specifically I would like to know

  • why was celery beat not used in edx.org? What problems guided you in the direction of jenkins?
  • would the core project be open to installing celery-beat now as a dependency ?
  • if adding celery-beat is a no, what would the correct approach for smaller instances be?

Your insights and feedback will be greatly appreciated. Thank you in advance for your time and expertise in this matter.

Iā€™m tagging some people that mostly guided those discussions in the past.

@jill @feanil @dave @gabor

1 Like

The only concrete case that I know of where edX was having celery beat failures was in the (now unsupported) notifier, way back in 2013:

I donā€™t know what specific issues there were though. The only reference to it I can find in my email is this edx-code post by Jim Abramson back in 2015:

The edx notifier service uses the Advanced Python Scheduler.

https://github.com/edx/notifier/blob/master/requirements.txt#L3
http://apscheduler.readthedocs.org/en/3.0/#

We started initially with Celery Beat but had problems getting it to work / debugging the problems. That was a couple years ago - it may be more reliable at this point, but APScheduler has been rock-solid for us.

In terms of edXā€™s preference for Jenkins, I donā€™t remember the exact rationale. I suspect it might have been because edX already had the Jenkins infrastructure for other things (running tests, preparing releases, etc.), so it was easier to build on top of that rather than introduce another tool and sets of configuration and access control management. There might have also been a lack of enthusiasm for running another worker type/deployment specifically for celery-beat, since it would require its own instance and possibly get bogged down by some of the really large/slow tasks that take hours to run. @feanil or @e0d might remember better.

would the core project be open to installing celery-beat now as a dependency ?

Iā€™d like to hear more about the operational experiences that OpenCraft and others in the community have had with it, but Iā€™m certainly open to it. My hope would be that:

  1. It should be optionalā€“we should allow people to continue to use Jenkins for these things if they really want, even if Celery Beat is the simple and supported option.
  2. We should have some way to test this functionality in automated tests, so that we donā€™t accidentally break it for a few months and only discover issues when manually testing the next release.
  3. If for some reason it doesnā€™t work, we at least document the problems properly somewhere this time. :stuck_out_tongue:

Good luck!

1 Like

The core of the problem as I recall was ensuring that things only ran once. The way that the edx infrastructure was setup, it was difficult to ensure that celery-beat only got started up on one machine and not on all the celery workers while keeping all the instances ephemeral.

This combined with the fact that we already had Jenkins infrastructure and it was easy to gaurantee run-once there, resulted in the choice of Jenkins for a lot of scheduled work for edx.org

In the k8s world, starting up a separate celery-beat pod with at most one instances seems much easier so it may be a much better option.

To continue the thread of old timers going ā€˜uh, letā€™s reach back into the ol memory banks,ā€™ but with a more modern twist: we were considering celery-beat while looking at how to set up event bus consumer infrastructure over the last year, and I think someone told us that there were performance issues on edx.org the last time someone tried to use it there, even though the community was using it just fine. (Take that with a grain of salt, because this memory is also somewhat fuzzy, and it was probably Dave that told us this anyway.)

I donā€™t think that was me, FWIW. Maybe one of the SRE folksā€¦?

To add more to this discussion, celery_beat is indeed used in edX. The repository where it is used is a private repository video-encode-manager (https://github.com/edx/video-encode-manager). VEM extensively uses celery beat to schedule video encode pipeline tasks.

  • With celery beat, we have been able to freely and quickly change schedules of the tasks from admin when needed. The admin also gave insights into when the last time a task was run which has been really helpful in diagnosing issues.
  • VEM is deployed on k8s, so the beat worker is running in its own Pod. VEM server and celery workers are in separate pods and have different startup commands.
  • With celery_beat, the major issue that has been encountered a few times is that it tends to break quite frequently with celery updates (major + sometimes minor). This has led us to constrain celery and celery beat to specific versions until celery beat releases a new version compatible with the latest celery. These issues were not detectable via unit tests and we only found out when the changes reached stage/production. So, relating to Daveā€™s point from above (We should have some way to test this functionality in automated tests, so that we donā€™t accidentally break it for a few months and only discover issues when manually testing the next release.), this would be a challenge.
2 Likes

First off all, thanks @Felipe for pinging me on this!

I donā€™t know much about it, but the last comments of the linked PR pretty much explains what was the reason for Jenkins.

Not knowing about the comment and previous attempts, ~2 years later, I did add celery beat integration to the edX configuration repository utilizing single-beat, which is exactly for solving the problem of multiple concurrently running celery beat schedulers by keeping a ā€œlockā€ in Redis.

Actually it is turned off by default, but because we have clients who must run instructor reports periodically we integrated it with grove, so anyone who needs it, it can be turned on.

It is running in production for ~1.5 years in production for one of our clients without any issues. We did use it for another client with ~4-10 auto-scaling workers, there were no issues either.

In Kubernetes, we are not experiencing any issues either on our test instances yet. (Production traffic is not hosted there yet.)

Honestly, I cannot see anything why it would block anyone to continue using Jenkins, cron-tabs or one-off commands. As I see celery-beat, it is just an ā€œapplication-nativeā€ crontab.

2 Likes

Thanks for all your opinions and recommendations, I have some ideas that I would like write in a document and I hope do it as soon as I can.

I will be out for the next three weeks, but I will expend some time in write the document.

Again, Thanks for all.

1 Like

Hello everyone,

We conducted a whole lot of tests around this feature by installing django-celery-beat in a test environment deployed with Kubernetes.

We used OPENEDX_EXTRA_PIP_REQUIREMENTS in the Tutor config file to install django-celery-beat, a tutor-inline-plugin to add django-celery-beat to INSTALLED_APPS, and add new tasks to be executed in celery-beat.

Additionally, we needed to make a change in the k8s/deployments.yml file, adding the ā€œā€“beatā€ flag to the arguments of the lms-workers deployment.

With this setup, we managed to have a functional development environment for Open edX using django-celery-beat to run cron jobs.

With this environment up and running, we performed different tests, including scaling the lms-workers to have more pods/instances and testing the uniqueness of the scheduled tasks (which was an issue in the past). We noticed sadly that django-celery-beat has not changed in this regard and still does not solve the problem of having multiple celery-beat instances coexisting at the same time. With multiple pods/instances started with the ā€œā€“beatā€ flag, each one behaves as a scheduler and executes the scheduled tasks, resulting in task duplication. Same as it was before for edx.org.

In conclusion, using django-celery-beat requires us to have only one pod or instance running celery-beat at a time, so additional mechanisms need to be applied to ensure this.

There are some possible solutions for Open edX projects that have more than one pod/instance of the workers:

A: Add an external locking mechanism to ensure that only one of the workers is initialized with the ā€œā€“beatā€ flag.

Pros: does not require tutor changes. Would be global to the project independently of the orchestration technology
Cons: it requires a new locking mechanism to be added to the core

B: Deploy celery-beat as a StatefulSet, which assigns unique identifiers to each pod and ensures that only one active instance of Celery Beat exists at all times.

Pros: easy to guarantee the uniqueness of the pod.
Cons: only applies for k8s. It will create a second worker pod even for instances that donā€™t require it.

C: Add a new Kubernetes deployment that starts an lms or cms worker using the ā€œā€“beatā€ flag. This could be done using a Tutor plugin and managed through a flag in the Tutor configuration. However, workers started with ā€œā€“beatā€ would still be able to execute tasks of normal workers, implying duplicated code and at least 2 workers running.

Pros: Does not require many changes to the current stack. It could be tested as a plugin and only then proposed for the tutor core.

Cons: it would be each personā€™s responsibility not to scale the number of pods for this. Only applies for k8s. Instances running with compose in prod would not have it.

D: Not using celery-beat. Instead use an external scheduler like Kubernetes CronJobs to centrally schedule tasks. This would allow us to execute tasks not only in LMS but also in other components, and we could manage it through a Tutor plugin.

Pros: from our testing. Itā€™s easy to manage this way
Cons: only applies for k8s. Instances running with compose in prod would not have it.

We would like to hear your opinions and possible ideas about this implementation. Thank you in advance!

3 Likes

From my perspective, running a cron job in production is almost always a red flag. The presence of a cron job often means that there is some task that needs to be performed, but the code is so confusing that we donā€™t know what conditions should cause the task to be triggered. So we pick some arbitrary schedule to trigger the task, and that schedule is typically either too often (causing unnecessary server load) or too seldom (resulting in stale data).

In addition, cron jobs are often poorly monitored. Error reporting requires ad-hoc settings, which are difficult to manage. Also, as you explained, cron jobs are tricky to scale.

In the current case, you want to generate a report periodically and send it to a user. What if some user wants a report to be generated right now, and not wait for the cron to trigger? What if they want to tweak the options for the report? (columns, destination address, etc.) Whatā€™s the retry policy?

I would argue that a ā€œbetterā€ solution to this problem would be to expose some interface to a user to make it possible for them to trigger the task themselves. This could be a friendly user interface, or an API endpoint. This makes it possible to for users to trigger the task manually. In addition, to generate periodical reports, you can rely on your continuous deployment infrastructure which was engineered precisely to run periodical tasks. Alternatively, the end user can use IFTTT, zapier, n8n, or any other automation tool to trigger the jobs themselves.

Such an interface already exists for bulk email sending, via the Comunications MFE. But it needs a scheduling backend, which could be django-celery-beat.

But the horizontal scaling limitations are kinda funky, yeah. For those scenarios, some other cloud-native thing probably makes more sense.

I just found this django-redbeat thing (based on this other redbeat thing), though, which could help solve these issues for all use cases.

This part is already done. Admin users can generate the reports every time they want, they can also see the status of previous reports and tweak their configurations.

The situation we are facing now is more like: what if said users never send the report. They donā€™t care about it, they simply forget or they find the sending instructions too complicated (I would argue this is not the case as it is as easy as going to the /admin tab, but Iā€™m not a regular user). The point of the survey report is that as a software community we have regular data that is aggregated and anonymous. We could send some messaging every so often asking operators to send the report. But it is a tossup if they will answer.

To the broader point of having some tasks that start from the server I agree that this could be a red flag sometimes, but I also think we must start by acknowledging that the edx-platform already has this behavior written in. By not allowing the platform to start some tasks by the server we are not making the architecture better, we are only making entire features not usable. This applies to the email nudges and the SAML certificate verification (that I know of).

@Alecar from the information Iā€™ve gathered from the thread I think we can still perform a test running celery beat as a separate deployment (not embedded in the worker process, since such an approach is not recommended by celery) and try to wrap the celery beat call with the single-beat suggested by @gabor . (BTW @gabor do you remember what was the reason you used the fork here?).

Regarding Readbeat, Iā€™m not pretty sure if the library forces the change of the Beat scheduler, or if we can use it just to control the lock that prevents multiple beats.

Django-readbeat under the hood uses the Readbeat scheduler and exposes a Django data model that interfaces with the Periodic tasks stored in Redis. This does not seem to be a good fit to solve this problem if we want to keep the level of customization the django-celery-beat scheduler offers.

1 Like

I feel like part of my message was ignored:

I would add to that list: native cron jobs on the native server and Kubernetes CronJob.

For all the reasons outlined above, Celerybeat is a terrible solution to a problem that 90% of the time does not exist. Please letā€™s not add it to the core.

Sorry for letting a part ignored. It was more that I did not know what to answer to that, but here I will try.

As I said in the previous post what we want to achieve is the automatic sending of anonymized aggregated data of open edx instances. Having data to make product and community decisions is key. It was an important part of the annual report. This goal wonā€™t be achieved if we leave most instances behind. We already have only about 10% (as per the report) of known instances answering the survey. That is why we want to build it as an opt out feature.

Not every production grade instance uses k8s. For me that makes k8s centric solutions also not great. Asking operators to go out of their ways to add ifttt, zappier, n8n or a custom tool like jenkins to their installations is driving the response rate of the survey into the ground.

I donā€™t think that making it easy to run async tasks, which is part of the inner workings of several features, is that bad. Yes getting cerlerybeat to run only once requires some work, but that is why we are having all this discussion.

Iā€™m also happy to hear more ideas into how we can make this survey have a higher rate of response.
Would you be open to include this (even with false as a default) to make it one of the questions of the interactive Tutor quickstart questionnaire? I imagine there is a way to connect such decision to actually installing a cron job in the native server.

Would it be better to use a banner like wikipedia and ask users in the /admin/ page to send the report twice a year? I would really like to know what others in the community think about this.

3 Likes

For anyone keeping up with this discussion we are going ahead with a proposal to use the wikipedia banner as a way to invite people to send their reports.

You can read about it here: Proposal: Banner in the Django admin to send Survey Report information - Open edX Product Management - Open edX Community Wiki

1 Like