A proposal for BTR: Tag point releases whenever they're needed

In particular, why not tag release/ulmo.2 right now?

Historically, Open edX point releases go out on a 2-month cadence, something like this:

  • Dec 9: release/zebrawood.1
  • Feb 9: release/zebrawood.2
  • Apr 9: release/zebrawood.3

and when the .1 release is delayed, the point releases are delayed too. So in the case of Ulmo, we’d have:

  • Jan 18: release/ulmo.1
  • Mar 18: release/ulmo.2
  • May 18: release/ulmo.3

But here’s the thing, there are three important fixes sitting on the tip of the release/ulmo right now: a critical bugfix for the Catalog MFE, a Django security fix, and critical fix to the openedx image build. We need to get these into a Tutor patch release ASAP, so that Tutor v20 (Ulmo) users can have a secure and functioning platform. But, due to the delay of the ulmo.1 release, we’re still over a month from the planned ulmo.2 tagging date.

When this has happened in past releases, the strategy has been to cherry-pick the critical fixes into the Tutor Dockerfile, and then do a Tutor patch release. For example, in Quince, Tutor had to do a v17.0.5 release to cherry-pick in a privilege escalation fix which merged into the quince branch but missed the quince.3 tag. Yes, we could do this now for the three Ulmo fixes I listed (and that’s what I’ll do tomorrow if this proposal is rejected :).

But, it seems backwards to me that we’re relying on Tutor to cherry-pick in critical fixes rather than doing our own Open edX patch releases. Plus, as a Tutor maintainer, I can vouch that these Dockerfile cherry-picks are a pain in the neck to keep track of, especially across the main and release branches :face_with_tongue:

My immediate proposal for BTR

Release ulmo.2 now so that we can push these importantfixes out without modifying Tutor’s Dockerfile. We’ll then cut a Tutor patch release which simply updates OPENEDX_COMMON_VERSION from ulmo.1 to ulmo.2.

My long-term proposal for BTR

:stop_sign: Stop: tagging .1 , .2, .3 on a strict schedule.
:stop_sign: Stop: cherry picking into Tutor’s Dockerfile

:sparkles: Start: the following process…

  • .1 is tagged according to the schedule
  • .2, .3, .4, etc. can be tagged when either of the following are true:
    • the release manager deems it appropriate to push out fix(es)
    • approximately 2 months have gone by since the last point release.
  • The final point release goes concurrently with .1 of the next named release, thus ensuring that there are no dangling “unreleased” commits one a named release at the point where it goes out of support.
  • Every named release will have at least four point release (.1, .2, .3, .4), but if may have many more if necessary.

:eye: Example:

  • yucca.1 goes out on Jun 9 according to the schedule.
  • yucca.2 goes out on Jun 19 because a critical bug is found and fixed
  • yucca.3 goes out on Aug 19 because 2 months have passed
  • yucca.4 goes out on Aug 30 to fix a django security bug
  • yucca.5 goes out on Aug 31 to fix another security issue
  • yucca.6 goes out on Oct 31 because 2 months have passed
  • yucca.7goes out on Dec 9, which is when zebrawood.1 goes out. Yucca is now unsupported.

:plus: Benefits: Fixes are officially released and blessed by the Open edX project rather than relying on Tutor to decide what’s critical enough to patch. Tutor gets simpler–no more patch conflicts between the main and release braches. Critical fixes no longer get “lost” just because they merged after .3 but before the next release was tagged.

:minus: Drawbacks: The release manager will need to run the tagging script more often. They need to tag every released repo to do a point release, not just the affected ones. This will be some added work, but I’m happy to help improve the automation if this becomes a point of friction.

Thoughts?

10 Likes

Not that I necessarily know all the inner workings and what goes into doing all this, but in my opinion this makes a lot of sense, delaying releases just to fit a calendar schedule somewhat undermines the importance of the fixes/updates waiting to be added

Strong +1 from the Tutor maintainer perspective.

This proposal addresses a real pain point we’ve been dealing with. Cherry-picking fixes into Tutor’s Dockerfile across main and release branches is error-prone and creates unnecessary maintenance burden. Having official point releases cut when needed (especially for security fixes) would be much cleaner.

Those three fixes (Catalog MFE bug, Django security patch, and image build fix) are critical enough to warrant an immediate release rather than waiting until mid-March.

The only consideration: we should document the new cadence clearly so operators understand point releases may come more frequently than every 2 months.

1 Like

Thanks for the comments so far.

As an aside, I’ve also opened a Tutor PR to cherry-pick the ulmo.1 fixes into the Dockerfile (the old way), just so they aren’t blocked by us waiting us to reach consensus on when to release ulmo.2

Overall this proposal sounds great!

There is one scenario I want to think through though.

Scenario: Non-critical fix in review, critical fix lands.

For this example let’s assume the last point release was 1 month ago, on Jan 10.

Current (fully scheduled)

  • Feb 5: Non critical fix PR opened
  • Feb 10: Critical fix lands
    • Tutor cherry-pick hotfix lands shortly after
  • Feb 15: Non-critical fix lands
  • Mar 10: New point release goes out, includes both fixes

Proposed (no delays for critical fixes)

  • Feb 5: Non critical fix PR opened
  • Feb 10: Critical fix lands
  • Feb 13: New point release goes out
  • Feb 15: Non-critical fix lands
  • Apr 13: New point release goes out

In this scenario the release of the non-critical fix (which may still be quite important for some site operators) would be delayed by over a month compared to the current schedule.

I don’t have a strong opinion on how to best handle this scenario, but one possible option that comes to mind would be to shorten the post-critical-point-release window from 2 months to 1 month.

Updated yucca example:

  • yucca.1 goes out on Jun 9 according to the schedule.
  • yucca.2 goes out on Jun 19 because a critical bug is found and fixed
  • yucca.3 goes out on Jul 19 because 1 month has passed since the previous critical-fix release
  • yucca.4 goes out on Aug 29 to fix a django security bug
  • yucca.5 goes out on Aug 30 to fix another security issue
  • yucca.6 goes out on Sep 30 because 1 month has passed since the previous critical-fix release
  • yucca.7 goes out on Nov 30 because 2 months have passed
  • yucca.8goes out on Dec 9, which is when zebrawood.1 goes out. Yucca is now unsupported.
1 Like

Not a maintainer here, but may I suggest adopting some kind of semantic versioning (https://semver.org/) ?

ulmo/verawood/yucca are the main version. Breaking changes

.1, .2, .3, .4, are feature additions

.X.1, .X.2, .X.3 are bug fixes

That way,

ulmo.2 can remain in the fixed calendar

ulmo.1.1 can be released now as a bug fix of ulmo.1.

1 Like

In any case, I strongly support releasing a new version of tutor ASAP to include those critical bug fixes.

I’m generally in favor of this proposal, but I want to focus on this:

I don’t think this is a drawback. The current release script is in dire need of a refactor, primarily to make it actually automatic: faster and less prone to failure. It should be part of a Github workflow so that the release manager need only press a button. They shouldn’t even need to keep tabs on the log, and instead be notified only in case of failure.

The flip side: we need to make sure new repositories follow the rules, so the script doesn’t fail due to permissions issues. This accounts for 90% of the failures we run into, time and again.

2 Likes

I’m in favor of the proposal. Critical platform bug fixes shouldn’t rely on Tutor cherry-picking.

@mboisson: I get the appeal of semantic release, but I’d actually push for keeping incremental feature updates out of the point releases altogether (i.e. only have bug fixes). Even the scheduled point releases don’t get nearly the kind of testing scrutiny that they should for feature additions.

2 Likes

I 100% agree with both, additional tagging when needed, and stop using the git patches in the dockerfile. The patches approach seems like it was from a time where Tutor wasn’t official, so landing upstream changes was harder.

In this scenario the release of the non-critical fix (which may still be quite important for some site operators) would be delayed by over a month compared to the current schedule.

I think we can keep the scheduled releases the same and add the additional releases in-between. On March 18 we simply tagrelease/ulmo.3instead ofrelease/ulmo.2. Worst case scenario, ulmo.3 and ulmo.2 point to the same commit, which wouldn’t be the first time it happened.

1 Like

@kmccormick Thank you for this post, I was about to make a similar post to discuss about the release process and how it needs to be evolved. Let me start by answering your queries and replying to your proposal, and then add some of mine.

I wholeheartedly agree with you on this point. If the changes are made in the platform and is backported in the specific release tag, then there is no point in making changes to Tutor. Tutor’s responsibility is to run the platform and not to patch it, and operators deserve that fix irrespective of whether they use Tutor.

A very good point, and I agree with that. The only reason that I didn’t propose to cut a release is that I felt the delta of changes being released was too small, and I wanted to ship a bit more. Had I known that there are critical fixes waiting, I would have shipped it or planned to cut the release that instant. I will actually schedule a release cut ceremony and try to ship out by Wednesday.

Now here are my few observations and additional proposal to what you pointed out.

What does a tag signify, and why semver doesn’t work for us?

If we think about it, why do we have release/<branch>.1 and tags/releases like that? I believe it is because we have always had so many components that we wanted to freeze a frame and say that at this moment, everything worked as expected, and this becomes our frame of reference to point out a bug or a change in behaviour.

This is one of the reasons we do a very elaborate testing of the platform after we freeze the code in the main release(eg: release/ulmo), and we are very critical about doing a .1 release, because we care for the operators. This gives us the confidence to backport fixes and even features because then we just have to test for that specific usecase.

I think this is not achievable by Semver because then it would mean that shipping a small patch in one repo, we have to ship all the repos to make sure all of them are on the same version. Semver works like a wonder when it is a mono repo, and no other service depends on it.

But with us, we have so many repos and no way to mark which version of the repo worked with the other repo. The only way to unify all of them is the tag/release.

This is my 2 cents on the understanding I have of the Open edX ecosystem. I can be wrong.

Proposal

We should still do a point release with a strict schedule for .1, which helps us with the branch freeze and testing, which is what @kmccormick proposed and I support. After that, we should do bi-weekly releases, no matter what the delta between .1.

There should be no limit to this, so we can have any number of releases.

The point of friction we have is scheduling the ceremony and coordinating the time, which might be solved by making a few of these releases async. I believe we should ship often.

I totally resonate with @arbrandes on this. I think we should spend some time to have this kind of automation in place.

Question

The question I have is, what if some critical feature or fix is merged into a repo? How will the release manager know?

One possible approach could be to ping the release manager on the PR, telling the RM that it is a critical fix so that the RM can schedule a release cut. The RM could also be pinged in the BTR Slack channel.

I am open to hearing what do people think about it?

1 Like

Thank you everybody for the replies! Glad to see we’re all leaning in the same direction.

Great! I think this will work in the vast majority of cases. There may be exceptional cases, namely a high-severity security fix, that would warrant an even quicker release time.

100%

We have this doc: https://openedx.atlassian.net/wiki/spaces/COMM/pages/2065367719/Backporting+Making+a+pull+request+for+a+named+release . Whatever process you choose, we could revise that doc with it and share it around.

That said, if we truly are releasing every two weeks, then possibly only the Security WG would ever need to reach out to you for a special release.

@kmccormick @farhaanbukhsh Very enthusiastic +1 from me on your proposals. We’re planning on bringing them to the Release Planning meeting on Thursday to try and help get buy-in so we can make this happen.

One note on Kyle’s original proposal: Cherry picking fixes into Tutor also only works for teams that use it. MIT for one does not use Tutor at all for any of our deployments.

2 Likes

One minor addition I just thought of: if we go with Farhaan’s bi-weekly proposal, there would be about 12 patch releases, so perhaps instead of .1, .2, etc., we should do .01, .02, etc., so that when it gets to .10 the tags continue to sort alphabetically.

1 Like

Thank you all for moving this forward! I’m pretty excited for us as a community to implement these changes into the release process :slight_smile:

One possible approach could be to ping the release manager on the PR, telling the RM that it is a critical fix so that the RM can schedule a release cut. The RM could also be pinged in the BTR Slack channel.

We could start releasing after any release blocker is merged. However, this would require a couple of improvements on our side (independent of the proposal):

  1. Get better at identifying release blockers in a timely manner.
  2. Empower contributors, and maintainers to classify issues as release blockers or reach out when they believe a fix should trigger a release.