How do we build continuous integration for the supported installation?

Last week the Juniper installation started failing when codecov yanked their old releases (reported here in this forum.)

We find out about things like this as users report them, but it isn’t always clear what is being reported, or that the report is affecting everyone.

I would like to improve the reliability of the installations by finding out about these failures sooner. An automated daily installation would give us confidence that the out-of-the-box installation instructions are still working.

How can we make this happen? @arbrandes: I hear OpenCraft may have something partially built already? Can we make this a real project and get it done?

1 Like

How can we make this happen? @arbrandes: I hear OpenCraft may have something partially built already? Can we make this a real project and get it done?

@nedbat That’s true. We already have daily periodic builds for Juniper and Ironwood at OpenCraft so for instance we spotted the codecov bug previously. What we don’t have is a check and alert mechanize, for now we need to manually check builds, do some triage, and then maybe alert the community about it. It can have some latency and we can miss some failure too since it is relying a lot on human steps.

We are already writing a discovery about how to have a better Community CI based on this Trello card. It is pretty clear that the first thing we need is an alerting system. Do you have any suggestions about how, where, who to notify when a build is failing? Since the Open edX stack depends on multiple services, some false positive can also happen.

2 Likes

This is great to hear. The simplest notification would be an email to a list. We could ask members of the working group to volunteer to receive the failure notifications.

What would be the best way to have a mailing list for the BTR working group? Can edX host one for us or should we do it ourself? cc @regis

What would be the best way to have a mailing list for the BTR working group?

IMHO these forums are better than a mailing list. What feature would you be looking for in a mailing list which you don’t have here?

EDIT 2020/08/29: I posted my answer in reply to @toxinu’s comment and did not notice the topic title. I agree that it does not make much sense to post notifications to Discourse.

The good thing about a forum is that people can come to it at their own pace, and participate in discussions asynchronously.

A notification of a failed build isn’t something we need to keep around in case someone wants to contribute to it a week later. It’s something to act on now, either by writing up an issue to be fixed in the long run, or by fixing something now.

We can get the “notification effect” by putting messages into Discourse, but we don’t need the rest of what Discourse offers, and I wouldn’t want to clutter Discourse with notifications. Discourse should be for discussions.

@nedbat @regis I agree with both of you, a mailing list will suit it better. I will take that into account for next improvements we want to add. :+1:t2:

Can edX host one for us or should we do it ourself?

What about this question?

I can make an edX-hosted Google Group mailing list (like adr-notifications@googlegroups.com).

We’re discussing details here: https://openedx.atlassian.net/browse/BTR-8

The open-edx-btr-notifications@googlegroups.com mailing address is for these notifications.

Update: I made the link go to the Groups page rather than the email address.

Just landed on this thread while talking about things with the SRE team. I like the idea of using the existing OpenCraft tests that were mentioned in the https://openedx.atlassian.net/browse/BTR-8 and sending failures to this new mailing list.

@toxinu Does OpenCraft also run these on master and do you have an idea of who will respond to failures once an email is sent? (I joined the google group in case I break something, but it’s not clear to me where the tests mentioned in https://trello.com/c/PaAA0zFz/61-community-run-ci-open-edx-tests#comment-5ee179a0c265065f3f438eaf are actually being run or what the common failure patterns are)

I would also be interested in preventing configuration PRs from being merged like @antoviaque mentioned in https://trello.com/c/PaAA0zFz/61-community-run-ci-open-edx-tests#comment-5eda16716241c26c867454d2 if there is a way to rapidly run a subset of these tests that would catch common issues. The edX SRE team has also been thinking about adding small checks like this with Github actions in other repos, but it’s been hard to guess which guardrails/automations/linters/tests are worth implementing.

Thanks for your reply Adam. :slight_smile:

Does OpenCraft also run these on master and do you have an idea of who will respond to failures once an email is sent? (I joined the google group in case I break something, but it’s not clear to me where the tests mentioned in Trello are actually being run or what the common failure patterns are)

Yes we also have a periodic build on master (not triggered by any push, just a periodic build on a specific interval).

The tests mentioned on the Trello card doesn’t exist right now (or we don’t know). I was trying to get more context about what edX want us to do to have a better periodic build that cover more things from different layers of testing. So if you can provide more context, I would be happy to hear.

About who will need to respond to failures, it can probably depend on the level of failure happen.
I guess I can share you our work in progress discovery: SE-2879 - How we can help with edX community CI (Shared) - Google Docs

First, I would like to reduce the number of places we have information. Let’s forget about the Trello board for this working group, and use the Jira board instead.

As far as what edX wants, the important thing here is, what does the community (as represented by the BTR working group) want? My proposal was to install the latest named release branch (“open-release/juniper.master”) once a day, to detect when something had changed in the universe that breaks the installation. Running the installation on the “master” branch would also be helpful, to find problems early in the release cycle.

It’s great to have this in place, thank you!

+1, that sounds like a great way to start.

@geoffrey Do we already have a task for sending the relevant notifications from our CI to this address? When could it be done?

Yes we have SE-2879 (Discovery: how we can help with edX community CI) which is still open for new ideas but I can create a ticket to implement the relevant notifications from our periodic builds to the mailing list for next sprint (Sept 21th - Oct 5th), sounds good @antoviaque ?

1 Like

@toxinu Sounds perfect - thank you!

We (OpenCraft) have updated our periodic build mechanize to send an email to the open-edx-btr-notifications@googlegroups.com mailing list when an installation is failing. We will try to send a test email soon, so do not forget to subscribe to it. :slight_smile:

You can find more information on the BTR-8 ticket.

We are now getting emails with reports of failures. This morning I got an email (“Deployment failed at instance: Periodic Build Master (periodic-build-master.opencraft.hosting)”) that ended with this:

Ansible task name: rbenv : if ruby-build exists, which versions we can install
Relevant log lines:
{‘changed’: True,
‘cmd’: [‘test’, ‘-x’, ‘/usr/local/bin/ruby-build’],
‘delta’: ‘[Filtered data]’,
‘end’: ‘[Filtered data]’,
‘msg’: ‘non-zero return code’,
‘rc’: 1,
‘start’: ‘[Filtered data]’,
‘stderr’: ‘’,
‘stderr_lines’: [‌ ],
‘stdout’: ‘’,
‘stdout_lines’: [‌ ]}

What should we do with these?

Are other people seeing failures with master?

FWIW, I tried installing from master today, and the “if ruby-build exists” task was skipped, but then a later task failed:

TASK [forum : initialize elasticsearch] ****************************************

== cmd ===========================
['/edx/app/forum/cs_comments_service/bin/rake', 'search:initialize']
== msg ===========================
non-zero return code
== stderr ===========================
/edx/app/forum/cs_comments_service/lib/tasks/flags.rake:6: warning: already initialized constant ROOT
/edx/app/forum/cs_comments_service/lib/tasks/kpis.rake:7: warning: previous definition of ROOT was here
/edx/app/forum/cs_comments_service/lib/tasks/db.rake:28: warning: already initialized constant COURSE_ID
/edx/app/forum/cs_comments_service/models/constants.rb:2: warning: previous definition of COURSE_ID was here
/edx/app/forum/cs_comments_service/lib/tasks/deep_search.rake:7: warning: already initialized constant ROOT
/edx/app/forum/cs_comments_service/lib/tasks/flags.rake:6: warning: previous definition of ROOT was here
rake aborted!
Elasticsearch::Transport::Transport::Errors::InternalServerError: [500] {"error":"ClassCastException[java.lang.String cannot be cast to java.util.Map]","status":500}
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:218:in `__raise_transport_error'
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:346:in `perform_request'
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/client.rb:176:in `perform_request'
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'
/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/actions/indices/create.rb:48:in `create'
/edx/app/forum/cs_comments_service/lib/task_helpers.rb:92:in `block in create_indices'
/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `each'
/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `create_indices'
/edx/app/forum/cs_comments_service/lib/task_helpers.rb:198:in `initialize_indices'
/edx/app/forum/cs_comments_service/lib/tasks/search.rake:30:in `block (2 levels) in <top (required)>'
/edx/app/forum/.gem/ruby/2.5.0/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'
Tasks: TOP => search:initialize
(See full trace by running task with --trace)
== stdout ===========================
W, [2020-10-23T15:48:43.689810 #31853]  WARN -- : Overwriting existing field _id in class User.
W, [2020-10-23T15:48:43.721014 #31853]  WARN -- : MONGODB | Unsupported client option 'max_retries'. It will be ignored.
W, [2020-10-23T15:48:43.721070 #31853]  WARN -- : MONGODB | Unsupported client option 'retry_interval'. It will be ignored.
W, [2020-10-23T15:48:43.721089 #31853]  WARN -- : MONGODB | Unsupported client option 'timeout'. It will be ignored.

How do we coordinate to make this continuous integration a better signal?

@nedbat I manually checked the error and our master periodic build got the same as yours. We will fix the empty stdout_lines list that should contain the actual error. (cc @gabor)