Privacy concerns with tracking module

Dear Open edX users and developers,

tl;dr: the tracking module collects a lot of personal data that we do not need. Can open edx turn it off by default?

We have run an Open edX instance for a Digital Security training platform for a few years now. Our target audience includes people with elevated security risks, like journalists and activists.

After upgrading to Juniper, we investigated what personal data we gather in the log files. And we were shocked. The following data is collected by the tracker module almost every time a user clicks anything in Open edX:

{"accept_language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3", "page": null, "event": "{\"GET\": {\"videoId\": [\"xxx\"]}, \"POST\": {}}", "context": {"org_id": "x", "course_id": "x", "course_user_tags": {}, "path": "/courses/course-v1:x", "user_id": xxxx}, "host": "learn.totem-project.org", "time": "2021-02-22T10:47:47.576859+00:00", "event_source": "server", "event_type": "/courses/course-v1:x", "ip": "xxx.xxx.xxx.xxx", "username": "x", "referer": "x", "agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0"}

So here we find:

  • IP address
  • Browser
  • Operating system
  • user name
  • page the user came from

Although it is possible to disable the tracking module (we wrote this tutor plugin to do so), I think it should be disabled by default. Apart from the fact that we have a specifically vulnerable target audience, in Europe we need to consider the General Data Protection Regulation (GDPR), specifically Article 6. This information is more than enough to exactly identify a person, so it is covered by that law.

It seems like the tracking module is only necessary if you enable an optional module, like Insights. But it seems like many organizations do not use Insights, and are thus gathering information they do not use. If you do not have an explicit use for the data and you don’t have explicit consent, you’re basically breaking the law by gathering and storing it.

I am posting this here because I am wondering if others have considered their compliance to the GDPR as well. I am also wondering if we might have missed some reason why the tracking module needs all this information (even if you don’t use Insights). I would like to propose that the tracking module is disabled by default for new Open edX installations to prevent European organizations from accidentally breaking the law.

1 Like

On behalf of our DPO:

We have also observed a risk on the natively collected data by the openedx software.

For our part, we need to collect and process data for research in educational sciences made by our partners, mostly universities; to do so, we remove the identifier from datasets before transmitting. We find that the data linked to Insight may be excessive for our use, and that there is also a risk on other non-identifying data such as the referrer (which can be, in certain cases, indirectly identifying).

That being said, considering that the GDPR implies the notion of “privacy by design” and “privacy by default” (but also that the processing of data can be carried out on another basis than the consent), the deactivation by default of these functions seems indeed necessary and help to minimize data collection and limit risks.

I agree that tracking should be disabled by default, and enabled explicitly by people who want to use Insights.

Other relevant info:

  • The tracking logs are not sent to external storage by default (see COMMON_OBJECT_STORE_LOG_SYNC).
    So while these logs will accumulate and be compressed and rotated on your Open edX servers, they’re not sent to persistent storage unless you request them to be.
    A small comfort if you’re using ephemeral VMs for hosting…
  • Rotated logs are kept for around a minimum of 2 years by default, which seems excessive too. (Preserved for 16000 hourly rotations if you’re generating 1M of log data each hour.)

Thanks for raising this issue, I’d like to understand better what the implications of this would be.

I know that, the Insights threshold is too narrow; there a numerous instructor analytics options beyond Insights.

Is there an example of an open-source project that does this well today?