Analytics Strategy for Data teams

Good day Team!

We are currently using Nutmeg and we use Cairn for our team to create dashboards on some of the data.

We have Data science team that is wanting to pull data into another system so that they can answer some questions on the data.

We are trying to determine the best route for them to get all the data from the open edX system. Is there a strategy that others have used to send emit all the data to other systems?

We started with the team pulling the mySQL database but, this is not all the data and the data team doesn’t believe they have all the data they need in just this database.

It sounded like there was a separate package outside of Carin at one point in time based on this thread: analytics-in-maple-cairn-vs-nothing

Are there APIs that already exist that our external teams can integrate with? Or, is it recommended to pull data from Cairn (click house db?) to get all the data?

I was just curious if there is a defined pattern/package/strategy that open edX recommends or, if the community has used for exporting all the data to a data science team.

Thanks in advance!

Hi @justin-jones !

Yes, the MySQL database does not contain all of the data, it’s missing the events that are emitted whenever your users interact with Open edX. That event data is stored in Clickhouse. AFAIK, Cairn does not have APIs for fetching this data so you’ll need to query Clickhouse directly to fetch these rows.

@regis is this correct?

Hi @justin-jones! Jill is absolutely correct that the best way to pull data from Cairn is to directly query the Clickhouse database. Fortunately, Clickhouse exposes its API in a couple of user-friendly ways. You could query Clickhouse natively, using for instance the clickhouse-connect Python driver, or with the HTTP API.

More ways to interface with Clickhouse are documented here: Drivers and Interfaces | ClickHouse Docs

@regis and @jill ,

I just wanted to verify my understanding of analytics within open edx. It sounds like there is/was a tool called edX Insights/analytics-pipeline that used to be the access point for other systems and teams to access the analytics from open edx installations.

It sounds like this tool may or may not be deprecated from the open edx platform? Or, it is not the preferred way for other systems and teams to interact with the analytic data within open edx.

We are partnering with a team within our organization with the goal of providing open edx analytical data so that they can provide some data modeling and for other organizational learning.

The team that we are working with suggested installing the insights tool but, that doesn’t seem like that is supported in the newer open edx releases and it doesn’t appear to be the communities preferred approach.

Is this a correct understanding?

Also, thank you both for the reply!

I’ll take a look at the Clickhouse docs!

References:

There was a clarification by the Insights team at the conference last week: we were told that Insights is still actively maintained by 2u. However, my personal interpretation is that it is only maintained for the purpose of running on edX.org, and not for the community at large. I might be wrong here, so if someone from the Insights team could comment that would be great.

FWIW, it was the difficulty of running Insights as well as a couple important missing features (real time data, dashboard customisation, extensibility) that convinced me to create Cairn.

1 Like