Now that analytics-pipeline has been removed from devstack via DEPR-119. There is a comment there that “edx-analytics-pipeline repo eventually being archived”.
Does this mean edX team is developing something to replace the current analytics-pipline? Will Insights app also get replaced too? Is there a timeline for the new Insights/analytics?
I just wondering if we should deploy Insights/analytic-pipeline now if it’s going to be removed/replaced soon.
I had asked a similar question in the analytics channel of the Open edX’s Slack.
My original question was the following:
And what is the status of Insights in the mid to long term? I saw at least one DEPR ticket [DEPR-119] - JIRA about Insights and Devstack. What about the future and the native / community installation ? @Nimisha Asthagiri I am tagging you to get the architectural perspective and try to understand where analytics is going with regards to Open edX.
Here is what @nimisha answered me on January 27th:
Insights Ownership and Future
Insights, as you know, has not received as much care as we would like. At edX, its ‘ownership’ has been tossed about.The good news is that, finally, there is a product-delivery team that has taken official ownership of Insights now.The meh news is that, they have not yet made a decision on its future but they have a goal for doing so in this coming quarter. Analytics Architecture Future
From an architectural perspective for Open edX, here are 2 tickets that provide info:
I don’t believe I have anything more concrete to share. All I can say is we are still evaluating.
Here is what I know:
My product-delivery team took over the technical ownership of Insights only 2 weeks ago. The boundary of the ownership is only on the Frontend Data Visualization part. The Backend data warehouse ownership is not on us.
Insights data visualization component is being researched for the next steps. However, the priority of such next steps needs to be decided in the next 2 month.
Insights data translation and aggregation pipeline is using EMR, an older infrastructure compare to rest of the data infrastructure. As time progresses, this area would require a more urgent decision
This is a feature/product that is on my mind. Whatever you choose to do in this area, I am interested to hear your thoughts and feedback.
FWIW I discussed this with Brian Beggs last summer, and while I imagine a lot may have changed, here’s an excerpt from the internal note I shared with my team afterward:
edX has been focused on building up their own BI (Business intelligence) tooling and it’s very focused on their own needs and the sort of concerns and limitations that apply at edX.org scale. …
edx-analytics-pipeline is based on the ETL model (Extract raw data, Transform it, Load it into the separate analytics database), and on Hadoop Map-Reduce, Hive, Luigi, etc. But edX and [others] are moving away from an ETL approach and toward ELT instead: Extract raw data, Load it into your data warehouse, and then Transform it as needed when you need to run reports/queries. For this purpose, edX has been using an open-core tool called dbt and I was told that the team loves dbt and it’s made their analytics way better, more flexible, easier to code, etc. The main difference is that before, to update edx-analytics-pipeline for some new report, one had to know python, Hive, Hadoop, Luigi, SQL, Jenkins DSL, and more. With dbt, writing a new report only requires knowing SQL.
So where does this leave us?
I think that if customers ask for analytics in the future we should try to leverage that to create an open source “Open edX Data Warehouse” built on dbt. This could be built out fairly quickly, would rapidly surpass Insights in functionality, and scale to instances of any size. The edX dbt code is not open source, but Brian had mentioned they’re open to open-sourcing parts of it that would be in common with a community approach. For customers…that have their own BI tools, they can ingest data from dbt; for the community in general we can create some sample BI reports using metabase. This would be so much simpler and cheaper than edX Insights…
We all liked this general approach but have not done any work toward it yet.
Thanks Nimisha and Braden for your comments on the analytics story at edX. We discovered Metabase and have been finding it super powerful and easy to deliver custom reports to our customers.
Our non-technical people can easily built reports without needing to know SQL, which I think is one of the biggest strengths of Metabase, but I’ve also heard very good things about dbt for doing more sophisticated analytics.
One of the downsides of Metabase is that you can’t do cross-database joins, so taking the data in MySQL and synthesizing it with data captures from the events tracking stream is not possible, unless you push all of that data into a single data warehouse.
For this purpose, we’re looking at tools like Dremio and Snowflake.