FWIW I discussed this with Brian Beggs last summer, and while I imagine a lot may have changed, here’s an excerpt from the internal note I shared with my team afterward:
edX has been focused on building up their own BI (Business intelligence) tooling and it’s very focused on their own needs and the sort of concerns and limitations that apply at edX.org scale. …
edx-analytics-pipeline
is based on the ETL model (Extract raw data, Transform it, Load it into the separate analytics database), and on Hadoop Map-Reduce, Hive, Luigi, etc. But edX and [others] are moving away from an ETL approach and toward ELT instead: Extract raw data, Load it into your data warehouse, and then Transform it as needed when you need to run reports/queries. For this purpose, edX has been using an open-core tool called dbt and I was told that the team loves dbt and it’s made their analytics way better, more flexible, easier to code, etc. The main difference is that before, to updateedx-analytics-pipeline
for some new report, one had to know python, Hive, Hadoop, Luigi, SQL, Jenkins DSL, and more. With dbt, writing a new report only requires knowing SQL.So where does this leave us?
I think that if customers ask for analytics in the future we should try to leverage that to create an open source “Open edX Data Warehouse” built on dbt. This could be built out fairly quickly, would rapidly surpass Insights in functionality, and scale to instances of any size. The edX dbt code is not open source, but Brian had mentioned they’re open to open-sourcing parts of it that would be in common with a community approach. For customers…that have their own BI tools, they can ingest data from dbt; for the community in general we can create some sample BI reports using metabase. This would be so much simpler and cheaper than edX Insights…
We all liked this general approach but have not done any work toward it yet.