Complex Cairn Migrations

I am looking to push a new dataset into clickhouse that joins details that are currently stored in mongodb and mysql. I have a prototype that extracts and transforms this data using a python script but I am trying to determine the best way to load it into clickhouse. All I have found in the tutorcairn source code are sql migration files, which wouldn’t be sufficient for loading the the data points from mongo. Are there examples of python migrations that others have been created for loading data into cairn?

Thanks in advance!

cc @tutor-maintainers

Hi Jeff! Cairn does have a script to import aggregated data from MongoDB to CLickhouse – it’s a bit clunky, but it’s really because the MongoDB data model is just difficult to work with. You should have a look at the “importcoursedata.py” script in $(tutor config printroot)/env/plugins/cairn/apps/openedx/scripts/importcoursedata.py.

Basically, the only way I found to read data jointly from MySQL and MongoDB was first to import data from MongoDB to Clickhouse. Cairn does this to import the structure of courses (section names and hierarchy). Then, joining this data with MySQL becomes easier.

Does this answer your question?

Thank you @regis! That is a great example. Can you instruct me how run a custom script similar to importcoursedata on a schedule to refresh the data in clickhouse?

The importcoursedata script is a great start toward getting the data I need into clickhouse. I would want to enhance it, though, to include fields such as parent_id in order to reconstruct the course hierarchy and definition_id in order to track library problem blocks that are reused in multiple course contexts.

There are many ways to achieve this. If you are running tutor local you could for instance create a cron job on the host that would execute the following command:

tutor local run -v /host/path/to/myscript.py:/tmp/myscript.py:ro lms /tmp/myscript.py

That script is not really extensible, and I’m not sure whether it would make sense to modify it. Maybe you could create a separate table to import your data?

Alternatively, I may have heard recently that the course structure was also stored in MySQL – but I’m not sure in what table. I propose to summon @dave’s omnipotent knowledge to enlighten us! Dave, could you give us a clue? I searched in edx-platform but could not find anything. Thanks!

In turn, I summon @BrianMesick and his recent work on getting edx-platform’s coursegraph app to push its course structure data into clickhouse.

The definition_id is a low level implementation detail, and will change even with minor edits to the source library content without a backwards link to the previous one. You might be better served by looking at the library_content blocks in the course content data that gets exported, which will give you explicit links to the library and version being used.

FWIW, in the new copy-paste Studio work, copied_from_block is an explicit attribute that will be mixed into the OLX–though I don’t know how useful vs. noisy that’s going to be in practice.

I am summoned! So the branch above is defunct, but I’m actively working on doing a more general version of it that will allow other signals / events to trigger sending data to ClickHouse in real-ish time (initially over Celery tasks, later over the event bus :crossed_fingers:). Work is happening over here, initially based on that branch, but only started yesterday: Add Course Published event listener and plugin plumbing by bmtcril · Pull Request #1 · openedx/openedx-event-sink-clickhouse · GitHub

This does a similar thing to what Coursegraph does, which is listen to the COURSE_PUBLISHED signal, fire off a Celery task which uses the in-platform code to pull the course structure from modulestore and serialize it out to ClickHouse in JSON. Once the signal version is working I’ll add in the management command that allows you to dump (one / some / all) courses on demand.

I’m hoping to have that plugin ready for real review in a few days, but it’s currently in the ugly prototype-to-production phase of development.

Our goal is to track a user’s performance on a particular problem from a library across all courses that utilize that library content. I was under the impression that definition_id is the common identifier that is referenced by all instances. From what I have seen, this is how I understand the library_content block to work:

  1. When a particular library is selected in a library_content block, all of the library’s children are copied as children of the library_content block with new unique block_ids
  2. Corresponding children of the library and library_content block can be associated via their shared definition_id
  3. If a new library_content block references the same library, the children will be copied again and receive new unique block_ids but still share the underlying definition_id

Is this an accurate description and can I rely on the definition_id to associate these problem instances across contexts? Or am I missing details?

You’re correct in terms of what happens, but I think the issues are:

  1. The definition_id used by the library component changes whenever you make any sort of edit to that component, but the individual courses will still point to the old definition_id until you sync from the library.
  2. The definition_id is a low-level implementation detail (i.e. it’s not part of the OLX export), and may not be safe to rely on in the long term.

I dug into this a bit more this morning, and I think what you want is to call get_block_original_usage on the modulestore, and pass it the UsageKey of the ProblemBlock in the course. That’ll get you the stable UsageKey for the original in the library. The LibraryToolsService uses this for analytics.

Thank you for your guidance! I’m sorry for not replying sooner. I have been absorbed with a different project and haven’t had a chance to follow up yet. I’ll share my results once I have a chance to experiment with your advice. Thanks again.

@dave I finally got back around to this task and it appears that your suggestion to use get_block_original_usage is working like a charm. I am now able to track a user’s performance on a particular problem (defined in a library) over time across all contexts in which that library problem is used. This is flowing into Cairn for internal reporting and tracking. Thanks again!

1 Like