Guidance needed on processing course content text in course discovery

Muhammad_Ammar · August 25, 2022, 7:26pm

We(Markhors Squad) need your guidance to devise a right architecture for a new feature. Below is the expected flow of what we are trying to achieve.

For each unit in a course, fetch text content of the unit including video transcripts
Send the course text content to course-discovery. taxonomy-connector is an edX pypi pkg that has all the logic to process text data. Text content will not be stored in course-discovery.
Translate the non-english text via AWS Translate service. We store the source text plus translated text in this case only.
taxonomy-connector will send the text to https://lightcast.io
https://lightcast.io will return us the job skills data by analyzing the text content
Job skills data will be stored in database. unit_block_id -> skills data
Job skills data will be exposed via an API
Job skills data will be shown for each unit when a learner visits that unit in LMS

Our thought process

Course content exists in mongo in edx-platform but we need to process the data in discovery because we already have all the taxonomy related text processing code there.
On a second thought it feels that we should do the processing in edx-platform because course content natively exist there but as mentioned previously all the taxonomy related text processing code only exists in taxonomy-connector.
Transferring big amount of text from edx-platform to discovery feel uncomfortable

Any guidance on this will be highly appreciated.

Thank you!

Zia_Fazal · August 26, 2022, 5:50am

It appears a similar effort has been in progress. In case of Markhors tags/taxonomies needs to be pulled from a third party service(lightcase.io)

Similar discussion took place in this thread and it appears tagstore is the proposed solution to link taxonomies with course content.

robrap · August 26, 2022, 3:45pm

I’m having trouble parsing a sentence from the tagstore README:

Tagstore is a system for tagging entities. For example, a common use case would be applying difficulty tags like “easy”, “medium”, or “hard” to XBlocks (learnable content components) that are stored in Blockstore.

Does “that are stored in Blockstore” refer to some subset of XBlocks (Blockstore only, if that even makes sense), or does is it just stating where tagstore data will be stored? I think it means:

Tagstore is a system for tagging entities. For example, a common use case would be applying difficulty tags like “easy”, “medium”, or “hard” to XBlocks (learnable content components). Tagstore data will be stored in Blockstore.

Can someone confirm?

robrap · August 29, 2022, 3:25pm

Thanks @Muhammad_Ammar. This is a very useful write-up.

Some additional notes from our conversation:

Although the first use case for the skills data is to display to learners in the LMS, we noted that discovery is likely to also need the skills data for marketing/catalog purposes.
I also want to highlight the ownership tension between the course data itself (which has clear ownership and source of record in the CMS service), and its related skills data (which is less clear) and ownership of use of the third-party service lightcase (which thus far has been in discovery).

jristau1984 · August 29, 2022, 3:35pm

I do not know where Tagstore data is stored, but Blockstore refers to the newer generation content store that will eventually replace ContentStore. Currently only “v2” blocks are stored in blockstore, and currently v2 blocks are only leveraged by LabXChange. Hope this helps!

robrap · August 29, 2022, 4:17pm

Thanks Jeremy. Your statement is clear, but the tagstore README still doesn’t make clear whether tagstore can only be used against newer generation content or any content. It can be read either way, so this would need to be answered by someone who knows about tagstore.

Also, noting that my tagstore question is a tangent from the original thread, which may not make any use of tagstore.

Parvin_Kumar · August 29, 2022, 5:17pm

Hi @Muhammad_Ammar ,
I am more interested in testing Tagstore. I am requesting you please opensource the code so all community people can get benefit out of it it will be your contribution for this community. I also wanted to taste first this tagstore feature and create tutorials but not able to do right now.

Muhammad_Ammar · August 29, 2022, 8:01pm

@Zia_Fazal Thank you mentioning the tagstore. Unfortunately I am not familiar with tagstore. I am not sure if we want to extend/use tagstore for our use case. We already store taxonomy data in course-discovery/taxonomy-connector and we may want to use that for the new use case mentioned above. Having said that we will definitely look into tagstore to see how it works.

@robrap Thank you for adding more details related to the original question asked.

Muhammad_Ammar · August 29, 2022, 8:04pm

@Parvin_Kumar I think tagstore is already opensource. Please check the github repo.

Parvin_Kumar · August 30, 2022, 5:32am

it is not compatible with Nutmeg I am requesting you please opensource your latest version of code so it will be useful for whole community

dave · August 31, 2022, 10:33am

(I started writing this response and then got sick. It’s not a full response, but I wanted to get it out sooner than later.)

I would encourage you to think about this in two different pieces that could be used together, but don’t necessarily have to be:

The part that exports this data from edx-platform and puts it somewhere (and is unaware of lightcast)
The part that sends the data to lightcast and knows what to do with the results.

One option you can do is to use straight OLX course exports. The API for this exists already, and OLX is more-or-less stable for existing things (though we’ve sometimes made mistakes here). The downside is that you’d have to parse OLX, and that can be surprisingly hard with edge cases.

Another option is to do something that builds on top of the search indexing functionality that some prominent XBlocks expose. That is already meant to get a plaintext, search-indexable summary of a given XBlock. You could make a django user task that spins off a celery task and generates a simple (but well documented) export file format that maps usage keys to text like you need for lightcast.

If you need constant updates, you could hook into the course publish signal instead, and constantly push updates to some django-storages location, and then emit a message bus event to let course discovery know when it’s changed. One word of warning there is that we have a lot of meaningless publishes that won’t actually change course content (because course-discovery pushes some metadata updates to Studio… for scheduling maybe?), so you’d probably want to save the hash or have some other way of verifying that things have really changed before kicking off another send to lightcast.

Your lightcast awareness can stay isolated to course-discovery. I’m guessing that there’s not much you can reuse in tagstore, and that a lot of your logic is going to be specific to this use case anyway.

(Will try to write more when I have working brain again.)

dave · August 31, 2022, 10:37am

You might be able to get the search index data by using the Course Blocks API (I don’t remember if that’s part of the API or not). If that’s the case, you wouldn’t have to build anything new in edx-platform for the first pass at this. Though you would have to be careful about race conditions if you’re triggering anything off of course publish because block transformers can take a long time to run.

navin · September 1, 2022, 6:06am

At opencraft, we are also planning on building something similar to help edx tag xblocks (specifically vertical and video blocks) using lighcast (or other external APIs). Additionally, the users will be able to validate these generated tags via LMS.

We have started working on a discovery which looks similar to what @dave has described in his comment.

The plan is to

Store tags in database as usage_id and tags along with some metadata like number of users approving the tag for the block, flag for whether the tags require validation as it could be added by the author.
Handle updates to xblock like when it is deleted, moved or duplicated, we can add openedx_filters in the contentstore to respective methods like _move_item, _delete_item etc.
Create a xblock mixin to add fields & functions to support validation of tags by users, extract text and add tags.
Create API to query these tags.
Add celery tasks to extract text and send it to external API for tagging.
Create UI for validating tags in LMS as well as adding tags from studio

To extract text content from xblocks, we are planning to either query indexed data directly from search engine (elastic search) OR use index_dictionary function which returns a plain text summary of the content as mentioned by @dave.

(Currently, we are not planning to store the text data somewhere but sending it directly for tagging and storing the tags.)

Muhammad_Ammar · September 1, 2022, 1:06pm

@dave @navin Bundle of thanks for your awesome detailed replies. You provided lot of food for thought. Let me dig into these details and understand how things work. As a first step, My main focus is on below recommendation by Dave.

The part that exports this data from edx-platform and puts it somewhere (and is unaware of lightcast)

The part that sends the data to lightcast and knows what to do with the results.

Thank you again!

braden · September 7, 2022, 11:19pm

@robrap I am the author of tagstore I was away so sorry for the slow reply here.

Tagstore can tag any kind of content whatsoever, as long as it has a stable ID. Since XBlocks have stable usage IDs, it can be used to tag XBlocks. It was intended to work with Blockstore so the focus was on tagging XBlocks that are stored within Blockstore, but it is extremely generic (essentially tags use what Django calls a “Generic Foreign Key”), so it can be used to tag users, content in external systems, XBlocks, or anything else.

The first version of Tagstore used Neo4j to store tags, but the latest version uses MySQL just like any other Django app.

@Parvin_Kumar Tagstore is and always has been open source: You can see the latest version of it at https://github.com/openedx/blockstore/tree/dbea0491b34a3e8ecef45e17ce93e9c6eb6a3c85/tagstore

What I think you don’t understand is that (1) Tagstore was never completed, (2) Tagstore is not being developed anymore and hasn’t been for 2 years, and (3) Tagstore is only a REST API and has no UI, so it can’t be directly used by users.

However, despite those caveats it works very well! It provides a REST API for tagging entities (things) of any type, and it supports a hierarchical taxonomy, so for example if you tag something as being a “black bear” but then ask Tagstore to list all entities tagged with any “Mammal” tag, it will return that thing because it knows that all black bears are mammals. That hierarchical taxonomy functionality was the main “feature” of Tagstore, and the rest of it is very straightforward.

Later I realized that you can easily get essentially the same functionality from ElasticSearch, if you just store tags as a field on your content and index all of your content metadata (including tags) into ElasticSearch in the appropriate format, then ElasticSearch can provide the same hierarchical search functionality. That’s what we did for LabXchange, which is how it has a hierarchical tagging functionality despite not using Tagstore.

robrap · September 9, 2022, 10:52pm

Thanks @braden. That’s very helpful context. Consider updating the README and/or adding an ADR to leave this information close to the code.

braden · September 9, 2022, 11:10pm

I would, except that Tagstore is not used nor developed, and only exists in the git history. There is currently no README in the master version of the repo that hosted it, so there’s not really anything to update. If someone ever resurrects it and starts using it, then of course we can do that.

Parvin_Kumar · September 12, 2022, 12:42pm

Thank you very much Braden for your help please guide me how I can achieve this thing with ElasticSearch and how we store tags a field on our context.
How we can implement this please explain. Thank you in advance for your precious time you will give.

robrap · September 12, 2022, 3:21pm

I see. The link provided earlier was to the opencraft fork of blockstore/tagstore, which still had a README, so I was confused. Sounds good. Thanks again for the clarification.

braden · September 12, 2022, 6:23pm

@Parvin_Kumar When you are defining the ElasticSearch document that will index your XBlocks, use a field like this for the tags:

tags = Keyword(multi=True)

Now, let’s say your tag hierarchy looks like this:

Mammal
- Black bear
- Human
- Elephant

Now say you are indexing a particular XBlock that is about black bears, so it is tagged with the tag “Black bear”. So even though this XBlock only has one associated tag (Black bear), when you generate the tags field of your ElasticSearch index document, you actually want to store two tag strings:

Mammal
Mammal/Black bear

Then on the frontend, to do a search for all items tagged “Mammal” OR any sub-tag of Mammal, you simply do an exact match for tags=Mammal and it will correctly find the XBlock about black bears, even though the block was only tagged with “Black bear” and not Mammal. Likewise, if the user wants to find XBlocks tagged with Black bear, you do an exact match for tags=Mammal/Black bear and you’ll find all such XBlocks.

Topic		Replies	Views
Course discovery vs edx-search Development api , koa	0	396	June 16, 2021
Running Blockstore in Juniper Site Operations Help	10	2042	August 11, 2020
Search API for all content in open edx Development api , how-to	0	511	October 4, 2019
Discourse integration plugin for Open edX Collaborative Proposals	25	4724	August 20, 2021
Discovery api no courses Site Operations Help	0	632	January 6, 2020

Guidance needed on processing course content text in course discovery

Related topics