Library import: PDF to library.xml

Hi, I have PDF files (400+ pages) to be imported to the library. Is there any simple way or automatized solution to convert it to library.xml in proper format?

There are many ways to do that, but I wouldn’t call it “simple”.

First, you need to convert each of the PDFs to text. If it’s mostly plain text, you could write a python script that uses a PDF library to extract text and then convert it to OLX. If the documents have complex structure and/or images, you will likely want to use Python (or any type of script / batch processing tool) to convert the PDFs to images, and then write a script that uses a multi-modal LLM to convert the images to OLX. To get a good result, you’ll have to do the first few manually and include them in your prompt as examples (“few-shot prompting”).

Then, you need to import that OLX into your Open edX instance. For legacy libraries, you can create a .tar.gz file in the correct format, using an exported library as an example. Legacy libraries don’t support static asset files like images though. For the newest (Sumac) version of Open edX, the new libraries feature doesn’t yet have import/export support, but it does have a REST API and/or Python API that you can use to import each component and its associated asset files like images.