Mongo: how to prune/delete orphaned course data?

i’m working on a special-purpose installation of Open edX containing several hundred courses; each of which has been re-imported many dozens/hundreds of times over the last 18 months. in some cases this results in orphaned data in quantities large enough to significantly slow screen rendering in Course Management Studio.

Unsuccessful attempts include the following:

  1. manage.py cms delete_orphans. this finds and remove some ancillary items, but it misses the much larger volume of orphaned documents.

  2. exporting all courses, deleting edxapp.modulestore.definitions and edxapp.modulestore.structures, and and then re-importing the courses. there is a minor bug in the course export related to the name of the course run – and i’ve found myself chasing my own tail trying to work around this minor problem.

  3. manual “Search & Destroy” from the MongoDB shell. so far, i have not been unable to to create sound logic that identifies only orphans. that is, my Mongo queries might also delete documents that are not orphans.

has anyone else needed to perform maintenance of this nature on Mongo? any suggestions?

1 Like

Hi @lpm0073,

There is an script in https://github.com/edx/tubular/blob/master/tubular/scripts/structures.py, I am not sure but I think that it should do what you need.

I think that the docstrings in that script are pretty fine. But this is how I run them in the devstack.

git clone https://github.com/edx/tubular
pip install -e ./tubular
## This generates the plan
structures.py --connection="mongodb://edxapp:password@edx.devstack.mongo/edxapp" \
  --database-name edxapp \
  make_plan \
  -v DEBUG out.json \
  --details details.txt \
  --retain 5
### This perform the prune
structures.py --connection="mongodb://edxapp:password@edx.devstack.mongo/edxapp" \
  --database-name edxapp \
  		prune  out.json

Hope this helps

1 Like

@morenol Thank you! yes, this appears to address the problems i’m facing. much appreciated :smiley:

1 Like