Reducing Memory Usage: NLTK

An LMS web worker process uses over 300 MB of RAM once initialized. Over 10% of that is from loading nltk, which we only use in one place–to parse chemical equation inputs in ProblemBlocks (see chemcalc.py in the openedx-chem repo).

I do not know why the grammar is specified using NLTK instead of pyparsing. It could have been to get around some limitation that pyparsing had twelve years ago, or it could have just been that the author was just more familiar with NLTK and could hack the code out faster that way.

Request: Does someone have the time and interest to look into removing our dependency on NLTK by changing the parser implementation here? It would likely involve some digging, and exact backwards compatibility would be extremely important.


(The rest of this is explaining how I got this data, in case anyone else is interested in poking around our memory usage.)

I used a tool called memray for this, in particular its live reporting feature.

This is the output with nltk:

This is the output when I commented out this line, which is the only place in edx-platform that chemcalc.py is loaded (which is what loads nltk):

This is the relevant part from the first screenshot:

nltk-zoom

The first column is total memory allocated when loading this module, including everything that module loads. So chem.chemcalc is using 10.73% of overall process memory, but the vast, vast majority of that is from the things that load when it imports nltk, which is responsible for using 10.71%.

:warning: The output of this screen can be misleading. For instance, the memory usage for bulk_email.models looks shockingly high, but that’s only because it’s the first thing that loads course_groups.cohorts, which in turn loads all the courseware stuff in courseware.courses (and everything it calls). Also, this screen represents a bunch of threads smashed together, and cycling through the individual threads gives a better idea of how the allocations are working in the background–for instance, pymongo maintains its own thread for MongoDB connections, and seems to use a fair amount of memory for that.

That being said, nltk seems like a pretty straightforward case.

I was just messing around with this on the weekend because I was looking into unit rendering issues, and I don’t have a nice tutor plugin for this or anything. That being said, this is the hacky thing I did to try it out:

  1. Added memray==1.13.3 to the top of development.txt.
  2. Removed the Django Debug Toolbar and Middleware from devstack.py. DDT can add a lot of overhead and might distort the results.
  3. Ran tutor images build openedx-dev
  4. In my tutor fork, I altered the startup command for the LMS to be: memray run --live-remote --live-port 56857 ./manage.py lms runserver --noreload 0.0.0.0:8000
    This starts the runserver with memray doing memory tracing, and outputting its findings to port 56857 (if you don’t specify something, it’ll just randomly pick something different each time. The --noreload is important because Django does reloading by forking a new process, and if you allow that, you’re only going to get stats on the original process and not the one actually serving requests. (There is a --follow-fork feature, but it’s not compatible with --live-remote for some reason, so…)
  5. tutor dev stop lms
  6. tutor config save
  7. tutor dev start lms
  8. tutor dev exec lms /bin/bash
    Once you’re finished this step and have a shell in the LMS container do:
  9. memray3.11 live 56857
    This will put up the cool TUI memory viewer.

Note that since auto-reloading is disabled in these instructinos, you have to manually restart the LMS if you want to test any code changes you’re making to improve memory usage.

5 Likes

FYI, I created an issue on the openedx-chem repo here: Remove NLTK dependency · Issue #85 · openedx/openedx-chem · GitHub

If you’re interested in taking this on, you can comment “assign me” on that ticket.