ISSUE: Comprehensive theming performance

# Performance issue with multiple themes.

Hi everyone. :smile:
Some time ago, when we were running our load tests, we found an issue in the platform around the response times when using comprehensive theming.

PROBLEM :fire:

Currently, edx-platform has a method to validate if a directory is a valid theme, that method searches within the folders that are set in COMPREHENSIVE_THEME_DIRS, but it does so recursively. Therefore, the more themes you have, the method will be called even more times which considerably increases the response time.
Adding 4 themes to the comprehensive themes directory will as much as triple the response time. Adding one single theme will have a noticeable effect.

TESTS :building_construction:

NOTE:

These tests were executed using K6 with a stage of:
"stages": [{ "target": 28, "duration": "2m" },{ "target": 28, "duration": "6m" },{ "target": 0, "duration": "2m" }]
Increase users from 0 to 28 in 2 minutes, keep the same number of users for 6 minutes more, and finally, decrease the number of users to 0 in 2 minutes.

To get the number of executions we use cProfiler.

All tests were executed in the same Kubernetes cluster.


Image without comprehensive theming

  • K6 load test
running (10m09.6s), 00/28 VUs, 602 complete and 0 interrupted iterations
http_req_duration..............: avg=553.57ms min=107.08ms med=492.97ms max=3.98s p(90)=1.02s p(95)=1.24s
  • cProfiler*
    is_theme_dir number of calls: 0

Image with tutor-indigo

  • K6 load test
running (10m15.3s), 00/28 VUs, 523 complete and 0 interrupted iterations
http_req_duration..............: avg=1.24s min=132.26ms med=1.23s max=2.95s p(90)=2s p(95)=2.17s
  • cProfiler*
    is_theme_dir number of calls: 556
ncalls tottime percall cumtime percall filename:lineno(function)
556 0.0009611 1.729e-06 0.005788 1.041e-05 helpers_dirs.py:70(is_theme_dir)

Image with multiple themes

NOTE:

The community themes that we used: cubitetech, digifab, indigo, mitxpro

  • K6 load test
running (10m11.9s), 00/28 VUs, 377 complete and 2 interrupted iterations
http_req_duration..............: avg=3.17s min=133.3ms med=3.56s max=7.38s p(90)=4.54s p(95)=4.8s
  • cProfiler*
    is_theme_dir number of calls: 1896
ncalls tottime percall cumtime percall filename:lineno(function)
1896 0.005246 2.767e-06 0.03027 1.596e-05 helpers_dirs.py:70(is_theme_dir)

CONCLUSION :interrobang:

These test results show how the response time drops considerably if you were to download several themes even if they are not in use.

SOLUTION :mag:

We are working on finding a solution to this, but we would like to know your opinions about it.
Is this something you have encountered before or even have a workaround for? is the ability to download more than one theme important to some of you?


cProfiler Reports:
indigo.prof (306.0 KB)
themes.prof (314.2 KB)
tutor.prof (301.4 KB)

4 Likes

Questions on your findings

  1. The cumtime for the slow case with themes is still only 0.03 seconds, so I wouldnā€™t expect it to have such a large impact on the overall load test times. Is it because thereā€™s disk contention when you get up to multiple concurrent requests? Or am I reading it wrongā€¦?
  2. Is this overhead for dev/debug mode, or does it happen in production configuration as well? (Both are worth optimizingā€“Iā€™m just trying to understand the problem better.)

Questions on theming use cases (for those that use them)

  1. Normally, I might be inclined to shift the burden of this to happen at startup, but it looks like the way the code is now would allow people to add new themes while the LMS is running, and that seems like it could be very valuable. Do people rely on this functionality?
  2. The code goes through this trouble to check that the things inside the theme directory really are themes. Do people put anything else in the themes directory? If so, what and why?

One you thing you might want to consider (if you havenā€™t already) is to run cProfile for a single request with the following scenarios:

  1. Uncached, no themes
  2. Cached, no themes (so just hit reload on the same page you just used)
  3. Uncached, one theme
  4. Cached, one theme
  5. Uncached, many themes
  6. Cached, many themes

Deterministic profilers like cProfile tend to add a substantial amount of overhead to requests, and that overhead isnā€™t always evenly spread out. Iā€™ve generally found it helpful to separate runs like K6 load tests that measure broad behavior of the system from detailed deterministic profiling like cProfileā€“i.e. run the load tests with cProfile entirely disabled to get the best understanding of how theming affects the response times (ā€œhow big is the problem?ā€), and then use cProfile to measure individual requests in different scenarios to dig down into the specific suspected bottlenecks (ā€œwhere could the problem be?ā€).

If you are going to run both code level profiling at the same time as a load test, you might want to look into statistical profilers, like pprofile. That gives less precise detail than cProfile, but it should give more real-world accuracy because of the low overhead.

Question about themes.prof:

get_current_theme is taking up 48% of the call time, which is madness. But It looks like itā€™s also throwing an exception, which Iā€™m guessing means that the Theme instance was never successfully created and it returns None here? Are you sure the theme was correctly set up during this test?

  1. We think that the problem is related to the disk concurrent request because each request to ā€œis_theme_dirā€ means to run the method ā€œposix.listdirā€.
  2. It is happening in production (Kubernetes), dev & local (docker-compose).

We use to think that we didnā€™t have this problem with the old infrastructure because currently, we arenā€™t having performance issues using ASG/EC2, this problem appears since we decided to use docker/Kubernetes, so maybe if we fix it, our old infrastructure can work better with fewer instances and we wonā€™t have problems in Kubernetes.

About theming use cases, maybe @Felipe or @dcoa could help me with a better answer.

And about the last part, we have already made a lot of tests without cProfile, in dev, stage, and production environments, and we found the same results, we reach a good number when we removed all the themes and just defined one.

The tests added in this post were just made to show the main problem that we found.

Hi @dave, great questions.

We do a lot of theme switching for the saas service, but we donā€™t add themes while the LMS is running. I would also move towards moving some of the burden to the server startup.

I have never seen anything but a theme be placed in a theme directory. I would bet that this comes from a time when COMPREHENSIVE_THEMES was a new thing and was largely unknown. Probably to prevent users from overreaching into directories that will be placed in the templates lookup.
Given the existing capacities that plugins already give to developers/operators, I donā€™t see why themes should have those restrictions.

Thanks, thatā€™s really interesting. Do you find that process startup times are slower as well (since itā€™s so disk-intensive)?

Okay, so maybe read in the themes directories in the ThemingConfig.ready method, load those values into a dict on that object (+ add some accessor method), and then have the existing helper functions call into that instead of hitting disk like they do now?

Any other heavy theme users have thoughts?

Thanks, @dave I created a new image without those warnings and with one theme more, these were the results.

running (10m12.7s), 00/28 VUs, 313 complete and 3 interrupted iterations

http_req_duration..............: avg=4.56s   min=111.86ms med=5.25s   max=7.57s    p(90)=6.25s    p(95)=6.5s  

multithemes.prof (307.1 KB)

Thanks for the new profiler dump. :slight_smile:

Based on my rough reading of the call order, you might want to try just slapping a @functools.lru_cache decorator over the get_theme_dirs and possibly get_themes functions. I think it will give you most of the benefit with much less risk and effort than refactoring to do this work on startup.

Good luck!

@dave I created a new image with lru_cache in both methods and itā€™s working perfectly, I tested with the same themes and configured a different theme for each site, which works perfectly.

default āœ“ [======================================] 00/28 VUs  10m0s
http_req_duration..............: avg=633.36ms min=101.13ms med=606.9ms  max=2.15s    p(90)=1.13s    p(95)=1.27s 

themes-cache.prof (236.4 KB)

2 Likes

Thatā€™s great news! Thanks so much for tracking this down and doing such detailed measurement/analysis. Please post the PR here when you have it, and please add a note about this to the Olive page after itā€™s merged. Even if itā€™s a relatively small code change, this is a substantial performance improvement for the many folks who use multiple themes, and I think it deserves getting called out.

1 Like

@dave Iā€™ve already created the PR feat: add lru_cache to improve performance with multiple themes by Alec4r Ā· Pull Request #31090 Ā· openedx/edx-platform Ā· GitHub, thanks for all your help.