Redis `memory > max memory`. Page load times and useability suffer dramatically

Zachary_Trabookis · April 15, 2024, 11:39pm

Has anyone run into this situation before concerning Redis memory > max memory and the Open edX platform pages get really slow and are unusable?

We’ve run into this issue before concerning Redis. At the moment, we’re using AWS ElasticCache Redis service. We’ve reached memory > maxmemory. Any suggestions on how to resolve this? Guess that we’ll need to increase our memory or restart the Redis service to fix this.

Does anyone have documentation on best ways to handle setting items that go into the Redis cache to live for so long?

Increasing memory will be considerably more expensive for us on AWS.
We are currently at 3 GB of memory for Redis but considering going to 6 GB.
We’re using a cache.t4g.medium…2 cores and 3.09 gib ram.

In the past restarting Redis helps clear out this memory issue. What issues would we have if we did that? Restarting the primary cluster node seems to help a lot.

CRITICAL/MainProcess] Unrecoverable error: ResponseError(“OOM command not allowed when used memory > ‘maxmemory’.“)

 tutor_local-cms-1        | [pid: 7|app: 0|req: 410/2073] 172.18.0.9 () {58 vars in 2310 bytes} [Mon Apr 15 20:13:19 2024] GET /export_status/course-v1:REVVED+EV-ST-IEV+DEVELOPMENT => generated 23 bytes in 129 msecs (HTTP/1.1 200) 7 headers in 499 bytes (1 switches on core 0)
tutor_local-cms-worker-1 | [2024-04-15 20:13:19,819: CRITICAL/MainProcess] Unrecoverable error: ResponseError(“OOM command not allowed when used memory > ‘maxmemory’.“)
 tutor_local-cms-worker-1 | Traceback (most recent call last):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/worker.py”, line 208, in start
 tutor_local-cms-worker-1 |    self.blueprint.start(self)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 119, in start
 tutor_local-cms-worker-1 |    step.start(parent)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 369, in start
 tutor_local-cms-worker-1 |    return self.obj.start()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py”, line 318, in start
 tutor_local-cms-worker-1 |    blueprint.start(self)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 119, in start
 tutor_local-cms-worker-1 |    step.start(parent)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/connection.py”, line 23, in start
 tutor_local-cms-worker-1 |    c.connection = c.connect()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py”, line 407, in connect
 tutor_local-cms-worker-1 |    conn.transport.register_with_event_loop(conn.connection, self.hub)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 1057, in register_with_event_loop
 tutor_local-cms-worker-1 |    cycle.on_poll_init(loop.poller)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 331, in on_poll_init
 tutor_local-cms-worker-1 |    return channel.qos.restore_visible(
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 196, in restore_visible
 tutor_local-cms-worker-1 |    with Mutex(client, self.unacked_mutex_key,
 tutor_local-cms-worker-1 |  File “/opt/pyenv/versions/3.8.12/lib/python3.8/contextlib.py”, line 113, in _enter_
 tutor_local-cms-worker-1 |    return next(self.gen)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 117, in Mutex
 tutor_local-cms-worker-1 |    lock_acquired = lock.acquire(blocking=False)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/lock.py”, line 187, in acquire
 tutor_local-cms-worker-1 |    if self.do_acquire(token):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/lock.py”, line 203, in do_acquire
 tutor_local-cms-worker-1 |    if self.redis.set(self.name, token, nx=True, px=timeout):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 1801, in set
 tutor_local-cms-worker-1 |    return self.execute_command(‘SET’, *pieces)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 901, in execute_command
 tutor_local-cms-worker-1 |    return self.parse_response(conn, command_name, **options)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 915, in parse_response
 tutor_local-cms-worker-1 |    response = connection.read_response()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/connection.py”, line 756, in read_response
 tutor_local-cms-worker-1 |    raise response
tutor_local-cms-worker-1 | redis.exceptions.ResponseError: OOM command not allowed when used memory > ‘maxmemory’.

cc @dave

joel.edwards · April 16, 2024, 1:51pm

Sounds to me like you have too little memory for your system overall.
on my machines I try give at least 16GB which seems to work well.
I’ve even had fresh installations or upgrades fail with as little as 12GB…

MoisesGonzalezS · April 16, 2024, 2:09pm

I don’t recall seeing that exception when we had problems with redis, but it may be my memory failing.

We also used elasticache and we faced the problem of memory filling up. You can see a discussion on this PR: fix: allow to setup redis maxmemory by Ian2012 · Pull Request #984 · overhangio/tutor · GitHub. TLDR: the course structure cache was not being evicted unless you added some extra configuration (this didn’t happen before because the behavior of memcache is different)

For elasticache what we ended up doing was to set maxmemory-policy to allkeys-lru from the default volatile-lru.

dave · April 16, 2024, 2:16pm

@Zachary_Trabookis: I agree with @MoisesGonzalezS – edx-platform uses the cache extensively, and it should be configured to evict old entries.

Zachary_Trabookis · April 16, 2024, 2:24pm

@joel.edwards @MoisesGonzalezS @dave
Thanks for all your recommendations. Because we use ElastiCache Redis cluster I’ll look into what @MoisesGonzalezS mentions about maxmemory-policy.

I was reading up about this maxmemory here.
Key eviction | Docs (redis.io)

I thought at one point that we did that. We currently have this set as the following.

MoisesGonzalezS · April 16, 2024, 2:31pm

IIRC, those are the allowed values, in our case it looks something like this:

Zachary_Trabookis · April 16, 2024, 2:34pm

@MoisesGonzalezS
Yeah, ours currently indicates volatile-lru for maxmemory-policy. We’ll change this to allkeys-lru based on your recommendation which should perform the following action.

allkeys-lru: Keeps most recently used keys; removes least recently used (LRU) keys

Zachary_Trabookis · April 16, 2024, 3:09pm

@MoisesGonzalezS
For ElastiCache Redis are you able to set maxmemory parameter? It appears that this is a non-modifiable field.

Also, I was wondering if your setup for Cluster mode: Enabled and what Redis Engine version are you running? It looks like Tutor is still on Redis Engine 6.

This is what we currently have configured.

I was also wondering if you’re doing something like this for the Django application if you have a cluster configuration.

Django’s cache framework | Django documentation | Django (djangoproject.com)

CACHES = {
    "default": {
        "BACKEND": "django.core.cache.backends.redis.RedisCache",
        "LOCATION": [
            "redis://127.0.0.1:6379",  # leader
            "redis://127.0.0.1:6378",  # read-replica 1
            "redis://127.0.0.1:6377",  # read-replica 2
        ],
    }
}

It appears that with Tutor configuration we just point to one Redis cluster node.

"CACHES": {
    "default": {
      "KEY_PREFIX": "default",
      "VERSION": "1",
      "BACKEND": "django_redis.cache.RedisCache",
      "LOCATION": "redis://@http://hidden-redis-001.yks9fc.0001.use1.hidden:6379/1"
    },
   ...
}

MoisesGonzalezS · April 16, 2024, 3:18pm

We do not set maxmemory, only the policy. Our engine version is 6.2.6.

Zachary_Trabookis · April 16, 2024, 3:28pm

@MoisesGonzalezS
I updated the previous post above. Can you look at the Django configuration for CACHES and let me know if ya’ll use more than one LOCATION? Say if you had a cluster configuration setup or not.

MoisesGonzalezS · April 16, 2024, 3:40pm

We have several locations and use a list with the primary endpoint first, and the reader endpoint second (in theory aws automatically balances this endpoint between all the replicas)

__CACHE_LOCATION = [
  "redis://<my-primary-endpoint>:6379/1",
  "redis://<my-reader-endpoint>:6379/1",
]

CACHES["course_structure_cache"]["LOCATION"] = __CACHE_LOCATION
CACHES["default"]["LOCATION"] = __CACHE_LOCATION
CACHES["general"]["LOCATION"] = __CACHE_LOCATION
CACHES["celery"]["LOCATION"] = __CACHE_LOCATION
CACHES["mongo_metadata_inheritance"]["LOCATION"] = __CACHE_LOCATION
CACHES["ora2-storage"]["LOCATION"] = __CACHE_LOCATION
CACHES["configuration"]["LOCATION"] = __CACHE_LOCATION

Zachary_Trabookis · April 16, 2024, 3:42pm

Thanks @MoisesGonzalezS. We’ll give this a try and we appreciate your support on this.

Zachary_Trabookis · April 16, 2024, 4:18pm

@MoisesGonzalezS @dave @regis
We’re considering AWS ElastiCache Serverless, however, the lowest version of Redis is engine 7.

It looks like Redis version was upgraded to engine 7 with this Palm Tutor update.
feat: upgrade to Palm · overhangio/tutor@b3c3c4a (github.com)

Could we use version 7 of Redis with say Maple?

Would we need to update our redis Python package to accommodate this change on the platform?

MoisesGonzalezS · April 16, 2024, 4:47pm

I have don’t think I’ve used Redis 7 on any installation, but to my understanding the API is rather stable so the current version of django-redis should still work. I think the main factor for bumping django-redis is to support newer versions of django.

Zachary_Trabookis · April 16, 2024, 5:35pm

@MoisesGonzalezS
Thanks for mentioning that about Redis 7 API being stable.

It looks like Tutor latest uses Redis 7.2.4
tutor/tutor/templates/config/defaults.yml at master · overhangio/tutor (github.com)

I checked versions of django-redis that get installed with Tutor Dockerfile for named releases.

Maple uses django-redis==4.12.1
Nutmeg uses django-redis==5.2.0
Palm uses django-redis==5.2.0
Quince uses django-redis==5.4.0

Looking a django-redis==5.4.0 requirements mention.

Our version of Maple has these requirements installed, so django-redis==5.4.0 and Redis 7 should work, so we’re going to give AWS ElastiCache Serverless a try.

Python 3.8.12
Django 3.2.13
redis 3.5.3

Zachary_Trabookis · April 16, 2024, 5:59pm

@MoisesGonzalezS
One thing to note about AWS ElastiCache Serverless is that you cannot modify the maxmemory-policy.

Redis configuration and limits - Amazon ElastiCache for Redis

To get around that the documentation mentions the following about serverless capacity settings and how it handles Out of Memory errors and evicting data on maximum memory usage limit.

Zachary_Trabookis · April 16, 2024, 8:00pm

@MoisesGonzalezS
Having read over the AWS ElastiCache Serveless capacity settings mentioned above it seems like it would not evict the CourseStructureCache data from Redis due to the maxmemory-policy: volatile-lru and no TTL set for cache setting.

Therefore, we decided to not try serverless at this time because Redis memory would keep increasing even though serverless should be able to increase size and/or evict data when Out of Memory is reached.

Zachary_Trabookis · April 16, 2024, 9:47pm

@MoisesGonzalezS
It appears django-redis python package doesn’t support Redis clusters.
Does not work with Redis Cluster. · Issue #606 · jazzband/django-redis (github.com)

Zachary_Trabookis · April 16, 2024, 10:09pm

@MoisesGonzalezS
Would you be willing to provide us with your AWS ElastiCache Redis provision settings? What size do ya’ll use in production? Is it a cluster configuration or not?

MoisesGonzalezS · April 17, 2024, 7:31pm

For the most part what we have is:

3 replicas, 2 read, 1 write.
Engine 6.x
node type cache.m6g.large
cluster mode disabled

Topic		Replies	Views
Receiving `maxmemory` on ElastiCache Redis Cluster Site Operations Help	1	330	October 18, 2023
Open edx consuming huge amount of memory Site Operators	4	1498	November 21, 2019
Failed to get program UUIDs from the cache Site Operations Help	10	2655	October 12, 2019
How to implement redis instead of memcached Site Operators how-to	5	1732	February 26, 2020
Open edX(high memory usage) Site Operators	5	905	March 13, 2021

Redis `memory > max memory`. Page load times and useability suffer dramatically

Related topics