Redis `memory > max memory`. Page load times and useability suffer dramatically

Has anyone run into this situation before concerning Redis memory > max memory and the Open edX platform pages get really slow and are unusable?

We’ve run into this issue before concerning Redis. At the moment, we’re using AWS ElasticCache Redis service. We’ve reached memory > maxmemory. Any suggestions on how to resolve this? Guess that we’ll need to increase our memory or restart the Redis service to fix this.

Does anyone have documentation on best ways to handle setting items that go into the Redis cache to live for so long?

Increasing memory will be considerably more expensive for us on AWS.
We are currently at 3 GB of memory for Redis but considering going to 6 GB.
We’re using a cache.t4g.medium…2 cores and 3.09 gib ram.

In the past restarting Redis helps clear out this memory issue. What issues would we have if we did that? Restarting the primary cluster node seems to help a lot.

CRITICAL/MainProcess] Unrecoverable error: ResponseError(“OOM command not allowed when used memory > ‘maxmemory’.“)

 tutor_local-cms-1        | [pid: 7|app: 0|req: 410/2073] 172.18.0.9 () {58 vars in 2310 bytes} [Mon Apr 15 20:13:19 2024] GET /export_status/course-v1:REVVED+EV-ST-IEV+DEVELOPMENT => generated 23 bytes in 129 msecs (HTTP/1.1 200) 7 headers in 499 bytes (1 switches on core 0)
tutor_local-cms-worker-1 | [2024-04-15 20:13:19,819: CRITICAL/MainProcess] Unrecoverable error: ResponseError(“OOM command not allowed when used memory > ‘maxmemory’.“)
 tutor_local-cms-worker-1 | Traceback (most recent call last):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/worker.py”, line 208, in start
 tutor_local-cms-worker-1 |    self.blueprint.start(self)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 119, in start
 tutor_local-cms-worker-1 |    step.start(parent)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 369, in start
 tutor_local-cms-worker-1 |    return self.obj.start()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py”, line 318, in start
 tutor_local-cms-worker-1 |    blueprint.start(self)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/bootsteps.py”, line 119, in start
 tutor_local-cms-worker-1 |    step.start(parent)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/connection.py”, line 23, in start
 tutor_local-cms-worker-1 |    c.connection = c.connect()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/celery/worker/consumer/consumer.py”, line 407, in connect
 tutor_local-cms-worker-1 |    conn.transport.register_with_event_loop(conn.connection, self.hub)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 1057, in register_with_event_loop
 tutor_local-cms-worker-1 |    cycle.on_poll_init(loop.poller)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 331, in on_poll_init
 tutor_local-cms-worker-1 |    return channel.qos.restore_visible(
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 196, in restore_visible
 tutor_local-cms-worker-1 |    with Mutex(client, self.unacked_mutex_key,
 tutor_local-cms-worker-1 |  File “/opt/pyenv/versions/3.8.12/lib/python3.8/contextlib.py”, line 113, in _enter_
 tutor_local-cms-worker-1 |    return next(self.gen)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/kombu/transport/redis.py”, line 117, in Mutex
 tutor_local-cms-worker-1 |    lock_acquired = lock.acquire(blocking=False)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/lock.py”, line 187, in acquire
 tutor_local-cms-worker-1 |    if self.do_acquire(token):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/lock.py”, line 203, in do_acquire
 tutor_local-cms-worker-1 |    if self.redis.set(self.name, token, nx=True, px=timeout):
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 1801, in set
 tutor_local-cms-worker-1 |    return self.execute_command(‘SET’, *pieces)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 901, in execute_command
 tutor_local-cms-worker-1 |    return self.parse_response(conn, command_name, **options)
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/client.py”, line 915, in parse_response
 tutor_local-cms-worker-1 |    response = connection.read_response()
 tutor_local-cms-worker-1 |  File “/openedx/venv/lib/python3.8/site-packages/redis/connection.py”, line 756, in read_response
 tutor_local-cms-worker-1 |    raise response
tutor_local-cms-worker-1 | redis.exceptions.ResponseError: OOM command not allowed when used memory > ‘maxmemory’.

cc @dave

Sounds to me like you have too little memory for your system overall.
on my machines I try give at least 16GB which seems to work well.
I’ve even had fresh installations or upgrades fail with as little as 12GB…

1 Like

I don’t recall seeing that exception when we had problems with redis, but it may be my memory failing.

We also used elasticache and we faced the problem of memory filling up. You can see a discussion on this PR: fix: allow to setup redis maxmemory by Ian2012 · Pull Request #984 · overhangio/tutor · GitHub. TLDR: the course structure cache was not being evicted unless you added some extra configuration (this didn’t happen before because the behavior of memcache is different)

For elasticache what we ended up doing was to set maxmemory-policy to allkeys-lru from the default volatile-lru.

@Zachary_Trabookis: I agree with @MoisesGonzalezS – edx-platform uses the cache extensively, and it should be configured to evict old entries.

1 Like

@joel.edwards @MoisesGonzalezS @dave
Thanks for all your recommendations. Because we use ElastiCache Redis cluster I’ll look into what @MoisesGonzalezS mentions about maxmemory-policy.

I was reading up about this maxmemory here.
Key eviction | Docs (redis.io)

I thought at one point that we did that. We currently have this set as the following.

IIRC, those are the allowed values, in our case it looks something like this:

@MoisesGonzalezS
Yeah, ours currently indicates volatile-lru for maxmemory-policy. We’ll change this to allkeys-lru based on your recommendation which should perform the following action.

  • allkeys-lru: Keeps most recently used keys; removes least recently used (LRU) keys

@MoisesGonzalezS
For ElastiCache Redis are you able to set maxmemory parameter? It appears that this is a non-modifiable field.

Also, I was wondering if your setup for Cluster mode: Enabled and what Redis Engine version are you running? It looks like Tutor is still on Redis Engine 6.

This is what we currently have configured.

I was also wondering if you’re doing something like this for the Django application if you have a cluster configuration.

Django’s cache framework | Django documentation | Django (djangoproject.com)

CACHES = {
    "default": {
        "BACKEND": "django.core.cache.backends.redis.RedisCache",
        "LOCATION": [
            "redis://127.0.0.1:6379",  # leader
            "redis://127.0.0.1:6378",  # read-replica 1
            "redis://127.0.0.1:6377",  # read-replica 2
        ],
    }
}

It appears that with Tutor configuration we just point to one Redis cluster node.

"CACHES": {
    "default": {
      "KEY_PREFIX": "default",
      "VERSION": "1",
      "BACKEND": "django_redis.cache.RedisCache",
      "LOCATION": "redis://@http://hidden-redis-001.yks9fc.0001.use1.hidden:6379/1"
    },
   ...
}

We do not set maxmemory, only the policy. Our engine version is 6.2.6.

1 Like

@MoisesGonzalezS
I updated the previous post above. Can you look at the Django configuration for CACHES and let me know if ya’ll use more than one LOCATION? Say if you had a cluster configuration setup or not.

We have several locations and use a list with the primary endpoint first, and the reader endpoint second (in theory aws automatically balances this endpoint between all the replicas)

__CACHE_LOCATION = [
  "redis://<my-primary-endpoint>:6379/1",
  "redis://<my-reader-endpoint>:6379/1",
]

CACHES["course_structure_cache"]["LOCATION"] = __CACHE_LOCATION
CACHES["default"]["LOCATION"] = __CACHE_LOCATION
CACHES["general"]["LOCATION"] = __CACHE_LOCATION
CACHES["celery"]["LOCATION"] = __CACHE_LOCATION
CACHES["mongo_metadata_inheritance"]["LOCATION"] = __CACHE_LOCATION
CACHES["ora2-storage"]["LOCATION"] = __CACHE_LOCATION
CACHES["configuration"]["LOCATION"] = __CACHE_LOCATION
1 Like

Thanks @MoisesGonzalezS. We’ll give this a try and we appreciate your support on this.

@MoisesGonzalezS @dave @regis
We’re considering AWS ElastiCache Serverless, however, the lowest version of Redis is engine 7.

It looks like Redis version was upgraded to engine 7 with this Palm Tutor update.
feat: upgrade to Palm · overhangio/tutor@b3c3c4a (github.com)

Could we use version 7 of Redis with say Maple?

Would we need to update our redis Python package to accommodate this change on the platform?

I have don’t think I’ve used Redis 7 on any installation, but to my understanding the API is rather stable so the current version of django-redis should still work. I think the main factor for bumping django-redis is to support newer versions of django.

1 Like

@MoisesGonzalezS
Thanks for mentioning that about Redis 7 API being stable.

It looks like Tutor latest uses Redis 7.2.4
tutor/tutor/templates/config/defaults.yml at master · overhangio/tutor (github.com)

I checked versions of django-redis that get installed with Tutor Dockerfile for named releases.

Maple uses django-redis==4.12.1
Nutmeg uses django-redis==5.2.0
Palm uses django-redis==5.2.0
Quince uses django-redis==5.4.0

Looking a django-redis==5.4.0 requirements mention.
image

Our version of Maple has these requirements installed, so django-redis==5.4.0 and Redis 7 should work, so we’re going to give AWS ElastiCache Serverless a try.

Python 3.8.12
Django 3.2.13
redis 3.5.3

@MoisesGonzalezS
One thing to note about AWS ElastiCache Serverless is that you cannot modify the maxmemory-policy.

Redis configuration and limits - Amazon ElastiCache for Redis

To get around that the documentation mentions the following about serverless capacity settings and how it handles Out of Memory errors and evicting data on maximum memory usage limit.

@MoisesGonzalezS
Having read over the AWS ElastiCache Serveless capacity settings mentioned above it seems like it would not evict the CourseStructureCache data from Redis due to the maxmemory-policy: volatile-lru and no TTL set for cache setting.

Therefore, we decided to not try serverless at this time because Redis memory would keep increasing even though serverless should be able to increase size and/or evict data when Out of Memory is reached.

@MoisesGonzalezS
It appears django-redis python package doesn’t support Redis clusters.
Does not work with Redis Cluster. · Issue #606 · jazzband/django-redis (github.com)

@MoisesGonzalezS
Would you be willing to provide us with your AWS ElastiCache Redis provision settings? What size do ya’ll use in production? Is it a cluster configuration or not?

For the most part what we have is:

  • 3 replicas, 2 read, 1 write.
  • Engine 6.x
  • node type cache.m6g.large
  • cluster mode disabled
1 Like