Report generation is taking too much time

sandro · July 16, 2025, 4:23pm

Hi everyone! As I don’t know where to ask this, I’ll put it here, in development.

Our partners are generating student_state_from_block reports to get all the answers that students gave in a specific exercise. One of the main differences of this report on theses courses is that we have a custom Python evaluated problem that we created that is being extracted as well.

Currently the extraction of these reports is taking way longer than expected, hitting the DB connection timeout we’ve set. I’m just wondering if, for response extraction purposes (to get the value of the input field), all these python fields are being evaluated and adding to the process time. Can somebody share with me some insights about this?

sarina · July 17, 2025, 12:30pm

@dave is on vacation currently, but when he’s back, he might be able to give an answer!

sandro · July 17, 2025, 1:01pm

Awesome. I’ll wait for @dave to be fully energized
Thank you @sarina.

dave · July 23, 2025, 2:48pm

Hi @sandro! A couple of questions:

What release are you running?
Can you please post a screenshot of the LMS button that’s pushed, just to make sure I’m thinking of the correct report?

I suspect what’s going on is this bit:

github.com/openedx/edx-platform

lms/djangoapps/instructor_task/tasks_helper/grades.py

9bf923784


      
          block = store.get_item(block_key)
          generated_report_data = defaultdict(list)
          
          # Blocks can implement the generate_report_data method to provide their own
          # human-readable formatting for user state.
          if hasattr(block, 'generate_report_data'):
              try:
                  user_state_iterator = user_state_client.iter_all_for_block(block_key)
                  for username, state in block.generate_report_data(user_state_iterator, max_count):
                      generated_report_data[username].append(state)
              except NotImplementedError:
                  pass

That allows any XBlock to implement a generate_report_data() method in order to give better formatted responses for this report. ProblemBlock implements this:

github.com/openedx/edx-platform

xmodule/capa_block.py

9bf923784


      
          def generate_report_data(self, user_state_iterator, limit_responses=None):
              """
              Return a list of student responses to this block in a readable way.
          
              Arguments:
                  user_state_iterator: iterator over UserStateClient objects.
                      E.g. the result of user_state_client.iter_all_for_block(block_key)
          
                  limit_responses (int|None): maximum number of responses to include.
                      Set to None (default) to include all.
          
              Returns:
                  each call returns a tuple like:
                  ("username", {
                                 "Question": "2 + 2 equals how many?",
                                 "Answer": "Four",
                                 "Answer ID": "98e6a8e915904d5389821a94e48babcf_10_1"
                  })
              """

So I think you’re right–the ProblemBlock is probably doing all that Python sandbox setup and calculation, slowing things down. It does look like the code takes some pain to avoid fully instantiating things in order to save time and memory:

github.com/openedx/edx-platform

xmodule/capa_block.py

9bf923784


      
          lcp = LoncapaProblem(
              problem_text=self.data,
              id=self.location.html_id(),
              capa_system=capa_system,
              # We choose to run without a fully initialized CapaModule
              capa_block=None,
              state={
                  'done': user_state.state.get('done'),
                  'correct_map': user_state.state.get('correct_map'),
                  'student_answers': user_state.state.get('student_answers'),
                  'has_saved_answers': user_state.state.get('has_saved_answers'),
                  'input_state': user_state.state.get('input_state'),
                  'seed': user_state.state.get('seed'),
              },
              seed=user_state.state.get('seed'),
              # extract_tree=False allows us to work without a fully initialized CapaModule
              # We'll still be able to find particular data in the XML when we need it
              extract_tree=False,
          )

It’s possible that this either wasn’t sufficient, or that there was some performance regression that happened later on that wasn’t caught. I’m afraid that’s about as far as I can get without setting up test data and doing profiling. I hope that helps narrow things down though.

Broadly speaking, I think some options are:

See if there’s space to optimize this further, after inspecting it with a profiler.
Offer a non-formatted version that just always pull raw state data.

That being said, any change to the CSV output would be a breaking change. We’d either need to create a new report entirely, or make it an opt-in flag, and that would likely get confusing for users.

It’s also possible that this data can be better accessed through Aspects now. On that, I defer to @TyHob and @Sara_Burns.

sandro · July 23, 2025, 6:07pm

Hey @dave.
We’re using the following form to extract the problem responses:

We are currently running Redwood and migrating to teak later this year, hopefully.

These are the results from our analysis in the past week:

Every single python evaluated problem went to our codejail container to be evaluated.
This created an issue with our only codejail container being overwhelmed with requests.
Due to the codejail issue, the connection to MySQL was timing out constantly, as codejail was taking too much time to complete tasks while the connection was open.

With this, I have 2 questions for you.

First, why are these CSV report tasks not running in the lms-worker-highmem, as they tend to be quite intensive when a course has a lot of enrollments?

Second, couldn’t we run these tasks using the read_replica?

With theses questions we are just trying to extract some extra performance from the system, as we think that this can be a bit optimized without breaking the code. We are interested in opening PRs, if the two questions I ask make sense to you.

By now we have scaled the number codejail pods to meet the demand and adjusted the timeout values just to make sure it has all it needs to evaluate the massive number of reports we have. We are currently monitoring everything.

sandro · July 23, 2025, 6:17pm

I also like this idea, as most of the staff of a course only want to see the response that was given when a user submits one via a text input field. I don’t think we should be evaluating everything for these type of reports.

dave · July 29, 2025, 1:29am

I don’t know for sure, but my guess was that there’s no deep reason for it. Things tended to get moved over to the highmem workers in reaction to observed operational issues. It’s possible that this report just never rose to that level of attention on edx.org.

I’m not sure. Intuitively, we only need to read the data. But given how this kind of code has historically worked, it’s very possible that the simple act of figuring out whether or not the answer is correct will have the side-effect of rewriting state. (This was an intentional feature once upon a time, even if it’s much less useful these days.)

Topic		Replies	Views
Unable to generate problem report csv Development lilac	6	675	May 15, 2023
Incomplete CSV Response Export for library_content Blocks in Instructor Dashboard Development	3	34	May 7, 2025
Export data to generate reports Development how-to , tutor	7	466	October 6, 2022
Adding a grades-per-problem API endpoint Collaborative Proposals api	4	1032	June 22, 2022
Python "list index out of range" bug when trying to generate grade report Site Operations Help	2	686	July 14, 2022

Report generation is taking too much time

Related topics