Export data to generate reports

Hi, im using tutor and i want to export data to generate reports, like generated certificates, users, etc…

i have already created this script:

from lms.djangoapps.certificates.models import GeneratedCertificate
from apice_registration.models import ExtendedProfile
import csv
from openedx.core.djangoapps.content.course_overviews.models import CourseOverview

report_dir = "/openedx/data/logs/"
header = ['name', 'country', 'state', 'city', 'birthyear', 'created_at',
'role', 'role_other', 'educational_stage', 'institution', 
'apice_1_cert', 'cert_date_1', 'apice_2_cert', 'cert_date_2', 'apice_3_cert', 'cert_date_3']
data = []
ex_profile = ExtendedProfile.objects.filter(user__is_staff=False)
cert_qs = GeneratedCertificate.objects.filter(status='downloadable', user__is_staff=False)
co = CourseOverview.objects.all()
course_cert = {
            'apice1': '',
            'cert_date_1': '',
            'apice2': '',
            'cert_date_2': '',
            'apice3': '',
            'cert_date_3': '',
        }
with open(report_dir + "stats.csv", 'w', encoding='UTF8') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for person in ex_profile:
        counter = 0
        for c in co:
            counter += 1
            person_cert = cert_qs.filter(user=person.user, course_id=c.id)
            if person_cert:
                course_cert['apice{}'.format(counter)] = 1
                course_cert['cert_date_{}'.format(counter)] = person_cert[0].created_date
        data = ([person.user.profile.name, person.current_country, person.current_state, person.current_city, 
                person.birthyear, person.user.date_joined, person.role, person.role_other, 
                person.educational_stage, person.institution, course_cert['apice1'], course_cert['cert_date_1'],
                course_cert['apice2'], course_cert['cert_date_2'], course_cert['apice3'],
                course_cert['cert_date_3']])
        course_cert = course_cert.fromkeys(course_cert, '')
        writer.writerow(data)

so i run it in tutor shell and it generate a csv with my data.

the problem is that sometimes it generates a csv with broken data, i think thats because some timeout issues with aws machine, cpu, or RAM. I dont know…
in django docs, it mentions StreamingHttpResponse. Its the way to generate large csv files, but i cant figure it out how to use in tutor shell.

is there any way to do this? how can i improve my script?

thanks!

If this is a script running on the server, you don’t have to worry about timeouts, and it’s not using HTTP in any way, so StreamingHttpResponse won’t help with anything. Your script generally looks fine from a quick glance. If the script runs and finishes without an explicit error, but there are issues with the data, it’s just a bug in the script or the data - it’s not an issue with timeouts nor resources.

What specifically is broken?

sometimes generates a csv with less certificates and users

Well, at the beginning of your script you can

print(f"Creating CSV for {ex_profile.count()} users")

Then at the end you can

print("Done!")

Now run it and see if it’s actually terminating in the middle (if “Done!” doesn’t get printed), and if the number of users in the CSV matches what it prints out at the beginning, and if the number of users at the beginning is the same every time.

Basically, you need to find a way to collect more information about what’s happening at each step in order to debug this issue.

thanks branden, but there are almost 500k users, it is impossible to debug this way, but I will think of a way to debug.

Why is it impossible? What I suggested only prints two lines, once at the start and once at the end. If you need per-user data to debug it or you need it to run faster, you can simply filter to a subset of users and then test and debug it with, say, 100 or 1,000 users rather than 500k.

If you have 500k users, do you also have a lot of courses? If so, one inefficiency in your script may be that it iterates over every possible course for each user, rather than just the courses they enrolled in or earned certificates in. Rather than iterating over all courses with for c in co, you should just use for person_cert in cert_qs.filter(user=person.user): to iterate over all of the certs for that user in the courses that they actually achieved a certificate.

I made a mistake, sorry!
I will test this method you mentioned above.
I will also review the way the csv is mounted. There are 3 courses, and they have been users since 2015, so there are many.

thank you!

Your script is performing the above request n * m times, where n is the number of users and m the number of courses. It is even performing twice that numbe: once in if person_cert and once in person_cert[0].created_date.
With 500000 users and three courses, it means that you are performing ~3M SQL requests. This will take a long time, even on a powerful server.

This is called the “N+1 query” problem. See this StackOverflow question for reference: database - What is the "N+1 selects problem" in ORM (Object-Relational Mapping)? - Stack Overflow

You should resolve this issue by performing a left join: SQL LEFT JOIN Keyword How this is implemented in Django is left as an exercise to the reader :wink:

2 Likes