HELP: Proper way to run ImportEnrollmentsIntoMysql task in Insights

Hello all,

Running Enrollment task like this and getting accurate data:
remote-task --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait ImportEnrollmentsIntoMysql --local-scheduler --interval 2019-05-20-$TODAY --overwrite-n-days 0 --n-reduce-tasks 8 --verbose --overwrite-hive --overwrite-mysql

My problem is with interval parameter,
If I pass –interval $YESTERDAY-$TODAY, not getting accumulated enrollment at
http:///courses/course-v1:edxDemo+edxDemo+edxDemo/enrollment/activity/

But if I pass like this: –interval 2019-05-20-$TODAY, getting accumulated enrollment and for this, I need to give data to Hadoop from the start date every time this task is executed,
I am passing start date beginning of time because it is mentioned here
The interval here, should be the beginning of time essentially. It computes enrollment by observing state changes from the beginning of time.

This does not look scalable solution, so is this right way or is there any other way by which I can get accumulated enrollment?

Whatever is the end date, the report will be calculated by subtracting a day from there, also your value of overwrite-n-days is 0 so that might be the reason of improper data.

It won’t compute all the events from initial date, the task will only calculate events for those dates whose files doesn’t exist in hadoop.

Every time you run it’s history task or the ImportEnrollment task itself, there will be files written in Hadoop marking it’s completion by date. So say first you ran a task for date range of 6 months, for the first iteration it’ll take a bit of time, and after it say you ran the task for 6 months + 15 days, it’ll only calculate events for those 15 days, as events for earlier 6 months will already be stored in hdfs by date.

This here is also something that should only be used if you are writing your data completely every time or you messed your data and want to correct it.
If you just want to append your data there’s no need to use it, and there’s also no need to use the full interval.

I run my ImportEnrollmentsIntoMysql with 7 days, and every time interval is last-week-date-current-date works like a charm and works in a very efficient manner, but I do run it’s history task also, you can either use it’s history task, or keep the overwrite-n-days value to what you need

1 Like

Hi @chintan ,

First Run

Result of this was accurate and data was accumulated in Enrollments.

Second Run

And got inaccurate data and duplication, too.

So, I have repeated First Run and changed Second Run like below:

In above I have provided data of 25-05-2019 & 26-05-2019 to hadoop and deleted previous files in data directory.

But no luck.

Please check attached screenshots for reference:
First Run:


Second Run:

This might be the reason of duplication, overwrite-n-days 3 will mean that, it goes 3 days behind, so 25,24 and 23 data will be duplicated.

Also try following and see if you get results.

  • First Run works well so no issues there
  • Second Run
    • CourseEnrollment : interval last-date-of-first-run–date-you-want-to-cover
    • ImportEnrollment: interval first-date-of-first-run–date-you-want-to-cover and overwrite-n-days 0

In this case it’ll not go for calculation of all the dates it should only go for pending dates.

Apart from that I don’t got no solutions

If I remember correctly (from trying this myself), you should run CourseEnrollmentEventsTask just as a bootstrap for the first time, and then you only run ImportEnrollmentsIntoMysql task for the incremental updates. This should solve the duplication.

Best,

Felipe.

Thanks @chintan and @felipe.espinoza.r for your reply.

I have tried both solutions you provided but no luck thus far.

Will update here if I get anything.

@jramnai I’m having this problem with duplicated data too! Have deployed a few analytics sites before, and have never run into this issue, but now I am, and I’m stumped.

I have run CourseEnrollmentEventsTask to bootstrap the historical data as specified in the docs. Each day, I run the daily enrollment task:

FROM_DATE='2020-10-01'
TO_DATE=`date +%Y-%m-%d`
analytics-configuration/automation/run-automated-task.sh ImportEnrollmentsIntoMysql --local-scheduler \
  --interval "$FROM_DATE-$TO_DATE" \
  --n-reduce-tasks $NUM_REDUCE_TASKS

from override.cfg:

[enrollments]
interval_start = 2015-01-01
blacklist_date = 2001-01-01
blacklist_path = s3://client-edxanalytics/enrollments-blacklist/  # this file is empty
overwrite_n_days = 3

This is what I’ve always done, but in this case, each day’s run adds the full enrollment count to the total enrollments (plus the actual enrollments added yesterday). Total enrollments Day 1: 306, Day 2: 616, Day 3: 922, …

I tried using --interval $(date +%Y-%m-%d -d "yesterday")-$(date +%Y-%m-%d -d "today") instead, but that didn’t help.

Did you ever work it out?