Running Enrollment task like this and getting accurate data: remote-task --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait ImportEnrollmentsIntoMysql --local-scheduler --interval 2019-05-20-$TODAY --overwrite-n-days 0 --n-reduce-tasks 8 --verbose --overwrite-hive --overwrite-mysql
My problem is with interval parameter,
If I pass –interval $YESTERDAY-$TODAY, not getting accumulated enrollment at
http:///courses/course-v1:edxDemo+edxDemo+edxDemo/enrollment/activity/
But if I pass like this: –interval 2019-05-20-$TODAY, getting accumulated enrollment and for this, I need to give data to Hadoop from the start date every time this task is executed,
I am passing start date beginning of time because it is mentioned here The interval here, should be the beginning of time essentially. It computes enrollment by observing state changes from the beginning of time.
This does not look scalable solution, so is this right way or is there any other way by which I can get accumulated enrollment?
Whatever is the end date, the report will be calculated by subtracting a day from there, also your value of overwrite-n-days is 0 so that might be the reason of improper data.
It won’t compute all the events from initial date, the task will only calculate events for those dates whose files doesn’t exist in hadoop.
Every time you run it’s history task or the ImportEnrollment task itself, there will be files written in Hadoop marking it’s completion by date. So say first you ran a task for date range of 6 months, for the first iteration it’ll take a bit of time, and after it say you ran the task for 6 months + 15 days, it’ll only calculate events for those 15 days, as events for earlier 6 months will already be stored in hdfs by date.
This here is also something that should only be used if you are writing your data completely every time or you messed your data and want to correct it.
If you just want to append your data there’s no need to use it, and there’s also no need to use the full interval.
I run my ImportEnrollmentsIntoMysql with 7 days, and every time interval is last-week-date-current-date works like a charm and works in a very efficient manner, but I do run it’s history task also, you can either use it’s history task, or keep the overwrite-n-days value to what you need
If I remember correctly (from trying this myself), you should run CourseEnrollmentEventsTask just as a bootstrap for the first time, and then you only run ImportEnrollmentsIntoMysql task for the incremental updates. This should solve the duplication.
@jramnai I’m having this problem with duplicated data too! Have deployed a few analytics sites before, and have never run into this issue, but now I am, and I’m stumped.
[enrollments]
interval_start = 2015-01-01
blacklist_date = 2001-01-01
blacklist_path = s3://client-edxanalytics/enrollments-blacklist/ # this file is empty
overwrite_n_days = 3
This is what I’ve always done, but in this case, each day’s run adds the full enrollment count to the total enrollments (plus the actual enrollments added yesterday). Total enrollments Day 1: 306, Day 2: 616, Day 3: 922, …
I tried using --interval $(date +%Y-%m-%d -d "yesterday")-$(date +%Y-%m-%d -d "today") instead, but that didn’t help.