Analytics pipeline: Failed to run task AnswerDistributionWorkflow

Hi guys,

I am working on feature Insights, I also setup analytics-pipeline by running most of the command lines within ansible successfully.

I am struggling to execute some tasks on a separate machine (Hadoop, Hive, and Sqoop). Follow the document here Tasks to Run to Update Insights — EdX Analytics Pipeline Reference Guide 1.0 documentation

The first task is Performance (Graded and Ungraded) by running command line
launch-task AnswerDistributionToMySQLTaskWorkflow --remote-log-level DEBUG --local-scheduler --include '"[\"*tracking.log*\"]"' --src '"[\"hdfs://localhost:9000/data\"]"' --dest '"[\"/tmp/answer_dist\"]"' --mapreduce-engine local --name test_task

The output seem look good

However, I recognize the error message as below
WARNING:luigi-interface:Will not run AnswerDistributionToMySQLTaskWorkflow(database=reports, credentials=/edx/etc/edx-analytics-pipeline/output.json, name=test_task, src=["[", "\"", "h", "d", "f", "s", ":", "/", "/", "l", "o", "c", "a", "l", "h", "o", "s", "t", ":", "9", "0", "0", "0", "/", "d", "a", "t", "a", "\"", "]"], dest="[\"/tmp/answer_dist\"]", include=["[", "\"", "*", "t", "r", "a", "c", "k", "i", "n", "g", ".", "l", "o", "g", "*", "\"", "]"], manifest=None, answer_metadata=None, base_input_format=None) or any dependencies due to error in deps() method: Traceback (most recent call last): File "/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/worker.py", line 743, in _add deps = task.deps() File "/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task.py", line 630, in deps return flatten(self._requires()) File "/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task.py", line 602, in _requires return flatten(self.requires()) # base impl File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/common/mysql_load.py", line 64, in requires self.required_tasks['insert_source'] = self.insert_source_task File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/insights/answer_dist.py", line 861, in insert_source_task manifest=self.manifest, File "/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task_register.py", line 88, in __call__ param_values = cls.get_param_values(params, args, kwargs) File "/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task.py", line 411, in get_param_values raise parameter.MissingParameterException("%s: requires the '%s' parameter to be set" % (exc_desc, param_name)) MissingParameterException: AnswerDistributionPerCourse[args=(), kwargs={'src': (u'[', u'"', u'h', u'd', u'f', u's', u':', u'/', u'/', u'l', u'o', u'c', u'a', u'l', u'h', u'o', u's', u't', u':', u'9', u'0', u'0', u'0', u'/', u'd', u'a', u't', u'a', u'"', u']'), 'name': 'test_task', 'dest': '"[\\"/tmp/answer_dist\\"]"', 'answer_metadata': None, 'n_reduce_tasks': 25, 'manifest': None, 'base_input_format': None, 'lib_jar': (), 'include': (u'[', u'"', u'*', u't', u'r', u'a', u'c', u'k', u'i', u'n', u'g', u'.', u'l', u'o', u'g', u'*', u'"', u']'), 'mapreduce_engine': 'local'}]: requires the 'remote_log_level' parameter to be set

I also added --remote-log-level into the command. But it does not work.

Can anyone please help me in this case?

Thank you,
Thin Nguyen

The above issue is fixed if I add one more param remote_log_level in the code.
However, the new issue appears

    WARNING:luigi-interface:Will not run ImportEnrollmentsIntoMysql(source=["hdfs://localhost:9000/data/logs/tracking/"], expand_interval=4 w 2 d 0 h 0 m 0 s, pattern=[".*tracking.log-([0-9]).*"], date_pattern=%Y%m%d, warehouse_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/, date=2021-03-26, partner_short_codes=[], partner_api_urls=[], api_root_url=None, interval=2021-03-25, enable_course_catalog=False) or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 401, in check_complete
    is_complete = task.complete()
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/util/overwrite.py", line 47, in complete
    return super(OverwriteOutputMixin, self).complete()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 827, in complete
    return all(r.complete() for r in flatten(self.requires()))
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 827, in 
    return all(r.complete() for r in flatten(self.requires()))
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/util/overwrite.py", line 47, in complete
    return super(OverwriteOutputMixin, self).complete()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 573, in complete
    return all(map(lambda output: output.exists(), outputs))
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 573, in 
    return all(map(lambda output: output.exists(), outputs))
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hive.py", line 438, in exists
    return self.client.table_exists(self.table, self.database, self.partition)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hive.py", line 148, in table_exi
sts
    (%s)""" % (database, table, self.partition_spec(partition)))
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hive.py", line 80, in run_hive_cmd
    return run_hive(['-e', hivecmd], check_return_code)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hive.py", line 67, in run_hive
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

The new issue just looks like edX analytics pipeline fails

Here’s your issue – looks like your source and include parameters aren’t valid JSON strings. Could you share your override.cfg?

Thanks @jill for your reply. Really appreciate!

Here is my override.cfg

[hive]
release = apache
version = 1.0
database = default
warehouse_path = hdfs://localhost:9000/edx-analytics-pipeline/warehouse/

[database-export]
database = reports
credentials = /edx/etc/edx-analytics-pipeline/output.json


[database-import]
database = openedx
credentials = /edx/etc/edx-analytics-pipeline/input.json
destination = hdfs://localhost:9000/edx-analytics-pipeline/warehouse/

[otto-database-import]
database = ecommerce
credentials = /edx/etc/edx-analytics-pipeline/input.json

[map-reduce]
engine = hadoop
marker = hdfs://localhost:9000/edx-analytics-pipeline/marker/

[event-logs]
pattern = [".*tracking.log.*"]
source = ["hdfs://localhost:9000/data/logs/tracking/"]
expand_interval = 30 days

For the first error, I fixed it by adding remote_log_level into the code of analytics-pipeline. Then it works well.
For the second error, because I have set up EMR and ansible in the same machine. However, the command line launch-task requires running on a wapper of hive CLI. So I executed it on hadoop user (that contains hive, hadoop, sqoop services). It works well right now.

I struggled it the whole week, lol. I will update any issue that I am facing with insights here.

Thank you!

The new issue if I execute this command

launch-task AnswerDistributionToMySQLTaskWorkflow --local-scheduler --remote-log-level DEBUG --include '"[\"*tracking.log*\"]"' --src '"[\"hdfs://localhost:9000/data/logs/tracking/\"]"' --dest /tmp/answer_dist --mapreduce-engine local --name test_task
ERROR: [pid 23776] Worker Worker(salt=065418045, workers=1, host=ubuntu-pc, username=hadoop, pid=23776) failed    ProblemCheckEvent(name=test_task, src=["[", "\"", "h", "d", "f", "s", ":", "/", "/", "l", "o", "c", "a", "l", "h", "o", "s", "t", ":", "9", "0", "0", "0", "/", "d", "a", "t", "a", "/", "l", "o", "g", "s", "/", "t", "r", "a", "c", "k", "i", "n", "g", "/", "\"", "]"], dest=/tmp/answer_dist, include=["[", "\"", "*", "t", "r", "a", "c", "k", "i", "n", "g", ".", "l", "o", "g", "*", "\"", "]"], manifest=None)
Traceback (most recent call last):
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps
    task_gen = self.task.run()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 740, in run
    self.job_runner().run_job(self)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 642, in run_job
    for i in luigi.task.flatten(job.input_hadoop()):
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/common/mapreduce.py", line 134, in input_hadoop
    return convert_to_manifest_input_if_necessary(self.manifest_id, super(MapReduceJobTask, self).input_hadoop())
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 755, in input_hadoop
    return luigi.task.getpaths(self.requires_hadoop())
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 844, in getpaths
    return struct.output()
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/common/pathutil.py", line 104, in output
    return [task.output() for task in self.requires()]
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/common/pathutil.py", line 78, in generate_file_list
    yield ExternalURL(filepath)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task_register.py", line 99, in __call__
    h[k] = instantiate()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task_register.py", line 80, in instantiate
    return super(Register, cls).__call__(*args, **kwargs)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 445, in __init__
    self.task_id = task_id_str(self.get_task_family(), self.to_str_params(only_significant=True, only_public=True))
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/task.py", line 496, in to_str_params
    params_str[param_name] = params[param_name].serialize(param_value)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/parameter.py", line 278, in serialize
    return str(x)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0151' in position 63: ordinal not in range(128)

Looks like you’ve got some non-ascii in a config file somewhere? Could be in your override.cfg (though I can’t see it in what you pasted), or one the files it references, like your input.json or output.json?

Hi @jill ,
You’re right. There was a certificate file in my machine with the pattern u0151.
What I did is update the code to yield ExternalURL(filepath.encode('utf8')) instead of yield ExternalURL(filepath) in file edx-analytics-pipeline/edx/analytics/tasks/common/pathutil.py

Thanks for your suggestion

Hi guys,

I tried to list all command line to be executed. This one is an example

launch-task CourseEnrollmentEventsTask --local-scheduler \
    --interval 2021-01-01-2021-04-30 \
    --n-reduce-tasks 4 \
    --remote-log-level DEBUG

The error log looks like:

ERROR: [pid 9851] Worker Worker(salt=717026547, workers=1, host=ubuntu-pc, username=hadoop, pid=9851) failed    CourseEnrollmentEventsTask(source=["hdfs://localhost:9000/data/logs/tracking/"], interval=2021-01-01-2021-04-30, expand_interval=0 w 2 d 0 h 0 m 0 s, pattern=[".*tracking.log.*"], date_pattern=%Y%m%d, warehouse_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/)
Traceback (most recent call last):
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps
    task_gen = self.task.run()
  File "/var/lib/analytics-tasks/analyticstack/edx-analytics-pipeline/edx/analytics/tasks/insights/enrollments.py", line 152, in run
    super(CourseEnrollmentEventsTask, self).run()
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 740, in run
    self.job_runner().run_job(self)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 581, in run_job
    run_and_track_hadoop_job(arglist, tracking_url_callback=job.set_tracking_url)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 353, in run_and_track_hadoop_job
    return track_process(arglist, tracking_url_callback, env)
  File "/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/luigi/contrib/hadoop.py", line 337, in track_process
    raise HadoopJobError(message + 'Also, no tracking url found.', out, err)
HadoopJobError: Streaming job failed with exit code 1. Also, no tracking url found.


1. override.cfg:

[event-logs]
pattern = [".*tracking.log.*"]
source = ["hdfs://localhost:9000/data/logs/tracking/"]
expand_interval = 2 days


2. tracking.log in  hdfs dfs -ls data/logs/tracking

21/03/29 11:13:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   75269560 2021-03-26 18:10 data/logs/tracking/tracking.log


I believe this issue is a kind of configuration issue in tracking path, but I'm not sure how to correct it. Seem they're correct right now

There are 2 different tracking directories

  1. One for hdfs://localhost:9000/data/logs/tracking/, it will be located at hdfs dfs -ls /data/logs/tracking
  2. One for user hadoop that located at /user/hadoop/data/logs/tracking by executing command hdfs dfs -ls data/logs/tracking

The first one is accurate. So cronjob was correct, and I need to create the directory as option 1.

I ran some tasks and they’re ok. The problem is most of tables are generated in Hive.
Some tables has records, some tables does not have any records such as

course_enrollment
course_enrollment_birth_year_daily
course_enrollment_daily
course_enrollment_education_level_daily
course_enrollment_gender_daily
course_enrollment_mode_daily
course_enrollment_summary
course_grade_by_mode
course_meta_summary_enrollment
grades_persistentcoursegrade

hive> select * from course_activity; OK Time taken: 0.072 seconds hive> select * from course_catalog; OK Time taken: 0.078 seconds

hive> select * from course_enrollment;
OK
Time taken: 0.072 seconds
hive> select * from course_enrollment_birth_year_daily;
OK
Time taken: 0.073 seconds
hive> select * from course_enrollment_daily;
OK
Time taken: 0.117 seconds
hive> select * from course_enrollment_education_level_daily;
OK
Time taken: 0.131 seconds
hive> select * from course_enrollment_gender_daily;
OK
Time taken: 0.093 seconds
hive> select * from course_enrollment_mode_daily;
OK
Time taken: 0.079 seconds
hive> select * from course_enrollment_summary;
OK
Time taken: 0.083 seconds
hive> select * from course_grade_by_mode;
OK
Time taken: 0.075 seconds
hive> select * from course_meta_summary_enrollment;
OK
Time taken: 0.11 seconds
hive> select * from grades_persistentcoursegrade;
OK
Time taken: 0.1 seconds
hive> select * from user_activity_by_user;
OK
Time taken: 0.085 seconds

I tried to access MySQL database reports, there was no record there. Anybody can let me know why it was empty?

Thank you

Those tables are populated by the ImportEnrollmentsIntoMysql task, which should be run daily. The CourseEnrollmentEventsTask history task doesn’t populate MySQL. But now that you’ve run CourseEnrollmentEventsTask successfully over your historical data, you should be able to schedule ImportEnrollmentsIntoMysql to run daily with overwrite_n_days=3, and it will keep your MySQL tables updated.

Thanks for your suggestion @jill , I will keep you updated on the outcome.

Hi @jill ,

I tried to run this command manually everyday

launch-task ImportEnrollmentsIntoMysql --local-scheduler     
--interval 2021-03-25-2021-03-31     
--n-reduce-tasks 24     
--remote-log-level INFO     
--overwrite-mysql     
--overwrite-hive     
--overwrite-n-days 3


The output looks like:

INFO:luigi-interface:
===== Luigi Execution Summary =====

Scheduled 38 tasks of which:
* 6 complete ones were encountered:
    - 1 CourseEnrollmentSummaryTask(...)
    - 1 CourseEnrollmentTask(...)
    - 1 CourseTableTask(warehouse_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/)
    - 1 ExternalURL(url=/edx/etc/edx-analytics-pipeline/output.json)
    - 1 ImportAuthUserProfileTask(...)
    ...
* 32 ran successfully:
    - 1 CourseEnrollmentPartitionTask(...)
    - 1 CourseEnrollmentSummaryPartitionTask(...)
    - 1 CourseEnrollmentSummaryTableTask(warehouse_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/)
    - 1 CourseEnrollmentTableTask(warehouse_path=hdfs://localhost:9000/edx-analytics-pipeline/warehouse/)
    - 1 CourseGradeByModeDataTask(...)
    ...

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====


My concerns are:

1. How is tracking.log correct?
    In my tracking.log file has records as `2021-03-26 10:59:51,905 INFO 904 [tracking] logger.py:50 `
    I can see level log is INFO, I also adding --remote-log-level INFO into the command
2. The interval date is a range from March 25th to Mar 31st. I also do some activities on LMS such as create courses (CMS), enroll, learn courses,... in order to make log message. However, no different result.

Do you have any suggestions?

I think there’s some confusion here? The analytics tasks read tracking logs generated by the LMS, which on production instances, end up under /edx/var/log/tracking/tracking.log. Each line of these files is a JSON object which describes a user event. See edX Research Guide: Events in the Tracking Logs for details.

Your --remote-log-level INFO is different. That sets a level of logging used by the running analytics task itself. I’m not sure what this has to do with tracking.log files… unless you’re storing the output of your analytics task runs in hdfs? If so, you don’t need to do that.

Instead, you should rotate your LMS /edx/var/log/tracking/tracking.log file into your task’s --src location, so they can be matched by the --include pattern. In production, we do this with logrotate, and transfer date-stamped tracking log files to S3 so they can be read by the day’s analytics tasks.

From your 1st question above, I think that you’re not reading the updated LMS tracking.log file with your task runs. Once you do that, you will see your new enrollments in the total.

Another tip: as per the Enrollment task description, you can use today as your end date for this task, so that it always reads the latest data. E.g.,

FROM_DATE="2021-03-25"
TO_DATE="today"
ImportEnrollmentsIntoMysql --local-scheduler \
  --interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
  --n-reduce-tasks $NUM_REDUCE_TASKS

Hi @jill ,

  1. For question 1: I already have a LMS site that put in AWS EC2. It was deployed by tutor. The tracking.log was located at $(tutor config printroot)/data/lms/logs/. There are 2 files: all.log and tracking.log. I tried to interact on LMS site, some events also was tracked in tracking.log. It looks like
2021-03-26 09:57:57,353 INFO 796 [tracking] logger.py:50 - {"username": "openedx", "event_type": "/courses/course-v1:TEST+00088+2021_00088/instructor/api/list_instructor_tasks", "ip": "113.161.78.65", "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36", "host": "dev.sandbox.co", "referer": "https://dev.sandbox.co/courses/course-v1:TEST+00088+2021_00088/instructor", "accept_language": "en-US,en;q=0.9,vi-VN;q=0.8,vi;q=0.7,fr-FR;q=0.6,fr;q=0.5", "event": "{\"POST\": {}, \"GET\": {}}", "event_source": "server", "context": {"course_user_tags": {}, "user_id": 5, "org_id": "TEST", "course_id": "course-v1:TEST+00088+2021_00088", "path": "/courses/course-v1:TEST+00088+2021_00088/instructor/api/list_instructor_tasks"}, "time": "2021-03-26T09:57:57.353746+00:00", "page": null}

I also copied that file into EMR machine at /edx/var/log/tracking. In this machine, the was a cronjob to put this raw file to hdfs.
I can put it manually by running command line hdfs dfs -put /edx/var/log/tracking/tracking.log /data/logs/tracking

  1. For logrotate, i didn’t know that, I will try to setup and run it.

Finally, I am thinking of data in tracking.log was not clean up or re-formatted so it cannot generate any record in MySQL database.

Anyway, thank you @jill , I will try tomorrow, hope to see the light.

Hi @jill ,

I tried to learn logrotation, there was a question that looks similar my question Edx/var/log/tracking/

To focus on what I am doing, I am marking you answer as solution, I will continue on this topic Edx/var/log/tracking/

Thank you so much @jill what you helped last few days.

1 Like

A bit off topic. Hey @Nguyen_Truong_Thin , I don’t know if you got it working. However, I’m using tutor and was facing the issues you were having with the reports.course_enrollment_daily not containing any entries even though the remote task was completed. Have you tried removing the lines prior to the {event} and running the task ImportEnrollment? I was able to finally get an entry in my sql db recently