Error running AnswerDistributionWorkflow

Hi All,

I recently deployed ironwood lms and managed to install insights on a different server. I am at a point of running analytics tasks but got stuck at the first one. Can anyone help me with understanding this error please.

DEBUG:edx.analytics.tasks.launchers.local:Loading override configuration ‘override.cfg’…
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
ERROR:luigi-interface:Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
Connection to localhost closed.
Exiting with status = 40

Cannot figure out which json file it is talking about. Output.json file is shown below

{

    "username": "pipeline001",
    "host": "localhost",
    "password": "passwordxxxx",
    "port": 3306

}

Regards,
Neville

Looks like it’s trying to read your override.cfg

If I dont specify override.cfg like below i still get an error
export UNIQUE_NAME=$(date +%Y-%m-%dT%H_%M_%SZ)
remote-task AnswerDistributionWorkflow --host localhost --user ubuntu --remote-name analyticstack --wait
–local-scheduler --verbose
–src [hdfs://localhost:9000/data]
–dest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/dest
–name $UNIQUE_NAME
–output-root hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/course
–include [“tracking.log.gz*”]
–manifest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/manifest.txt
–base-input-format “org.edx.hadoop.input.ManifestTextInputFormat”
–lib-jar [hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar]
–n-reduce-tasks 1
–marker hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/marker
–credentials /edx/etc/edx-analytics-pipeline/output.json

Error code is different
DEBUG:edx.analytics.tasks.launchers.local:Configuration file ‘override.cfg’ does not exist!
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
ERROR:luigi-interface:Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
Connection to localhost closed.
Exiting with status = 4

Anybody having any comment about this error would greatly help. thanks :slight_smile:

@nevilleonline I don’t think it’s a JSON file that’s causing the error… I think it’s a JSON-formatted task parameter. I can see two in your command which aren’t valid JSON:

You need to put quotes around those strings:

–src ["hdfs://localhost:9000/data"]
–lib-jar ["hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar"]

See AnswerDistributionWorkflow task documentation.

Thanks Jill for your response. I tried with the quotes and it seems to give the same error.

export UNIQUE_NAME=$(date +%Y-%m-%dT%H_%M_%SZ)

export UNIQUE_NAME=$(date +%Y-%m-%dT%H_%M_%SZ)
remote-task AnswerDistributionWorkflow --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait
–local-scheduler --verbose
–src [“hdfs://localhost:9000/data”]
–dest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/dest
–name $UNIQUE_NAME
–output-root hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/course
–include [“tracking.log.gz”]
–manifest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/manifest.txt
–base-input-format “org.edx.hadoop.input.ManifestTextInputFormat”
–lib-jar [“hdfs://localhost:9000/edx-analytics-pipeline/site-packages/edx-analytics-hadoop-util.jar”]
–n-reduce-tasks 1
–marker hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/$UNIQUE_NAME/marker
–credentials /edx/etc/edx-analytics-pipeline/output.json

Error below

DEBUG:edx.analytics.tasks.launchers.local:Loading override configuration ‘override.cfg’…
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
ERROR:luigi-interface:Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 133, in _get_task_kwargs
res.update(((param_name, param_obj.parse(attr)),))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py”, line 940, in parse
return list(json.loads(x, object_pairs_hook=_FrozenOrderedDict))
File “/usr/lib/python2.7/json/init.py”, line 352, in loads
return cls(encoding=encoding, **kw).decode(s)
File “/usr/lib/python2.7/json/decoder.py”, line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python2.7/json/decoder.py”, line 382, in raw_decode
raise ValueError(“No JSON object could be decoded”)
ValueError: No JSON object could be decoded
Connection to localhost closed.
Exiting with status = 40

Is there anything else you can see that i am not seeing?

The error logs don’t help, and I can’t see from your command what tripping it up :frowning:

This is really hacky, but you could try modifying this luigi code shown above to get it to print out the param_name when the parsing fails, so you can see which variable is causing the issue? e.g.

    def _get_task_kwargs(self):
        """
        Get the local task arguments as a dictionary. The return value is in
        the form ``dict(my_param='my_value', ...)``
        """
        res = {}
        for (param_name, param_obj) in self._get_task_cls().get_params():
            attr = getattr(self.known_args, param_name)
            if attr:
                try:
                    res.update(((param_name, param_obj.parse(attr)),))
                except ValueError as err:
                    print("Error parsing JSON %s, value=%s" % (param_name, param_obj))
                    raise err
        return res

Apologies Jill for responding late. Managed to do the change in the luigi code. something new has popped up. Hopefully it makes sense to you.

“Error parsing JSON lib_jar, value=<luigi.parameter.ListParameter object at 0x7fd87f304dd0>”

DEBUG:edx.analytics.tasks.launchers.local:Loading override configuration ‘override.cfg’…
Error parsing JSON lib_jar, value=<luigi.parameter.ListParameter object at 0x7fd87f304dd0>
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 137, in _get_task_kwargs
raise err
ValueError: No JSON object could be decoded
ERROR:luigi-interface:Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 137, in _get_task_kwargs
raise err
ValueError: No JSON object could be decoded
Connection to localhost closed.

Regards,
Neville

Well that’s cool, it confirms that the issue is with parsing your --lib-jar parameter at least. Unfortunately, the value printed doesn’t tell us what’s been parsed from the command line like I’d hoped it would.

I can’t tell from your previous post (because Discourse insists on using “smart quotes”), but can you confirm that you’re only using plain double quotes around the strings in that list?

Thanks Jill, for your response. I have tried putting quotes like below.

–src [“hdfs://localhost:9000/data”]
–lib-jar [“hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar”]

I tried deleting and then uploading the lib jar file in hadoop as well. But the error is the same.

I’ve found one of our client configurations which uses the following instead:

–lib-jar '"[\"hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar\"]"'

I’m sorry it’s so convoluted, but here’s an explanation, from the outside in:

  • The single quotes around the whole thing wrap the value for the shell command.
  • The double quotes around the array make the array string parseable as a JSON string.
  • The double quotes around the hdfs.. need to be escaped with a \ to make them read as double quotes once the JSON string is read.

If that works, we can work with edX to get the analytics task documentation updated too.

1 Like

Thanks Jill for your response. I still cant shake that error off. Error message is slightly different.

–lib-jar ‘"[“hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar”]"’ \

Same error shown below.

Error parsing JSON src, value=<luigi.parameter.ListParameter object at 0x7fbac9a0cd50>
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 137, in _get_task_kwargs
raise err
ValueError: No JSON object could be decoded
ERROR:luigi-interface:Uncaught exception in luigi
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/retcodes.py”, line 74, in run_with_retcodes
worker = luigi.interface._run(argv)[‘worker’]
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/interface.py”, line 248, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 116, in get_task_obj
return self._get_task_cls()(**self._get_task_kwargs())
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/cmdline_parser.py”, line 137, in _get_task_kwargs
raise err
ValueError: No JSON object could be decoded
Connection to localhost closed.

Ooh this is progress! It’s now complaining about the --src parameter, which It means that we did find the magic incantation for --lib-jar! Just have to do the same for the other list parameters.

--src '"[\"hdfs://localhost:9000/data\"]"' 

And maybe also:

--include '"[\"*tracking.log*.gz\"]"' 

Shell escaping is painful! The good news is, for all the other analytics task types, you can use an override.cfg, which has a much easier syntax, e.g. luigi_docker.cfg

1 Like

Thanks Jill. It worked. I no longer get the JSON error. But I am stuck with another error
“Cannot overwrite a table with an empty result set.”

Not sure how to deeal with this. Do I need to delete the previous hadoop data and mysql data? Need to know which tables if you can and then I can try to re-run this task.

2020-10-09 18:48:06,588 ERROR 34358 [edx.analytics.tasks.common.mysql_load] mysql_load.py:393 - Cannot overwrite a table with an empty result set.
2020-10-09 18:48:06,591 ERROR 34358 [luigi-interface] worker.py:213 - [pid 34358] Worker Worker(salt=367386724, workers=1, host=insights.millicenttechnologies.co.in, username=hadoop, pid=34358, sudo_user=millicentr) failed AnswerDistributionToMySQLTaskWorkflow(database=reports, credentials=/edx/etc/edx-analytics-pipeline/output.json, name=2020-10-09T18_44_02Z, src=[“hdfs://localhost:9000/data”], dest=hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/2020-10-09T18_44_02Z/dest, include=[“tracking.log.gz”], manifest=hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/2020-10-09T18_44_02Z/manifest.txt, answer_metadata=None, base_input_format=org.edx.hadoop.input.ManifestTextInputFormat)
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/worker.py”, line 194, in run
new_deps = self._run_get_new_deps()
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/worker.py”, line 146, in _run_get_new_deps
requires = task_gen.send(next_send)
File “/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/common/mysql_load.py”, line 377, in run
self.insert_rows(cursor)
File “/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/common/mysql_load.py”, line 332, in insert_rows
raise Exception(‘Cannot overwrite a table with an empty result set.’)
Exception: Cannot overwrite a table with an empty result set.

That error comes from this parameter: allow_empty_insert = False

A couple of tasks make this configurable (InsertToMysqlAllVideoTask, ModuleEngagement), but unfortunately, AnswerDistributionWorkflow isn’t one of them.

To work around it, you can change this line to be allow_empty_insert = True.

Thanks Jill once again. Got it to work finally. I changed it False to True in 2 files.

edx-analytics-pipeline/edx/analytics/tasks/common/mysql_load.py

allow_empty_insert = False

to

allow_empty_insert = True

/var/lib/analytics-tasks/analyticstack/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/common/mysql_load.py

allow_empty_insert = False

to

allow_empty_insert = True

Finally a good result. Not sure how if it will affect the other tasks.

===== Luigi Execution Summary =====

Scheduled 7 tasks of which:

  • 2 present dependencies were encountered:
    • 1 ExternalURL(url=/edx/etc/edx-analytics-pipeline/output.json)
    • 1 PathSetTask(…)
  • 5 ran successfully:
    • 1 AnswerDistributionOneFilePerCourseTask(…)
    • 1 AnswerDistributionPerCourse(…)
    • 1 AnswerDistributionToMySQLTaskWorkflow(…)
    • 1 AnswerDistributionWorkflow(…)
    • 1 ProblemCheckEvent(…)

This progress looks :slight_smile: because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

1 Like

Hallelujah! Sorry it was such a messy journey for you @nevilleonline. Could you mark this post as Solved, so others can have hope of solving this issue too?

A tip for the other analytics tasks: it’s best to create and use an override.cfg file to specify the parameters for those tasks which remain the same for every run. For example, to set allow_empty_insert for the aforementioned Video task, your override.cfg file needs a stanza like the one below. You can see the section and parameter name in the parameter’s config_path, which in this case is {'section': 'videos', 'name': 'allow_empty_insert'}

[videos]
allow_empty_insert = true

Similarly, your array parameters get easier to specify in the .cfg file, and don’t require all that gnarly escaping:

[event-logs]
pattern = [".*tracking.log-(?P<date>[0-9]+).*"]
source = ["hdfs://localhost:9000/data/logs/tracking/"]
1 Like