Does anyone run their analytics pipelines outside of the default us-east-1
region? What configuration did you have to change to make this possible?
One of our clients is running the analytics pipeline in the eu-west-1
region. The tasks run in an environment with export AWS_REGION=eu-west-1
, which seems to take care of most of the issues. We’ve also added this stanza to the override.cfg
, which seems to override the default s3.amazonaws.com
endpoint as the code indicates it should:
[s3]
host = s3.eu-west-1.amazonaws.com
However, the only way we can get the tasks to succeed is to actually modify that hard-coded default in the code to match our configuration.
We added debug logging, and it shows that this hard-coded override shouldn’t be necessary:
2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:138 - S3 'host' not in kwargs
2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:139 - S3 'host' = s3.eu-west-1.amazonaws.com
2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:141 - S3 'host' overridden to s3.eu-west-1.amazonaws.com
However, unless the hard-coded default is modified, the tasks don’t succeed!
The error isn’t very helpful either, and I’m baffled:
2019-07-23 05:42:54,943 ERROR 15440 [luigi-interface] worker.py:213 - [pid 15440] Worker Worker(salt=457984176, workers=1, host=ip-172-31-22-131, username=hadoop, pid=15440, sudo_user=hadoop) failed CourseEnrollmentEventsTask(source=["s3://redacted-eu-tracking-logs/logs/tracking/"], interval=2019-07-20-2019-07-23, expand_interval=4 w 2 d 0 h 0 m 0 s, pattern=[".*tracking.log-(?P<date>[0-9]+).*"], date_pattern=%Y%m%d, warehouse_path=s3://redacted-eu-edxanalytics/warehouse/hive/)
Traceback (most recent call last):
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/worker.py", line 194, in run
new_deps = self._run_get_new_deps()
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/worker.py", line 131, in _run_get_new_deps
task_gen = self.task.run()
File "/var/lib/analytics-tasks/automation/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/insights/enrollments.py", line 152, in run
super(CourseEnrollmentEventsTask, self).run()
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 781, in run
self.job_runner().run_job(self)
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 622, in run_job
run_and_track_hadoop_job(arglist, tracking_url_callback=job.set_tracking_url)
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 390, in run_and_track_hadoop_job
return track_process(arglist, tracking_url_callback, env)
File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 380, in track_process
(tracking_url, e), out, err)
HadoopJobError: Streaming job failed with exit code 1. Additionally, an error occurred when fetching data from http://ip-172-31-22-131.eu-west-1.compute.internal:20888/proxy/application_1563857964707_0001/: No module named mechanize
These logs are from the ImportEnrollmentsIntoMysql
task.
Any advice is welcome!