Analytics pipeline failing on non-us S3 buckets

Does anyone run their analytics pipelines outside of the default us-east-1 region? What configuration did you have to change to make this possible?

One of our clients is running the analytics pipeline in the eu-west-1 region. The tasks run in an environment with export AWS_REGION=eu-west-1, which seems to take care of most of the issues. We’ve also added this stanza to the override.cfg, which seems to override the default s3.amazonaws.com endpoint as the code indicates it should:

[s3]
host = s3.eu-west-1.amazonaws.com

However, the only way we can get the tasks to succeed is to actually modify that hard-coded default in the code to match our configuration.

We added debug logging, and it shows that this hard-coded override shouldn’t be necessary:

2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:138 - S3 'host' not in kwargs
2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:139 - S3 'host' = s3.eu-west-1.amazonaws.com
2019-07-23 05:10:24,099 ERROR 15440 [edx.analytics.tasks.util.s3_util] s3_util.py:141 - S3 'host' overridden to s3.eu-west-1.amazonaws.com

However, unless the hard-coded default is modified, the tasks don’t succeed!

The error isn’t very helpful either, and I’m baffled:

2019-07-23 05:42:54,943 ERROR 15440 [luigi-interface] worker.py:213 - [pid 15440] Worker Worker(salt=457984176, workers=1, host=ip-172-31-22-131, username=hadoop, pid=15440, sudo_user=hadoop) failed    CourseEnrollmentEventsTask(source=["s3://redacted-eu-tracking-logs/logs/tracking/"], interval=2019-07-20-2019-07-23, expand_interval=4 w 2 d 0 h 0 m 0 s, pattern=[".*tracking.log-(?P<date>[0-9]+).*"], date_pattern=%Y%m%d, warehouse_path=s3://redacted-eu-edxanalytics/warehouse/hive/)
Traceback (most recent call last):
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/worker.py", line 194, in run
    new_deps = self._run_get_new_deps()
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/worker.py", line 131, in _run_get_new_deps
    task_gen = self.task.run()
  File "/var/lib/analytics-tasks/automation/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/insights/enrollments.py", line 152, in run
    super(CourseEnrollmentEventsTask, self).run()
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 781, in run
    self.job_runner().run_job(self)
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 622, in run_job
    run_and_track_hadoop_job(arglist, tracking_url_callback=job.set_tracking_url)
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 390, in run_and_track_hadoop_job
    return track_process(arglist, tracking_url_callback, env)
  File "/mnt/var/lib/analytics-tasks/automation/venv/src/luigi/luigi/contrib/hadoop.py", line 380, in track_process
    (tracking_url, e), out, err)
HadoopJobError: Streaming job failed with exit code 1. Additionally, an error occurred when fetching data from http://ip-172-31-22-131.eu-west-1.compute.internal:20888/proxy/application_1563857964707_0001/: No module named mechanize

These logs are from the ImportEnrollmentsIntoMysql task.

Any advice is welcome!

Hello Jill,

We run our all pipelines in us-west-1 and we don’t even have an 's3" stanza in the override file. In fact, I can’t find ‘host’ being set anywhere on our systems. I do set AWS_REGION=us-west-1 in jenkins_env.

We’re running Hawthorn and, with the default version, I could not get boto to connect from the core box to S3 although I tried formats in the jenkins .boto file that should have worked. I ended up downgrading boto to the version in Ginkgo. I mention this as the .boto file is the only other place we set region info:

[Boto]
debug = 1
ec2_region_name = us-west-1
ec2_region_endpoint = ec2.us-west-1.amazonaws.com

Not much help I know.
/David

@dcdams That’s very interesting…

How do you provide your custom .boto file, and where is it installed on your servers? I scanned through edx:configuration and didn’t find a variable for it, and the pipeline boto.cfg looks hardcoded too, but it wouldn’t be hard to maintain config differences if we needed to keep them there.

Was it boto or boto3 you had to downgrade? I ask because of this note on upstream master, which indicates that my host code change will cease working once everything moves to boto3.

In jenkins_env:
export BOTO_CONFIG=/edx/var/jenkins/.boto

It was boto, here’s the commit:

I remember going through the process of running a pipeline task, logging in to the core node and then doing something to test the boto connection. Unfortunately, I can’t remember exactly what but it was likely a repeat of the cmd for error that I was seeing in the logs.

I was expecting to find a .boto file somewhere on the core node but I never did find it. That part remains a mystery for me.

Additionally, I just grep’d through our logs and I’m seeing the ‘No module named mechanize’ error come up consistently for ModuleEngagementRosterIndexTask. I don’t know what effect this failure is having since Engagement looks good and up to date in our Insight UI. I also see the error came up once only for SqoopImportFromMysql as part of GradeDistFromSqoopToMySQLWorkflow.

I’m really confused now also.
/David

The No module named mechanize error is reported as an additional error, so might not be the root cause of my errors?

Me too… there’s so much bootstrapping and transfer of config from jenkins to core/master/task instances, that I’m never sure what ends up where.

Unfortunately, I tried your suggested .boto file and BOTO_CONFIG env override, but it didn’t fix the issue. I even tried enabling the use_endpoint_heuristics setting noted in the Boto config docs, but it didn’t fix the problem either.

So I guess I’ll just have to carry over this code drift for now. Maybe it will be fixable when everything is updated to boto3?

Thank you for your help!