edX analytics pipeline fails

Hello everyone,

I’m trying to install edX analytics pipeline and I’m following the steps in this link .I’m executing it on a vagrant box (VM) that can reach another VM with edX installed inside.

I had few issues that I managed to solve (failing ssh, failing Ansible playbook execution…), but now I’m reaching a blocking state and I’m unable to fix a luigi task issue:

PLAY RECAP ********************************************************************
localhost : ok=24 changed=14 unreachable=0 failed=0
Running command = [‘ssh’, ‘-tt’, ‘-o’, ‘ForwardAgent=yes’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘UserKnownHostsFile=/dev/null’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘User=vagrant’, ‘-o’, ‘ConnectTimeout=10’, ‘localhost’, “sudo -Hu hadoop /bin/bash -c ‘cd /var/lib/analytics-tasks/analyticstack/repo && . $HOME/.bashrc && . /var/lib/analytics-tasks/analyticstack/venv/bin/activate && launch-task TotalEventsDailyTask --interval 2016 --output-root hdfs://localhost:9000/output/ --local-scheduler’”]
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
No handlers could be found for logger “luigi-interface”
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘sqoop-import = edx.analytics.tasks.common.sqoop:SqoopImportFromMysql’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘run-vertica-sql-script = edx.analytics.tasks.warehouse.run_vertica_sql_script:RunVerticaSqlScriptTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘obfuscation = edx.analytics.tasks.export.obfuscation:ObfuscatedCourseTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘enrollment_validation = edx.analytics.tasks.monitor.enrollment_validation:CourseEnrollmentValidationTask’)
INFO:luigi-interface:Loaded [‘client.cfg’]
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘problem_response = edx.analytics.tasks.insights.problem_response:LatestProblemResponseDataTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-warehouse-bigquery = edx.analytics.tasks.warehouse.load_warehouse_bigquery:LoadWarehouseBigQueryTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘push_to_vertica_lms_courseware_link_clicked = edx.analytics.tasks.warehouse.lms_courseware_link_clicked:PushToVerticaLMSCoursewareLinkClickedTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-internal-active-users = edx.analytics.tasks.warehouse.load_internal_reporting_active_users:LoadInternalReportingActiveUsersToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘video = edx.analytics.tasks.insights.video:InsertToMysqlAllVideoTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘ed_services_report = edx.analytics.tasks.warehouse.financial.ed_services_financial_report:BuildEdServicesReportTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-internal-database = edx.analytics.tasks.warehouse.load_internal_reporting_database:ImportMysqlToVerticaTask’)
DEBUG:snowflake.connector.ssl_wrap_socket:Injecting ssl_wrap_socket_with_ocsp
DEBUG:snowflake.connector.auth:cache directory: /home/hadoop/.cache/snowflake
DEBUG:snowflake.connector.cursor:Failed to import pyarrow. No Apache Arrow result set format can be used.
DEBUG:snowflake.connector.cursor:Failed to import ArrowResult. No Apache Arrow result set format can be used.
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-insights = edx.analytics.tasks.warehouse.load_warehouse_insights:LoadInsightsTableToVertica’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘export-student-module = edx.analytics.tasks.export.database_exports:StudentModulePerCourseAfterImportWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘calendar = edx.analytics.tasks.insights.calendar_task:CalendarTableTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘snowflake-load = edx.analytics.tasks.common.snowflake_load:SnowflakeLoadTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘affiliate_window = edx.analytics.tasks.warehouse.financial.fees:LoadFeesToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘orders = edx.analytics.tasks.warehouse.financial.orders_import:OrderTableTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘cybersource = edx.analytics.tasks.warehouse.financial.cybersource:DailyPullFromCybersourceTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-d-user = edx.analytics.tasks.warehouse.load_internal_reporting_user:LoadInternalReportingUserToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘location-per-course = edx.analytics.tasks.insights.location_per_course:LastCountryOfUser’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘payment_reconcile = edx.analytics.tasks.warehouse.financial.reconcile:ReconcileOrdersAndTransactionsTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-warehouse = edx.analytics.tasks.warehouse.load_warehouse:LoadWarehouseWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘engagement = edx.analytics.tasks.insights.module_engagement:ModuleEngagementDataTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘events_obfuscation = edx.analytics.tasks.export.events_obfuscation:ObfuscateCourseEventsTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘dump-student-module =
edx.analytics.tasks.export.database_exports:StudentModulePerCourseTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘export-events-by-course = edx.analytics.tasks.export.event_exports_by_course:EventExportByCourseTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-ga-permissions = edx.analytics.tasks.warehouse.load_ga_permissions:LoadGoogleAnalyticsPermissionsWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘noop = edx.analytics.tasks.monitor.performance:ParseEventLogPerformanceTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘course_blocks = edx.analytics.tasks.insights.course_blocks:CourseBlocksApiDataTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘export-vertica-sqoop = edx.analytics.tasks.common.vertica_export:ExportVerticaTableToS3Task’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-events = edx.analytics.tasks.warehouse.load_internal_reporting_events:TrackingEventRecordDataTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-google-sheet-snowflake = edx.analytics.tasks.warehouse.load_google_sheet_to_warehouse:LoadGoogleSpreadsheetsToSnowflakeWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-d-certificates = edx.analytics.tasks.warehouse.load_internal_reporting_certificates:LoadInternalReportingCertificatesToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘user-activity = edx.analytics.tasks.insights.user_activity:InsertToMysqlCourseActivityTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘tags-dist = edx.analytics.tasks.insights.tags_dist:TagsDistributionPerCourse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘bigquery-load = edx.analytics.tasks.common.bigquery_load:BigQueryLoadTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘run-vertica-sql-scripts = edx.analytics.tasks.warehouse.run_vertica_sql_scripts:RunVerticaSqlScriptTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-vertica-schema-bigquery = edx.analytics.tasks.warehouse.load_vertica_schema_to_bigquery:LoadVerticaSchemaFromS3ToBigQueryTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘paypal = edx.analytics.tasks.warehouse.financial.paypal:PaypalTransactionsByDayTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘grade-dist = edx.analytics.tasks.data_api.studentmodule_dist:GradeDistFromSqoopToMySQLWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘database-import = edx.analytics.tasks.insights.database_imports:ImportAllDatabaseTablesTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-course-catalog = edx.analytics.tasks.warehouse.load_internal_reporting_course_catalog:PullDiscoveryCoursesAPIData’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘enrollments = edx.analytics.tasks.insights.enrollments:ImportEnrollmentsIntoMysql’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘event-type-dist = edx.analytics.tasks.warehouse.event_type_dist:PushToVerticaEventTypeDistributionTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-internal-course-structure = edx.analytics.tasks.warehouse.load_internal_reporting_course_structure:LoadCourseBlockRecordToVertica’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘program_reports = edx.analytics.tasks.programs.program_reports:BuildLearnerProgramReport’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘enterprise_enrollments = edx.analytics.tasks.enterprise.enterprise_enrollments:ImportEnterpriseEnrollmentsIntoMysql’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘export-events = edx.analytics.tasks.export.event_exports:EventExportTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘financial_reports = edx.analytics.tasks.warehouse.financial.finance_reports:BuildFinancialReportsTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-warehouse-snowflake = edx.analytics.tasks.warehouse.load_warehouse_snowflake:LoadWarehouseSnowflakeTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘data_obfuscation = edx.analytics.tasks.export.data_obfuscation:ObfuscatedCourseDumpTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘course_list = edx.analytics.tasks.insights.course_list:CourseListApiDataTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-d-user-course = edx.analytics.tasks.warehouse.load_internal_reporting_user_course:LoadUserCourseSummary’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-d-country = edx.analytics.tasks.warehouse.load_internal_reporting_country:LoadInternalReportingCountryToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-google-sheet-vertica = edx.analytics.tasks.warehouse.load_google_sheet_to_warehouse:LoadGoogleSpreadsheetsToVerticaWorkflow’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘overall_events = edx.analytics.tasks.monitor.overall_events:TotalEventsDailyTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-f-user-activity = edx.analytics.tasks.warehouse.load_internal_reporting_user_activity:LoadInternalReportingUserActivityToWarehouse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘enterprise_user = edx.analytics.tasks.enterprise.enterprise_user:ImportEnterpriseUsersIntoMysql’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘paypal-report = edx.analytics.tasks.warehouse.financial.paypal_ftpreport:LoadPayPalCaseReportToVertica’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘answer-dist = edx.analytics.tasks.insights.answer_dist:AnswerDistributionPerCourse’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘load-vertica-schema-snowflake = edx.analytics.tasks.warehouse.load_vertica_schema_to_snowflake:LoadVerticaSchemaFromS3ToSnowflakeTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘student_engagement = edx.analytics.tasks.data_api.student_engagement:StudentEngagementTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘insert-into-table = edx.analytics.tasks.common.mysql_load:MysqlInsertTask’)
DEBUG:stevedore.extension:found extension EntryPoint.parse(‘all_events_report = edx.analytics.tasks.monitor.total_events_report:TotalEventsReportWorkflow’)
DEBUG:edx.analytics.tasks.launchers.local:Loading override configuration ‘override.cfg’…
/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py:261: UserWarning: Parameter “input_format” with value “None” is not of type string.
warnings.warn(‘Parameter “{}” with value “{}” is not of type string.’.format(param_name, param_value))
/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py:261: UserWarning: Parameter “n_reduce_tasks” with value “25” is not of type string.
warnings.warn(‘Parameter “{}” with value “{}” is not of type string.’.format(param_name, param_value))
/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py:261: UserWarning: Parameter “pool” with value “None” is not of type string.
warnings.warn(‘Parameter “{}” with value “{}” is not of type string.’.format(param_name, param_value))
/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py:261: UserWarning: Parameter “effective_user” with value “None” is not of type string.
warnings.warn(‘Parameter “{}” with value “{}” is not of type string.’.format(param_name, param_value))
/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/parameter.py:261: UserWarning: Parameter “namenode_host” with value “None” is not of type string.
warnings.warn(‘Parameter “{}” with value “{}” is not of type string.’.format(param_name, param_value))
2020-06-14 15:30:15,078 WARNING 10525 [luigi-interface] worker.py:560 - Will not run TotalEventsDailyTask(source=[“hdfs://localhost:9000/data/”], interval=2016, expand_interval=0 w 2 d 0 h 0 m 0 s, pattern=[“.tracking.log.”], date_pattern=%Y%m%d, output_root=hdfs://localhost:9000/output/) or any dependencies due to error in complete() method:
Traceback (most recent call last):
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/worker.py”, line 334, in check_complete
is_complete = task.complete()
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task.py”, line 548, in complete
return all(map(lambda output: output.exists(), outputs))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/task.py”, line 548, in
return all(map(lambda output: output.exists(), outputs))
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/target.py”, line 243, in exists
return self.fs.exists(path)
File “/var/lib/analytics-tasks/analyticstack/venv/src/luigi/luigi/contrib/hdfs/hadoopcli_clients.py”, line 78, in exists
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True, universal_newlines=True)
File “/usr/lib/python2.7/subprocess.py”, line 711, in init
errread, errwrite)
File “/usr/lib/python2.7/subprocess.py”, line 1343, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
2020-06-14 15:30:15,084 INFO 10525 [luigi-interface] worker.py:501 - Informed scheduler that task TotalEventsDailyTask__Y_m_d_0_w_2_d_0_h_0_m__2016_1ce5578cb2 has status UNKNOWN
2020-06-14 15:30:15,084 INFO 10525 [luigi-interface] interface.py:206 - Done scheduling tasks
2020-06-14 15:30:15,085 INFO 10525 [luigi-interface] worker.py:1070 - Running Worker with 1 processes
2020-06-14 15:30:15,096 INFO 10525 [luigi-interface] worker.py:401 - Worker Worker(salt=169823526, workers=1, host=vm1, username=hadoop, pid=10525, sudo_user=vagrant) was stopped. Shutting down Keep-Alive thread
2020-06-14 15:30:15,098 INFO 10525 [luigi-interface] interface.py:208 -
===== Luigi Execution Summary =====
Scheduled 1 tasks of which:

  • 1 failed scheduling:
    • 1 TotalEventsDailyTask(…)
      Did not run any tasks
      This progress looks :frowning: because there were tasks whose scheduling failed
      ===== Luigi Execution Summary =====
      Connection to localhost closed.
      Exiting with status = 35

I’m using a Ubuntu 16 in the vagrant VM, which means I didn’t respect the the requirement stating the I must install analytics pipeline on a Ubuntu 12.04 machine. and since all the errors I had before didn’t seem to be a dependency/package related errors, I decided to stick with a the more recent version of Ubuntu.
Can this luigi error be system/package/dependency related, or am I doing something wrong ?

Thanks.

2 Likes

The same problem for me.

@nablisoft @sefirosu have your tried Figures? Maybe it can solve your analytics needs: https://www.appsembler.com/blog/figures/

Code: https://github.com/appsembler/figures

Hi @sefirosu ,

I got the same issue as you. Do you have any solution to fix it?

Take a look

TotalEventsDailyTask(source=[“hdfs://localhost:9000/data/”], interval=2016, expand_interval=0 w 2 d 0 h 0 m 0 s, pattern=[“.tracking.log.”], date_pattern=%Y%m%d, output_root=hdfs://localhost:9000/output/)

The problem is tracking.log was not located at /data/tracking.log in HDFS
You can check it by executing command line hdfs dfs -ls /data/
Refer to my command

Hey have you solved the following issue?
If yes then how?