Hi @ettayeb_mohamed ! Thanks for posting your question, since installing analytics is a source of frustration for a lot of people.
Not sure where you saw this, but analytics isn’t installed by default with Open edX? edX have made a lot of improvements with the devstack since it moved to Docker, and so doing development with the analytics pipeline is now supported by default on the docker devstack, but AFAIK, the production deployment still requires separate deployment steps.
Yep, currently AWS is the only officially supported environment for analytics deployments, because of all the pieces required to run the analytics pipeline, which feeds data into Insights (see architecture diagram). We at OpenCraft set up analytics on AWS a lot for clients, so we’ve assembled some documentation for how to do this, be beware that it’s not straightforward: openedx-deployment.doc.opencraft.com, under Analytics.
However, AWS is cost-prohibitive for a lot of deployments, and also, people with small- and medium-sized LMS user bases doesn’t really need the massively-scaled infrastructure that Open edX’s AWS analytics deployment provides. There’s a couple of options.
Figures
@john and Appsembler built Figures, which provides some of the data reporting available in Open edX Insights/analytics.
Since it runs in the same python environment as the LMS, it’s much easier to install, use, and contribute to.
Depending on which version of Open edX you’re running, I’d totally recommend trying it out to see if it meets your needs. They’re happy to accept pull requests too, if you find bugs or have features you want to add!
OpenStack Analytics
OpenCraft are working enhancing our Open edX deployment service (Ocim) to make it possible to run Insights and the Analytics Pipeline on a single OpenStack (OVH) instance.
The timeline for completing this isn’t yet known, so nothing has been upstreamed or properly documented yet. But I can share what we’ve done so far, and you’re welcome to use what you like. Again beware: it’s not a simple process.
Also note: we use S3 buckets for cost and authentication reasons, but you can use any hdfs-friendly locations.
-
Based my configuration branch on our ironwood.2 release branch, cf changes made
-
Deployed using this modified playbook and these ansible variables:
Click to expand ansible variables
Replace
FIXMEs
with real values.SANDBOX_ENABLE_CERTIFICATES: false SANDBOX_ENABLE_ANALYTICS_API: true SANDBOX_ENABLE_INSIGHTS: true SANDBOX_ENABLE_PIPELINE: true INSIGHTS_NGINX_PORT: 80 # packages required to install and run the pipeline analytics_pipeline_debian_pkgs: - "mysql-server-5.6" - python-mysqldb - libpq-dev NGINX_INSIGHTS_APP_EXTRA: | # Use /status instead of /heartbeat endpoint to keep Ocim provisioning happy rewrite ^/heartbeat$ /status; # Allows hadoop/hdfs to write to our S3 bucket. HADOOP_CORE_SITE_EXTRA_CONFIG: fs.s3.awsAccessKeyId: "{{ AWS_ACCESS_KEY_ID }}" fs.s3.awsSecretAccessKey: "{{ AWS_SECRET_ACCESS_KEY }}" fs.s3.region: us-east-1 # FIXME: should be a variable fs.s3.impl: "org.apache.hadoop.fs.s3native.NativeS3FileSystem" # Use our mysql database for the hive database HIVE_METASTORE_DATABASE_HOST: "{{ EDXAPP_MYSQL_HOST }}" HIVE_METASTORE_DATABASE_NAME: hive HIVE_METASTORE_DATABASE_USER: # FIXME HIVE_METASTORE_DATABASE_PASSWORD: # FIXME HIVE_SITE_EXTRA_CONFIG: datanucleus.autoCreateSchema: true datanucleus.autoCreateTables: true datanucleus.fixedDatastore: true # EDXAPP Variables needed by config below EDXAPP_LMS_ROOT_URL: "{{ EDXAPP_LMS_BASE_SCHEME | default('https') }}://{{ EDXAPP_LMS_BASE }}" ANALYTICS_API_LMS_BASE_URL: "{{ EDXAPP_LMS_ROOT_URL }}" # ANALYTICS_API Variables needed by playbooks ANALYTICS_API_EMAIL_HOST: localhost ANALYTICS_API_EMAIL_HOST_PASSWORD: '' ANALYTICS_API_EMAIL_HOST_USER: '' ANALYTICS_API_EMAIL_PORT: 25 # ANALYTICS_API_GIT_IDENTITY: '{{ COMMON_GIT_IDENTITY }}' ANALYTICS_API_LANGUAGE_CODE: en-us ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7 ANALYTICS_API_LANGUAGE_CODE: en-us ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7 ANALYTICS_API_SERVICE_CONFIG: ANALYTICS_DATABASE: reports API_AUTH_TOKEN: # FIXME DATABASES: '{{ ANALYTICS_API_DATABASES }}' # nb: using default localhost elasticsearch EMAIL_PORT: '{{ ANALYTICS_API_EMAIL_PORT }}' LANGUAGE_CODE: en-us SECRET_KEY: '{{ ANALYTICS_API_SECRET_KEY }}' STATICFILES_DIRS: [] STATIC_ROOT: '{{ COMMON_DATA_DIR }}/{{ analytics_api_service_name }}/staticfiles' TIME_ZONE: UTC # This password must be 40 characters or fewer ANALYTICS_API_USER_PASSWORD: # FIXME ANALYTICS_API_USERS: apiuser001: '{{ ANALYTICS_API_USER_PASSWORD }}' dummy-api-user: # FIXME # INSIGHTS Variables needed by playbooks INSIGHTS_APPLICATION_NAME: "Insights {{ EDXAPP_PLATFORM_NAME }}" INSIGHTS_BASE_URL: # FIXME INSIGHTS_CMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_CMS_BASE }}/course INSIGHTS_CMS_NGINX_PORT: '{{ EDXAPP_PLATFORM_NAME }}' INSIGHTS_CSRF_COOKIE_NAME: crsftoken # INSIGHTS_DATABASES stanza defined above INSIGHTS_DATA_API_AUTH_TOKEN: '{{ ANALYTICS_API_USER_PASSWORD }}' INSIGHTS_DOC_BASE: http://edx-insights.readthedocs.org/en/latest INSIGHTS_DOC_LOAD_ERROR_URL: http://edx-insights.readthedocs.org/en/latest/Reference.html#error-conditions INSIGHTS_FEEDBACK_EMAIL: dashboard@example.com INSIGHTS_GUNICORN_EXTRA: '' INSIGHTS_GUNICORN_WORKERS: '8' INSIGHTS_LANGUAGE_COOKIE_NAME: language INSIGHTS_LMS_BASE: https://{{ EDXAPP_LMS_BASE }} INSIGHTS_LMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_LMS_BASE }}/courses INSIGHTS_MKTG_BASE: 'https://{{ EDXAPP_LMS_BASE }}' # credentials should be auto-generated, not hardcoded here. INSIGHTS_OAUTH2_KEY: # FIXME INSIGHTS_OAUTH2_SECRET: # FIXME INSIGHTS_OAUTH2_URL_ROOT: https://{{ EDXAPP_LMS_BASE }}/oauth2 INSIGHTS_OPEN_SOURCE_URL: http://code.edx.org/ INSIGHTS_PLATFORM_NAME: '{{ EDXAPP_PLATFORM_NAME }}' INSIGHTS_PRIVACY_POLICY_URL: 'https://{{ EDXAPP_LMS_BASE }}/edx-privacy-policy' INSIGHTS_SESSION_COOKIE_NAME: sessionid INSIGHTS_SOCIAL_AUTH_REDIRECT_IS_HTTPS: true INSIGHTS_SUPPORT_EMAIL: support@example.com
-
Made some minor mods to the analytics pipeline, cf diff, and used that branch to run the pipeline.
-
Used this configuration for the pipeline:
Click to expand override.cfg
Replace THINGS-IN-ALL-CAPS with real values.[hive] warehouse_path = s3://BUCKET-NAME-HERE/analytics/warehouse/ [database-export] database = CLIENT-PREFIX-HERE_reports credentials = s3://BUCKET-NAME-HERE/analytics/config/output.json [database-import] database = CLIENT-PREFIX-HERE_edxapp credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json destination = s3://BUCKET-NAME-HERE/analytics/warehouse/ [otto-database-import] database = CLIENT-PREFIX-HERE_ecommerce credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json [map-reduce] engine = hadoop marker = s3://BUCKET-NAME-HERE/analytics/marker/ lib_jar = [ "hdfs://localhost:9000/lib/hadoop-aws-2.7.2.jar", "hdfs://localhost:9000/lib/aws-java-sdk-1.7.4.jar"] [event-logs] pattern = [".*tracking.log-(?P<date>[0-9]+).*"] expand_interval = 30 days source = ["s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/"] [event-export] output_root = s3://BUCKET-NAME-HERE/analytics/event-export/output/ environment = simple config = s3://BUCKET-NAME-HERE/analytics/event_export/config.yaml gpg_key_dir = s3://BUCKET-NAME-HERE/analytics/event_export/gpg-keys/ gpg_master_key = master@key.org required_path_text = FakeServerGroup [event-export-course] output_root = s3://BUCKET-NAME-HERE/analytics/event-export-by-course/output/ [manifest] threshold = 500 input_format = org.edx.hadoop.input.ManifestTextInputFormat lib_jar = s3://BUCKET-NAME-HERE/analytics/packages/edx-analytics-hadoop-util.jar path = s3://BUCKET-NAME-HERE/analytics/manifest/ [user-activity] overwrite_n_days = 10 output_root = s3://BUCKET-NAME-HERE/analytics/activity/ [answer-distribution] valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse [enrollments] interval_start = 2017-01-01 overwrite_n_days = 3 blacklist_date = 2001-01-01 blacklist_path = s3://BUCKET-NAME-HERE/analytics/enrollments-blacklist/ [enrollment-reports] src = s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/ destination = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/output/ offsets = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/offsets.tsv blacklist = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/course_blacklist.tsv history = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/enrollment_history.tsv [course-summary-enrollment] # JV - course catalog is optional, and was causing CourseProgramMetadataInsertToMysqlTask errors. # enable_course_catalog = true enable_course_catalog = false [financial-reports] shoppingcart-partners = {"DEFAULT": "edx"} [geolocation] geolocation_data = s3://BUCKET-NAME-HERE/analytics/packages/GeoIP.dat [location-per-course] interval_start = 2017-01-01 overwrite_n_days = 3 [calendar] interval = 2017-01-01-2030-01-01 [videos] dropoff_threshold = 0.05 allow_empty_insert = true overwrite_n_days = 3 [elasticsearch] host = ["http://localhost:9200/"] [module-engagement] alias = roster_1_2 number_of_shards = 5 overwrite_n_days = 3 allow_empty_insert = true [ccx] enabled = false [problem-response] report_fields = [ "username", "problem_id", "answer_id", "location", "question", "score", "max_score", "correct", "answer", "total_attempts", "first_attempt_date", "last_attempt_date"] report_output_root = s3://BUCKET-NAME-HERE/analytics/reports/ [edx-rest-api] # Create using: # ./manage.py lms --settings=devstack create_oauth2_client \ # http://localhost:9999 # URL does not matter \ # http://localhost:9999/complete/edx-oidc/ \ # confidential \ # --client_name "Analytics Pipeline" \ # --client_id oauth_id \ # --client_secret oauth_secret \ # --trusted client_id = oauth_id client_secret = oauth_secret auth_url = https://LMS_URL_HERE/oauth2/access_token/ [course-list] api_root_url = https://LMS_URL_HERE/api/courses/v1/courses/ [course-blocks] api_root_url = https://LMS_URL_HERE/api/courses/v1/blocks/
-
Then, the analytics tasks can be run on the local machine using this script. Schedule it to run daily via cron to keep your data updated.
Click to expand pipeline.sh
Replace the variables in the FIXME block with real values.#!/bin/bash # Acquire lock using this script itself as the lockfile. # If another pipeline task is already running, then exit immediately. exec 200<$0 flock -n 200 || { echo "`date` Another pipeline task is already running."; exit 1; } # Run as hadoop user . $HOME/hadoop/hadoop_env . $HOME/venvs/pipeline/bin/activate cd $HOME/pipeline export OVERRIDE_CONFIG=$HOME/override.cfg HIVE='hive' HDFS="hadoop fs" # FIXME set these variables FROM_DATE=2017-01-01 NUM_REDUCE_TASKS=12 TRACKING_LOGS_S3_BUCKET="s3://TRACKING-LOG-BUCKET-GOES-HERE" TRACKING_LOGS_S3_PATH="$TRACKING_LOGS_S3_BUCKET/logs/tracking/" HADOOP_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET" # bucket/path for temporary/intermediate storage HADOOP_S3_PATH="$HADOOP_S3_BUCKET/analytics" HDFS_ROOT="$HADOOP_S3_PATH" TASK_CONFIGURATION_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET" # bucket/path containing task configuration files TASK_CONFIGURATION_S3_PATH="$TASK_CONFIGURATION_S3_BUCKET/analytics/packages/" # /FIXME set these variables END_DATE=$(date +"%Y-%m-%d") INTERVAL="$FROM_DATE-$END_DATE" REMOTE_TASK="launch-task" WEEKS=10 ADD_PARAMS="" LOCKFILE=/tmp/pipeline-tasks.lock if [ -f $LOCKFILE ]; then echo "This script is already running." exit else touch $LOCKFILE fi DO_SHIFT=0 getopts e:w:p: PARAM while [ $? -eq 0 ]; do case "$PARAM" in (e) echo "Using end_date: $OPTARG" END_DATE=$OPTARG DO_SHIFT=$(( $DO_SHIFT + 2 )) ;; (w) echo "Using WEEKS=$OPTARG" WEEKS=$OPTARG DO_SHIFT=$(( $DO_SHIFT + 2 )) ;; (p) echo "Using file pattern: $OPTARG" ADD_PARAMS="--pattern '$OPTARG'" DO_SHIFT=$(( $DO_SHIFT + 2 )) ;; esac getopts e:w:p: PARAM done if [ $DO_SHIFT -gt 0 ]; then shift $DO_SHIFT fi if [ "$1x" != "x" ]; then echo "Adding parameters: $@" ADD_PARAMS="$@" fi # Run history tasks once to bootstrap new deployments. RUN_ENROLLMENTS_HISTORY=0 RUN_GEOGRAPHY_HISTORY=0 RUN_LEARNER_ANALYTICS_HISTORY=0 # Run incremental tasks daily RUN_ENROLLMENTS=1 RUN_PERFORMANCE=1 RUN_GEOGRAPHY=1 RUN_ENGAGEMENT=1 RUN_VIDEO=1 RUN_LEARNER_ANALYTICS=1 # Run engagement task if today is a Monday if [ $(date +%u) -eq 1 ]; then RUN_ENGAGEMENT=1 fi if [ ! -d /tmp/$END_DATE ]; then mkdir /tmp/$END_DATE fi if [ $RUN_ENROLLMENTS_HISTORY -gt 0 ]; then # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#history-task $REMOTE_TASK CourseEnrollmentEventsTask \ --interval "$INTERVAL" \ --local-scheduler \ --overwrite \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/CourseEnrollmentEventsTask.log 2>&1 fi if [ $RUN_ENROLLMENTS -gt 0 ]; then # https://groups.google.com/d/msg/openedx-ops/pCuzvbG1OyA/FehWsxTgBwAJ # Since Gingko, using a persistent Hive metastore causes issues with the enrolments summary data. # The workaround is to delete the previously calculated summary data. $HIVE -e 'USE default;DROP TABLE IF EXISTS course_grade_by_mode;' \ >> /tmp/$END_DATE/cleanup.log 2>&1 $HDFS -rm -r $HDFS_ROOT/warehouse/course_grade_by_mode/* \ >> /tmp/$END_DATE/cleanup.log 2>&1 $HIVE -e 'USE default;DROP TABLE IF EXISTS course_meta_summary_enrollment;' \ >> /tmp/$END_DATE/cleanup.log 2>&1 $HDFS -rm -r $HDFS_ROOT/warehouse/course_meta_summary_enrollment/* \ >> /tmp/$END_DATE/cleanup.log 2>&1 $REMOTE_TASK ImportEnrollmentsIntoMysql \ --interval "$INTERVAL" \ --local-scheduler \ --overwrite \ --overwrite-n-days 1 \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/ImportEnrollmentsIntoMysql.log 2>&1 fi if [ $RUN_PERFORMANCE -gt 0 ]; then NOW=`date +%s` ANSWER_DIST_S3_BUCKET=$HADOOP_S3_PATH/intermediate/answer_dist/$NOW $REMOTE_TASK AnswerDistributionWorkflow \ --local-scheduler \ --src "[\"$TRACKING_LOGS_S3_PATH\"]" \ --dest "$ANSWER_DIST_S3_BUCKET" \ --name AnswerDistributionWorkflow \ --output-root "$HADOOP_S3_PATH/grading_reports/" \ --include "[\"*tracking.log*.gz\"]" \ --manifest "$ANSWER_DIST_S3_BUCKET/manifest.txt" \ --base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \ --lib-jar "[\"$TASK_CONFIGURATION_S3_PATH/edx-analytics-hadoop-util.jar\"]" \ --n-reduce-tasks $NUM_REDUCE_TASKS \ --marker "$ANSWER_DIST_S3_BUCKET/marker" \ $ADD_PARAMS > /tmp/$END_DATE/AnswerDistributionWorkflow.log 2>&1 fi if [ $RUN_GEOGRAPHY_HISTORY -gt 0 ]; then # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id6 $REMOTE_TASK LastDailyIpAddressOfUserTask \ --local-scheduler \ --interval $INTERVAL \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/LastDailyIpAddressOfUserTask.log 2>&1 fi if [ $RUN_GEOGRAPHY -gt 0 ]; then $REMOTE_TASK InsertToMysqlLastCountryPerCourseTask \ --local-scheduler \ --interval-end $END_DATE \ --n-reduce-tasks $NUM_REDUCE_TASKS \ --overwrite \ $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlLastCountryPerCourseTask.log 2>&1 fi if [ $RUN_ENGAGEMENT -gt 0 ]; then WEEKS=24 $REMOTE_TASK InsertToMysqlCourseActivityTask \ --local-scheduler \ --end-date $END_DATE \ --weeks $WEEKS \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/CourseActivityWeeklyTask.log 2>&1 fi if [ $RUN_VIDEO -gt 0 ]; then $REMOTE_TASK InsertToMysqlAllVideoTask \ --local-scheduler \ --interval $INTERVAL \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlAllVideoTask.log 2>&1 fi if [ $RUN_LEARNER_ANALYTICS_HISTORY -gt 0 ]; then # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id12 $REMOTE_TASK ModuleEngagementIntervalTask \ --local-scheduler \ --interval $INTERVAL \ --n-reduce-tasks $NUM_REDUCE_TASKS \ --overwrite-from-date $END_DATE \ --overwrite-mysql \ $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementIntervalTask.log 2>&1 fi if [ $RUN_LEARNER_ANALYTICS -gt 0 ]; then $REMOTE_TASK ModuleEngagementWorkflowTask \ --local-scheduler \ --date $END_DATE \ --indexing-tasks 5 \ --throttle 0.5 \ --n-reduce-tasks $NUM_REDUCE_TASKS \ $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementWorkflowTask.log 2>&1 fi rm -f $LOCKFILE