Installing insights

Hi @ettayeb_mohamed ! Thanks for posting your question, since installing analytics is a source of frustration for a lot of people.

Not sure where you saw this, but analytics isn’t installed by default with Open edX? edX have made a lot of improvements with the devstack since it moved to Docker, and so doing development with the analytics pipeline is now supported by default on the docker devstack, but AFAIK, the production deployment still requires separate deployment steps.

Yep, currently AWS is the only officially supported environment for analytics deployments, because of all the pieces required to run the analytics pipeline, which feeds data into Insights (see architecture diagram). We at OpenCraft set up analytics on AWS a lot for clients, so we’ve assembled some documentation for how to do this, be beware that it’s not straightforward: openedx-deployment.doc.opencraft.com, under Analytics.

However, AWS is cost-prohibitive for a lot of deployments, and also, people with small- and medium-sized LMS user bases doesn’t really need the massively-scaled infrastructure that Open edX’s AWS analytics deployment provides. There’s a couple of options.

Figures
@john and Appsembler built Figures, which provides some of the data reporting available in Open edX Insights/analytics.

Since it runs in the same python environment as the LMS, it’s much easier to install, use, and contribute to.

Depending on which version of Open edX you’re running, I’d totally recommend trying it out to see if it meets your needs. They’re happy to accept pull requests too, if you find bugs or have features you want to add!

OpenStack Analytics

OpenCraft are working enhancing our Open edX deployment service (Ocim) to make it possible to run Insights and the Analytics Pipeline on a single OpenStack (OVH) instance.

The timeline for completing this isn’t yet known, so nothing has been upstreamed or properly documented yet. But I can share what we’ve done so far, and you’re welcome to use what you like. Again beware: it’s not a simple process.

Also note: we use S3 buckets for cost and authentication reasons, but you can use any hdfs-friendly locations.

  • Based my configuration branch on our ironwood.2 release branch, cf changes made

  • Deployed using this modified playbook and these ansible variables:

    Click to expand ansible variables

    Replace FIXMEs with real values.

    SANDBOX_ENABLE_CERTIFICATES: false
    SANDBOX_ENABLE_ANALYTICS_API: true
    SANDBOX_ENABLE_INSIGHTS: true
    SANDBOX_ENABLE_PIPELINE: true
    INSIGHTS_NGINX_PORT: 80
    
    # packages required to install and run the pipeline
    analytics_pipeline_debian_pkgs:
      - "mysql-server-5.6"
      - python-mysqldb
      - libpq-dev
    
    NGINX_INSIGHTS_APP_EXTRA: |
      # Use /status instead of /heartbeat endpoint to keep Ocim provisioning happy
      rewrite ^/heartbeat$ /status;
    
    # Allows hadoop/hdfs to write to our S3 bucket.
    HADOOP_CORE_SITE_EXTRA_CONFIG:
      fs.s3.awsAccessKeyId: "{{ AWS_ACCESS_KEY_ID }}"
      fs.s3.awsSecretAccessKey: "{{ AWS_SECRET_ACCESS_KEY }}"
      fs.s3.region: us-east-1   # FIXME: should be a variable
      fs.s3.impl: "org.apache.hadoop.fs.s3native.NativeS3FileSystem"
    
    # Use our mysql database for the hive database
    HIVE_METASTORE_DATABASE_HOST: "{{ EDXAPP_MYSQL_HOST }}"
    HIVE_METASTORE_DATABASE_NAME: hive
    HIVE_METASTORE_DATABASE_USER:  # FIXME
    HIVE_METASTORE_DATABASE_PASSWORD: # FIXME
    
    HIVE_SITE_EXTRA_CONFIG:
      datanucleus.autoCreateSchema: true
      datanucleus.autoCreateTables: true
      datanucleus.fixedDatastore: true
    
    # EDXAPP Variables needed by config below
    EDXAPP_LMS_ROOT_URL: "{{ EDXAPP_LMS_BASE_SCHEME | default('https') }}://{{ EDXAPP_LMS_BASE }}"
    ANALYTICS_API_LMS_BASE_URL: "{{ EDXAPP_LMS_ROOT_URL }}"
    
    # ANALYTICS_API Variables needed by playbooks
    ANALYTICS_API_EMAIL_HOST: localhost
    ANALYTICS_API_EMAIL_HOST_PASSWORD: ''
    ANALYTICS_API_EMAIL_HOST_USER: ''
    ANALYTICS_API_EMAIL_PORT: 25
    # ANALYTICS_API_GIT_IDENTITY: '{{ COMMON_GIT_IDENTITY }}'
    ANALYTICS_API_LANGUAGE_CODE: en-us
    ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
    ANALYTICS_API_LANGUAGE_CODE: en-us
    ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
    ANALYTICS_API_SERVICE_CONFIG:
      ANALYTICS_DATABASE: reports
      API_AUTH_TOKEN: # FIXME
      DATABASES: '{{ ANALYTICS_API_DATABASES }}'
      # nb: using default localhost elasticsearch
      EMAIL_PORT: '{{ ANALYTICS_API_EMAIL_PORT }}'
      LANGUAGE_CODE: en-us
      SECRET_KEY: '{{ ANALYTICS_API_SECRET_KEY }}'
      STATICFILES_DIRS: []
      STATIC_ROOT: '{{ COMMON_DATA_DIR }}/{{ analytics_api_service_name }}/staticfiles'
      TIME_ZONE: UTC
    # This password must be 40 characters or fewer
    ANALYTICS_API_USER_PASSWORD: # FIXME
    ANALYTICS_API_USERS:
      apiuser001: '{{ ANALYTICS_API_USER_PASSWORD }}'
      dummy-api-user: # FIXME
    
    # INSIGHTS Variables needed by playbooks
    INSIGHTS_APPLICATION_NAME: "Insights {{ EDXAPP_PLATFORM_NAME }}"
    INSIGHTS_BASE_URL: # FIXME
    INSIGHTS_CMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_CMS_BASE }}/course
    INSIGHTS_CMS_NGINX_PORT: '{{ EDXAPP_PLATFORM_NAME }}'
    INSIGHTS_CSRF_COOKIE_NAME: crsftoken
    # INSIGHTS_DATABASES stanza defined above
    INSIGHTS_DATA_API_AUTH_TOKEN: '{{ ANALYTICS_API_USER_PASSWORD }}'
    INSIGHTS_DOC_BASE: http://edx-insights.readthedocs.org/en/latest
    INSIGHTS_DOC_LOAD_ERROR_URL: http://edx-insights.readthedocs.org/en/latest/Reference.html#error-conditions
    INSIGHTS_FEEDBACK_EMAIL: dashboard@example.com
    INSIGHTS_GUNICORN_EXTRA: ''
    INSIGHTS_GUNICORN_WORKERS: '8'
    INSIGHTS_LANGUAGE_COOKIE_NAME: language
    INSIGHTS_LMS_BASE: https://{{ EDXAPP_LMS_BASE }}
    INSIGHTS_LMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_LMS_BASE }}/courses
    INSIGHTS_MKTG_BASE: 'https://{{ EDXAPP_LMS_BASE }}'
    # credentials should be auto-generated, not hardcoded here.
    INSIGHTS_OAUTH2_KEY: # FIXME
    INSIGHTS_OAUTH2_SECRET: # FIXME
    INSIGHTS_OAUTH2_URL_ROOT: https://{{ EDXAPP_LMS_BASE }}/oauth2
    INSIGHTS_OPEN_SOURCE_URL: http://code.edx.org/
    INSIGHTS_PLATFORM_NAME: '{{ EDXAPP_PLATFORM_NAME }}'
    INSIGHTS_PRIVACY_POLICY_URL: 'https://{{ EDXAPP_LMS_BASE }}/edx-privacy-policy'
    INSIGHTS_SESSION_COOKIE_NAME: sessionid
    INSIGHTS_SOCIAL_AUTH_REDIRECT_IS_HTTPS: true
    INSIGHTS_SUPPORT_EMAIL: support@example.com
    
  • Made some minor mods to the analytics pipeline, cf diff, and used that branch to run the pipeline.

  • Used this configuration for the pipeline:

    Click to expand override.cfg Replace THINGS-IN-ALL-CAPS with real values.
    [hive]
    warehouse_path = s3://BUCKET-NAME-HERE/analytics/warehouse/
    
    [database-export]
    database = CLIENT-PREFIX-HERE_reports
    credentials = s3://BUCKET-NAME-HERE/analytics/config/output.json
    
    [database-import]
    database = CLIENT-PREFIX-HERE_edxapp
    credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json
    destination = s3://BUCKET-NAME-HERE/analytics/warehouse/
    
    [otto-database-import]
    database = CLIENT-PREFIX-HERE_ecommerce
    credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json
    
    [map-reduce]
    engine = hadoop
    marker = s3://BUCKET-NAME-HERE/analytics/marker/
    lib_jar = [
        "hdfs://localhost:9000/lib/hadoop-aws-2.7.2.jar",
        "hdfs://localhost:9000/lib/aws-java-sdk-1.7.4.jar"]
    
    [event-logs]
    pattern = [".*tracking.log-(?P<date>[0-9]+).*"]
    expand_interval = 30 days
    source = ["s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/"]
    
    [event-export]
    output_root = s3://BUCKET-NAME-HERE/analytics/event-export/output/
    environment = simple
    config = s3://BUCKET-NAME-HERE/analytics/event_export/config.yaml
    gpg_key_dir = s3://BUCKET-NAME-HERE/analytics/event_export/gpg-keys/
    gpg_master_key = master@key.org
    required_path_text = FakeServerGroup
    
    [event-export-course]
    output_root = s3://BUCKET-NAME-HERE/analytics/event-export-by-course/output/
    
    [manifest]
    threshold = 500
    input_format = org.edx.hadoop.input.ManifestTextInputFormat
    lib_jar = s3://BUCKET-NAME-HERE/analytics/packages/edx-analytics-hadoop-util.jar
    path = s3://BUCKET-NAME-HERE/analytics/manifest/
    
    [user-activity]
    overwrite_n_days = 10
    output_root = s3://BUCKET-NAME-HERE/analytics/activity/
    
    [answer-distribution]
    valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse
        
    [enrollments]
    interval_start = 2017-01-01
    overwrite_n_days = 3
    blacklist_date = 2001-01-01
    blacklist_path = s3://BUCKET-NAME-HERE/analytics/enrollments-blacklist/
    
    [enrollment-reports]
    src = s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/
    destination = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/output/
    offsets = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/offsets.tsv
    blacklist = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/course_blacklist.tsv
    history = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/enrollment_history.tsv
    
    [course-summary-enrollment]
    # JV - course catalog is optional, and was causing CourseProgramMetadataInsertToMysqlTask errors.
    # enable_course_catalog = true
    enable_course_catalog = false
    
    [financial-reports]
    shoppingcart-partners = {"DEFAULT": "edx"}
    
    [geolocation]
    geolocation_data = s3://BUCKET-NAME-HERE/analytics/packages/GeoIP.dat
     
    [location-per-course]
    interval_start = 2017-01-01
    overwrite_n_days = 3
    
    [calendar]
    interval = 2017-01-01-2030-01-01
    
    [videos]
    dropoff_threshold = 0.05
    allow_empty_insert = true
    overwrite_n_days = 3
    
    [elasticsearch]
    host = ["http://localhost:9200/"]
    
    [module-engagement]
    alias = roster_1_2
    number_of_shards = 5
    overwrite_n_days = 3
    allow_empty_insert = true
    
    [ccx]
    enabled = false
    
    [problem-response]
    report_fields = [
        "username",
        "problem_id",
        "answer_id",
        "location",
        "question",
        "score",
        "max_score",
        "correct",
        "answer",
        "total_attempts",
        "first_attempt_date",
        "last_attempt_date"]
    report_output_root = s3://BUCKET-NAME-HERE/analytics/reports/
    
    [edx-rest-api]
    # Create using:
    # ./manage.py lms --settings=devstack create_oauth2_client  \
    #   http://localhost:9999  # URL does not matter \
    #   http://localhost:9999/complete/edx-oidc/  \
    #   confidential \
    #   --client_name "Analytics Pipeline" \
    #   --client_id oauth_id \
    #   --client_secret oauth_secret \
    #   --trusted
    client_id = oauth_id
    client_secret = oauth_secret
    auth_url = https://LMS_URL_HERE/oauth2/access_token/
    
    [course-list]
    api_root_url = https://LMS_URL_HERE/api/courses/v1/courses/
    
    [course-blocks]
    api_root_url = https://LMS_URL_HERE/api/courses/v1/blocks/
    
  • Then, the analytics tasks can be run on the local machine using this script. Schedule it to run daily via cron to keep your data updated.

    Click to expand pipeline.sh Replace the variables in the FIXME block with real values.
    #!/bin/bash
    
    # Acquire lock using this script itself as the lockfile.
    # If another pipeline task is already running, then exit immediately.
    exec 200<$0
    flock -n 200 || { echo "`date` Another pipeline task is already running."; exit 1; }
    
    # Run as hadoop user
    . $HOME/hadoop/hadoop_env
    . $HOME/venvs/pipeline/bin/activate
    cd $HOME/pipeline
    
    export OVERRIDE_CONFIG=$HOME/override.cfg
    
    HIVE='hive'
    HDFS="hadoop fs"
    
    # FIXME set these variables
    FROM_DATE=2017-01-01
    NUM_REDUCE_TASKS=12
    TRACKING_LOGS_S3_BUCKET="s3://TRACKING-LOG-BUCKET-GOES-HERE"
    TRACKING_LOGS_S3_PATH="$TRACKING_LOGS_S3_BUCKET/logs/tracking/"
    HADOOP_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path for temporary/intermediate storage
    HADOOP_S3_PATH="$HADOOP_S3_BUCKET/analytics"
    HDFS_ROOT="$HADOOP_S3_PATH"
    TASK_CONFIGURATION_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path containing task configuration files
    TASK_CONFIGURATION_S3_PATH="$TASK_CONFIGURATION_S3_BUCKET/analytics/packages/"
    # /FIXME set these variables
    
    END_DATE=$(date +"%Y-%m-%d")
    INTERVAL="$FROM_DATE-$END_DATE"
    REMOTE_TASK="launch-task"
    WEEKS=10
    ADD_PARAMS=""
    LOCKFILE=/tmp/pipeline-tasks.lock
    
    if [ -f $LOCKFILE ]; then
            echo "This script is already running."
            exit
    else
            touch $LOCKFILE
    fi
    
    DO_SHIFT=0
    getopts e:w:p: PARAM
    while [ $? -eq 0 ]; do
            case "$PARAM" in
                    (e)
                            echo "Using end_date: $OPTARG"
                            END_DATE=$OPTARG
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
                    (w)
                            echo "Using WEEKS=$OPTARG"
                            WEEKS=$OPTARG
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
                    (p)
                            echo "Using file pattern: $OPTARG"
                            ADD_PARAMS="--pattern '$OPTARG'"
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
            esac
            getopts e:w:p: PARAM
    done
    
    if [ $DO_SHIFT -gt 0 ]; then
            shift $DO_SHIFT
    fi
    
    if [ "$1x" != "x" ]; then
            echo "Adding parameters: $@"
            ADD_PARAMS="$@"
    fi
    
    # Run history tasks once to bootstrap new deployments.
    RUN_ENROLLMENTS_HISTORY=0
    RUN_GEOGRAPHY_HISTORY=0
    RUN_LEARNER_ANALYTICS_HISTORY=0
    
    # Run incremental tasks daily
    RUN_ENROLLMENTS=1
    RUN_PERFORMANCE=1
    RUN_GEOGRAPHY=1
    RUN_ENGAGEMENT=1
    RUN_VIDEO=1
    RUN_LEARNER_ANALYTICS=1
    
    # Run engagement task if today is a Monday
    if [ $(date +%u) -eq 1 ]; then
            RUN_ENGAGEMENT=1
    fi
    
    if [ ! -d /tmp/$END_DATE ]; then
            mkdir /tmp/$END_DATE
    fi
    
    
    if [ $RUN_ENROLLMENTS_HISTORY -gt 0 ]; then
    
       # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#history-task
       $REMOTE_TASK CourseEnrollmentEventsTask \
         --interval "$INTERVAL" \
         --local-scheduler \
         --overwrite \
         --n-reduce-tasks $NUM_REDUCE_TASKS \
         $ADD_PARAMS > /tmp/$END_DATE/CourseEnrollmentEventsTask.log 2>&1
    fi
    
    if [ $RUN_ENROLLMENTS -gt 0 ]; then
    
      # https://groups.google.com/d/msg/openedx-ops/pCuzvbG1OyA/FehWsxTgBwAJ
      # Since Gingko, using a persistent Hive metastore causes issues with the enrolments summary data.
      # The workaround is to delete the previously calculated summary data.
      $HIVE -e 'USE default;DROP TABLE IF EXISTS course_grade_by_mode;' \
          >> /tmp/$END_DATE/cleanup.log 2>&1
      $HDFS -rm -r $HDFS_ROOT/warehouse/course_grade_by_mode/* \
          >> /tmp/$END_DATE/cleanup.log 2>&1
      $HIVE -e 'USE default;DROP TABLE IF EXISTS course_meta_summary_enrollment;' \
        >> /tmp/$END_DATE/cleanup.log 2>&1
      $HDFS -rm -r $HDFS_ROOT/warehouse/course_meta_summary_enrollment/* \
          >> /tmp/$END_DATE/cleanup.log 2>&1
    
      $REMOTE_TASK ImportEnrollmentsIntoMysql \
        --interval "$INTERVAL" \
        --local-scheduler \
        --overwrite \
        --overwrite-n-days 1 \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/ImportEnrollmentsIntoMysql.log 2>&1
    fi
    
    if [ $RUN_PERFORMANCE -gt 0 ]; then
    
      NOW=`date +%s`
      ANSWER_DIST_S3_BUCKET=$HADOOP_S3_PATH/intermediate/answer_dist/$NOW
    
      $REMOTE_TASK AnswerDistributionWorkflow \
        --local-scheduler \
        --src "[\"$TRACKING_LOGS_S3_PATH\"]" \
        --dest "$ANSWER_DIST_S3_BUCKET" \
        --name AnswerDistributionWorkflow \
        --output-root "$HADOOP_S3_PATH/grading_reports/" \
        --include "[\"*tracking.log*.gz\"]" \
        --manifest "$ANSWER_DIST_S3_BUCKET/manifest.txt" \
        --base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
        --lib-jar "[\"$TASK_CONFIGURATION_S3_PATH/edx-analytics-hadoop-util.jar\"]" \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --marker "$ANSWER_DIST_S3_BUCKET/marker" \
        $ADD_PARAMS > /tmp/$END_DATE/AnswerDistributionWorkflow.log 2>&1
    fi
    
    if [ $RUN_GEOGRAPHY_HISTORY -gt 0 ]; then
    
      # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id6
      $REMOTE_TASK LastDailyIpAddressOfUserTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/LastDailyIpAddressOfUserTask.log 2>&1
    fi
    
    if [ $RUN_GEOGRAPHY -gt 0 ]; then
    
      $REMOTE_TASK InsertToMysqlLastCountryPerCourseTask \
        --local-scheduler \
        --interval-end $END_DATE \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --overwrite \
        $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlLastCountryPerCourseTask.log 2>&1
    fi
    
    if [ $RUN_ENGAGEMENT -gt 0 ]; then
    
      WEEKS=24
    
      $REMOTE_TASK InsertToMysqlCourseActivityTask \
        --local-scheduler \
        --end-date $END_DATE \
        --weeks $WEEKS \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/CourseActivityWeeklyTask.log 2>&1
    fi
    
    if [ $RUN_VIDEO -gt 0 ]; then
      $REMOTE_TASK InsertToMysqlAllVideoTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlAllVideoTask.log 2>&1
    fi
    
    if [ $RUN_LEARNER_ANALYTICS_HISTORY -gt 0 ]; then
    
      # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id12
      $REMOTE_TASK ModuleEngagementIntervalTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --overwrite-from-date $END_DATE \
        --overwrite-mysql \
        $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementIntervalTask.log 2>&1
    fi
    
    if [ $RUN_LEARNER_ANALYTICS -gt 0 ]; then
    
      $REMOTE_TASK ModuleEngagementWorkflowTask \
        --local-scheduler \
        --date $END_DATE \
        --indexing-tasks 5 \
        --throttle 0.5 \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementWorkflowTask.log 2>&1
    fi
    
    rm -f $LOCKFILE
    
2 Likes