Installing insights

jill · February 11, 2020, 9:09am

Hi @ettayeb_mohamed ! Thanks for posting your question, since installing analytics is a source of frustration for a lot of people.

Not sure where you saw this, but analytics isn’t installed by default with Open edX? edX have made a lot of improvements with the devstack since it moved to Docker, and so doing development with the analytics pipeline is now supported by default on the docker devstack, but AFAIK, the production deployment still requires separate deployment steps.

ettayeb_mohamed:

but it looks like the docs are only to install it on an aws instance? but i want to install it on my vps directly.
…
TASK [aws : Gather ec2 facts for use in other roles] **************************************************************************************************
fatal: [localhost]: FAILED!

Yep, currently AWS is the only officially supported environment for analytics deployments, because of all the pieces required to run the analytics pipeline, which feeds data into Insights (see architecture diagram). We at OpenCraft set up analytics on AWS a lot for clients, so we’ve assembled some documentation for how to do this, be beware that it’s not straightforward: openedx-deployment.doc.opencraft.com, under Analytics.

However, AWS is cost-prohibitive for a lot of deployments, and also, people with small- and medium-sized LMS user bases doesn’t really need the massively-scaled infrastructure that Open edX’s AWS analytics deployment provides. There’s a couple of options.

Figures
@john and Appsembler built Figures, which provides some of the data reporting available in Open edX Insights/analytics.

Since it runs in the same python environment as the LMS, it’s much easier to install, use, and contribute to.

Depending on which version of Open edX you’re running, I’d totally recommend trying it out to see if it meets your needs. They’re happy to accept pull requests too, if you find bugs or have features you want to add!

OpenStack Analytics

OpenCraft are working enhancing our Open edX deployment service (Ocim) to make it possible to run Insights and the Analytics Pipeline on a single OpenStack (OVH) instance.

The timeline for completing this isn’t yet known, so nothing has been upstreamed or properly documented yet. But I can share what we’ve done so far, and you’re welcome to use what you like. Again beware: it’s not a simple process.

Also note: we use S3 buckets for cost and authentication reasons, but you can use any hdfs-friendly locations.

Based my configuration branch on our ironwood.2 release branch, cf changes made

Deployed using this modified playbook and these ansible variables:

Click to expand ansible variables

Replace FIXMEs with real values.

SANDBOX_ENABLE_CERTIFICATES: false
SANDBOX_ENABLE_ANALYTICS_API: true
SANDBOX_ENABLE_INSIGHTS: true
SANDBOX_ENABLE_PIPELINE: true
INSIGHTS_NGINX_PORT: 80

# packages required to install and run the pipeline
analytics_pipeline_debian_pkgs:
  - "mysql-server-5.6"
  - python-mysqldb
  - libpq-dev

NGINX_INSIGHTS_APP_EXTRA: |
  # Use /status instead of /heartbeat endpoint to keep Ocim provisioning happy
  rewrite ^/heartbeat$ /status;

# Allows hadoop/hdfs to write to our S3 bucket.
HADOOP_CORE_SITE_EXTRA_CONFIG:
  fs.s3.awsAccessKeyId: "{{ AWS_ACCESS_KEY_ID }}"
  fs.s3.awsSecretAccessKey: "{{ AWS_SECRET_ACCESS_KEY }}"
  fs.s3.region: us-east-1   # FIXME: should be a variable
  fs.s3.impl: "org.apache.hadoop.fs.s3native.NativeS3FileSystem"

# Use our mysql database for the hive database
HIVE_METASTORE_DATABASE_HOST: "{{ EDXAPP_MYSQL_HOST }}"
HIVE_METASTORE_DATABASE_NAME: hive
HIVE_METASTORE_DATABASE_USER:  # FIXME
HIVE_METASTORE_DATABASE_PASSWORD: # FIXME

HIVE_SITE_EXTRA_CONFIG:
  datanucleus.autoCreateSchema: true
  datanucleus.autoCreateTables: true
  datanucleus.fixedDatastore: true

# EDXAPP Variables needed by config below
EDXAPP_LMS_ROOT_URL: "{{ EDXAPP_LMS_BASE_SCHEME | default('https') }}://{{ EDXAPP_LMS_BASE }}"
ANALYTICS_API_LMS_BASE_URL: "{{ EDXAPP_LMS_ROOT_URL }}"

# ANALYTICS_API Variables needed by playbooks
ANALYTICS_API_EMAIL_HOST: localhost
ANALYTICS_API_EMAIL_HOST_PASSWORD: ''
ANALYTICS_API_EMAIL_HOST_USER: ''
ANALYTICS_API_EMAIL_PORT: 25
# ANALYTICS_API_GIT_IDENTITY: '{{ COMMON_GIT_IDENTITY }}'
ANALYTICS_API_LANGUAGE_CODE: en-us
ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
ANALYTICS_API_LANGUAGE_CODE: en-us
ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
ANALYTICS_API_SERVICE_CONFIG:
  ANALYTICS_DATABASE: reports
  API_AUTH_TOKEN: # FIXME
  DATABASES: '{{ ANALYTICS_API_DATABASES }}'
  # nb: using default localhost elasticsearch
  EMAIL_PORT: '{{ ANALYTICS_API_EMAIL_PORT }}'
  LANGUAGE_CODE: en-us
  SECRET_KEY: '{{ ANALYTICS_API_SECRET_KEY }}'
  STATICFILES_DIRS: []
  STATIC_ROOT: '{{ COMMON_DATA_DIR }}/{{ analytics_api_service_name }}/staticfiles'
  TIME_ZONE: UTC
# This password must be 40 characters or fewer
ANALYTICS_API_USER_PASSWORD: # FIXME
ANALYTICS_API_USERS:
  apiuser001: '{{ ANALYTICS_API_USER_PASSWORD }}'
  dummy-api-user: # FIXME

# INSIGHTS Variables needed by playbooks
INSIGHTS_APPLICATION_NAME: "Insights {{ EDXAPP_PLATFORM_NAME }}"
INSIGHTS_BASE_URL: # FIXME
INSIGHTS_CMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_CMS_BASE }}/course
INSIGHTS_CMS_NGINX_PORT: '{{ EDXAPP_PLATFORM_NAME }}'
INSIGHTS_CSRF_COOKIE_NAME: crsftoken
# INSIGHTS_DATABASES stanza defined above
INSIGHTS_DATA_API_AUTH_TOKEN: '{{ ANALYTICS_API_USER_PASSWORD }}'
INSIGHTS_DOC_BASE: http://edx-insights.readthedocs.org/en/latest
INSIGHTS_DOC_LOAD_ERROR_URL: http://edx-insights.readthedocs.org/en/latest/Reference.html#error-conditions
INSIGHTS_FEEDBACK_EMAIL: dashboard@example.com
INSIGHTS_GUNICORN_EXTRA: ''
INSIGHTS_GUNICORN_WORKERS: '8'
INSIGHTS_LANGUAGE_COOKIE_NAME: language
INSIGHTS_LMS_BASE: https://{{ EDXAPP_LMS_BASE }}
INSIGHTS_LMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_LMS_BASE }}/courses
INSIGHTS_MKTG_BASE: 'https://{{ EDXAPP_LMS_BASE }}'
# credentials should be auto-generated, not hardcoded here.
INSIGHTS_OAUTH2_KEY: # FIXME
INSIGHTS_OAUTH2_SECRET: # FIXME
INSIGHTS_OAUTH2_URL_ROOT: https://{{ EDXAPP_LMS_BASE }}/oauth2
INSIGHTS_OPEN_SOURCE_URL: http://code.edx.org/
INSIGHTS_PLATFORM_NAME: '{{ EDXAPP_PLATFORM_NAME }}'
INSIGHTS_PRIVACY_POLICY_URL: 'https://{{ EDXAPP_LMS_BASE }}/edx-privacy-policy'
INSIGHTS_SESSION_COOKIE_NAME: sessionid
INSIGHTS_SOCIAL_AUTH_REDIRECT_IS_HTTPS: true
INSIGHTS_SUPPORT_EMAIL: support@example.com

Made some minor mods to the analytics pipeline, cf diff, and used that branch to run the pipeline.

Used this configuration for the pipeline:

Click to expand override.cfg

Replace THINGS-IN-ALL-CAPS with real values.

[hive]
warehouse_path = s3://BUCKET-NAME-HERE/analytics/warehouse/

[database-export]
database = CLIENT-PREFIX-HERE_reports
credentials = s3://BUCKET-NAME-HERE/analytics/config/output.json

[database-import]
database = CLIENT-PREFIX-HERE_edxapp
credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json
destination = s3://BUCKET-NAME-HERE/analytics/warehouse/

[otto-database-import]
database = CLIENT-PREFIX-HERE_ecommerce
credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json

[map-reduce]
engine = hadoop
marker = s3://BUCKET-NAME-HERE/analytics/marker/
lib_jar = [
    "hdfs://localhost:9000/lib/hadoop-aws-2.7.2.jar",
    "hdfs://localhost:9000/lib/aws-java-sdk-1.7.4.jar"]

[event-logs]
pattern = [".*tracking.log-(?P<date>[0-9]+).*"]
expand_interval = 30 days
source = ["s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/"]

[event-export]
output_root = s3://BUCKET-NAME-HERE/analytics/event-export/output/
environment = simple
config = s3://BUCKET-NAME-HERE/analytics/event_export/config.yaml
gpg_key_dir = s3://BUCKET-NAME-HERE/analytics/event_export/gpg-keys/
gpg_master_key = master@key.org
required_path_text = FakeServerGroup

[event-export-course]
output_root = s3://BUCKET-NAME-HERE/analytics/event-export-by-course/output/

[manifest]
threshold = 500
input_format = org.edx.hadoop.input.ManifestTextInputFormat
lib_jar = s3://BUCKET-NAME-HERE/analytics/packages/edx-analytics-hadoop-util.jar
path = s3://BUCKET-NAME-HERE/analytics/manifest/

[user-activity]
overwrite_n_days = 10
output_root = s3://BUCKET-NAME-HERE/analytics/activity/

[answer-distribution]
valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse
    
[enrollments]
interval_start = 2017-01-01
overwrite_n_days = 3
blacklist_date = 2001-01-01
blacklist_path = s3://BUCKET-NAME-HERE/analytics/enrollments-blacklist/

[enrollment-reports]
src = s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/
destination = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/output/
offsets = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/offsets.tsv
blacklist = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/course_blacklist.tsv
history = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/enrollment_history.tsv

[course-summary-enrollment]
# JV - course catalog is optional, and was causing CourseProgramMetadataInsertToMysqlTask errors.
# enable_course_catalog = true
enable_course_catalog = false

[financial-reports]
shoppingcart-partners = {"DEFAULT": "edx"}

[geolocation]
geolocation_data = s3://BUCKET-NAME-HERE/analytics/packages/GeoIP.dat
 
[location-per-course]
interval_start = 2017-01-01
overwrite_n_days = 3

[calendar]
interval = 2017-01-01-2030-01-01

[videos]
dropoff_threshold = 0.05
allow_empty_insert = true
overwrite_n_days = 3

[elasticsearch]
host = ["http://localhost:9200/"]

[module-engagement]
alias = roster_1_2
number_of_shards = 5
overwrite_n_days = 3
allow_empty_insert = true

[ccx]
enabled = false

[problem-response]
report_fields = [
    "username",
    "problem_id",
    "answer_id",
    "location",
    "question",
    "score",
    "max_score",
    "correct",
    "answer",
    "total_attempts",
    "first_attempt_date",
    "last_attempt_date"]
report_output_root = s3://BUCKET-NAME-HERE/analytics/reports/

[edx-rest-api]
# Create using:
# ./manage.py lms --settings=devstack create_oauth2_client  \
#   http://localhost:9999  # URL does not matter \
#   http://localhost:9999/complete/edx-oidc/  \
#   confidential \
#   --client_name "Analytics Pipeline" \
#   --client_id oauth_id \
#   --client_secret oauth_secret \
#   --trusted
client_id = oauth_id
client_secret = oauth_secret
auth_url = https://LMS_URL_HERE/oauth2/access_token/

[course-list]
api_root_url = https://LMS_URL_HERE/api/courses/v1/courses/

[course-blocks]
api_root_url = https://LMS_URL_HERE/api/courses/v1/blocks/

Then, the analytics tasks can be run on the local machine using this script. Schedule it to run daily via cron to keep your data updated.

Click to expand pipeline.sh

Replace the variables in the FIXME block with real values.

#!/bin/bash

# Acquire lock using this script itself as the lockfile.
# If another pipeline task is already running, then exit immediately.
exec 200<$0
flock -n 200 || { echo "`date` Another pipeline task is already running."; exit 1; }

# Run as hadoop user
. $HOME/hadoop/hadoop_env
. $HOME/venvs/pipeline/bin/activate
cd $HOME/pipeline

export OVERRIDE_CONFIG=$HOME/override.cfg

HIVE='hive'
HDFS="hadoop fs"

# FIXME set these variables
FROM_DATE=2017-01-01
NUM_REDUCE_TASKS=12
TRACKING_LOGS_S3_BUCKET="s3://TRACKING-LOG-BUCKET-GOES-HERE"
TRACKING_LOGS_S3_PATH="$TRACKING_LOGS_S3_BUCKET/logs/tracking/"
HADOOP_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path for temporary/intermediate storage
HADOOP_S3_PATH="$HADOOP_S3_BUCKET/analytics"
HDFS_ROOT="$HADOOP_S3_PATH"
TASK_CONFIGURATION_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path containing task configuration files
TASK_CONFIGURATION_S3_PATH="$TASK_CONFIGURATION_S3_BUCKET/analytics/packages/"
# /FIXME set these variables

END_DATE=$(date +"%Y-%m-%d")
INTERVAL="$FROM_DATE-$END_DATE"
REMOTE_TASK="launch-task"
WEEKS=10
ADD_PARAMS=""
LOCKFILE=/tmp/pipeline-tasks.lock

if [ -f $LOCKFILE ]; then
        echo "This script is already running."
        exit
else
        touch $LOCKFILE
fi

DO_SHIFT=0
getopts e:w:p: PARAM
while [ $? -eq 0 ]; do
        case "$PARAM" in
                (e)
                        echo "Using end_date: $OPTARG"
                        END_DATE=$OPTARG
                        DO_SHIFT=$(( $DO_SHIFT + 2 ))
                        ;;
                (w)
                        echo "Using WEEKS=$OPTARG"
                        WEEKS=$OPTARG
                        DO_SHIFT=$(( $DO_SHIFT + 2 ))
                        ;;
                (p)
                        echo "Using file pattern: $OPTARG"
                        ADD_PARAMS="--pattern '$OPTARG'"
                        DO_SHIFT=$(( $DO_SHIFT + 2 ))
                        ;;
        esac
        getopts e:w:p: PARAM
done

if [ $DO_SHIFT -gt 0 ]; then
        shift $DO_SHIFT
fi

if [ "$1x" != "x" ]; then
        echo "Adding parameters: $@"
        ADD_PARAMS="$@"
fi

# Run history tasks once to bootstrap new deployments.
RUN_ENROLLMENTS_HISTORY=0
RUN_GEOGRAPHY_HISTORY=0
RUN_LEARNER_ANALYTICS_HISTORY=0

# Run incremental tasks daily
RUN_ENROLLMENTS=1
RUN_PERFORMANCE=1
RUN_GEOGRAPHY=1
RUN_ENGAGEMENT=1
RUN_VIDEO=1
RUN_LEARNER_ANALYTICS=1

# Run engagement task if today is a Monday
if [ $(date +%u) -eq 1 ]; then
        RUN_ENGAGEMENT=1
fi

if [ ! -d /tmp/$END_DATE ]; then
        mkdir /tmp/$END_DATE
fi


if [ $RUN_ENROLLMENTS_HISTORY -gt 0 ]; then

   # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#history-task
   $REMOTE_TASK CourseEnrollmentEventsTask \
     --interval "$INTERVAL" \
     --local-scheduler \
     --overwrite \
     --n-reduce-tasks $NUM_REDUCE_TASKS \
     $ADD_PARAMS > /tmp/$END_DATE/CourseEnrollmentEventsTask.log 2>&1
fi

if [ $RUN_ENROLLMENTS -gt 0 ]; then

  # https://groups.google.com/d/msg/openedx-ops/pCuzvbG1OyA/FehWsxTgBwAJ
  # Since Gingko, using a persistent Hive metastore causes issues with the enrolments summary data.
  # The workaround is to delete the previously calculated summary data.
  $HIVE -e 'USE default;DROP TABLE IF EXISTS course_grade_by_mode;' \
      >> /tmp/$END_DATE/cleanup.log 2>&1
  $HDFS -rm -r $HDFS_ROOT/warehouse/course_grade_by_mode/* \
      >> /tmp/$END_DATE/cleanup.log 2>&1
  $HIVE -e 'USE default;DROP TABLE IF EXISTS course_meta_summary_enrollment;' \
    >> /tmp/$END_DATE/cleanup.log 2>&1
  $HDFS -rm -r $HDFS_ROOT/warehouse/course_meta_summary_enrollment/* \
      >> /tmp/$END_DATE/cleanup.log 2>&1

  $REMOTE_TASK ImportEnrollmentsIntoMysql \
    --interval "$INTERVAL" \
    --local-scheduler \
    --overwrite \
    --overwrite-n-days 1 \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    $ADD_PARAMS > /tmp/$END_DATE/ImportEnrollmentsIntoMysql.log 2>&1
fi

if [ $RUN_PERFORMANCE -gt 0 ]; then

  NOW=`date +%s`
  ANSWER_DIST_S3_BUCKET=$HADOOP_S3_PATH/intermediate/answer_dist/$NOW

  $REMOTE_TASK AnswerDistributionWorkflow \
    --local-scheduler \
    --src "[\"$TRACKING_LOGS_S3_PATH\"]" \
    --dest "$ANSWER_DIST_S3_BUCKET" \
    --name AnswerDistributionWorkflow \
    --output-root "$HADOOP_S3_PATH/grading_reports/" \
    --include "[\"*tracking.log*.gz\"]" \
    --manifest "$ANSWER_DIST_S3_BUCKET/manifest.txt" \
    --base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
    --lib-jar "[\"$TASK_CONFIGURATION_S3_PATH/edx-analytics-hadoop-util.jar\"]" \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    --marker "$ANSWER_DIST_S3_BUCKET/marker" \
    $ADD_PARAMS > /tmp/$END_DATE/AnswerDistributionWorkflow.log 2>&1
fi

if [ $RUN_GEOGRAPHY_HISTORY -gt 0 ]; then

  # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id6
  $REMOTE_TASK LastDailyIpAddressOfUserTask \
    --local-scheduler \
    --interval $INTERVAL \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    $ADD_PARAMS > /tmp/$END_DATE/LastDailyIpAddressOfUserTask.log 2>&1
fi

if [ $RUN_GEOGRAPHY -gt 0 ]; then

  $REMOTE_TASK InsertToMysqlLastCountryPerCourseTask \
    --local-scheduler \
    --interval-end $END_DATE \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    --overwrite \
    $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlLastCountryPerCourseTask.log 2>&1
fi

if [ $RUN_ENGAGEMENT -gt 0 ]; then

  WEEKS=24

  $REMOTE_TASK InsertToMysqlCourseActivityTask \
    --local-scheduler \
    --end-date $END_DATE \
    --weeks $WEEKS \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    $ADD_PARAMS > /tmp/$END_DATE/CourseActivityWeeklyTask.log 2>&1
fi

if [ $RUN_VIDEO -gt 0 ]; then
  $REMOTE_TASK InsertToMysqlAllVideoTask \
    --local-scheduler \
    --interval $INTERVAL \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlAllVideoTask.log 2>&1
fi

if [ $RUN_LEARNER_ANALYTICS_HISTORY -gt 0 ]; then

  # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id12
  $REMOTE_TASK ModuleEngagementIntervalTask \
    --local-scheduler \
    --interval $INTERVAL \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    --overwrite-from-date $END_DATE \
    --overwrite-mysql \
    $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementIntervalTask.log 2>&1
fi

if [ $RUN_LEARNER_ANALYTICS -gt 0 ]; then

  $REMOTE_TASK ModuleEngagementWorkflowTask \
    --local-scheduler \
    --date $END_DATE \
    --indexing-tasks 5 \
    --throttle 0.5 \
    --n-reduce-tasks $NUM_REDUCE_TASKS \
    $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementWorkflowTask.log 2>&1
fi

rm -f $LOCKFILE

Topic		Replies	Views
Installing insights on Ubuntu 16.4 Site Operators	5	1040	April 14, 2021
How to install and Enabled Learning Analytics Tool in Open edX Site Operations Help	3	520	December 15, 2019
Open edx (koa.3) analytics installation Site Operations Help analytics , koa	2	574	July 28, 2021
[problem] install insights fail Site Operations Help	2	667	November 20, 2019
OpenEdx insights Site Operations Help how-to	3	749	January 16, 2020

Installing insights

Related topics