Installing insights

Hello there,
As i can see on the docs, the analytics is installed by default with open edx starting from the ironwood release, but well after i install open edx, the analytics is not working on the port 18110 so i used the docs and i tried to install it, but it looks like the docs are only to install it on an aws instance? but i want to install it on my vps directly.
I tried with this script:
https://openedx.atlassian.net/wiki/spaces/OpenOPS/pages/43385371/edX+Analytics+Installation

i updated the required values, but i am receiving an error:

TASK [aws : Gather ec2 facts for use in other roles] **************************************************************************************************
fatal: [localhost]: FAILED! => {“censored”: “the output has been hidden due to the fact that ‘no_log: true’ was specified for this result”, “changed”: false}
to retry, use: --limit @/root/configuration/playbooks/analytics_single.retry

PLAY RECAP ********************************************************************************************************************************************
localhost : ok=34 changed=7 unreachable=0 failed=1

After i receive another error:

GATHERING FACTS ***************************************************************
previous known host file not found
fatal: [localhost] => SSH encountered an unknown error during the connection. We recommend you re-run the command using -vvvv, which will enable SSH debugging output to help diagnose the issue

TASK: [luigi | configuration directory created] *******************************
FATAL: no hosts matched or all hosts have already failed – aborting

PLAY RECAP ********************************************************************
to retry, use: --limit @/root/task.retry

localhost : ok=0 changed=0 unreachable=1 failed=0

My configs are:

#!/bin/bash

LMS_HOSTNAME=“http://xxx.xxx.xxx.243
INSIGHTS_HOSTNAME=“http://xxx.xxx.xxx.243:8110/” # Change this to the externally visible domain and scheme for your Insights install, ideally HTTPS
DB_USERNAME=“xxxxxx”
DB_HOST=“localhost”
DB_PASSWORD=“xxxxxxxxx”
DB_PORT=“3306”

Anyone could help?

Hi @ettayeb_mohamed ! Thanks for posting your question, since installing analytics is a source of frustration for a lot of people.

Not sure where you saw this, but analytics isn’t installed by default with Open edX? edX have made a lot of improvements with the devstack since it moved to Docker, and so doing development with the analytics pipeline is now supported by default on the docker devstack, but AFAIK, the production deployment still requires separate deployment steps.

Yep, currently AWS is the only officially supported environment for analytics deployments, because of all the pieces required to run the analytics pipeline, which feeds data into Insights (see architecture diagram). We at OpenCraft set up analytics on AWS a lot for clients, so we’ve assembled some documentation for how to do this, be beware that it’s not straightforward: openedx-deployment.doc.opencraft.com, under Analytics.

However, AWS is cost-prohibitive for a lot of deployments, and also, people with small- and medium-sized LMS user bases doesn’t really need the massively-scaled infrastructure that Open edX’s AWS analytics deployment provides. There’s a couple of options.

Figures
@john and Appsembler built Figures, which provides some of the data reporting available in Open edX Insights/analytics.

Since it runs in the same python environment as the LMS, it’s much easier to install, use, and contribute to.

Depending on which version of Open edX you’re running, I’d totally recommend trying it out to see if it meets your needs. They’re happy to accept pull requests too, if you find bugs or have features you want to add!

OpenStack Analytics

OpenCraft are working enhancing our Open edX deployment service (Ocim) to make it possible to run Insights and the Analytics Pipeline on a single OpenStack (OVH) instance.

The timeline for completing this isn’t yet known, so nothing has been upstreamed or properly documented yet. But I can share what we’ve done so far, and you’re welcome to use what you like. Again beware: it’s not a simple process.

Also note: we use S3 buckets for cost and authentication reasons, but you can use any hdfs-friendly locations.

  • Based my configuration branch on our ironwood.2 release branch, cf changes made

  • Deployed using this modified playbook and these ansible variables:

    Click to expand ansible variables

    Replace FIXMEs with real values.

    SANDBOX_ENABLE_CERTIFICATES: false
    SANDBOX_ENABLE_ANALYTICS_API: true
    SANDBOX_ENABLE_INSIGHTS: true
    SANDBOX_ENABLE_PIPELINE: true
    INSIGHTS_NGINX_PORT: 80
    
    # packages required to install and run the pipeline
    analytics_pipeline_debian_pkgs:
      - "mysql-server-5.6"
      - python-mysqldb
      - libpq-dev
    
    NGINX_INSIGHTS_APP_EXTRA: |
      # Use /status instead of /heartbeat endpoint to keep Ocim provisioning happy
      rewrite ^/heartbeat$ /status;
    
    # Allows hadoop/hdfs to write to our S3 bucket.
    HADOOP_CORE_SITE_EXTRA_CONFIG:
      fs.s3.awsAccessKeyId: "{{ AWS_ACCESS_KEY_ID }}"
      fs.s3.awsSecretAccessKey: "{{ AWS_SECRET_ACCESS_KEY }}"
      fs.s3.region: us-east-1   # FIXME: should be a variable
      fs.s3.impl: "org.apache.hadoop.fs.s3native.NativeS3FileSystem"
    
    # Use our mysql database for the hive database
    HIVE_METASTORE_DATABASE_HOST: "{{ EDXAPP_MYSQL_HOST }}"
    HIVE_METASTORE_DATABASE_NAME: hive
    HIVE_METASTORE_DATABASE_USER:  # FIXME
    HIVE_METASTORE_DATABASE_PASSWORD: # FIXME
    
    HIVE_SITE_EXTRA_CONFIG:
      datanucleus.autoCreateSchema: true
      datanucleus.autoCreateTables: true
      datanucleus.fixedDatastore: true
    
    # EDXAPP Variables needed by config below
    EDXAPP_LMS_ROOT_URL: "{{ EDXAPP_LMS_BASE_SCHEME | default('https') }}://{{ EDXAPP_LMS_BASE }}"
    ANALYTICS_API_LMS_BASE_URL: "{{ EDXAPP_LMS_ROOT_URL }}"
    
    # ANALYTICS_API Variables needed by playbooks
    ANALYTICS_API_EMAIL_HOST: localhost
    ANALYTICS_API_EMAIL_HOST_PASSWORD: ''
    ANALYTICS_API_EMAIL_HOST_USER: ''
    ANALYTICS_API_EMAIL_PORT: 25
    # ANALYTICS_API_GIT_IDENTITY: '{{ COMMON_GIT_IDENTITY }}'
    ANALYTICS_API_LANGUAGE_CODE: en-us
    ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
    ANALYTICS_API_LANGUAGE_CODE: en-us
    ANALYTICS_API_PIP_EXTRA_ARGS: --use-wheel --no-index --find-links=http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise/Python-2.7
    ANALYTICS_API_SERVICE_CONFIG:
      ANALYTICS_DATABASE: reports
      API_AUTH_TOKEN: # FIXME
      DATABASES: '{{ ANALYTICS_API_DATABASES }}'
      # nb: using default localhost elasticsearch
      EMAIL_PORT: '{{ ANALYTICS_API_EMAIL_PORT }}'
      LANGUAGE_CODE: en-us
      SECRET_KEY: '{{ ANALYTICS_API_SECRET_KEY }}'
      STATICFILES_DIRS: []
      STATIC_ROOT: '{{ COMMON_DATA_DIR }}/{{ analytics_api_service_name }}/staticfiles'
      TIME_ZONE: UTC
    # This password must be 40 characters or fewer
    ANALYTICS_API_USER_PASSWORD: # FIXME
    ANALYTICS_API_USERS:
      apiuser001: '{{ ANALYTICS_API_USER_PASSWORD }}'
      dummy-api-user: # FIXME
    
    # INSIGHTS Variables needed by playbooks
    INSIGHTS_APPLICATION_NAME: "Insights {{ EDXAPP_PLATFORM_NAME }}"
    INSIGHTS_BASE_URL: # FIXME
    INSIGHTS_CMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_CMS_BASE }}/course
    INSIGHTS_CMS_NGINX_PORT: '{{ EDXAPP_PLATFORM_NAME }}'
    INSIGHTS_CSRF_COOKIE_NAME: crsftoken
    # INSIGHTS_DATABASES stanza defined above
    INSIGHTS_DATA_API_AUTH_TOKEN: '{{ ANALYTICS_API_USER_PASSWORD }}'
    INSIGHTS_DOC_BASE: http://edx-insights.readthedocs.org/en/latest
    INSIGHTS_DOC_LOAD_ERROR_URL: http://edx-insights.readthedocs.org/en/latest/Reference.html#error-conditions
    INSIGHTS_FEEDBACK_EMAIL: dashboard@example.com
    INSIGHTS_GUNICORN_EXTRA: ''
    INSIGHTS_GUNICORN_WORKERS: '8'
    INSIGHTS_LANGUAGE_COOKIE_NAME: language
    INSIGHTS_LMS_BASE: https://{{ EDXAPP_LMS_BASE }}
    INSIGHTS_LMS_COURSE_SHORTCUT_BASE_URL: https://{{ EDXAPP_LMS_BASE }}/courses
    INSIGHTS_MKTG_BASE: 'https://{{ EDXAPP_LMS_BASE }}'
    # credentials should be auto-generated, not hardcoded here.
    INSIGHTS_OAUTH2_KEY: # FIXME
    INSIGHTS_OAUTH2_SECRET: # FIXME
    INSIGHTS_OAUTH2_URL_ROOT: https://{{ EDXAPP_LMS_BASE }}/oauth2
    INSIGHTS_OPEN_SOURCE_URL: http://code.edx.org/
    INSIGHTS_PLATFORM_NAME: '{{ EDXAPP_PLATFORM_NAME }}'
    INSIGHTS_PRIVACY_POLICY_URL: 'https://{{ EDXAPP_LMS_BASE }}/edx-privacy-policy'
    INSIGHTS_SESSION_COOKIE_NAME: sessionid
    INSIGHTS_SOCIAL_AUTH_REDIRECT_IS_HTTPS: true
    INSIGHTS_SUPPORT_EMAIL: support@example.com
    
  • Made some minor mods to the analytics pipeline, cf diff, and used that branch to run the pipeline.

  • Used this configuration for the pipeline:

    Click to expand override.cfg Replace THINGS-IN-ALL-CAPS with real values.
    [hive]
    warehouse_path = s3://BUCKET-NAME-HERE/analytics/warehouse/
    
    [database-export]
    database = CLIENT-PREFIX-HERE_reports
    credentials = s3://BUCKET-NAME-HERE/analytics/config/output.json
    
    [database-import]
    database = CLIENT-PREFIX-HERE_edxapp
    credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json
    destination = s3://BUCKET-NAME-HERE/analytics/warehouse/
    
    [otto-database-import]
    database = CLIENT-PREFIX-HERE_ecommerce
    credentials = s3://BUCKET-NAME-HERE/analytics/config/input.json
    
    [map-reduce]
    engine = hadoop
    marker = s3://BUCKET-NAME-HERE/analytics/marker/
    lib_jar = [
        "hdfs://localhost:9000/lib/hadoop-aws-2.7.2.jar",
        "hdfs://localhost:9000/lib/aws-java-sdk-1.7.4.jar"]
    
    [event-logs]
    pattern = [".*tracking.log-(?P<date>[0-9]+).*"]
    expand_interval = 30 days
    source = ["s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/"]
    
    [event-export]
    output_root = s3://BUCKET-NAME-HERE/analytics/event-export/output/
    environment = simple
    config = s3://BUCKET-NAME-HERE/analytics/event_export/config.yaml
    gpg_key_dir = s3://BUCKET-NAME-HERE/analytics/event_export/gpg-keys/
    gpg_master_key = master@key.org
    required_path_text = FakeServerGroup
    
    [event-export-course]
    output_root = s3://BUCKET-NAME-HERE/analytics/event-export-by-course/output/
    
    [manifest]
    threshold = 500
    input_format = org.edx.hadoop.input.ManifestTextInputFormat
    lib_jar = s3://BUCKET-NAME-HERE/analytics/packages/edx-analytics-hadoop-util.jar
    path = s3://BUCKET-NAME-HERE/analytics/manifest/
    
    [user-activity]
    overwrite_n_days = 10
    output_root = s3://BUCKET-NAME-HERE/analytics/activity/
    
    [answer-distribution]
    valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse
        
    [enrollments]
    interval_start = 2017-01-01
    overwrite_n_days = 3
    blacklist_date = 2001-01-01
    blacklist_path = s3://BUCKET-NAME-HERE/analytics/enrollments-blacklist/
    
    [enrollment-reports]
    src = s3://BUCKET-NAME-HERE/CLIENT-PREFIX-HERE/logs/tracking/
    destination = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/output/
    offsets = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/offsets.tsv
    blacklist = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/course_blacklist.tsv
    history = s3://BUCKET-NAME-HERE/analytics/enrollment_reports/enrollment_history.tsv
    
    [course-summary-enrollment]
    # JV - course catalog is optional, and was causing CourseProgramMetadataInsertToMysqlTask errors.
    # enable_course_catalog = true
    enable_course_catalog = false
    
    [financial-reports]
    shoppingcart-partners = {"DEFAULT": "edx"}
    
    [geolocation]
    geolocation_data = s3://BUCKET-NAME-HERE/analytics/packages/GeoIP.dat
     
    [location-per-course]
    interval_start = 2017-01-01
    overwrite_n_days = 3
    
    [calendar]
    interval = 2017-01-01-2030-01-01
    
    [videos]
    dropoff_threshold = 0.05
    allow_empty_insert = true
    overwrite_n_days = 3
    
    [elasticsearch]
    host = ["http://localhost:9200/"]
    
    [module-engagement]
    alias = roster_1_2
    number_of_shards = 5
    overwrite_n_days = 3
    allow_empty_insert = true
    
    [ccx]
    enabled = false
    
    [problem-response]
    report_fields = [
        "username",
        "problem_id",
        "answer_id",
        "location",
        "question",
        "score",
        "max_score",
        "correct",
        "answer",
        "total_attempts",
        "first_attempt_date",
        "last_attempt_date"]
    report_output_root = s3://BUCKET-NAME-HERE/analytics/reports/
    
    [edx-rest-api]
    # Create using:
    # ./manage.py lms --settings=devstack create_oauth2_client  \
    #   http://localhost:9999  # URL does not matter \
    #   http://localhost:9999/complete/edx-oidc/  \
    #   confidential \
    #   --client_name "Analytics Pipeline" \
    #   --client_id oauth_id \
    #   --client_secret oauth_secret \
    #   --trusted
    client_id = oauth_id
    client_secret = oauth_secret
    auth_url = https://LMS_URL_HERE/oauth2/access_token/
    
    [course-list]
    api_root_url = https://LMS_URL_HERE/api/courses/v1/courses/
    
    [course-blocks]
    api_root_url = https://LMS_URL_HERE/api/courses/v1/blocks/
    
  • Then, the analytics tasks can be run on the local machine using this script. Schedule it to run daily via cron to keep your data updated.

    Click to expand pipeline.sh Replace the variables in the FIXME block with real values.
    #!/bin/bash
    
    # Acquire lock using this script itself as the lockfile.
    # If another pipeline task is already running, then exit immediately.
    exec 200<$0
    flock -n 200 || { echo "`date` Another pipeline task is already running."; exit 1; }
    
    # Run as hadoop user
    . $HOME/hadoop/hadoop_env
    . $HOME/venvs/pipeline/bin/activate
    cd $HOME/pipeline
    
    export OVERRIDE_CONFIG=$HOME/override.cfg
    
    HIVE='hive'
    HDFS="hadoop fs"
    
    # FIXME set these variables
    FROM_DATE=2017-01-01
    NUM_REDUCE_TASKS=12
    TRACKING_LOGS_S3_BUCKET="s3://TRACKING-LOG-BUCKET-GOES-HERE"
    TRACKING_LOGS_S3_PATH="$TRACKING_LOGS_S3_BUCKET/logs/tracking/"
    HADOOP_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path for temporary/intermediate storage
    HADOOP_S3_PATH="$HADOOP_S3_BUCKET/analytics"
    HDFS_ROOT="$HADOOP_S3_PATH"
    TASK_CONFIGURATION_S3_BUCKET="$TRACKING_LOGS_S3_BUCKET"  # bucket/path containing task configuration files
    TASK_CONFIGURATION_S3_PATH="$TASK_CONFIGURATION_S3_BUCKET/analytics/packages/"
    # /FIXME set these variables
    
    END_DATE=$(date +"%Y-%m-%d")
    INTERVAL="$FROM_DATE-$END_DATE"
    REMOTE_TASK="launch-task"
    WEEKS=10
    ADD_PARAMS=""
    LOCKFILE=/tmp/pipeline-tasks.lock
    
    if [ -f $LOCKFILE ]; then
            echo "This script is already running."
            exit
    else
            touch $LOCKFILE
    fi
    
    DO_SHIFT=0
    getopts e:w:p: PARAM
    while [ $? -eq 0 ]; do
            case "$PARAM" in
                    (e)
                            echo "Using end_date: $OPTARG"
                            END_DATE=$OPTARG
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
                    (w)
                            echo "Using WEEKS=$OPTARG"
                            WEEKS=$OPTARG
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
                    (p)
                            echo "Using file pattern: $OPTARG"
                            ADD_PARAMS="--pattern '$OPTARG'"
                            DO_SHIFT=$(( $DO_SHIFT + 2 ))
                            ;;
            esac
            getopts e:w:p: PARAM
    done
    
    if [ $DO_SHIFT -gt 0 ]; then
            shift $DO_SHIFT
    fi
    
    if [ "$1x" != "x" ]; then
            echo "Adding parameters: $@"
            ADD_PARAMS="$@"
    fi
    
    # Run history tasks once to bootstrap new deployments.
    RUN_ENROLLMENTS_HISTORY=0
    RUN_GEOGRAPHY_HISTORY=0
    RUN_LEARNER_ANALYTICS_HISTORY=0
    
    # Run incremental tasks daily
    RUN_ENROLLMENTS=1
    RUN_PERFORMANCE=1
    RUN_GEOGRAPHY=1
    RUN_ENGAGEMENT=1
    RUN_VIDEO=1
    RUN_LEARNER_ANALYTICS=1
    
    # Run engagement task if today is a Monday
    if [ $(date +%u) -eq 1 ]; then
            RUN_ENGAGEMENT=1
    fi
    
    if [ ! -d /tmp/$END_DATE ]; then
            mkdir /tmp/$END_DATE
    fi
    
    
    if [ $RUN_ENROLLMENTS_HISTORY -gt 0 ]; then
    
       # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#history-task
       $REMOTE_TASK CourseEnrollmentEventsTask \
         --interval "$INTERVAL" \
         --local-scheduler \
         --overwrite \
         --n-reduce-tasks $NUM_REDUCE_TASKS \
         $ADD_PARAMS > /tmp/$END_DATE/CourseEnrollmentEventsTask.log 2>&1
    fi
    
    if [ $RUN_ENROLLMENTS -gt 0 ]; then
    
      # https://groups.google.com/d/msg/openedx-ops/pCuzvbG1OyA/FehWsxTgBwAJ
      # Since Gingko, using a persistent Hive metastore causes issues with the enrolments summary data.
      # The workaround is to delete the previously calculated summary data.
      $HIVE -e 'USE default;DROP TABLE IF EXISTS course_grade_by_mode;' \
          >> /tmp/$END_DATE/cleanup.log 2>&1
      $HDFS -rm -r $HDFS_ROOT/warehouse/course_grade_by_mode/* \
          >> /tmp/$END_DATE/cleanup.log 2>&1
      $HIVE -e 'USE default;DROP TABLE IF EXISTS course_meta_summary_enrollment;' \
        >> /tmp/$END_DATE/cleanup.log 2>&1
      $HDFS -rm -r $HDFS_ROOT/warehouse/course_meta_summary_enrollment/* \
          >> /tmp/$END_DATE/cleanup.log 2>&1
    
      $REMOTE_TASK ImportEnrollmentsIntoMysql \
        --interval "$INTERVAL" \
        --local-scheduler \
        --overwrite \
        --overwrite-n-days 1 \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/ImportEnrollmentsIntoMysql.log 2>&1
    fi
    
    if [ $RUN_PERFORMANCE -gt 0 ]; then
    
      NOW=`date +%s`
      ANSWER_DIST_S3_BUCKET=$HADOOP_S3_PATH/intermediate/answer_dist/$NOW
    
      $REMOTE_TASK AnswerDistributionWorkflow \
        --local-scheduler \
        --src "[\"$TRACKING_LOGS_S3_PATH\"]" \
        --dest "$ANSWER_DIST_S3_BUCKET" \
        --name AnswerDistributionWorkflow \
        --output-root "$HADOOP_S3_PATH/grading_reports/" \
        --include "[\"*tracking.log*.gz\"]" \
        --manifest "$ANSWER_DIST_S3_BUCKET/manifest.txt" \
        --base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
        --lib-jar "[\"$TASK_CONFIGURATION_S3_PATH/edx-analytics-hadoop-util.jar\"]" \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --marker "$ANSWER_DIST_S3_BUCKET/marker" \
        $ADD_PARAMS > /tmp/$END_DATE/AnswerDistributionWorkflow.log 2>&1
    fi
    
    if [ $RUN_GEOGRAPHY_HISTORY -gt 0 ]; then
    
      # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id6
      $REMOTE_TASK LastDailyIpAddressOfUserTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/LastDailyIpAddressOfUserTask.log 2>&1
    fi
    
    if [ $RUN_GEOGRAPHY -gt 0 ]; then
    
      $REMOTE_TASK InsertToMysqlLastCountryPerCourseTask \
        --local-scheduler \
        --interval-end $END_DATE \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --overwrite \
        $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlLastCountryPerCourseTask.log 2>&1
    fi
    
    if [ $RUN_ENGAGEMENT -gt 0 ]; then
    
      WEEKS=24
    
      $REMOTE_TASK InsertToMysqlCourseActivityTask \
        --local-scheduler \
        --end-date $END_DATE \
        --weeks $WEEKS \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/CourseActivityWeeklyTask.log 2>&1
    fi
    
    if [ $RUN_VIDEO -gt 0 ]; then
      $REMOTE_TASK InsertToMysqlAllVideoTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/InsertToMysqlAllVideoTask.log 2>&1
    fi
    
    if [ $RUN_LEARNER_ANALYTICS_HISTORY -gt 0 ]; then
    
      # http://edx-analytics-pipeline-reference.readthedocs.io/en/latest/running_tasks.html#id12
      $REMOTE_TASK ModuleEngagementIntervalTask \
        --local-scheduler \
        --interval $INTERVAL \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        --overwrite-from-date $END_DATE \
        --overwrite-mysql \
        $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementIntervalTask.log 2>&1
    fi
    
    if [ $RUN_LEARNER_ANALYTICS -gt 0 ]; then
    
      $REMOTE_TASK ModuleEngagementWorkflowTask \
        --local-scheduler \
        --date $END_DATE \
        --indexing-tasks 5 \
        --throttle 0.5 \
        --n-reduce-tasks $NUM_REDUCE_TASKS \
        $ADD_PARAMS > /tmp/$END_DATE/ModuleEngagementWorkflowTask.log 2>&1
    fi
    
    rm -f $LOCKFILE
    
2 Likes

Hello @jill ! Thank you for replying, actually i am not able to use AWS sadly but i got some help on Slack from @sambapete ( big thanks for him ) so i disabled the AWS related tasks this way:

AWS_GATHER_FACTS: false

Then i solved another problem about the database migration and finally i am stuck on the following task:

Task analytics_pipeline : enable Hadoop services
Placement: configuration/playbooks/roles/analytics_pipeline/tasks/main.yml:136
Error message: Could not find the requested service [‘hdfs-namenode’, ‘hdfs-datanode’, ‘yarn-resourcemanager’, ‘yarn-nodemanager’, ‘yarn-proxyserver’, 'mapreduce-historyserver

The hadoop user is there and everything seems good… i am trying to solve this for now.

Glad you’re making progress here @ettayeb_mohamed!

A quick google search for that error suggests that there’s an issue with the ansible service module on some systems. If you add daemon_reload: yes to that task as suggested here, does it help?

Hi @jill ! Actually i just enabled those services manually and commented the check there cause even after adding daemon_reload: yes and trying a lot of things it’s not working… So i just bypassed those steps and fixed some other things and finally it seems like everything goes well and the installation is done :grin: :grin: Now i am having problems with setting the authentication :frowning: it redirecting me to 127.0.0.1:8000 even after updating the insights.yml and the lms.env.json with the public ip address… I added the trusted client on my admin dashboard etc too… this is weired…
Is there any updated or more clear steps to fix this cause i believe that i did what is there on the docs?

Do any of these tips help?

https://openedx-deployment.doc.opencraft.com/en/latest/analytics/insights/#oauth2

Most of the steps there are already done… the problem is with this redirect to 127.0.0.1:8000 i cannot find where is should update it to make the redirection goes to the public ip and not 127.0.0.1.

There’s a couple of places where redirect URLs are specified during authentication:

  • /edx/etc/insights.yml – the SOCIAL_AUTH_EDX_OIDC_* variables.
  • The LMS Django Admin, URL ending in /admin/oauth2/client/: the redirect URI
1 Like

I really appreciate your reply!
I solved that issue sadly by updating directly on /edx/app/insights/edx_analytics_dashboard/analytics_dashboard/settings/base.py It looks like restarting the insights will not load any changes… (very weird… anyway).
Right now i am having another issue which is:
invalid_request The requested redirect didn't match the client settings.
I tried with the troubleshoting section here: https://openedx-deployment.doc.opencraft.com/en/latest/analytics/insights/#oauth2
The links are all good… But still getting that error… :frowning:

Yep, OAuth is tricky. Note that it’s not Open edX making this hard, the django social authentication settings have to be exactly right.

I need more information about your config to debug this… can you post your /edx/etc/insights.yml, your full LMS URL, the LMS_BASE_SCHEME from /edx/etc/lms.yml, and the values in the /admin/oauth2/client/ created for Insights? (with keys and secrets redacted of course) There’s a mismatch in there somewhere.

Hello @jill, here are my files and everything… i am totally tired of this… :
My client config with everything clear ( i dont care anymore about keys and secrets… i will remove this after…) :

Here is the important part of the insights.yml:

SOCIAL_AUTH_EDX_OIDC_ID_TOKEN_DECRYPTION_KEY: 92fb605d041bfaa8e8f69ccb4abfb620e3f7c35a
SOCIAL_AUTH_EDX_OIDC_ISSUER: http://51.91.253.243/oauth2
SOCIAL_AUTH_EDX_OIDC_KEY: 3d7050fb2085a2c2a325
SOCIAL_AUTH_EDX_OIDC_LOGOUT_URL: http://51.91.253.243/logout
SOCIAL_AUTH_EDX_OIDC_SECRET: 92fb605d041bfaa8e8f69ccb4abfb620e3f7c35a
SOCIAL_AUTH_EDX_OIDC_URL_ROOT: http://51.91.253.243/oauth2
SOCIAL_AUTH_REDIRECT_IS_HTTPS: false

Here is the important part from my lms.env.json:

"JWT_EXPIRATION": 30, 
"JWT_ISSUER": "http://51.91.253.243/oauth2", 
"JWT_PRIVATE_SIGNING_KEY": null, 
"LANGUAGE_CODE": "en", 
"LANGUAGE_COOKIE": "openedx-language-preference", 
"LMS_BASE": "51.91.253.243", 
"LMS_INTERNAL_ROOT_URL": "http://51.91.253.243", 
"LMS_ROOT_URL": "http://51.91.253.243", 



“OAUTH_DELETE_EXPIRED”: true,
“OAUTH_ENFORCE_SECURE”: false,
“OAUTH_EXPIRE_CONFIDENTIAL_CLIENT_DAYS”: 365,
“OAUTH_EXPIRE_PUBLIC_CLIENT_DAYS”: 30,
“OAUTH_OIDC_ISSUER”: “http://51.91.253.243/oauth2”,

@ettayeb_mohamed Hey, looks like you sorted it out? What was the fix?

I was able to register a new account on your LMS, and was able to authenticate. Getting a 403 on the Insights home page, but that’s usual (unfortunately) if the pipeline tasks haven’t run yet.

Hello @jill,
I think that the problem was with the insights version… it’s something like that the lms working with oidc and the insights with oauth2 ( /complete/edx-oidc VS /complete/edx-oauth2/ )
The solution was to add some variable to the ansible-playbook command this way:
ansible-playbook -i localhost, -c local analytics_single.yml --extra-vars "INSIGHTS_LMS_BASE=<LMS DOMAIN> INSIGHTS_VERSION=open-release/ironwood.master ANALYTICS_API_VERSION=open-release/ironwood.master"
I think that it was installing another version of the insights that’s it…
Right now all is good, i even solved all the hadoop problems etc… but i cannot run the pipeline tasks i don’t know why :frowning:

(pipeline) root@vps759767:~/edx-analytics-pipeline# remote-task --host localhost --user root --remote-name analyticstack --skip-setup --wait ImportEnrollmentsIntoMysql --interval 2016 --local-scheduler
Parsed arguments = Namespace(branch=‘release’, extra_repo=None, host=‘localhost’, job_flow_id=None, job_flow_name=None, launch_task_arguments=[‘ImportEnrollmentsIntoMysql’, ‘–interval’, ‘2016’, ‘–local-scheduler’], log_path=None, override_config=None, package=None, private_key=None, python_version=None, remote_name=‘analyticstack’, repo=None, secure_config=None, secure_config_branch=None, secure_config_repo=None, shell=None, skip_setup=True, sudo_user=‘hadoop’, user=‘root’, vagrant_path=None, verbose=False, virtualenv_extra_args=None, wait=True, wheel_url=None, workflow_profiler=None)
Running commands from path = /root/pipeline/share/edx.analytics.tasks
Remote name = analyticstack
Running command = [‘ssh’, ‘-tt’, ‘-o’, ‘ForwardAgent=yes’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘UserKnownHostsFile=/dev/null’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘User=root’, ‘-o’, ‘ConnectTimeout=10’, ‘localhost’, “sudo -Hu hadoop /bin/bash -c ‘cd /var/lib/analytics-tasks/analyticstack/repo && . $HOME/.bashrc && . /var/lib/analytics-tasks/analyticstack/venv/bin/activate && launch-task ImportEnrollmentsIntoMysql --interval 2016 --local-scheduler’”]
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
/bin/bash: line 0: cd: /var/lib/analytics-tasks/analyticstack/repo: No such file or directory
Connection to localhost closed.
Exiting with status = 1

I am having this error when i run the tasks:

There’s an error in your screenshot that could be the culprit:

Required argument: -input

Are there any tracking logs under hdfs://localhost:9000/data/ that match the configured pattern .*tracking.log.*?

I have a file tracking.log under /edx/var/log/tracking that’s it!
this hdfs://localhost:9000/data/ should be pointed there i think?

The pipeline task want to read the tracking logs from hdfs (or s3, when configured to read from there), so you should sync your tracking logs to that hdfs store periodically.

The analytics devstack does this with a cron job, see analytics_pipeline playbook.

1 Like

Actually that was the main problem!!! there was nothing on that hdfs store!
When i ran that playbook i didn’t receive any error so i though that all is good :frowning:
I reran that manually and i restarted the taks and everything works well then i run the sync db command and finally everything is working well and the dashboard is there!!!
Big thanks for you @jill!!! :heart_eyes: :smiling_face_with_three_hearts:

1 Like

I will prepare a full guide on the next days and share it with you.

3 Likes

Thank you for your persistence @ettayeb_mohamed! So pleased you got it working.