I am currently in the process of determining the steps we would need to take to migrate our environment from Native to Tutor.
Today, I have a question about tracking logs and I thought asking the question to the data working group would be appropriate.
How are tracking logs currently generated?
I am asking in the context of Tutor where the standard /edx/var/log/tracking/tracking.log files from Native do not seem to be generated in the same format as under Native.
Yes, they are in $TUTOR_ROOT/data/logs/tracking.log but with a prefix like “2022-03-07 13:29:06,940 INFO 26 [tracking] [user None] [ip 66.249.93.13] logger.py:41 -”
Therefore, I would need to process them beforehand in order for them to be usable by Insights.
Am I assuming right? Or is there already a process to move from the Tutor format used in the tracking.log file to the old format used in the Native tracking.log file.
I am asking because we invested a lot of time and efforts with Insights. When we will migrate from Native to Insights, we still need to provide our instructors with data on their courses. Courses that I will admit might still be going on when moving from one type of installation to another.
At this time, this looks like a no-win scenario if we don’t have a “free” working analytics solution in the near future to replace Insights.
So I have something to add here that might or might not be helpful for you.
A while ago for a personal priejct (which is not active at the moment), I was spending sometime in order to create a python library which convert Nginx logs to csv like format.
The idea is once logs are converted to csv format, it then can be easily used with exacel or pandas…etc.
Here is a link for a github gist which has an example of using regex pattern to parse logs.
And here is my code which I used at somepoint to convert nginx logs to csv file
import os
import re
from datetime import datetime
import pandas as pd
import numpy as np
lineformat = re.compile( r"""( ["](?P<host>.*)["]) - (?P<ipaddress>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (?P<remoteuser>.+) \[(?P<dateandtime>\d{2}\/[a-z]{3}\/\d{4}:\d{2}:\d{2}:\d{2} (\+|\-)\d{4})\] (((\"(?P<method>.+) )(?P<url>.+)(http\/[1-2]\.[0-9]"))|(["](?P<badurl>.*)["])) (?P<statuscode>\d{3}) (?P<bytessent>\d+) (["](?P<refferer>(\-)|(.+))["]) (["](?P<useragent>.+)["]) (["](?P<xforward>.+)["])""", re.IGNORECASE)
"""
The above regex pattern is based on this log format (as its configured nginx config file):
log_format main ' "$host" - $remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
"""
DIRECTORY="logs_last/"
def readfile(path):
bad=[]
result=[]
f=open(DIRECTORY+path)
#line=f.readline()
for line in f.readlines():
data=re.search(lineformat,line[:-1])
if data is None:
bad.append(line[:-1])
else:
data=data.groupdict()
data['dateandtime'] = datetime.strptime(data["dateandtime"], "%d/%b/%Y:%H:%M:%S %z")
result.append(data)
#line=f.readline()
f.close()
return {'result':result,'bad':bad}
def readfiles(directory):
files = [f for f in os.listdir(DIRECTORY) if f.startswith('access')]
all_bad=[]
all_result=[]
for f in files:
file_result=readfile(f)
all_bad+=file_result['bad']
all_result+=file_result['result']
return {'result':all_result,'bad':all_bad}
def create_csv():
dicts = readfiles()
df = pd.DataFrame.from_dict(dicts['result'])
df['is_bot']=False
df['is_bot']=np.where(df['useragent'].str.contains('bot|Spotify|Bot|iTMS',regex=True),True,df['is_bot'])
df.to_csv('logs_last.csv')
return df
create_csv()
print(readfile("access.log"))
Here are some facts/consdiration about the above code:
The parsing regex is inspired from the github gist shared above.
Its work is still on progress, might not work or might be buggy, and it’s a bit choatic.
There might be some stuff you don’t need for example the code was written to handle the case where there are multiple log fils in a directory, and all of them starts/prefiexed with access.
Also I added another condition to parse or count requests that are being issued by bots, e.g. Google search, Bing…etc (you might need to discard that).
I imagine, it might be useful for you, when you tweak the regex pattern above, but I might be wrong, I leave to you to judge.
A bad count/instance means that the regex re.search failed to parse it…I used that to keep track of which lines the regex pattern failed to parse. So I kept tweaking until len(all_bad) is 0.
Thanks @ghassan I will look into it. There might also have hints in the edx-platform code on how the tracking logs are generated from the logs. I haven’t checked really.
@ghassan you might want to look at the post from @regis on Tutor’s Discourse
That might be a simpler solution at this time if we decide to stay with Insights.
As we are still in the process of evaluating our migration from Native to Tutor, we still need to acquire a licence for the Tutor Wizard Edition to check if Cairn would do what we need.
@sambapete yeah I have just checked it, and I guess you were referring to the type of logs which is (python dict or JSON). That being said, then yes its should be way simpler to parse it I suppose one can do it with a python script, json.loads(for each line)…etc.