Cleaning up HDFS in Insights

I have a question about Insights and cleaning up the /edx-analytics-pipeline/warehouse in HDFS.

Is it safe to assume that after all the processing is done by Hadoop, Hive and Sqoop and the data is populated in MySQL that we could clean up some of the directories under /edx-analytics-pipeline/warehouse/module_engagement_roster/dt=2021-* for example?

I know the size of the subdirectories increments day by day as per the incremental tasks, do I need to keep them on disk ?

Any suggestions would be greatly appreciated.

I’ve found that sometimes when corresponding files are not present when initially running the task, it could give error, so I’ve not removed them and kept them as is. But in increments it does go a lot faster if all of the files are not present.

It’s 50-50 sort of thing some times it works for me by running task with files initially and then removing it later and sometimes it doesn’t, when it doesn’t I just remove stuff completely and start from scratch.

Starting from scratch was what we usually did. But recalculating everything since 2015 would take a long time, especially for the historical tasks.

I tried removing some of the incremental directories created in HDFS by the different daily tasks, but I discovered a strange side effect this morning. I had removed almost everything up until 2021-09-01 and suddenly the dashboard only presented me data from 2021-09-01 going forward. Not exactly what I was expecting.

I guess I am going back to the backup I made yesterday before trying to clean up the disk space. I might have no choice but to resize the disk used for the HDFS data now. It was worth a try,

If I find something else, I’ll add it here.

Thanks for you comment @chintan

1 Like