Is there an easy way to add a robots.txt to a tutor-based OpenEdX deployment?

I already found corresponding posts, but nothing usable. I have no idea how to create and then install a Django plugin… this should be a simple “drop a robots.txt in your theme”, with at most writing a tutor patch, but alas it is not

Note that I already found Adding to urls.py using Tutor Plugin - #5 by crivet - Tutor - Overhang.IO which was not helpful to me

@mboisson could you describe your requirements a bit more? I’m not exactly sure what this question is about so I’m having trouble trying to direct your question.

I want this file robots.txt (at the root of the LMS_HOST) to be served as a text file. This is to avoid crawlers indexing the site: robots.txt - Wikipedia

At the moment, the file is there, but the web server does not serve it.

Gotcha, thank you.

@tutor-maintainers - any ideas here?

Note that I’m not proclaiming this to be THE definitive way to do it, but I was able to get it to work.
Not 100% certain if it will survive a rebuild but I think it might.

I took advantage of the directory ~/.local/share/tutor/data/caddy/ since that is already mounted into the Caddy container, so no need to manage new mounts.
In there create file robots.txt with contents:

User-agent: *
Disallow: /

Next you can modify your Caddyfile, preferably by way of a caddyfile-cms and caddyfile-lms patch to make it permanent, or for a quick test (will not survive rebuild) you can manually modify BOTH the cms and lms block of ~/.local/share/tutor/env/apps/caddy/Caddyfile to include the robots handling like this:

    # Serve robots.txt
    route /robots.txt {
        root * /data
        file_server
    }

Here’s an example of the modified block:

cms.mydomain.tld{$default_site_port} {
    @favicon_matcher {
        path_regexp ^/favicon.ico$
    }
    rewrite @favicon_matcher /theming/asset/images/favicon.ico

    import proxy "cms:8000"

    # Serve robots.txt
    route /robots.txt {
        root * /data
        file_server
    }

    handle_path /* {
        request_body {
            max_size 2048MB
        }
    }
}

Restart Caddy container with tutor local restart caddy and then test if you can access the URL

curl -o - https://lms.mydomain.tld/robots.txt
User-agent: *
Disallow: /

This topic that you linked to is exactly what you need. You should create an edx-platform plugin application, as documented here: edx-django-utils/edx_django_utils/plugins/README.rst at master · openedx/edx-django-utils · GitHub This application will introduce a new /robots.txt url in your LMS (and in your Studio, if you wish to), as well as a new robots.txt static file. Then, push that application to GitHub. Add this application to your edx-platform extra requirements via the OPENEDX_EXTRA_PIP_REQUIREMENTS configuration setting (in Tutor). Re-build your image with tutor images build openedx, then restart your platform with tutor local start -d.

Yes, it’s a complex process. You may discover easier ways to achieve what you want, but I suspect these approaches will not be just as future-proof.

Alternatively, we could include that feature in tutor-indigo, along with a nice {{ patch("indigo-robots.txt") }} statement to allow users to customise the robots.txt file. Maybe this is something for the maintainers of Indigo @HammadYousaf01 @Ahmed_Khalid?

1 Like

@kmccormick and I discussed this during one of our Tutor Users Group meeting. What Kyle mentioned is that we don’t want to stop indexing of all MFEs but keep some pages of learner facing MFEs such as learning and learner-dashboard.

For this, he also suggested mounting a robots.txt file in the MFE container similar to what @joel.edwards mentioned here and then that file should have conditional rendering for each MFE based on the user preferences. An example file can be found in the meeting notes.

@regis What do you think about this approach instead of tutor-indigo?

I would say that what “we” want is up to the platform operator. In my case, I do want to stop indexing everything. This is for a development platform and somehow Google found it. It should not be indexed at all, period.

I’m sorry, but creating a django plugin is like talking a foreign language to me. I don’t want to become a django developer just to add a simple text file to be served.

Adding a patch for indigo-robots.txt would be nice.

This seems like the most approachable solution at the moment. I will keep it in mind.

in case you want to replicate my setup, here’s a patch file for you to try.

Create robots.txt file in Caddy data dir: ~/.local/share/tutor/data/caddy/robots.txt with contents:

User-agent: *
Disallow: /

Create Patch file at ~/.local/share/tutor-plugins/robots_crawlers.py (or give it your own preferred name) with the following content:

from tutor import hooks

hooks.Filters.ENV_PATCHES.add_item(
    (
        "caddyfile-cms",
        """
# Serve robots.txt
route /robots.txt {
    root * /data
    file_server
}
        """
    )
) 

hooks.Filters.ENV_PATCHES.add_item(
    (
        "caddyfile-lms",
        """
# Serve robots.txt
route /robots.txt {
    root * /data
    file_server
}
        """
    )
)

Activate the plugin: tutor plugins enable robots_crawlers
Restart Caddy: tutor local restart caddy
Test to confirm functionality and it should hopefully work :slight_smile:

1 Like

Thanks! This seems rather straightforward and should be easy to implement. I will try it as soon as I get a break from meetings :sweat_smile:

Mmmm, I configured it, and the Caddyfile inside the caddy container is ok. However, I still don’t get a file served. I suspect that’s because the robots.txt does not make its way to the container ?

/srv # cat /data/caddy/robots.txt
cat: can't open '/data/caddy/robots.txt': No such file or directory

Should it be found in the caddy container ?

Does something else need to be done after this file is created ? Note that I had to create the caddy directory inside of ~/.local/share/tutor/data/ since it did not exist

In the caddy logs, I see:

[tutor@edx1 ~]$ tutor local logs caddy | grep robots.txt | cut -d',' -f 10,11,15,16 | tail -n 10 | sed -e "s/$HOSTNAME/<REDACTED>/g"
"host":"edx.<REDACTED>","uri":"/robots.txt"},"size":9442,"status":404}
"host":"edx.<REDACTED>","uri":"/robots.txt"},"size":3288,"status":404}
"host":"edx.<REDACTED>","uri":"/robots.txt"},"size":3288,"status":404}
"host":"studio.edx.<REDACTED>","uri":"/robots.txt"},"size":2254,"status":404}
"host":"apps.edx.<REDACTED>","uri":"/robots.txt"},"size":0,"status":200}
"host":"preview.edx.<REDACTED>","uri":"/robots.txt"},"size":3288,"status":404}
"host":"edx.<REDACTED>","uri":"/robots.txt"},"size":3288,"status":404}
"host":"apps.edx.<REDACTED>","uri":"/robots.txt"},"size":0,"status":200}
"host":"edx.<REDACTED>","uri":"/robots.txt"},"size":0,"status":404}
"host":"apps.edx.<REDACTED>","uri":"/robots.txt"},"size":0,"status":200}

Are you by chance NOT using SSL/HTTPS on your site? I think that directory might not be created/mounted if you aren’t using HTTPS. Sorry, I just took it as standard that it should be there but neglected to consider if it was conditional on other parameters.

My docker-compose.prod.yml looks like this for the Caddy entry, perhaps you can create a custom mount point somewhere else if it suits your requirements better:

  # Web proxy for load balancing and SSL termination
  caddy:
    image: docker.io/caddy:2.7.4
    restart: unless-stopped
    ports:
      - "80:80"
      
      - "443:443"
      # include support for http/3
      - "443:443/udp"
      
    environment:
      default_site_port: ""
    volumes:
      - ../apps/caddy/Caddyfile:/etc/caddy/Caddyfile:ro
      - ../../data/caddy:/data    # <--- This line here is my mount point
1 Like

Ah! that must be it.

Indeed, in this setup, EdX is running on an internal VM, so EdX’s caddy is not doing https. https is being done by the external caddy which manages other web services and reroutes to EdX’s caddy for EdX.

Somehow however, no amount of

tutor mounts add caddy:/tutor/.local/share/tutor/data:/data

manages to get this in the compose file…

[tutor@edx1 ~]$ grep -r '../data/caddy' .local/share/tutor/env/
[tutor@edx1 ~]$ tutor mounts add caddy:../../data/caddy:/data
Adding bind-mount: caddy:../../data/caddy:/data
Configuration saved to /tutor/.local/share/tutor/config.yml
Environment generated in /tutor/.local/share/tutor/env
[tutor@edx1 ~]$ grep -r '../data/caddy' .local/share/tutor/env/
[tutor@edx1 ~]$

It seems like the mounts command is being ignored… :person_shrugging:

It appears that the mounts command merely modifies the MOUNTS tutor config variable, which itself is not use for the caddy container, only for lms and cms :person_facepalming:

The only way that I can get robots.txt visible in the caddy container is by manually editing the docker-compose.prod.yml, but that will likely get overwritten by tutor, so I don’t know how to fix this.

This is so complicated for something that should be so simple :cry:

At least for now, I ended up doing it in the configuration of the external caddy that sits in front of OpenEdX’s caddy. It was much easier to do there.

Talking with my colleague who knows more about Caddy, I realised it should be possible to do with this instead, which does not require adding a file to the caddy container

from tutor import hooks

hooks.Filters.ENV_PATCHES.add_item(
    (
        "caddyfile-cms",
        """
# Serve robots.txt
respond /robots.txt 200 {
    body "User-agent: *
Disallow: /"
    close
}
        """
    )
) 

hooks.Filters.ENV_PATCHES.add_item(
    (
        "caddyfile-lms",
        """
# Serve robots.txt
respond /robots.txt 200 {
    body "User-agent: *
Disallow: /"
    close
}
        """
    )
)

I did not implement this since I’ve implemented in the outside caddy server instead, but that could be of use to someone, so dropping it here anyway.

4 Likes

This is actually a pretty solid method, colour me impressed :slight_smile:

1 Like