[Re-opened] Text in problems gets truncated if non latin language

Dimitris shared with me at the conference that this issue is still unsolved:

Since that thread is locked, let’s start a new one here.

@Dimitris_Angelakis to help me reproduce this, could you share with me the XML of the problem that is suffering from this issue?

You can access the XML of a single problem by going to it in Studio, clicking Edit, clicking the Advanced editor (might be hidden under “Show advanced settings”), ignoring the warning, and copy-pasting the XML. Make sure you exit by clicking “Cancel” instead of “Save”–that way, the problem will not be permanently converted to the advanced editor.

Hello @kmccormick

Thank you for reopening this thread!

We usually edit the problems in XML in the advanced editor, but thank you for your concern with it!

In the following link are 3 screen captures. The first two show the problem in both Studio and LMS, and the 3rd picture shows the proper text when using the workaround (“span” tags in between the text) as I describe below, to overcome the problem with the truncated texts:

https://imgur.com/a/EGu0CMN

The XML of this particular problem that shows the truncated text is here:

<p>5. Έστω το παρακάτω απόσπασμα προγράμματος tkinter:</p>
<pre>ttk.Style().configure('mystyle', bg= "yellow", font='Arial 30')
self.button = ttk.Button(self.root, style='mystyle',
					text="show text", command=self.showText)</pre><br></br>
<p>Ποια η λειτουργία του προγράμματος;</p>
<multiplechoiceresponse>
  <choicegroup type="MultipleChoice">
    <choice correct="false">Ορίζει το στυλ mystyle (χρώμα υποβάθρου, γραμματοσειρά) και, στη συνέχεια, το στυλ αυτό χρησιμοποιείται σε ένα γραφικό αντικείμενο τύπου ttk.Button.</choice>
    <choice correct="true">_tkinter.TclError: Layout mystyle not found</choice>
    <choice correct="false">Ορίζεται το στυλ mystyle (χρώμα υποβάθρου, γραμματοσειρά), όμως στη συνέχεια δεν εφαρμόζεται στο γραφικό αντικείμενο self.button, αφού δεν είναι το καθορισμένο όνομα στυλ για αντικείμενα τύπου Button, οπότε το γραφικό αντικείμενο σχεδιάζεται με default layout.</choice>
  </choicegroup>
</multiplechoiceresponse>

<solution>
<div class="detailed-solution">
<p>Εξήγηση</p>

<p>Δίνει σφάλμα, γιατί το όνομα mystyle δεν είναι το καθορισμένο όνομα στυλ για αντικείμενα τύπου Button. Όπως αναφέρθηκε και στη διάλεξη, το στυλ αυτό είναι το button.TButton.</p>
</div>
</solution>

The problem appears at the last “choice” tag.

And here is the same problem where we use the “span” tag as a workaround. The “span” or “p” or some other tags helps to show the text untruncated:

<p>5. Έστω το παρακάτω απόσπασμα προγράμματος tkinter:</p>
<pre>ttk.Style().configure('mystyle', bg= "yellow", font='Arial 30')
self.button = ttk.Button(self.root, style='mystyle',
					text="show text", command=self.showText)</pre><br></br>
<p>Ποια η λειτουργία του προγράμματος;</p>
<multiplechoiceresponse>
  <choicegroup type="MultipleChoice">
    <choice correct="false">Ορίζει το στυλ mystyle (χρώμα υποβάθρου, γραμματοσειρά) και, στη συνέχεια, το στυλ αυτό χρησιμοποιείται σε ένα γραφικό αντικείμενο τύπου ttk.Button.</choice>
    <choice correct="true">_tkinter.TclError: Layout mystyle not found</choice>
    <choice correct="false">Ορίζεται το στυλ mystyle (χρώμα υποβάθρου, γραμματοσειρά), όμως στη συνέχεια δεν εφαρμόζεται στο γραφικό αντικείμενο self.button, αφού δεν είναι το καθορισμένο όνομα στυλ για <span> αντικείμενα τύπου Button, οπότε το γραφικό αντικείμενο σχεδιάζεται με default layout. </span></choice>
  </choicegroup>
</multiplechoiceresponse>

<solution>
<div class="detailed-solution">
<p>Εξήγηση</p>

<p>Δίνει σφάλμα, γιατί το όνομα mystyle δεν είναι το καθορισμένο όνομα στυλ για αντικείμενα τύπου Button. Όπως αναφέρθηκε και στη διάλεξη, το στυλ αυτό είναι το button.TButton.</p>
</div>
</solution>

I don’t know if you can replicate the problem with this example, but as you can see in the screenshots at the end of the truncated text, this: “/div>” appears, which is a bit strange, and it doesn’t always appear there (if you see the screenshot on the initial thread, although the text is truncated, this broken “div” tag is not there).

Moreover, the strangest thing is that this whole situation is happening a bit at random!
We get messages from our students that they see some texts truncated like that in their browser, while to mine or my colleagues’ browsers it appears to be correct with the whole text shown correctly, but we have to “brake” it using “span” tags in order for everyone to see the whole text correctly.

Also, sometimes, in some cases, it doesn’t appear immediately after we create the problem, but after a while…

I hope this is enough for you to get an idea of the problem.
Anything else you might need from us, please let me know, and again, thank you for helping with this!

Happy to help @Dimitris_Angelakis . So far, the problem is displaying correctly for me in both Chrome and Firefox. Have you been able to reproduce it yourselves, or have you only seen student reports? If it’s the latter, then my guess is that there may be a common browser extension that several students are using that is mangling the problem text. When I worked at edX, there were a couple times where some students’ auto-translation browser extension was disrupting CAPA problem rendering.

Hello again @kmccormick

Yes, I have been able to reproduce it in several problems inside our platform just like our users reported it.

I have tried several browsers (Firefox, Chrome, Chromium, Brave etc), some freshly installed, with even all extensions/plugins and auto translation disabled and the issue is still there in several cases.

Like I mentioned before, it’s a bit random sometimes, but in some cases it happens the same for everyone…

Would it be ok if I PM you a link with a particular problem in a course in our platform where the issue appears to be visible by everyone, and you create an account and login to see it directly from there?

Maybe this will help you a bit more…

Got it, very interesting. Sure thing!

1 Like

I couldn’t send a PM from here (it forbade me for some reason) so I emailed you.
I hope everything you need is there.

Thank you @kmccormick

1 Like

I was able to observe the truncation issue using the live problem block link that @Dimitris_Angelakis sent me, although the issue does not manifest when I add the problem block locally. To isolate the issue, I opened the problem block in the chromeless view (https://LMS_BASE/xblock/block-v1:ORG+COURSE+RUN+type@problem+block@ID) and confirmed that the truncation issue still exists. Looking at the source of the page, I dug into the value of the data-content attribute, which contains escaped HTML. Decoding that, I can see that the truncated problem text actually exists in the page source. This tells me that the truncation is happening on the server side! This is surprising, but also encouraging, because it makes debugging this much more tractable.

That is all I was able to find today, but I will dig in again when I have a chance. @Dimitris_Angelakis , could you let me know what version of Open edX you are running, whether you have a theme that overrides html templates and edx-platform, and whether you have any customizations to the xmodule/capa (or common/lib/xmodule/capa) source folders in edx-platform? Lastly, have you ever observed the truncation occur in Studio, or is this only happening in LMS views?

2 Likes

In Studio truncate too. Tested on clean 17.0.5 installation without custom theming.

2 Likes

Hello again @kmccormick

We use the Palm release of Open edX, but I have a test installation of Quince which also has the same problem. This problem firstly actually appeared with Juniper release, if I recall correctly, so it’s not something new. Before Juniper we had no such a problem.

We have indeed been using our own theme that overrides some of the html templates, not all though. And I should mention that while I was testing the Quince release with the Indigo theme enabled, the problem also appeared, so I don’t think it’s our particular theming that causes this, but I actually haven’t tried it yet without any theming enabled. I will try to make a fresh testing installation of Open edX without any theming enabled and check if the problem is still there, but if @kosolapovlb also has this problem without any theming enabled, I think I’ll have the same result.

We haven’t done any customizations to the xmodule/capa (or common/lib/xmodule/capa) source folders though.

Yes, as @kosolapovlb already mentioned, this is also happening to Studio the same way it happens to LMS.

I hope these pieces of information were helpful. Thank you for your time Kyle! If you need anything else, let me know!

Juniper was the first release of Open edX to support Python 3, which means that we were going through the codebase and modifying a lot of “are we treating this as a string or as bytes” code through the platform. That includes some bits that modify capa content like this. It’s possible that something got introduced as a result of this.

Also, sometimes, in some cases, it doesn’t appear immediately after we create the problem, but after a while…

That is one of the weirdest parts of this for me. Is it at all possible that content is getting written /modified in a way that outwardly looks the same but changes the representation underneath? Like part of it is getting saved as ISO-8859-7 instead of UTF-8? Or that it would shift to different Unicode codepoints that happen to look the same?

Would it be possible for you to upload the XML file of the problem in question (the actual file in the export, rather than copy-paste)?

Thank you.

1 Like

I can provide my sample problem. Hope it helps.


977116fa302a45649c3b8d0cae7be854.xml (7.1 KB)

1 Like

Sure, here is the xml file from the exported course, with a particular problem that has this issue (this one actually appears to everyone the same way). I have included a screenshot with the part of the problem the phrase is truncated, and the corresponding xml file.

The issue appears at the line:

<p>Σημείωση: Όταν η ζητούμενη αριθμητική απάντηση έχει μονάδες (π.χ. sec, cm, Kg κ.λπ.) ή απαιτεί κάποιον επιπλέον προσδιορισμό (όπως π.Χ.) τότε αυτός λέγεται ρητά στην εκφώνηση ή/και αναγράφεται έξω από το κουτί της απάντησης. Μέσα στο κουτί μπαίνει μόνο ο αριθμός και ποτέ οι μονάδες ή ο όποιος επιπλέον προσδιορισμός.</p>

I hope it’s helpful

519f608574f81be1c49e.xml (2.1 KB)

@Dimitris_Angelakis: @kmccormick and I looked into this a bit more on Friday, and I poked at it a bit on the weekend. I can’t pursue this full time right now, but I’ll write down some findings/thoughts in case they help others looking into this:

We have a slightly simplified reproducing case (the spacing is very goofy looking, but it matters):
minimum-reproduction.xml (1.5 KB)

Some initial findings:

I can’t reproduce this issue in development.

It reproduces reliably all the way out to Redwood on our sandbox, but I can’t reproduce it in our local tutor dev environment. For a while I thought this might have been because we switched from bleach to nh3 after Redwood was cut, but reverting that commit locally did not let me reproduce it. Also, the timing is off, since the commit adding bleach usage to capa happened just after Juniper was cut, and wouldn’t have been part of that release.

It’s possible that there is something specific about dev vs. production settings that triggers this bug, though I can’t imagine what that would be. I have not pursued this lead (e.g. run tutor local with the latest master).

The full OLX is saved properly, and shows up in the staff debug.

No surprises here, but it means that the code causing issues is almost certainly in the capa problem processing, and not something lower-level.

It occurs even outside of any response/input type.

In the example above, it’s happening in a paragraph tag that comes after the last numericalresponse was closed. So it looks like it’s part of the top-level problem processing, not a particular input or response type.

It breaks the </p> tag.

In the output, the tag is partly broken by the truncation–it outputs as /p>. So this is likely some kind of strip() or sub() operation on the raw text.

Speculation: It’s something where offset/range code used for stripping is not Unicode aware.

You can see this in that if you substitute an “á” with an “a” in the XML before the broken text, it will shift the location of where the break happens. But if you substitute the “á” with two "a"s, the break location remains the same–because an “á” is stored with two bytes in UTF-8, while an “a” is serialized as one byte.

Which is also weird, because you’d expect everything to have been converted to Unicode before this text processing happens at all…

Edit: Example of that substitution carried as far as I could while still reproducing:
minimum-reproduction2.xml (1.5 KB)

1 Like

@Dimitris_Angelakis: A couple of questions:

  1. Does this happen with content authored in Markdown, or only raw OLX?
  2. Does it go away if you wrap the whole paragraph in a <text> tag, e.g. <text><p>The whole paragraph where the truncation happens in..</p></text>?

Thank you.

1 Like

It happens in both cases, and usually, when we use Markdown to create a problem, we switch to raw html to embed several span tags in between the phrases that brake, so they will appear as continuous. When we embed either span, or p or other such tags in the middle of a large sentence, it appears to be ok, if not, it usually brakes…

No, if we use a single text or span or any other tag around the whole sentence, if this sentence is a bit large, it usually brakes, so a single text tag around the whole sentence doesn’t fix it…


I have noticed something which I mentioned in my initial thread, and forgive me if it sounds stupid, but as far as I understand, every problem (even if it’s just plain text) is treated as a potential loncapa problem but it also makes use of the mathjax library which parses the whole problem and the parsed end-result appears encoded in the final html that is produced.
Could this (the mathjax library) be the source of the problem?
Again, if this is a stupid question, just ignore it :slight_smile:

Thank you!

That is an absolutely valid suspicion to have, and was the first thing I suspected as well. But @kmccormick pointed out (and I’ve also verified) that the truncation happens on the server side code, before MathJax even enters the picture.

1 Like