Improve use of CharacterEncoding by mmatera · Pull Request #1735 · Mathics3/mathics-core

mmatera · 2026-03-15T18:20:38Z

This PR covers #1678 by improving the use of CharacterEncoding and $CharacterEncoding, to aling
better to the behavior in WMA: all string manipulations and formatting
function works with the internal encoding, and CharacterEncoding is
only used in rendering functions, used in the frontend and in the
ToString evaluation.

List of changes:

A module mathics.eval.encoding was added, with a function
encode_string_value(value:str, encoding:str). The function takes
a string in the internal encoding, and produce a string in the target
encoding.
The encode_string_value function is used in rendering String an Box expressions
according to the encoding parameter.
In Evaluation.format, current $CharacterEncoding is taken into account to
format the output in output text.
DocTest.compare now takes a parameter encoding to convert the expected result
to an specific encoding.
Docpipeline now can use the system character encoding for tests.
More use of Mathics3-scanner character tables for finding characters.
the encoding parameter is not used anymore in formatting In/Pre/Postfix to Box expressions.

Missing stuff

Add to the Unicode to ASCII encoding the letter-like characters (e.g. 'α' \rightarrow '[Alpha]')
Adjust doctests to use Unicode instead of ASCII encoding for the expected results. With the new changes,
Using FullForm in pytests instead OutputForm in all the cases where the format is not relevant.

mmatera · 2026-03-15T19:16:14Z

+}
+
+
+def encode_string_value(value: str, encoding: str):


This function is a just a proof of concept. The final version should look into the MathicsScanner tables

rocky · 2026-03-16T09:27:47Z

            value = value[1:-1]
+
+    if "encoding" in options and options["encoding"] != "Unicode":
+        value = encode_string_value(value, options["encoding"])


Looking at this more closely, there may be a deeper problem here.

If the Mathics3 string was encoded with Unicode under the user's control, that should remain. If Mathics3 added the Unicode because an operator appeared, that is probably wrong, and the code that added the Unicode should be fixed.

So, what is a specific scenario or situation where line 200 is triggered?

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

This paraphrases the if condition. I meant, what is it that is causing an operator to get converted before ToString was called. This, I think, is the real source of the problem.

rocky · 2026-03-16T09:32:37Z

A suggestion for a check that things are fixed would be to run pytest without setting MATHICS_CHARACTER_ENCODING, but changing pytest/helper.py so that an encoding of ASCII in the ToString calls does not cause tests to fail.

rocky · 2026-03-25T13:10:03Z

        ("ArcTan[0, 1]", None, "Pi / 2", None),
        ("ArcTan[0, -1]", None, "-Pi / 2", None),
-        ("Cos[1.5 Pi]", None, "-1.83697×10^-16", None),
+        ("Cos[1.5 Pi]", None, "-1.83697 x 10^-16", None),


How can I see this behavior using wolframscript?

When I try using InputForm I see this:

In[1]:= Cos[1.5 Pi] // InputForm Out[1]//InputForm= -1.8369701987210297*^-16

Actually, you can't: InputForm uses this base*^exp notation, while OutputForm uses the 2D form

In[3]:= 2.3*^43 43 Out[3]= 2.3 10

The closest form is the EngineeringForm:

In[4]:= EngineeringForm[2.3*^32] 30 Out[4]//EngineeringForm= 230. × 10

which uses the times operator. By setting the character encoding to ASCII, we see the use of x instead of Times:

In[5]:= $CharacterEncoding="ASCII" Out[5]= ASCII In[6]:= EngineeringForm[2.3*^32] 30 Out[6]//EngineeringForm= 230. x 10

Ok. Thanks for the clarification. So, in what environment do we see exactly "-1.83697 x 10^-16" as the output of "2.3*^43"?

And how am I to understand that in this test, I am matching that kind of environment?

In the test/format/ folder, there are two files, one is format_tests.yaml which we use for the test, and another format_tests-WMA.yaml which contains the corresponding outputs in WMA.
Regarding which environment, I am running WMA 12.04 in Ubuntu 22.04.

Regarding which environment,

I am sorry that I wasn't clear. By environment, I mean in what Wolfram program or product do we see this kind of output appearing?

This is because we have not yet implemented a 2D OutputForm. In any case, looking at this, what we should test in most of the cases, is the match regarding the internal representation, and not regarding the format, which is just one aspect to be tested.

Normally, I'd say yes, please let's match the internal representation. But...

Unless we know that there is a WMA internal representation that we are trying to match, or unless right now this internal representation has a concrete impact on output we can see today that matches WMA, I'd say, let's not test things (let alone exhaustive tests) of something that could very well change (even if ever so slightly) as we fill things in, such as for 2D character layout.

OK, but that's the point. With the "internal" representation of the result, I mean FullForm, which is readily accessible. Then, aspects related to the format can be tested in focused tests. This allows us to avoid worring about the representation of \[Times], when what we want to test if whether two numerical results are the same.

Testing against an internal representation as a quicker, more understandable, and more trackable way to ensure that FullForm output is correct is not only fine, but it is preferable. (Of course, there will be some end-to-end "blackbox" FullForm tests as well.)

Writing tests to track the internal representation for how that might be used in 2D character output that hasn't been fleshed out and is not implemented is, however, not a good idea.

That should be delayed until we have the 2D renderer in place.

Logic note:

In "Unless x then y", when x is false, y can happen.

rocky · 2026-03-25T13:18:36Z

  text:
    System`InputForm: 1.*^-6
-    System`OutputForm: "1.\xD710^-6"
+    System`OutputForm: "1. x 10^-6"


I don't understand.

Here is what I see:

In[67]:= 10.^6 //TeXForm Out[67]//TeXForm= 1.\times 10^6 In[68]:= 10.^6 //TeXForm//OutputForm Out[68]//OutputForm= 1.\times 10^6 In[69]:= 10.^6 //TeXForm//InputForm Out[69]//InputForm= 1.\times 10^6

How should I reconcile the output with this test with what I see in WolframScript?

This test is for "text" form, not "TeX" form.

What Form corresponds to "text" form?

"OutputForm", I guess... Actually, what you want I think is to compare against ToString, which by default produces expressions formatted as OutputForm.

mmatera · 2026-03-25T15:16:41Z

  mathml:
    System`InputForm: <mtext>&lt;|a&nbsp;-&gt;&nbsp;x,&nbsp;b&nbsp;-&gt;&nbsp;y,&nbsp;c&nbsp;-&gt;&nbsp;&lt;|d&nbsp;-&gt;&nbsp;t|&gt;|&gt;</mtext>
-    System`OutputForm: '<mtext>&lt;|a&nbsp;-&gt;&nbsp;x,&nbsp;b&nbsp;-&gt;&nbsp;y,&nbsp;c&nbsp;-&gt;&nbsp;&lt;|d&nbsp;-&gt;&nbsp;t|&gt;|&gt;</mtext>'
+    System`OutputForm: '<mtext>&lt;|a&nbsp;⇾&nbsp;x,&nbsp;b&nbsp;⇾&nbsp;y,&nbsp;c&nbsp;⇾&nbsp;&lt;|d&nbsp;⇾&nbsp;t|&gt;|&gt;</mtext>'


In WMA, MathMLForm produces Unicode characters, but encoded as &#charcode;. For example,

In[1]:= x->b //MathMLForm Out[1]//MathMLForm= <math> <mrow> <mi>x</mi> <semantics> <mo>→</mo> <annotation encoding='Mathematica'>"\[Rule]"</annotation> </semantics> <mi>b</mi> </mrow> </math>

Notice that -> was converted into <mo>→</mo>. We should go over this in another round.

Going over in another round is fine. But here, instead of testing wrong behavior, let's comment out the test.

When we have the test correct, it would get uncommented.

rocky · 2026-03-25T16:32:36Z

 def eval_ToString(
    expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
 ) -> String:
+    from mathics.format.render.encoding import EncodingNameError


I looked into what's causing the circular import, and this is a mess.

Everything in mathics.form.render is imported inside __init__. And in doing that, rendering for OutputForm needs this eval routine, which then imports something else in mathics.form.render.

The idea behind the automated dynamic import in mathics.form.render was that one could just drop in new rendering routines. mathics.form.render.encoding is not a renderer, so it should not be imported.

Doing this here is just working around a design flaw elsewhere.

Probably mathics.format.render.encoding should be inside mathics.eval

rocky · 2026-03-25T16:48:39Z

Make that CharacterEncoding option in ToString works as expected

This does not fully do this. In this git branch:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"

PR #1749 does.

Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually?

For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation.

mmatera · 2026-03-25T18:04:27Z

Make that CharacterEncoding option in ToString works as expected

This does not fully do this. In this git branch:
In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"
PR #1749 does.

Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually?

For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation.

This is because I am using a minimal encoding table coded by hand. The next step is to pick the conversion tables from Mathics3-Scanner, as in #1749

rocky · 2026-03-25T19:34:09Z

-    text = boxes.to_text(evaluation=evaluation)
-    return String(text)
+
+    boxes = format_element(expr, evaluation, form)


If the final idea is that the strings in format_element are going to get converted, then I think this is approaching this the wrong way.

Instead, format_element needs to take the parameters expr, form, and encoding to produce boxes that have the appropriate strings in them initially.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works.

I did not find anywhere in those experiments that there was a string that was encoded one way, and inside ToString, it got reencoded, as opposed to being encoded correctly initially.

It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied.

That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it.

Consider this example where I set a breakpoint at the location we are discussing:

$ mathics3 ... In[1]:= ToString[a >= b, CharacterEncoding -> "ASCII"] (/tmp/Mathics3/mathics-core/mathics/eval/strings.py:30:5 @46): eval_ToString -- 30 try: (trepan3k) list 25 expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation 26 ) -> String: 27 28 boxes = format_element(expr, evaluation, form) 29 breakpoint() 30 -> try: 31 return String(boxes.to_text(evaluation=evaluation, encoding=encoding)) 32 except EncodingNameError: 33 # Mimic the WMA behavior. In the future, we can implement the mechanism 34 # with encodings stored in .m files, and give a chance with it. (trepan3k) boxes.elements (<Expression: <Symbol: System`PaneBox>[<String: ""a ≥ b"">]>, <Expression: <Symbol: ...

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

And if you do the conversion earlier, a double conversion spoils the result. Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

This is not relevant here. We started with a Mathics3 Expression, and inside format_element, this expression got turned into an incorrect string, because encoding information indicating that strings are supposed to be ASCII was not respected inside format_element.

Another viable solution might be to have format_element not convert the expression a >= b to a String, and leave it as an Expression for later. But, I am not sure that is possible or correct. I believe only that what is done is incorrect and there's no evidence right now that WMA is reencoding strings instead of encoding them correctly initially.

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28

boxes = format_element(expr, evaluation, form)

must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result <String: ""a ≥ b""> is correct. The encoding is applied in line 31

return String(boxes.to_text(evaluation=evaluation, encoding=encoding))

which takes the box expression and converts it into a Python string, in the request encoding.

The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way.

mmatera · 2026-03-25T19:43:15Z

ToString[a>=b, CharacterEncoding->"ASCII"]

With the last changes, we are loading the encoding from Mathics3-scanner, and the case you mention is also covered:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a >= b"

rocky · 2026-03-25T20:31:38Z

ToString[a>=b, CharacterEncoding->"ASCII"]
With the last changes, we are loading the encoding from Mathics3-scanner, and the case you mention is also covered:
In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a >= b"

Comparing the amount of code needed to do this and the conceptual complexity, I'd say #1749 is cleaner, if not also a lot less code, if making this example work was all that we wanted to accomplish.

I think the code here is making some point or doing something with respect to handling Microsoft Windows code pages, and that is a good thing.

The problem, though, is that I am having a hard time understanding exactly what this is and how to isolate it from the reencoding part, which I think is misguided.

rocky · 2026-03-25T20:38:57Z

Comparing the amount of code ...

In support of this, this PR has 20 files changed, 14 commits, 93 lines added, and -64 removed. And I suspect there should be more changes.

#1749 has 5 commits, 6 files changed, 73 lines added, and 43 removed.

Admittedly, though its scope may be narrower than what this attempts. But in that case, I'd say commit #1749 and then build on that here.

mmatera · 2026-03-25T20:47:11Z

In support of this, this PR has 20 files changed, 14 commits, 93 lines added, and -64 removed. And I suspect there should be more changes.

I quite conviced that #1749 is shorter but wrong. Just I do not have the time at this moment to give a more complete explanation of why. Later, I will try to do that.

mmatera · 2026-03-27T13:20:13Z


  >> a -> 1 + 2
-  = a -> 3
+  = a ⇾ 3


I put this change just as an example. In principle, changing all the docstrings that involve Unicode symbols would make the docpipeline work with any encoding. I didn't do that because I would like to have some feedback before facing this task.
Also, I think that another possibility would be to use named characters instead of Unicode in these expected lines. But for it, I would also need to adjust some code in the doctest parser. @rocky, thoughts?

@mmatera I thought about this for several minutes, and right now I just don't feel confident in suggesting anything one way or another. It might be something to talk over and discuss. For example, it might be that we decide to try one thing on a small scale, see how it goes, and then try another.

Is there some way we can discuss in a manner other than PR review comments?

mmatera · 2026-03-27T13:21:11Z


    options = {
-        "CharacterEncoding": '"Unicode"',
+        "CharacterEncoding": "$CharacterEncoding",


For ToString, the default CharacterEncoding should be $CharacterEncoding

mmatera · 2026-03-27T13:23:20Z

        if result is None:
            return None

+        try:


This ensures that when expressions are formatted as text, the encoding is always applied. With this change, if we specify that the encoding is "ASCII", all the tests must match with ASCII outputs.

mmatera · 2026-03-27T13:25:38Z

+            "UTF-8": {},
+        }[encoding]
+    except KeyError:
+        raise EncodingNameError


Instead of raising this exception, we could try to create an evaluation object and load the encodings from .m files in some special folder, like in WMA. I plan to do that in another round, which would be relatively easy.

rocky · 2026-03-27T13:28:52Z

        if not show_string_characters:
            value = value[1:-1]
-    return value
+    return encode_string_value(value, options["encoding"])


How do we know that "encoding" has always been passed in the options dictionary? Should this be options.get instead?

I put this way to check that indeed we passed the option. But yes, we can use get instead.

rocky · 2026-03-27T13:29:45Z

            'stream = StringToStream["1.523E-19"]; Read[stream, Real]',
            None,
-            "1.523×10^-19",
+            "1.523 x 10^-19",


I thought you were going to rewrite so that we aren't testing impossible behavior?

Such as testing at the encoding-independent level.

Yes, but to do that, I need to do it in several steps, in order to avoid a much larger PRs. The plan is go over this in the next round.

How can we break up this PR into smaller, self-contained conceptual pieces?

Not sure, because when I tried to do it in smaller pieces, I was not able to make you understand what I was doing. Maybe now I can split the part of MathML, then the encoding function, then docpipeline, then remove the internal encoding and the other fixes.

rocky · 2026-03-27T13:53:25Z

@rocky, I think with the last changes I covered some of your observations:

As I look at mathics.format.form, I see other uses of str() that probably need to be addressed too.

Not sure what do you refer with this. What I did now is to remove all the references to encoding in the box making routines.

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there. That option is just hidden inside **kwargs. I consciously split that out because explicit parameters are better than implicit parameters, but at this point, I don't really care to make an issue of it.

rocky · 2026-03-27T13:58:42Z

Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example,

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

mmatera · 2026-03-27T14:17:45Z

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

That option is just hidden inside **kwargs. I consciously split that out because explicit parameters are better than implicit parameters, but at this point, I don't really care to make an issue of it.

I know that. However, since options to be passes among thse render functions are variable, it is hard to say which parameters would require a render function of some of the nested elements.

mmatera · 2026-03-27T14:19:00Z

Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example,

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The problem with TeX and MathML in WMA is that it sucks, and is not usable for our purposes...

mmatera · 2026-03-27T14:21:59Z

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

rocky · 2026-03-27T14:32:36Z

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

rocky · 2026-03-27T14:41:13Z

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

You have just emphasized the problem I am having with this PR! It is hard to follow when the approach is code first, and then start to discuss, and then discover bugs, or more things to add, and rewrite and code some more ... and then discuss, ...

And #1749 is a lot smaller than this even in its earlier versions. So presumably that would have been easier to do :-)

At this point, I'd like to break this up into all the separable ideas contained in this code and go over them one by one and get them merged in.

For example, one problem with the code in the master branch is that the character-encoding information from built-in functions like ToString has to propagate down into the render routine as an option, e.g. via **kwargs.

That is an example of one self-contained bug in the master-branch code that can be fixed in isolation from making proper use of the option.

mmatera · 2026-03-27T14:56:56Z

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

According to my experiments, in WMA, MathML is insensitive to the encoding:

In[1]:= a->b//MathMLForm                                                        

Out[1]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[2]:= $CharacterEncoding="ASCII"                                              

Out[2]= ASCII

In[3]:= a->b//MathMLForm                                                        

Out[3]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[4]:= $CharacterEncoding="WindowsANSI"                                                                                                                                      

Out[4]= WindowsANSI

In[5]:= a->b//MathMLForm                                                                                                                                                      

Out[5]//MathMLForm= <math>
                     <mrow>
                      <mi>a</mi>
                      <semantics>
                       <mo>&#8594;</mo>
                       <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                      </semantics>
                      <mi>b</mi>
                     </mrow>
                    </math>

In[6]:= $CharacterEncoding="Klingon"                                                                                                                                          

Out[6]= Klingon

In[7]:= a->b//MathMLForm                                                                                                                                                     

Out[7]//MathMLForm= <math>
                      <mrow>
                       <mi>a</mi>
                       <semantics>
                        <mo>&#8594;</mo>
                        <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                       </semantics>
                       <mi>b</mi>
                      </mrow>
                     </math>

mmatera · 2026-03-27T14:59:34Z

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

According to my experiments, in WMA, MathML is insensitive to the encoding:

In[1]:= a->b//MathMLForm                                                        

Out[1]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[2]:= $CharacterEncoding="ASCII"                                              

Out[2]= ASCII

In[3]:= a->b//MathMLForm                                                        

Out[3]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[4]:= $CharacterEncoding="WindowsANSI"                                                                                                                                      

Out[4]= WindowsANSI

In[5]:= a->b//MathMLForm                                                                                                                                                      

Out[5]//MathMLForm= <math>
                     <mrow>
                      <mi>a</mi>
                      <semantics>
                       <mo>&#8594;</mo>
                       <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                      </semantics>
                      <mi>b</mi>
                     </mrow>
                    </math>

In[6]:= $CharacterEncoding="Klingon"                                                                                                                                          

Out[6]= Klingon

In[7]:= a->b//MathMLForm                                                                                                                                                     

Out[7]//MathMLForm= <math>
                      <mrow>
                       <mi>a</mi>
                       <semantics>
                        <mo>&#8594;</mo>
                        <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                       </semantics>
                       <mi>b</mi>
                      </mrow>
                     </math>

This is probably also wrong, because the browser that reads it could not handle these characters in its codepage. But this is something to handle in another round.

mmatera · 2026-03-27T15:11:31Z

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

You have just emphasized the problem I am having with this PR! It is hard to follow when the approach is code: first, and then start to discuss, and then code some more ... and then discuss, ...

And #1749 is a lot smaller than this even in its earlier versions. So presumably that would have been easier to do :-)

At this point, I'd like to break this up into all the separable ideas contained in this code and go over them one by one and get them merged in.

For example, one problem with the code in the master branch is that the character-encoding information from built-in functions like ToString has to propagate down into the render routine as an option, e.g. via **kwargs.

That is an example of one self-contained bug in the master-branch code that can be fixed in isolation from making proper use of the option.

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

rocky · 2026-03-27T15:18:05Z

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.

Or punt on the whole thing and wait until after release. Your choice.

mmatera · 2026-03-27T15:21:54Z

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.

Or punt on the whole thing and wait until after release. Your choice.

I think that it would be great to have this in the release to complete the format/render refactor. At this point, it takes some hours of work, but I have a more or less clear sequence of steps on how to do it.

rocky · 2026-03-27T15:29:45Z

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.
Or punt on the whole thing and wait until after release. Your choice.

I think that it would be great to have this in the release to complete the format/render refactor. At this point, it takes some hours of work, but I have a more or less clear sequence of steps on how to do it.

Even with everything here, format/render will not be complete. But there will be significant improvements, and since this is API breaking, it is good to get more of it in sooner rather than later.

I look forward to understanding a cleaner breakdown in terms of PRs, which follow the steps, and the changes made.

…und-trip `FullForm` output (#1763) This PR adds support for the options `ShowSpecialCharacters` and `ShowStringCharacters` used in StyleBox, Style, and Cell builtin functions. These options control how strings are rendered. In WMA, when this `ShowSpecialCharacters` option is set to `False` , and `ShowStringCharacters` is set to `True`, strings are rendered using an ASCII representation in which any non-ASCII characters are represented by their character names. This provides an "invertible" representation of the internal original String. In WMA, this representation is used in `FullForm`. This would also provide better grounds for #1735

author Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1774807641 -0300 committer Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1775313022 -0300 Handle encodings in docpipeline tests

author Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1774807641 -0300 committer Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1775312913 -0300 Handle encodings in docpipeline tests Add changes in documentation.

Improve use of CharacterEncoding

e02e3a9

mmatera commented Mar 15, 2026

View reviewed changes

mmatera added 2 commits March 15, 2026 20:33

Merge branch 'master' into fix_ToStringEncoding

586d3a4

Merge branch 'master' into fix_ToStringEncoding

96ea4e8

rocky reviewed Mar 16, 2026

View reviewed changes

Merge branch 'master' into fix_ToStringEncoding

cd526f5

rocky mentioned this pull request Mar 20, 2026

Pass **kwargs through Form functions so rendering has access to a CharacterEncoding option set, e.g. in ToString #1749

Closed

mmatera added 2 commits March 24, 2026 17:42

Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding

dc9c8ad

not finished

1a53c1a

rocky reviewed Mar 25, 2026

View reviewed changes

mmatera added 2 commits March 25, 2026 12:05

hangle encoding in doctests

3d4b0a5

adjust tests

0218bd9

mmatera commented Mar 25, 2026

View reviewed changes

mmatera added 4 commits March 25, 2026 12:32

commenting out the Mathml tests

1324c41

Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding

f58574c

adding missing module

79dcf9d

avoid circular import

e7f88e5

mmatera marked this pull request as ready for review March 25, 2026 15:59

rocky reviewed Mar 25, 2026

View reviewed changes

mmatera added 2 commits March 25, 2026 16:04

using Mathics3-scanner tables. Moving encoding.py to mathics.eval

0595532

remove hard coded table

1670da6

rocky reviewed Mar 25, 2026

View reviewed changes

mmatera commented Mar 27, 2026

View reviewed changes

rocky reviewed Mar 27, 2026

View reviewed changes

mmatera mentioned this pull request Mar 29, 2026

Add ShowSpecialCharacters and ShowStringCharacters options and round-trip FullForm output #1763

Merged

mmatera added 9 commits March 29, 2026 14:36

Merge branch 'master' into fix_ToStringEncoding

b01006c

Merge branch 'master' into fix_ToStringEncoding

8a33df5

Merge branch 'master' into fix_ToStringEncoding

2d82c6d

parent 6239c83

68cb8b9

author Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1774807641 -0300 committer Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1775313022 -0300 Handle encodings in docpipeline tests

parent 6239c83

cd3cf0d

author Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1774807641 -0300 committer Juan Mauricio Matera <matera@fisica.unlp.edu.ar> 1775312913 -0300 Handle encodings in docpipeline tests Add changes in documentation.

strip result before the comparison

b743191

fix wrong character

ba5d790

Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding

17ab410

merge with handle_encoding_in_docpipeline

55ac83b

Uh oh!

Conversation

mmatera commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky commented Mar 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmatera Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmatera Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmatera Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmatera commented Mar 25, 2026

Uh oh!

rocky Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mmatera commented Mar 15, 2026 •

edited

Loading

rocky Mar 16, 2026 •

edited

Loading

rocky Mar 17, 2026 •

edited

Loading

mmatera Mar 25, 2026 •

edited

Loading

rocky Mar 25, 2026 •

edited

Loading

rocky Mar 25, 2026 •

edited

Loading

mmatera Mar 25, 2026 •

edited

Loading

mmatera Mar 25, 2026 •

edited

Loading

rocky Mar 25, 2026 •

edited

Loading

rocky commented Mar 25, 2026 •

edited

Loading

rocky Mar 25, 2026 •

edited

Loading

rocky Mar 25, 2026 •

edited

Loading

rocky commented Mar 25, 2026 •

edited

Loading

rocky Mar 27, 2026 •

edited

Loading

rocky Mar 27, 2026 •

edited

Loading