Skip to content

Improve use of CharacterEncoding#1735

Draft
mmatera wants to merge 15 commits intomasterfrom
fix_ToStringEncoding
Draft

Improve use of CharacterEncoding#1735
mmatera wants to merge 15 commits intomasterfrom
fix_ToStringEncoding

Conversation

@mmatera
Copy link
Copy Markdown
Contributor

@mmatera mmatera commented Mar 15, 2026

This PR covers #1678 by improving the use of CharacterEncoding and $CharacterEncoding, to aling
better to the behavior in WMA: all string manipulations and formatting
function works with the internal encoding, and CharacterEncoding is
only used in rendering functions, used in the frontend and in the
ToString evaluation.

List of changes:

  • A module mathics.eval.encoding was added, with a function
    encode_string_value(value:str, encoding:str). The function takes
    a string in the internal encoding, and produce a string in the target
    encoding.
  • The encode_string_value function is used in rendering String an Box expressions
    according to the encoding parameter.
  • In Evaluation.format, current $CharacterEncoding is taken into account to
    format the output in output text.
  • DocTest.compare now takes a parameter encoding to convert the expected result
    to an specific encoding.
  • Docpipeline now can use the system character encoding for tests.
  • More use of Mathics3-scanner character tables for finding characters.
  • the encoding parameter is not used anymore in formatting In/Pre/Postfix to Box expressions.

Missing stuff

  • Add to the Unicode to ASCII encoding the letter-like characters (e.g. 'α' \rightarrow '[Alpha]')
  • Adjust doctests to use Unicode instead of ASCII encoding for the expected results. With the new changes,
  • Using FullForm in pytests instead OutputForm in all the cases where the format is not relevant.

}


def encode_string_value(value: str, encoding: str):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a just a proof of concept. The final version should look into the MathicsScanner tables

value = value[1:-1]

if "encoding" in options and options["encoding"] != "Unicode":
value = encode_string_value(value, options["encoding"])
Copy link
Copy Markdown
Member

@rocky rocky Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more closely, there may be a deeper problem here.

If the Mathics3 string was encoded with Unicode under the user's control, that should remain. If Mathics3 added the Unicode because an operator appeared, that is probably wrong, and the code that added the Unicode should be fixed.

So, what is a specific scenario or situation where line 200 is triggered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

Copy link
Copy Markdown
Member

@rocky rocky Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

This paraphrases the if condition. I meant, what is it that is causing an operator to get converted before ToString was called. This, I think, is the real source of the problem.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 16, 2026

A suggestion for a check that things are fixed would be to run pytest without setting MATHICS_CHARACTER_ENCODING, but changing pytest/helper.py so that an encoding of ASCII in the ToString calls does not cause tests to fail.

("ArcTan[0, 1]", None, "Pi / 2", None),
("ArcTan[0, -1]", None, "-Pi / 2", None),
("Cos[1.5 Pi]", None, "-1.83697×10^-16", None),
("Cos[1.5 Pi]", None, "-1.83697 x 10^-16", None),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I see this behavior using wolframscript?

When I try using InputForm I see this:

In[1]:= Cos[1.5 Pi] // InputForm                                                            

Out[1]//InputForm= -1.8369701987210297*^-16

Copy link
Copy Markdown
Contributor Author

@mmatera mmatera Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can't: InputForm uses this base*^exp notation, while OutputForm uses the 2D form

In[3]:= 2.3*^43                                                                 

              43
Out[3]= 2.3 10

The closest form is the EngineeringForm:

In[4]:= EngineeringForm[2.3*^32]                                                

                                   30
Out[4]//EngineeringForm= 230. × 10

which uses the times operator. By setting the character encoding to ASCII, we see the use of x instead of Times:

In[5]:= $CharacterEncoding="ASCII"                                              

Out[5]= ASCII

In[6]:= EngineeringForm[2.3*^32]                                                

                                  30
Out[6]//EngineeringForm= 230. x 10

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for the clarification. So, in what environment do we see exactly "-1.83697 x 10^-16" as the output of "2.3*^43"?

And how am I to understand that in this test, I am matching that kind of environment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test/format/ folder, there are two files, one is format_tests.yaml which we use for the test, and another format_tests-WMA.yaml which contains the corresponding outputs in WMA.
Regarding which environment, I am running WMA 12.04 in Ubuntu 22.04.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding which environment,

I am sorry that I wasn't clear. By environment, I mean in what Wolfram program or product do we see this kind of output appearing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we have not yet implemented a 2D OutputForm. In any case, looking at this, what we should test in most of the cases, is the match regarding the internal representation, and not regarding the format, which is just one aspect to be tested.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, I'd say yes, please let's match the internal representation. But...

Unless we know that there is a WMA internal representation that we are trying to match, or unless right now this internal representation has a concrete impact on output we can see today that matches WMA, I'd say, let's not test things (let alone exhaustive tests) of something that could very well change (even if ever so slightly) as we fill things in, such as for 2D character layout.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but that's the point. With the "internal" representation of the result, I mean FullForm, which is readily accessible. Then, aspects related to the format can be tested in focused tests. This allows us to avoid worring about the representation of \[Times], when what we want to test if whether two numerical results are the same.

Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing against an internal representation as a quicker, more understandable, and more trackable way to ensure that FullForm output is correct is not only fine, but it is preferable. (Of course, there will be some end-to-end "blackbox" FullForm tests as well.)

Writing tests to track the internal representation for how that might be used in 2D character output that hasn't been fleshed out and is not implemented is, however, not a good idea.

That should be delayed until we have the 2D renderer in place.

Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic note:

In "Unless x then y", when x is false, y can happen.

text:
System`InputForm: 1.*^-6
System`OutputForm: "1.\xD710^-6"
System`OutputForm: "1. x 10^-6"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

Here is what I see:

In[67]:= 10.^6 //TeXForm                                                                     

Out[67]//TeXForm= 1.\times 10^6

In[68]:= 10.^6 //TeXForm//OutputForm                                                         

Out[68]//OutputForm= 1.\times 10^6

In[69]:= 10.^6 //TeXForm//InputForm                                                          

Out[69]//InputForm= 1.\times 10^6

How should I reconcile the output with this test with what I see in WolframScript?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is for "text" form, not "TeX" form.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Form corresponds to "text" form?

Copy link
Copy Markdown
Contributor Author

@mmatera mmatera Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"OutputForm", I guess... Actually, what you want I think is to compare against ToString, which by default produces expressions formatted as OutputForm.

mathml:
System`InputForm: <mtext>&lt;|a&nbsp;-&gt;&nbsp;x,&nbsp;b&nbsp;-&gt;&nbsp;y,&nbsp;c&nbsp;-&gt;&nbsp;&lt;|d&nbsp;-&gt;&nbsp;t|&gt;|&gt;</mtext>
System`OutputForm: '<mtext>&lt;|a&nbsp;-&gt;&nbsp;x,&nbsp;b&nbsp;-&gt;&nbsp;y,&nbsp;c&nbsp;-&gt;&nbsp;&lt;|d&nbsp;-&gt;&nbsp;t|&gt;|&gt;</mtext>'
System`OutputForm: '<mtext>&lt;|a&nbsp;⇾&nbsp;x,&nbsp;b&nbsp;⇾&nbsp;y,&nbsp;c&nbsp;⇾&nbsp;&lt;|d&nbsp;&nbsp;t|&gt;|&gt;</mtext>'
Copy link
Copy Markdown
Contributor Author

@mmatera mmatera Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In WMA, MathMLForm produces Unicode characters, but encoded as &#charcode;. For example,

In[1]:= x->b //MathMLForm                                                       

Out[1]//MathMLForm= 
   <math>
    <mrow>
     <mi>x</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

Notice that -> was converted into <mo>&#8594;</mo>. We should go over this in another round.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going over in another round is fine. But here, instead of testing wrong behavior, let's comment out the test.

When we have the test correct, it would get uncommented.

@mmatera mmatera marked this pull request as ready for review March 25, 2026 15:59
def eval_ToString(
expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
) -> String:
from mathics.format.render.encoding import EncodingNameError
Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into what's causing the circular import, and this is a mess.

Everything in mathics.form.render is imported inside __init__. And in doing that, rendering for OutputForm needs this eval routine, which then imports something else in mathics.form.render.

The idea behind the automated dynamic import in mathics.form.render was that one could just drop in new rendering routines. mathics.form.render.encoding is not a renderer, so it should not be imported.

Doing this here is just working around a design flaw elsewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably mathics.format.render.encoding should be inside mathics.eval

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 25, 2026

  • Make that CharacterEncoding option in ToString works as expected

This does not fully do this. In this git branch:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"

PR #1749 does.

Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually?

For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 25, 2026

  • Make that CharacterEncoding option in ToString works as expected

This does not fully do this. In this git branch:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"

PR #1749 does.

Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually?

For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation.

  • Make that CharacterEncoding option in ToString works as expected

This does not fully do this. In this git branch:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"

PR #1749 does.

Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually?

For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation.

This is because I am using a minimal encoding table coded by hand. The next step is to pick the conversion tables from Mathics3-Scanner, as in #1749

text = boxes.to_text(evaluation=evaluation)
return String(text)

boxes = format_element(expr, evaluation, form)
Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the final idea is that the strings in format_element are going to get converted, then I think this is approaching this the wrong way.

Instead, format_element needs to take the parameters expr, form, and encoding to produce boxes that have the appropriate strings in them initially.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works.

I did not find anywhere in those experiments that there was a string that was encoded one way, and inside ToString, it got reencoded, as opposed to being encoded correctly initially.

It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied.

That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it.

Consider this example where I set a breakpoint at the location we are discussing:

$ mathics3
...
In[1]:= ToString[a >= b, CharacterEncoding -> "ASCII"]
(/tmp/Mathics3/mathics-core/mathics/eval/strings.py:30:5 @46): eval_ToString
-- 30     try:
(trepan3k) list
 25    	    expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
 26    	) -> String:
 27    	
 28    	    boxes = format_element(expr, evaluation, form)
 29    	    breakpoint()
 30  ->	    try:
 31    	        return String(boxes.to_text(evaluation=evaluation, encoding=encoding))
 32    	    except EncodingNameError:
 33    	        # Mimic the WMA behavior. In the future, we can implement the mechanism
 34    	        # with encodings stored in .m files, and give a chance with it.
(trepan3k) boxes.elements
(<Expression: <Symbol: System`PaneBox>[<String: ""a ≥ b"">]>, <Expression: <Symbol: ...

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

And if you do the conversion earlier, a double conversion spoils the result. Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

This is not relevant here. We started with a Mathics3 Expression, and inside format_element, this expression got turned into an incorrect string, because encoding information indicating that strings are supposed to be ASCII was not respected inside format_element.

Another viable solution might be to have format_element not convert the expression a >= b to a String, and leave it as an Expression for later. But, I am not sure that is possible or correct. I believe only that what is done is incorrect and there's no evidence right now that WMA is reencoding strings instead of encoding them correctly initially.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28

    	    boxes = format_element(expr, evaluation, form)

must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result <String: ""a ≥ b""> is correct. The encoding is applied in line 31

    	        return String(boxes.to_text(evaluation=evaluation, encoding=encoding))

which takes the box expression and converts it into a Python string, in the request encoding.

The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 25, 2026

ToString[a>=b, CharacterEncoding->"ASCII"]

With the last changes, we are loading the encoding from Mathics3-scanner, and the case you mention is also covered:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a >= b"

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 25, 2026

ToString[a>=b, CharacterEncoding->"ASCII"]

With the last changes, we are loading the encoding from Mathics3-scanner, and the case you mention is also covered:

In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a >= b"

Comparing the amount of code needed to do this and the conceptual complexity, I'd say #1749 is cleaner, if not also a lot less code, if making this example work was all that we wanted to accomplish.

I think the code here is making some point or doing something with respect to handling Microsoft Windows code pages, and that is a good thing.

The problem, though, is that I am having a hard time understanding exactly what this is and how to isolate it from the reencoding part, which I think is misguided.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 25, 2026

Comparing the amount of code ...

In support of this, this PR has 20 files changed, 14 commits, 93 lines added, and -64 removed. And I suspect there should be more changes.

#1749 has 5 commits, 6 files changed, 73 lines added, and 43 removed.

Admittedly, though its scope may be narrower than what this attempts. But in that case, I'd say commit #1749 and then build on that here.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 25, 2026

In support of this, this PR has 20 files changed, 14 commits, 93 lines added, and -64 removed. And I suspect there should be more changes.

I quite conviced that #1749 is shorter but wrong. Just I do not have the time at this moment to give a more complete explanation of why. Later, I will try to do that.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 25, 2026

A box expression is a representation of how an expression should look like, disregarding which front end, encoding or file format are you use it for rendering it. As I show you before, the output of ToBoxes on any expression is independent of the encoding. What could be confuse is that when you render the Box Expression as it it were an expression, the result is printed in the system encoding. In a similar way, ToString takes a box expression and convert it into a string, with a given encoding. But the box expression itself is build of strings in the internal encoding.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 25, 2026

Again, this is the example that shows that at least in WMA, encoding happens after formatting and MakeBoxes:

In[1]:= $CharacterEncoding                                                                                                                                                    

Out[1]= UTF-8

The initial encoding is UTF-8

Now, let's take one expression that, in formatting, produces an special character:

In[2]:= box1=EngineeringForm[1.*^32]//OutputForm//ToBoxes                                                                                                                     

                                                                        30
Out[2]= InterpretationBox[PaneBox["         30\n100. × 10"], 100. × 10  , Editable -> False]

Now, let's change the encoding, and evaluate the same expression:

In[3]:= $CharacterEncoding="ASCII"                                                                                                                                            

Out[3]= ASCII

In[4]:= box2=EngineeringForm[1.*^32]//OutputForm//ToBoxes                                                                                                                     

                                                                      30
Out[4]= InterpretationBox[PaneBox["         30\n100. x 10"], 100. x 10  , Editable -> False]

Notice that now, when the box expression is "rendered" , "[Times]" is replaced by "x". However,

In[5]:= box1==box2                                                                                                                                                            

Out[5]= True

This happens because, internally, both Box expressions are exactly the same, disregarding the encoding.

If we look inside, both are printed in the current encoding:

In[6]:= box1                                                                                                                                                                  

                                                                      30
Out[6]= InterpretationBox[PaneBox["         30\n100. x 10"], 100. x 10  , Editable -> False]

In[7]:= box2                                                                                                                                                                  

                                                                      30
Out[7]= InterpretationBox[PaneBox["         30\n100. x 10"], 100. x 10  , Editable -> False]

Going back to UTF-8, both Box expressions are now shown in the original encoding:


In[8]:= $CharacterEncoding="UTF-8"                                                                                                                                           

Out[8]= UTF-8

In[9]:= box1                                                                                                                                                                 

                                                                         30
Out[9]= InterpretationBox[PaneBox["         30\n100. × 10"], 100. × 10  , Editable -> False]

In[10]:= box2                                                                                                                                                                 

                                                                         30
Out[10]= InterpretationBox[PaneBox["         30\n100. × 10"], 100. × 10  , Editable -> False]

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 26, 2026

Again, this is the example that shows that at least in WMA, encoding happens after formatting and MakeBoxes:

We keep going over the same thing. This is not in dispute. I think I've said this twice before.

This PR reencodes a string that shouldn't have gotten in there. #1749 influences the renderer erroneously called from Form handling, so that it encodes the string properly.

The most precise fix is in understanding how that string gets in there in the first place, and adjusting that.

Doing a little investigation, the culprit seems to be https://github.com/Mathics3/mathics-core/blob/master/mathics/builtin/forms/print.py#L260

@is_print_form_callback("System`OutputForm")
def eval_makeboxes_outputform(expr: BaseElement, evaluation: Evaluation, **kwargs):
    """
    Build a 2D representation of the expression using only keyboard characters.
    """

    text_outputform = str(render_output_form(expr, evaluation, **kwargs))  # probably wrong
    pane = PaneBox(String('"' + text_outputform + '"'))  # probably wrong
    return InterpretationBox(
        pane, Expression(SymbolOutputForm, expr), **{"System`Editable": SymbolFalse}
    )

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 26, 2026

Again, this is the example that shows that at least in WMA, encoding happens after formatting and MakeBoxes:

We keep going over the same thing. This is not in dispute. I think I've said this twice before.

OK, but then, what does an encoding parameter in format?

The most precise fix is in understanding how that string gets in there in the first place, and adjusting that.

Fully agree with this

Doing a little investigation, the culprit seems to be https://github.com/Mathics3/mathics-core/blob/master/mathics/builtin/forms/print.py#L260

Also agree with this. OutputForm produces a (2D in WMA) textual representation of an expression, but using the internal representation. When I wrote this refactor, I was not fully aware of this detail, and it seemed like a way to reproduce the behavior expected in the tests. Also, removing it now is easier than it was before the refactor (in version 9.0)

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 26, 2026

We keep going over the same thing. This is not in dispute. I think I've said this twice before.

OK, but then, what does an encoding parameter in format?

I explained this, too, a couple of times. It was a workaround, same as this PR.

The most precise fix is in understanding how that string gets in there in the first place, and adjusting that.

Fully agree with this

Ok. So then we should go with the longer-term solution. Right now, I am tempted to punt on both of these PRs until after a release.

There is still a lot to do in boxing and rendering. And I think that should be continued after release.

@mmatera Your thoughts?

@rocky rocky marked this pull request as draft March 26, 2026 15:04
@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

I am working on the last tweaks, in order to complete this discussion.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

I am working on the last tweaks, in order to complete this discussion.

As I look at mathics.format.form, I see other uses of str() that probably need to be addressed too.

And then there is the issue of adjusting the output correctly, or adjusting the tests to test against a coding-independent representation.

And dealing properly with what in conventional terms is a called a CodePage, but in Mathematica is called a CharacterEncoding.

All of these issues to be fixed properly take a bit more work.

Therefore, I think we should do this slowly and more carefully in another pass after release.

{
"×": r" x ",
operator_to_unicode["Times"]: r" x ",
"": r"\[DifferentialD]",
Copy link
Copy Markdown
Member

@rocky rocky Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 or F74C is the WL encoding of DifferentialD mentioned in named-characters.yml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, all the letter-like characters are missing in the current approach. However, I guess this should be handled by adding a table that takes this into account.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

@rocky, I think with the last changes I covered some of your observations:

As I look at mathics.format.form, I see other uses of str() that probably need to be addressed too.

Not sure what do you refer with this. What I did now is to remove all the references to encoding in the box making routines.

And then there is the issue of adjusting the output correctly, or adjusting the tests to test against a coding-independent representation.

Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example,

Out[3]//MathMLForm= <math>
                     <ms>&#8747;f[x]&#8518;x</ms>
                    </math>
In[1]:= "\[Integral]f[x]\[DifferentialD]x"//MathMLForm
Out[1]//MathMLForm= <mtext>&#8747;f[x]&#119889;x</mtext>

format tests were adjusted accordingly. Also, doctests now handle comparisons in the system encoding instead of being restricted to ASCII. With some adjustments in the docstrings, now the doctests should work with any codepage.

And dealing properly with what in conventional terms is a called a CodePage, but in Mathematica is called a CharacterEncoding.

If there are still more things to cover this issues, now they are more localized: it requires just to improve a function in a module, instead of changing code in many unrelated modules.

All of these issues to be fixed properly take a bit more work.

I think we can do it in a couple of other rounds.

Therefore, I think we should do this slowly and more carefully in another pass after release.

OK


>> a -> 1 + 2
= a -> 3
= a 3
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this change just as an example. In principle, changing all the docstrings that involve Unicode symbols would make the docpipeline work with any encoding. I didn't do that because I would like to have some feedback before facing this task.
Also, I think that another possibility would be to use named characters instead of Unicode in these expected lines. But for it, I would also need to adjust some code in the doctest parser. @rocky, thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmatera I thought about this for several minutes, and right now I just don't feel confident in suggesting anything one way or another. It might be something to talk over and discuss. For example, it might be that we decide to try one thing on a small scale, see how it goes, and then try another.

Is there some way we can discuss in a manner other than PR review comments?


options = {
"CharacterEncoding": '"Unicode"',
"CharacterEncoding": "$CharacterEncoding",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ToString, the default CharacterEncoding should be $CharacterEncoding

if result is None:
return None

try:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that when expressions are formatted as text, the encoding is always applied. With this change, if we specify that the encoding is "ASCII", all the tests must match with ASCII outputs.

"UTF-8": {},
}[encoding]
except KeyError:
raise EncodingNameError
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of raising this exception, we could try to create an evaluation object and load the encodings from .m files in some special folder, like in WMA. I plan to do that in another round, which would be relatively easy.

if not show_string_characters:
value = value[1:-1]
return value
return encode_string_value(value, options["encoding"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that "encoding" has always been passed in the options dictionary? Should this be options.get instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this way to check that indeed we passed the option. But yes, we can use get instead.

'stream = StringToStream["1.523E-19"]; Read[stream, Real]',
None,
"1.523×10^-19",
"1.523 x 10^-19",
Copy link
Copy Markdown
Member

@rocky rocky Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you were going to rewrite so that we aren't testing impossible behavior?

Such as testing at the encoding-independent level.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but to do that, I need to do it in several steps, in order to avoid a much larger PRs. The plan is go over this in the next round.

Copy link
Copy Markdown
Member

@rocky rocky Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we break up this PR into smaller, self-contained conceptual pieces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, because when I tried to do it in smaller pieces, I was not able to make you understand what I was doing. Maybe now I can split the part of MathML, then the encoding function, then docpipeline, then remove the internal encoding and the other fixes.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

@rocky, I think with the last changes I covered some of your observations:

As I look at mathics.format.form, I see other uses of str() that probably need to be addressed too.

Not sure what do you refer with this. What I did now is to remove all the references to encoding in the box making routines.

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there. That option is just hidden inside **kwargs. I consciously split that out because explicit parameters are better than implicit parameters, but at this point, I don't really care to make an issue of it.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example,

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

That option is just hidden inside **kwargs. I consciously split that out because explicit parameters are better than implicit parameters, but at this point, I don't really care to make an issue of it.

I know that. However, since options to be passes among thse render functions are variable, it is hard to say which parameters would require a render function of some of the nested elements.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example,

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The problem with TeX and MathML in WMA is that it sucks, and is not usable for our purposes...

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

You have just emphasized the problem I am having with this PR! It is hard to follow when the approach is code first, and then start to discuss, and then discover bugs, or more things to add, and rewrite and code some more ... and then discuss, ...

And #1749 is a lot smaller than this even in its earlier versions. So presumably that would have been easier to do :-)

At this point, I'd like to break this up into all the separable ideas contained in this code and go over them one by one and get them merged in.

For example, one problem with the code in the master branch is that the character-encoding information from built-in functions like ToString has to propagate down into the render routine as an option, e.g. via **kwargs.

That is an example of one self-contained bug in the master-branch code that can be fixed in isolation from making proper use of the option.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

According to my experiments, in WMA, MathML is insensitive to the encoding:

In[1]:= a->b//MathMLForm                                                        

Out[1]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[2]:= $CharacterEncoding="ASCII"                                              

Out[2]= ASCII

In[3]:= a->b//MathMLForm                                                        

Out[3]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[4]:= $CharacterEncoding="WindowsANSI"                                                                                                                                      

Out[4]= WindowsANSI

In[5]:= a->b//MathMLForm                                                                                                                                                      

Out[5]//MathMLForm= <math>
                     <mrow>
                      <mi>a</mi>
                      <semantics>
                       <mo>&#8594;</mo>
                       <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                      </semantics>
                      <mi>b</mi>
                     </mrow>
                    </math>

In[6]:= $CharacterEncoding="Klingon"                                                                                                                                          

Out[6]= Klingon

In[7]:= a->b//MathMLForm                                                                                                                                                     

Out[7]//MathMLForm= <math>
                      <mrow>
                       <mi>a</mi>
                       <semantics>
                        <mo>&#8594;</mo>
                        <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                       </semantics>
                       <mi>b</mi>
                      </mrow>
                     </math>

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.)

The noticeable problem that this PR solves is that the CharacterEncoding parameter in ToString now does something similar to what is expected.

@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using UTF-8 versus WindowsANSI versus WindowsGreek? If so, that's a great improvement.

According to my experiments, in WMA, MathML is insensitive to the encoding:

In[1]:= a->b//MathMLForm                                                        

Out[1]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[2]:= $CharacterEncoding="ASCII"                                              

Out[2]= ASCII

In[3]:= a->b//MathMLForm                                                        

Out[3]//MathMLForm= 
   <math>
    <mrow>
     <mi>a</mi>
     <semantics>
      <mo>&#8594;</mo>
      <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
     </semantics>
     <mi>b</mi>
    </mrow>
   </math>

In[4]:= $CharacterEncoding="WindowsANSI"                                                                                                                                      

Out[4]= WindowsANSI

In[5]:= a->b//MathMLForm                                                                                                                                                      

Out[5]//MathMLForm= <math>
                     <mrow>
                      <mi>a</mi>
                      <semantics>
                       <mo>&#8594;</mo>
                       <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                      </semantics>
                      <mi>b</mi>
                     </mrow>
                    </math>

In[6]:= $CharacterEncoding="Klingon"                                                                                                                                          

Out[6]= Klingon

In[7]:= a->b//MathMLForm                                                                                                                                                     

Out[7]//MathMLForm= <math>
                      <mrow>
                       <mi>a</mi>
                       <semantics>
                        <mo>&#8594;</mo>
                        <annotation encoding='Mathematica'>&quot;\[Rule]&quot;</annotation>
                       </semantics>
                       <mi>b</mi>
                      </mrow>
                     </math>

This is probably also wrong, because the browser that reads it could not handle these characters in its codepage. But this is something to handle in another round.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there.

I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.

You have just emphasized the problem I am having with this PR! It is hard to follow when the approach is code: first, and then start to discuss, and then code some more ... and then discuss, ...

And #1749 is a lot smaller than this even in its earlier versions. So presumably that would have been easier to do :-)

At this point, I'd like to break this up into all the separable ideas contained in this code and go over them one by one and get them merged in.

For example, one problem with the code in the master branch is that the character-encoding information from built-in functions like ToString has to propagate down into the render routine as an option, e.g. via **kwargs.

That is an example of one self-contained bug in the master-branch code that can be fixed in isolation from making proper use of the option.

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.

Or punt on the whole thing and wait until after release. Your choice.

@mmatera
Copy link
Copy Markdown
Contributor Author

mmatera commented Mar 27, 2026

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.

Or punt on the whole thing and wait until after release. Your choice.

I think that it would be great to have this in the release to complete the format/render refactor. At this point, it takes some hours of work, but I have a more or less clear sequence of steps on how to do it.

@rocky
Copy link
Copy Markdown
Member

rocky commented Mar 27, 2026

OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts

Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily.
Or punt on the whole thing and wait until after release. Your choice.

I think that it would be great to have this in the release to complete the format/render refactor. At this point, it takes some hours of work, but I have a more or less clear sequence of steps on how to do it.

Even with everything here, format/render will not be complete. But there will be significant improvements, and since this is API breaking, it is good to get more of it in sooner rather than later.

I look forward to understanding a cleaner breakdown in terms of PRs, which follow the steps, and the changes made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants