Conversation
mathics/format/render/text.py
Outdated
| } | ||
|
|
||
|
|
||
| def encode_string_value(value: str, encoding: str): |
There was a problem hiding this comment.
This function is a just a proof of concept. The final version should look into the MathicsScanner tables
mathics/format/render/text.py
Outdated
| value = value[1:-1] | ||
|
|
||
| if "encoding" in options and options["encoding"] != "Unicode": | ||
| value = encode_string_value(value, options["encoding"]) |
There was a problem hiding this comment.
Looking at this more closely, there may be a deeper problem here.
If the Mathics3 string was encoded with Unicode under the user's control, that should remain. If Mathics3 added the Unicode because an operator appeared, that is probably wrong, and the code that added the Unicode should be fixed.
So, what is a specific scenario or situation where line 200 is triggered?
There was a problem hiding this comment.
Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.
There was a problem hiding this comment.
Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting
MATHICS_CHARACTER_ENCODING="ASCII") or when it is call fromToStringwith a specificCharacterEncodingoption.
This paraphrases the if condition. I meant, what is it that is causing an operator to get converted before ToString was called. This, I think, is the real source of the problem.
|
A suggestion for a check that things are fixed would be to run |
| ("ArcTan[0, 1]", None, "Pi / 2", None), | ||
| ("ArcTan[0, -1]", None, "-Pi / 2", None), | ||
| ("Cos[1.5 Pi]", None, "-1.83697×10^-16", None), | ||
| ("Cos[1.5 Pi]", None, "-1.83697 x 10^-16", None), |
There was a problem hiding this comment.
How can I see this behavior using wolframscript?
When I try using InputForm I see this:
In[1]:= Cos[1.5 Pi] // InputForm
Out[1]//InputForm= -1.8369701987210297*^-16There was a problem hiding this comment.
Actually, you can't: InputForm uses this base*^exp notation, while OutputForm uses the 2D form
In[3]:= 2.3*^43
43
Out[3]= 2.3 10
The closest form is the EngineeringForm:
In[4]:= EngineeringForm[2.3*^32]
30
Out[4]//EngineeringForm= 230. × 10
which uses the times operator. By setting the character encoding to ASCII, we see the use of x instead of Times:
In[5]:= $CharacterEncoding="ASCII"
Out[5]= ASCII
In[6]:= EngineeringForm[2.3*^32]
30
Out[6]//EngineeringForm= 230. x 10
There was a problem hiding this comment.
Ok. Thanks for the clarification. So, in what environment do we see exactly "-1.83697 x 10^-16" as the output of "2.3*^43"?
And how am I to understand that in this test, I am matching that kind of environment?
There was a problem hiding this comment.
In the test/format/ folder, there are two files, one is format_tests.yaml which we use for the test, and another format_tests-WMA.yaml which contains the corresponding outputs in WMA.
Regarding which environment, I am running WMA 12.04 in Ubuntu 22.04.
There was a problem hiding this comment.
Regarding which environment,
I am sorry that I wasn't clear. By environment, I mean in what Wolfram program or product do we see this kind of output appearing?
There was a problem hiding this comment.
This is because we have not yet implemented a 2D OutputForm. In any case, looking at this, what we should test in most of the cases, is the match regarding the internal representation, and not regarding the format, which is just one aspect to be tested.
There was a problem hiding this comment.
Normally, I'd say yes, please let's match the internal representation. But...
Unless we know that there is a WMA internal representation that we are trying to match, or unless right now this internal representation has a concrete impact on output we can see today that matches WMA, I'd say, let's not test things (let alone exhaustive tests) of something that could very well change (even if ever so slightly) as we fill things in, such as for 2D character layout.
There was a problem hiding this comment.
OK, but that's the point. With the "internal" representation of the result, I mean FullForm, which is readily accessible. Then, aspects related to the format can be tested in focused tests. This allows us to avoid worring about the representation of \[Times], when what we want to test if whether two numerical results are the same.
There was a problem hiding this comment.
Testing against an internal representation as a quicker, more understandable, and more trackable way to ensure that FullForm output is correct is not only fine, but it is preferable. (Of course, there will be some end-to-end "blackbox" FullForm tests as well.)
Writing tests to track the internal representation for how that might be used in 2D character output that hasn't been fleshed out and is not implemented is, however, not a good idea.
That should be delayed until we have the 2D renderer in place.
There was a problem hiding this comment.
Logic note:
In "Unless x then y", when x is false, y can happen.
| text: | ||
| System`InputForm: 1.*^-6 | ||
| System`OutputForm: "1.\xD710^-6" | ||
| System`OutputForm: "1. x 10^-6" |
There was a problem hiding this comment.
I don't understand.
Here is what I see:
In[67]:= 10.^6 //TeXForm
Out[67]//TeXForm= 1.\times 10^6
In[68]:= 10.^6 //TeXForm//OutputForm
Out[68]//OutputForm= 1.\times 10^6
In[69]:= 10.^6 //TeXForm//InputForm
Out[69]//InputForm= 1.\times 10^6
How should I reconcile the output with this test with what I see in WolframScript?
There was a problem hiding this comment.
This test is for "text" form, not "TeX" form.
There was a problem hiding this comment.
What Form corresponds to "text" form?
There was a problem hiding this comment.
"OutputForm", I guess... Actually, what you want I think is to compare against ToString, which by default produces expressions formatted as OutputForm.
test/format/format_tests.yaml
Outdated
| mathml: | ||
| System`InputForm: <mtext><|a -> x, b -> y, c -> <|d -> t|>|></mtext> | ||
| System`OutputForm: '<mtext><|a -> x, b -> y, c -> <|d -> t|>|></mtext>' | ||
| System`OutputForm: '<mtext><|a ⇾ x, b ⇾ y, c ⇾ <|d ⇾ t|>|></mtext>' |
There was a problem hiding this comment.
In WMA, MathMLForm produces Unicode characters, but encoded as &#charcode;. For example,
In[1]:= x->b //MathMLForm
Out[1]//MathMLForm=
<math>
<mrow>
<mi>x</mi>
<semantics>
<mo>→</mo>
<annotation encoding='Mathematica'>"\[Rule]"</annotation>
</semantics>
<mi>b</mi>
</mrow>
</math>
Notice that -> was converted into <mo>→</mo>. We should go over this in another round.
There was a problem hiding this comment.
Going over in another round is fine. But here, instead of testing wrong behavior, let's comment out the test.
When we have the test correct, it would get uncommented.
mathics/eval/strings.py
Outdated
| def eval_ToString( | ||
| expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation | ||
| ) -> String: | ||
| from mathics.format.render.encoding import EncodingNameError |
There was a problem hiding this comment.
I looked into what's causing the circular import, and this is a mess.
Everything in mathics.form.render is imported inside __init__. And in doing that, rendering for OutputForm needs this eval routine, which then imports something else in mathics.form.render.
The idea behind the automated dynamic import in mathics.form.render was that one could just drop in new rendering routines. mathics.form.render.encoding is not a renderer, so it should not be imported.
Doing this here is just working around a design flaw elsewhere.
There was a problem hiding this comment.
Probably mathics.format.render.encoding should be inside mathics.eval
This does not fully do this. In this git branch: In[1]:= ToString[a>=b, CharacterEncoding->"ASCII"]
Out[1]= "a ≥ b"PR #1749 does. Aside from it being declared incomplete, e.g., there are hard-coded tables, it is also a bit unfocused. Is there way we can break this up into smaller pieces to go over them individually? For example, just handling code pages for MS Windows is a suggestion for breaking this down into a smaller part that can be reviewed in isolation. |
This is because I am using a minimal encoding table coded by hand. The next step is to pick the conversion tables from Mathics3-Scanner, as in #1749 |
| text = boxes.to_text(evaluation=evaluation) | ||
| return String(text) | ||
|
|
||
| boxes = format_element(expr, evaluation, form) |
There was a problem hiding this comment.
If the final idea is that the strings in format_element are going to get converted, then I think this is approaching this the wrong way.
Instead, format_element needs to take the parameters expr, form, and encoding to produce boxes that have the appropriate strings in them initially.
There was a problem hiding this comment.
Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.
There was a problem hiding this comment.
Okay, but this doesn't align with how the experiments I showed you suggest WMA works.
I did not find anywhere in those experiments that there was a string that was encoded one way, and inside ToString, it got reencoded, as opposed to being encoded correctly initially.
It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied.
That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it.
Consider this example where I set a breakpoint at the location we are discussing:
$ mathics3
...
In[1]:= ToString[a >= b, CharacterEncoding -> "ASCII"]
(/tmp/Mathics3/mathics-core/mathics/eval/strings.py:30:5 @46): eval_ToString
-- 30 try:
(trepan3k) list
25 expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
26 ) -> String:
27
28 boxes = format_element(expr, evaluation, form)
29 breakpoint()
30 -> try:
31 return String(boxes.to_text(evaluation=evaluation, encoding=encoding))
32 except EncodingNameError:
33 # Mimic the WMA behavior. In the future, we can implement the mechanism
34 # with encodings stored in .m files, and give a chance with it.
(trepan3k) boxes.elements
(<Expression: <Symbol: System`PaneBox>[<String: ""a ≥ b"">]>, <Expression: <Symbol: ...
<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.
And if you do the conversion earlier, a double conversion spoils the result. Handling encoding at the level of
format_elementis like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.
This is not relevant here. We started with a Mathics3 Expression, and inside format_element, this expression got turned into an incorrect string, because encoding information indicating that strings are supposed to be ASCII was not respected inside format_element.
Another viable solution might be to have format_element not convert the expression a >= b to a String, and leave it as an Expression for later. But, I am not sure that is possible or correct. I believe only that what is done is incorrect and there's no evidence right now that WMA is reencoding strings instead of encoding them correctly initially.
There was a problem hiding this comment.
<String: ""a ≥ b"">is wrong. That should be<String: ""a >= b"">.
I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28
boxes = format_element(expr, evaluation, form)
must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result <String: ""a ≥ b""> is correct. The encoding is applied in line 31
return String(boxes.to_text(evaluation=evaluation, encoding=encoding))
which takes the box expression and converts it into a Python string, in the request encoding.
The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way.
With the last changes, we are loading the encoding from Mathics3-scanner, and the case you mention is also covered: |
Comparing the amount of code needed to do this and the conceptual complexity, I'd say #1749 is cleaner, if not also a lot less code, if making this example work was all that we wanted to accomplish. I think the code here is making some point or doing something with respect to handling Microsoft Windows code pages, and that is a good thing. The problem, though, is that I am having a hard time understanding exactly what this is and how to isolate it from the reencoding part, which I think is misguided. |
In support of this, this PR has 20 files changed, 14 commits, 93 lines added, and -64 removed. And I suspect there should be more changes. #1749 has 5 commits, 6 files changed, 73 lines added, and 43 removed. Admittedly, though its scope may be narrower than what this attempts. But in that case, I'd say commit #1749 and then build on that here. |
I quite conviced that #1749 is shorter but wrong. Just I do not have the time at this moment to give a more complete explanation of why. Later, I will try to do that. |
|
A box expression is a representation of how an expression should look like, disregarding which front end, encoding or file format are you use it for rendering it. As I show you before, the output of ToBoxes on any expression is independent of the encoding. What could be confuse is that when you render the Box Expression as it it were an expression, the result is printed in the system encoding. In a similar way, ToString takes a box expression and convert it into a string, with a given encoding. But the box expression itself is build of strings in the internal encoding. |
|
Again, this is the example that shows that at least in WMA, encoding happens after formatting and MakeBoxes: The initial encoding is UTF-8 Now, let's take one expression that, in formatting, produces an special character: Now, let's change the encoding, and evaluate the same expression: Notice that now, when the box expression is "rendered" , "[Times]" is replaced by "x". However, This happens because, internally, both Box expressions are exactly the same, disregarding the encoding. If we look inside, both are printed in the current encoding: Going back to UTF-8, both Box expressions are now shown in the original encoding: |
We keep going over the same thing. This is not in dispute. I think I've said this twice before. This PR reencodes a string that shouldn't have gotten in there. #1749 influences the renderer erroneously called from Form handling, so that it encodes the string properly. The most precise fix is in understanding how that string gets in there in the first place, and adjusting that. Doing a little investigation, the culprit seems to be https://github.com/Mathics3/mathics-core/blob/master/mathics/builtin/forms/print.py#L260 @is_print_form_callback("System`OutputForm")
def eval_makeboxes_outputform(expr: BaseElement, evaluation: Evaluation, **kwargs):
"""
Build a 2D representation of the expression using only keyboard characters.
"""
text_outputform = str(render_output_form(expr, evaluation, **kwargs)) # probably wrong
pane = PaneBox(String('"' + text_outputform + '"')) # probably wrong
return InterpretationBox(
pane, Expression(SymbolOutputForm, expr), **{"System`Editable": SymbolFalse}
) |
OK, but then, what does an
Fully agree with this
Also agree with this. OutputForm produces a (2D in WMA) textual representation of an expression, but using the internal representation. When I wrote this refactor, I was not fully aware of this detail, and it seemed like a way to reproduce the behavior expected in the tests. Also, removing it now is easier than it was before the refactor (in version 9.0) |
I explained this, too, a couple of times. It was a workaround, same as this PR.
Ok. So then we should go with the longer-term solution. Right now, I am tempted to punt on both of these PRs until after a release. There is still a lot to do in boxing and rendering. And I think that should be continued after release. @mmatera Your thoughts? |
|
I am working on the last tweaks, in order to complete this discussion. |
As I look at And then there is the issue of adjusting the output correctly, or adjusting the tests to test against a coding-independent representation. And dealing properly with what in conventional terms is a called a CodePage, but in Mathematica is called a All of these issues to be fixed properly take a bit more work. Therefore, I think we should do this slowly and more carefully in another pass after release. |
| { | ||
| "×": r" x ", | ||
| operator_to_unicode["Times"]: r" x ", | ||
| "": r"\[DifferentialD]", |
There was a problem hiding this comment.
or F74C is the WL encoding of DifferentialD mentioned in named-characters.yml
There was a problem hiding this comment.
Also, all the letter-like characters are missing in the current approach. However, I guess this should be handled by adding a table that takes this into account.
|
@rocky, I think with the last changes I covered some of your observations:
Not sure what do you refer with this. What I did now is to remove all the references to
Now I made some progress on this. MathML tests now test that the output is closer to the WMA output: all the non-ansi characters are converted to escaped char codes: for example, format tests were adjusted accordingly. Also, doctests now handle comparisons in the system encoding instead of being restricted to ASCII. With some adjustments in the docstrings, now the doctests should work with any codepage.
If there are still more things to cover this issues, now they are more localized: it requires just to improve a function in a module, instead of changing code in many unrelated modules.
I think we can do it in a couple of other rounds.
OK |
|
|
||
| >> a -> 1 + 2 | ||
| = a -> 3 | ||
| = a ⇾ 3 |
There was a problem hiding this comment.
I put this change just as an example. In principle, changing all the docstrings that involve Unicode symbols would make the docpipeline work with any encoding. I didn't do that because I would like to have some feedback before facing this task.
Also, I think that another possibility would be to use named characters instead of Unicode in these expected lines. But for it, I would also need to adjust some code in the doctest parser. @rocky, thoughts?
There was a problem hiding this comment.
@mmatera I thought about this for several minutes, and right now I just don't feel confident in suggesting anything one way or another. It might be something to talk over and discuss. For example, it might be that we decide to try one thing on a small scale, see how it goes, and then try another.
Is there some way we can discuss in a manner other than PR review comments?
|
|
||
| options = { | ||
| "CharacterEncoding": '"Unicode"', | ||
| "CharacterEncoding": "$CharacterEncoding", |
There was a problem hiding this comment.
For ToString, the default CharacterEncoding should be $CharacterEncoding
| if result is None: | ||
| return None | ||
|
|
||
| try: |
There was a problem hiding this comment.
This ensures that when expressions are formatted as text, the encoding is always applied. With this change, if we specify that the encoding is "ASCII", all the tests must match with ASCII outputs.
| "UTF-8": {}, | ||
| }[encoding] | ||
| except KeyError: | ||
| raise EncodingNameError |
There was a problem hiding this comment.
Instead of raising this exception, we could try to create an evaluation object and load the encodings from .m files in some special folder, like in WMA. I plan to do that in another round, which would be relatively easy.
| if not show_string_characters: | ||
| value = value[1:-1] | ||
| return value | ||
| return encode_string_value(value, options["encoding"]) |
There was a problem hiding this comment.
How do we know that "encoding" has always been passed in the options dictionary? Should this be options.get instead?
There was a problem hiding this comment.
I put this way to check that indeed we passed the option. But yes, we can use get instead.
| 'stream = StringToStream["1.523E-19"]; Read[stream, Real]', | ||
| None, | ||
| "1.523×10^-19", | ||
| "1.523 x 10^-19", |
There was a problem hiding this comment.
I thought you were going to rewrite so that we aren't testing impossible behavior?
Such as testing at the encoding-independent level.
There was a problem hiding this comment.
Yes, but to do that, I need to do it in several steps, in order to avoid a much larger PRs. The plan is go over this in the next round.
There was a problem hiding this comment.
How can we break up this PR into smaller, self-contained conceptual pieces?
There was a problem hiding this comment.
Not sure, because when I tried to do it in smaller pieces, I was not able to make you understand what I was doing. Maybe now I can split the part of MathML, then the encoding function, then docpipeline, then remove the internal encoding and the other fixes.
You covered my concerns. But I also note that you were basically taking the idea in #1749. The removal of the "encoding" parameter, which seemed to be a great concern because (whatever) are still in there. That option is just hidden inside |
I guess this is good. I don't understand if this was a user-noticable problem or if it is just a nice-to-have because it matches the same output as WMA. (In some situations, for handling TeX output, I thought you had a desire to not follow WMA output.) |
I would like to say that I took the idea from #1749, but I just didn't understand it from the code there, but from our discussion here and there.
I know that. However, since options to be passes among thse render functions are variable, it is hard to say which parameters would require a render function of some of the nested elements. |
The problem with TeX and MathML in WMA is that it sucks, and is not usable for our purposes... |
The noticeable problem that this PR solves is that the CharacterEncoding parameter in |
@mmatera Let me see if I understand this correctly. So you are saying that the character code in MathMLFormat boxes should change depending on the CharacterEncoding used? For example, whether using |
You have just emphasized the problem I am having with this PR! It is hard to follow when the approach is code first, and then start to discuss, and then discover bugs, or more things to add, and rewrite and code some more ... and then discuss, ... And #1749 is a lot smaller than this even in its earlier versions. So presumably that would have been easier to do :-) At this point, I'd like to break this up into all the separable ideas contained in this code and go over them one by one and get them merged in. For example, one problem with the code in the master branch is that the character-encoding information from built-in functions like That is an example of one self-contained bug in the master-branch code that can be fixed in isolation from making proper use of the option. |
According to my experiments, in WMA, MathML is insensitive to the encoding: |
This is probably also wrong, because the browser that reads it could not handle these characters in its codepage. But this is something to handle in another round. |
OK, today I ran out all the time I had for this. Maybe during the weekend, I can try to split up into smaller parts |
Sure. I understand. This PR is too large and all over the place as it is. Either split it up so we can go over smaller pieces individually and possibly find bugs or improve. But, at any rate, we can discuss more easily. Or punt on the whole thing and wait until after release. Your choice. |
I think that it would be great to have this in the release to complete the format/render refactor. At this point, it takes some hours of work, but I have a more or less clear sequence of steps on how to do it. |
Even with everything here, format/render will not be complete. But there will be significant improvements, and since this is API breaking, it is good to get more of it in sooner rather than later. I look forward to understanding a cleaner breakdown in terms of PRs, which follow the steps, and the changes made. |
This PR covers #1678 by improving the use of
CharacterEncodingand$CharacterEncoding, to alingbetter to the behavior in WMA: all string manipulations and formatting
function works with the internal encoding, and
CharacterEncodingisonly used in rendering functions, used in the frontend and in the
ToStringevaluation.List of changes:
mathics.eval.encodingwas added, with a functionencode_string_value(value:str, encoding:str). The function takesa string in the internal encoding, and produce a string in the target
encoding.
encode_string_valuefunction is used in rendering String an Box expressionsaccording to the
encodingparameter.$CharacterEncodingis taken into account toformat the output in output text.
encodingto convert the expected resultto an specific encoding.
Missing stuff