-
-
Notifications
You must be signed in to change notification settings - Fork 65
Improve use of CharacterEncoding #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
e02e3a9
586d3a4
96ea4e8
cd526f5
dc9c8ad
1a53c1a
3d4b0a5
0218bd9
1324c41
f58574c
79dcf9d
e7f88e5
0595532
1670da6
d9e7eb5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -419,10 +419,15 @@ def format_output(self, expr, format=None): | |
| if result is None: | ||
| return None | ||
|
|
||
| try: | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ensures that when expressions are formatted as text, the encoding is always applied. With this change, if we specify that the encoding is "ASCII", all the tests must match with ASCII outputs. |
||
| encoding = self.definitions.get_ownvalue("System`$CharacterEncoding").value | ||
| except AttributeError: | ||
| encoding = "Unicode" | ||
|
|
||
| try: | ||
| # With the new implementation, if result is not a ``BoxExpression`` | ||
| # then we should raise a BoxError here. | ||
| boxes = result.to_text(evaluation=self) | ||
| boxes = result.to_text(evaluation=self, encoding=encoding) | ||
| except BoxError: | ||
| self.message( | ||
| "General", "notboxes", Expression(SymbolFullForm, result).evaluate(self) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -656,10 +656,10 @@ You can also specify a list of rules: | |
| There is a "delayed" version of 'Rule' which can be specified by ':>' (similar to the relation of ':=' to '='): | ||
|
|
||
| >> a :> 1 + 2 | ||
| = a :> 1 + 2 | ||
| = a ⧴ 1 + 2 | ||
|
|
||
| >> a -> 1 + 2 | ||
| = a -> 3 | ||
| = a ⇾ 3 | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put this change just as an example. In principle, changing all the docstrings that involve Unicode symbols would make the docpipeline work with any encoding. I didn't do that because I would like to have some feedback before facing this task.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mmatera I thought about this for several minutes, and right now I just don't feel confident in suggesting anything one way or another. It might be something to talk over and discuss. For example, it might be that we decide to try one thing on a small scale, see how it goes, and then try another. Is there some way we can discuss in a manner other than PR review comments? |
||
|
|
||
| This is useful when the right side of a rule should not be evaluated immediately (before matching): | ||
| >> {1, 2} /. x_Integer -> N[x] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| """ | ||
| Functions to format strings in a given encoding. | ||
| """ | ||
|
|
||
| from typing import Dict | ||
|
|
||
| from mathics.core.convert.op import operator_to_ascii, operator_to_unicode | ||
|
|
||
| # Map WMA encoding names to Python encoding names | ||
| ENCODING_WMA_TO_PYTHON = { | ||
| "WindowsEastEurope": "cp1250", | ||
| "WindowsCyrillic": "cp1251", | ||
| "WindowsANSI": "cp1252", | ||
| "WindowsGreek": "cp1252", | ||
| "WindowsTurkish": "cp1254", | ||
| } | ||
|
|
||
| UNICODE_CHARACTER_TO_ASCII = { | ||
| ch: operator_to_ascii.get(name, rf"\[{name}]") | ||
| for name, ch in operator_to_unicode.items() | ||
| } | ||
|
|
||
| # These characters are used in encoding | ||
| # in WMA, and differs from what we have | ||
| # in Mathics3-scanner tables: | ||
| UNICODE_CHARACTER_TO_ASCII.update( | ||
| { | ||
| operator_to_unicode["Times"]: r" x ", | ||
| "": r"\[DifferentialD]", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or F74C is the WL encoding of DifferentialD mentioned in named-characters.yml
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, all the letter-like characters are missing in the current approach. However, I guess this should be handled by adding a table that takes this into account. |
||
| } | ||
| ) | ||
|
|
||
|
|
||
| class EncodingNameError(Exception): | ||
| pass | ||
|
|
||
|
|
||
| def get_encoding_table(encoding: str) -> Dict[str, str]: | ||
| """ | ||
| Return a dictionary with a map from | ||
| character codes in the internal (Unicode) | ||
| representation to the request encoding. | ||
| """ | ||
| if encoding == "Unicode": | ||
| return {} | ||
|
|
||
| # In the final implementation, this should load the corresponding | ||
| # json table or an encoding file as in WMA | ||
| # SystemFiles/CharacterEncodings/*.m | ||
| # If the encoding is not available, raise an EncodingError | ||
| try: | ||
| return { | ||
| "ASCII": UNICODE_CHARACTER_TO_ASCII, | ||
| "UTF-8": {}, | ||
| }[encoding] | ||
| except KeyError: | ||
| raise EncodingNameError | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of raising this exception, we could try to create an evaluation object and load the encodings from .m files in some special folder, like in WMA. I plan to do that in another round, which would be relatively easy. |
||
|
|
||
|
|
||
| def encode_string_value(value: str, encoding: str) -> str: | ||
| """Convert an Unicode string `value` to the required `encoding`""" | ||
|
|
||
| # In WMA, encodings are readed from SystemFiles/CharacterEncodings/*.m | ||
| # on the fly. We should load them from Mathics3-Scanner tables. | ||
| encoding_table = get_encoding_table(encoding) | ||
| if not encoding_table: | ||
| return value | ||
| result = "" | ||
| for ch in value: | ||
| ch = encoding_table.get(ch, ch) | ||
| result += ch | ||
| return result | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,15 +17,23 @@ | |
| from mathics.core.expression_predefined import MATHICS3_INFINITY | ||
| from mathics.core.list import ListExpression | ||
| from mathics.core.symbols import Symbol, SymbolTrue | ||
| from mathics.eval.encoding import EncodingNameError | ||
| from mathics.format.box import format_element | ||
|
|
||
|
|
||
| def eval_ToString( | ||
| expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation | ||
| ) -> String: | ||
| boxes = format_element(expr, evaluation, form, encoding=encoding) | ||
| text = boxes.to_text(evaluation=evaluation) | ||
| return String(text) | ||
|
|
||
| boxes = format_element(expr, evaluation, form) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the final idea is that the strings in Instead,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I did not find anywhere in those experiments that there was a string that was encoded one way, and inside
That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it. Consider this example where I set a breakpoint at the location we are discussing:
This is not relevant here. We started with a Mathics3 Expression, and inside Another viable solution might be to have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28 must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result which takes the box expression and converts it into a Python string, in the request encoding. The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way. |
||
| try: | ||
| return String(boxes.to_text(evaluation=evaluation, encoding=encoding)) | ||
| except EncodingNameError: | ||
| # Mimic the WMA behavior. In the future, we can implement the mechanism | ||
| # with encodings stored in .m files, and give a chance with it. | ||
| evaluation.message("Get", "noopen", String("encodings/" + encoding + "." + "m")) | ||
|
|
||
| return String(boxes.to_text(evaluation=evaluation, encoding="Unicode")) | ||
|
|
||
|
|
||
| def eval_StringContainsQ(name, string, patt, evaluation, options, matched): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
ToString, the default CharacterEncoding should be$CharacterEncoding