feat(string): add UTF-8 string conversion and validation functions#2528
feat(string): add UTF-8 string conversion and validation functions#2528bobtista wants to merge 11 commits intoTheSuperHackers:mainfrom
Conversation
|
| Filename | Overview |
|---|---|
| Core/Libraries/Source/WWVegas/WWLib/utf8.h | New header: UTF-8 helper declarations with #pragma once, correct per-function doc comments, and bool/size_t typed API. |
| Core/Libraries/Source/WWVegas/WWLib/utf8.cpp | New implementation: correct Win32 wrapping and structural UTF-8 validation; minor style violation — inline if bodies in Utf8_Num_Bytes. |
| Core/GameEngine/Source/Common/System/AsciiString.cpp | Replaces 7-bit ASCII loop with Get_Utf8_Size + Unicode_To_Utf8; buffer sizing and null termination are correct. |
| Core/GameEngine/Source/Common/System/UnicodeString.cpp | Mirrors AsciiString change for the wide-char direction; same correct buffer pattern. |
| Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp | Removes raw heap allocations and misplaced null write; switches to std::wstring/std::string with WWLib helpers — cleaner and correct. |
| Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt | Adds utf8.cpp and utf8.h to the WWLib source list; straightforward build wiring. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[AsciiString::translate] -->|wcslen| B[Get_Utf8_Size]
B -->|WideCharToMultiByte size query| W32A[Win32 WideCharToMultiByte]
A -->|getBufferForRead size+1| BUF[String buffer]
A -->|Unicode_To_Utf8| C[WideCharToMultiByte write]
C --> BUF
A -->|buf size = 0| NUL[null terminator]
NUL --> BUF
D[UnicodeString::translate] -->|strlen| E[Get_Unicode_Size]
E -->|MultiByteToWideChar size query| W32B[Win32 MultiByteToWideChar]
D -->|getBufferForRead size+1| BUF2[Wide string buffer]
D -->|Utf8_To_Unicode| F[MultiByteToWideChar write]
F --> BUF2
D -->|buf size = L'0'| NUL2[null terminator]
NUL2 --> BUF2
G[ThreadUtils MultiByteToWideCharSingleLine] -->|Get_Unicode_Size + Utf8_To_Unicode| H[std::wstring RAII]
H -->|wcschr replace newlines| I[return wstring]
J[ThreadUtils WideCharStringToMultiByte] -->|Get_Utf8_Size + Unicode_To_Utf8| K[std::string RAII]
K --> L[return string]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 34-38
Comment:
**Inline `if` bodies in `Utf8_Num_Bytes`**
All four `if` statements place their `return` on the same line as the condition. This prevents precise debugger breakpoint placement on the return statements individually.
```suggestion
if ((lead & 0x80) == 0x00)
return 1;
if ((lead & 0xE0) == 0xC0)
return 2;
if ((lead & 0xF0) == 0xE0)
return 3;
if ((lead & 0xF8) == 0xF0)
return 4;
```
**Rule Used:** Always place if/else/for/while statement bodies on... ([source](https://app.greptile.com/review/custom-context?memory=16b9b669-b823-49be-ba5b-2bd30ff3ba6d))
**Learnt From**
[TheSuperHackers/GeneralsGameCode#2067](https://github.com/TheSuperHackers/GeneralsGameCode/pull/2067#discussion_r2706274626)
How can I resolve this? If you propose a fix, please make it concise.Reviews (5): Last reviewed commit: "refactor(string): Add srcLen parameter t..." | Re-trigger Greptile
|
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
| DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting")); | ||
| break; | ||
| } | ||
| // TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name |
There was a problem hiding this comment.
This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?
…char to char and update non-Win32 error message
… remove UTF-8 validation from ParseAsciiStringToGameInfo
…uffers to avoid intermediate allocations
xezon
left a comment
There was a problem hiding this comment.
Get_Utf8_Size should not include the null terminator in its size.
| size_t Get_Wchar_Size(const char* src) | ||
| { | ||
| int wchars = MultiByteToWideChar(CP_UTF8, 0, src, -1, nullptr, 0); | ||
| return (wchars > 0) ? (size_t)wchars : 1; |
| size_t i = 0; | ||
| while (i < length) | ||
| { | ||
| int bytes = Utf8_Num_Bytes(str[i]); |
| return false; | ||
| for (int j = 1; j < bytes; ++j) | ||
| { | ||
| if (!Is_Trail_Byte((unsigned char)str[i + j])) |
| delete[] dest; | ||
| } | ||
| size_t size = Get_Utf8_Size(orig); | ||
| std::string ret(size - 1, '\0'); |
There was a problem hiding this comment.
Will crash if size is 0.
This interface is confusing. Get_Utf8_Size should not return string size WITH null terminator, because STL never does this as well.
| delete[] dest; | ||
| } | ||
| size_t size = Get_Utf8_Size(orig); | ||
| std::string ret(size - 1, '\0'); |
…e null terminator
…n bool on failure
| if (size == 0) | ||
| return std::wstring(); | ||
| std::wstring ret(size, L'\0'); | ||
| Utf8_To_Unicode(&ret[0], orig, size + 1); |
There was a problem hiding this comment.
I expect writing to size + 1 with std::string is not legal. It would imply that someone is allowed to write a non-null character in there.
| if (dest_size == 0) | ||
| return; | ||
| return false; | ||
| int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size); |
There was a problem hiding this comment.
What happens if dest_size does not have enough room for a null terminator?
| WideChar* buf = getBufferForRead((Int)(size - 1)); | ||
| Utf8_To_Wchar(buf, src, size); | ||
| WideChar* buf = getBufferForRead((Int)size); | ||
| if (!Utf8_To_Unicode(buf, src, size + 1)) |
There was a problem hiding this comment.
Get_Unicode_Size will count length of src, and Utf8_To_Unicode will do it again,
The new functions should take size_t srcLen to allow specify exactly how long src should be.
…id double string scan
Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.
Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.
New:
WWLib/utf8.h/utf8.cppUtf8_Num_Bytes(char lead)— byte count of a UTF-8 character from its lead byteUtf8_Trailing_Invalid_Bytes(const char* str, size_t length)— count of invalid trailing bytes due to an incomplete multi-byte sequenceUtf8_Validate(const char* str)/Utf8_Validate(const char* str, size_t length)— returns true if the string is valid UTF-8Get_Utf8_Size(const wchar_t* src)/Get_Wchar_Size(const char* src)— required buffer sizes including null terminatorWchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)Naming follows the
Snake_Caseconvention used in WWVegas. Arguments are ordereddest, srcmatchingmemcpyconvention. Implementation wraps Win32WideCharToMultiByte/MultiByteToWideChar.AsciiString::translate/UnicodeString::translateReplaces the broken implementations that only worked for 7-bit ASCII (marked
@todosince the original code) with proper UTF-8 conversion using the new WWLib functions.ThreadUtils.cppReplaces raw Win32 API calls in
MultiByteToWideCharSingleLineandWideCharStringToMultiBytewith the new WWLib functions. Also removes the manualdest[len] = 0null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.