Skip to content

feat(string): add UTF-8 string conversion and validation functions#2528

Open
bobtista wants to merge 11 commits intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions
Open

feat(string): add UTF-8 string conversion and validation functions#2528
bobtista wants to merge 11 commits intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions

Conversation

@bobtista
Copy link
Copy Markdown

@bobtista bobtista commented Apr 3, 2026

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: WWLib/utf8.h / utf8.cpp

  • Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
  • Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
  • Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8
  • Get_Utf8_Size(const wchar_t* src) / Get_Wchar_Size(const char* src) — required buffer sizes including null terminator
  • Wchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)
  • Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)

Naming follows the Snake_Case convention used in WWVegas. Arguments are ordered dest, src matching memcpy convention. Implementation wraps Win32 WideCharToMultiByte / MultiByteToWideChar.

AsciiString::translate / UnicodeString::translate

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

ThreadUtils.cpp

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions. Also removes the manual dest[len] = 0 null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR introduces a shared WWLib/utf8.h + utf8.cpp module providing UTF-8 ↔ wide-char conversion and validation helpers, then plumbs them through AsciiString::translate, UnicodeString::translate, and the two ThreadUtils.cpp string converters, replacing the old "7-bit ASCII only" loops and raw Win32 API calls.

The overall approach is sound and the buffer arithmetic is correct throughout: Get_Utf8_Size / Get_Unicode_Size intentionally exclude the null terminator, callers pass size to getBufferForRead(size) which allocates size + 1 slots, and the null is written explicitly with buf[size] = '\0'. The std::wstring usage in ThreadUtils.cpp is also safe because C++11 guarantees ret[ret.size()] == L'\0', so the wcschr C-string traversal cannot overrun.

Key changes by file:

  • WWLib/utf8.h / utf8.cpp — new UTF-8 utility module; wraps Win32 WideCharToMultiByte / MultiByteToWideChar behind type-safe helpers with #pragma once header guard
  • AsciiString.cpp / UnicodeString.cpp — replaces long-broken character-by-character translation with a single correct UTF-8 round-trip
  • ThreadUtils.cpp — removes fragile manual heap allocations and the misplaced dest[len] = 0 write; switches to std::wstring / std::string RAII wrappers
  • CMakeLists.txt — wires utf8.cpp / utf8.h into the WWLib build

One style rule violation was found in the new utf8.cpp (inline if bodies in Utf8_Num_Bytes); everything else is clean.

Confidence Score: 5/5

PR is safe to merge; only a minor style rule violation remains.

All logic is correct: buffer sizes and null terminator placement are accurate across all three call sites, std::wstring/std::string null-termination guarantees make the ThreadUtils wcschr traversal safe, and UnicodeString::str() returns a static empty char on null m_data preventing any wcslen crash. The single remaining finding is a P2 style issue (inline if bodies in Utf8_Num_Bytes) that does not affect runtime correctness. Prior P1 concerns (overlong encoding in the validator, dead utf8.h include in GameInfo.cpp) have been marked as addressed by the developer.

utf8.cpp has a minor inline-if style issue in Utf8_Num_Bytes (lines 34-38); all other files are clean.

Important Files Changed

Filename Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.h New header: UTF-8 helper declarations with #pragma once, correct per-function doc comments, and bool/size_t typed API.
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp New implementation: correct Win32 wrapping and structural UTF-8 validation; minor style violation — inline if bodies in Utf8_Num_Bytes.
Core/GameEngine/Source/Common/System/AsciiString.cpp Replaces 7-bit ASCII loop with Get_Utf8_Size + Unicode_To_Utf8; buffer sizing and null termination are correct.
Core/GameEngine/Source/Common/System/UnicodeString.cpp Mirrors AsciiString change for the wide-char direction; same correct buffer pattern.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Removes raw heap allocations and misplaced null write; switches to std::wstring/std::string with WWLib helpers — cleaner and correct.
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt Adds utf8.cpp and utf8.h to the WWLib source list; straightforward build wiring.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AsciiString::translate] -->|wcslen| B[Get_Utf8_Size]
    B -->|WideCharToMultiByte size query| W32A[Win32 WideCharToMultiByte]
    A -->|getBufferForRead size+1| BUF[String buffer]
    A -->|Unicode_To_Utf8| C[WideCharToMultiByte write]
    C --> BUF
    A -->|buf size = 0| NUL[null terminator]
    NUL --> BUF

    D[UnicodeString::translate] -->|strlen| E[Get_Unicode_Size]
    E -->|MultiByteToWideChar size query| W32B[Win32 MultiByteToWideChar]
    D -->|getBufferForRead size+1| BUF2[Wide string buffer]
    D -->|Utf8_To_Unicode| F[MultiByteToWideChar write]
    F --> BUF2
    D -->|buf size = L'0'| NUL2[null terminator]
    NUL2 --> BUF2

    G[ThreadUtils MultiByteToWideCharSingleLine] -->|Get_Unicode_Size + Utf8_To_Unicode| H[std::wstring RAII]
    H -->|wcschr replace newlines| I[return wstring]

    J[ThreadUtils WideCharStringToMultiByte] -->|Get_Utf8_Size + Unicode_To_Utf8| K[std::string RAII]
    K --> L[return string]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 34-38

Comment:
**Inline `if` bodies in `Utf8_Num_Bytes`**

All four `if` statements place their `return` on the same line as the condition. This prevents precise debugger breakpoint placement on the return statements individually.

```suggestion
	if ((lead & 0x80) == 0x00)
		return 1;
	if ((lead & 0xE0) == 0xC0)
		return 2;
	if ((lead & 0xF0) == 0xE0)
		return 3;
	if ((lead & 0xF8) == 0xF0)
		return 4;
```

**Rule Used:** Always place if/else/for/while statement bodies on... ([source](https://app.greptile.com/review/custom-context?memory=16b9b669-b823-49be-ba5b-2bd30ff3ba6d))

**Learnt From**
[TheSuperHackers/GeneralsGameCode#2067](https://github.com/TheSuperHackers/GeneralsGameCode/pull/2067#discussion_r2706274626)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (5): Last reviewed commit: "refactor(string): Add srcLen parameter t..." | Re-trigger Greptile

@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 3, 2026

  • Fixed the if formatting
  • added RFC 3629 overlong and out-of-range checks
  • RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
break;
}
// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

@xezon xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026
bobtista added 3 commits April 3, 2026 12:17
…char to char and update non-Win32 error message
… remove UTF-8 validation from ParseAsciiStringToGameInfo
Copy link
Copy Markdown

@xezon xezon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_Utf8_Size should not include the null terminator in its size.

size_t Get_Wchar_Size(const char* src)
{
int wchars = MultiByteToWideChar(CP_UTF8, 0, src, -1, nullptr, 0);
return (wchars > 0) ? (size_t)wchars : 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::max

size_t i = 0;
while (i < length)
{
int bytes = Utf8_Num_Bytes(str[i]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t early

return false;
for (int j = 1; j < bytes; ++j)
{
if (!Is_Trail_Byte((unsigned char)str[i + j]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to cast?

delete[] dest;
}
size_t size = Get_Utf8_Size(orig);
std::string ret(size - 1, '\0');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will crash if size is 0.

This interface is confusing. Get_Utf8_Size should not return string size WITH null terminator, because STL never does this as well.

delete[] dest;
}
size_t size = Get_Utf8_Size(orig);
std::string ret(size - 1, '\0');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why zero fill?

if (size == 0)
return std::wstring();
std::wstring ret(size, L'\0');
Utf8_To_Unicode(&ret[0], orig, size + 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect writing to size + 1 with std::string is not legal. It would imply that someone is allowed to write a non-null character in there.

if (dest_size == 0)
return;
return false;
int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dest_size does not have enough room for a null terminator?

WideChar* buf = getBufferForRead((Int)(size - 1));
Utf8_To_Wchar(buf, src, size);
WideChar* buf = getBufferForRead((Int)size);
if (!Utf8_To_Unicode(buf, src, size + 1))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_Unicode_Size will count length of src, and Utf8_To_Unicode will do it again,

The new functions should take size_t srcLen to allow specify exactly how long src should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants