feat(string): add UTF-8 string conversion and validation functions by bobtista · Pull Request #2528 · TheSuperHackers/GeneralsGameCode

bobtista · 2026-04-03T00:07:30Z

Relates to refactor(string): Add functions for handling UTF8 encoded strings #2045

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: `WWLib/utf8.h` / `utf8.cpp`

Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8
Get_Utf8_Size(const wchar_t* src) / Get_Wchar_Size(const char* src) — required buffer sizes including null terminator
Wchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)
Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)

Naming follows the Snake_Case convention used in WWVegas. Arguments are ordered dest, src matching memcpy convention. Implementation wraps Win32 WideCharToMultiByte / MultiByteToWideChar.

`AsciiString::translate` / `UnicodeString::translate`

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

`ThreadUtils.cpp`

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions. Also removes the manual dest[len] = 0 null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.

greptile-apps · 2026-04-03T00:12:52Z

Greptile Summary

This PR introduces a shared WWLib/utf8.h + utf8.cpp module providing UTF-8 ↔ wide-char conversion and validation helpers, then plumbs them through AsciiString::translate, UnicodeString::translate, and the two ThreadUtils.cpp string converters, replacing the old "7-bit ASCII only" loops and raw Win32 API calls.

The overall approach is sound and the buffer arithmetic is correct throughout: Get_Utf8_Size / Get_Unicode_Size intentionally exclude the null terminator, callers pass size to getBufferForRead(size) which allocates size + 1 slots, and the null is written explicitly with buf[size] = '\0'. The std::wstring usage in ThreadUtils.cpp is also safe because C++11 guarantees ret[ret.size()] == L'\0', so the wcschr C-string traversal cannot overrun.

Key changes by file:

WWLib/utf8.h / utf8.cpp — new UTF-8 utility module; wraps Win32 WideCharToMultiByte / MultiByteToWideChar behind type-safe helpers with #pragma once header guard
AsciiString.cpp / UnicodeString.cpp — replaces long-broken character-by-character translation with a single correct UTF-8 round-trip
ThreadUtils.cpp — removes fragile manual heap allocations and the misplaced dest[len] = 0 write; switches to std::wstring / std::string RAII wrappers
CMakeLists.txt — wires utf8.cpp / utf8.h into the WWLib build

One style rule violation was found in the new utf8.cpp (inline if bodies in Utf8_Num_Bytes); everything else is clean.

Confidence Score: 5/5

PR is safe to merge; only a minor style rule violation remains.

All logic is correct: buffer sizes and null terminator placement are accurate across all three call sites, std::wstring/std::string null-termination guarantees make the ThreadUtils wcschr traversal safe, and UnicodeString::str() returns a static empty char on null m_data preventing any wcslen crash. The single remaining finding is a P2 style issue (inline if bodies in Utf8_Num_Bytes) that does not affect runtime correctness. Prior P1 concerns (overlong encoding in the validator, dead utf8.h include in GameInfo.cpp) have been marked as addressed by the developer.

utf8.cpp has a minor inline-if style issue in Utf8_Num_Bytes (lines 34-38); all other files are clean.

Important Files Changed

Filename	Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.h	New header: UTF-8 helper declarations with `#pragma once`, correct per-function doc comments, and `bool`/`size_t` typed API.
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp	New implementation: correct Win32 wrapping and structural UTF-8 validation; minor style violation — inline `if` bodies in `Utf8_Num_Bytes`.
Core/GameEngine/Source/Common/System/AsciiString.cpp	Replaces 7-bit ASCII loop with `Get_Utf8_Size` + `Unicode_To_Utf8`; buffer sizing and null termination are correct.
Core/GameEngine/Source/Common/System/UnicodeString.cpp	Mirrors AsciiString change for the wide-char direction; same correct buffer pattern.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp	Removes raw heap allocations and misplaced null write; switches to `std::wstring`/`std::string` with WWLib helpers — cleaner and correct.
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt	Adds `utf8.cpp` and `utf8.h` to the WWLib source list; straightforward build wiring.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AsciiString::translate] -->|wcslen| B[Get_Utf8_Size]
    B -->|WideCharToMultiByte size query| W32A[Win32 WideCharToMultiByte]
    A -->|getBufferForRead size+1| BUF[String buffer]
    A -->|Unicode_To_Utf8| C[WideCharToMultiByte write]
    C --> BUF
    A -->|buf size = 0| NUL[null terminator]
    NUL --> BUF

    D[UnicodeString::translate] -->|strlen| E[Get_Unicode_Size]
    E -->|MultiByteToWideChar size query| W32B[Win32 MultiByteToWideChar]
    D -->|getBufferForRead size+1| BUF2[Wide string buffer]
    D -->|Utf8_To_Unicode| F[MultiByteToWideChar write]
    F --> BUF2
    D -->|buf size = L'0'| NUL2[null terminator]
    NUL2 --> BUF2

    G[ThreadUtils MultiByteToWideCharSingleLine] -->|Get_Unicode_Size + Utf8_To_Unicode| H[std::wstring RAII]
    H -->|wcschr replace newlines| I[return wstring]

    J[ThreadUtils WideCharStringToMultiByte] -->|Get_Utf8_Size + Unicode_To_Utf8| K[std::string RAII]
    K --> L[return string]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 34-38

Comment:
**Inline `if` bodies in `Utf8_Num_Bytes`**

All four `if` statements place their `return` on the same line as the condition. This prevents precise debugger breakpoint placement on the return statements individually.

```suggestion
	if ((lead & 0x80) == 0x00)
		return 1;
	if ((lead & 0xE0) == 0xC0)
		return 2;
	if ((lead & 0xF0) == 0xE0)
		return 3;
	if ((lead & 0xF8) == 0xF0)
		return 4;
```

**Rule Used:** Always place if/else/for/while statement bodies on... ([source](https://app.greptile.com/review/custom-context?memory=16b9b669-b823-49be-ba5b-2bd30ff3ba6d))

**Learnt From**
[TheSuperHackers/GeneralsGameCode#2067](https://github.com/TheSuperHackers/GeneralsGameCode/pull/2067#discussion_r2706274626)

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (5): Last reviewed commit: "refactor(string): Add srcLen parameter t..." | Re-trigger Greptile}

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

bobtista · 2026-04-03T02:32:50Z

Fixed the if formatting
added RFC 3629 overlong and out-of-range checks
RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

Core/GameEngine/Source/Common/System/UnicodeString.cpp

Core/GameEngine/Source/Common/System/AsciiString.cpp

xezon · 2026-04-03T10:02:10Z

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

 								DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
 								break;
 							}
+							// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name


This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

Core/GameEngine/Source/Common/System/AsciiString.cpp

…char to char and update non-Win32 error message

… remove UTF-8 validation from ParseAsciiStringToGameInfo

…uffers to avoid intermediate allocations

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

xezon

Get_Utf8_Size should not include the null terminator in its size.

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

xezon · 2026-04-04T17:08:11Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+size_t Get_Wchar_Size(const char* src)
+{
+	int wchars = MultiByteToWideChar(CP_UTF8, 0, src, -1, nullptr, 0);
+	return (wchars > 0) ? (size_t)wchars : 1;


xezon · 2026-04-04T17:08:59Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+	size_t i = 0;
+	while (i < length)
+	{
+		int bytes = Utf8_Num_Bytes(str[i]);


size_t early

xezon · 2026-04-04T17:11:10Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+			return false;
+		for (int j = 1; j < bytes; ++j)
+		{
+			if (!Is_Trail_Byte((unsigned char)str[i + j]))


Does this need to cast?

xezon · 2026-04-04T17:12:54Z

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

-		delete[] dest;
-	}
+	size_t size = Get_Utf8_Size(orig);
+	std::string ret(size - 1, '\0');


Will crash if size is 0.

This interface is confusing. Get_Utf8_Size should not return string size WITH null terminator, because STL never does this as well.

xezon · 2026-04-04T17:15:07Z

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

-		delete[] dest;
-	}
+	size_t size = Get_Utf8_Size(orig);
+	std::string ret(size - 1, '\0');


Why zero fill?

…omments

…de prefix

…e null terminator

…n bool on failure

xezon · 2026-04-05T12:09:05Z

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

+	if (size == 0)
+		return std::wstring();
+	std::wstring ret(size, L'\0');
+	Utf8_To_Unicode(&ret[0], orig, size + 1);


I expect writing to size + 1 with std::string is not legal. It would imply that someone is allowed to write a non-null character in there.

xezon · 2026-04-05T12:17:27Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

 	if (dest_size == 0)
-		return;
+		return false;
 	int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);


What happens if dest_size does not have enough room for a null terminator?

xezon · 2026-04-05T12:20:24Z

Core/GameEngine/Source/Common/System/UnicodeString.cpp

-	WideChar* buf = getBufferForRead((Int)(size - 1));
-	Utf8_To_Wchar(buf, src, size);
+	WideChar* buf = getBufferForRead((Int)size);
+	if (!Utf8_To_Unicode(buf, src, size + 1))


Get_Unicode_Size will count length of src, and Utf8_To_Unicode will do it again,

The new functions should take size_t srcLen to allow specify exactly how long src should be.

…id double string scan

feat(string): Add UTF-8 string conversion and validation functions

93c21ae

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Show resolved Hide resolved

xezon reviewed Apr 3, 2026

View reviewed changes

xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026

bobtista added 3 commits April 3, 2026 12:17

refactor(string): Change Utf8_Num_Bytes parameter type from unsigned …

e9942c1

…char to char and update non-Win32 error message

refactor(network): Move utf8.h include after bare library headers and…

31d140d

… remove UTF-8 validation from ParseAsciiStringToGameInfo

refactor(string): Write conversion output directly into destination b…

d4dd10d

…uffers to avoid intermediate allocations

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated Show resolved Hide resolved

refactor(network): Remove unused utf8.h include from GameInfo

893df58

xezon reviewed Apr 4, 2026

View reviewed changes

bobtista added 5 commits April 4, 2026 13:43

refactor(string): Change Utf8_Num_Bytes return type from int to size_t

5161e27

refactor(string): Replace "loop" with "implementation" in translate c…

2535562

…omments

refactor(string): Rename Wchar/Utf8 conversion functions to use Unico…

c110f97

…de prefix

refactor(string): Change Get_Utf8_Size and Get_Unicode_Size to exclud…

1918826

…e null terminator

refactor(string): Change Unicode_To_Utf8 and Utf8_To_Unicode to retur…

7fac56f

…n bool on failure

xezon reviewed Apr 5, 2026

View reviewed changes

refactor(string): Add srcLen parameter to conversion functions to avo…

452a8bf

…id double string scan

Conversation

bobtista commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New: WWLib/utf8.h / utf8.cpp

AsciiString::translate / UnicodeString::translate

ThreadUtils.cpp

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

bobtista commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xezon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bobtista commented Apr 3, 2026 •

edited

Loading

New: `WWLib/utf8.h` / `utf8.cpp`

`AsciiString::translate` / `UnicodeString::translate`

`ThreadUtils.cpp`

greptile-apps bot commented Apr 3, 2026 •

edited

Loading