DataContent stores Data as a UTF-16 URI string, a format not used by consumers or wire

This is less of a "must fix" issue and more of an observation.

DataContent will serialize through it's URI property
https://github.com/dotnet/extensions/blob/99b3272cd18d6a0a73e6ce0d98c3dc472afa1c16/src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/DataContent.cs#L275

This is a managed string.  So whenever a DataContent is sent on the wire, that string will be transcoded to UTF-8.  When received on the wire it will also be transcoded back from UTF-8 to UTF-16.

Whenever someone reads a DataContent as data, that string will further need to be decoded from UTF-16 Base64 to the raw bytes.

So doing anything with a DataContent will require full copies of the entire data, either to transcode, or to decode from Base64.  In other words *no-one* needs the UTF-16 string (other than debugging / test scenarios), so I think it's completely extraneous and could be removed from our happy-paths.  We haven't heard feedback on this yet, but I suspect if folks end up doing more with large data they might start to observe this excess memory usage / gc work.  Perhaps if we had customer scenarios that were DataContent intensive, we could measure them to see if this is a meaningful concern.

I think we can assume that folks using a DataContent are doing so in order to work with other MEAI types and serialize it, but they may not be needing to read/write the raw bytes.  So I propose that instead of storing as a managed string, we store it as UTF-8 bytes.  That way sending on the wire will not require an additional copy during serialization.

I think we could compatibly change this in DataContent.  We change the "required" storage to be UTF-8 Base64 encoded ROM<byte>.  We make `string Uri` property lazily transcode.  We add a JsonConverter to avoid populating the UTF-16 string for happy paths.  We could add a property for the ROM<byte> access.  We could add constructors that help folks construct zero-copy DataContent's.

This is similar / inspired by https://github.com/modelcontextprotocol/csharp-sdk/issues/1064.  We'd probably want to tune other consumers of DataContent who care about minimizing copies to use new API if we did this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataContent stores Data as a UTF-16 URI string, a format not used by consumers or wire #7257

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DataContent stores Data as a UTF-16 URI string, a format not used by consumers or wire #7257

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions