Skip to content

DataContent stores Data as a UTF-16 URI string, a format not used by consumers or wire #7257

@ericstj

Description

@ericstj

This is less of a "must fix" issue and more of an observation.

DataContent will serialize through it's URI property

This is a managed string. So whenever a DataContent is sent on the wire, that string will be transcoded to UTF-8. When received on the wire it will also be transcoded back from UTF-8 to UTF-16.

Whenever someone reads a DataContent as data, that string will further need to be decoded from UTF-16 Base64 to the raw bytes.

So doing anything with a DataContent will require full copies of the entire data, either to transcode, or to decode from Base64. In other words no-one needs the UTF-16 string (other than debugging / test scenarios), so I think it's completely extraneous and could be removed from our happy-paths. We haven't heard feedback on this yet, but I suspect if folks end up doing more with large data they might start to observe this excess memory usage / gc work. Perhaps if we had customer scenarios that were DataContent intensive, we could measure them to see if this is a meaningful concern.

I think we can assume that folks using a DataContent are doing so in order to work with other MEAI types and serialize it, but they may not be needing to read/write the raw bytes. So I propose that instead of storing as a managed string, we store it as UTF-8 bytes. That way sending on the wire will not require an additional copy during serialization.

I think we could compatibly change this in DataContent. We change the "required" storage to be UTF-8 Base64 encoded ROM. We make string Uri property lazily transcode. We add a JsonConverter to avoid populating the UTF-16 string for happy paths. We could add a property for the ROM access. We could add constructors that help folks construct zero-copy DataContent's.

This is similar / inspired by modelcontextprotocol/csharp-sdk#1064. We'd probably want to tune other consumers of DataContent who care about minimizing copies to use new API if we did this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-aiMicrosoft.Extensions.AI librariesuntriaged

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions