This is less of a "must fix" issue and more of an observation.
DataContent will serialize through it's URI property
This is a managed string. So whenever a DataContent is sent on the wire, that string will be transcoded to UTF-8. When received on the wire it will also be transcoded back from UTF-8 to UTF-16.
Whenever someone reads a DataContent as data, that string will further need to be decoded from UTF-16 Base64 to the raw bytes.
So doing anything with a DataContent will require full copies of the entire data, either to transcode, or to decode from Base64. In other words no-one needs the UTF-16 string (other than debugging / test scenarios), so I think it's completely extraneous and could be removed from our happy-paths. We haven't heard feedback on this yet, but I suspect if folks end up doing more with large data they might start to observe this excess memory usage / gc work. Perhaps if we had customer scenarios that were DataContent intensive, we could measure them to see if this is a meaningful concern.
I think we can assume that folks using a DataContent are doing so in order to work with other MEAI types and serialize it, but they may not be needing to read/write the raw bytes. So I propose that instead of storing as a managed string, we store it as UTF-8 bytes. That way sending on the wire will not require an additional copy during serialization.
I think we could compatibly change this in DataContent. We change the "required" storage to be UTF-8 Base64 encoded ROM. We make string Uri property lazily transcode. We add a JsonConverter to avoid populating the UTF-16 string for happy paths. We could add a property for the ROM access. We could add constructors that help folks construct zero-copy DataContent's.
This is similar / inspired by modelcontextprotocol/csharp-sdk#1064. We'd probably want to tune other consumers of DataContent who care about minimizing copies to use new API if we did this.
This is less of a "must fix" issue and more of an observation.
DataContent will serialize through it's URI property
extensions/src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/DataContent.cs
Line 275 in 99b3272
This is a managed string. So whenever a DataContent is sent on the wire, that string will be transcoded to UTF-8. When received on the wire it will also be transcoded back from UTF-8 to UTF-16.
Whenever someone reads a DataContent as data, that string will further need to be decoded from UTF-16 Base64 to the raw bytes.
So doing anything with a DataContent will require full copies of the entire data, either to transcode, or to decode from Base64. In other words no-one needs the UTF-16 string (other than debugging / test scenarios), so I think it's completely extraneous and could be removed from our happy-paths. We haven't heard feedback on this yet, but I suspect if folks end up doing more with large data they might start to observe this excess memory usage / gc work. Perhaps if we had customer scenarios that were DataContent intensive, we could measure them to see if this is a meaningful concern.
I think we can assume that folks using a DataContent are doing so in order to work with other MEAI types and serialize it, but they may not be needing to read/write the raw bytes. So I propose that instead of storing as a managed string, we store it as UTF-8 bytes. That way sending on the wire will not require an additional copy during serialization.
I think we could compatibly change this in DataContent. We change the "required" storage to be UTF-8 Base64 encoded ROM. We make
string Uriproperty lazily transcode. We add a JsonConverter to avoid populating the UTF-16 string for happy paths. We could add a property for the ROM access. We could add constructors that help folks construct zero-copy DataContent's.This is similar / inspired by modelcontextprotocol/csharp-sdk#1064. We'd probably want to tune other consumers of DataContent who care about minimizing copies to use new API if we did this.