Theoretical limitations and embeddings

Hi,

I'm a psychiatry resident and computer scientist interested in using open-cesp for synthetic dataset generation.

I was wondering the theoretical limitations of your sampling technique regarding different data types, particularly high-dimensional data like text embeddings?

Also, I'm concerned about potential privacy leakage with what I call "rich" data formats, specifically text embeddings. Recent work (https://arxiv.org/html/2505.12540v2) has demonstrated that text embeddings can be reverse-engineered back to original text without even knowing the embedding model used.

Given that your technique preserves statistical characteristics while anonymizing data, I'm wondering:
1. Are there specific data types that shouldn't be cloned with this approach? If unsure, are there established methods to check how well the sampling did and how de anonymised it is?
2. Have you evaluated it for embedding-based or high-dimensional continuous features?
3. Are there recommended preprocessing steps for such rich data?
4. Are there types of data that *could* be cloned but would be computationally intractable? Any useful ballpark figures to give me an idea of the scales here?

Thanks for this interesting project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Theoretical limitations and embeddings #2752

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Theoretical limitations and embeddings #2752

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions