Skip to content

Theoretical limitations and embeddings #2752

@thiswillbeyourgithub

Description

@thiswillbeyourgithub

Hi,

I'm a psychiatry resident and computer scientist interested in using open-cesp for synthetic dataset generation.

I was wondering the theoretical limitations of your sampling technique regarding different data types, particularly high-dimensional data like text embeddings?

Also, I'm concerned about potential privacy leakage with what I call "rich" data formats, specifically text embeddings. Recent work (https://arxiv.org/html/2505.12540v2) has demonstrated that text embeddings can be reverse-engineered back to original text without even knowing the embedding model used.

Given that your technique preserves statistical characteristics while anonymizing data, I'm wondering:

  1. Are there specific data types that shouldn't be cloned with this approach? If unsure, are there established methods to check how well the sampling did and how de anonymised it is?
  2. Have you evaluated it for embedding-based or high-dimensional continuous features?
  3. Are there recommended preprocessing steps for such rich data?
  4. Are there types of data that could be cloned but would be computationally intractable? Any useful ballpark figures to give me an idea of the scales here?

Thanks for this interesting project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    newAutomatic label applied to new issuesquestionGeneral question about the software

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions