-
Notifications
You must be signed in to change notification settings - Fork 411
Open
Labels
newAutomatic label applied to new issuesAutomatic label applied to new issuesquestionGeneral question about the softwareGeneral question about the software
Description
Hi,
I'm a psychiatry resident and computer scientist interested in using open-cesp for synthetic dataset generation.
I was wondering the theoretical limitations of your sampling technique regarding different data types, particularly high-dimensional data like text embeddings?
Also, I'm concerned about potential privacy leakage with what I call "rich" data formats, specifically text embeddings. Recent work (https://arxiv.org/html/2505.12540v2) has demonstrated that text embeddings can be reverse-engineered back to original text without even knowing the embedding model used.
Given that your technique preserves statistical characteristics while anonymizing data, I'm wondering:
- Are there specific data types that shouldn't be cloned with this approach? If unsure, are there established methods to check how well the sampling did and how de anonymised it is?
- Have you evaluated it for embedding-based or high-dimensional continuous features?
- Are there recommended preprocessing steps for such rich data?
- Are there types of data that could be cloned but would be computationally intractable? Any useful ballpark figures to give me an idea of the scales here?
Thanks for this interesting project!
Metadata
Metadata
Assignees
Labels
newAutomatic label applied to new issuesAutomatic label applied to new issuesquestionGeneral question about the softwareGeneral question about the software