This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This repository contains Databricks User-Defined Functions (UDFs) that integrate with Skyflow's Data Privacy Vault for data tokenization and de-identification. The functions are designed to be registered in Unity Catalog and shared across Databricks accounts.
- deidentify_string: SQL function with embedded Python that detects and replaces sensitive entities in unstructured text
- tokenize_from_csv: Python UDF that reads CSV data, inserts sensitive values into Skyflow vault, and returns tokens
- SQL function definitions with embedded Python code for Databricks Unity Catalog
- Python UDFs using Skyflow SDK for vault operations
- Databricks secrets integration for secure credential management
- Batch processing capabilities for CSV tokenization (25 records per batch)
Functions require these Skyflow parameters:
vault_id: Skyflow vault identifiervault_url: Skyflow vault URL prefixaccount_id: Skyflow account identifier- Service account credentials file (
credentials.json)
Store Skyflow credentials in Databricks secret scopes:
- API keys: Use
dbutils.secrets.get()to retrieve - Configuration JSON: Store as stringified JSON in secrets
- Default scope name pattern:
sky-agentic-demoordemoscope
- Copy SQL from
deidentify_string/deidentify_string.sql - Replace placeholder values for
vault_idandvault_url - Execute in Databricks Query to register function
- Function signature:
deidentify_string(input_text STRING, sky_api_key STRING)
- Upload credentials.json to Databricks volume
- Create configuration JSON with Skyflow details
- Store configuration as Databricks secret
- Copy Python code from
tokenize_from_csv/tokenize_from_csv.pyinto notebook - Update
CONFIG_SECRET_KEY_NAMEandDATABRICKS_SECRETS_SCOPE_NAME - Run notebook to register UDF
- Function signature:
tokenizeCSV(csv_path, skyflow_table, column_map)
- pandas==2.2.2
- pyspark==3.5.1
- skyflow==1.15.1
Required for secret management:
brew tap databricks/tap
brew install databricks
databricks configuresky_api_key = dbutils.secrets.get(scope="sky-agentic-demo", key="sky_api_key")
input_text = "Hi my name is Joseph McCarron and I live in Austin TX"
result_df = spark.sql(f"SELECT agentic.default.deidentify_string('{input_text}', '{sky_api_key}') AS deidentified_text")SELECT tokenizeCSV("/path/to/file.csv", "persons", MAP("csv_col", "skyflow_col")) AS tokenized_data;- All code is provided as sample code without warranty
- Functions require active Skyflow vault with appropriate permissions
- Service accounts need insert and tokenization permissions
- Test and validate code before production deployment
- CSV tokenization processes data in batches of 25 records