CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

This repository contains Databricks User-Defined Functions (UDFs) that integrate with Skyflow's Data Privacy Vault for data tokenization and de-identification. The functions are designed to be registered in Unity Catalog and shared across Databricks accounts.

Architecture

Function Types

deidentify_string: SQL function with embedded Python that detects and replaces sensitive entities in unstructured text
tokenize_from_csv: Python UDF that reads CSV data, inserts sensitive values into Skyflow vault, and returns tokens

Key Components

SQL function definitions with embedded Python code for Databricks Unity Catalog
Python UDFs using Skyflow SDK for vault operations
Databricks secrets integration for secure credential management
Batch processing capabilities for CSV tokenization (25 records per batch)

Configuration Requirements

Skyflow Configuration

Functions require these Skyflow parameters:

vault_id: Skyflow vault identifier
vault_url: Skyflow vault URL prefix
account_id: Skyflow account identifier
Service account credentials file (credentials.json)

Databricks Secrets

Store Skyflow credentials in Databricks secret scopes:

API keys: Use dbutils.secrets.get() to retrieve
Configuration JSON: Store as stringified JSON in secrets
Default scope name pattern: sky-agentic-demo or demoscope

Function Installation

Deidentify String

Copy SQL from deidentify_string/deidentify_string.sql
Replace placeholder values for vault_id and vault_url
Execute in Databricks Query to register function
Function signature: deidentify_string(input_text STRING, sky_api_key STRING)

Tokenize from CSV

Upload credentials.json to Databricks volume
Create configuration JSON with Skyflow details
Store configuration as Databricks secret
Copy Python code from tokenize_from_csv/tokenize_from_csv.py into notebook
Update CONFIG_SECRET_KEY_NAME and DATABRICKS_SECRETS_SCOPE_NAME
Run notebook to register UDF
Function signature: tokenizeCSV(csv_path, skyflow_table, column_map)

Dependencies

Python Packages (for tokenize_from_csv)

pandas==2.2.2
pyspark==3.5.1
skyflow==1.15.1

Databricks CLI Setup

Required for secret management:

brew tap databricks/tap
brew install databricks
databricks configure

Testing and Usage

Test Deidentify String

sky_api_key = dbutils.secrets.get(scope="sky-agentic-demo", key="sky_api_key")
input_text = "Hi my name is Joseph McCarron and I live in Austin TX"
result_df = spark.sql(f"SELECT agentic.default.deidentify_string('{input_text}', '{sky_api_key}') AS deidentified_text")

Test Tokenize CSV

SELECT tokenizeCSV("/path/to/file.csv", "persons", MAP("csv_col", "skyflow_col")) AS tokenized_data;

Important Notes

All code is provided as sample code without warranty
Functions require active Skyflow vault with appropriate permissions
Service accounts need insert and tokenization permissions
Test and validate code before production deployment
CSV tokenization processes data in batches of 25 records

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Repository Overview

Architecture

Function Types

Key Components

Configuration Requirements

Skyflow Configuration

Databricks Secrets

Function Installation

Deidentify String

Tokenize from CSV

Dependencies

Python Packages (for tokenize_from_csv)

Databricks CLI Setup

Testing and Usage

Test Deidentify String

Test Tokenize CSV

Important Notes

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Overview

Architecture

Function Types

Key Components

Configuration Requirements

Skyflow Configuration

Databricks Secrets

Function Installation

Deidentify String

Tokenize from CSV

Dependencies

Python Packages (for tokenize_from_csv)

Databricks CLI Setup

Testing and Usage

Test Deidentify String

Test Tokenize CSV

Important Notes