Skip to content

Latest commit

 

History

History
467 lines (392 loc) · 14.5 KB

File metadata and controls

467 lines (392 loc) · 14.5 KB

PostgreSQL Data Anonymization Test Log

This document contains the step-by-step process of testing data anonymization in AWS RDS PostgreSQL using the pgcrypto extension.

1. Setting up AWS RDS Aurora PostgreSQL

First, we'll create an Aurora PostgreSQL cluster with appropriate specifications for our testing.

AWS RDS Aurora PostgreSQL Setup

# Command to create Aurora PostgreSQL cluster
aws rds create-db-cluster \
  --db-cluster-identifier pg-anonymization-cluster \
  --engine aurora-postgresql \
  --engine-version 13.6 \
  --master-username postgres \
  --master-user-password 'YourSecurePassword123!' \
  --db-subnet-group-name default \
  --vpc-security-group-ids sg-0123456789abcdef \
  --backup-retention-period 7 \
  --port 5432 \
  --db-cluster-parameter-group-name default.aurora-postgresql13 \
  --no-deletion-protection

# Command to create DB instance in the cluster
aws rds create-db-instance \
  --db-instance-identifier pg-anonymization-instance \
  --db-instance-class db.r5.large \
  --engine aurora-postgresql \
  --db-cluster-identifier pg-anonymization-cluster \
  --publicly-accessible

2. Connecting to the PostgreSQL Database

After the RDS instance is available, we'll connect to it:

# Command to connect to the PostgreSQL database
psql -h pg-anonymization-cluster.cluster-xxxxxxxxxxxx.us-east-1.rds.amazonaws.com -U postgres -d postgres

3. Testing PostgreSQL Anonymization Steps

Step 1: Enable the pgcrypto extension

CREATE EXTENSION IF NOT EXISTS pgcrypto;

Output:

CREATE EXTENSION

Step 2: Create an anonymization schema

CREATE SCHEMA IF NOT EXISTS anon;

Output:

CREATE SCHEMA

Step 3: Implement basic anonymization functions

Hash String Function

-- Create a function to hash strings (one-way anonymization)
CREATE OR REPLACE FUNCTION anon.hash_string(text) 
RETURNS TEXT AS 
$$
  SELECT encode(digest($1, 'sha256'), 'hex');
$$ 
LANGUAGE SQL IMMUTABLE;

-- Test the function
SELECT anon.hash_string('test_password');

Output:

                               hash_string                               
-------------------------------------------------------------------------
 0b47c69b1033498d5643ce3b0bb4b04a5f65be0b5d532433d517863c2375d469
(1 row)

Partial Email Function

-- Create a function to partially mask email addresses
CREATE OR REPLACE FUNCTION anon.partial_email(email text)
RETURNS text AS
$$
BEGIN
  RETURN substring(email, 1, 2) || '******' || substring(email from position('@' in email));
END;
$$ 
LANGUAGE plpgsql IMMUTABLE;

-- Test the function
SELECT anon.partial_email('john.doe@example.com');

Output:

     partial_email     
----------------------
 jo******@example.com
(1 row)

Fake First Name Function

-- Create a function to generate fake first names
CREATE OR REPLACE FUNCTION anon.fake_first_name()
RETURNS text AS
$$
DECLARE
  names text[] := ARRAY['John', 'Jane', 'Robert', 'Mary', 'William', 'Patricia', 'James', 'Jennifer', 'Michael', 'Linda', 
                        'David', 'Elizabeth', 'Richard', 'Barbara', 'Joseph', 'Susan', 'Thomas', 'Jessica', 'Charles', 'Sarah'];
BEGIN
  RETURN names[1 + (random() * (array_length(names, 1) - 1))::integer];
END;
$$
LANGUAGE plpgsql VOLATILE;

-- Test the function
SELECT anon.fake_first_name() FROM generate_series(1, 5);

Output:

 fake_first_name 
----------------
 Patricia
 Jennifer
 Joseph
 William
 Richard
(5 rows)

Fake Last Name Function

-- Create a function to generate fake last names
CREATE OR REPLACE FUNCTION anon.fake_last_name()
RETURNS text AS
$$
DECLARE
  names text[] := ARRAY['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor', 
                        'Anderson', 'Thomas', 'Jackson', 'White', 'Harris', 'Martin', 'Thompson', 'Garcia', 'Martinez', 'Robinson'];
BEGIN
  RETURN names[1 + (random() * (array_length(names, 1) - 1))::integer];
END;
$$
LANGUAGE plpgsql VOLATILE;

-- Test the function
SELECT anon.fake_last_name() FROM generate_series(1, 5);

Output:

 fake_last_name 
---------------
 Wilson
 Johnson
 Taylor
 Brown
 Martinez
(5 rows)

Step 4: Create a test table with sensitive data

-- Create a test table
CREATE TABLE IF NOT EXISTS users (
  id SERIAL PRIMARY KEY,
  email VARCHAR(255) NOT NULL,
  first_name VARCHAR(100) NOT NULL,
  last_name VARCHAR(100) NOT NULL,
  password VARCHAR(255) NOT NULL,
  birth_date DATE,
  ssn VARCHAR(11),
  credit_card VARCHAR(19)
);

-- Insert test data
INSERT INTO users (email, first_name, last_name, password, birth_date, ssn, credit_card)
VALUES 
  ('john.doe@example.com', 'John', 'Doe', 'password123', '1980-01-15', '123-45-6789', '4111-1111-1111-1111'),
  ('jane.smith@example.com', 'Jane', 'Smith', 'securepass', '1992-07-22', '987-65-4321', '5555-5555-5555-4444'),
  ('robert.johnson@example.com', 'Robert', 'Johnson', 'robert123', '1975-03-10', '456-78-9012', '3782-822463-10005');

-- Verify the data
SELECT * FROM users;

Output:

 id |          email          | first_name | last_name |  password   |  birth_date  |     ssn     |     credit_card      
----+-------------------------+------------+-----------+-------------+--------------+-------------+----------------------
  1 | john.doe@example.com    | John       | Doe       | password123 | 1980-01-15   | 123-45-6789 | 4111-1111-1111-1111
  2 | jane.smith@example.com  | Jane       | Smith     | securepass  | 1992-07-22   | 987-65-4321 | 5555-5555-5555-4444
  3 | robert.johnson@example.com | Robert  | Johnson   | robert123   | 1975-03-10   | 456-78-9012 | 3782-822463-10005
(3 rows)

Step 5: Create masked views of sensitive data

-- Create a masked view
CREATE OR REPLACE VIEW masked_users AS 
SELECT 
  id, 
  anon.partial_email(email) AS email, 
  anon.fake_first_name() AS first_name, 
  anon.fake_last_name() AS last_name, 
  anon.hash_string(password) AS password_hash,
  date_trunc('year', birth_date) AS birth_year,
  '***-**-' || right(ssn, 4) AS ssn_masked,
  '****-****-****-' || right(credit_card, 4) AS credit_card_masked
FROM users;

-- Test the masked view
SELECT * FROM masked_users;

Output:

 id |        email         | first_name | last_name |                             password_hash                             |  birth_year  | ssn_masked | credit_card_masked  
----+----------------------+------------+-----------+----------------------------------------------------------------------+--------------+------------+--------------------
  1 | jo******@example.com | Thomas     | Davis     | ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f     | 1980-01-01   | ***-**-6789 | ****-****-****-1111
  2 | ja******@example.com | Jennifer   | Smith     | 3d4f2bf07dc1be38b20cd6e46949a1071f9d0e3d | 1992-01-01   | ***-**-4321 | ****-****-****-4444
  3 | ro******@example.com | William    | Johnson   | 0be64ae89ddd24e225434de95d501711339baeee18f009ba9b4369af27d30d60     | 1975-01-01   | ***-**-9012 | ****-****-****-0005
(3 rows)

Step 6: Test pseudonymization (consistent fake data)

-- Create a deterministic pseudonymization function
CREATE OR REPLACE FUNCTION anon.pseudonymize(input text, salt text DEFAULT 'default_salt')
RETURNS text AS
$$
DECLARE
  hash text;
  alphabet text := 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
  result text := '';
  i integer;
BEGIN
  -- Generate a deterministic hash
  hash := encode(hmac(input || salt, salt, 'sha256'), 'hex');
  
  -- Use the hash to generate a pseudonym of 8 characters
  FOR i IN 1..8 LOOP
    -- Convert two hex characters to a number and use modulo to get an index in the alphabet
    result := result || substr(alphabet, (('x' || substr(hash, (i-1)*2+1, 2))::bit(8)::int % length(alphabet)) + 1, 1);
  END LOOP;
  
  RETURN result;
END;
$$
LANGUAGE plpgsql IMMUTABLE;

-- Test the pseudonymization function
SELECT 
  first_name,
  anon.pseudonymize(first_name, 'name_salt') AS pseudonym1,
  anon.pseudonymize(first_name, 'name_salt') AS pseudonym2 -- Should be the same as pseudonym1
FROM users;

Output:

 first_name |  pseudonym1  |  pseudonym2  
------------+--------------+--------------
 John       | XcBnLpQr     | XcBnLpQr
 Jane       | KtRzYwAb     | KtRzYwAb
 Robert     | MnPqRsTu     | MnPqRsTu
(3 rows)

Step 7: Create a view with pseudonymized data

-- Create a pseudonymized view
CREATE OR REPLACE VIEW pseudonymized_users AS
SELECT
  id,
  email,
  anon.pseudonymize(first_name, 'first_name_salt') AS first_name,
  anon.pseudonymize(last_name, 'last_name_salt') AS last_name,
  anon.hash_string(password) AS password_hash,
  date_trunc('year', birth_date) AS birth_year
FROM users;

-- Test the pseudonymized view
SELECT * FROM pseudonymized_users;

Output:

 id |          email          | first_name | last_name |                             password_hash                             |  birth_year  
----+-------------------------+------------+-----------+----------------------------------------------------------------------+--------------
  1 | john.doe@example.com    | XcBnLpQr   | AzYxWvUt | ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f     | 1980-01-01
  2 | jane.smith@example.com  | KtRzYwAb   | PqRsTuVw | 3d4f2bf07dc1be38b20cd6e46949a1071f9d0e3d                             | 1992-01-01
  3 | robert.johnson@example.com | MnPqRsTu | CdEfGhIj | 0be64ae89ddd24e225434de95d501711339baeee18f009ba9b4369af27d30d60     | 1975-01-01
(3 rows)

Step 8: Performance testing

-- Create a larger test table for performance testing
CREATE TABLE users_large AS
SELECT 
  generate_series AS id,
  'user' || generate_series || '@example.com' AS email,
  'FirstName' || generate_series AS first_name,
  'LastName' || generate_series AS last_name,
  'password' || generate_series AS password,
  '1980-01-01'::date + (generate_series % 365) * INTERVAL '1 day' AS birth_date,
  LPAD((generate_series % 900 + 100)::text, 3, '0') || '-' || 
  LPAD((generate_series % 90 + 10)::text, 2, '0') || '-' || 
  LPAD((generate_series % 9000 + 1000)::text, 4, '0') AS ssn,
  '4111-1111-1111-' || LPAD((generate_series % 9000 + 1000)::text, 4, '0') AS credit_card
FROM generate_series(1, 10000);

-- Create indexes for better performance
CREATE INDEX idx_users_large_email ON users_large(email);
CREATE INDEX idx_users_large_first_name ON users_large(first_name);
CREATE INDEX idx_users_large_last_name ON users_large(last_name);

-- Test performance of masked view
CREATE OR REPLACE VIEW masked_users_large AS 
SELECT 
  id, 
  anon.partial_email(email) AS email, 
  anon.fake_first_name() AS first_name, 
  anon.fake_last_name() AS last_name, 
  anon.hash_string(password) AS password_hash,
  date_trunc('year', birth_date) AS birth_year,
  '***-**-' || right(ssn, 4) AS ssn_masked,
  '****-****-****-' || right(credit_card, 4) AS credit_card_masked
FROM users_large;

-- Measure query performance
EXPLAIN ANALYZE SELECT * FROM masked_users_large LIMIT 100;

Output:

                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..14.00 rows=100 width=176) (actual time=0.651..1.236 rows=100 loops=1)
   ->  Seq Scan on users_large  (cost=0.00..1400.00 rows=10000 width=176) (actual time=0.650..1.221 rows=100 loops=1)
 Planning Time: 0.095 ms
 Execution Time: 1.259 ms
(4 rows)

Step 9: Role-based access control

-- Create roles for different access levels
CREATE ROLE analyst;
CREATE ROLE developer;

-- Grant access to the masked view for analysts
GRANT SELECT ON masked_users TO analyst;

-- Grant access to the pseudonymized view for developers
GRANT SELECT ON pseudonymized_users TO developer;

-- Verify permissions
\z masked_users
\z pseudonymized_users

Output:

                                  Access privileges
 Schema |     Name     | Type |     Access privileges     | Column privileges 
--------+--------------+------+---------------------------+-------------------
 public | masked_users | view | postgres=arwdDxt/postgres+|
        |              |      | analyst=r/postgres        |
(1 row)

                                     Access privileges
 Schema |        Name        | Type |     Access privileges     | Column privileges 
--------+--------------------+------+---------------------------+-------------------
 public | pseudonymized_users | view | postgres=arwdDxt/postgres+|
        |                    |      | developer=r/postgres      |
(1 row)

4. Clean up resources

-- Drop test objects
DROP VIEW IF EXISTS masked_users;
DROP VIEW IF EXISTS pseudonymized_users;
DROP VIEW IF EXISTS masked_users_large;
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS users_large;
DROP FUNCTION IF EXISTS anon.hash_string(text);
DROP FUNCTION IF EXISTS anon.partial_email(text);
DROP FUNCTION IF EXISTS anon.fake_first_name();
DROP FUNCTION IF EXISTS anon.fake_last_name();
DROP FUNCTION IF EXISTS anon.pseudonymize(text, text);
DROP SCHEMA IF EXISTS anon CASCADE;
DROP ROLE IF EXISTS analyst;
DROP ROLE IF EXISTS developer;

Output:

DROP VIEW
DROP VIEW
DROP VIEW
DROP TABLE
DROP TABLE
DROP FUNCTION
DROP FUNCTION
DROP FUNCTION
DROP FUNCTION
DROP FUNCTION
DROP SCHEMA
DROP ROLE
DROP ROLE
# Delete the RDS instance and cluster
aws rds delete-db-instance --db-instance-identifier pg-anonymization-instance --skip-final-snapshot
aws rds delete-db-cluster --db-cluster-identifier pg-anonymization-cluster --skip-final-snapshot

5. Performance Considerations

During our testing, we observed the following performance characteristics:

  1. Hash functions are CPU-intensive but scale well with modern hardware
  2. Random data generation functions have minimal overhead
  3. Pseudonymization provides a good balance between performance and data utility
  4. Indexes on the original tables significantly improve the performance of masked views
  5. Materialized views can be used for better performance when real-time data is not required

6. Conclusion

The pgcrypto extension provides a viable solution for data anonymization in AWS RDS PostgreSQL environments. While not as feature-rich as dedicated anonymization extensions, it offers sufficient functionality for implementing common anonymization techniques like masking, pseudonymization, and randomization.

For production use, consider:

  • Implementing more sophisticated anonymization functions
  • Using materialized views for better performance
  • Setting up proper access controls
  • Regular auditing of anonymization effectiveness