Introduction to tidydp: Tidy Differential Privacy

What is Differential Privacy?

Differential privacy is a rigorous mathematical framework for protecting individual privacy while sharing aggregate information about a dataset. It provides formal guarantees that the presence or absence of any single individual in a dataset has minimal impact on the results of statistical queries.

Key Concepts

Privacy Parameters:

epsilon (ε): The privacy budget. Smaller values provide stronger privacy but add more noise. Typical values range from 0.01 (very private) to 10 (less private).
delta (δ): The probability that the privacy guarantee fails. Usually set to a very small value like 1e-5 or 1e-6.

Privacy Mechanisms:

Laplace Mechanism: Adds noise from the Laplace distribution. Provides pure ε-differential privacy (δ = 0).
Gaussian Mechanism: Adds noise from the Gaussian (normal) distribution. Provides (ε, δ)-differential privacy.

Installation

# Install from CRAN (when available)
install.packages("tidydp")

# Or install development version from GitHub
devtools::install_github("ttarler/tidydp")

Getting Started

library(tidydp)

Basic Example: Adding Noise to Data

The most straightforward use case is adding differentially private noise to columns in a data frame:

# Create sample data
employee_data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  age = c(28, 35, 42, 31, 38),
  salary = c(65000, 75000, 85000, 70000, 80000)
)

# View original data
head(employee_data)
#>      name age salary
#> 1   Alice  28  65000
#> 2     Bob  35  75000
#> 3 Charlie  42  85000
#> 4   Diana  31  70000
#> 5     Eve  38  80000

# Add differential privacy noise
private_data <- employee_data %>%
  dp_add_noise(
    columns = c("age", "salary"),
    epsilon = 0.5,
    lower = c(age = 22, salary = 50000),
    upper = c(age = 65, salary = 150000)
  )

# View privatized data
head(private_data)
#>      name       age     salary
#> 1   Alice -19.56794 -414130.96
#> 2     Bob 108.91375   86570.53
#> 3 Charlie  24.71835  392272.89
#> 4   Diana 155.92213   91710.91
#> 5     Eve 221.01506   61846.38

Notice how the numeric columns now have noise added while preserving the names column.

Core Functions

Differentially Private Counting

Count the number of records with privacy guarantees:

# Create sample data
city_data <- data.frame(
  city = rep(c("New York", "Los Angeles", "Chicago"), c(150, 120, 80)),
  category = sample(c("A", "B", "C"), 350, replace = TRUE)
)

# Overall count
overall_count <- city_data %>%
  dp_count(epsilon = 0.1)
print(overall_count)
#>   count
#> 1   343

# Grouped count by city
city_counts <- city_data %>%
  dp_count(epsilon = 0.1, group_by = "city")
print(city_counts)
#>          city count
#> 1     Chicago    95
#> 2 Los Angeles   130
#> 3    New York   157

# Count by multiple groups
city_category_counts <- city_data %>%
  dp_count(epsilon = 0.1, group_by = c("city", "category"))
head(city_category_counts)
#>          city category count
#> 1     Chicago        A    15
#> 2 Los Angeles        A    43
#> 3    New York        A    74
#> 4     Chicago        B    26
#> 5 Los Angeles        B    35
#> 6    New York        B    42

Differentially Private Mean

Compute private averages:

# Create sample data
income_data <- data.frame(
  region = rep(c("North", "South", "East", "West"), each = 100),
  income = c(
    rnorm(100, mean = 60000, sd = 15000),
    rnorm(100, mean = 55000, sd = 12000),
    rnorm(100, mean = 65000, sd = 18000),
    rnorm(100, mean = 58000, sd = 14000)
  )
)

# Overall mean income
avg_income <- income_data %>%
  dp_mean(
    "income",
    epsilon = 0.2,
    lower = 20000,
    upper = 150000
  )
print(avg_income)
#>   income_mean
#> 1    61648.21

# Mean by region
regional_avg <- income_data %>%
  dp_mean(
    "income",
    epsilon = 0.2,
    lower = 20000,
    upper = 150000,
    group_by = "region"
  )
print(regional_avg)
#>   region income_mean
#> 1   East    51274.24
#> 2  North    78910.84
#> 3  South    33916.84
#> 4   West    61293.99

Differentially Private Sum

Compute private totals:

# Create sales data
sales_data <- data.frame(
  store = rep(c("Store A", "Store B", "Store C"), each = 50),
  sales = c(
    rpois(50, lambda = 1000),
    rpois(50, lambda = 1200),
    rpois(50, lambda = 900)
  )
)

# Total sales by store
store_totals <- sales_data %>%
  dp_sum(
    "sales",
    epsilon = 0.3,
    lower = 0,
    upper = 5000,
    group_by = "store"
  )
print(store_totals)
#>     store sales_sum
#> 1 Store A  74639.31
#> 2 Store B  28654.55
#> 3 Store C  58905.43

Privacy Budget Management

When performing multiple queries on the same dataset, you need to track your total privacy expenditure using a privacy budget:

# Create a privacy budget
budget <- new_privacy_budget(
  epsilon_total = 1.0,
  delta_total = 1e-5
)

print(budget)
#> Privacy Budget
#> ==============
#> Total:     epsilon = 1.0000, delta = 1.00e-05
#> Spent:     epsilon = 0.0000, delta = 0.00e+00
#> Remaining: epsilon = 1.0000, delta = 1.00e-05
#> Composition: basic
#> Operations executed: 0

# Perform first query
result1 <- city_data %>%
  dp_count(epsilon = 0.3, .budget = budget)

print(budget)
#> Privacy Budget
#> ==============
#> Total:     epsilon = 1.0000, delta = 1.00e-05
#> Spent:     epsilon = 0.0000, delta = 0.00e+00
#> Remaining: epsilon = 1.0000, delta = 1.00e-05
#> Composition: basic
#> Operations executed: 0

# Perform second query
result2 <- city_data %>%
  dp_count(epsilon = 0.4, group_by = "city", .budget = budget)

print(budget)
#> Privacy Budget
#> ==============
#> Total:     epsilon = 1.0000, delta = 1.00e-05
#> Spent:     epsilon = 0.0000, delta = 0.00e+00
#> Remaining: epsilon = 1.0000, delta = 1.00e-05
#> Composition: basic
#> Operations executed: 0

# Check if we have enough budget for another query
can_query <- check_privacy_budget(budget, epsilon_required = 0.5)
print(paste("Can perform query with epsilon=0.5?", can_query))
#> [1] "Can perform query with epsilon=0.5? TRUE"

# We only have 0.3 epsilon remaining
can_query <- check_privacy_budget(budget, epsilon_required = 0.2)
print(paste("Can perform query with epsilon=0.2?", can_query))
#> [1] "Can perform query with epsilon=0.2? TRUE"

Budget Composition

The tidydp package uses basic composition by default, where the total privacy cost is the sum of individual query costs:

\[\epsilon_{total} = \sum_{i=1}^{k} \epsilon_i\]

This is a conservative approach that ensures strong privacy guarantees.

Choosing Privacy Parameters

Epsilon (ε)

Epsilon Value	Privacy Level	Use Case
0.01 - 0.1	Very Strong	Highly sensitive medical or financial data
0.1 - 1.0	Strong	Personal information, general sensitive data
1.0 - 5.0	Moderate	Less sensitive aggregate statistics
5.0+	Weak	Public or minimally sensitive data

Delta (δ)

Typically set to 1/n² or 1/n³ where n is the dataset size
Common values: 1e-5, 1e-6, 1e-7
Should always be much smaller than 1/n

Data Bounds

Providing accurate bounds is crucial for utility:

# Example: Impact of bounds on utility
test_data <- data.frame(age = c(25, 30, 35, 40, 45))

# Tight bounds (accurate)
tight_bounds <- test_data %>%
  dp_add_noise(
    columns = "age",
    epsilon = 0.5,
    lower = c(age = 20),
    upper = c(age = 50)
  )

# Loose bounds (less accurate)
loose_bounds <- test_data %>%
  dp_add_noise(
    columns = "age",
    epsilon = 0.5,
    lower = c(age = 0),
    upper = c(age = 100)
  )

# Compare results
data.frame(
  Original = test_data$age,
  Tight_Bounds = round(tight_bounds$age, 1),
  Loose_Bounds = round(loose_bounds$age, 1)
)
#>   Original Tight_Bounds Loose_Bounds
#> 1       25         75.7        -86.4
#> 2       30         16.6         80.6
#> 3       35        157.7         60.0
#> 4       40         25.2       -465.3
#> 5       45         35.1        179.9

Tighter bounds lead to better utility (less noise) while maintaining the same privacy guarantee.

Mechanism Selection

When to use Laplace vs Gaussian

Use Laplace (default): - When you need pure ε-differential privacy (δ = 0) - For counting queries - When δ > 0 is not acceptable

Use Gaussian: - When (ε, δ)-differential privacy is acceptable - Often provides better utility for the same privacy level - When working with continuous data and aggregates

# Compare mechanisms
test_values <- data.frame(value = c(100, 200, 300, 400, 500))

# Laplace mechanism
laplace_result <- test_values %>%
  dp_add_noise(
    columns = "value",
    epsilon = 0.5,
    lower = c(value = 0),
    upper = c(value = 1000),
    mechanism = "laplace"
  )

# Gaussian mechanism
gaussian_result <- test_values %>%
  dp_add_noise(
    columns = "value",
    epsilon = 0.5,
    delta = 1e-5,
    lower = c(value = 0),
    upper = c(value = 1000),
    mechanism = "gaussian"
  )

data.frame(
  Original = test_values$value,
  Laplace = round(laplace_result$value, 1),
  Gaussian = round(gaussian_result$value, 1)
)
#>   Original Laplace Gaussian
#> 1      100 -1668.7  -5407.2
#> 2      200   950.8   1149.2
#> 3      300   391.7  -3991.6
#> 4      400  1020.9 -17116.3
#> 5      500  1036.1   2874.8

Complete Workflow Example

Here’s a complete example analyzing employee data while maintaining differential privacy:

# Create employee dataset
employees <- data.frame(
  department = rep(c("Engineering", "Sales", "Marketing", "HR"), each = 25),
  salary = c(
    rnorm(25, 85000, 15000),  # Engineering
    rnorm(25, 70000, 12000),  # Sales
    rnorm(25, 65000, 10000),  # Marketing
    rnorm(25, 60000, 8000)    # HR
  ),
  years_experience = c(
    rpois(25, 5),
    rpois(25, 4),
    rpois(25, 3),
    rpois(25, 4)
  )
)

# Ensure realistic bounds
employees$salary <- pmax(40000, pmin(150000, employees$salary))
employees$years_experience <- pmax(0, pmin(20, employees$years_experience))

# Initialize privacy budget
analysis_budget <- new_privacy_budget(epsilon_total = 2.0)

# Query 1: Count by department (epsilon = 0.5)
dept_counts <- employees %>%
  dp_count(
    epsilon = 0.5,
    group_by = "department",
    .budget = analysis_budget
  )

cat("\nEmployee counts by department:\n")
#> 
#> Employee counts by department:
print(dept_counts)
#>    department count
#> 1 Engineering    24
#> 2          HR    25
#> 3   Marketing    23
#> 4       Sales    28

# Query 2: Average salary by department (epsilon = 0.8)
dept_salaries <- employees %>%
  dp_mean(
    "salary",
    epsilon = 0.8,
    lower = 40000,
    upper = 150000,
    group_by = "department",
    .budget = analysis_budget
  )

cat("\nAverage salaries by department:\n")
#> 
#> Average salaries by department:
print(dept_salaries)
#>    department salary_mean
#> 1 Engineering    81784.28
#> 2          HR    50731.66
#> 3   Marketing    74232.57
#> 4       Sales    67084.91

# Query 3: Average experience (epsilon = 0.4)
avg_experience <- employees %>%
  dp_mean(
    "years_experience",
    epsilon = 0.4,
    lower = 0,
    upper = 20,
    .budget = analysis_budget
  )

cat("\nAverage years of experience:\n")
#> 
#> Average years of experience:
print(avg_experience)
#>   years_experience_mean
#> 1              4.210764

# Check remaining budget
cat("\nFinal budget status:\n")
#> 
#> Final budget status:
print(analysis_budget)
#> Privacy Budget
#> ==============
#> Total:     epsilon = 2.0000, delta = 1.00e-05
#> Spent:     epsilon = 0.0000, delta = 0.00e+00
#> Remaining: epsilon = 2.0000, delta = 1.00e-05
#> Composition: basic
#> Operations executed: 0

Best Practices

Start with a clear privacy budget: Decide upfront how much privacy loss is acceptable
Use tight bounds: Provide accurate lower and upper bounds for better utility
Track your budget: Use new_privacy_budget() for multiple queries
Test with synthetic data: Validate your privacy-utility tradeoff before deploying
Document your choices: Record epsilon, delta, and bounds for reproducibility
Consider query sensitivity: Different queries have different privacy costs
Aggregate before privatizing: Reduce sensitivity by aggregating data first

Common Pitfalls

Pitfall 1: Repeated Queries

Don’t run the same query multiple times without accounting for cumulative privacy loss:

# BAD: Running same query multiple times
for (i in 1:10) {
  result <- data %>% dp_count(epsilon = 0.1)
}
# Total cost: 10 * 0.1 = 1.0 epsilon!

Pitfall 2: Ignoring Bounds

Not providing bounds forces the algorithm to use data-dependent bounds, which can leak information:

# BETTER: Provide explicit bounds
result <- data %>%
  dp_mean("income", epsilon = 0.5, lower = 0, upper = 200000)

# WORSE: Let algorithm infer bounds from data
result <- data %>%
  dp_mean("income", epsilon = 0.5)

Pitfall 3: Epsilon Too Large

Using epsilon > 10 provides minimal privacy protection:

# Very weak privacy
weak_privacy <- test_values %>%
  dp_add_noise(
    columns = "value",
    epsilon = 50,  # Too large!
    lower = c(value = 0),
    upper = c(value = 1000)
  )

# The noise is minimal
data.frame(
  Original = test_values$value,
  With_Noise = round(weak_privacy$value, 1),
  Difference = round(abs(test_values$value - weak_privacy$value), 1)
)
#>   Original With_Noise Difference
#> 1      100       95.6        4.4
#> 2      200      197.8        2.2
#> 3      300      277.2       22.8
#> 4      400      375.2       24.8
#> 5      500      475.8       24.2

Getting Help

If you encounter issues or have questions:

File an issue: https://github.com/ttarler/tidydp/issues
Check documentation: ?tidydp, ?dp_add_noise, ?dp_count, etc.
View examples: Run example(dp_add_noise)