Differential Privacy for Complex Data: Answering Queries Across Multiple Data Tables So far in this blog series, we have discussed the challenges of ensuring differential privacy for queries over a single database table. In practice, however, databases are often organized into multiple tables, and queries over the data involve joins between these tables. In this post, we discuss the additional challenges of differential privacy for queries with joins, and describe some of the solutions for this setting. Queries with Joins Consider the U.S. Census Bureau, which would like to release employment statistics. An example of a query used to compute these statistics is "how many workers exist with age > 20?" To answer this query, we only need a single database table: the example "Workers" table shown below on the left. What if we need two tables to answer a query though? For example: "how many jobs were filled by employees with age > 20 in 2020?" The "Workers" table only has the demographic information of a worker, and each worker can take multiple jobs (or change jobs). Hence, to answer this query, we need to join the "Workers" table with the "Jobs" table, shown below on the right, matching the P_ID column of the "Workers" table with the P_ID column of the "Jobs" table. Figure 1: An example database with a "workers" table (left) which includes a unique identifier (P_ID) for people, and a "jobs" table (right) which links people with jobs through their identifiers. |
No comments:
Post a Comment