Skip to content

[Model] ConsistencyOfDatabaseFrequencyTables #452

@isPANN

Description

@isPANN

Motivation

CONSISTENCY OF DATABASE FREQUENCY TABLES (P183) from Garey & Johnson, A4 SR35. A classical NP-complete problem at the intersection of database theory and statistical disclosure control. It asks whether published frequency tables (cross-tabulations of attribute pairs) are consistent with a set of known attribute values — that is, whether there exists a complete assignment of attribute values to all objects that matches both the frequency tables and the known values. The NP-completeness result (Reiss, 1977) has important implications for database privacy: it shows that "compromising" a database by deducing individual attribute values from published frequency tables is computationally intractable unless P = NP.

Associated reduction rules:

  • As target: R129: 3SAT -> Consistency of Database Frequency Tables (GJ SR35)
  • As source: (none known in GJ)

Definition

Name: ConsistencyOfDatabaseFrequencyTables
Canonical name: Consistency of Database Frequency Tables (also: Frequency Table Satisfiability, Statistical Database Consistency, Contingency Table Realizability)
Reference: Garey & Johnson, Computers and Intractability, A4 SR35

Mathematical definition:

INSTANCE: Set A of attribute names, domain set D_a for each a in A, set V of objects, collection F of frequency tables for some pairs a,b in A (where a frequency table for a,b in A is a function f_{a,b}: D_a x D_b -> Z+ with the sum, over all pairs x in D_a and y in D_b, of f_{a,b}(x,y) equal to |V|), and a set K of triples (v,a,x) with v in V, a in A, and x in D_a, representing the known attribute values.

QUESTION: Are the frequency tables in F consistent with the known attribute values in K, i.e., is there a collection of functions g_a: V -> D_a, for each a in A, such that g_a(v) = x if (v,a,x) in K and such that, for each f_{a,b} in F, x in D_a, and y in D_b, the number of v in V for which g_a(v) = x and g_b(v) = y is exactly f_{a,b}(x,y)?

The problem is a decision (satisfaction) problem: given frequency tables and partial knowledge, determine if there exists a complete assignment consistent with everything.

Variables

  • Count: |V| * |A| (one variable per object-attribute pair, representing the attribute value assigned to each object)
  • Per-variable domain: For object v and attribute a, the variable g_a(v) takes values in D_a. If (v,a,x) is in K, then g_a(v) is fixed to x.
  • Meaning: Variable g_a(v) = x means object v has value x for attribute a. The configuration (g_a(v) for all v in V, a in A) encodes a complete database. The assignment is valid if all frequency table constraints are satisfied and all known values in K are respected.

For a brute-force encoding, one can use |V| * |A| categorical variables, each with domain size max_a |D_a|.

Schema (data type)

Type name: ConsistencyOfDatabaseFrequencyTables
Variants: none (no graph or weight type parameter; the database schema is stored directly)

Field Type Description
num_objects usize Number of objects in V (indexed 0..num_objects)
attributes Vec<usize> Domain sizes: attributes[i] =
frequency_tables Vec<(usize, usize, Vec<Vec<usize>>)> Collection F: each entry (a, b, table) where table[x][y] = f_{a,b}(x,y)
known_values Vec<(usize, usize, usize)> Set K of triples (object_index, attribute_index, value)

Complexity

  • Decision complexity: NP-complete (Reiss, 1977; transformation from 3SAT).
  • Best known exact algorithm: Brute-force enumeration over all possible attribute-value assignments: for each object v and each attribute a, try all values in D_a. The total search space is the product over all (v,a) of |D_a|, which is |D|^(|V||A|) in the worst case where all domains have the same size |D|. For binary attributes (|D_a| = 2 for all a), this is 2^(|V||A|).
  • Complexity string for declare_variants!: "2^(num_objects * num_attributes)" (worst-case binary domains where all |D_a| = 2)
  • Related modern work: The problem is closely related to FREQSAT (Calders, 2004), which asks whether given itemset-frequency pairs can be realized by some database. FREQSAT is also NP-complete, and when the maximum number of duplicate transactions is bounded, it becomes PP-hard. These results generalize the Reiss (1977) result to more general frequency constraints.
  • Practical implications: The NP-completeness means that there is no efficient general algorithm for determining whether published frequency tables can be used to deduce individual attribute values, providing a theoretical foundation for the security of statistical databases that release aggregate frequency information.
  • References:
    • [Reiss, 1977b] S. P. Reiss, "Statistical database confidentiality", Dept. of Statistics, University of Stockholm, 1977.
    • [Reiss, 1980] S. P. Reiss, "Practical data-swapping: The first steps", Proc. IEEE Symposium on Security and Privacy, pp. 36-44, 1980.
    • [Calders, 2004] T. Calders, "Computational complexity of itemset frequency satisfiability", Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 143-154, 2004.

Specialization

  • This is related to: Contingency table realizability, FREQSAT (itemset frequency satisfiability)
  • Known special cases: When all domains are binary ({0, 1}) and F contains all pairwise frequency tables, the problem is equivalent to finding a binary matrix with given 2x2 marginals. When K is empty, the problem reduces to finding any database consistent with F.
  • Restriction: When only univariate (single-attribute) frequency tables are given (not pairwise), the problem is solvable in polynomial time by independent assignment.

Extra Remark

Full book text:

INSTANCE: Set A of attribute names, domain set Da for each a ∈ A, set V of objects, collection F of frequency tables for some pairs a,b ∈ A (where a frequency table for a,b ∈ A is a function fa,b: Da×Db → Z+ with the sum, over all pairs x ∈ Da and y ∈ Db, of fa,b(x,y) equal to |V|), and a set K of triples (v,a,x) with v ∈ V, a ∈ A, and x ∈ Da, representing the known attribute values.
QUESTION: Are the frequency tables in F consistent with the known attribute values in K, i.e., is there a collection of functions ga: V → Da, for each a ∈ A, such that ga(v) = x if (v,a,x) ∈ K and such that, for each fa,b ∈ F, x ∈ Da, and y ∈ Db, the number of v ∈ V for which ga(v) = x and gb(v) = y is exactly fa,b(x,y)?
Reference: [Reiss, 1977b]. Transformation from 3SAT.
Comment: Above result implies that no polynomial time algorithm can be given for "compromising" a data base from its frequency tables by deducing prespecified attribute values, unless P = NP (see reference for details).

How to solve

  • It can be solved by (existing) bruteforce. Enumerate all possible assignments of attribute values to objects (g_a(v) for all v, a), check that all known values in K are respected and all frequency tables in F are exactly matched.
  • It can be solved by reducing to integer programming. Introduce integer variables y_{v,a,x} in {0,1} for each object v, attribute a, value x (indicating g_a(v) = x). Constraints: (1) exactly one value per object-attribute: sum_x y_{v,a,x} = 1; (2) known values: y_{v,a,x} = 1 if (v,a,x) in K; (3) frequency tables: sum_v y_{v,a,x} * y_{v,b,y} = f_{a,b}(x,y) for all (a,b) in F, x, y. The quadratic constraint (3) can be linearized using McCormick envelopes: introduce auxiliary binary variables z_{v,a,x,b,y} = y_{v,a,x} * y_{v,b,y} with constraints z_{v,a,x,b,y} <= y_{v,a,x}, z_{v,a,x,b,y} <= y_{v,b,y}, and z_{v,a,x,b,y} >= y_{v,a,x} + y_{v,b,y} - 1; then replace constraint (3) with sum_v z_{v,a,x,b,y} = f_{a,b}(x,y).
  • Other: (none identified)

Example Instance

Objects V = {v_0, v_1, v_2, v_3, v_4, v_5} (6 objects)
Attributes A = {a_0, a_1, a_2} with domains:

  • D_{a_0} = {0, 1} (binary, e.g., "gender")
  • D_{a_1} = {0, 1, 2} (ternary, e.g., "age group")
  • D_{a_2} = {0, 1} (binary, e.g., "employed")

Frequency tables F (2 tables):

f_{a_0, a_1} (gender x age_group):

a_0 \ a_1 0 1 2
0 1 1 1
1 1 1 1

(Each cell has count 1; row sums = 3, column sums = 2, total = 6 = |V|)

f_{a_1, a_2} (age_group x employed):

a_1 \ a_2 0 1
0 1 1
1 0 2
2 1 1

(Row sums: 2, 2, 2; column sums: 2, 4; total = 6 = |V|)

Known values K = {(v_0, a_0, 0), (v_3, a_0, 1), (v_1, a_2, 1)}
(We know: v_0 has gender=0, v_3 has gender=1, v_1 is employed=1)

Consistent assignment:

Object a_0 (gender) a_1 (age_group) a_2 (employed)
v_0 0 0 0
v_1 0 1 1
v_2 0 2 1
v_3 1 0 1
v_4 1 1 1
v_5 1 2 0

Verification of f_{a_0, a_1}:

  • (0,0): v_0 -> count=1 ✓
  • (0,1): v_1 -> count=1 ✓
  • (0,2): v_2 -> count=1 ✓
  • (1,0): v_3 -> count=1 ✓
  • (1,1): v_4 -> count=1 ✓
  • (1,2): v_5 -> count=1 ✓

Verification of f_{a_1, a_2}:

  • (0,0): v_0 -> count=1 ✓
  • (0,1): v_3 -> count=1 ✓
  • (1,0): (none) -> count=0 ✓
  • (1,1): v_1, v_4 -> count=2 ✓
  • (2,0): v_5 -> count=1 ✓
  • (2,1): v_2 -> count=1 ✓

Verification of known values K:

  • (v_0, a_0, 0): g_{a_0}(v_0) = 0 ✓
  • (v_3, a_0, 1): g_{a_0}(v_3) = 1 ✓
  • (v_1, a_2, 1): g_{a_2}(v_1) = 1 ✓

All frequency tables match and all known values are respected. The tables are consistent. ✓

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions