Skip to content

[Model] ClosestString #1032

@zhangjieqingithub

Description

@zhangjieqingithub

Motivation

ClosestString is a standard string-combinatorics model that is not currently in the catalog.

It is useful because it captures the consensus-string problem under Hamming distance: given several equal-length strings, find one center string that minimizes the worst Hamming distance to the inputs. This problem appears in computational biology, coding theory, and clustering over discrete sequences.

Associated rule:

Definition

Name: ClosestString
Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150

Let Sigma = {0, ..., q-1} be a finite alphabet. The input is a list of strings
s_1, ..., s_n in Sigma^m
all having the same length m.

A solution is a center string
c in Sigma^m.

The objective is to minimize the maximum Hamming distance from the center to the input strings:

minimize max_i d_H(c, s_i).

So this is a minimization problem.

Variables

Let m be the common string length and q = alphabet_size.

  • Count: m variables
  • Per-variable domain: {0, ..., q-1}
  • Meaning: variable c_j is the symbol chosen for the center string at position j

A configuration is always syntactically feasible if every symbol lies in the alphabet. Its objective value is the maximum Hamming distance to the input strings.

Schema (data type)

Type name: ClosestString

Field Mathematical type Description
alphabet_size positive integer Size q of the finite alphabet
strings list of equal-length strings over {0,...,q-1} Input strings s_1,...,s_n

Suggested size fields:

  • alphabet_size
  • num_strings
  • string_length
  • total_length = num_strings * string_length

Input validation should require:

  • alphabet_size > 0 unless all strings and the center length are zero,
  • all input strings have equal length,
  • every symbol is < alphabet_size.

Complexity

The direct exact baseline enumerates all possible center strings and evaluates the maximum Hamming distance.

This gives:

O(alphabet_size^string_length * num_strings * string_length).

The registry complexity expression can use:

alphabet_size ^ string_length.

Extra Remark

This model is distinct from ClosestSubstring. In ClosestString, every input string has the same length as the center, so there is no window-selection decision.

Reduction Rule Crossref

How to solve

Example Instance

Use binary alphabet Sigma = {0,1} and input strings of length 3:

String Value
s_1 000
s_2 011
s_3 101
s_4 110

Expected Outcome

An optimal center string is:

c = 000.

Its distances are:

  • d_H(000, 000) = 0
  • d_H(000, 011) = 2
  • d_H(000, 101) = 2
  • d_H(000, 110) = 2

So the objective value is 2.

No center has radius 1: the four input strings are pairwise arranged so that every binary length-3 string is at distance at least 2 from at least one input string. Therefore the optimum radius is exactly 2.

BibTeX

@article{LiMaWang2002ClosestStringSubstring,
  author  = {Ming Li and Bin Ma and Lusheng Wang},
  title   = {On the Closest String and Substring Problems},
  journal = {Journal of the ACM},
  volume  = {49},
  number  = {2},
  pages   = {157--171},
  year    = {2002},
  doi     = {10.1145/506147.506150},
  url     = {https://doi.org/10.1145/506147.506150}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions