Version 0.2.0
FoldMatch is a Python toolkit to encode macromolecular 3D structures into fixed-length vector embeddings for efficient large-scale structure similarity search and clustering.
Reference: Multi-scale structural similarity embedding search across entire proteomes.
A web-based implementation using this tool for structure similarity search is available at rcsb-embedding-search.
If you are interested in training a new model with a new structure dataset, visit the rcsb-embedding-search repository, which provides scripts and documentation for training.
- Residue-level embeddings computed using the ESM3 protein language model
- Sequence-based embeddings from FASTA files without requiring 3D structures
- Structure-level embeddings aggregated via a transformer-based aggregator network
- Fast and efficient FAISS-based similarity search
- Structural clustering using the Leiden algorithm for biological assembly identification
- Command-line interface implemented with Typer for high-throughput inference workflows
- Python API for interactive embedding computation and integration into analysis pipelines
- High-performance inference leveraging PyTorch Lightning, with multi-node and multi-GPU support
pip install foldmatchgit clone https://github.com/rcsb/foldmatch.git
cd foldmatch
pip install -e .Requirements:
- Python ≥ 3.12
- ESM 3.2.3
- Lightning 2.6.1
- Typer 0.24.1
- Biotite 1.6.0
- FAISS 1.13.2
- igraph 1.0.0
- leidenalg 0.11.0
- PyTorch with CUDA support (recommended for GPU acceleration)
Optional Dependencies:
faiss-gpufor GPU-accelerated similarity search (instead offaiss-cpu)
Before using the package, download the pre-trained ESM3 and aggregator models:
fm-inference download-modelsThe package provides two main interfaces:
- Command-line Interface (CLI) for batch processing and high-throughput workflows
- Python API for interactive use and integration into custom pipelines
The CLI provides four main command groups: fm-embedding for computing embeddings from a folder of structure files, fm-sequence for computing embeddings from protein sequences in FASTA files, fm-inference for computing embeddings from CSV file lists, and fm-search for similarity search operations.
Calculate residue-level embeddings using ESM3 from a folder of structure files. All chains in each structure are processed. Outputs are stored as PyTorch tensor files (default) or CSV files.
fm-embedding residue \
--src-folder data/structures \
--output-path results/residue_embeddings \
--structure-format mmcif \
--batch-size 8 \
--devices autoKey Options:
--src-folder: Folder containing structure files (.cif,.pdb, or.bcif, including.gzvariants)--output-path: Directory to store embedding files--output-format:separated(individual files) orgrouped(single JSON)--output-name: Filename when usinggroupedformat (default:inference)--write-csv/--no-write-csv: Write embeddings as CSV files instead of tensor files when usingseparatedformat (default: disabled)--structure-format:mmcif,binarycif, orpdb--min-res-n: Minimum residue count for chain filtering (default: 0)--batch-size: Batch size for processing (default: 1)--num-workers: Data loader workers (default: 0)--num-nodes: Number of nodes for distributed inference (default: 1)--accelerator: Device type -auto,cpu,cuda,gpu(default:auto)--devices: Device indices (can specify multiple with--devices 0 --devices 1) orauto--strategy: Lightning distribution strategy (default:auto)
Compute chain-level embeddings from a folder of structure files. By default, residue embeddings are computed as a first step and stored in --res-embedding-location, then aggregated into chain embeddings. Use --no-compute-residue-embedding to skip the residue step and use pre-computed residue embeddings.
# End-to-end: compute residue + chain embeddings
fm-embedding chain \
--src-folder data/structures \
--res-embedding-location results/residue_embeddings \
--output-path results/chain_embeddings \
--batch-size 4
# Using pre-computed residue embeddings
fm-embedding chain \
--src-folder data/structures \
--res-embedding-location results/residue_embeddings \
--output-path results/chain_embeddings \
--no-compute-residue-embedding \
--batch-size 4Key Options:
--src-folder: Folder containing structure files--res-embedding-location: Directory for residue embedding tensor files (output when computing, input for chain aggregation)--output-path: Directory to store chain embedding CSV files--compute-residue-embedding/--no-compute-residue-embedding: Compute residue embeddings first (default: enabled)--output-format:separated(individual files) orgrouped(single JSON)--output-name: Filename when usinggroupedformat (default:inference)- All other options similar to
fm-embedding residue
Compute assembly-level embeddings from a folder of structure files. By default, residue embeddings are computed as a first step and stored in --res-embedding-location, then aggregated into assembly embeddings. Use --no-compute-residue-embedding to skip the residue step and use pre-computed residue embeddings.
# End-to-end: compute residue + assembly embeddings
fm-embedding assembly \
--src-folder data/structures \
--res-embedding-location results/residue_embeddings \
--output-path results/assembly_embeddings \
--min-res-n 10 \
--max-res-n 10000
# Using pre-computed residue embeddings
fm-embedding assembly \
--src-folder data/structures \
--res-embedding-location results/residue_embeddings \
--output-path results/assembly_embeddings \
--no-compute-residue-embedding \
--min-res-n 10 \
--max-res-n 10000Key Options:
--src-folder: Folder containing structure files--res-embedding-location: Directory for residue embedding tensor files (output when computing, input for assembly aggregation)--output-path: Directory to store assembly embedding CSV files--compute-residue-embedding/--no-compute-residue-embedding: Compute residue embeddings first (default: enabled)--output-format:separated(individual files) orgrouped(single JSON)--output-name: Filename when usinggroupedformat (default:inference)--min-res-n: Minimum residues per chain (default: 0)--max-res-n: Maximum total residues for assembly (default: unlimited)- All other options similar to
fm-embedding residue
Download ESM3 and aggregator models from Hugging Face.
fm-embedding download-modelsCalculate residue-level ESM embeddings from protein sequences in a FASTA file. No 3D structure information is required. Outputs are stored as PyTorch tensor files (default) or CSV files.
fm-sequence residue \
--fasta-file sequences.fasta \
--output-path results/residue_embeddings \
--batch-size 8 \
--devices autoKey Options:
--fasta-file: FASTA file containing protein sequences--output-path: Directory to store embedding files--output-format:separated(individual files) orgrouped(single JSON)--output-name: Filename when usinggroupedformat (default:inference)--write-csv/--no-write-csv: Write embeddings as CSV files instead of tensor files when usingseparatedformat (default: disabled)--min-res-n: Minimum residue count for sequence filtering (default: 0)--batch-size: Batch size for processing (default: 1)--num-workers: Data loader workers (default: 0)--num-nodes: Number of nodes for distributed inference (default: 1)--accelerator: Device type -auto,cpu,cuda,gpu(default:auto)--devices: Device indices (can specify multiple with--devices 0 --devices 1) orauto--strategy: Lightning distribution strategy (default:auto)
Compute chain-level embeddings from protein sequences in a FASTA file. By default, residue embeddings are computed as a first step and stored in --res-embedding-location, then aggregated into chain embeddings using the transformer-based aggregator. Use --no-compute-residue-embedding to skip the residue step and use pre-computed residue embeddings.
# End-to-end: compute residue + chain embeddings
fm-sequence chain \
--fasta-file sequences.fasta \
--res-embedding-location results/residue_embeddings \
--output-path results/chain_embeddings \
--batch-size 4
# Using pre-computed residue embeddings
fm-sequence chain \
--fasta-file sequences.fasta \
--res-embedding-location results/residue_embeddings \
--output-path results/chain_embeddings \
--no-compute-residue-embedding \
--batch-size 4Key Options:
--fasta-file: FASTA file containing protein sequences--res-embedding-location: Directory for residue embedding tensor files (output when computing, input for chain aggregation)--output-path: Directory to store chain embedding CSV files--compute-residue-embedding/--no-compute-residue-embedding: Compute residue embeddings first (default: enabled)--output-format:separated(individual files) orgrouped(single JSON)--output-name: Filename when usinggroupedformat (default:inference)- All other options similar to
fm-sequence residue-embedding
Download ESM3 and aggregator models from Hugging Face.
fm-sequence download-modelsBuild a FAISS database from structure files for similarity search.
fm-search build-db \
--structure-dir data/pdb_files \
--output-db databases/my_structures \
--tmp-dir tmp \
--granularity chain \
--min-res 10 \
--use-gpu-indexKey Options:
--structure-dir: Directory containing structure files--output-db: Database path (prefix for.indexand.metadatafiles)--tmp-dir: Temporary directory for intermediate files--structure-format:mmcif,binarycif, orpdb--granularity:chainorassemblylevel embeddings--file-extension: Filter files by extension (e.g.,.cif,.bcif,.pdb)--min-res: Minimum residue count (default: 10)--use-gpu-index: Use GPU for FAISS index construction--accelerator,--devices,--strategy: Inference device settings--batch-size-res,--num-workers-res,--num-nodes-res: Residue embedding settings--batch-size-aggregator,--num-workers-aggregator,--num-nodes-aggregator: Aggregator settings
Update an existing FAISS database with new or replacement structure files. Structures with IDs already present in the database are replaced; new IDs are added. The FAISS index is fully rebuilt after merging.
fm-search update-db \
--structure-dir data/new_structures \
--output-db databases/my_structures \
--tmp-dir tmp \
--structure-format mmcif \
--granularity chain \
--min-res 10 \
--batch-size-res 8Key Options:
--structure-dir: Directory containing new or updated structure files--output-db: Path to the existing FAISS database to update--tmp-dir: Temporary directory for intermediate files--structure-format:mmcif,binarycif, orpdb--granularity:chainorassemblylevel embeddings--file-extension: Filter files by extension (e.g.,.cif,.bcif,.pdb)--min-res: Minimum residue count (default: 10)--use-gpu-index: Use GPU for FAISS index construction--accelerator,--devices,--strategy: Inference device settings--batch-size-res,--num-workers-res,--num-nodes-res: Residue embedding settings--batch-size-aggregator,--num-workers-aggregator,--num-nodes-aggregator: Aggregator settings--log-level: Logging level -info,warn, ordebug(default:info)
Search the database for structures similar to a query structure.
fm-search query \
--db-path databases/my_structures \
--query-structure query.cif \
--structure-format mmcif \
--granularity chain \
--top-k 100 \
--threshold 0.8 \
--output-csv results.csvKey Options:
--db-path: Path to FAISS database--query-structure: Query structure file--structure-format:mmciforpdb--granularity:chainorassemblysearch mode--chain-id: Specific chain to search (optional)--assembly-id: Specific assembly ID (optional)--top-k: Number of results per query (default: 100)--threshold: Minimum similarity score (default: 0.8)--output-csv: Export results to CSV (optional)--min-res: Minimum residue filter (default: 10)--max-res: Maximum residue filter (optional)--device:cuda,cpu, orauto--use-gpu-index: Use GPU for FAISS search
Compare all entries from a query database against a subject database.
fm-search query-db \
--query-db-path databases/query_set \
--subject-db-path databases/target_set \
--top-k 100 \
--threshold 0.8 \
--output-csv comparisons.csvKey Options:
--query-db-path: Query database path--subject-db-path: Subject database to search--top-k: Results per query (default: 100)--threshold: Similarity threshold (default: 0.8)--output-csv: Export results to CSV--use-gpu-index: Use GPU acceleration
Display database statistics.
fm-search stats --db-path databases/my_structuresCluster database embeddings using the Leiden algorithm.
fm-search cluster \
--db-path databases/my_structures \
--threshold 0.8 \
--resolution 1.0 \
--output clusters.csv \
--max-neighbors 1000 \
--min-cluster-size 5Key Options:
--db-path: Database path--threshold: Similarity threshold for edge creation (default: 0.8)--resolution: Leiden resolution parameter - higher values create more clusters (default: 1.0)--output: Output file (.csvor.json)--max-neighbors: Maximum neighbors per chain (default: 1000)--min-cluster-size: Filter out smaller clusters (optional)--use-gpu-index: Use GPU for FAISS operations--seed: Random seed for reproducibility (optional)
Build a similarity graph from database embeddings and export it in GraphML format. Each node represents a chain (identified by its chain ID) and each edge carries a weight attribute with the cosine similarity score between the two connected chains.
fm-search similarity-graph \
--db-path databases/my_structures \
--threshold 0.8 \
--output similarity_graph.graphml \
--max-neighbors 1000Key Options:
--db-path: Database path--threshold: Minimum similarity score to create an edge (default: 0.8)--output: Output GraphML file (default:similarity_graph.graphml)--max-neighbors: Maximum neighbors per chain considered during k-NN search (default: 1000)--use-gpu-index: Use GPU for FAISS operations--log-level: Logging verbosity level (default:info)
The resulting GraphML file can be loaded directly into tools such as Gephi, Cytoscape, or Python's NetworkX:
import networkx as nx
G = nx.read_graphml("similarity_graph.graphml")The RcsbStructureEmbedding class provides methods for computing embeddings programmatically.
from foldmatch import FoldMatch
# Initialize model
model = FoldMatch(min_res=10, max_res=5000)
# Load models (optional - loads automatically on first use)
model.load_models() # Auto-detects CUDA
# or specify device:
# import torch
# model.load_models(device=torch.device("cuda:0"))Load both residue and aggregator models.
import torch
model.load_models(device=torch.device("cuda"))Load only the ESM3 residue embedding model.
model.load_residue_embedding()Load only the aggregator model.
model.load_aggregator_embedding()Compute per-residue embeddings for a structure.
Parameters:
src_structure: File path, URL, or file-like objectstructure_format:'mmcif','binarycif', or'pdb'chain_id: Specific chain ID (optional, uses all chains if None)assembly_id: Assembly ID for biological assembly (optional)
Returns: torch.Tensor of shape [num_residues, embedding_dim]
# Single chain
residue_emb = model.residue_embedding(
src_structure="1abc.cif",
structure_format="mmcif",
chain_id="A"
)
# All chains concatenated
all_residues = model.residue_embedding(
src_structure="1abc.cif",
structure_format="mmcif"
)
# Biological assembly
assembly_residues = model.residue_embedding(
src_structure="1abc.cif",
structure_format="mmcif",
assembly_id="1"
)Compute per-residue embeddings separately for each chain.
Returns: dict[str, torch.Tensor] mapping chain IDs to embeddings
chain_embeddings = model.residue_embedding_by_chain(
src_structure="1abc.cif",
structure_format="mmcif"
)
# Returns: {'A': tensor(...), 'B': tensor(...), ...}
# Get specific chain
chain_a = model.residue_embedding_by_chain(
src_structure="1abc.cif",
chain_id="A"
)Compute residue embeddings for an assembly.
Returns: dict[str, torch.Tensor] mapping assembly ID to concatenated embeddings
assembly_emb = model.residue_embedding_by_assembly(
src_structure="1abc.cif",
structure_format="mmcif",
assembly_id="1"
)
# Returns: {'1': tensor(...)}Compute residue embeddings from amino acid sequence (no structural information).
Parameters:
sequence: Amino acid sequence string (plain or FASTA format)
Returns: torch.Tensor of shape [sequence_length, embedding_dim]
# Plain sequence
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")
# FASTA format
fasta = """>Protein1
ACDEFGHIKLMNPQRSTVWY
ACDEFGHIKLMNPQRSTVWY"""
seq_emb = model.sequence_embedding(fasta)Aggregate residue embeddings into a single structure-level vector.
Parameters:
residue_embedding:torch.Tensorfrom residue embedding methods
Returns: torch.Tensor of shape [1536]
residue_emb = model.residue_embedding("1abc.cif", chain_id="A")
structure_emb = model.aggregator_embedding(residue_emb)End-to-end: compute residue embeddings and aggregate in one call.
# Complete structure embedding
structure_emb = model.structure_embedding(
src_structure="1abc.cif",
structure_format="mmcif",
chain_id="A"
)
# Returns: tensor of shape [1536]from foldmatch import FoldMatch
import torch
# Initialize
model = FoldMatch(min_res=10, max_res=5000)
# Option 1: Full structure embedding (one-shot)
embedding = model.structure_embedding(
src_structure="1abc.cif",
structure_format="mmcif",
chain_id="A"
)
# Option 2: Step-by-step with residue embeddings
residue_emb = model.residue_embedding(
src_structure="1abc.cif",
structure_format="mmcif",
chain_id="A"
)
structure_emb = model.aggregator_embedding(residue_emb)
# Option 3: Process multiple chains
chain_embeddings = model.residue_embedding_by_chain(
src_structure="1abc.cif"
)
for chain_id, res_emb in chain_embeddings.items():
chain_emb = model.aggregator_embedding(res_emb)
print(f"Chain {chain_id}: {chain_emb.shape}")
# Sequence-based embedding
seq_emb = model.sequence_embedding("ACDEFGHIKLMNPQRSTVWY")
structure_from_seq = model.aggregator_embedding(seq_emb)See the examples/ and tests/ directories for more use cases.
The embedding model is trained to predict structural similarity by approximating TM-scores using cosine distances between embeddings. It consists of two main components:
- Protein Language Model (PLM): Computes residue-level embeddings from a given 3D structure.
- Residue Embedding Aggregator: A transformer-based neural network that aggregates these residue-level embeddings into a single vector.
Residue-wise embeddings of protein structures are computed using the ESM3 generative protein language model.
The aggregation component consists of six transformer encoder layers, each with a 3,072-neuron feedforward layer and ReLU activations. After processing through these layers, a summation pooling operation is applied, followed by 12 fully connected residual layers that refine the embeddings into a single 1,536-dimensional vector.
After installation, run the test suite:
pytestSegura, J., et al. (2026). Multi-scale structural similarity embedding search across entire proteomes. (https://doi.org/10.1093/bioinformatics/btag058)
This project uses the EvolutionaryScale ESM-3 model and is distributed under the Cambrian Non-Commercial License Agreement.
