This repository contains the research prototype implementation of SAGA (State-Aware Graph Analytics), a distributed framework for maintaining combinatorial graph properties on dynamic graphs under continuous updates and concurrent queries.
The implementation is Flink-native and is intended for research evaluation and artifact review, not as a production library.
SAGA is designed to incrementally maintain graph properties as the graph evolves, without global recomputation. The system emphasizes:
- State-aware vertex management
- Localized, incremental repair
- Non-blocking query processing
- Streaming execution on Apache Flink
The current implementation supports the following algorithms:
- SAGA-MIS — Maximal Independent Set
- SAGA-GC — Graph Coloring
- SAGA-MM — Maximal Matching
Each algorithm operates on per-vertex state and is integrated into a unified streaming runtime.
-
VertexState Authoritative per-vertex logical state stored as Flink keyed state.
-
S-Store (State Store) A state-aware working context that captures before/after vertex state during update processing.
-
Algorithms MIS, GC, and MM are implemented as pluggable modules with minimal algorithm-specific state.
-
Ω (Coordination Engine) A lightweight coordination layer used when updates affect boundary (replicated) vertices.
-
Flink Runtime Apache Flink is used for streaming execution, fault tolerance, and state management.
saga/
├── api/ # Update & query APIs
├── state/
│ ├── vertex/ # VertexState and algorithm state
│ └── sstore/ # State-aware working context
├── algorithms/
│ ├── mis/ # Maximal Independent Set
│ ├── gc/ # Graph Coloring
│ └── mm/ # Maximal Matching
├── runtime/
│ ├── flink/ # Flink job and state backend
│ └── operators/ # Flink operators
└── coordination/ # Ω coordination layer
The codebase is intentionally modular to allow reviewers to inspect algorithms, coordination, and runtime independently.
- Java 17
- Apache Flink 1.17+ (tested with modern Flink releases)
- Maven 3.8+
- Multi-core machine (≥16 cores recommended)
- ≥64 GB RAM for large graphs
- Distributed cluster recommended for scaling experiments
From the repository root:
mvn clean packageThis produces a runnable JAR under:
target/
Each line represents a dynamic edge update:
ADD u v
DEL u v
Where u and v are vertex identifiers.
Each line represents a read-only query:
QUERY vertexId
Example (Maximal Independent Set):
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm MIS \
--updates /path/to/updates.txt \
--queries /path/to/queries.txtSupported algorithm values:
MISGCMM
Queries are optional. If omitted, the system runs in update-only mode.
SAGA is evaluated on a collection of large real-world graphs, referred to as G1–G10, which are treated as dynamic graphs by applying incremental update batches over an initial static base graph.
Table below summarizes the graphs used in evaluation. All graphs are publicly available benchmark datasets commonly used in large-scale graph processing and graph algorithm research.
| S. No. | Graph Name | Vertices | Edges | Avg Degree |
|---|---|---|---|---|
| G1 | citationCiteseer | 0.27M | 2.31M | 8.55 |
| G2 | amazon0601 | 0.40M | 4.89M | 12.22 |
| G3 | as-Skitter | 1.69M | 22.19M | 13.14 |
| G4 | cit-Patents | 3.77M | 33.04M | 8.80 |
| G5 | rmat22 | 4.19M | 65.66M | 15.67 |
| G6 | soc-LiveJournal1 | 4.85M | 85.70M | 17.67 |
| G7 | delaunay_n24 | 16.78M | 100.66M | 5.99 |
| G8 | kron_g500-logn21 | 2.09M | 182.08M | 87.11 |
| G9 | uk-2002 | 18.52M | 523.57M | 28.27 |
| G10 | com-Friendster | 65.61M | 1.82B | 27.58 |
Notes:
-
Graphs span social, web, citation, and synthetic categories.
-
They cover a wide range of:
- graph sizes,
- degree distributions,
- sparsity/density characteristics.
-
This diversity is intentional to stress different aspects of SAGA’s design.
The collection includes both real-world graphs and synthetic generators (e.g., RMAT and Kronecker graphs) to evaluate behavior across diverse structural characteristics.
Before graph construction, each dataset undergoes the following preprocessing steps:
-
Removal of self-loops Edges of the form
(u, u)are discarded. -
Duplicate edge elimination Multiple occurrences of the same edge are collapsed into a single edge.
-
Undirected normalization For directed inputs, edges are symmetrized so that
(u, v)implies(v, u). -
Vertex ID normalization Vertex identifiers are remapped to a dense range
[0, n)to improve memory locality.
These steps ensure that all graphs conform to a consistent structure suitable for combinatorial graph algorithms.
Internally, SAGA represents graphs using the Compressed Sparse Row (CSR) format.
For each graph:
- an adjacency list is constructed after preprocessing,
- neighbors are stored contiguously in memory,
- an offset array indexes the start of each vertex’s adjacency.
Conceptually, the CSR representation consists of:
- a row pointer array of size
|V| + 1, and - a column index array of size
|E|.
This representation:
- enables efficient neighbor traversal,
- improves cache locality,
- minimizes memory overhead,
- aligns with high-performance graph processing practices.
The CSR structure serves as the initial base graph state before dynamic updates are applied.
While updates are provided as a stream of ADD/DEL operations, SAGA internally maintains vertex adjacency using CSR-derived structures that are incrementally updated during execution.
CSR is chosen to align with high-performance graph processing practices and to ensure efficient neighborhood access during incremental updates.
To model graph evolution, a subset of edges from each CSR graph is converted into a stream of dynamic updates.
Each update is encoded textually as:
ADD u v
DEL u v
where u and v are vertex identifiers consistent with the CSR mapping.
Updates are applied incrementally on top of the base CSR graph during execution.
To evaluate scalability and responsiveness under varying workloads, update streams are divided into batches of increasing size.
The following batch sizes are used throughout the evaluation:
| Batch Size | Number of Updates |
|---|---|
| 0.01K | 10 |
| 0.1K | 100 |
| 1K | 1,000 |
| 10K | 10,000 |
| 100K | 100,000 |
Each batch is processed independently to measure:
- update processing time,
- throughput,
- query latency under concurrent updates.
Given a preprocessed CSR graph, update batches are generated as follows:
- Select a subset of edges from the CSR representation.
- Randomly partition the selected edges into batches of the desired size.
- Emit each batch as a sequence of
ADDorDELoperations.
This approach:
- preserves the original graph topology,
- avoids synthetic graph generation,
- reflects realistic incremental graph evolution.
SAGA supports read-only vertex queries that can be issued during or after update processing.
Each query is encoded as:
QUERY vertexId
where vertexId refers to a vertex in the current graph.
- Queries are read-only
- Queries do not modify graph or algorithm state
- Queries do not block update processing
- Queries are served directly from live keyed vertex state
Depending on the selected algorithm:
- MIS: query returns whether the vertex is in the independent set
- GC: query returns the current color of the vertex
- MM: query returns the matched partner (or −1 if unmatched)
Queries may be issued:
- concurrently with update batches, or
- after a batch has completed.
This allows evaluation of query latency under active graph evolution, which is a core design goal of SAGA.
Query latency depends on:
- keyed state access,
- Flink task scheduling,
- operator execution overhead.
No log replay, joins, or recomputation is required.
SAGA relies on Apache Flink’s runtime logging and metrics infrastructure rather than fine-grained per-update application logging.
-
SAGA does not log individual graph updates or per-vertex decisions by default.
-
This design avoids excessive overhead during high-throughput streaming execution.
-
Instead, logging focuses on runtime-level events, including:
- job startup and termination
- operator initialization
- checkpoint creation and completion
- failures and recovery
- backpressure and task warnings
All such events are handled by Flink’s logging subsystem.
When running SAGA on a Flink cluster, logs are written to:
$FLINK_HOME/log/
Typical log files include:
flink-*-jobmanager-*.logflink-*-taskexecutor-*.log
These logs provide visibility into:
- state backend initialization
- operator execution
- checkpointing and recovery
- runtime errors, if any
This section describes how to run the experimental workloads used to evaluate SAGA once the code is compiled.
All experiments follow the same execution pattern:
- start a Flink cluster,
- provide a base graph (preprocessed),
- provide an update stream,
- optionally provide a query stream,
- execute the SAGA Flink job with the desired algorithm.
Before running experiments, ensure that:
- Apache Flink is installed and configured
$FLINK_HOME/binis in the system path- sufficient memory is available for the chosen graph
- the project has been compiled using:
mvn clean packageOn a standalone or cluster setup:
$FLINK_HOME/bin/start-cluster.shVerify that the cluster is running by visiting:
http://localhost:8081
or by listing running jobs:
$FLINK_HOME/bin/flink listAll experiments are launched using the same Flink job entry point:
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm <ALGO> \
--updates <UPDATE_FILE> \
--queries <QUERY_FILE>Where:
<ALGO>∈{MIS, GC, MM}<UPDATE_FILE>is a text file containingADD/DELoperations<QUERY_FILE>is optional and containsQUERY vertexIdentries
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm MIS \
--updates updates_G5_10K.txt \
--queries queries.txtThis runs incremental MIS maintenance while processing dynamic updates.
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm GC \
--updates updates_G5_10K.txt \
--queries queries.txtThis maintains a proper vertex coloring under dynamic updates.
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm MM \
--updates updates_G5_10K.txt \
--queries queries.txtThis maintains a maximal matching incrementally.
To evaluate scalability, the same experiment is repeated with update batches of different sizes:
| Batch Label | Updates |
|---|---|
| 0.01K | 10 |
| 0.1K | 100 |
| 1K | 1,000 |
| 10K | 10,000 |
| 100K | 100,000 |
Example:
--updates updates_G6_100K.txtEach update file corresponds to a batch generated as described in Section 7.
To measure pure update throughput, omit the query argument:
flink run \
-c saga.runtime.flink.SagaJob \
target/saga.jar \
--algorithm MIS \
--updates updates_G6_100K.txtThis runs the system in update-only mode.
During execution, the following can be observed:
-
Flink Web UI:
http://localhost:8081 -
Job status and runtime
-
Operator parallelism
-
Backpressure and throughput
-
Checkpoint progress
Runtime logs are available under:
$FLINK_HOME/log/