Skip to content

fix: use atomic writes for state.json to prevent corruption#600

Open
MayankSharmaCSE wants to merge 1 commit intourunc-dev:mainfrom
MayankSharmaCSE:fix/atomic-state-write
Open

fix: use atomic writes for state.json to prevent corruption#600
MayankSharmaCSE wants to merge 1 commit intourunc-dev:mainfrom
MayankSharmaCSE:fix/atomic-state-write

Conversation

@MayankSharmaCSE
Copy link
Copy Markdown

Description

This PR fixes the non-atomic state.json writes in saveContainerState() that could lead to file corruption if the urunc process is killed mid-write.

Problem

  • saveContainerState() in pkg/unikontainers/unikontainers.go was using os.WriteFile() directly
  • If the process was killed during write, state.json would be partially/corruptly written
  • Subsequent lifecycle operations (urunc start, kill, delete) would fail with JSON parse errors
  • VMM processes would become orphaned and unmanageable

Solution

  • Added atomicWriteFile() helper in pkg/unikontainers/utils.go that:
    • Writes to a temporary file in the same directory
    • Syncs the file to ensure durability
    • Atomically renames the temp file to the target path
  • Refactored saveContainerState() to use the new helper
  • Refactored writePidFile() to use the same helper for consistency
  • Added unit tests for atomicWriteFile()

Files changed

  • pkg/unikontainers/utils.go - Added atomicWriteFile(), refactored writePidFile()
  • pkg/unikontainers/unikontainers.go - Changed saveContainerState() to use atomic writes
  • pkg/unikontainers/utils_test.go - Added TestAtomicWriteFile with 4 test cases

Related issues

How was this tested?

  • Added unit tests for atomicWriteFile() covering:
    • Basic atomic write
    • Overwriting existing file
    • Temp file cleanup on success
    • Error handling for invalid directory
  • Verified writePidFile still works correctly (existing test passes)
  • Verified go vet passes for Linux target
  • Verified golangci-lint passes with no new warnings

LLM usage

Claude Opus 4.7 (Anthropic) assisted with analysis, implementation, and code review.

Checklist

  • I have read the contribution guide.
  • The linter passes locally (make lint).
  • The e2e tests of at least one tool pass locally (make test_ctr, make test_nerdctl, make test_docker, make test_crictl).
  • If LLMs were used: I have read the llm policy.

Signed-off-by: mayanksharmaCSE <mayanksharmacse1@gmail.com>
Copilot AI review requested due to automatic review settings May 1, 2026 21:04
@netlify
Copy link
Copy Markdown

netlify Bot commented May 1, 2026

Deploy Preview for urunc canceled.

Name Link
🔨 Latest commit b8e3312
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/69f5154125a06900087ec3c5

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens unikontainer state persistence by replacing direct os.WriteFile() updates with an atomic temp-file-and-rename approach to reduce the risk of state.json corruption on mid-write termination (Fixes #598).

Changes:

  • Added atomicWriteFile() helper for atomic file updates via temp file + sync + rename.
  • Refactored saveContainerState() (state.json) and writePidFile() to use atomicWriteFile().
  • Added unit tests covering basic atomicWriteFile() behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
pkg/unikontainers/utils.go Introduces atomicWriteFile() and refactors writePidFile() to use it
pkg/unikontainers/unikontainers.go Switches saveContainerState() to atomic writes
pkg/unikontainers/utils_test.go Adds unit tests for atomicWriteFile()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return fmt.Errorf("failed to close temp file: %w", closeErr)
}

return os.Rename(tmpName, path)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomicWriteFile returns os.Rename(tmpName, path) directly. If the rename fails, the temp file is left behind and the returned error lacks context about the target path. Consider removing the temp file on rename failure and wrapping the error so callers get a clearer failure reason.

Suggested change
return os.Rename(tmpName, path)
if err := os.Rename(tmpName, path); err != nil {
_ = os.Remove(tmpName)
return fmt.Errorf("failed to rename temp file %q to %q: %w", tmpName, path, err)
}
return nil

Copilot uses AI. Check for mistakes.
dir := filepath.Dir(path)
tmpName := filepath.Join(dir, "."+filepath.Base(path)+".tmp")

f, err := os.OpenFile(tmpName, os.O_RDWR|os.O_CREATE|os.O_TRUNC|os.O_SYNC, perm)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomicWriteFile opens the temp file with O_SYNC and also calls f.Sync(). This is redundant and can significantly slow down writes (O_SYNC already forces synchronous writes). Consider dropping O_SYNC and relying on an explicit f.Sync() (or vice versa) to keep the durability semantics while avoiding extra overhead.

Suggested change
f, err := os.OpenFile(tmpName, os.O_RDWR|os.O_CREATE|os.O_TRUNC|os.O_SYNC, perm)
f, err := os.OpenFile(tmpName, os.O_RDWR|os.O_CREATE|os.O_TRUNC, perm)

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +79
t.Run("fails on invalid directory", func(t *testing.T) {
t.Parallel()
target := filepath.Join("/nonexistent/dir", "state.json")

err := atomicWriteFile(target, []byte("data"), 0o644)
assert.Error(t, err)
})
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestAtomicWriteFile covers the success path and a create-temp failure, but it doesn’t cover the rename-failure path (e.g., when the target exists as a directory or permissions prevent replacing it). Adding a test for rename failure is useful to verify temp-file cleanup and to prevent accumulating stray temp files over time.

Copilot uses AI. Check for mistakes.

stateName := filepath.Join(u.BaseDir, stateFilename)
return os.WriteFile(stateName, data, 0o644) //nolint: gosec
return atomicWriteFile(stateName, data, 0o644) //nolint: gosec
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saveContainerState still has a //nolint: gosec suppression on the atomicWriteFile call. Since this is no longer a direct os.WriteFile with a literal permission, the suppression may be unnecessary and could hide future issues on this line. Consider removing the nolint (or moving it to the specific line that triggers gosec, if any).

Suggested change
return atomicWriteFile(stateName, data, 0o644) //nolint: gosec
return atomicWriteFile(stateName, data, 0o644)

Copilot uses AI. Check for mistakes.
Comment on lines +114 to 118
dir := filepath.Dir(path)
tmpName := filepath.Join(dir, "."+filepath.Base(path)+".tmp")

f, err := os.OpenFile(tmpName, os.O_RDWR|os.O_CREATE|os.O_TRUNC|os.O_SYNC, perm)
if err != nil {
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomicWriteFile uses a deterministic temp path ("."+base+".tmp") and opens it with O_TRUNC (no O_EXCL / unique name). If two goroutines/processes call atomicWriteFile for the same target concurrently, they can share the same temp inode; one caller may rename it to the final path while the other still has the FD open and continues writing, which can corrupt the final file (defeating the purpose of atomic writes). Use a unique temp file per call (e.g., os.CreateTemp in the same dir with a prefix) and then rename that unique file into place (optionally chmod to perm).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

saveContainerState() should use atomic writes to prevent state.json corruption

2 participants