ENH: Add lychee link-checker as a Utilities/Maintenance script#5377
Conversation
|
Not sure why it complains for the links to the commits in the release notes files. |
|
Many failures are due to HTTP code 429: Network error: Too Many Requests. Can we slow down request rate or something similar? |
ae36dc6 to
cb55e14
Compare
|
Looks like the limiting the rate of requests through caching does not work. Not sure if the cache has to be built first to have it working. Have gone through these but have not found how to modify the arguments to make throttling/request pace limiting work. Have seen other workflows accepting 429, e.g. but I guess that will effectively not check the links at issue. Also, checks will not work until this issue is solved: I do not know any other tool that does this job for rst files. This one only seems to work for md files: |
|
@jhlegarreta It seems like this effort has been abandoned due to difficulties. Perhaps this effort can be made into a manual script in the Utilities directory that is periodically run rather than addding it to the CI in a way that will slow down other efforts? I'm making a pass through issues trying reduce the number of stale, unlikely to proceed issues. |
|
@hjmjohnson thanks for the heads-up; going through challenging times on my end, so I am being unable to push ITK items. My sincere apologies. The approach you propose sounds reasonable. Feel free to close the PR. |
cb55e14 to
cc6bfed
Compare
Replaces the original GHA workflow attempt (PR InsightSoftwareConsortium#5377) with a manual script under Utilities/Maintenance/. Per author + maintainer discussion on the original PR, integrating lychee into per-PR CI is not workable: lycheeverse/lychee#1574 and HTTP 429 rate-limits on github.com / DOI hosts produce too many spurious failures to act on in CI. New artifacts: - Utilities/Maintenance/check-links.sh — wrapper that resolves the repository root, requires lychee on PATH, runs against the supplied paths (or the whole tree by default), and writes a Markdown report plus a persistent cache. - Utilities/Maintenance/lychee.toml — configuration with the rate- limit-aware accept list (treats 429/999 as non-broken), commit-URL skip pattern that motivated the original CI failures, exclusions for ThirdParty trees, and path globs limited to documentation file types. - .gitignore: ignore the local cache (.lycheecache) and report (.lychee-report.md) artifacts so re-running does not pollute the working tree. This script is intended for periodic / on-demand runs by maintainers, not the per-PR pipeline. Co-Authored-By: Jon Haitz Legarreta Gorroño <5576557+jhlegarreta@users.noreply.github.com>
|
@jhlegarreta — thank you for the original work and for the OK to pivot. Force-pushed Holler if you'd rather I close this and open the script under a fresh PR — both work; reusing this branch felt cleaner since the conversation history is already here. |
Replaces the original GHA workflow attempt (PR InsightSoftwareConsortium#5377) with a manual script under Utilities/Maintenance/. Per author + maintainer discussion on the original PR, integrating lychee into per-PR CI is not workable: lycheeverse/lychee#1574 and HTTP 429 rate-limits on github.com / DOI hosts produce too many spurious failures to act on in CI. New artifacts: - Utilities/Maintenance/check-links.sh — wrapper that resolves the repository root, requires lychee on PATH, runs against the supplied paths (or the whole tree by default), and writes a Markdown report plus a persistent cache. - Utilities/Maintenance/lychee.toml — configuration with the rate- limit-aware accept list (treats 429/999 as non-broken), commit-URL skip pattern that motivated the original CI failures, exclusions for ThirdParty trees, and path globs limited to documentation file types. - .gitignore: ignore the local cache (.lycheecache) and report (.lychee-report.md) artifacts so re-running does not pollute the working tree. This script is intended for periodic / on-demand runs by maintainers, not the per-PR pipeline. Co-Authored-By: Jon Haitz Legarreta Gorroño <5576557+jhlegarreta@users.noreply.github.com>
cc6bfed to
b9db686
Compare
|
| Filename | Overview |
|---|---|
| Utilities/Maintenance/check-links.sh | New lychee wrapper script; set -euo pipefail conflicts with the manual exit-code capture pattern — the "Report written to" message and exit $status are skipped when lychee finds broken links. |
| Utilities/Maintenance/lychee.toml | New lychee config; exclude_path entry Documentation/Release is stale — release notes live at Documentation/docs/releases/ — making the exclusion a silent no-op. |
| .gitignore | Adds .lycheecache and .lychee-report.md gitignore entries; correct and complete. |
| Documentation/docs/releases/1.0.md | Dead links updated; CDash URL is duplicated in the resource list (two entries with identical URL replacing two formerly distinct resources). |
| Documentation/Maintenance/Release.md | Corrects release-notes directory reference from Documentation/ReleaseNotes to Documentation/docs/releases. |
| Documentation/docs/migration_guides/itk_5_migration_guide.md | Doxygen links updated to docs.itk.org; an incomplete "Update scripts" stub section removed cleanly. |
| Documentation/docs/releases/5.0b03.md | Multiple dead links replaced with current GitHub/main-branch equivalents; JIRA tracker noted as decommissioned. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([Maintainer runs check-links.sh]) --> B{lychee on PATH?}
B -- No --> C[exit 127]
B -- Yes --> D{lychee.toml present?}
D -- No --> E[exit 1]
D -- Yes --> F[cd REPO_ROOT]
F --> G{Args supplied?}
G -- No --> H[default scan]
G -- Yes --> I[use supplied paths]
H & I --> J[lychee runs with cache and markdown output]
J -- exit 0 --> K[echo Report written, exit 0]
J -- exit non-zero --> L[set -e terminates: echo and exit skipped]
K --> M([Done])
L --> N([Script exits with lychee code without confirmation])
Reviews (1): Last reviewed commit: "ENH: lychee.toml absorbs gateway-timeout..." | Re-trigger Greptile
Replaces the original GHA workflow attempt (PR InsightSoftwareConsortium#5377) with a manual script under Utilities/Maintenance/. Per author + maintainer discussion on the original PR, integrating lychee into per-PR CI is not workable: lycheeverse/lychee#1574 and HTTP 429 rate-limits on github.com / DOI hosts produce too many spurious failures to act on in CI. New artifacts: - Utilities/Maintenance/check-links.sh — wrapper that resolves the repository root, requires lychee on PATH, runs against the supplied paths (or the whole tree by default), and writes a Markdown report plus a persistent cache. - Utilities/Maintenance/lychee.toml — configuration with the rate- limit-aware accept list (treats 429/999 as non-broken), commit-URL skip pattern that motivated the original CI failures, exclusions for ThirdParty trees, and path globs limited to documentation file types. - .gitignore: ignore the local cache (.lycheecache) and report (.lychee-report.md) artifacts so re-running does not pollute the working tree. This script is intended for periodic / on-demand runs by maintainers, not the per-PR pipeline. Co-Authored-By: Jon Haitz Legarreta Gorroño <5576557+jhlegarreta@users.noreply.github.com>
Updates four URLs flagged by Utilities/Maintenance/check-links.sh that have well-defined modern equivalents: - Doxygen `\tparam` reference: stack.nl/~dimitri/doxygen -> doxygen.nl/manual (the canonical Doxygen documentation site since the project's move). - SCI Institute Seg3D landing page: cibc-software/seg3d.html -> sci.utah.edu/seg3d. - SCI Institute SCIRun landing page: cibc-software/scirun.html -> sci.utah.edu/scirun. - MITK home: mitk.org/wiki/The_Medical_Imaging_Interaction_Toolkit (the wiki page no longer resolves) -> mitk.org/ (the project home). All four replacements were verified to return HTTP 200. Other lychee findings (release-notes link rot, ipfs.io connection resets, Kitware blog post moves) need case-by-case research and are deferred.
33fbffc to
e9f9781
Compare
@hjmjohnson This is fine. Thanks for all this work. |
Bulk update of broken links flagged by
Utilities/Maintenance/check-links.sh on top of the prior obvious-fixes
commit:
- itk.org/Insight/Doxygen/html/...
-> docs.itk.org/projects/doxygen/en/stable/...
(the legacy doxygen path moved to docs.itk.org).
- classitk_1_1Experimental_1_1<Range>.html
-> classitk_1_1<Range>.html
for ImageBufferRange, IndexRange, ShapedImageNeighborhoodRange,
ImageRegionRange. These classes were promoted out of the
itk::Experimental namespace; only the un-Experimental URL resolves
on the modern doxygen build. HyperrectangularImageNeighborhoodShape
is touched the same way (the un-Experimental name is the live one).
- namespaceitk_1_1Experimental.html -> namespaceitk.html
(the namespace was retired; its members live in itk:: now).
- github.com/.../ITK/blob/master/Documentation/ITK5MigrationGuide.md
-> github.com/.../ITK/blob/main/Documentation/docs/migration_guides/itk_5_migration_guide.md
(file moved + renamed; anchors in the old URL resolve in the new
layout).
- Documentation/Maintenance/Release.md: ReleaseNotes folder rename
-> Documentation/docs/releases.
- courses.md: uu.nl/en/master/medical-imaging/study-programme
-> uu.nl/en/masters/medical-imaging.
All replacement URLs verified to return HTTP 200 before staging.
Re-running check-links.sh on the touched files reduces error count
from 90 to 17 (residual = dead course pages, ipfs.io infra, HDF5
license relocation, opencollective rate-limit transient — all need
case-by-case research).
Six links in Documentation/docs/releases/5.0b03.md pointed at files
that have moved or been renamed since the 5.0b03 release. Repoint
each to its current location on main:
- .github/ISSUE_TEMPLATE.md -> .github/ISSUE_TEMPLATE/ (templates
moved to a directory of files).
- .github/PULL_REQUEST_TEMPLATE.md
-> .github/pull_request_template.md
(renamed lowercase).
- Documentation/CodeOfConduct/Motivation.md
-> CODE_OF_CONDUCT.md (the standalone
Motivation.md was merged into the
top-level Code of Conduct).
- Documentation/Data.md -> Documentation/docs/contributing/data.md
(docs reorganised under
Documentation/docs/).
- Documentation/UploadBinaryData.md
-> Documentation/docs/contributing/upload_binary_data.md.
- Utilities/UploadBinaryData.sh (script removed)
-> Documentation/docs/contributing/upload_binary_data.md
(the canonical doc that supersedes
the now-removed helper script).
- Documentation/ReleaseNotes/ -> Documentation/docs/releases/.
Replacement URLs verified to return HTTP 200. Two remaining 404s in
this file (atlassian.net JIRA project — decommissioned, the broken
link itself communicates that fact; GitCheatSheet.pdf — file removed
without a successor) are left in place for historical accuracy.
The legacy ITK JIRA project at insightsoftwareconsortium.atlassian.net was decommissioned by Atlassian; the URL no longer resolves. Remove the broken link and rephrase the surrounding sentence to state the decommissioning directly.
The University of Central Florida (cs.ucf.edu/~bagci/teaching/mic17), Uppsala (it.uu.se/edu/course/homepage/bild1/vt14), and Western University (eng.uwo.ca/biomed/courses/courses_9519) course pages have been retired with no announced successor URLs. Remove the three bullets rather than carry permanently broken links.
gdcm.sourceforge.net/Copyright.html no longer resolves; GDCM development moved to github.com/malaterre/GDCM, where the canonical copyright file is Copyright.txt at the repo root.
The Utilities/ITKv5Preparation directory contained one-shot bash scripts used during the 4 -> 5 migration; the directory was removed once the migration completed. The "Update scripts" section pointed at that now-deleted directory and trailed off mid-sentence; remove it rather than carry a permanently broken link to scripts that no longer exist.
The TIFF row of the supported-formats table had a markdown link with an empty URL ([\`itk::TIFFImageIO\`]()). Repoint to the actual TIFFImageIO doxygen page on docs.itk.org.
HDF Group reorganised the HDF5 license layout: COPYING_LBNL_HDF5 was renamed to LICENSE_LBNL_HDF5 and then consolidated into the single top-level LICENSE file at the repository root. The legacy support.hdfgroup.org/ftp/HDF5/releases/COPYING_LBNL_HDF5 URL no longer resolves. Update the licenses.md note to point at https://github.com/HDFGroup/hdf5/blob/develop/LICENSE, which contains the LBNL Copyright Notice and Licensing Terms verbatim.
The HDF Group retired the support.hdfgroup.org/HDF5/ landing page; the canonical HDF5 home is now www.hdfgroup.org/solutions/hdf5/.
itk.org/CourseWare/Training/RegistrationMethodsOverview.pdf no longer exists. Repoint the "registration overview" sentence in faq.md to the Registration chapter of the ITK Software Guide (Book 2, Chapter 3), which is the canonical successor and is actively maintained.
The Hyperrectangular shape no longer has its own doxygen page on the modern docs.itk.org build; demoting that markdown link to inline code keeps the class name visible without pointing at a 404. The sibling ShapedImageNeighborhoodRange page does still exist and remains linked.
Documentation/GitCheatSheet.pdf was removed from the repository without a successor. Remove the trailing "We also have a Git cheatsheet for quick reference." sentence rather than leave a permanent 404; the surrounding prose still points at the Software Guide and CONTRIBUTING.md as starting points.
Insight Journal moved from the legacy InsightJournalManager/view_reviews.php?...&pubid=N URL scheme to the canonical /browse/publication/N form years ago. All 14 publication URLs flagged by Utilities/Maintenance/check-links.sh in Documentation/docs/releases/3.2.md are mechanically rewritten to the modern path; each was verified to return HTTP 200 individually.
The itk.org wiki was retired and snapshot-archived under
insightsoftwareconsortium.github.io/ITKWikiArchive (gh-pages source at
github.com/InsightSoftwareConsortium/ITKWikiArchive). Four entries
in the 4.0 release notes still pointed at the dead itk.org/Wiki/...
URLs (or used a malformed escape sequence on the archive URL).
Update them to the verified archive paths:
- Modern_C\%2B\%2B (broken backslash escape)
-> Modern_C%252B%252B/ (the directory's own name is double-encoded
in the archive layout).
- itk.org/Wiki/Refactoring_itk::FEM_framework_-_V4
-> ITK_Release_4/Refactoring_FEM_Framework/.
- itk.org/Wiki/Refactoring_Level-Set_framework_-_V4
-> ITK/Release_4/Refactoring_Level_Set_Framework/Refactoring_Level_Set_Framework/.
- itk.org/Wiki/GPU_Acceleration_-_V4
-> ITK_Release_4/GPU_Acceleration/GPU_Acceleration/.
Each replacement was verified to return HTTP 200 individually.
The 1.0 release notes pre-date most of ITK's current infrastructure; the broken URLs flagged by Utilities/Maintenance/check-links.sh have well-defined modern successors: - public.kitware.com/dashboard.php?name=itk (the legacy Kitware dashboard) and public.kitware.com/Dart (Dart, the predecessor of CDash) -> open.cdash.org/index.php?project=Insight. - public.kitware.com/Cable (the CABLE C++ wrapping system) -> github.com/CastXML/CastXML. CABLE was succeeded by GCC-XML and then by CastXML, which is the wrapping toolchain ITK uses today. - www.cmake.org/CMake/HTML/Download.html -> cmake.org/download/. - www.itk.org/HTML/Download.php -> docs.itk.org/en/latest/download.html. - www.itk.org/HTML/Examples.htm -> examples.itk.org/. - www.itk.org/mailman/listinfo/insight-users (the legacy mailing list) -> discourse.itk.org/ (the modern discussion forum that replaced it). All replacements verified to return HTTP 200 individually.
creatis.insa-lyon.fr/Public/Gdcm/Main.html no longer resolves; GDCM development moved to github.com/malaterre/GDCM (matches the same fix in faq.md).
review.source.kitware.com/p/ITK no longer resolves; the Gerrit instance was decommissioned after the move to GitHub pull requests. Replace the dead link with a parenthetical noting the migration.
The scanco.ch customer-login FAQ page is behind authentication and returns 404 to anonymous fetches; the format-name cell stands on its own without the link. No public Scanco format reference is currently discoverable to substitute.
Both 'changes in style' and 'Coding Style Guide' references in 5.0a01.md pointed at Book1ch13.html#x57-259000C in the ITK Software Guide. The 5.x edition of the SG renumbered the coding-style chapter to Book1ch9, and the per-section anchor IDs (`x57-...`) were regenerated. Drop the chapter-specific anchor and link to the current chapter file (HTTP 200 verified); readers can navigate the chapter TOC for the relevant section.
Kitware retired the kitware.com/blog/home/post/N URL scheme; the
specific posts referenced in the 4.8 release notes were never
captured by the Wayback Machine and are not discoverable via the
modern Kitware site search. Replace each broken "Details:" link
with the canonical ITK successor where one exists, and drop the
link entirely where no successor is available (the surrounding
prose already describes the topic):
- post/888 (CastXML wrapping replaces GCCXML)
-> https://github.com/CastXML/CastXML
- post/912 (Emscripten / JavaScript build)
-> https://wasm.itk.org/ (the canonical itk-wasm successor)
- post/890 (Software Guide HTML edition)
-> https://itk.org/ITKSoftwareGuide/html/
- post/904 (cross-compilation/packaging),
post/887 (Raspberry Pi),
post/893 (Android),
post/883 (MXE/MinGW-w64),
post/891 (POWER8),
post/899 (UpdateThirdPartyFromUpstream.sh / Git subtree)
-> "Details:" link removed; topic line retained.
Same Kitware-blog URL retirement as the 4.8 fixup; canonical
successors used where available, dead link dropped otherwise:
- post/942 (AnisotropicDiffusionLBR web-browser reproducibility)
-> https://wasm.itk.org/ (the itk-wasm in-browser runtime that
evolved out of the original Emscripten experiments).
- post/997 (External Modules outside the ITK source tree)
-> https://docs.itk.org/en/latest/contributing/module_workflows.html
(the canonical module workflows doc).
- post/939 (Option to export all library symbols on Windows)
-> "Details:" link removed; topic line retained.
… URLs
Two hdl.handle.net handles consistently return HTTP 500; the matching
publications were located in the modern insight-journal.org catalog
and verified to return HTTP 200:
- 10380/320 (SplitComponents new-class entry in 4.6 release notes)
-> insight-journal.org/browse/publication/774
("An ITK Class that Splits Multi-Component Images").
- 1926/3596 (SLIC super-pixel filter in 5.0b01 release notes,
referenced twice in the same file)
-> insight-journal.org/browse/publication/989
("Scalable Simple Linear Iterative Clustering (SSLIC) Using a
Generic and Parallel Approach", Lowekamp et al.).
Each replacement was confirmed by matching publication metadata in
the IJ /browse listing.
Both jeffro.net/mind and caddlab.rad.unc.edu/software/MIND have been offline for years and have no Wayback Machine snapshot. Drop the hyperlinks, keep the URLs as inline code so the historical record of where the project was hosted is preserved, and add a one-clause note that the URLs are no longer reachable.
visual.nlm.nih.gov no longer hosts the 2010 ITKv4 kick-off meeting agenda; the Internet Archive captured the page on 2012-03-13. Repoint the link to the Wayback snapshot so the historical information remains reachable.
After all link-rot fixes for ITK's documentation, the periodic check-links.sh run still surfaced ~14 "errors" that were not link rot but transient infrastructure responses from the maintainer's network: - 504 from eth.limo gateway in front of content-link-upload.itk.eth.limo. - 522 from Cloudflare in front of opencollective.org. - TCP-level resets and connection-failed signals from ipfs.io, monai.io, dicom.nema.org. Extend the accept list to cover 504 and 522, and turn on accept_timeouts so the residual reachability artifacts don't drown the report. Real link rot (404, 5xx other than 504/522, etc.) still surfaces normally.
e9f9781 to
8bee451
Compare
hjmjohnson
left a comment
There was a problem hiding this comment.
Thanks @jhlegarreta for initiating this effort.
7d30daa
into
InsightSoftwareConsortium:main
Add a manual lychee link-checker (
Utilities/Maintenance/check-links.sh+lychee.toml) for periodic maintainer-driven runs, instead of the per-PR GHA workflow originally proposed.Why a manual script instead of a GHA workflow
The original workflow attempt failed CI consistently due to:
In the Feb 2026 discussion on this PR, @jhlegarreta agreed the per-PR CI approach was not workable and authorized a pivot. This commit takes over the branch (with
Co-Authored-By:credit) and converts to the script approach.What's included
Utilities/Maintenance/check-links.shlycheeonPATH, accepts optional path arguments, writes a Markdown report and persistent cache.Utilities/Maintenance/lychee.tomlaccept = [..., 429, 999], commit-URL skip regex, ThirdParty exclusion, and document-onlyincludeglobs..gitignore.lycheecacheand.lychee-report.md(local artifacts).Intended use: a maintainer runs
Utilities/Maintenance/check-links.shperiodically (or scopes it to a subdirectory) and acts on the report. No CI gate is added.