out_s3: Add parquet compression type with pure C#10691
Conversation
WalkthroughAdds Parquet compression support gated by Arrow GLib Parquet detection, extends S3 output to treat any non-NONE compression uniformly, enforces put-object for Arrow/Parquet, updates build scripts and CI to install Arrow/Parquet GLib, and exposes a new compressor API plus option wiring. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User/Config
participant S3 as S3 Plugin
participant Comp as Compressor (gzip/arrow/parquet)
participant AWS as Amazon S3
User->>S3: Configure compression=(none|gzip|arrow|parquet), use_put_object
alt compression in {arrow, parquet}
S3->>S3: Verify use_put_object == true
alt use_put_object false
S3-->>User: Error: require put-object for Arrow/Parquet
note right of S3: Abort send
else use_put_object true
S3->>Comp: Compress records
alt parquet
note over Comp: out_s3_compress_parquet\n(Arrow Table -> Parquet buffer)
else arrow
note over Comp: out_s3_compress_arrow\n(Arrow Table -> Feather)
else gzip
note over Comp: gzip buffer
end
Comp-->>S3: Compressed payload, size
S3->>S3: Validate size (multipart 5GB compressed limit)
S3->>AWS: Upload (PUT Object or Multipart)
AWS-->>S3: Response
S3-->>User: Result
end
else compression == none
S3->>S3: No compression
S3->>S3: Validate size (50MB uncompressed limit for multipart)
S3->>AWS: Upload (PUT Object or Multipart)
AWS-->>S3: Response
S3-->>User: Result
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
85316ab to
0afb495
Compare
243f704 to
4eaed0d
Compare
4eaed0d to
e69bbd2
Compare
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Using LIBRARIES does not point into the right place of the apache arrow-glib and parquet-glib libraries. Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
f5b2741 to
01fe1bb
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
plugins/out_s3/s3.c (2)
706-714: Bug: using ctx->compression before assignment; use ‘ret’ for the checkYou compute ret with flb_aws_compression_get_type(tmp) but validate ctx->compression (still uninitialized). This can bypass the intended restriction or cause undefined behavior.
Apply:
- if (ctx->use_put_object == FLB_FALSE && - (ctx->compression == FLB_AWS_COMPRESS_ARROW || - ctx->compression == FLB_AWS_COMPRESS_PARQUET)) { + if (ctx->use_put_object == FLB_FALSE && + (ret == FLB_AWS_COMPRESS_ARROW || + ret == FLB_AWS_COMPRESS_PARQUET)) { flb_plg_error(ctx->ins, "use_put_object must be enabled when Apache Arrow or Parquet is enabled"); return -1; } ctx->compression = ret;
1131-1136: Fix: unlock chunk on compression failure to avoid stuck locked filesIf compression fails here, you return FLB_RETRY without unlocking the file (construct_request_buffer locked it). That can wedge the file in a permanently locked state.
- if (ret == -1) { - flb_plg_error(ctx->ins, "Failed to compress data"); - return FLB_RETRY; - } else { + if (ret == -1) { + flb_plg_error(ctx->ins, "Failed to compress data"); + if (chunk != NULL) { + s3_store_file_unlock(chunk); + chunk->failures += 1; + } + return FLB_RETRY; + } else {
♻️ Duplicate comments (1)
.github/workflows/unit-tests.yaml (1)
94-103: Streamline Arrow APT setup: canonical repo URL and one apt-get updateTwo small cleanups:
- Use the canonical Apache repo URL (apache.jfrog.io) to reduce future 404s.
- Drop the second apt-get update; running it after adding the apt source is sufficient.
Note: Per our past learning about Arrow’s changing checksums, I’m not recommending pinned checksums here.
Proposed tweak:
- sudo apt-get update - sudo apt-get install -y -V ca-certificates lsb-release wget - wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb - sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb - sudo apt-get update + sudo apt-get update + sudo apt-get install -y -V ca-certificates lsb-release wget + wget -q https://apache.jfrog.io/artifactory/arrow/apt/$(lsb_release --codename --short)/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb + sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb + sudo apt-get update sudo apt-get install -y -V libarrow-glib-dev libparquet-glib-dev
🧹 Nitpick comments (8)
src/aws/compression/arrow/compress.c (4)
10-15: Include flb_mem to use project allocatorsFollow-up to allocator change: include the Fluent Bit memory header.
#include <arrow-glib/arrow-glib.h> #ifdef FLB_HAVE_ARROW_PARQUET #include <parquet-glib/parquet-glib.h> #endif -#include <fluent-bit/flb_log.h> +#include <fluent-bit/flb_log.h> +#include <fluent-bit/flb_mem.h>
154-223: Parquet writer row group sizing could explode memory; consider bounded row-group lengthPassing n_rows as max_row_group_length writes a single, potentially huge row group. That can increase peak memory usage and degrade parallel reads. Consider a bounded row-group length (e.g., 64k–1M rows) or exposing a config to control it. If left as-is, please document the implication.
If you want, I can propose a small helper to compute a capped row-group length based on table size.
21-31: Guard g_error_free and widen size type to avoid truncation
- parse_json takes int size but is passed size_t; large inputs risk truncation. Arrow GLib APIs accept gsize; adopt size_t/gsize end-to-end.
- Several g_error_free(error) calls assume error != NULL. Guard to avoid accidental NULL deref with unexpected provider behavior.
Minimal patch:
-static GArrowTable* parse_json(uint8_t *json, int size) +static GArrowTable* parse_json(uint8_t *json, size_t size) { @@ - buffer = garrow_buffer_new(json, size); + buffer = garrow_buffer_new(json, (gsize) size); @@ - reader = garrow_json_reader_new(GARROW_INPUT_STREAM(input), options, &error); + reader = garrow_json_reader_new(GARROW_INPUT_STREAM(input), options, &error); if (reader == NULL) { - g_error_free(error); + if (error != NULL) { g_error_free(error); } g_object_unref(buffer); g_object_unref(input); g_object_unref(options); return NULL; } @@ - table = garrow_json_reader_read(reader, &error); + table = garrow_json_reader_read(reader, &error); if (table == NULL) { - g_error_free(error); + if (error != NULL) { g_error_free(error); } g_object_unref(buffer); g_object_unref(input); g_object_unref(options); g_object_unref(reader); return NULL; }Note: also update the two call sites of parse_json later in this file to pass size_t (they already do).
Also applies to: 48-55, 57-65
189-195: Parquet error paths should guard g_error_free(error)When writer creation/write/close fail, error might still be NULL depending on GLib build/behavior. Guard g_error_free(error) before freeing.
Example:
- flb_error("[aws][compress] Failed to create parquet writer: %s", error->message); - g_error_free(error); + if (error != NULL) { + flb_error("[aws][compress] Failed to create parquet writer: %s", error->message); + g_error_free(error); + } else { + flb_error("[aws][compress] Failed to create parquet writer"); + }Repeat similarly for write_table/close failures.
Also applies to: 201-208, 211-218
plugins/out_s3/s3.c (2)
1173-1176: Log message wording nit: “Pre-compression chunk size” instead of “upload_chunk_size”Minor clarity: this log refers to the current buffer size, not the configuration.
- flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, " + flb_plg_info(ctx->ins, "Pre-compression chunk size= %zu, After compression, chunk is only %zu bytes, " "the chunk was too small, using PutObject to upload", preCompress_size, body_size);
189-199: Optional: default Content-Type for Arrow/Parquet when user doesn’t set oneIf ctx->content_type is NULL and compression is Arrow or Parquet, consider setting:
- Arrow: application/vnd.apache.arrow.stream or application/vnd.apache.arrow.file (depending on format you emit)
- Parquet: application/vnd.apache.parquet
Helps downstream consumers. Keep gzip behavior unchanged.
If you want, I can send a patch that sets these defaults only when content_type is unset.
packaging/distros/debian/Dockerfile (1)
150-159: Install Arrow libs only when FLB_ARROW=On (and use canonical repo URL)In bookworm stages, Arrow libs are always installed. Gating by FLB_ARROW keeps images lean and avoids unnecessary network.
Minimal change (move ARG above RUN and conditionalize):
-FROM debian:bookworm-slim AS debian-bookworm-base +FROM debian:bookworm-slim AS debian-bookworm-base @@ -# hadolint ignore=DL3008,DL3015 +ARG FLB_ARROW=On +# hadolint ignore=DL3008,DL3015 RUN apt-get -qq update && \ @@ - curl -jksSL "${cmake_download_url}" | tar -xzf - -C "${CMAKE_HOME}" --strip-components 1 && \ - wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \ - apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \ - apt-get -qq update && \ - apt-get install -y libarrow-glib-dev libparquet-glib-dev + curl -jksSL "${cmake_download_url}" | tar -xzf - -C "${CMAKE_HOME}" --strip-components 1 && \ + if [ "$FLB_ARROW" = "On" ]; then \ + wget -q https://apache.jfrog.io/artifactory/arrow/apt/$(lsb_release --codename --short)/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \ + apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \ + apt-get -qq update && \ + apt-get install -y libarrow-glib-dev libparquet-glib-dev ; \ + fi @@ -ARG FLB_ARROW=On -ENV FLB_ARROW=$FLB_ARROW +ENV FLB_ARROW=$FLB_ARROWApply the same pattern to the arm64 bookworm stage below.
packaging/distros/amazonlinux/Dockerfile (1)
82-84: Arrow deps installation on AL2023 is fine; consider using dnf consistently.AL2023’s yum is a wrapper for dnf, but using dnf throughout this stage improves clarity and avoids subtle behavior diffs.
-RUN yum -y update && \ +RUN dnf -y update && \ yum install -y rpm-build curl-minimal ca-certificates gcc gcc-c++ make bash \ ... - dnf install -y https://packages.apache.org/artifactory/arrow/amazon-linux/$(cut -d: -f6 /etc/system-release-cpe)/apache-arrow-release-latest.rpm && \ + dnf install -y https://packages.apache.org/artifactory/arrow/amazon-linux/$(cut -d: -f6 /etc/system-release-cpe)/apache-arrow-release-latest.rpm && \ dnf install -y arrow-glib-devel parquet-glib-devel && \ - yum clean all && \ + dnf clean all && \Also applies to: 113-115
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (14)
.github/workflows/unit-tests.yaml(3 hunks)CMakeLists.txt(1 hunks)include/fluent-bit/aws/flb_aws_compress.h(1 hunks)packaging/distros/almalinux/Dockerfile(6 hunks)packaging/distros/amazonlinux/Dockerfile(7 hunks)packaging/distros/centos/Dockerfile(10 hunks)packaging/distros/debian/Dockerfile(7 hunks)packaging/distros/rockylinux/Dockerfile(6 hunks)packaging/distros/ubuntu/Dockerfile(10 hunks)plugins/out_s3/s3.c(9 hunks)src/aws/compression/arrow/CMakeLists.txt(1 hunks)src/aws/compression/arrow/compress.c(2 hunks)src/aws/compression/arrow/compress.h(1 hunks)src/aws/flb_aws_compress.c(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/aws/compression/arrow/CMakeLists.txt
- include/fluent-bit/aws/flb_aws_compress.h
- src/aws/compression/arrow/compress.h
- CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
PR: fluent/fluent-bit#10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
packaging/distros/centos/Dockerfile.github/workflows/unit-tests.yaml
🧬 Code graph analysis (1)
src/aws/flb_aws_compress.c (1)
src/aws/compression/arrow/compress.c (1)
out_s3_compress_parquet(226-274)
🪛 Checkov (3.2.334)
packaging/distros/centos/Dockerfile
[HIGH] 20-34: Ensure that certificate validation isn't disabled with curl
(CKV2_DOCKER_2)
[HIGH] 57-71: Ensure that certificate validation isn't disabled with curl
(CKV2_DOCKER_2)
[HIGH] 223-243: Ensure that certificate validation isn't disabled with curl
(CKV2_DOCKER_2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: PR - fuzzing test
🔇 Additional comments (12)
src/aws/flb_aws_compress.c (1)
58-64: Parquet compression option wiring looks correctEntry is properly guarded by FLB_HAVE_ARROW_PARQUET and mirrors the Arrow entry. No concerns.
plugins/out_s3/s3.c (2)
735-745: Multipart size checks generalized for all non-NONE compressions look goodApplying the 5GB per-part cap for compressed multipart uploads vs. 50MB when uncompressed is consistent and conservative.
3956-4000: Config help text updated to include ‘parquet’ — LGTMClear and matches build-time guards. No issues.
.github/workflows/unit-tests.yaml (1)
55-55: Matrix setup to exercise -DFLB_ARROW=On looks solidCoverage for Arrow builds, and excluding clang for Arrow in this matrix, is clear and maintainable.
Also applies to: 70-74
packaging/distros/rockylinux/Dockerfile (1)
145-162: FLB_ARROW flag wiring looks good.ARG + CMake propagation is consistent with other distros. No issues here.
packaging/distros/amazonlinux/Dockerfile (1)
35-37: ARG/ENV and CMake propagation for FLB_ARROW are consistent.Good exposure of the flag across base stages and the builder; matches project-wide pattern.
Also applies to: 66-67, 93-95, 125-126, 167-168
packaging/distros/almalinux/Dockerfile (1)
146-163: CMake: FLB_ARROW wiring is correct.Flag is introduced and forwarded consistently. Looks good.
packaging/distros/ubuntu/Dockerfile (2)
143-145: FLB_ARROW exposure and CMake propagation are consistent across Ubuntu variants.Defaults Off for 16.04/18.04 and On for 20.04+ are reasonable. No further action.
Also applies to: 175-177, 205-207, 237-239, 267-269, 299-301
137-141: APT Arrow repo enablement is correct for Ubuntu ≥ 20.04.Using the official apache-arrow apt-source .deb, then installing libarrow-glib-dev/libparquet-glib-dev, matches upstream guidance. Looks good.
Also applies to: 169-173, 199-203, 231-235, 261-265, 293-297
packaging/distros/centos/Dockerfile (3)
207-211: Flag wiring (FLB_UNICODE_ENCODER/FLB_ARROW) and CMake propagation look good.Consistent with the rest of the PR; no issues.
Also applies to: 253-255, 300-301
181-201: Minor: consistent plugin install before dnf config-manager across both CentOS 9 stages.centos-9-base installs
'dnf-command(config-manager)'; the arm64v8 stage mirrors the same pattern—good. Keep this symmetry if future edits touch these blocks.Also applies to: 223-243
192-201: CentOS Stream 9: please manually verify Arrow “almalinux” repo resolution on both archesThis change relies on the AlmaLinux Arrow RPM landing cleanly on CentOS Stream 9. Because automated Docker tests aren’t available here, please confirm on real or VM hosts for both x86_64 and aarch64 that metadata and packages resolve without errors. For example:
# On a CentOS Stream 9 (x86_64 or aarch64) host/VM: sudo dnf -y install 'dnf-command(config-manager)' epel-release sudo dnf -y install https://packages.apache.org/artifactory/arrow/almalinux/$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)/apache-arrow-release-latest.rpm sudo dnf config-manager --set-enabled epel powertools crb ol$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)_codeready_builder sudo dnf config-manager --set-enabled codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-rhui-rpms sudo subscription-manager repos --enable codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-$(arch)-rpms sudo dnf -y install arrow-glib-devel parquet-glib-develAlso verify the same steps apply cleanly to the snippet at lines 234–243.
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)
706-714: Critical: validation uses ctx->compression before it’s assigned (guard never triggers).You compute
ret = flb_aws_compression_get_type(tmp)but then checkctx->compression(still holding its old/default value) instead ofret. This allows Arrow/Parquet withuse_put_object == FLB_FALSEto slip past initialization. Compare againstret(or assign first, then validate).Apply this minimal fix:
ret = flb_aws_compression_get_type(tmp); if (ret == -1) { flb_plg_error(ctx->ins, "unknown compression: %s", tmp); return -1; } - if (ctx->use_put_object == FLB_FALSE && - (ctx->compression == FLB_AWS_COMPRESS_ARROW || - ctx->compression == FLB_AWS_COMPRESS_PARQUET)) { + if (ctx->use_put_object == FLB_FALSE && + (ret == FLB_AWS_COMPRESS_ARROW || + ret == FLB_AWS_COMPRESS_PARQUET)) { flb_plg_error(ctx->ins, - "use_put_object must be enabled when Apache Arrow or Parquet is enabled"); + "use_put_object must be enabled when Apache Arrow or Parquet compression is selected"); return -1; } ctx->compression = ret;
♻️ Duplicate comments (1)
plugins/out_s3/s3.c (1)
1173-1177: Fix misleading log message (it’s reporting chunk size, not upload_chunk_size).The message says “Pre-compression upload_chunk_size” but prints the pre-compression chunk size. Suggest clarifying wording.
This also reflects the earlier feedback about the Arrow/Parquet fallback branch; it’s now correctly limited to the gzip-only case.
Apply this tweak:
- flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, " - "the chunk was too small, using PutObject to upload", preCompress_size, body_size); + flb_plg_info(ctx->ins, + "Pre-compression chunk size was %zu bytes; after compression it's %zu bytes; " + "chunk too small, using PutObject to upload", + preCompress_size, body_size);
🧹 Nitpick comments (2)
plugins/out_s3/s3.c (2)
1195-1197: De-duplicate compressed-buffer frees with a single cleanup path.You correctly free
payload_bufacross all exit paths, but the repetition makes future edits error-prone. Consider a singlecleanup:block (or a small inline helper) to free the compressed buffer when set.Example pattern:
int ret = FLB_OK; void *payload_buf = NULL; bool compressed = false; /* ... set compressed=true and payload_buf on success ... */ put_object: ret = s3_put_object(...); cleanup: if (compressed && payload_buf != NULL) { flb_free(payload_buf); } return ret;This keeps ownership clear and reduces maintenance overhead.
Also applies to: 1222-1224, 1236-1238, 1246-1248, 1263-1265
3996-3999: Clarify option docs: Arrow/Parquet require PutObject.Since initialization enforces PutObject for Arrow/Parquet, reflect that in the
compressionoption description to prevent config surprises.- "Compression type for S3 objects. 'gzip', 'arrow' and 'parquet' are the supported values. " - "'arrow' and 'parquet' are only available if Apache Arrow was enabled at compile time. " + "Compression type for S3 objects. Supported values: 'gzip', 'arrow', 'parquet'. " + "'arrow' and 'parquet' are available only if Apache Arrow was enabled at compile time, " + "and require 'use_put_object' to be enabled. " "Defaults to no compression. " "If 'gzip' is selected, the Content-Encoding HTTP Header will be set to 'gzip'."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
CMakeLists.txt(1 hunks)plugins/out_s3/s3.c(9 hunks)src/aws/compression/arrow/CMakeLists.txt(1 hunks)src/aws/compression/arrow/compress.c(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/aws/compression/arrow/CMakeLists.txt
- src/aws/compression/arrow/compress.c
- CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: PR - fuzzing test
🔇 Additional comments (3)
plugins/out_s3/s3.c (3)
735-745: Multipart size caps: compressed vs. uncompressed limit — LGTM.Good call enforcing 5 GB per-part cap for compressed payloads and retaining 50 MB for uncompressed. This aligns the code with S3 multipart constraints and avoids accidental oversizing.
1130-1141: Unified compression preprocessing for all non-NONE types — LGTM.Centralizing gzip/arrow/parquet into a single compression pathway simplifies control flow and makes memory handling consistent. The early return on compression failure with FLB_RETRY is appropriate.
1352-1364: Graceful fallback to uncompressed on compression failure — LGTM.When compression fails in put_all_chunks(), you log and proceed with the original buffer to prevent data loss. Memory ownership transitions (
buffer->payload_buf) and frees look correct.
|
Question, this doesn't seem to support specify codec or row size and data page size. Will that be added in the future? |
With apache arrow glib parquet library, we're able to support parquet format on out_s3.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
With leaks on macOS, there's no leaks:
With valgrind:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Improvements
Documentation