Skip to content

out_s3: Add parquet compression type with pure C#10691

Merged
edsiper merged 12 commits into
masterfrom
cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet
Aug 28, 2025
Merged

out_s3: Add parquet compression type with pure C#10691
edsiper merged 12 commits into
masterfrom
cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet

Conversation

@cosmo0920

@cosmo0920 cosmo0920 commented Aug 5, 2025

Copy link
Copy Markdown
Contributor

With apache arrow glib parquet library, we're able to support parquet format on out_s3.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    trace
    HTTP_Server  Off
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

[INPUT]
    Name dummy
    Tag  dummy.local
    dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}

[OUTPUT]
    Name  s3
    Match dummy*
    Region us-east-2
    bucket fbit-parquet-s3
    Use_Put_object true
    compression parquet
    # No need to specify schema
  • Debug log output from testing the change
Fluent Bit v4.1.0
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/08/05 17:27:09] [ info] Configuration:
[2025/08/05 17:27:09] [ info]  flush time     | 5.000000 seconds
[2025/08/05 17:27:09] [ info]  grace          | 5 seconds
[2025/08/05 17:27:09] [ info]  daemon         | 0
[2025/08/05 17:27:09] [ info] ___________
[2025/08/05 17:27:09] [ info]  inputs:
[2025/08/05 17:27:09] [ info]      dummy
[2025/08/05 17:27:09] [ info] ___________
[2025/08/05 17:27:09] [ info]  filters:
[2025/08/05 17:27:09] [ info] ___________
[2025/08/05 17:27:09] [ info]  outputs:
[2025/08/05 17:27:09] [ info]      s3.0
[2025/08/05 17:27:09] [ info] ___________
[2025/08/05 17:27:09] [ info]  collectors:
[2025/08/05 17:27:09] [ info] [fluent bit] version=4.1.0, commit=0afb495f86, pid=81424
[2025/08/05 17:27:09] [debug] [engine] coroutine stack size: 36864 bytes (36.0K)
[2025/08/05 17:27:09] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/08/05 17:27:09] [ info] [simd    ] NEON
[2025/08/05 17:27:09] [ info] [cmetrics] version=1.0.5
[2025/08/05 17:27:09] [ info] [ctraces ] version=0.6.6
[2025/08/05 17:27:09] [ info] [input:dummy:dummy.0] initializing
[2025/08/05 17:27:09] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2025/08/05 17:27:09] [debug] [dummy:dummy.0] created event channels: read=26 write=27
[2025/08/05 17:27:09] [debug] [s3:s3.0] created event channels: read=28 write=29
[2025/08/05 17:27:09] [debug] [tls] attempting to load certificates from system keychain of macOS
# <snip of loading certificates logs>
[2025/08/05 17:27:09] [debug] [tls] finished loading keychain certificates, total loaded: 153
[2025/08/05 17:27:09] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2025/08/05 17:27:09] [debug] [aws_credentials] creating profile (null) provider
[2025/08/05 17:27:09] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain
[2025/08/05 17:27:09] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2025/08/05 17:27:09] [debug] [aws_credentials] Not initializing ECS/EKS HTTP Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI and AWS_CONTAINER_CREDENTIALS_FULL_URI is not set
[2025/08/05 17:27:09] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2025/08/05 17:27:09] [debug] [aws_credentials] Sync called on the EC2 provider
[2025/08/05 17:27:09] [debug] [aws_credentials] Init called on the env provider
[2025/08/05 17:27:09] [ info] [output:s3:s3.0] Sending locally buffered data from previous executions to S3; buffer=/tmp/fluent-bit/s3/fbit-parquet-s3
[2025/08/05 17:27:09] [ info] [output:s3:s3.0] Pre-compression chunk size is 2394, After compression, chunk is 2384 bytes
[2025/08/05 17:27:10] [debug] [upstream] KA connection #34 to s3.us-east-2.amazonaws.com:443 is connected
[2025/08/05 17:27:10] [debug] [http_client] not using http_proxy for header
[2025/08/05 17:27:10] [debug] [aws_credentials] Requesting credentials from the env provider..
[2025/08/05 17:27:10] [debug] [upstream] KA connection #34 to s3.us-east-2.amazonaws.com:443 is now available
[2025/08/05 17:27:10] [debug] [output:s3:s3.0] PutObject http status=200
[2025/08/05 17:27:10] [ info] [output:s3:s3.0] Successfully uploaded object /fluent-bit-logs/dummy.local/2025/08/05/08/27/09-objectt33OX2sM
[2025/08/05 17:27:10] [debug] [aws_credentials] upstream_set called on the EC2 provider
[2025/08/05 17:27:10] [ info] [output:s3:s3.0] worker #0 started
[2025/08/05 17:27:10] [ info] [sp] stream processor started
[2025/08/05 17:27:10] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/08/05 17:27:15] [debug] [task] created task=0x1137040a0 id=0 OK
[2025/08/05 17:27:15] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2025/08/05 17:27:15] [debug] [output:s3:s3.0] Creating upload timer with frequency 60s
[2025/08/05 17:27:15] [debug] [out flush] cb_destroy coro_id=0
[2025/08/05 17:27:15] [debug] [task] destroy task=0x1137040a0 (task_id=0)
[2025/08/05 17:27:18] [engine] caught signal (SIGTERM)
[2025/08/05 17:27:18] [ info] [input] pausing dummy.0
[2025/08/05 17:27:18] [ info] [output:s3:s3.0] thread worker #0 stopping...
[2025/08/05 17:27:18] [ info] [output:s3:s3.0] thread worker #0 stopped
[2025/08/05 17:27:18] [ info] [output:s3:s3.0] Sending all locally buffered data to S3
[2025/08/05 17:27:18] [ info] [output:s3:s3.0] Pre-compression chunk size is 504, After compression, chunk is 1904 bytes
[2025/08/05 17:27:18] [debug] [output:s3:s3.0] PutObject http status=200
[2025/08/05 17:27:18] [ info] [output:s3:s3.0] Successfully uploaded object /fluent-bit-logs/dummy.local/2025/08/05/08/27/15-objectkoBRRSAl
  • Attached Valgrind output that shows no leaks or memory corruption was found

With leaks on macOS, there's no leaks:

Process:         fluent-bit [81424]
Path:            /Users/USER/*/fluent-bit
Load Address:    0x100f58000
Identifier:      fluent-bit
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [81423]
Target Type:     live task

Date/Time:       2025-08-05 17:27:19.085 +0900
Launch Time:     2025-08-05 17:27:09.515 +0900
OS Version:      macOS 15.5 (24F74)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         15.6M
Physical footprint (peak):  18.8M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 81424: 3008 nodes malloced for 428 KB
Process 81424: 0 leaks for 0 total leaked bytes.

[2025/08/05 17:27:19] [engine] caught signal (SIGCONT)
[2025/08/05 17:27:19] Fluent Bit Dump

With valgrind:

==361658== LEAK SUMMARY:
==361658==    definitely lost: 0 bytes in 0 blocks
==361658==    indirectly lost: 0 bytes in 0 blocks
==361658==      possibly lost: 0 bytes in 0 blocks
==361658==    still reachable: 68,984 bytes in 677 blocks
==361658==         suppressed: 0 bytes in 0 blocks
==361658== Reachable blocks (those to which a pointer was found) are not shown.

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Added Parquet as a compression option for the S3 output (now supports gzip, arrow, parquet). Requires Apache Arrow support at build time.
  • Improvements

    • Enforced requirement to enable use_put_object when using Arrow or Parquet compression.
    • Clarified upload size handling: compressed uploads (any non-NONE compression) follow the 5GB multipart limit; uncompressed (NONE) follows the 50MB single-part limit.
  • Documentation

    • Updated configuration descriptions to include Parquet and build requirements.

@coderabbitai

coderabbitai Bot commented Aug 5, 2025

Copy link
Copy Markdown

Walkthrough

Adds Parquet compression support gated by Arrow GLib Parquet detection, extends S3 output to treat any non-NONE compression uniformly, enforces put-object for Arrow/Parquet, updates build scripts and CI to install Arrow/Parquet GLib, and exposes a new compressor API plus option wiring.

Changes

Cohort / File(s) Summary
Build system: Arrow/Parquet detection
CMakeLists.txt, src/aws/compression/arrow/CMakeLists.txt
Adds pkg-config check for parquet-glib; defines FLB_HAVE_ARROW_PARQUET when FLB_ARROW and ARROW_GLIB_PARQUET_FOUND; links Parquet GLib includes/libs to flb-aws-arrow; optionally links jemalloc.
Public API: compression constants
include/fluent-bit/aws/flb_aws_compress.h
Introduces FLB_AWS_COMPRESS_PARQUET 3; aligns constant formatting.
Compressor implementation (Arrow/Parquet)
src/aws/compression/arrow/compress.c, src/aws/compression/arrow/compress.h
Adds Parquet compression: table_to_parquet_buffer helper and out_s3_compress_parquet API; declaration guarded by FLB_HAVE_ARROW_PARQUET.
Compression option registry
src/aws/flb_aws_compress.c
Registers "parquet" option with FLB_AWS_COMPRESS_PARQUET and out_s3_compress_parquet under FLB_HAVE_ARROW_PARQUET.
S3 output plugin behavior
plugins/out_s3/s3.c
Supports parquet in config; requires use_put_object for Arrow/Parquet; generalizes compression handling for any non-NONE type; enforces 5GB compressed multipart limit for non-NONE and 50MB uncompressed limit for NONE; adjusts buffer free paths; gzip still sets Content-Encoding.
CI: Arrow/Parquet dependencies
.github/workflows/unit-tests.yaml
Adds -DFLB_ARROW=On to matrix; installs libarrow-glib-dev and libparquet-glib-dev when FLB_ARROW is on; adds clang row testing Arrow path.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User/Config
  participant S3 as S3 Plugin
  participant Comp as Compressor (gzip/arrow/parquet)
  participant AWS as Amazon S3

  User->>S3: Configure compression=(none|gzip|arrow|parquet), use_put_object
  alt compression in {arrow, parquet}
    S3->>S3: Verify use_put_object == true
    alt use_put_object false
      S3-->>User: Error: require put-object for Arrow/Parquet
      note right of S3: Abort send
    else use_put_object true
      S3->>Comp: Compress records
      alt parquet
        note over Comp: out_s3_compress_parquet\n(Arrow Table -> Parquet buffer)
      else arrow
        note over Comp: out_s3_compress_arrow\n(Arrow Table -> Feather)
      else gzip
        note over Comp: gzip buffer
      end
      Comp-->>S3: Compressed payload, size
      S3->>S3: Validate size (multipart 5GB compressed limit)
      S3->>AWS: Upload (PUT Object or Multipart)
      AWS-->>S3: Response
      S3-->>User: Result
    end
  else compression == none
    S3->>S3: No compression
    S3->>S3: Validate size (50MB uncompressed limit for multipart)
    S3->>AWS: Upload (PUT Object or Multipart)
    AWS-->>S3: Response
    S3-->>User: Result
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

I nibbled bytes and hopped through trees,
Found Parquet paths on Arrow’s breeze.
With PUT I stash my tidy pack,
Compress, confess—no turning back.
From CI burrows, green lights glow—
To S3’s clouds, my payloads go. 🐇📦✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet branch from 85316ab to 0afb495 Compare August 5, 2025 07:55
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet branch from 243f704 to 4eaed0d Compare August 5, 2025 08:16
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet branch from 4eaed0d to e69bbd2 Compare August 5, 2025 08:21
@cosmo0920 cosmo0920 marked this pull request as ready for review August 5, 2025 09:28
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Using LIBRARIES does not point into the right place of the apache
arrow-glib and parquet-glib libraries.

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920 cosmo0920 force-pushed the cosmo0920-implement-parquet-comression-with-apache-arrow-glib-parquet branch from f5b2741 to 01fe1bb Compare August 25, 2025 08:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/out_s3/s3.c (2)

706-714: Bug: using ctx->compression before assignment; use ‘ret’ for the check

You compute ret with flb_aws_compression_get_type(tmp) but validate ctx->compression (still uninitialized). This can bypass the intended restriction or cause undefined behavior.

Apply:

-        if (ctx->use_put_object == FLB_FALSE &&
-            (ctx->compression == FLB_AWS_COMPRESS_ARROW ||
-             ctx->compression == FLB_AWS_COMPRESS_PARQUET)) {
+        if (ctx->use_put_object == FLB_FALSE &&
+            (ret == FLB_AWS_COMPRESS_ARROW ||
+             ret == FLB_AWS_COMPRESS_PARQUET)) {
             flb_plg_error(ctx->ins,
                           "use_put_object must be enabled when Apache Arrow or Parquet is enabled");
             return -1;
         }
         ctx->compression = ret;

1131-1136: Fix: unlock chunk on compression failure to avoid stuck locked files

If compression fails here, you return FLB_RETRY without unlocking the file (construct_request_buffer locked it). That can wedge the file in a permanently locked state.

-        if (ret == -1) {
-            flb_plg_error(ctx->ins, "Failed to compress data");
-            return FLB_RETRY;
-        } else {
+        if (ret == -1) {
+            flb_plg_error(ctx->ins, "Failed to compress data");
+            if (chunk != NULL) {
+                s3_store_file_unlock(chunk);
+                chunk->failures += 1;
+            }
+            return FLB_RETRY;
+        } else {
♻️ Duplicate comments (1)
.github/workflows/unit-tests.yaml (1)

94-103: Streamline Arrow APT setup: canonical repo URL and one apt-get update

Two small cleanups:

  • Use the canonical Apache repo URL (apache.jfrog.io) to reduce future 404s.
  • Drop the second apt-get update; running it after adding the apt source is sufficient.

Note: Per our past learning about Arrow’s changing checksums, I’m not recommending pinned checksums here.

Proposed tweak:

-          sudo apt-get update
-          sudo apt-get install -y -V ca-certificates lsb-release wget
-          wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
-          sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
-          sudo apt-get update
+          sudo apt-get update
+          sudo apt-get install -y -V ca-certificates lsb-release wget
+          wget -q https://apache.jfrog.io/artifactory/arrow/apt/$(lsb_release --codename --short)/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+          sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+          sudo apt-get update
           sudo apt-get install -y -V libarrow-glib-dev libparquet-glib-dev
🧹 Nitpick comments (8)
src/aws/compression/arrow/compress.c (4)

10-15: Include flb_mem to use project allocators

Follow-up to allocator change: include the Fluent Bit memory header.

 #include <arrow-glib/arrow-glib.h>
 #ifdef FLB_HAVE_ARROW_PARQUET
 #include <parquet-glib/parquet-glib.h>
 #endif
-#include <fluent-bit/flb_log.h>
+#include <fluent-bit/flb_log.h>
+#include <fluent-bit/flb_mem.h>

154-223: Parquet writer row group sizing could explode memory; consider bounded row-group length

Passing n_rows as max_row_group_length writes a single, potentially huge row group. That can increase peak memory usage and degrade parallel reads. Consider a bounded row-group length (e.g., 64k–1M rows) or exposing a config to control it. If left as-is, please document the implication.

If you want, I can propose a small helper to compute a capped row-group length based on table size.


21-31: Guard g_error_free and widen size type to avoid truncation

  • parse_json takes int size but is passed size_t; large inputs risk truncation. Arrow GLib APIs accept gsize; adopt size_t/gsize end-to-end.
  • Several g_error_free(error) calls assume error != NULL. Guard to avoid accidental NULL deref with unexpected provider behavior.

Minimal patch:

-static GArrowTable* parse_json(uint8_t *json, int size)
+static GArrowTable* parse_json(uint8_t *json, size_t size)
 {
@@
-        buffer = garrow_buffer_new(json, size);
+        buffer = garrow_buffer_new(json, (gsize) size);
@@
-        reader = garrow_json_reader_new(GARROW_INPUT_STREAM(input), options, &error);
+        reader = garrow_json_reader_new(GARROW_INPUT_STREAM(input), options, &error);
         if (reader == NULL) {
-            g_error_free(error);
+            if (error != NULL) { g_error_free(error); }
             g_object_unref(buffer);
             g_object_unref(input);
             g_object_unref(options);
             return NULL;
         }
@@
-        table = garrow_json_reader_read(reader, &error);
+        table = garrow_json_reader_read(reader, &error);
         if (table == NULL) {
-            g_error_free(error);
+            if (error != NULL) { g_error_free(error); }
             g_object_unref(buffer);
             g_object_unref(input);
             g_object_unref(options);
             g_object_unref(reader);
             return NULL;
         }

Note: also update the two call sites of parse_json later in this file to pass size_t (they already do).

Also applies to: 48-55, 57-65


189-195: Parquet error paths should guard g_error_free(error)

When writer creation/write/close fail, error might still be NULL depending on GLib build/behavior. Guard g_error_free(error) before freeing.

Example:

-            flb_error("[aws][compress] Failed to create parquet writer: %s", error->message);
-            g_error_free(error);
+            if (error != NULL) {
+                flb_error("[aws][compress] Failed to create parquet writer: %s", error->message);
+                g_error_free(error);
+            } else {
+                flb_error("[aws][compress] Failed to create parquet writer");
+            }

Repeat similarly for write_table/close failures.

Also applies to: 201-208, 211-218

plugins/out_s3/s3.c (2)

1173-1176: Log message wording nit: “Pre-compression chunk size” instead of “upload_chunk_size”

Minor clarity: this log refers to the current buffer size, not the configuration.

-                flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, "
+                flb_plg_info(ctx->ins, "Pre-compression chunk size= %zu, After compression, chunk is only %zu bytes, "
                                        "the chunk was too small, using PutObject to upload", preCompress_size, body_size);

189-199: Optional: default Content-Type for Arrow/Parquet when user doesn’t set one

If ctx->content_type is NULL and compression is Arrow or Parquet, consider setting:

  • Arrow: application/vnd.apache.arrow.stream or application/vnd.apache.arrow.file (depending on format you emit)
  • Parquet: application/vnd.apache.parquet

Helps downstream consumers. Keep gzip behavior unchanged.

If you want, I can send a patch that sets these defaults only when content_type is unset.

packaging/distros/debian/Dockerfile (1)

150-159: Install Arrow libs only when FLB_ARROW=On (and use canonical repo URL)

In bookworm stages, Arrow libs are always installed. Gating by FLB_ARROW keeps images lean and avoids unnecessary network.

Minimal change (move ARG above RUN and conditionalize):

-FROM debian:bookworm-slim AS debian-bookworm-base
+FROM debian:bookworm-slim AS debian-bookworm-base
@@
-# hadolint ignore=DL3008,DL3015
+ARG FLB_ARROW=On
+# hadolint ignore=DL3008,DL3015
 RUN apt-get -qq update && \
@@
-    curl -jksSL "${cmake_download_url}" | tar -xzf - -C "${CMAKE_HOME}" --strip-components 1 && \
-    wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
-    apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
-    apt-get -qq update && \
-    apt-get install -y libarrow-glib-dev libparquet-glib-dev
+    curl -jksSL "${cmake_download_url}" | tar -xzf - -C "${CMAKE_HOME}" --strip-components 1 && \
+    if [ "$FLB_ARROW" = "On" ]; then \
+      wget -q https://apache.jfrog.io/artifactory/arrow/apt/$(lsb_release --codename --short)/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
+      apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb && \
+      apt-get -qq update && \
+      apt-get install -y libarrow-glib-dev libparquet-glib-dev ; \
+    fi
@@
-ARG FLB_ARROW=On
-ENV FLB_ARROW=$FLB_ARROW
+ENV FLB_ARROW=$FLB_ARROW

Apply the same pattern to the arm64 bookworm stage below.

packaging/distros/amazonlinux/Dockerfile (1)

82-84: Arrow deps installation on AL2023 is fine; consider using dnf consistently.

AL2023’s yum is a wrapper for dnf, but using dnf throughout this stage improves clarity and avoids subtle behavior diffs.

-RUN yum -y update && \
+RUN dnf -y update && \
     yum install -y rpm-build curl-minimal ca-certificates gcc gcc-c++ make bash \
 ...
-    dnf install -y https://packages.apache.org/artifactory/arrow/amazon-linux/$(cut -d: -f6 /etc/system-release-cpe)/apache-arrow-release-latest.rpm && \
+    dnf install -y https://packages.apache.org/artifactory/arrow/amazon-linux/$(cut -d: -f6 /etc/system-release-cpe)/apache-arrow-release-latest.rpm && \
     dnf install -y arrow-glib-devel parquet-glib-devel && \
-    yum clean all && \
+    dnf clean all && \

Also applies to: 113-115

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f5b2741 and 01fe1bb.

📒 Files selected for processing (14)
  • .github/workflows/unit-tests.yaml (3 hunks)
  • CMakeLists.txt (1 hunks)
  • include/fluent-bit/aws/flb_aws_compress.h (1 hunks)
  • packaging/distros/almalinux/Dockerfile (6 hunks)
  • packaging/distros/amazonlinux/Dockerfile (7 hunks)
  • packaging/distros/centos/Dockerfile (10 hunks)
  • packaging/distros/debian/Dockerfile (7 hunks)
  • packaging/distros/rockylinux/Dockerfile (6 hunks)
  • packaging/distros/ubuntu/Dockerfile (10 hunks)
  • plugins/out_s3/s3.c (9 hunks)
  • src/aws/compression/arrow/CMakeLists.txt (1 hunks)
  • src/aws/compression/arrow/compress.c (2 hunks)
  • src/aws/compression/arrow/compress.h (1 hunks)
  • src/aws/flb_aws_compress.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/aws/compression/arrow/CMakeLists.txt
  • include/fluent-bit/aws/flb_aws_compress.h
  • src/aws/compression/arrow/compress.h
  • CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
PR: fluent/fluent-bit#10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.

Applied to files:

  • packaging/distros/centos/Dockerfile
  • .github/workflows/unit-tests.yaml
🧬 Code graph analysis (1)
src/aws/flb_aws_compress.c (1)
src/aws/compression/arrow/compress.c (1)
  • out_s3_compress_parquet (226-274)
🪛 Checkov (3.2.334)
packaging/distros/centos/Dockerfile

[HIGH] 20-34: Ensure that certificate validation isn't disabled with curl

(CKV2_DOCKER_2)


[HIGH] 57-71: Ensure that certificate validation isn't disabled with curl

(CKV2_DOCKER_2)


[HIGH] 223-243: Ensure that certificate validation isn't disabled with curl

(CKV2_DOCKER_2)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (12)
src/aws/flb_aws_compress.c (1)

58-64: Parquet compression option wiring looks correct

Entry is properly guarded by FLB_HAVE_ARROW_PARQUET and mirrors the Arrow entry. No concerns.

plugins/out_s3/s3.c (2)

735-745: Multipart size checks generalized for all non-NONE compressions look good

Applying the 5GB per-part cap for compressed multipart uploads vs. 50MB when uncompressed is consistent and conservative.


3956-4000: Config help text updated to include ‘parquet’ — LGTM

Clear and matches build-time guards. No issues.

.github/workflows/unit-tests.yaml (1)

55-55: Matrix setup to exercise -DFLB_ARROW=On looks solid

Coverage for Arrow builds, and excluding clang for Arrow in this matrix, is clear and maintainable.

Also applies to: 70-74

packaging/distros/rockylinux/Dockerfile (1)

145-162: FLB_ARROW flag wiring looks good.

ARG + CMake propagation is consistent with other distros. No issues here.

packaging/distros/amazonlinux/Dockerfile (1)

35-37: ARG/ENV and CMake propagation for FLB_ARROW are consistent.

Good exposure of the flag across base stages and the builder; matches project-wide pattern.

Also applies to: 66-67, 93-95, 125-126, 167-168

packaging/distros/almalinux/Dockerfile (1)

146-163: CMake: FLB_ARROW wiring is correct.

Flag is introduced and forwarded consistently. Looks good.

packaging/distros/ubuntu/Dockerfile (2)

143-145: FLB_ARROW exposure and CMake propagation are consistent across Ubuntu variants.

Defaults Off for 16.04/18.04 and On for 20.04+ are reasonable. No further action.

Also applies to: 175-177, 205-207, 237-239, 267-269, 299-301


137-141: APT Arrow repo enablement is correct for Ubuntu ≥ 20.04.

Using the official apache-arrow apt-source .deb, then installing libarrow-glib-dev/libparquet-glib-dev, matches upstream guidance. Looks good.

Also applies to: 169-173, 199-203, 231-235, 261-265, 293-297

packaging/distros/centos/Dockerfile (3)

207-211: Flag wiring (FLB_UNICODE_ENCODER/FLB_ARROW) and CMake propagation look good.

Consistent with the rest of the PR; no issues.

Also applies to: 253-255, 300-301


181-201: Minor: consistent plugin install before dnf config-manager across both CentOS 9 stages.

centos-9-base installs 'dnf-command(config-manager)'; the arm64v8 stage mirrors the same pattern—good. Keep this symmetry if future edits touch these blocks.

Also applies to: 223-243


192-201: CentOS Stream 9: please manually verify Arrow “almalinux” repo resolution on both arches

This change relies on the AlmaLinux Arrow RPM landing cleanly on CentOS Stream 9. Because automated Docker tests aren’t available here, please confirm on real or VM hosts for both x86_64 and aarch64 that metadata and packages resolve without errors. For example:

# On a CentOS Stream 9 (x86_64 or aarch64) host/VM:
sudo dnf -y install 'dnf-command(config-manager)' epel-release
sudo dnf -y install https://packages.apache.org/artifactory/arrow/almalinux/$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)/apache-arrow-release-latest.rpm
sudo dnf config-manager --set-enabled epel powertools crb ol$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)_codeready_builder
sudo dnf config-manager --set-enabled codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-rhui-rpms
sudo subscription-manager repos --enable codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-$(arch)-rpms
sudo dnf -y install arrow-glib-devel parquet-glib-devel

Also verify the same steps apply cleanly to the snippet at lines 234–243.

Comment thread packaging/distros/almalinux/Dockerfile Outdated
Comment thread packaging/distros/centos/Dockerfile Outdated
Comment thread packaging/distros/rockylinux/Dockerfile Outdated
Comment thread packaging/distros/ubuntu/Dockerfile Outdated
Comment thread src/aws/compression/arrow/compress.c Outdated
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)

706-714: Critical: validation uses ctx->compression before it’s assigned (guard never triggers).

You compute ret = flb_aws_compression_get_type(tmp) but then check ctx->compression (still holding its old/default value) instead of ret. This allows Arrow/Parquet with use_put_object == FLB_FALSE to slip past initialization. Compare against ret (or assign first, then validate).

Apply this minimal fix:

         ret = flb_aws_compression_get_type(tmp);
         if (ret == -1) {
             flb_plg_error(ctx->ins, "unknown compression: %s", tmp);
             return -1;
         }
-        if (ctx->use_put_object == FLB_FALSE &&
-            (ctx->compression == FLB_AWS_COMPRESS_ARROW ||
-             ctx->compression == FLB_AWS_COMPRESS_PARQUET)) {
+        if (ctx->use_put_object == FLB_FALSE &&
+            (ret == FLB_AWS_COMPRESS_ARROW ||
+             ret == FLB_AWS_COMPRESS_PARQUET)) {
             flb_plg_error(ctx->ins,
-                          "use_put_object must be enabled when Apache Arrow or Parquet is enabled");
+                          "use_put_object must be enabled when Apache Arrow or Parquet compression is selected");
             return -1;
         }
         ctx->compression = ret;
♻️ Duplicate comments (1)
plugins/out_s3/s3.c (1)

1173-1177: Fix misleading log message (it’s reporting chunk size, not upload_chunk_size).

The message says “Pre-compression upload_chunk_size” but prints the pre-compression chunk size. Suggest clarifying wording.

This also reflects the earlier feedback about the Arrow/Parquet fallback branch; it’s now correctly limited to the gzip-only case.

Apply this tweak:

-                flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, "
-                                       "the chunk was too small, using PutObject to upload", preCompress_size, body_size);
+                flb_plg_info(ctx->ins,
+                             "Pre-compression chunk size was %zu bytes; after compression it's %zu bytes; "
+                             "chunk too small, using PutObject to upload",
+                             preCompress_size, body_size);
🧹 Nitpick comments (2)
plugins/out_s3/s3.c (2)

1195-1197: De-duplicate compressed-buffer frees with a single cleanup path.

You correctly free payload_buf across all exit paths, but the repetition makes future edits error-prone. Consider a single cleanup: block (or a small inline helper) to free the compressed buffer when set.

Example pattern:

int ret = FLB_OK;
void *payload_buf = NULL;
bool compressed = false;

/* ... set compressed=true and payload_buf on success ... */

put_object:
ret = s3_put_object(...);

cleanup:
if (compressed && payload_buf != NULL) {
    flb_free(payload_buf);
}
return ret;

This keeps ownership clear and reduces maintenance overhead.

Also applies to: 1222-1224, 1236-1238, 1246-1248, 1263-1265


3996-3999: Clarify option docs: Arrow/Parquet require PutObject.

Since initialization enforces PutObject for Arrow/Parquet, reflect that in the compression option description to prevent config surprises.

-    "Compression type for S3 objects. 'gzip', 'arrow' and 'parquet' are the supported values. "
-    "'arrow' and 'parquet' are only available if Apache Arrow was enabled at compile time. "
+    "Compression type for S3 objects. Supported values: 'gzip', 'arrow', 'parquet'. "
+    "'arrow' and 'parquet' are available only if Apache Arrow was enabled at compile time, "
+    "and require 'use_put_object' to be enabled. "
     "Defaults to no compression. "
     "If 'gzip' is selected, the Content-Encoding HTTP Header will be set to 'gzip'."
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d165b26 and fbaab06.

📒 Files selected for processing (4)
  • CMakeLists.txt (1 hunks)
  • plugins/out_s3/s3.c (9 hunks)
  • src/aws/compression/arrow/CMakeLists.txt (1 hunks)
  • src/aws/compression/arrow/compress.c (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/aws/compression/arrow/CMakeLists.txt
  • src/aws/compression/arrow/compress.c
  • CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (3)
plugins/out_s3/s3.c (3)

735-745: Multipart size caps: compressed vs. uncompressed limit — LGTM.

Good call enforcing 5 GB per-part cap for compressed payloads and retaining 50 MB for uncompressed. This aligns the code with S3 multipart constraints and avoids accidental oversizing.


1130-1141: Unified compression preprocessing for all non-NONE types — LGTM.

Centralizing gzip/arrow/parquet into a single compression pathway simplifies control flow and makes memory handling consistent. The early return on compression failure with FLB_RETRY is appropriate.


1352-1364: Graceful fallback to uncompressed on compression failure — LGTM.

When compression fails in put_all_chunks(), you log and proceed with the original buffer to prevent data loss. Memory ownership transitions (buffer -> payload_buf) and frees look correct.

@xie317

xie317 commented Sep 4, 2025

Copy link
Copy Markdown

Question, this doesn't seem to support specify codec or row size and data page size. Will that be added in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants