From b1dee3f0603245e5a130cb9572ee94ab31afb898 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 24 Oct 2025 14:34:44 +0900 Subject: [PATCH 1/4] out_s3: Add an instruction for enabling parquet compression - Apply suggestion from @esmerel Co-authored-by: Lynette Miles <6818907+esmerel@users.noreply.github.com> Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 57 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index d1f602b44..2092d09fb 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -6,7 +6,7 @@ description: Send logs, data, and metrics to Amazon S3 ![AWS logo](../../.gitbook/assets/image%20(9).png) -The _Amazon S3_ output plugin lets you ingest records into the [S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) cloud object store. +The _Amazon S3_ output plugin lets you ingest records into the [S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) cloud object store. The plugin can upload data to S3 using the [multipart upload API](https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html) or [`PutObject`](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html). Multipart is the default and is recommended. Fluent Bit will stream data in a series of _parts_. This limits the amount of data buffered on disk at any point in time. By default, every time 5 MiB of data have been received, a new part will be uploaded. The plugin can create files up to gigabytes in size from many small chunks or parts using the multipart API. All aspects of the upload process are configurable. @@ -36,7 +36,7 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor | `blob_database_file` | Absolute path to a database file to be used to store blob files contexts. | _none_ | | `bucket` | S3 Bucket name | _none_ | | `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ | -| `compression` | Compression type for S3 objects. `gzip`, `arrow`, `parquet` and `zstd` are the supported values, `arrow` and `parquet` are only available if Apache Arrow was enabled at compile time. Defaults to no compression. | _none_ | +| `compression` | Compression/format for S3 objects. Supported: `gzip` (always available) and `parquet` (requires Arrow build). For `gzip`, the `Content-Encoding` header is set to `gzip`. `parquet` is available **only when Fluent Bit is built with `-DFLB_ARROW=On`** and Arrow GLib/Parquet GLib are installed. Parquet is typically used with `use_put_object On`. | _none_ | | `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ | | `endpoint` | Custom endpoint for the S3 API. Endpoints can contain scheme and port. | _none_ | | `external_id` | Specify an external ID for the STS API. Can be used with the `role_arn` parameter if your role requires an external ID. | _none_ | @@ -695,3 +695,56 @@ The following example uses `pyarrow` to analyze the uploaded data: 3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0 4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0 ``` + +## Enable Parquet support + +### Build requirements for Parquet + +To enable Parquet, build Fluent Bit with Apache Arrow support and install Arrow GLib/Parquet GLib: + +```bash +# Ubuntu/Debian example +sudo apt-get update +sudo apt-get install -y -V ca-certificates lsb-release wget +wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt-get update +sudo apt-get install -y -V libarrow-glib-dev libparquet-glib-dev + +# Build Fluent Bit with Arrow: +cd build/ +cmake -DFLB_ARROW=On .. +cmake --build . +``` + +For other Linux distributions, refer [the document for installation instructions of Apache Parquet](https://arrow.apache.org/install/). +Apache Parquet GLib is a part of Apache Arrow project. + +### Testing Parquet compression + +## Testing (Parquet) + +Example configuration: + +```yaml +service: + flush: 5 + daemon: Off + log_level: debug + http_server: Off + +pipeline: + inputs: + - name: dummy + tag: dummy.local + dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"} + + outputs: + - name: s3 + match: dummy* + region: us-east-2 + bucket: + use_put_object: On + compression: parquet + # other parameters +``` From 0e3ab85e789639a503e810cc0662b172e4c2c45b Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 21 Nov 2025 19:16:59 +0900 Subject: [PATCH 2/4] out_s3: Add classic format configuration for Parquet Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 2092d09fb..7193a5c79 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -639,6 +639,7 @@ After being compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow For example: {% tabs %} + {% tab title="fluent-bit.yaml" %} ```yaml @@ -726,6 +727,10 @@ Apache Parquet GLib is a part of Apache Arrow project. Example configuration: +{% tabs %} + +{% tab title="fluent-bit.yaml" %} + ```yaml service: flush: 5 @@ -748,3 +753,31 @@ pipeline: compression: parquet # other parameters ``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[SERVICE] + Flush 5 + Daemon Off + Log_Level debug + HTTP_Server Off + +[INPUT] + Name dummy + Tag dummy.local + Dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"} + +[OUTPUT] + Name s3 + Match dummy* + Region us-east-2 + Bucket + Use_Put_Object On + Compression parquet + # other parameters +``` + +{% endtab %} +{% endtabs %} From 848e027a81402d8288e3af196067d00c99f266b5 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 21 Nov 2025 19:22:47 +0900 Subject: [PATCH 3/4] out_s3: Align headers Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 7193a5c79..1cae2696b 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -721,9 +721,7 @@ cmake --build . For other Linux distributions, refer [the document for installation instructions of Apache Parquet](https://arrow.apache.org/install/). Apache Parquet GLib is a part of Apache Arrow project. -### Testing Parquet compression - -## Testing (Parquet) +### Testing Parquet support Example configuration: From 84b8c9a6190d0da31fbb028b1f2d158965bdbc9e Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Tue, 25 Nov 2025 10:51:08 +0900 Subject: [PATCH 4/4] out_s3: Remove a needless newline Signed-off-by: Hiroshi Hatake --- pipeline/outputs/s3.md | 1 - 1 file changed, 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 1cae2696b..106282aa4 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -726,7 +726,6 @@ Apache Parquet GLib is a part of Apache Arrow project. Example configuration: {% tabs %} - {% tab title="fluent-bit.yaml" %} ```yaml