Skip to content

out_s3: Add parquet compression type#8837

Closed
cosmo0920 wants to merge 34 commits into
masterfrom
cosmo0920-add-parquet-compression-type-for-out_s3
Closed

out_s3: Add parquet compression type#8837
cosmo0920 wants to merge 34 commits into
masterfrom
cosmo0920-add-parquet-compression-type-for-out_s3

Conversation

@cosmo0920

@cosmo0920 cosmo0920 commented May 20, 2024

Copy link
Copy Markdown
Contributor

With columnify command we're able to support parquet format on out_s3.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    trace
    HTTP_Server  Off
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

[INPUT]
    Name dummy
    Tag  dummy.local
    dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}

[OUTPUT]
    Name  s3
    Match dummy*
    Region us-east-2
    bucket fbit-parquet-s3
    Use_Put_object true
    compression parquet
    parquet.schema_file schema-dummy.avsc

schema-dummy.avsc

{
  "type": "record",
  "name": "DummyMessages",
  "fields" : [
    {"name": "boolean", "type": "boolean"},
    {"name": "int",     "type": "int"},
    {"name": "long",    "type": "long"},
    {"name": "float",   "type": "float"},
    {"name": "double",  "type": "double"},
    {"name": "bytes",   "type": "bytes"},
    {"name": "string",  "type": "string"}
  ]
}
  • Debug log output from testing the change
Fluent Bit v4.1.0
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/07/28 19:17:51] [ info] Configuration:
[2025/07/28 19:17:51] [ info]  flush time     | 5.000000 seconds
[2025/07/28 19:17:51] [ info]  grace          | 5 seconds
[2025/07/28 19:17:51] [ info]  daemon         | 0
[2025/07/28 19:17:51] [ info] ___________
[2025/07/28 19:17:51] [ info]  inputs:
[2025/07/28 19:17:51] [ info]      dummy
[2025/07/28 19:17:51] [ info] ___________
[2025/07/28 19:17:51] [ info]  filters:
[2025/07/28 19:17:51] [ info] ___________
[2025/07/28 19:17:51] [ info]  outputs:
[2025/07/28 19:17:51] [ info]      s3.0
[2025/07/28 19:17:51] [ info] ___________
[2025/07/28 19:17:51] [ info]  collectors:
[2025/07/28 19:17:51] [ info] [fluent bit] version=4.1.0, commit=c8e4dffcd9, pid=234222
[2025/07/28 19:17:51] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2025/07/28 19:17:51] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/07/28 19:17:51] [ info] [simd    ] disabled
[2025/07/28 19:17:51] [ info] [cmetrics] version=1.0.5
[2025/07/28 19:17:51] [ info] [ctraces ] version=0.6.6
[2025/07/28 19:17:51] [ info] [input:dummy:dummy.0] initializing
[2025/07/28 19:17:51] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2025/07/28 19:17:51] [debug] [dummy:dummy.0] created event channels: read=25 write=26
[2025/07/28 19:17:51] [debug] [s3:s3.0] created event channels: read=27 write=28
[2025/07/28 19:17:51] [ info] [output:s3:s3.0] Using upload size 100000000 bytes
[2025/07/28 19:17:51] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2025/07/28 19:17:51] [debug] [aws_credentials] creating profile (null) provider
[2025/07/28 19:17:51] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain
[2025/07/28 19:17:51] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2025/07/28 19:17:51] [debug] [aws_credentials] Not initializing ECS/EKS HTTP Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI and AWS_CONTAINER_CREDENTIALS_FULL_URI is not set
[2025/07/28 19:17:51] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2025/07/28 19:17:51] [debug] [aws_credentials] Sync called on the EC2 provider
[2025/07/28 19:17:51] [debug] [aws_credentials] Init called on the env provider
[2025/07/28 19:17:51] [ info] [output:s3:s3.0] Sending locally buffered data from previous executions to S3; buffer=/tmp/fluent-bit/s3/fbit-parquet-s3
[2025/07/28 19:17:51] [ info] [output:s3:s3.0] Pre-compression chunk size is 5922, After compression, chunk is 1070 bytes
[2025/07/28 19:17:52] [debug] [upstream] KA connection #32 to s3.us-east-2.amazonaws.com:443 is connected
[2025/07/28 19:17:52] [debug] [http_client] not using http_proxy for header
[2025/07/28 19:17:52] [debug] [aws_credentials] Requesting credentials from the env provider..
[2025/07/28 19:17:53] [debug] [upstream] KA connection #32 to s3.us-east-2.amazonaws.com:443 is now available
[2025/07/28 19:17:53] [debug] [output:s3:s3.0] PutObject http status=200
[2025/07/28 19:17:53] [ info] [output:s3:s3.0] Successfully uploaded object /fluent-bit-logs/dummy.local/2025/07/28/10/17/51-object7DWAHnqM
[2025/07/28 19:17:53] [debug] [aws_credentials] upstream_set called on the EC2 provider
[2025/07/28 19:17:53] [ info] [sp] stream processor started
[2025/07/28 19:17:53] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/07/28 19:17:53] [ info] [output:s3:s3.0] worker #0 started
[2025/07/28 19:17:57] [debug] [task] created task=0x8661470 id=0 OK
[2025/07/28 19:17:57] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2025/07/28 19:17:57] [debug] [output:s3:s3.0] Creating upload timer with frequency 60s
[2025/07/28 19:17:57] [debug] [out flush] cb_destroy coro_id=0
[2025/07/28 19:17:57] [debug] [task] destroy task=0x8661470 (task_id=0)
[2025/07/28 19:18:02] [debug] [task] created task=0x86f0a30 id=0 OK
[2025/07/28 19:18:02] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2025/07/28 19:18:02] [debug] [out flush] cb_destroy coro_id=1
[2025/07/28 19:18:02] [debug] [task] destroy task=0x86f0a30 (task_id=0)
[2025/07/28 19:18:07] [debug] [task] created task=0x877db90 id=0 OK
[2025/07/28 19:18:07] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2025/07/28 19:18:07] [debug] [out flush] cb_destroy coro_id=2
[2025/07/28 19:18:07] [debug] [task] destroy task=0x877db90 (task_id=0)
^C[2025/07/28 19:18:08] [engine] caught signal (SIGINT)
[2025/07/28 19:18:08] [debug] [task] created task=0x87f0160 id=0 OK
[2025/07/28 19:18:08] [debug] [output:s3:s3.0] task_id=0 assigned to thread #0
[2025/07/28 19:18:08] [ warn] [engine] service will shutdown in max 5 seconds
[2025/07/28 19:18:08] [debug] [out flush] cb_destroy coro_id=3
[2025/07/28 19:18:08] [debug] [engine] retry=0x75f54b0 for task 0 already scheduled to run, not re-scheduling it.
[2025/07/28 19:18:08] [ info] [input] pausing dummy.0
[2025/07/28 19:18:08] [debug] [task] destroy task=0x87f0160 (task_id=0)
[2025/07/28 19:18:09] [ info] [engine] service has stopped (0 pending tasks)
[2025/07/28 19:18:09] [ info] [input] pausing dummy.0
[2025/07/28 19:18:09] [ info] [output:s3:s3.0] thread worker #0 stopping...
[2025/07/28 19:18:09] [ info] [output:s3:s3.0] Sending all locally buffered data to S3
[2025/07/28 19:18:09] [ info] [output:s3:s3.0] thread worker #0 stopped
[2025/07/28 19:18:09] [ info] [output:s3:s3.0] Pre-compression chunk size is 2016, After compression, chunk is 1002 bytes
[2025/07/28 19:18:09] [debug] [upstream] KA connection #32 to s3.us-east-2.amazonaws.com:443 has been assigned (recycled)
[2025/07/28 19:18:09] [debug] [http_client] not using http_proxy for header
[2025/07/28 19:18:09] [debug] [aws_credentials] Requesting credentials from the env provider..
[2025/07/28 19:18:10] [debug] [upstream] KA connection #32 to s3.us-east-2.amazonaws.com:443 is now available
[2025/07/28 19:18:10] [debug] [output:s3:s3.0] PutObject http status=200
[2025/07/28 19:18:10] [ info] [output:s3:s3.0] Successfully uploaded object /fluent-bit-logs/dummy.local/2025/07/28/10/17/57-objectKvzl4lte

Install columnify with:

$ go install github.com/reproio/columnify/cmd/columnify@latest
# ...
$ which columnify
/path/to/columnify
$ echo $?
0
  • Attached Valgrind output that shows no leaks or memory corruption was found
==234222== 
==234222== HEAP SUMMARY:
==234222==     in use at exit: 0 bytes in 0 blocks
==234222==   total heap usage: 20,870 allocs, 20,870 frees, 5,028,035 bytes allocated
==234222== 
==234222== All heap blocks were freed -- no leaks are possible
==234222== 
==234222== Use --track-origins=yes to see where uninitialised values come from
==234222== For lists of detected and suppressed errors, rerun with: -s
==234222== ERROR SUMMARY: 17 errors from 5 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1380

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Comment thread plugins/out_s3/s3.c Outdated
Comment thread plugins/out_s3/s3_win32_compat.h Outdated
Comment thread plugins/out_s3/s3.c Outdated
cosmo0920 and others added 27 commits July 28, 2025 17:37
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hatake@calyptia.com>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…n Windows

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ompat

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…rable

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ects

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
…ects

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@bilalmughal

Copy link
Copy Markdown

Hi! Is there any update or ETA on merging this PR? We'd love to make use of the Parquet with S3. Thanks

@cosmo0920

Copy link
Copy Markdown
Contributor Author

This could be superseded by #10691.

@agup006

agup006 commented Sep 4, 2025

Copy link
Copy Markdown
Member

Closing this in favor of #10691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.