add benchmark with sequential reading and parallel processing by oliverpool · Pull Request #5 · actforgood/bigcsvreader

oliverpool · 2023-06-19T13:35:52Z

Description

This package has an interesting approach of splitting the file. However I think that usual csv processing is CPU-bound and not IO-bound.

This is reflected in the benchmarks with the fakeProcessRow function.

I created a benchmark with uses the stdlib csv reader and spawns multiple workers for record processing: Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell.

The result is not far from Benchmark50000Rows_50Mb_withBigCsvReader:

go test -timeout=20m -benchmem -benchtime=1x -bench=.
goos: linux
goarch: amd64
pkg: github.com/actforgood/bigcsvreader
cpu: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
Benchmark50000Rows_50Mb_withBigCsvReader-8                                     1        7249230548 ns/op        61740648 B/op     100291 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadAll-8                               1        56917609645 ns/op       68507744 B/op     100044 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneAndReuseRecord-8            1        57368588456 ns/op       57606088 B/op      50019 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell-8           1        7275138093 ns/op        61607144 B/op     100037 allocs/op

Type of change

Benchmark

Checklist:

I added examples and changed the documentation accordingly (not yet).

Contributor License Agreement
By submitting this pull request, I confirm that my contribution is made under the terms of project's license.

bogcon · 2023-06-20T11:05:41Z

Hi @oliverpool
Nice discovery.

I tried to catch only the I/O (reading) improvement on a bigger file, ~500Mb instead of ~50Mb, and got the bellow results:

// diff
 func fakeProcessRow(_ []string) {
-       time.Sleep(time.Millisecond)
+       // time.Sleep(time.Millisecond)
 }

 func Benchmark50000Rows_50Mb_withBigCsvReader(b *testing.B) {
-       benchmarkBigCsvReader(5e4)(b)
+       benchmarkBigCsvReader(5e5)(b)
 }

 func Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell(b *testing.B) {
-       benchmarkGoCsvReaderReadOneByOneProcessParalell(5e4)(b)
+       benchmarkGoCsvReaderReadOneByOneProcessParalell(5e5)(b)
 }

// results
Benchmark50000Rows_50Mb_withBigCsvReader-8                             	       2	1417864642 ns/op	616139968 B/op	 1000259 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell-8   	       2	4566982168 ns/op	616006568 B/op	 1000028 allocs/op

oliverpool · 2023-06-20T19:18:12Z

Thanks!

Maybe it would be worth mentioning in the description/readme that this package is mostly useful for IO-bound workloads (since as the newly added benchmarks shows, when the CPU is the bottleneck, the stdlib csv reader is fast enough :)

bogcon · 2023-06-20T20:09:16Z

I added this note in the readme.

add benchmark with sequential reading and parallel processing

4fd9954

bogcon merged commit 364d029 into actforgood:main Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add benchmark with sequential reading and parallel processing#5

add benchmark with sequential reading and parallel processing#5
bogcon merged 1 commit intoactforgood:mainfrom
oliverpool:main

oliverpool commented Jun 19, 2023

Uh oh!

bogcon commented Jun 20, 2023

Uh oh!

oliverpool commented Jun 20, 2023 •

edited

Loading

Uh oh!

bogcon commented Jun 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oliverpool commented Jun 19, 2023

Description

Type of change

Checklist:

Uh oh!

bogcon commented Jun 20, 2023

Uh oh!

oliverpool commented Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogcon commented Jun 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oliverpool commented Jun 20, 2023 •

edited

Loading