string_decoder: support Uint8Array input to methods by addaleax · Pull Request #11613 · nodejs/node

addaleax · 2017-02-28T17:15:25Z

This includes a bit of refactoring for the Buffer internals to keep up performance. Some quick benchmark results (only the string-decoder benchmark, excluding the bigger input/chunk sizes and with reduced n):

$ ./node benchmark/compare.js --new ./node --old ./node-d08836003c57 --runs 5 --filter string-decoder.js string_decoder| Rscript benchmark/compare.R
[00:01:37|% 100| 1/1 files | 10/10 runs | 20/20 configs]: Done
                                                                                      improvement confidence      p.value
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="ascii"            14.65 %        *** 1.092735e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-ascii"      8.52 %        *** 6.202910e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-utf8"       5.27 %            7.176679e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf16le"           4.37 %          * 1.891394e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf8"             14.01 %        *** 1.475088e-04
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="ascii"             20.33 %        *** 1.058625e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-ascii"       8.51 %          * 1.025246e-02
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-utf8"       -0.67 %            8.260933e-01
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf16le"           10.16 %        *** 6.193190e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf8"               7.54 %        *** 6.250636e-04
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="ascii"            16.56 %        *** 1.856548e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-ascii"      8.68 %        *** 2.254509e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-utf8"       6.20 %            7.669403e-02
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf16le"           7.12 %        *** 2.359718e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf8"              3.33 %          * 1.040546e-02
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="ascii"             18.07 %        *** 9.677031e-07
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-ascii"      12.49 %         ** 4.455920e-03
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-utf8"        2.96 %            2.299725e-01
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf16le"            9.82 %        *** 8.056526e-06
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf8"               7.09 %        *** 7.882838e-04

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

Affected core subsystem(s)

string_decoder, buffer

addaleax · 2017-02-28T17:16:14Z

test/parallel/test-string-decoder.js

@mscdex What was/is testing? The 2 would always going to be out of range…

This was added by @EricPoker in 48f8869.

jasnell · 2017-02-28T20:41:34Z

lib/buffer.js

nit: I'm really not a fan of this syntax.. but oh well.

Me neither. but what are the alternatives if I want to avoid the cost of property lookups? Splitting this into 6 lines? That’s something I can do if you prefer

nah, this is fine as is. Just had to gripe about it ;-)

addaleax · 2017-03-06T17:01:07Z

@mscdex Any further thoughts on this? Otherwise I’d like to land this in the next 1 or 2 days.

mscdex · 2017-03-07T01:47:34Z

lib/string_decoder.js

This will create hidden classes, which could add to the lookup overhead.

mscdex · 2017-03-07T01:50:11Z

lib/string_decoder.js

Using a prototypeless object would be better, but I'm not sure that having a lookup object like this is best.

Taking into account my comment from below, I would suggest creating a prototypeless object with the properties assigned at the same time using Object.create(null, { ... }).

Also, it might be worthwhile comparing other lookup strategies, such as using a function that returns the correct function based on the encoding.

mscdex · 2017-03-07T01:56:53Z

Shouldn't this be semver-major if we're now (explicitly) changing the behavior of strings passed to .write()?

mscdex · 2017-03-07T02:04:23Z

lib/string_decoder.js

I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.

I don't understand the comment here. Is the suggestion that in the future it should throw on a string? Otherwise the comment seems at odds with the string check below.

Yes… do you have different thoughts? It doesn’t really make sense to pass in a string here, does it?

Maybe just leave the behavior about string inputs as it is and make it throw in another PR? (that would be semver-major, I guess) (EDIT: OK this PR is already semver-major..)

mscdex · 2017-03-07T02:05:10Z

Also, have you benchmarked the Buffer changes independently?

jasnell · 2017-03-17T18:03:40Z

ping @addaleax :-)

addaleax · 2017-03-20T21:14:57Z

@jasnell Thanks for the ping…

I’ve rebased this and edited a bit, @mscdex was right to ask for individual benchmarks for the Buffer changes. Instead, the *Slice methods are now made available twice, once on the Buffer prototype and once on the binding.

Here’s the current benchmark situation:

$ ./node benchmark/compare.js --new ./node --old ./node-bd496e0187 --runs 15 string_decoder| Rscript benchmark/compare.R
[00:07:51|% 100| 2/2 files | 30/30 runs | 20/20 configs]: Done
                                                                                      improvement confidence      p.value
 string_decoder/string-decoder-create.js n=2500000 encoding="ascii"                      -25.22 %        *** 3.102600e-30
 string_decoder/string-decoder-create.js n=2500000 encoding="AscII"                      -14.26 %        *** 7.347390e-14
 string_decoder/string-decoder-create.js n=2500000 encoding="base64"                      -4.54 %        *** 2.573403e-07
 string_decoder/string-decoder-create.js n=2500000 encoding="ucs2"                        -6.64 %        *** 6.163783e-09
 string_decoder/string-decoder-create.js n=2500000 encoding="UTF-16LE"                    -3.15 %         ** 8.447201e-03
 string_decoder/string-decoder-create.js n=2500000 encoding="utf-8"                       -7.28 %        *** 1.657616e-17
 string_decoder/string-decoder-create.js n=2500000 encoding="utf8"                        -7.37 %        *** 7.429393e-10
 string_decoder/string-decoder-create.js n=2500000 encoding="UTF-8"                        0.21 %            8.402134e-01
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="ascii"            23.66 %        *** 2.569594e-23
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-ascii"    -10.56 %        *** 1.821717e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="base64-utf8"      -9.02 %        *** 1.316690e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf16le"          11.54 %        *** 9.783854e-05
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=128 encoding="utf8"              5.14 %        *** 1.415359e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="ascii"             28.12 %        *** 1.476634e-22
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-ascii"      -9.58 %        *** 2.745890e-06
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="base64-utf8"      -10.01 %        *** 6.174522e-08
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf16le"            9.09 %        *** 8.884515e-07
 string_decoder/string-decoder.js n=250000 chunk=16 inlen=32 encoding="utf8"               9.04 %        *** 7.448446e-10
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="ascii"            27.16 %        *** 3.700006e-14
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-ascii"    -11.61 %        *** 1.095245e-07
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="base64-utf8"     -11.04 %        *** 7.624943e-08
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf16le"           8.53 %        *** 1.069139e-11
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=128 encoding="utf8"              3.68 %        *** 4.301468e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="ascii"             28.03 %        *** 3.022976e-20
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-ascii"      -7.10 %        *** 1.286362e-06
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="base64-utf8"       -7.74 %        *** 1.830142e-05
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf16le"            9.93 %        *** 6.290946e-04
 string_decoder/string-decoder.js n=250000 chunk=64 inlen=32 encoding="utf8"               8.06 %        *** 8.187544e-09

I would be okay with accepting these, especially given how the improvements tend to affect the more common encodings (esp. utf8).

Shouldn't this be semver-major if we're now (explicitly) changing the behavior of strings passed to .write()?

@mscdex I wouldn’t consider string input covered as part of the API, but it’s a reasonable point of view. I’m changing the label to be careful.

mscdex · 2017-03-20T21:44:29Z

lib/string_decoder.js

These additions concern me a bit because they make the function size exceed Crankshaft's max inlineable source size.

@mscdex … yeah. Do you have a better alternative in mind?

I'd have to look into it.

jasnell · 2017-04-04T18:12:16Z

ping @addaleax @mscdex .. what do you want to do with this one? If this is going to make it into 8.0.0 it needs to get landed this week. Today is technically the cut off but I'll be going the "release candidate" build next Tuesday so there's a slight bit more time.

mscdex · 2017-04-04T20:07:40Z

@jasnell I plan on taking a look this week.

mscdex · 2017-04-07T17:05:58Z

FWIW I think I've found some more optimizations to avoid the current performance regressions and then some. I am running benchmarks now ...

mscdex · 2017-04-07T19:32:06Z

Ok, here are the results (compared to the current node master branch) with this PR + my changes to StringDecoder's encoding normalization:

                                                                                        improvement confidence      p.value
 string_decoder/string-decoder-create.js n=25000000 encoding="ascii"                       37.19 %        *** 7.731410e-79
 string_decoder/string-decoder-create.js n=25000000 encoding="AscII"                       33.64 %        *** 1.707788e-27
 string_decoder/string-decoder-create.js n=25000000 encoding="base64"                      40.86 %        *** 2.873623e-47
 string_decoder/string-decoder-create.js n=25000000 encoding="ucs2"                        27.75 %        *** 9.452042e-60
 string_decoder/string-decoder-create.js n=25000000 encoding="UTF-16LE"                    25.59 %        *** 1.022009e-54
 string_decoder/string-decoder-create.js n=25000000 encoding="utf-8"                       31.80 %        *** 1.731379e-56
 string_decoder/string-decoder-create.js n=25000000 encoding="UTF-8"                       32.08 %        *** 1.138630e-72
 string_decoder/string-decoder-create.js n=25000000 encoding="utf8"                        32.78 %        *** 4.094873e-61
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="ascii"            22.32 %        *** 1.476634e-48
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="base64-ascii"      0.88 %            4.800656e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="base64-utf8"       0.53 %            7.488514e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="utf16le"           9.56 %        *** 2.426733e-32
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=128 encoding="utf8"              5.75 %        *** 2.816096e-28
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="ascii"             21.25 %        *** 2.168653e-39
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="base64-ascii"       1.45 %          * 2.909510e-02
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="base64-utf8"        2.09 %            1.014384e-01
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="utf16le"            9.81 %        *** 1.713600e-09
 string_decoder/string-decoder.js n=2500000 chunk=16 inlen=32 encoding="utf8"               7.90 %        *** 2.832152e-07
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="ascii"            26.26 %        *** 3.872336e-32
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="base64-ascii"     -0.13 %            8.786887e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="base64-utf8"      -0.43 %            6.920594e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="utf16le"           6.79 %        *** 6.201167e-16
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=128 encoding="utf8"              4.28 %        *** 1.457799e-10
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="ascii"             24.15 %        *** 3.802638e-17
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="base64-ascii"      -0.92 %            4.692021e-01
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="base64-utf8"        2.65 %        *** 1.026323e-04
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="utf16le"            4.14 %         ** 2.253467e-03
 string_decoder/string-decoder.js n=2500000 chunk=64 inlen=32 encoding="utf8"               5.45 %        *** 9.106784e-11

addaleax · 2017-04-09T15:27:33Z

@mscdex Are your modifications pushed somewhere? Are you okay with this change (possibly pending applying them)?

mscdex · 2017-04-09T17:32:42Z

Not yet, I wasn't sure where to push it for review.

addaleax · 2017-04-09T17:33:26Z

You can just push to this branch if you like.

addaleax · 2017-04-14T11:27:52Z

CI is green. @jasnell Do you mind taking another look?

TimothyGu · 2017-04-14T20:25:55Z

After #12223, I wonder if it would make sense to add support for all ArrayBuffer views instead of just Uint8Array, here rather than in a later PR.

addaleax · 2017-04-14T20:27:10Z

@TimothyGu I am not sure that makes sense … for something that decodes byte sequences, shouldn’t the input be an Uint8Array?

joyeecheung · 2017-05-05T08:51:14Z

lib/string_decoder.js

Would it affect performance if we use constants with names instead of number literals for indices?

Makes the string slice methods of buffers available on the binding object in addition to the `Buffer` prototype. This enables subsequent `string_decoder` changes to use these methods directly without performance loss, since all parameters are known by the string decoder in those cases.

This is a bit odd since `string_decoder` does currently not perform any type checking. Also, this adds an explicit check for `string` input, which does not really make sense but is relied upon by our test suite.

addaleax · 2017-05-05T09:34:37Z

Rebased. @mscdex Do my changes LGTY? This is basically only waiting for a second CTC member approval.

mscdex · 2017-05-05T14:59:31Z

src/node_buffer.cc

+                              Local<Value> buffer_arg,
+                              Local<Value> start_arg,
+                              Local<Value> end_arg) {
  Isolate* isolate = args.GetIsolate();


Perhaps we could match StringSlice() above and instead do Isolate* isolate = env->isolate(); for consistency?

mscdex · 2017-05-05T15:03:18Z

LGTM with one minor nit that shouldn't block this from landing.

CI again: https://ci.nodejs.org/job/node-test-pull-request/7898/

TimothyGu · 2017-05-05T15:16:24Z

test/parallel/test-string-decoder.js

-    sequence.forEach((write) => {
-      output += decoder.write(input.slice(write[0], write[1]));
+  for (const useUint8array of [ false, true ]) {
+    sequences.forEach((sequence) => {


While at it, change this to a for-of loop?

TimothyGu · 2017-05-05T17:55:09Z

for something that decodes byte sequences, shouldn’t the input be an Uint8Array?

Well maybe, but I feel it is plausible for the user to use a Uint16Array for UTF-16/UCS-2 input, for example

mcollina

LGTM with a polyfill for https://github.com/rvagg/string_decoder (it can be down there too!)

BridgeAR · 2017-08-26T08:45:38Z

Needs a rebase but I guess this is otherwise pretty much good to go?

BridgeAR · 2017-09-12T19:34:52Z

Ping @addaleax I would love to get this in and I think this only needs a rebase. Otherwise I would go ahead and close this.

mscdex · 2017-09-12T19:49:51Z

@BridgeAR benchmarks would probably need to be re-ran before merging because this was all done before TurboFan.

addaleax · 2017-09-13T09:24:45Z

@BridgeAR @mcollina’s review was dependent on there being a polyfill for the corresponding npm module, which I haven’t done yet; feel free to take this over if you like

@mscdex I agree, but I wouldn’t expect much of a difference since I don’t think TurboFan had impact on how native bindings are called

mscdex · 2017-09-13T13:45:54Z

@addaleax I was referring more to the js-land stuff, especially the commit I pushed.

BridgeAR · 2017-09-23T00:33:27Z

@mcollina I think it would be fine to land this as is for now as your PR to update the module to 8.1.2 did not yet land either. So that should be merged first out of my perspective.

@addaleax this needs a rebase though.

mcollina · 2017-09-25T16:52:01Z

@BridgeAR this can land independently, we pick the content from core releases, so we can fetch them

But good catch on the other PR, I'll get it updated and landed. This is semver-major anyway, so we have time.

BridgeAR · 2017-10-02T11:58:53Z

Ping @addaleax

BridgeAR · 2017-11-22T12:56:39Z

Closing due to long inactivity. @addaleax please reopen if you want to follow up on this :-) (in that case the benchmarks should be rerun though).

addaleax · 2017-11-22T13:05:54Z

Yeah, I think there’s no point in pursuing this given that we now have TextDecoder support …

addaleax added buffer Issues and PRs related to the buffer subsystem. semver-minor PRs that contain new features and should be released in the next minor version. string_decoder Issues and PRs related to the string_decoder subsystem. labels Feb 28, 2017

nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. string_decoder Issues and PRs related to the string_decoder subsystem. labels Feb 28, 2017

addaleax commented Feb 28, 2017

View reviewed changes

addaleax requested review from jasnell and mscdex February 28, 2017 17:17

jasnell reviewed Feb 28, 2017

View reviewed changes

jasnell approved these changes Feb 28, 2017

View reviewed changes

mscdex reviewed Mar 7, 2017

View reviewed changes

lib/string_decoder.js Outdated

Copy link

Contributor

mscdex Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create hidden classes, which could add to the lookup overhead.

mscdex reviewed Mar 7, 2017

View reviewed changes

addaleax force-pushed the string-decoder-uint8array branch from 7014a3e to db0181e Compare March 20, 2017 20:42

addaleax added semver-major PRs that contain breaking changes and should be released in the next major version. and removed semver-minor PRs that contain new features and should be released in the next minor version. labels Mar 20, 2017

mscdex reviewed Mar 20, 2017

View reviewed changes

refack force-pushed the master branch from 16073c0 to fbe946b Compare April 14, 2017 04:11

jasnell approved these changes Apr 14, 2017

View reviewed changes

joyeecheung reviewed May 5, 2017

View reviewed changes

addaleax and others added 3 commits May 5, 2017 11:28

string_decoder: support Uint8Array input to methods

f4e7b55

This is a bit odd since `string_decoder` does currently not perform any type checking. Also, this adds an explicit check for `string` input, which does not really make sense but is relied upon by our test suite.

string_decoder: refactor encoding normalization

525fabd

addaleax force-pushed the string-decoder-uint8array branch from b67e1e9 to 525fabd Compare May 5, 2017 09:30

mscdex reviewed May 5, 2017

View reviewed changes

TimothyGu reviewed May 5, 2017

View reviewed changes

addaleax mentioned this pull request May 11, 2017

bring string_decoder into the Foundation nodejs/TSC#260

Closed

mcollina approved these changes May 15, 2017

View reviewed changes

BridgeAR added the stalled Issues and PRs that are stalled. label Sep 8, 2017

BridgeAR closed this Nov 22, 2017

addaleax deleted the string-decoder-uint8array branch November 22, 2017 13:05

Uh oh!

Conversation

addaleax commented Feb 28, 2017

Checklist

Affected core subsystem(s)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

addaleax commented Mar 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mscdex Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mscdex commented Mar 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyeecheung May 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mscdex commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasnell commented Mar 17, 2017

Uh oh!

addaleax commented Mar 20, 2017

Uh oh!

mscdex Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasnell commented Apr 4, 2017

Uh oh!

mscdex commented Apr 4, 2017

Uh oh!

mscdex commented Apr 7, 2017

Uh oh!

mscdex commented Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

addaleax commented Apr 9, 2017

Uh oh!

mscdex commented Apr 9, 2017

Uh oh!

addaleax commented Apr 9, 2017

Uh oh!

addaleax commented Apr 14, 2017

Uh oh!

TimothyGu commented Apr 14, 2017

Uh oh!

addaleax commented Apr 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

addaleax commented May 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mscdex Mar 7, 2017 •

edited

Loading

joyeecheung May 5, 2017 •

edited

Loading

mscdex commented Mar 7, 2017 •

edited

Loading

mscdex Mar 20, 2017 •

edited

Loading

mscdex commented Apr 7, 2017 •

edited

Loading

mscdex commented May 5, 2017 •

edited

Loading