Comments (4)
@jamiebuilds can you give usecase(s) where you only care about the byte length and don't need the encoded data?
from encoding.
@jakearchibald I work on an end-to-end encrypted messaging app where we can't inspect the types of payloads being sent between clients on the server, so there are many places where we need to enforce a max byte length on the client to prevent certain types of abuse overloading client apps.
Right now we mostly do encode the data in Node buffers but found that it would be more efficient to catch these things earlier and have the option of dropping payloads that are too large before we start doing anything with that data.
After implementing some of this though, I actually found an even better way of doing this:
function maxLimitCheck(maxByteSize: number) {
let encoder = new TextEncoder()
let maxSizeArray = new Uint8Array(maxByteSize + 4)
return (input: string): boolean => {
return encoder.encodeInto(input, maxSizeArray).written < maxByteSize
}
}
let check = maxLimitCheck(5e6) // 5MB
check("a".repeat(5)) // true
check("a".repeat(5e6)) // true
check("a".repeat(5e6 - 1) + "¢") // true
check("a".repeat(5e6 + 1)) // false
check("a".repeat(2 ** 29 - 24)) // false
Testing this out in my benchmark repo with the max size array enforcing a couple different limits:
./benchmarks/blob.js: 4.8 ops/sec (±0.1, p=0.001, o=0/10)
./benchmarks/buffer.js: 54.5 ops/sec (±3.0, p=0.001, o=0/10)
./benchmarks/implementation.js: 0.7 ops/sec (±0.0, p=0.001, o=0/10)
./benchmarks/textencoder.js: 11.9 ops/sec (±1.0, p=0.001, o=0/10)
5MB:
6’318.7 ops/sec (±743.3, p=0.001, o=8/100) severe outliers=6
50MB:
551.8 ops/sec (±7.6, p=0.001, o=7/100) severe outliers=4
500MB:
51.5 ops/sec (±4.6, p=0.001, o=6/100) severe outliers=4
I still believe this is a useful function to have, there are more than 10k results for Buffer.byteLength(
on GitHub (which looking around mostly seem like strings being passed in, although the API accepts Buffers and other typed arrays too).
Seems like a lot of people are using it for Content-Length
headers too
from encoding.
I am not 100% this is correct ... but ... it's also pretty slow and I start wondering if the slowness doesn't come directly from string internal code-points:
"use strict"
module.exports = (input) => {
let total = 0;
for (const c of input) {
const p = c.codePointAt(0);
if (p < 0x80) total += 1;
else if (p < 0x800) total += 2;
else total += (p & 0xD800) ? 4 : 3;
}
return total;
};
Results on my laptop:
./benchmarks/blob.js: 405’174.6 ops/sec (±5’563.9, p=0.001, o=6/100) severe outliers=2
./benchmarks/buffer.js: 45’447’421.7 ops/sec (±659’453.6, p=0.001, o=0/100)
./benchmarks/codepoint.js: 15’096’778.8 ops/sec (±185’463.1, p=0.001, o=0/100)
./benchmarks/implementation.js: 65’565’103.6 ops/sec (±1’127’578.0, p=0.001, o=4/100) severe outliers=2
./benchmarks/textencoder.js: 2’698’465.4 ops/sec (±97’198.0, p=0.001, o=0/100)
from encoding.
Did some extra test to verify if the buffer creation is the reason for such slowdown and indeed this proves it:
new buffer each time
"use strict"
let input = require("../input")
let encoder = new TextEncoder()
module.exports = () => {
// size as worst case scenario
const ui8Array = new Uint8Array(input.length * 4);
return encoder.encodeInto(input, ui8Array).written;
}
This is still faster than encode(input).byteLength
:
./benchmarks/textencoder.js: 3’329’442.3 ops/sec (±174’287.7, p=0.001, o=0/100)
Now, if there is no new buffer creation at all:
"use strict"
let input = require("../input")
let encoder = new TextEncoder()
// size as worst case scenario
const ui8Array = new Uint8Array(input.length * 4);
module.exports = () => {
return encoder.encodeInto(input, ui8Array).written;
}
The result is better than code points loop:
./benchmarks/textencoder.js: 23’922’510.0 ops/sec (±547’321.7, p=0.001, o=4/100) severe outliers=3
I suppose a method to just count bytes length would make it possible to have performance closer to NodeJS buffer.
from encoding.
Related Issues (20)
- Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE HOT 6
- End-of-queue during decoding of GB18030 should not mask ASCII characters. HOT 4
- gb18030 encoder using index gb18030 ranges pointer HOT 4
- aria-label usage in BMP coverage table HOT 4
- Bug in TextDecoderStream around processing the end of stream. HOT 1
- Add a static decode and encode method to `TextEncoder` and `TextDecoder` HOT 10
- Shift_JIS decoder HOT 12
- [GB18030] Wrong codepoint at index 7533 HOT 4
- TextDecoderStream: empty Uint8Array should result in an empty string HOT 4
- 7-bit ASCII encoding HOT 3
- The concept of "output encoding" is not described anywhere HOT 5
- Visualization tables has lack of descriptions HOT 2
- Why Big5 index contains unmappable characters? HOT 2
- Consider adding windows-936-2000 as a label for GBK HOT 2
- Preface punctuation
- Reflect changes in GB 18030-2022 HOT 5
- Make encodeInto() throw when given a detached buffer HOT 5
- Ambiguous wording in GB18030 decoder HOT 4
- Reference link wrong in "If ioQueue is empty..." HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from encoding.