Comments (4)
Bummer that we are delivering invalid UTF-8 data. I'd love to see the invalid data too so that we can fix things on our side.
Couple options here:
- patch DelimitedStringReader so that you wrap the
new String(...)
in a try-catch block with logging, incrementing a stat and returning null in the case of an exception. - create a new
HosebirdMessageProcessor
(eg a ByteArrayMessageProcessor) that is parameterized onbyte[]
and use it in yourClientBuilder
. - I'm not familiar with your 2nd option (CharsetDecoder changes), but if there is a good reason that is preferable to either of those 1st 2 options, please explain.
I think the simplest possible thing is patching DelimitedStringReader, but it also doesn't give much flexibility. I suspect its good enough for this though.
from hbc.
I was mistaken about the data in the firehose messages being invalid UTF-8. The data was valid UTF-8 just not valid JSON and the cause was traced back to multibyte UTF-8 mangling. This occurred back around March 12-14 2013. I communicated with Arne and it was fixed. To quote Arne:
this wound up being a bit related to the unencoded em-dash issue. Along the way, the street_address field was mangling multibyte UTF-8 characters. This basically manifested itself as \u2013 (em-dash) being delivered as \u0013 (byte 13, a control character). I imagine that the other characters you saw were mangled in a similar way.
In any case, I would like to cover both possibilities (invalid UTF-8 and invalid JSON).
I'm fairly new to Java so I might have overlooked something, but it is my understanding that the String constructors do not throw on invalid data. Instead they silently replace it. This is why I ended up at CharsetDecoder. I've written this JUnit test to demonstrate the problem:
public void testUnicode() {
final byte[] badBytes = {
(byte)0xC3, (byte)0x2E
};
ByteBuffer uniBuf = ByteBuffer.wrap(badBytes);
CharBuffer charBuf = null;
Charset utf8 = Charset.forName("UTF-8");
CharsetDecoder utf8Decoder = utf8.newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPORT);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
charBuf = utf8Decoder.decode(uniBuf);
String s = charBuf.toString();
fail("should not decode");
} catch (CharacterCodingException ignored) {}
// this will silently replace the bad utf8 sequence with U+FFFD (replacement character)
// no exception is thrown
try {
String z = new String(badBytes, utf8);
} catch (Exception ignored) {
fail("nothing is thrown");
}
}
I realize this is a bit pedantic and if you think it is out of the scope of hbc to check for encoding errors, I'd agree and just have my own HosebirdMessageProcessor.
from hbc.
Ahhh didn't know that bit about the String constructor replacing bad data.
So, my take is that this was a bug on our end that we don't expect to regress again and I'd rather leave this bit simple in our implementation.
from hbc.
Sounds good to me.
from hbc.
Related Issues (20)
- Enhanced URL enrichment in Twitter/GNIP volume stream 2.0 causing DelimitedStreamReader's buffer to overflow
- Sitestream: Add/remove user on reconnect HOT 2
- OAuth1.signRequest throws NullPointerException when no query params HOT 1
- Incomplete Streaming of tweets HOT 3
- ClientBuilder error caused by deprecation of SchemeRegistryFactory
- Broken link on README HOT 2
- Create release with support for powertrack 2 HOT 5
- How to search tweets between specific date range using hbc HOT 1
- Use hbc on talend
- Twitter "Streaming API" link in README.md is no longer valid HOT 1
- Handler for onBlock? HOT 1
- is this repo and the getting started not going to work after the 16th of August? HOT 1
- Streaming connection to twitter through a internet proxy in a new release
- Proxy in ClientBuilder HOT 2
- Quotes o Retweets search by UserName not retrieved by streaming API
- Missing Check against Null
- Twitter API is not working for Kafka. it's refusing the connection HOT 2
- Connection Failure. HOT 1
- How do I get token and secret HOT 2
- Hosebird client is retired
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hbc.