Coder Social home page Coder Social logo

Corrupted messages are lost about hbc HOT 4 CLOSED

twitter avatar twitter commented on June 19, 2024
Corrupted messages are lost

from hbc.

Comments (4)

kevinoliver avatar kevinoliver commented on June 19, 2024

Bummer that we are delivering invalid UTF-8 data. I'd love to see the invalid data too so that we can fix things on our side.

Couple options here:

  • patch DelimitedStringReader so that you wrap the new String(...) in a try-catch block with logging, incrementing a stat and returning null in the case of an exception.
  • create a new HosebirdMessageProcessor (eg a ByteArrayMessageProcessor) that is parameterized on byte[] and use it in your ClientBuilder.
  • I'm not familiar with your 2nd option (CharsetDecoder changes), but if there is a good reason that is preferable to either of those 1st 2 options, please explain.

I think the simplest possible thing is patching DelimitedStringReader, but it also doesn't give much flexibility. I suspect its good enough for this though.

from hbc.

toffaletti avatar toffaletti commented on June 19, 2024

I was mistaken about the data in the firehose messages being invalid UTF-8. The data was valid UTF-8 just not valid JSON and the cause was traced back to multibyte UTF-8 mangling. This occurred back around March 12-14 2013. I communicated with Arne and it was fixed. To quote Arne:

this wound up being a bit related to the unencoded em-dash issue. Along the way, the street_address field was mangling multibyte UTF-8 characters. This basically manifested itself as \u2013 (em-dash) being delivered as \u0013 (byte 13, a control character). I imagine that the other characters you saw were mangled in a similar way.

In any case, I would like to cover both possibilities (invalid UTF-8 and invalid JSON).

I'm fairly new to Java so I might have overlooked something, but it is my understanding that the String constructors do not throw on invalid data. Instead they silently replace it. This is why I ended up at CharsetDecoder. I've written this JUnit test to demonstrate the problem:

    public void testUnicode() {
        final byte[] badBytes = {
                (byte)0xC3, (byte)0x2E
        };
        ByteBuffer uniBuf = ByteBuffer.wrap(badBytes);
        CharBuffer charBuf = null;
        Charset utf8 = Charset.forName("UTF-8");
        CharsetDecoder utf8Decoder = utf8.newDecoder();
        utf8Decoder.onMalformedInput(CodingErrorAction.REPORT);
        utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
        try {
            charBuf = utf8Decoder.decode(uniBuf);
            String s = charBuf.toString();
            fail("should not decode");
        } catch (CharacterCodingException ignored) {}

        // this will silently replace the bad utf8 sequence with U+FFFD (replacement character)
        // no exception is thrown
        try {
            String z = new String(badBytes, utf8);
        } catch (Exception ignored) {
            fail("nothing is thrown");
        }
    }

I realize this is a bit pedantic and if you think it is out of the scope of hbc to check for encoding errors, I'd agree and just have my own HosebirdMessageProcessor.

from hbc.

kevinoliver avatar kevinoliver commented on June 19, 2024

Ahhh didn't know that bit about the String constructor replacing bad data.

So, my take is that this was a bug on our end that we don't expect to regress again and I'd rather leave this bit simple in our implementation.

from hbc.

toffaletti avatar toffaletti commented on June 19, 2024

Sounds good to me.

from hbc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.