Coder Social home page Coder Social logo

protoscan's Introduction

protoscan CI codecov Go Report Card Godoc Reference

Package protoscan is a low-level reader for protocol buffers encoded data in Golang. The main feature is the support for lazy/conditional decoding of fields.

This library can help decoding performance in two ways:

  1. fields can be conditionally decoded, skipping over fields that are not needed for a specific use-case,

  2. decoding directly into specific types or perform other transformations, the extra state can be skipped by manually decoding into the types directly.

Please be aware that to decode an entire message it is still faster to use gogoprotobuf. After much testing I think this is due to the generated code inlining almost all code to eliminate the function call overhead.

Warning: Writing code with this library is like writing the auto-generated protobuf decoder and is very time-consuming. It should only be used for specific use cases and for stable protobuf definitions.

Usage

First, the encoded protobuf data is used to initialize a new Message. Then you iterate over the fields, reading or skipping them.

msg := protoscan.New(encodedData)
for msg.Next() {
    switch msg.FieldNumber() {
    case 1: // an int64 type
        v, err := msg.Int64()
        if err != nil {
            // handle
        }

    case 3: // repeated number types can be returned as a slice
        ids, err := msg.RepeatedInt64(nil)
        if err != nil {
            // handle
        }

    case 2: // for more control repeated+packed fields can be read using an iterator
        iter, err := msg.Iterator(nil)
        if err != nil {
            // handle
        }

        userIDs := make([]UserID, 0, iter.Count(protoscan.WireTypeVarint))
        for iter.HasNext() {
            v, err := iter.Int64()
            if err != nil {
                // handle
            }

            userIDs = append(userIDs, UserID(v))
        }
    default:
        msg.Skip() // required if value not needed.
    }
}

if msg.Err() != nil {
    // handle
}

After calling Next() you MUST call an accessor function (Int64(), RepeatedInt64(), Iterator(), etc.) or Skip() to ignore the field. All these functions, including Next() and Skip(), must not be called twice in a row.

Value Accessor Functions

There is an accessor for each one the protobuf scalar value types.

For repeated fields there is a corresponding set of functions like RepeatedInt64(buf []int64) ([]int64, error). Repeated fields may or may not be packed, so you should pass in a pre-created buffer variable when calling. For example

var ids []int64

msg := protoscan.New(encodedData)
for msg.Next() {
    switch msg.FieldNumber() {
    case 1: // repeated int64 field
        var err error
        ids, err = msg.RepeatedInt64(ids)
        if err != nil {
            // handle
        }
    default:
        msg.Skip()
    }
}

if msg.Err() != nil {
    // handle
}

If the ids are 'packed', RepeatedInt64() will be called once. If the ids are simply repeated RepeatedInt64() will be called N times, but the resulting array of ids will be the same.

For more control over the values in a packed, repeated field use an Iterator. See above for an example.

Decoding Embedded Messages

Embedded messages can be handled recursively, or the raw data can be returned and decoded using a standard/auto-generated proto.Unmarshal function.

msg := protoscan.New(encodedData)
for msg.Next() {
    fn := msg.FieldNumber()

    // use protoscan recursively
    if fn == 1 && needFieldNumber1 {
        embeddedMsg, err := msg.Message()
        for embeddedMsg.Next() {
            switch embeddedMsg.FieldNumber() {
            case 1:
                // do something
            default:
                embeddedMsg.Skip()
            }
        }
    }

    // if you need the whole message decode the message in the standard way.
    if fn == 2 && needFieldNumber2 {
        data, err := msg.MessageData()

        v := &ProtoBufThing()
        err = proto.Unmarshal(data, v)
    }
}

Handling errors

For Errors can occure for two reason:

  1. The field is being read as the incorrect type.
  2. The data is corrupted or somehow invalid.

Larger Example

Starting with a customer message with embedded orders and items and you only want to count the number of items in open orders.

message Customer {
  required int64 id = 1;
  optional string username = 2;

  repeated Order orders = 3;
  repeated int64 favorite_ids = 4 [packed=true];
}

message Order {
  required int64 id = 1;
  required bool open = 2;
  repeated Item items = 3;
}

message Item {
  // a big object
}

Sample Code:

openCount := 0
itemCount := 0
favoritesCount := 0

customer := protoscan.New(data)
for customer.Next() {
    switch customer.FieldNumber() {
    case 1: // id
        id, err := customer.Int64()
        if err != nil {
            panic(err)
        }
        _ = id // do something or skip this case if not needed

    case 2: // username
        username, err := customer.String()
        if err != nil {
            panic(err)
        }
        _ = username // do something or skip this case if not needed

    case 3: // orders
        open := false
        count := 0

        orderData, _ := customer.MessageData()
        order := protoscan.New(orderData)
        for order.Next() {
            switch order.FieldNumber() {
            case 2: // open
                v, _ := order.Bool()
                open = v
            case 3: // item
                count++

                // we're not reading the data but we still need to skip it.
                order.Skip()
            default:
                // required to move past unneeded fields
                order.Skip()
            }
        }

        if open {
            openCount++
            itemCount += count
        }
    case 4: // favorite ids
        iter, err := customer.Iterator(nil)
        if err != nil {
        	panic(err)
        }

        // Typically this section would only be run once but it is valid
        // protobuf to contain multiple sections of repeated fields that should
        // be concatenated together.
        favoritesCount += iter.Count(protoscan.WireTypeVarint)
    default:
        // unread fields must be skipped
        customer.Skip()
    }
}

fmt.Printf("Open Orders: %d\n", openCount)
fmt.Printf("Items:       %d\n", itemCount)
fmt.Printf("Favorites:   %d\n", favoritesCount)

// Output:
// Open Orders: 2
// Items:       4
// Favorites:   8

Wire Type Start Group and End Group

Groups are an old protobuf wire type that has been deprecated for a long time. They function as parentheses but with no "data length" information so their content can not be effectively skipped. Just the start and end group indicators can be read and skipped like any other field. This would cause the data to be read without the parentheses, whatever that may mean in practice. To get the raw protobuf data inside a group try something like:

var (
    groupFieldNum = 123
    groupData []byte
)

msg := New(data)
for msg.Next() {
    if msg.FieldNumber() == groupFieldNum && msg.WireType() == WireTypeStartGroup {
        start, end := msg.Index, msg.Index
        for msg.Next() {
            msg.Skip()
            if msg.FieldNumber() == groupFieldNum && msg.WireType() == WireTypeEndGroup {
                break
            }
            end = msg.Index
        }
        // groupData would be the raw protobuf encoded bytes of the fields in the group.
        groupData = msg.Data[start:end]
    }
}

Similar libraries in other languages

  • protozero - C++, the inspiration for this library
  • pbf - javascript

protoscan's People

Contributors

dependabot[bot] avatar paulmach avatar si3nloong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tucarapp

protoscan's Issues

io.Reader/streaming support

Thanks for this library. As far as I can tell, the Go Proto ecosystem seems to be otherwise entirely void of a proto reader that isn't "greedy" and copies things a million times.
I was just reading the code and noticed you must provide a []byte. My use cases are large files/blobs with a fixed proto schema with certain large bytes fields I want to scan through once. Any plans or ideas on how io.Reader could be supported? (Proto seems amenable to lazily passing through a stream once while only keeping the current field in memory - but I might be mistaken.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.