ianlopshire / go-fixedwidth Goto Github PK

View Code? Open in Web Editor NEW

80.0 4.0 32.0 92 KB

Encoding and decoding for fixed-width formatted data

Home Page: http://godoc.org/github.com/ianlopshire/go-fixedwidth

License: MIT License

Go 100.00%

fixed-width encoding decoding

go-fixedwidth's Introduction

fixedwidth

Package fixedwidth provides encoding and decoding for fixed-width formatted Data.

go get github.com/ianlopshire/go-fixedwidth

Usage

Struct Tags

The struct tag schema schema used by fixedwidth is: fixed:"{startPos},{endPos},[{alignment},[{padChar}]]"¹.

The startPos and endPos arguments control the position within a line. startPos and endPos must both be positive integers greater than 0. Positions start at 1. The interval is inclusive.

The alignment argument controls the alignment of the value within it's interval. The valid options are default², right, left, and none. The alignment is optional and can be omitted.

The padChar argument controls the character that will be used to pad any empty characters in the interval after writing the value. The default padding character is a space. The padChar is optional and can be omitted.

Fields without tags are ignored.

Encode

// define some data to encode
people := []struct {
    ID        int     `fixed:"1,5"`
    FirstName string  `fixed:"6,15"`
    LastName  string  `fixed:"16,25"`
    Grade     float64 `fixed:"26,30"`
    Age       uint    `fixed:"31,33"`
    Alive     bool    `fixed:"34,39"`
}{
    {1, "Ian", "Lopshire", 99.5, 20, true},
}

data, err := Marshal(people)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s", data)
// Output:
// 1    Ian       Lopshire  99.5020 true

Decode

// define the format
var people []struct {
    ID        int     `fixed:"1,5"`
    FirstName string  `fixed:"6,15"`
    LastName  string  `fixed:"16,25"`
    Grade     float64 `fixed:"26,30"`
    Age       uint    `fixed:"31,33"`
    Alive     bool    `fixed:"34,39"`
    Github    bool    `fixed:"40,41"`
}

// define some fixed-with data to parse
data := []byte("" +
    "1    Ian       Lopshire  99.50 20 false f" + "\n" +
    "2    John      Doe       89.50 21 true t" + "\n" +
    "3    Jane      Doe       79.50 22 false F" + "\n" +
    "4    Ann       Carraway  79.59 23 false T" + "\n")

err := Unmarshal(data, &people)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("%+v\n", people[0])
fmt.Printf("%+v\n", people[1])
fmt.Printf("%+v\n", people[2])
fmt.Printf("%+v\n", people[3])
// Output:
//{ID:1 FirstName:Ian LastName:Lopshire Grade:99.5 Age:20 Alive:false Github:false}
//{ID:2 FirstName:John LastName:Doe Grade:89.5 Age:21 Alive:true Github:true}
//{ID:3 FirstName:Jane LastName:Doe Grade:79.5 Age:22 Alive:false Github:false}
//{ID:4 FirstName:Ann LastName:Carraway Grade:79.59 Age:23 Alive:false Github:true}

It is also possible to read data incrementally

decoder := fixedwidth.NewDecoder(bytes.NewReader(data))
for {
    var element myStruct
    err := decoder.Decode(&element)
    if err == io.EOF {
        break
    }
    if err != nil {
        log.Fatal(err)
    }
    handle(element)
}

UTF-8, Codepoints, and Multibyte Characters

fixedwidth supports encoding and decoding fixed-width data where indices are expressed in unicode codepoints and not raw bytes. The data must be UTF-8 encoded.

decoder := fixedwidth.NewDecoder(strings.NewReader(data))
decoder.SetUseCodepointIndices(true)
// Decode as usual now

buff := new(bytes.Buffer)
encoder := fixedwidth.NewEncoder(buff)
encoder.SetUseCodepointIndices(true)
// Encode as usual now

Alignment Behavior

Alignment	Encoding	Decoding
`default`	Field is left aligned	The padding character is trimmed from both right and left of value
`left`	Field is left aligned	The padding character is trimmed from right of value
`right`	Field is right aligned	The padding character is trimmed from left of value
`none`	Field is left aligned	The padding character is not trimmed from value. Useful for nested structs.

Notes

{} indicates an argument. [] indicates and optional segment ^
The default alignment is similar to left but has slightly different behavior required to maintain backwards compatibility ^

Licence

MIT

go-fixedwidth's People

Contributors

Stargazers

Watchers

go-fixedwidth's Issues

Use codepoint indices should be opt-out instead of opt-in

The encoder and decoder currently support multi-byte characters, but it requires library users to enable the feature explicitly. This behavior is unintuitive and leads to confusion (see #52).

decoder := fixedwidth.NewDecoder(strings.NewReader(data))
decoder.SetUseCodepointIndices(true)
// Decode as usual now

buff := new(bytes.Buffer)
encoder := fixedwidth.NewEncoder(buff)
encoder.SetUseCodepointIndices(true)
// Encode as usual now

There is still a performance cost associated with supporting multi-byte characters, so I'd like to keep the feature in the library.

Mutlti-byte character support should be enabled by default, but there should still be an option to opt-out.

This is a breaking change and should be released as part of a major version bump.

Go get not finding v0.4.4

I think this is because the tag has a capital V, but I'm not positive.

Encoder Strict Mode

There should be an opt-in strict mode on the encoder that triggers an error when a value does not fit in the available space. It should also throw an error when the intervals defined for a struct overlap.

Add formatting options (e.g. left pad)

The encoder should support common formatting needs such as left padding.

Proposed Spec

Formatting Options

default - No padding is applied to the value.
rightpad - The value is padded on the right to fill available space.
leftpad - The value is padded on the left to fill available space.

In all cases the value is written to the available space in a left-to-right manner. If the value length is greater than available space, the right most characters will be omitted.

Struct Tags

The struct tag schema will be updated to support and optional third option to specify formatting – fixed:"{startPos},{endPos},{format}".

Padding Characters

Types	Padding Character
`int`, `int8`, `int16`, `int32`, `int64`	`0`
`uint`, `uint8`, `uint16`, `uint32`, `uint64`	`0`
`float32`, `float64`	`0`
`string`, `[]byte`	`\u0020` (space)

Any type not listed will default to being padded with \u0020 (space).

`Decoder.readLine` might have an error

When reading code for readLine, I spotted a naked return:

// readLine reads the next line of data. False is returned if there is no remaining data
// to read.
func (d *Decoder) readLine(v reflect.Value) (err error, ok bool) {
	ok = d.scanner.Scan()
	if !ok {
		if d.scanner.Err() != nil {
			return d.scanner.Err(), false
		}

		d.done = true
		return nil, false
	}

	line := string(d.scanner.Bytes())

	rawValue, err := newRawValue(line, d.useCodepointIndices)
	if err != nil {
		return
	}
...
}

This might be an error, as it returns zero value for error and False for ok, which doesn't seem right. Shouldn't it return err, False?

cannot unmarshal when a struct has just one field and one lenght

func TestUnmarshalLenghtOne(t *testing.T) {
type simpleType struct {
One string fixed:"1,1"
}

b := []byte("B")
var simple simpleType

err := Unmarshal(b, &simple)
if err != nil {
	t.Errorf("Unmarshal should fine, have %s", err)
}

if simple.One != "B" {
	t.Errorf("value should be %q, have %q", "B", simple.One)
}

}

This test is not pass

Nested structs should not require the use of the `none` format.

Currently, to decode nested structs properly, you must use the none format option.

type Nested struct {
	First  string `fixed:"1,3"`
	Second string `fixed:"4,6"`
}

type Test struct {
	First  string `fixed:"1,3"`
	Second Nested `fixed:"4,9,none"`
}

This is a breaking change and should be released as part of a major version bump.

Behavior of complex types

The behavior of more complex types need to be defined/implemented.

nested structs with tag
nested structs without tag
embedded struct with tag
embedded struct without tag

type Nested struct {
	F1 string `fixed:"1,10"`
	F2 struct {
		E1 string `fixed:"11,20"`
		E2 string `fixed:"21,30"`
	}
}

type NestedWithTag struct {
	F1 string `fixed:"1,10"`
	F2 struct {
		E1 string `fixed:"1,10"`
		E2 string `fixed:"11,20"`
	} `fixed:"11,30"`
}

type S1 struct {
	F1 string `fixed:"1,10"`
	F4 string `fixed:"31,40"`
}

type Embedded struct {
	S1
	F2 string `fixed:"11,20"`
	F3 string `fixed:"21,30"`
}

type S2 struct {
	F3 string `fixed:"1,10"`
	F4 string `fixed:"11,20"`
}

type EmbeddedWithTag struct {
	F2 string `fixed:"1,10"`
	F3 string `fixed:"11,20"`
	S2 `fixed:"21,40"`
}

Simplify the design of go-fixedwidth

Hi @ianlopshire,

When I have a struct with about 80 fields, I have to spend time to calculate the start and end position for each field, it's really hard to use.
So why don't we just use the fixed length (bytes or codepoints)? I think it is easier for implementing and better user experience also.

My suggestion:

type Nested struct {
	F1 string `fixed:"10"`
	F2 struct {
		E1 string `fixed:"10"`
		E2 string `fixed:"10"`
	}
}

Cache descriptions of structs to avoid repeated reparsing of tags

In a benchmark I have of repeatedly calling Decode(&myStruct), nearly 20% of the time is being spent parsing tags.

The stdlib json package caches a description computed description of structs it receives in order to avoid this sort of overhead.

Panic when using SetUseCodepointIndices

Version: v0.9.3

To recreate the issue, I've boiled it down to:

encoding with SetUseCodepointIndices enabled
have some field which contains a non-ASCII character
have some field at the end
the last field is right aligned
the last field is empty

package main

import (
	"bytes"
	"fmt"

	"github.com/ianlopshire/go-fixedwidth"
)

func main() {
	var v struct {
		Foo string `fixed:"1,1"`
		Bar string `fixed:"2,2,right"`
	}
	v.Foo = "Ç"

	buf := new(bytes.Buffer)
	e := fixedwidth.NewEncoder(buf)
	e.SetUseCodepointIndices(true)
	_ = e.Encode(v)
	fmt.Printf("%q\n", buf)
}

Output:

panic: runtime error: index out of range [2] with length 2

goroutine 1 [running]:
github.com/ianlopshire/go-fixedwidth.(*lineBuilder).WriteValue(0xc000092ce0, 0x2, {{0x0, 0x0}, {0x0, 0x0, 0x0}})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/buff.go:63 +0x4a5
github.com/ianlopshire/go-fixedwidth.valueEncoder.Write(0x0?, 0x0?, {0x10acd00?, 0xc0000b8010?, 0x0?}, {0x2, 0x2, 0xc0000ac050, 0xc0000ac060, 0x10c7e38, ...})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:198 +0x31e
github.com/ianlopshire/go-fixedwidth.structEncoder.func1({0x10b2540?, 0xc0000b8000?, 0x203000?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:244 +0x415
github.com/ianlopshire/go-fixedwidth.(*Encoder).writeLine(0xc000092f30, {0x10b2540?, 0xc0000b8000?, 0x1009c6b?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:146 +0x159
github.com/ianlopshire/go-fixedwidth.(*Encoder).Encode(0xc000092f30, {0x10b2540, 0xc0000b8000?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:112 +0x16f
main.main()
        /.../main.go:20 +0x171
exit status 2

Expected output:

"Ç "

Optional TrimSpace

The decoder automatically trims spaces which I guess works for most people. I have a case where I do not want to trim spaces. Would it be possible to allow this to be configured via an option on the decoder?

Version 1.0.0

This issue exists to track breaking changes that should be made for a v1.0.0 release.

does not support bool type?

go-fixedwidth does not support bool type in Struct?

package main

import (
	"fmt"
	"github.com/ianlopshire/go-fixedwidth"
	"log"
	"strings"
)

func main() {

	type Book struct {
		F1 bool `fixed:"1,5,left, "`
		F2 bool `fixed:"6,10,left, "`
		F3 int  `fixed:"11,12"`
	}

	s := `true true 1
falsefalse2
true false3`

	for _, s := range strings.Split(s, "\n") {
		var b Book
		if err := fixedwidth.Unmarshal([]byte(s), &b); err != nil {
			log.Fatal(err)
		}

		fmt.Printf("%+v\n", b)
	}

}

Then output ix below.

2021/01/31 23:23:16 fixedwidth: cannot unmarshal true true 1 into Go struct field Book.F1 of type bool:fixedwidth: unknown type
exit status 1

I want support for bool types.

Encoder Does Not Support Multi-Byte Codepoints.

#10 Added support for decoding based on utf-8 codepoints. The Encoder needs to be updated in a similar manner.

Allow default formatting to be configured at the struct level

As of v0.7.0 formatting is configurable at the struct field level.

type Record struct {
	Field1 string `fixed:"1,5,left,#"`
	Field2 string `fixed:"6,10,left,#"`
	Field3 string `fixed:"11,15,left,#"`
	Field4 string `fixed:"16,20,left,#"`
	...
}

Adding all of the required tags can be tedious when all of the fields require a specific format. To alleviate this, there should be a mechanism to set the default format for all the fields in a struct.

My current thought is to implement something similar to xml.Name.

type Record struct {
	// Format is a special struct that can be embedded into a struct to control
	// the default formatting of its fields.
  	fixedwidth.Format `fixed:"left,#"`

	Field1 string `fixed:"1,5"`
	Field2 string `fixed:"6,10"`
	Field3 string `fixed:"11,15"`
	Field4 string `fixed:"16,20,right,0"` // Override the default formatting.
	...
}

Incorrectly getting io.EOF before the end of file when using Decoder.Decode

When there is a very long line (specifically, 64 * 1024 = 65536 characters or longer) in a file processed by Decoder.Decode, the Decode method returns io.EOF when it reaches the very long line, even when it is not the end of the file.

To reproduce this bug, see the sample code and test file in this small repo. These are based off of the sample code in the fixedwith repo's README here. In the small repo I provided, the test file contains 4 data lines, similar to the sample data in the README, except the 2nd line has over 65536 characters. The sample code uses the Decoder to read each line and print out the struct version of the line. It prints out the first line correctly, then prints a message indicating io.EOF was returned from the second line, e.g.:

go run main.go
{ID:1 FirstName:Ian LastName:Lopshire Grade:99.5 Age:20 Alive:false Github:f}
Got EOF%

I believe this may be happening because lines 108-113 of decode.go return io.EOF if, after calling readLine, the Decoder’s done flag is true and the returned values of readLine are err == nil && !ok. Looking into readLine (specifically, lines 162-166), if the result of the Decoder’s underlying Scanner object’s call to Scan() returns false, these conditions will be met. Looking into the Scan function in bufio/scan.go here, if the buffer length is greater than the maxTokenSize (which is 64 * 1024 = 65536), the method returns false. This false is then being used by Decoder to return io.EOF, even though this is not the end of the file.

feature request: Allow for specifying the lengths of the fields instead of start and end

It's much more convinient to be able to say fieldA:3 fieldB:4 etc than to calculate the start and end for all the fields.

I realize this means you have to put them in order but that's what most people will do anywya.

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct

Hello,

I have noticed a bug that causes a panic when decoding into a nested struct when using codepoint indices as your boundaries rather than bytes. Take the following example:

func TestDecodeSetUseCodepointIndices_Nested(t *testing.T) {
	type Nested struct {
		First  string `fixed:"1,3"`
		Second string `fixed:"4,6"`
	}

	type Test struct {
		First  string `fixed:"1,3"`
		Second Nested `fixed:"4,9"`
		Third  string `fixed:"10,12"`
		Fourth Nested `fixed:"13,18"`
		Fifth  string `fixed:"19,21"`
	}

	for _, tt := range []struct {
		name     string
		raw      []byte
		expected Test
	}{
		{
			name: "Multi-byte characters",
			raw:  []byte("123x☃x456x☃x789x☃x012\n"),
			expected: Test{
				First:  "123",
				Second: Nested{First: "x☃x", Second: "456"},
				Third:  "x☃x",
				Fourth: Nested{First: "789", Second: "x☃x"},
				Fifth:  "012",
			},
		},
	} {
		t.Run(tt.name, func(t *testing.T) {
			d := NewDecoder(bytes.NewReader(tt.raw))
			d.SetUseCodepointIndices(true)
			var s Test
			err := d.Decode(&s)
			if err != nil {
				t.Errorf("Unexpected err: %v", err)
			}
			if !reflect.DeepEqual(tt.expected, s) {
				t.Errorf("Decode(%v) want %v, have %v", tt.raw, tt.expected, s)
			}
		})
	}
}

Currently, this causes a panic due to codepoint indices not being adjusted when trimming data from the front of the string in decode.go:rawValueFromLine.

I believe the issue is here (decode.go Ln. 217):

	if value.codepointIndices != nil {
		if len(value.codepointIndices) == 0 || startPos > len(value.codepointIndices) {
			return rawValue{data: ""}
		}
		var relevantIndices []int
		var lineData string
		if endPos >= len(value.codepointIndices) {
			relevantIndices = value.codepointIndices[startPos-1:]
			lineData = value.data[relevantIndices[0]:]
		} else {
			relevantIndices = value.codepointIndices[startPos-1 : endPos]
			lineData = value.data[relevantIndices[0]:value.codepointIndices[endPos]]
		}
	} else { // truncated
	}

Note that lineData is trimmed from the left but the codepoint indices are not adjusted to match, which can cause an index out of bounds, or reading from the wrong part of the data string.

I have created a fix in PR #60 for your review.

Decode into slices don't return EOF

When v in * Decode(v interface{})* is a pointer of slice, readLines don't return EOF.

I suggest this change below:

func (d *Decoder) readLines(v reflect.Value) (err error) {
	ct := v.Type().Elem()
	for {
		nv := reflect.New(ct).Elem()
		err, ok := d.readLine(nv)
		if err != nil {
			return err
		}
		if ok {
			v.Set(reflect.Append(v, nv))
		}
		if d.done {
-			break
+                     return io.EOF
		}
	}
	return nil
}

Does the library support lists?

Hi,
Does the library support repeated lists, where the list length is not known?

type Individual struct{
Name string
Age int
}

type Population struct{
Country string
Citizens []Individual
AverageAge float64
}

[Question] Fixed width files with Header and Trailer

Any suggestion using this package to deal with fixed width files with header and trailer ?
It's common to have a line prefix to identify different schemas in the same file.
Do you think it makes sense for this package to deal with this type of files with header and trailer ?

No way to tell if at EOF when using Decoder.Decode() with struct

Looking at:

https://github.com/ianlopshire/go-fixedwidth/blob/master/decode.go#L81-L93

and

https://github.com/ianlopshire/go-fixedwidth/blob/master/decode.go#L113-L128

As far as I can tell, readLine eats the EOF error -- it returns an ok bool, but Decode throws that away. This could be fixed by having Decode transform the ok bool back into an err == io.EOF, or the done bool could be exposed via a public API.

I'm happy to implement a PR for either one of these strategies that you prefer.

Thanks for this package!

Support for codepoint-based fixedwidth files

I have an extremely bonkers fixed-width file format where the file is utf-8 encoded, but the fixed offsets are expressed in decoded codepoints. Naturally this doesn't play super well with go-fixedwidth's (ENTIRELY REASONABLE) byte-based approach :-)

I'd like to find some way to shoe-horn this into go-fixedwidth (and obviously contribute this upstream).

I think the most efficient implementation is probably, if you're doing this codepoint mode, to conver the line to []rune in readLine and in rawValueFromLine convert it back to a []byte with the utf-8 value.

And I think that'd all work fine. But I don't see an obvious way to do this without throwing a bunch of if statements here and changing a bunch of types (e.g. I could replace all the []byte with struct { bytes []byte; runes []rune } and then sticks if in readLine and rawValueFromLine).

Do you have an opinion on if there's a better implementation strategy? Would I be better off just forking decode.go?

ianlopshire / go-fixedwidth Goto Github PK

go-fixedwidth's Introduction

fixedwidth

Usage

Struct Tags

Encode

Decode

UTF-8, Codepoints, and Multibyte Characters

Alignment Behavior

Notes

Licence

go-fixedwidth's People

Contributors

Stargazers

Watchers

Forkers

go-fixedwidth's Issues

Proposed Spec

Formatting Options

Struct Tags

Padding Characters

Recommend Projects

Recommend Topics

Recommend Org