Coder Social home page Coder Social logo

ianlopshire / go-fixedwidth Goto Github PK

View Code? Open in Web Editor NEW
80.0 4.0 32.0 92 KB

Encoding and decoding for fixed-width formatted data

Home Page: http://godoc.org/github.com/ianlopshire/go-fixedwidth

License: MIT License

Go 100.00%
fixed-width encoding decoding

go-fixedwidth's Introduction

fixedwidth GoDoc Report card Go Cover

Package fixedwidth provides encoding and decoding for fixed-width formatted Data.

go get github.com/ianlopshire/go-fixedwidth

Usage

Struct Tags

The struct tag schema schema used by fixedwidth is: fixed:"{startPos},{endPos},[{alignment},[{padChar}]]"1.

The startPos and endPos arguments control the position within a line. startPos and endPos must both be positive integers greater than 0. Positions start at 1. The interval is inclusive.

The alignment argument controls the alignment of the value within it's interval. The valid options are default2, right, left, and none. The alignment is optional and can be omitted.

The padChar argument controls the character that will be used to pad any empty characters in the interval after writing the value. The default padding character is a space. The padChar is optional and can be omitted.

Fields without tags are ignored.

Encode

// define some data to encode
people := []struct {
    ID        int     `fixed:"1,5"`
    FirstName string  `fixed:"6,15"`
    LastName  string  `fixed:"16,25"`
    Grade     float64 `fixed:"26,30"`
    Age       uint    `fixed:"31,33"`
    Alive     bool    `fixed:"34,39"`
}{
    {1, "Ian", "Lopshire", 99.5, 20, true},
}

data, err := Marshal(people)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s", data)
// Output:
// 1    Ian       Lopshire  99.5020 true

Decode

// define the format
var people []struct {
    ID        int     `fixed:"1,5"`
    FirstName string  `fixed:"6,15"`
    LastName  string  `fixed:"16,25"`
    Grade     float64 `fixed:"26,30"`
    Age       uint    `fixed:"31,33"`
    Alive     bool    `fixed:"34,39"`
    Github    bool    `fixed:"40,41"`
}

// define some fixed-with data to parse
data := []byte("" +
    "1    Ian       Lopshire  99.50 20 false f" + "\n" +
    "2    John      Doe       89.50 21 true t" + "\n" +
    "3    Jane      Doe       79.50 22 false F" + "\n" +
    "4    Ann       Carraway  79.59 23 false T" + "\n")

err := Unmarshal(data, &people)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("%+v\n", people[0])
fmt.Printf("%+v\n", people[1])
fmt.Printf("%+v\n", people[2])
fmt.Printf("%+v\n", people[3])
// Output:
//{ID:1 FirstName:Ian LastName:Lopshire Grade:99.5 Age:20 Alive:false Github:false}
//{ID:2 FirstName:John LastName:Doe Grade:89.5 Age:21 Alive:true Github:true}
//{ID:3 FirstName:Jane LastName:Doe Grade:79.5 Age:22 Alive:false Github:false}
//{ID:4 FirstName:Ann LastName:Carraway Grade:79.59 Age:23 Alive:false Github:true}

It is also possible to read data incrementally

decoder := fixedwidth.NewDecoder(bytes.NewReader(data))
for {
    var element myStruct
    err := decoder.Decode(&element)
    if err == io.EOF {
        break
    }
    if err != nil {
        log.Fatal(err)
    }
    handle(element)
}

UTF-8, Codepoints, and Multibyte Characters

fixedwidth supports encoding and decoding fixed-width data where indices are expressed in unicode codepoints and not raw bytes. The data must be UTF-8 encoded.

decoder := fixedwidth.NewDecoder(strings.NewReader(data))
decoder.SetUseCodepointIndices(true)
// Decode as usual now
buff := new(bytes.Buffer)
encoder := fixedwidth.NewEncoder(buff)
encoder.SetUseCodepointIndices(true)
// Encode as usual now

Alignment Behavior

Alignment Encoding Decoding
default Field is left aligned The padding character is trimmed from both right and left of value
left Field is left aligned The padding character is trimmed from right of value
right Field is right aligned The padding character is trimmed from left of value
none Field is left aligned The padding character is not trimmed from value. Useful for nested structs.

Notes

  1. {} indicates an argument. [] indicates and optional segment ^
  2. The default alignment is similar to left but has slightly different behavior required to maintain backwards compatibility ^

Licence

MIT

go-fixedwidth's People

Contributors

alex avatar ianlopshire avatar knl avatar leonb avatar sidkurella avatar werber avatar zorchenhimer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

go-fixedwidth's Issues

Use codepoint indices should be opt-out instead of opt-in

The encoder and decoder currently support multi-byte characters, but it requires library users to enable the feature explicitly. This behavior is unintuitive and leads to confusion (see #52).

decoder := fixedwidth.NewDecoder(strings.NewReader(data))
decoder.SetUseCodepointIndices(true)
// Decode as usual now
buff := new(bytes.Buffer)
encoder := fixedwidth.NewEncoder(buff)
encoder.SetUseCodepointIndices(true)
// Encode as usual now

There is still a performance cost associated with supporting multi-byte characters, so I'd like to keep the feature in the library.

Mutlti-byte character support should be enabled by default, but there should still be an option to opt-out.

This is a breaking change and should be released as part of a major version bump.

Encoder Strict Mode

There should be an opt-in strict mode on the encoder that triggers an error when a value does not fit in the available space. It should also throw an error when the intervals defined for a struct overlap.

Add formatting options (e.g. left pad)

The encoder should support common formatting needs such as left padding.

Proposed Spec

Formatting Options

  • default - No padding is applied to the value.
  • rightpad - The value is padded on the right to fill available space.
  • leftpad - The value is padded on the left to fill available space.

In all cases the value is written to the available space in a left-to-right manner. If the value length is greater than available space, the right most characters will be omitted.

Struct Tags

The struct tag schema will be updated to support and optional third option to specify formatting – fixed:"{startPos},{endPos},{format}".

Padding Characters

Types Padding Character
int, int8, int16, int32, int64 0
uint, uint8, uint16, uint32, uint64 0
float32, float64 0
string, []byte \u0020 (space)

Any type not listed will default to being padded with \u0020 (space).

`Decoder.readLine` might have an error

When reading code for readLine, I spotted a naked return:

// readLine reads the next line of data. False is returned if there is no remaining data
// to read.
func (d *Decoder) readLine(v reflect.Value) (err error, ok bool) {
	ok = d.scanner.Scan()
	if !ok {
		if d.scanner.Err() != nil {
			return d.scanner.Err(), false
		}

		d.done = true
		return nil, false
	}

	line := string(d.scanner.Bytes())

	rawValue, err := newRawValue(line, d.useCodepointIndices)
	if err != nil {
		return
	}
...
}

This might be an error, as it returns zero value for error and False for ok, which doesn't seem right. Shouldn't it return err, False?

cannot unmarshal when a struct has just one field and one lenght

func TestUnmarshalLenghtOne(t *testing.T) {
type simpleType struct {
One string fixed:"1,1"
}

b := []byte("B")
var simple simpleType

err := Unmarshal(b, &simple)
if err != nil {
	t.Errorf("Unmarshal should fine, have %s", err)
}

if simple.One != "B" {
	t.Errorf("value should be %q, have %q", "B", simple.One)
}

}

This test is not pass

Nested structs should not require the use of the `none` format.

Currently, to decode nested structs properly, you must use the none format option.

type Nested struct {
	First  string `fixed:"1,3"`
	Second string `fixed:"4,6"`
}

type Test struct {
	First  string `fixed:"1,3"`
	Second Nested `fixed:"4,9,none"`
}

This is a breaking change and should be released as part of a major version bump.

Behavior of complex types

The behavior of more complex types need to be defined/implemented.

  • nested structs with tag
  • nested structs without tag
  • embedded struct with tag
  • embedded struct without tag
type Nested struct {
	F1 string `fixed:"1,10"`
	F2 struct {
		E1 string `fixed:"11,20"`
		E2 string `fixed:"21,30"`
	}
}

type NestedWithTag struct {
	F1 string `fixed:"1,10"`
	F2 struct {
		E1 string `fixed:"1,10"`
		E2 string `fixed:"11,20"`
	} `fixed:"11,30"`
}

type S1 struct {
	F1 string `fixed:"1,10"`
	F4 string `fixed:"31,40"`
}

type Embedded struct {
	S1
	F2 string `fixed:"11,20"`
	F3 string `fixed:"21,30"`
}

type S2 struct {
	F3 string `fixed:"1,10"`
	F4 string `fixed:"11,20"`
}

type EmbeddedWithTag struct {
	F2 string `fixed:"1,10"`
	F3 string `fixed:"11,20"`
	S2 `fixed:"21,40"`
}

Simplify the design of go-fixedwidth

Hi @ianlopshire,

When I have a struct with about 80 fields, I have to spend time to calculate the start and end position for each field, it's really hard to use.
So why don't we just use the fixed length (bytes or codepoints)? I think it is easier for implementing and better user experience also.

My suggestion:

type Nested struct {
	F1 string `fixed:"10"`
	F2 struct {
		E1 string `fixed:"10"`
		E2 string `fixed:"10"`
	}
}

Panic when using SetUseCodepointIndices

Version: v0.9.3

To recreate the issue, I've boiled it down to:

  1. encoding with SetUseCodepointIndices enabled
  2. have some field which contains a non-ASCII character
  3. have some field at the end
  4. the last field is right aligned
  5. the last field is empty
package main

import (
	"bytes"
	"fmt"

	"github.com/ianlopshire/go-fixedwidth"
)

func main() {
	var v struct {
		Foo string `fixed:"1,1"`
		Bar string `fixed:"2,2,right"`
	}
	v.Foo = "Ç"

	buf := new(bytes.Buffer)
	e := fixedwidth.NewEncoder(buf)
	e.SetUseCodepointIndices(true)
	_ = e.Encode(v)
	fmt.Printf("%q\n", buf)
}

Output:

panic: runtime error: index out of range [2] with length 2

goroutine 1 [running]:
github.com/ianlopshire/go-fixedwidth.(*lineBuilder).WriteValue(0xc000092ce0, 0x2, {{0x0, 0x0}, {0x0, 0x0, 0x0}})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/buff.go:63 +0x4a5
github.com/ianlopshire/go-fixedwidth.valueEncoder.Write(0x0?, 0x0?, {0x10acd00?, 0xc0000b8010?, 0x0?}, {0x2, 0x2, 0xc0000ac050, 0xc0000ac060, 0x10c7e38, ...})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:198 +0x31e
github.com/ianlopshire/go-fixedwidth.structEncoder.func1({0x10b2540?, 0xc0000b8000?, 0x203000?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:244 +0x415
github.com/ianlopshire/go-fixedwidth.(*Encoder).writeLine(0xc000092f30, {0x10b2540?, 0xc0000b8000?, 0x1009c6b?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:146 +0x159
github.com/ianlopshire/go-fixedwidth.(*Encoder).Encode(0xc000092f30, {0x10b2540, 0xc0000b8000?})
        /.../go/pkg/mod/github.com/ianlopshire/[email protected]/encode.go:112 +0x16f
main.main()
        /.../main.go:20 +0x171
exit status 2

Expected output:

"Ç "

Optional TrimSpace

The decoder automatically trims spaces which I guess works for most people. I have a case where I do not want to trim spaces. Would it be possible to allow this to be configured via an option on the decoder?

Version 1.0.0

This issue exists to track breaking changes that should be made for a v1.0.0 release.

does not support bool type?

go-fixedwidth does not support bool type in Struct?

package main

import (
	"fmt"
	"github.com/ianlopshire/go-fixedwidth"
	"log"
	"strings"
)

func main() {

	type Book struct {
		F1 bool `fixed:"1,5,left, "`
		F2 bool `fixed:"6,10,left, "`
		F3 int  `fixed:"11,12"`
	}

	s := `true true 1
falsefalse2
true false3`

	for _, s := range strings.Split(s, "\n") {
		var b Book
		if err := fixedwidth.Unmarshal([]byte(s), &b); err != nil {
			log.Fatal(err)
		}

		fmt.Printf("%+v\n", b)
	}

}

Then output ix below.

2021/01/31 23:23:16 fixedwidth: cannot unmarshal true true 1 into Go struct field Book.F1 of type bool:fixedwidth: unknown type
exit status 1

I want support for bool types.

Allow default formatting to be configured at the struct level

As of v0.7.0 formatting is configurable at the struct field level.

type Record struct {
	Field1 string `fixed:"1,5,left,#"`
	Field2 string `fixed:"6,10,left,#"`
	Field3 string `fixed:"11,15,left,#"`
	Field4 string `fixed:"16,20,left,#"`
	...
}

Adding all of the required tags can be tedious when all of the fields require a specific format. To alleviate this, there should be a mechanism to set the default format for all the fields in a struct.

My current thought is to implement something similar to xml.Name.

type Record struct {
	// Format is a special struct that can be embedded into a struct to control
	// the default formatting of its fields.
  	fixedwidth.Format `fixed:"left,#"`

	Field1 string `fixed:"1,5"`
	Field2 string `fixed:"6,10"`
	Field3 string `fixed:"11,15"`
	Field4 string `fixed:"16,20,right,0"` // Override the default formatting.
	...
}

Incorrectly getting io.EOF before the end of file when using Decoder.Decode

When there is a very long line (specifically, 64 * 1024 = 65536 characters or longer) in a file processed by Decoder.Decode, the Decode method returns io.EOF when it reaches the very long line, even when it is not the end of the file.

To reproduce this bug, see the sample code and test file in this small repo. These are based off of the sample code in the fixedwith repo's README here. In the small repo I provided, the test file contains 4 data lines, similar to the sample data in the README, except the 2nd line has over 65536 characters. The sample code uses the Decoder to read each line and print out the struct version of the line. It prints out the first line correctly, then prints a message indicating io.EOF was returned from the second line, e.g.:

go run main.go
{ID:1 FirstName:Ian LastName:Lopshire Grade:99.5 Age:20 Alive:false Github:f}
Got EOF%

I believe this may be happening because lines 108-113 of decode.go return io.EOF if, after calling readLine, the Decoder’s done flag is true and the returned values of readLine are err == nil && !ok. Looking into readLine (specifically, lines 162-166), if the result of the Decoder’s underlying Scanner object’s call to Scan() returns false, these conditions will be met. Looking into the Scan function in bufio/scan.go here, if the buffer length is greater than the maxTokenSize (which is 64 * 1024 = 65536), the method returns false. This false is then being used by Decoder to return io.EOF, even though this is not the end of the file.

Potential panic or invalid data when using UTF-8 codepoint boundaries when decoding into a nested struct

Hello,

I have noticed a bug that causes a panic when decoding into a nested struct when using codepoint indices as your boundaries rather than bytes. Take the following example:

func TestDecodeSetUseCodepointIndices_Nested(t *testing.T) {
	type Nested struct {
		First  string `fixed:"1,3"`
		Second string `fixed:"4,6"`
	}

	type Test struct {
		First  string `fixed:"1,3"`
		Second Nested `fixed:"4,9"`
		Third  string `fixed:"10,12"`
		Fourth Nested `fixed:"13,18"`
		Fifth  string `fixed:"19,21"`
	}

	for _, tt := range []struct {
		name     string
		raw      []byte
		expected Test
	}{
		{
			name: "Multi-byte characters",
			raw:  []byte("123x☃x456x☃x789x☃x012\n"),
			expected: Test{
				First:  "123",
				Second: Nested{First: "x☃x", Second: "456"},
				Third:  "x☃x",
				Fourth: Nested{First: "789", Second: "x☃x"},
				Fifth:  "012",
			},
		},
	} {
		t.Run(tt.name, func(t *testing.T) {
			d := NewDecoder(bytes.NewReader(tt.raw))
			d.SetUseCodepointIndices(true)
			var s Test
			err := d.Decode(&s)
			if err != nil {
				t.Errorf("Unexpected err: %v", err)
			}
			if !reflect.DeepEqual(tt.expected, s) {
				t.Errorf("Decode(%v) want %v, have %v", tt.raw, tt.expected, s)
			}
		})
	}
}

Currently, this causes a panic due to codepoint indices not being adjusted when trimming data from the front of the string in decode.go:rawValueFromLine.

I believe the issue is here (decode.go Ln. 217):

	if value.codepointIndices != nil {
		if len(value.codepointIndices) == 0 || startPos > len(value.codepointIndices) {
			return rawValue{data: ""}
		}
		var relevantIndices []int
		var lineData string
		if endPos >= len(value.codepointIndices) {
			relevantIndices = value.codepointIndices[startPos-1:]
			lineData = value.data[relevantIndices[0]:]
		} else {
			relevantIndices = value.codepointIndices[startPos-1 : endPos]
			lineData = value.data[relevantIndices[0]:value.codepointIndices[endPos]]
		}
	} else { // truncated
	}

Note that lineData is trimmed from the left but the codepoint indices are not adjusted to match, which can cause an index out of bounds, or reading from the wrong part of the data string.

I have created a fix in PR #60 for your review.

Decode into slices don't return EOF

When v in * Decode(v interface{})* is a pointer of slice, readLines don't return EOF.

I suggest this change below:

func (d *Decoder) readLines(v reflect.Value) (err error) {
	ct := v.Type().Elem()
	for {
		nv := reflect.New(ct).Elem()
		err, ok := d.readLine(nv)
		if err != nil {
			return err
		}
		if ok {
			v.Set(reflect.Append(v, nv))
		}
		if d.done {
-			break
+                     return io.EOF
		}
	}
	return nil
}

Does the library support lists?

Hi,
Does the library support repeated lists, where the list length is not known?

type Individual struct{
Name string
Age int
}

type Population struct{
Country string
Citizens []Individual
AverageAge float64
}

[Question] Fixed width files with Header and Trailer

Any suggestion using this package to deal with fixed width files with header and trailer ?
It's common to have a line prefix to identify different schemas in the same file.
Do you think it makes sense for this package to deal with this type of files with header and trailer ?

No way to tell if at EOF when using Decoder.Decode() with struct

Looking at:

https://github.com/ianlopshire/go-fixedwidth/blob/master/decode.go#L81-L93

and

https://github.com/ianlopshire/go-fixedwidth/blob/master/decode.go#L113-L128

As far as I can tell, readLine eats the EOF error -- it returns an ok bool, but Decode throws that away. This could be fixed by having Decode transform the ok bool back into an err == io.EOF, or the done bool could be exposed via a public API.

I'm happy to implement a PR for either one of these strategies that you prefer.

Thanks for this package!

Support for codepoint-based fixedwidth files

I have an extremely bonkers fixed-width file format where the file is utf-8 encoded, but the fixed offsets are expressed in decoded codepoints. Naturally this doesn't play super well with go-fixedwidth's (ENTIRELY REASONABLE) byte-based approach :-)

I'd like to find some way to shoe-horn this into go-fixedwidth (and obviously contribute this upstream).

I think the most efficient implementation is probably, if you're doing this codepoint mode, to conver the line to []rune in readLine and in rawValueFromLine convert it back to a []byte with the utf-8 value.

And I think that'd all work fine. But I don't see an obvious way to do this without throwing a bunch of if statements here and changing a bunch of types (e.g. I could replace all the []byte with struct { bytes []byte; runes []rune } and then sticks if in readLine and rawValueFromLine).

Do you have an opinion on if there's a better implementation strategy? Would I be better off just forking decode.go?

Client-Configurable Padding Character

As a consumer of this package, I have a need for generating fixedwidth -encoded output that's padded with 0s, so that I can integrate with a system which treats spaces as significant characters.

omitempty

Would a pull request that implements the omitempty struct tag be accepted?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.