thedatashed / xlsxreader Goto Github PK

View Code? Open in Web Editor NEW

69.0 17.0 20.0 169 KB

A low-memory high performance library for reading data from an xlsx file

Home Page: https://godoc.org/github.com/TheDataShed/xlsxreader

License: MIT License

Go 100.00%

golang go excel excelreader xlsx-files xlsxreader

xlsxreader's Introduction

xlsxreader: A Go Package for reading data from an xlsx file

Overview

A low-memory high performance library for reading data from an xlsx file.

Suitable for reading .xlsx data and designed to aid with the bulk uploading of data where the key requirement is to parse and read raw data.

The reader will read data out row by row (1->n) and has no concept of headers or data types (this is to be managed by the consumer).

The reader is currently not concerned with handling some of the more advanced cell data that can be stored in a xlsx file.

Further reading on how this came to be is available on our blog

Install

go get github.com/thedatashed/xlsxreader

Example Usage

Reading from the file system:

package main

import (
  "github.com/thedatashed/xlsxreader"
)

func main() {
    // Create an instance of the reader by opening a target file
    xl, _ := xlsxreader.OpenFile("./test.xlsx")

    // Ensure the file reader is closed once utilised
    defer xl.Close()

    // Iterate on the rows of data
    for row := range xl.ReadRows(xl.Sheets[0]){
    ...
    }
}

Reading from an already in-memory source

package main

import (
  "io/ioutil"
  "github.com/thedatashed/xlsxreader"
)

func main() {

    // Preprocessing of file data
    file, _ := os.Open("./test/test-small.xlsx")
    defer file.Close()
    bytes, _ := ioutil.ReadAll(file)

    // Create an instance of the reader by providing a data stream
    xl, _ := xlsxreader.NewReader(bytes)

    // Iterate on the rows of data
    for row := range xl.ReadRows(xl.Sheets[0]){
    ...
    }
}

Key Concepts

Files

The reader operates on a single file and will read data from the specified file using the OpenFile function.

Data

The Reader can also be instantiated with a byte array by using the NewReader function.

Sheets

An xlsx workbook can contain many worksheets, when reading data, the target sheet name should be passed. To process multiple sheets, either iterate on the array of sheet names identified by the reader or make multiple calls to the ReadRows function with the desired sheet names.

Rows

A sheet contains n rows of data, the reader returns an iterator that can be accessed to cycle through each row of data in a worksheet. Each row holds an index and contains n cells that contain column data.

Cells

A cell represents a row/column value and contains a string representation of that data. Currently numeric data is parsed as found, with dates parsed to ISO 8601 / RFC3339 format.

xlsxreader's People

Contributors

Stargazers

Watchers

Forkers

mwain dglsparsons kdming vjeantet emilien-puget xakep666 durango alicebob moxxx1e esourceful mylxsw nathj07 amitech troutsoftware wkbae cinakyn henryosei muktihari kra53n

xlsxreader's Issues

date and number formatting

Hi!

Two more things, they are related, so I'll make them a single issue:

date formatting is currently hardcoded to use the format time.RFC3339 (for a datetime) or "2006-01-02" (for a date). That's a very reasonable choice, however for (boring business) reasons I need to be able to format a date differently.
Cell.Value is currently not changed from what's stored in the file, which means that for numbers it will be in scientific notation when they are "big". Our xlsx have phone numbers in them, and are uploaded by customers. Those should be stored by our customers as strings, but you know, sometimes they aren't, and then they end up as 1.23456789E8. So I need to reformat numbers with strconv.ParseFloat(...); strconv.FormatFloat(f, 'f', -1, 64).

how about

How about we define constants for the possible cell types: [string, numerical, datetime, date, boolean] (more?) and add a field with the cell type to the Cell struct.
That way I can parse and reformat numerical types how I see fit, and same with date cells: parse them as 2006-01-02 and then reformat them how I want.
Adding a custom type, as opposed to exposing the raw xlsx cell type, makes there is no distinction between 'inline strings' and normal strings, and makes it clear what's a datetime/date.

I'm happy to make an example PR, but I didn't want to submit it out of the blue.

Thanks!

Automate release process

Ideally each new merge to master should result in a new tag, allowing any potential fixes to be used ASAP without manual intervention required to create releases.

Update documentation link

It looks like godoc is deprecated in favour of pkg.go.dev -> https://pkg.go.dev/github.com/thedatashed/xlsxreader?tab=doc

I'm not sure why TheDataShed appears in lower case there though... and it's missing the latest tag. seems suspect?

Extracting Cell Hyperlinks

If a cell has the link in it, as plain text, then all is well. However, if the cell has text with a link associated with it that link is not extracted.

The underling XML for the sheet has:

 <hyperlinks>
        <hyperlink ref="B2" r:id="rId1" xr:uid="{758EED76-BA86-204E-84F7-E3C6CCB25DF4}"/>
    </hyperlinks>

That r:id can be looked up in the rels file, so if that snippet is from sheet1, then _rels/sheet1.xml.rels looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships
	xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments" Target="../comments1.xml"/>
	<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/vmlDrawing" Target="../drawings/vmlDrawing1.vml"/>
	<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://en.wikipedia.org/wiki/Wales" TargetMode="External"/>
</Relationships>

It would be really useful to have a way to extract these links from each sheet, and associate them with the cell.

One thought would be to have a function such as getHyperlinks that is called from the XlsxFile.init method. This could be made optional, as I imagine not every user wants this information.

This new function could then read each sheet, and pull the links, and references and create a data structure on the XlsxFile struct; map[string]map[string][]string - a map of the sheet to a map of the cell reference to a slice of the links associated with that cell.

Having that structure populated would make it very simple during processing a sheet to find the all the hyperlinks on that sheet and associate them correctly with the text fro the cell.

If this is an agreeable approach I could possibly work on this. If you have another idea, let me know and I can take look at that too.

proposal: fix goroutine leaks when calling ReadRows

This is well-known issue since it is documented in the code but actually the fix is simple. But first, let me explain the issue to give more context for anyone reading this proposal:

When we only want to read some rows using ReadRows, we have to read the whole rows in the sheet, otherwise, it will produce a goroutine leaks. The goroutine leaks will never be clean up and will only be accumulated until the program exit. This happen because ReadRows spawned a goroutine that will always try to send the data to a non-buffered channel rowChannel but there are no longer readers at the other end:

xlsxreader/rows.go

Lines 363 to 367 in f3dd3ae

    
           func (x *XlsxFile) ReadRows(sheet string) chan Row { 
        
           	rowChannel := make(chan Row) 
        
           	go x.readSheetRows(sheet, rowChannel) 
        
           	return rowChannel 
        
           }

xlsxreader/rows.go

Line 258 in f3dd3ae

func (x *XlsxFile) readSheetRows(sheet string, ch chan<- Row) {

xlsxreader/rows.go

Lines 286 to 294 in f3dd3ae

    
           case xml.StartElement: 
        
           	if startElement.Name.Local == "row" { 
        
           		row := x.parseRow(decoder, &startElement) 
        
           		if len(row.Cells) < 1 && row.Error == nil { 
        
           			continue 
        
           		} 
        
           		ch <- row 
        
           	} 
        
           }

Leaks proof:

func TestGoroutineLeaks(t *testing.T) {
	goroutineUsedByTest := runtime.NumGoroutine()

	xl, err := OpenFile("test/test-small.xlsx")
	if err != nil {
		t.Fatal(err)
	}

	row := <-xl.ReadRows(xl.Sheets[0])
	_ = row // pull one row

	ng := runtime.NumGoroutine() - goroutineUsedByTest
	t.Logf("active goroutine before: %d\n", ng)

	xl.Close()   // Close XlsxFile
	runtime.GC() // Force GC to run

	ng = runtime.NumGoroutine() - goroutineUsedByTest

	t.Logf("active goroutine after: %d\n", ng)

	if ng != 0 {
		t.Fatalf("expected active goroutine is zero, got: %d", ng)
	}
}

Even though the XlsxFIle has been closed and no longer used, runtime could not clean it up.

If left unaddressed, this issue will compromise the integrity of this library slogan: "A low-memory high performance library for reading data from an xlsx file.".

The memory leaks may contradict with "low-memory" claim, and if users are compelled to read entire rows unnecessarily, their program performance will suffer especially if the xlsx file is huge in size, this may contradict with "high performance" claim.

So, here is the fix:

We need to make x.readSheetRows aware when user is calling Close(), all operations should be dropped:

if startElement.Name.Local == "row" {
        row := x.parseRow(decoder, &startElement)
        if len(row.Cells) < 1 && row.Error == nil {
	        continue
        }
        select {
        case <-x.doneChannel:
	        return
        case ch <- row:
        }
}

If this proposal is being accepted, I will send the PR with proper test so you can review it.

cannot read some spreadsheet cells

I have a spreadsheet (attached) that has 81 columns, yet some of the rows read only 79 or 80 columns . I have tried Excelize and it also has the same trouble with two of the spreadsheet columns when the cells are empty (see the columns "Customer request date" and "Target date" and the last two rows in the attached spreadsheet).

Here is some code for reading the file and seeing the problem:

xl, err := xlsxreader.OpenFile("")

if err != nil {
	panic(err)
}
// Ensure the file reader is closed once utilised
defer xl.Close()

rowcount := 0
cellcount := 0
columncount := 0
brokencount := 0

// Iterate on the rows of data
for row := range xl.ReadRows("Sheet1") {
	rowcount++

	record := make([]string, 0, len(row.Cells))

	for _, cell := range row.Cells {
		cellcount++
		record = append(record, cell.Value)
		if rowcount == 1 {
			columncount++
			fmt.Println("found cell header value:", cell.Value)
		}
	}
	if len(record) != columncount {
		brokencount++
		fmt.Println(rowcount, ": found row that has count different from columncount (", len(record), "vs", columncount, ")")
	 }
}

fmt.Println("The input file", *infile, "is", size, "bytes long")
fmt.Println("found", cellcount, "cells in", rowcount, "rows. There are", columncount, "columns")
fmt.Println("There are ", brokencount, "broken rows!")

The xml in the file appears ok from manual examination. As an aside, I also tried parsing with NodeJS and one library fails similarly, but another works just fine.

It is a bit of a puzzle and I'd love to see a fix or explanation for the problem! If it can be solved that would make this module absolutely amazing and a critical piece of functionality I have to deliver.

sample_parse_failure.xlsx

Thanks so much for any help!

Shared Strings XML not found

If xl/SharedStrings.xml or xl/sharedStrings.xml is not found in the excel file, can we not consider an empty table and parse the sheets?

Wrap errors better

This library does a really bad job of wrapping errors - it often just returns the naked error which makes it hard to trace where the error occurred within the code.

Instead, we should look to always wrap errors before returning.

i.e.

if err != nil {
   return nil, fmt.Errorf("something broke: %W", err)
}

xml/unmarshal/rawcell: Is this float64 rounding really necessary?

Hi @KingMichaelPark, is this float64 rounding really necessary?

xlsxreader/rows.go

Lines 235 to 240 in 6f2298a

    
           v, err := strconv.ParseFloat(*r.Value, 64) 
        
           if err != nil { 
        
           	return *r.Value, nil 
        
           } 
        
           *r.Value = strconv.FormatFloat(v, 'f', -1, 64) 
        
           return *r.Value, nil

What I can see from issue #55 is that the problem is in the file itself, not this library. The value in the file is indeed "4.4000000000000004" not "4.4". Using vscode' debug:

Or, if you prefer to look at xml file directly, we can extract like this:

package main

import (
	"archive/zip"
	"io"
	"os"
)

func main() {
	f, err := zip.OpenReader("file1.xlsx")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	out, err := os.OpenFile("debug_file1.xlsx_sheet1.xml", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o644)
	if err != nil {
		panic(err)
	}
	defer out.Close()

	for _, file := range f.File {
		if file.Name == "xl/worksheets/sheet1.xml" {
			sheet1, err := file.Open()
			if err != nil {
				panic(err)
			}
			_, err = io.Copy(out, sheet1)
			if err != nil {
				sheet1.Close()
				panic(err)
			}
			sheet1.Close()
			break
		}
	}
}

Here is the resulting file: debug_file1.xlsx_sheet1.xml.zip

I'm afraid it might affect performance since we need to re-alloc string twice, it's small but remember it's per affected cell:

func BenchmarkParseFloatRoundTrip(b *testing.B) {
	v := "4.4000000000000004"
	for i := 0; i < b.N; i++ {
		v2, _ := strconv.ParseFloat(v, 64)
		_ = strconv.FormatFloat(v2, 'f', -1, 64)
	}
}
// Output:
// BenchmarkParseFloatRoundTrip-4    3446553    342.2 ns/op	   27 B/op    2 allocs/op

I believe this floating-point handling should be placed at user space rather than in this library since they are the one who own the file so they are expected to know how to manage. Moreover, it's the nature of floating-point number to have that kind of precision, it's expected.

CMIIW.

Merged Cells

Is there a way in the library to view merged cells, because at the moment the library skips the merged cells and return them as empty.

File Not Found after deleting a worksheet

test3.xlsx

Say, if we have a workbook with 3 worksheets and sheet1 is deleted, we start getting an error when opening the file using the Open method:

open file: File not found: xl/worksheets/sheet7.xml

Can we not skip the sheets ids that are not associated with a worksheet?

Release changes for shared_strings.go

Please make a release for #37. This seems to solves the problem of xlsx read getting stuck for large sharedStrings.xml

Floating point data rounding error

Here is the code that I have written to read the attached file:

package main

import (
	"bytes"
	"encoding/csv"
	"fmt"
	"log"

	"github.com/thedatashed/xlsxreader"
)

// copy of https://godoc.org/github.com/tealeg/xlsx#ColLettersToIndex
// column is read from the right to left, for each char, res += (n+char-A)*26pow(i) (0<= i <=len(col)-1).
func columnIndex(col string) int {
	sum, mul, n := 0, 1, 0
	for i := len(col) - 1; i >= 0; i, mul, n = i-1, mul*26, 1 {
		c := col[i]
		switch {
		case c >= 'A' && c <= 'Z':
			n += int(c - 'A')
		case c >= 'a' && c <= 'z':
			n += int(c - 'a')
		}
		sum += n * mul
	}
	return sum
}

func getColCount(f *xlsxreader.XlsxFileCloser) int {
	row := <-f.ReadRows(f.Sheets[0])
	if row.Error != nil {
		return 0
	}
	return columnIndex(row.Cells[len(row.Cells)-1].Column) + 1
}

func main() {
	var result []string
	// Create an instance of the reader by opening a target file
	xl, _ := xlsxreader.OpenFile("./file1.xlsx")

	// Ensure the file reader is closed once utilised
	defer xl.Close()
	numColumns := getColCount(xl)
	for row := range xl.ReadRows(xl.Sheets[0]) {
		if row.Error != nil {
			continue
		}
		csvRow := make([]string, numColumns)
		data := make([][]string, 0)
		for _, curCell := range row.Cells {
			colIndex := columnIndex(curCell.Column)
			if colIndex < numColumns {
				csvRow[colIndex] = curCell.Value
			}
		}
		data = append(data, csvRow)
		buf := new(bytes.Buffer)
		w := csv.NewWriter(buf)
		err := w.WriteAll(data)
		if err != nil {
			panic(err)
		}
		if err := w.Error(); err != nil {
			log.Fatalln("error writing csv:", err)
		}
		csvString := buf.String()[:len(buf.String())-1]
		result = append(result, csvString)
	}
	fmt.Println(result)
}

However, When the data from the second row of the L column i.e. the value of the Facebook score is read then it's printed as 4.4000000000000004 instead of 4.4. Why is this rounding error coming in the case of some float-type data?

File_Attached_Here

Invalid Path for File Name

Error:

copy excel to csv: open file: File not found: xl//xl/worksheets/sheet1.xml

We're using the relationship file to make the name of the spreadsheet file. Can we not just get the names of the spreadsheets that are present in the root directory of the xml

For the above file, the file listing is:

     docProps/app.xml
     docProps/core.xml
     xl/theme/theme1.xml
     xl/worksheets/sheet1.xml
     xl/styles.xml
     _rels/.rels
     xl/workbook.xml
     xl/_rels/workbook.xml.rels
     [Content_Types].xml

As we can see, xl/worksheets/sheet1.xml is already present in the listing

Library skips empty cells in the returned data

Let's say the excel file contains the following data

Email,Name,Age
[email protected],,18

The returned row will only contain

Email,Name,Age
[email protected],18

This is an issue as 18 will be under the Name header wherein we want empty string under Name.

	func (x *XlsxFile) ReadRows(sheet string) chan Row {
	rowChannel := make(chan Row)
	go x.readSheetRows(sheet, rowChannel)
	return rowChannel
	}

	case xml.StartElement:
	if startElement.Name.Local == "row" {
	row := x.parseRow(decoder, &startElement)
	if len(row.Cells) < 1 && row.Error == nil {
	continue
	}
	ch <- row
	}
	}

	v, err := strconv.ParseFloat(*r.Value, 64)
	if err != nil {
	return *r.Value, nil
	}
	*r.Value = strconv.FormatFloat(v, 'f', -1, 64)
	return *r.Value, nil

thedatashed / xlsxreader Goto Github PK

xlsxreader's Introduction

xlsxreader: A Go Package for reading data from an xlsx file

Overview

Install

Example Usage

Key Concepts

Files

Data

Sheets

Rows

Cells

xlsxreader's People

Contributors

Stargazers

Watchers

Forkers

xlsxreader's Issues

how about

Recommend Projects

Recommend Topics

Recommend Org