Coder Social home page Coder Social logo

Comments (12)

joinr avatar joinr commented on July 1, 2024 1

@tsulej wrote:

Possibly you're right that structure of the last row causes a problem. Thanks for investigation.
This issue points out regression.

Either that, or the original Excel parsing was too lenient and possibly incorrect :)

from incanter.

tsulej avatar tsulej commented on July 1, 2024

Worth to mention that this example works well on original versions, ie:

[org.clojure/clojure "1.6.0"]
[incanter/incanter "1.5.5"]

from incanter.

joinr avatar joinr commented on July 1, 2024

Possible dupe / related to #372 .
Note, that's unresolved. Behavior at the moment seems to be "don't have empties/nils," which is weak IMO. I can dig through the core.matrix stuff to try to figure out why, I think the problem is inputs going to mp/get-cols (called during dataset creation, passed "lines" read from excel zipmapped with the assumed field names from the first line) is crapping out because of the specific record not having the same number of entries as the field name, leading to a bumb call to nth, leading to index out of bounds. That's a guess at this point.

@mikera is better positioned to speculate on the core.matrix details.

The other option is to provide better parsing and guarantees on the excel read, and ensure we're not pushing stuff that will break cmm and trip an error.

from incanter.

rjungemann avatar rjungemann commented on July 1, 2024

So, I re-defined the read-sheet method from incanter.excel to output some information before trying to place the rows into the dataset.

I discovered that if I (drop-last 1 rows) that it worked for the UK2010.xls spreadsheet. This is because the last row of the spreadsheet is like a summary row, and for some reason, only has one cell in the row instead of the usual 144. I think this is the root of the problem, that this row should have some nil cells instead of the cells being completely missing.

From the lein repl:

(require '[incanter.excel])
(in-ns 'incanter.excel)

(defn- read-sheet [rows-it header-keywords]
  (let [colnames  (read-line-values (first rows-it))
        rows (->> (rest rows-it)
                  (map read-line-values)
                  (filter (complement empty?))
                  (drop-last 1))] ; NOTE: This line was my change, just to test my theory
    (dataset
      (if header-keywords
        (map keyword colnames)
        colnames)
      rows)))

(incanter.excel/read-xls "UK2010.xls")

So I have a feeling that the issue isn't with core.matrix, but rather with the Excel parsing. I'm still learning, so I'm not going to be the best person to solve the issue, but if there's a way I can help, I'd love to.

from incanter.

tsulej avatar tsulej commented on July 1, 2024

Possibly you're right that structure of the last row causes a problem. Thanks for investigation.
This issue points out regression (implementation of dataset is moved to clojure.core.matrix.dataset).

from incanter.

joinr avatar joinr commented on July 1, 2024

So I have a feeling that the issue isn't with core.matrix, but rather with the Excel parsing.

Great diagnosis. This is a similar corner case I ran into when parsing the excel file for a different library. Basically, you assume a tabular region of cells...but that's not always the case. We'd like there to be behavior like "current region" in excel, which may select empty cells to create a rectangular region. It could be the case that when reading cells, you get a sparse representation, which means you have to account for that during parsing (either fill in with empties or default values). The problem is, core.matrix is assuming consistent-width rows - that is row values match the expected number of fields. There's a call to mp/get-columns (which I think has an implementation for sequences of maps to coerce a vector of columns from the according to each field). In the protocol definitions, there are a couple of calls to nth (none of which exist in the incanter excel wrappers), so I think that's where the indexed access error is occurring.

core.matrix.dataset/dataset is what incanter.core/dataset wraps/delegates to, and what incanter.excel/read-sheet calls (passing the row-maps from the sheet iterator and the derived column names). That uses the core.matrix.impl.dataset/dataset-from-row-maps to construct the dataset. That uses map accessors and just tries to extract the known field names from the row maps (no problem, since get will return nil if there's no key, say in a sparse map like the "summary row" you mentioned). This builds up a nested vector representation for the rows from the maps.

Then dataset-from-row-maps then passes the nested vector representation of rows to dataset-from-rows, which internally coerces the row-major representation to a column-major representation (still nested vectors) via mp/get-columns. That's the current trace I've found.

from incanter.

tsulej avatar tsulej commented on July 1, 2024

Either that, or the original Excel parsing was too lenient and possibly incorrect :)

Yes, yes, this also should be considered here.

from incanter.

joinr avatar joinr commented on July 1, 2024

In-depth diagnosis of the problem is here

tldr: incanter is passing data to core.matrix.impl.dataset/dataset-from-rows that violates the documented assumptions core.matrix has about the data, namely that each row has an entry for each column. In the sample data, the last row only has a single entry, so when (ultimately) nth is called on the last row for anything beyong the first column, we get said exception.

Solution would be to include validation in incanter.excel/read-sheet to validate the input prior to invoking core.matrix, or detect this condition if there's an index out of bound exception and report the error with useful information (like which row is not normalized). Alternately, provide a way for caller to normalized data with default values, and during prep (prior to core.matrix invocation), ensure the row is normalized by adding default values if the size of a row-vector is < expected size.

Finally.....there's a behavior (complement empty?) that filters out empty rows in the xls file. In my practice, we parse tabular, contiguous data where empty rows indicate a "Break" in the data (as with current-region in Excel), so the remainder is not processed. Incanter will process these types of sheets and filter out "Break" rows at the moment....

from incanter.

rjungemann avatar rjungemann commented on July 1, 2024

If you open up the spreadsheet, the cell in the summary row is a few columns over, like this:

image

My instinct would be that there should be empty cells in the data imported from read-sheet. Essentially, read-sheet should indeed return the same number of cells per row, I think.

I attached the XLS file in case it is helpful in any way.

UK2010.xls.zip

from incanter.

joinr avatar joinr commented on July 1, 2024

Yeah, the current behavior is piss poor IMO, and definitely indicates non-regression compared to incanter 1.5.x (incanter 1.5.x was processing the inputs wrong; core.matrix is throwing an opaque, but useful error indicating such).

I work with Excel a lot (most analysts use it for data entry), and my simulation has a tight data protocol that establishes contiguous tabular regions of a worksheet (default rooted at A1), with selection behavior similar to "current region" behavior in excel. So analysts can mess with the inputs via a known medium (Excel), and then re-run the application providing the workbook as input for a project (basically a bunch of tables).

The problem the incanter.excel takes here is that it just uses each rows iterator, and doesn't check ANYTHING regarding contiguity or indexing. In my library, tabular-region uses contiguous-rows to get the region, then fills out empty cells (i.e. cells that have no recorded value in the workbook) by checking field count vs. row count and normalizing the resulting row vector with nil entries for missing fields. As I examine this, I probably need to 2x check the actual cell addresses instead of just counting the cells in each row ala row iterator.

So, detecting where there are gaps in the data (like the singleton cell in the last example, which should be the 144th column entry) is the correct parsing behavior....

I think a solution would perform a simple comparison of row-count vs. expected field count, and if they aren't the same, unpack the sparse data from the row (using the Apache POI cell object information) to map the cell values to where they should be in the resulting row vector, then populating nil values for all other fields/columns, to get things normalized for core.matrix. That keeps the "fix" isolated to incanter.excel, and satisfies assumptions the core.matrix makes about the inputs to dataset-from-rows

from incanter.

joinr avatar joinr commented on July 1, 2024

Hah, I forgot I do actually parse missing/non-contiguous column entries (up to the last element of the cell in the row), and fill in blanks up to the last cell in the row iterator sequence. This could probably be trivially applied to incanter.excel, along with the logic that normalizes the cell entries, packaged into read-sheet.

from incanter.

happy15 avatar happy15 commented on July 1, 2024

any update/fix on this issue?

from incanter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.