Coder Social home page Coder Social logo

rxls's Introduction

RXLS

Shortcut from Read XLS[X|B]

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.

Description

This module provides one function: read for reading both .xlsx and .xlsb files.

Simple usage showcase:

import polars as pl
import pandas as pd

from rxls.reader import read

polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()

Advanced usage examples.

Some file has no header:

from rxls.reader import read

table = read(some_file, header=0)  # produce `Unnamed: 0` ... `Unnamed: N` columns
table = read(some_file, header=['a', 'b', 'c'])  # produce table with columns `a`, `b`, `c`, and raise an exception, when column count != 3.

Header is placed in unknown row (but we know, it is in top-30 and one cell of it contains word timestamp)

from rxls.reader import read

table = read(some_file, header=True, lookup_head='^timestamp$')  # First row of header must contain cell with word `timestamp`

Some rows in column timestamp contains unwanted cells (with subscripts, totals, or something else), but column row_id doesn't:

from rxls.reader import read

table = read(some_file, header=True, row_filters='^row_id$')  # After `header`, will be only rows, that contains non-null value in column `row_id`

Some rows in column timestamp contains unwanted cells, column row_id doesn't, but we will ensure, that table has also non-null column client:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'])  # After `header`, will be only rows, that contains non-null value in columns `row_id` AND `client`

Some rows in a sheet contains unwanted cells, but we know, that row_id column cells are not null, or client column cells are not null either:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or')  # After `header`, will be only rows, that contains non-null value in columns `row_id` OR `client`

For tracking rows events:

from rxls.reader import read
from tqdm import tqdm

with tqdm(desc='Reading some file', unit=' rows') as tq:
    table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or', row_callback=tq.update)

Some algorithm explanations:

RXLS uses only TWO datatypes for cells of excel files: NUMERIC and STRING. It is because all other datatypes, as I discovered, are just styles, applied to them (named representation in RXLS).

So, when MS Excel show us a date (or time, or timestamp, or duration interval), it's actually a floating-point number "under the hood". This number, when is between [0.0, 1.0), presents part of day (0.5 = 12 hours, 0.125 = 3 hours and vice versa), which useful for time and duration datatypes, and, when bigger or equals to 1, presents days since 1900 with part of day. For date - decimals will be zero. All temporal cells (date/time, datetime, timedelta) will be converted to timestamp[ms] during prepare step. This behaviour may be overriden with dtypes argument - after all, pyarrow's timestamp[ms] datatype may be convert to date64 or time32['ms'] instantly and without any problems. duration datatype in Excel files, usually, not presented, and pyarrow cannot cast timestamp['ms'] to duration['ms']. For such conversion, you can extract timestamp column and convert it as:

import pyarrow as pa

column_name = 'duration_column'

table = read(some_file, header=True)

table.set_column(
    next(i for i, x in enumerate(table.column_names) if x == column_name),
    pa.field(column_name, pa.duration('ms')),
    table[column_name].combine_chunks().view(pa.duration('ms'))  # original dtype is pyarrow.timestamp['ms'], which is binary equivalent for pyarrow.duration['ms']
)

Strings most common representation is 4-byte little-endian integer (in xlsb) or some number (in xlsx). This number is an index of shared string in special part of Excel file (xl/sharedStrings.[xml|bin]). This special part will be scanned on each read call, before all other steps.

Keeping all this in mind, I've implemented class chunk for keeping original data as-is, until it should be accessed during final table creation. Chunks, actually, are column parts, and contains non-null values with same type and representation. Chunk may be of NULL type, but it contains just count of null values between not-null chunks.

Original data will be contained inside chunk as-is, intact until prepare step:

  • numpy.NDArray[numpy.float64] - simple numeric values (with optional temporal flag)
  • numpy.NDArray[numpy.uint32] - for XLSB RkNumber-formatted floating-poing or integer values.
  • pyarrow.LargeStringArray - for XLSX and XLSB strings, that are not in shared, but presented as-is.
  • pyarrow.UInt64Array - for XLSX and XLSB shared strings indices.
  • int - for null chunks: count of null values

RXLS can skip entire chunks (or their parts) before prepare step, when they should be skipped (see lookup_head and row_filters arguments descriptions) - they will never be accessed afterwards.

RXLS performs two main steps for a sheet of Excel file: reading and prepare.

On reading step, no computations will be performed - it's just filling xl_series objects by chunks, keeping original data.

On prepare step, unwanted chunks will be dropped, and others will be prepared as:

  • numpy.NDArray[numpy.uint32] -> numpy.NDArray[numpy.float64] - when numeric chunk has rknubmer flag.
  • numpy.NDArray[numpy.float64] -> numpy.NDArray[numpy.int64] - when numeric chunk has temporal flag. count of days with decimal part since 1900 (Windows) -> milliseconds since 1970-01-01 (UNIX timestamp)
  • pyarrow.UInt64Array -> pyarrow.LargeStringArray - shared strings indices to actual data from shared.
  • int -> pyarrow.NullArray or pyarrow.Array[T, pyarrow.Scalar[T]] if column has chunks of another common datatype T.

These operations were totally optimized (and may performs much faster with numba along with (or without) tbb - see Dependencies section).

Also, about lookup_head and header with height > 1.

If header has more than 1 row, every column will fill it's topmost empty cells by use corresponding non-empty cell values from left neighbour. All other cells will be leave as is. After this, non-empty cells will be concatenated from top to down with ', ' separator. These operations performs for each column from left to right.

So, RXLS features are:

  • Lazy and fast column-wise computations
  • Smart header searching and row filtering
  • Keeping datatypes and resolve conflicts.

Notes:

  • At the moment, RXLS just skip empty column (which has no header, nor any data).

Parameters:

  • Positional only:
    • file: path to the file, or BytesIO with file.
    • sheet: index or name of sheet for reading (default: 0)
  • Positional:
    • header:
      • bool: True when header is presented in the sheet and have one row height
      • int: 0 when no header is presented in the sheet, N for header of N rows height.
      • list[str]: when no header is presented, but we are know, what it should be.
  • Keyword-only
    • dtypes: Specify datatypes for columns.
      • Sequence[pyarrow.DataType]: when we know dtype for each non-empty column.
      • dict[str, pyarrow.DataType]: when we will override dtype for some columns.
      • pyarrow.DataType: when all columns must be of this dtype.
      • None: for original dtypes of columns.
    • skip_cols: Skip some columns by their 0-based indices (A == 0). Performs on reading step.
    • skip_rows: Skip some rows on the top of the file. Performs on reading step.
    • skip_rows_after_header: Skip some rows after header. Performs on prepare step.
    • take_rows: Stop reading after this row (0-based). Performs on reading step.
    • take_rows_non_empty: Leave empty rows in resulting table. Performs on reading step.
    • lookup_head: Regular expression for smart-search of the first row of header, or column index, where first non-empty cell is the top-level cell of header.
    • lookup_size: Count of rows to perform lookup when searching for header. Note: RXLS will raise an exception, if lookup_head with this lookup_size is failed.
    • row_filters: Regular expression(s) for columns, which content determines empty and non-empty rows.
    • row_filters_strategy: Boolean operator(s) for situations with two or more columns in row_filters.
    • float_precision. All numeric values in MS Excel are floating-point under the hood, so, when rounding whole float64 column to this precision gives equal result to just truncate decimals, this column will be converted to int64.
    • datetime_formats: One or more formats, which may appears in columns with conflicts.
    • conflict_resolve: When column contains two or more datatypes, this is a conflict. When conflict cannot be resolved, whole column will be convert to utf-8. Conflicts may be resolved as:
      • no: All parts in columns with conflicts will be convert to utf8.
      • temporal: Try to convert non-temporal parts of column with some temporal parts to temporal (float64 -> timestamp and utf8 -> timestamp (using default formats (ISO 8601), or as specified in datetime_formats)
      • numeric: Try to convert non-numeric parts of column with some numeric parts to numeric (utf8 -> float64).
      • all: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting: utf8 -> float64 -> timestamp)
    • utf8_type_infers: (WIP) When resulting column is utf-8 and all non-null cells of it passes regular expression of numeric values, convert it to float64 (and, maybe, to int64 after).
    • null_values: Advanced list of values, that should be skipped on reading step (or callable predicate for them).
    • row_callback: Any callable, which may be called without arguments on each row event. Useful for progress tracking.

Dependencies

Required:

  • pyarrow>=14.0.2
  • numpy>=1.24.4
  • recordclass>=0.21.1
  • typing_extensions>=4.9.0

Optional:

  • numba>=0.58.1 (increase import time, but reading speed also increases x3/x4 and up)
  • tbb>=2021.11.0 (only for numba - additional performance gain)
  • polars>=0.20.4 (if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone)
  • pandas>=2.0.3 (for pyarrow to_pandas functionality)
  • tqdm>=4.66.1 (fast progress tracking)

rxls's People

Contributors

saelkimberly avatar

Stargazers

 avatar AM avatar  avatar Alexey avatar  avatar  avatar Sergey avatar alexander erofeev avatar Vadvad avatar

Watchers

 avatar

Forkers

aerofeev wikiped

rxls's Issues

Please consider API standardization

First of all - thank you for making a reader that can finally work with both XLSX an XLXB files. This is a long-awaited solution.

However, considering that the main 'target user-base' is made solely from those who already use either pandas or polars, it might be a good strategy to offer an API that is as close / familiar to the intended user as it only could be. This would probably make adoption of this promising project even easier / faster.

With this in mind, please consider the following suggestions:

  1. Use read as main function name.
    xl_scan probably better reflectes what is happenning inside the function, but read probably better reflects the intention of the whole idea behind this Reader and it is the first verb used in the description of this package.

  2. Since output of the main function can return very different types it might be a good idea to offer the user functions more explicitly reflecting intended result type: something like read_to_pandas, read_to_polars.

  3. Use the same names for entities already used in pandas / polars.

  • header instead of head
  • float_precision instead of frounding
  • skip_rows and skip_rows_after_header instead of skip_rows and drop_rows, which sound synonimous and are bound to be a cause of confusion.
  • n_rows instead of take_rows
  1. Avoid shortenning the terms where it might be difficult to remember the shortenning. I.e. drop_cels is short of l and the benefit is not obvious - it would be more natural to type cells than cels.

  2. Perhaps drop_cels would be better named as first_cell_regex to reflect its actual intent, which is currently discribed with litterally these words:

    ...cell, that matches this [regex], will be first non-empty cell...

  3. Use progress or show_progress instead of with_tqdm, which exposes an imlementation detail, which might not be familiar to all users. This might be accompanied with progress_kwargs to allow finetuning the title, etc.

  4. Use filter_col instead of index_col
    Based on the desciption:

    Column, that defines, whether row must exist, or not...

    the intended use sounds like it is about filtering, and not about indexing - and this is bound to cause confusion with pandas / polars user where index is a very specific term / operation. See for reference description of index_col in pandas.read_excel.

  5. Use ignore_empty_rows instead of keep_rows to better reflect the intent and as it is already used in arrows.

  6. Use dtypes instead of inferring. And accept a dict similarly to pandas/polars, allowing dtype to be a type name (like UTF8) or one the of the DtypeInferenceStrategy enum values (like: SAFE | DROP | REGEX | ... ).

For reference below is the summary of APIs used by pandas / polars / arrows for reading Excel and CSV files.

Read Excel API Summary

pandas.read_excel polars.read_excel xlsx2csv_options
sheet_name file sheetid
header sheet_id sheetname
names sheet_name dateformat
index_col xlsx2csv_options timeformat
usecols read_csv_options -> see: Read CSV API Summary floatformat
dtype quoting
engine delimiter
converters sheetdelimiter
true_values skip_empty_lines
false_values skip_trailing_columns
skiprows hyperlinks
nrows include_sheet_pattern
na_values exclude_sheet_pattern
keep_default_na exclude_hidden_sheets
na_filter skip_hidden_rows
verbose
parse_dates
date_parser
date_format
thousands
decimal
comment
skipfooter
storage_options
use_nullable_dtypes

Read CSV API Summary

polars.read_csv arrow_read_csv pandas.read_csv
batch_size input_file cache_dates
columns read_options chunksize
comment_char     use_threads comment
dtypes     block_size compression
encoding     skip_rows converters
eol_char     skip_rows_after_names date_parser
file     column_names dayfirst
has_header     autogenerate_column_names decimal
ignore_errors     encoding delim_whitespace
infer_schema_length parse_options delimiter
low_memory     delimiter dialect
missing_utf8_is_empty_string     quote_char doublequote
n_rows     double_quote dtype
n_threads     escape_char encoding
new_columns     newlines_in_values encoding_errors
null_values     ignore_empty_lines engine
quote_char     invalid_row_handler error_bad_lines
rechunk convert_options escapechar
row_count_name     check_utf8 false_values
row_count_offset     column_types filepath_or_buffer
sample_size     null_values float_precision
sep     true_values header
skip_rows     false_values index_col
skip_rows_after_header     decimal_point infer_datetime_format
storage_options     strings_can_be_null iterator
try_parse_dates     quoted_strings_can_be_null keep_date_col
use_pyarrow     include_columns keep_default_na
    include_missing_columns lineterminator
    auto_dict_encode low_memory
    auto_dict_max_cardinality mangle_dupe_cols
    timestamp_parsers memory_map
memory_pool na_filter
na_values
names
nrows
on_bad_lines
parse_dates
prefix
quotechar
quoting
sep
skip_blank_lines
skipfooter
skipinitialspace
skiprows
squeeze
storage_options
thousands
true_values
usecols
verbose
warn_bad_lines

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.