saelkimberly / rxls Goto Github PK

View Code? Open in Web Editor NEW

9.0 1.0 2.0 66.04 MB

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow

License: MIT License

Python 30.24% Jupyter Notebook 69.76%

pyarrow

rxls's Introduction

RXLS

Shortcut from Read XLS[X|B]

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.

Description

This module provides one function: read for reading both .xlsx and .xlsb files.

Simple usage showcase:

import polars as pl
import pandas as pd

from rxls.reader import read

polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()

Advanced usage examples.

Some file has no header:

from rxls.reader import read

table = read(some_file, header=0)  # produce `Unnamed: 0` ... `Unnamed: N` columns
table = read(some_file, header=['a', 'b', 'c'])  # produce table with columns `a`, `b`, `c`, and raise an exception, when column count != 3.

Header is placed in unknown row (but we know, it is in top-30 and one cell of it contains word timestamp)

from rxls.reader import read

table = read(some_file, header=True, lookup_head='^timestamp$')  # First row of header must contain cell with word `timestamp`

Some rows in column timestamp contains unwanted cells (with subscripts, totals, or something else), but column row_id doesn't:

from rxls.reader import read

table = read(some_file, header=True, row_filters='^row_id$')  # After `header`, will be only rows, that contains non-null value in column `row_id`

Some rows in column timestamp contains unwanted cells, column row_id doesn't, but we will ensure, that table has also non-null column client:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'])  # After `header`, will be only rows, that contains non-null value in columns `row_id` AND `client`

Some rows in a sheet contains unwanted cells, but we know, that row_id column cells are not null, or client column cells are not null either:

from rxls.reader import read

table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or')  # After `header`, will be only rows, that contains non-null value in columns `row_id` OR `client`

For tracking rows events:

from rxls.reader import read
from tqdm import tqdm

with tqdm(desc='Reading some file', unit=' rows') as tq:
    table = read(some_file, header=True, row_filters=['^row_id$', '^client$'], row_filters_strategy='or', row_callback=tq.update)

Some algorithm explanations:

RXLS uses only TWO datatypes for cells of excel files: NUMERIC and STRING. It is because all other datatypes, as I discovered, are just styles, applied to them (named representation in RXLS).

So, when MS Excel show us a date (or time, or timestamp, or duration interval), it's actually a floating-point number "under the hood". This number, when is between [0.0, 1.0), presents part of day (0.5 = 12 hours, 0.125 = 3 hours and vice versa), which useful for time and duration datatypes, and, when bigger or equals to 1, presents days since 1900 with part of day. For date - decimals will be zero. All temporal cells (date/time, datetime, timedelta) will be converted to timestamp[ms] during prepare step. This behaviour may be overriden with dtypes argument - after all, pyarrow's timestamp[ms] datatype may be convert to date64 or time32['ms'] instantly and without any problems. duration datatype in Excel files, usually, not presented, and pyarrow cannot cast timestamp['ms'] to duration['ms']. For such conversion, you can extract timestamp column and convert it as:

import pyarrow as pa

column_name = 'duration_column'

table = read(some_file, header=True)

table.set_column(
    next(i for i, x in enumerate(table.column_names) if x == column_name),
    pa.field(column_name, pa.duration('ms')),
    table[column_name].combine_chunks().view(pa.duration('ms'))  # original dtype is pyarrow.timestamp['ms'], which is binary equivalent for pyarrow.duration['ms']
)

Strings most common representation is 4-byte little-endian integer (in xlsb) or some number (in xlsx). This number is an index of shared string in special part of Excel file (xl/sharedStrings.[xml|bin]). This special part will be scanned on each read call, before all other steps.

Keeping all this in mind, I've implemented class chunk for keeping original data as-is, until it should be accessed during final table creation. Chunks, actually, are column parts, and contains non-null values with same type and representation. Chunk may be of NULL type, but it contains just count of null values between not-null chunks.

Original data will be contained inside chunk as-is, intact until prepare step:

numpy.NDArray[numpy.float64] - simple numeric values (with optional temporal flag)
numpy.NDArray[numpy.uint32] - for XLSB RkNumber-formatted floating-poing or integer values.
pyarrow.LargeStringArray - for XLSX and XLSB strings, that are not in shared, but presented as-is.
pyarrow.UInt64Array - for XLSX and XLSB shared strings indices.
int - for null chunks: count of null values

RXLS can skip entire chunks (or their parts) before prepare step, when they should be skipped (see lookup_head and row_filters arguments descriptions) - they will never be accessed afterwards.

RXLS performs two main steps for a sheet of Excel file: reading and prepare.

On reading step, no computations will be performed - it's just filling xl_series objects by chunks, keeping original data.

On prepare step, unwanted chunks will be dropped, and others will be prepared as:

numpy.NDArray[numpy.uint32] -> numpy.NDArray[numpy.float64] - when numeric chunk has rknubmer flag.
numpy.NDArray[numpy.float64] -> numpy.NDArray[numpy.int64] - when numeric chunk has temporal flag. count of days with decimal part since 1900 (Windows) -> milliseconds since 1970-01-01 (UNIX timestamp)
pyarrow.UInt64Array -> pyarrow.LargeStringArray - shared strings indices to actual data from shared.
int -> pyarrow.NullArray or pyarrow.Array[T, pyarrow.Scalar[T]] if column has chunks of another common datatype T.

These operations were totally optimized (and may performs much faster with numba along with (or without) tbb - see Dependencies section).

Also, about lookup_head and header with height > 1.

If header has more than 1 row, every column will fill it's topmost empty cells by use corresponding non-empty cell values from left neighbour. All other cells will be leave as is. After this, non-empty cells will be concatenated from top to down with ', ' separator. These operations performs for each column from left to right.

So, RXLS features are:

Lazy and fast column-wise computations
Smart header searching and row filtering
Keeping datatypes and resolve conflicts.

Notes:

At the moment, RXLS just skip empty column (which has no header, nor any data).

Parameters:

Positional only:
- file: path to the file, or BytesIO with file.
- sheet: index or name of sheet for reading (default: 0)
Positional:
- header:
  - bool: True when header is presented in the sheet and have one row height
  - int: 0 when no header is presented in the sheet, N for header of N rows height.
  - list[str]: when no header is presented, but we are know, what it should be.
Keyword-only
- dtypes: Specify datatypes for columns.
  - Sequence[pyarrow.DataType]: when we know dtype for each non-empty column.
  - dict[str, pyarrow.DataType]: when we will override dtype for some columns.
  - pyarrow.DataType: when all columns must be of this dtype.
  - None: for original dtypes of columns.
- skip_cols: Skip some columns by their 0-based indices (A == 0). Performs on reading step.
- skip_rows: Skip some rows on the top of the file. Performs on reading step.
- skip_rows_after_header: Skip some rows after header. Performs on prepare step.
- take_rows: Stop reading after this row (0-based). Performs on reading step.
- take_rows_non_empty: Leave empty rows in resulting table. Performs on reading step.
- lookup_head: Regular expression for smart-search of the first row of header, or column index, where first non-empty cell is the top-level cell of header.
- lookup_size: Count of rows to perform lookup when searching for header. Note: RXLS will raise an exception, if lookup_head with this lookup_size is failed.
- row_filters: Regular expression(s) for columns, which content determines empty and non-empty rows.
- row_filters_strategy: Boolean operator(s) for situations with two or more columns in row_filters.
- float_precision. All numeric values in MS Excel are floating-point under the hood, so, when rounding whole float64 column to this precision gives equal result to just truncate decimals, this column will be converted to int64.
- datetime_formats: One or more formats, which may appears in columns with conflicts.
- conflict_resolve: When column contains two or more datatypes, this is a conflict. When conflict cannot be resolved, whole column will be convert to utf-8. Conflicts may be resolved as:
  - no: All parts in columns with conflicts will be convert to utf8.
  - temporal: Try to convert non-temporal parts of column with some temporal parts to temporal (float64 -> timestamp and utf8 -> timestamp (using default formats (ISO 8601), or as specified in datetime_formats)
  - numeric: Try to convert non-numeric parts of column with some numeric parts to numeric (utf8 -> float64).
  - all: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting: utf8 -> float64 -> timestamp)
- utf8_type_infers: (WIP) When resulting column is utf-8 and all non-null cells of it passes regular expression of numeric values, convert it to float64 (and, maybe, to int64 after).
- null_values: Advanced list of values, that should be skipped on reading step (or callable predicate for them).
- row_callback: Any callable, which may be called without arguments on each row event. Useful for progress tracking.

Dependencies

Required:

pyarrow>=14.0.2
numpy>=1.24.4
recordclass>=0.21.1
typing_extensions>=4.9.0

Optional:

numba>=0.58.1 (increase import time, but reading speed also increases x3/x4 and up)
tbb>=2021.11.0 (only for numba - additional performance gain)
polars>=0.20.4 (if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone)
pandas>=2.0.3 (for pyarrow to_pandas functionality)
tqdm>=4.66.1 (fast progress tracking)

rxls's People

Contributors

Stargazers

Watchers

Forkers

aerofeev wikiped

rxls's Issues

Please consider API standardization

First of all - thank you for making a reader that can finally work with both XLSX an XLXB files. This is a long-awaited solution.

However, considering that the main 'target user-base' is made solely from those who already use either pandas or polars, it might be a good strategy to offer an API that is as close / familiar to the intended user as it only could be. This would probably make adoption of this promising project even easier / faster.

With this in mind, please consider the following suggestions:

Use read as main function name.
xl_scan probably better reflectes what is happenning inside the function, but read probably better reflects the intention of the whole idea behind this Reader and it is the first verb used in the description of this package.
Since output of the main function can return very different types it might be a good idea to offer the user functions more explicitly reflecting intended result type: something like read_to_pandas, read_to_polars.
Use the same names for entities already used in pandas / polars.

header instead of head
float_precision instead of frounding
skip_rows and skip_rows_after_header instead of skip_rows and drop_rows, which sound synonimous and are bound to be a cause of confusion.
n_rows instead of take_rows

Avoid shortenning the terms where it might be difficult to remember the shortenning. I.e. drop_cels is short of l and the benefit is not obvious - it would be more natural to type cells than cels.
Perhaps drop_cels would be better named as first_cell_regex to reflect its actual intent, which is currently discribed with litterally these words:

...cell, that matches this [regex], will be first non-empty cell...
Use progress or show_progress instead of with_tqdm, which exposes an imlementation detail, which might not be familiar to all users. This might be accompanied with progress_kwargs to allow finetuning the title, etc.
Use filter_col instead of index_col
Based on the desciption:

Column, that defines, whether row must exist, or not...

the intended use sounds like it is about filtering, and not about indexing - and this is bound to cause confusion with pandas / polars user where index is a very specific term / operation. See for reference description of index_col in pandas.read_excel.
Use ignore_empty_rows instead of keep_rows to better reflect the intent and as it is already used in arrows.
Use dtypes instead of inferring. And accept a dict similarly to pandas/polars, allowing dtype to be a type name (like UTF8) or one the of the DtypeInferenceStrategy enum values (like: SAFE | DROP | REGEX | ... ).

For reference below is the summary of APIs used by pandas / polars / arrows for reading Excel and CSV files.

Read Excel API Summary

pandas.read_excel	polars.read_excel	xlsx2csv_options
sheet_name	file	sheetid
header	sheet_id	sheetname
names	sheet_name	dateformat
index_col	xlsx2csv_options	timeformat
usecols	read_csv_options -> see: Read CSV API Summary	floatformat
dtype		quoting
engine		delimiter
converters		sheetdelimiter
true_values		skip_empty_lines
false_values		skip_trailing_columns
skiprows		hyperlinks
nrows		include_sheet_pattern
na_values		exclude_sheet_pattern
keep_default_na		exclude_hidden_sheets
na_filter		skip_hidden_rows
verbose
parse_dates
date_parser
date_format
thousands
decimal
comment
skipfooter
storage_options
use_nullable_dtypes

Read CSV API Summary

polars.read_csv	arrow_read_csv	pandas.read_csv
batch_size	input_file	cache_dates
columns	read_options	chunksize
comment_char	use_threads	comment
dtypes	block_size	compression
encoding	skip_rows	converters
eol_char	skip_rows_after_names	date_parser
file	column_names	dayfirst
has_header	autogenerate_column_names	decimal
ignore_errors	encoding	delim_whitespace
infer_schema_length	parse_options	delimiter
low_memory	delimiter	dialect
missing_utf8_is_empty_string	quote_char	doublequote
n_rows	double_quote	dtype
n_threads	escape_char	encoding
new_columns	newlines_in_values	encoding_errors
null_values	ignore_empty_lines	engine
quote_char	invalid_row_handler	error_bad_lines
rechunk	convert_options	escapechar
row_count_name	check_utf8	false_values
row_count_offset	column_types	filepath_or_buffer
sample_size	null_values	float_precision
sep	true_values	header
skip_rows	false_values	index_col
skip_rows_after_header	decimal_point	infer_datetime_format
storage_options	strings_can_be_null	iterator
try_parse_dates	quoted_strings_can_be_null	keep_date_col
use_pyarrow	include_columns	keep_default_na
	include_missing_columns	lineterminator
	auto_dict_encode	low_memory
	auto_dict_max_cardinality	mangle_dupe_cols
	timestamp_parsers	memory_map
	memory_pool	na_filter
		na_values
		names
		nrows
		on_bad_lines
		parse_dates
		prefix
		quotechar
		quoting
		sep
		skip_blank_lines
		skipfooter
		skipinitialspace
		skiprows
		squeeze
		storage_options
		thousands
		true_values
		usecols
		verbose
		warn_bad_lines

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.