mongodb / bson-numpy Goto Github PK

This project has been superseded by PyMongoArrow - https://github.com/mongodb-labs/mongo-arrow/tree/main/bindings/python

Home Page: https://bson-numpy.readthedocs.io/en/latest/

License: Apache License 2.0

Python 9.15% C 89.63% CMake 0.22% C++ 1.00%

bson-numpy's People

Contributors

Stargazers

Watchers

Forkers

ajdavis dkenward shaneharvey prashantmital jostheim lanthias dincaus mohanpilla juliusgeo swayable isabella232 bibekjyotinath

bson-numpy's Issues

Require field names in dtypes

If you pass a dtype that doesn't have named fields, it causes an odd exception:

bsonnumpy.sequence_to_ndarray(iter([data]), np.dtype('i'), 1)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
error: TODO: constant loaded with _load_dcument, shouldn't happen

BSON-NumPy should error earlier and more clearly, describing what kind of dtypes we require.

sequence_to_ndarray should also accept iterables

This works now:

>>> data = bson.BSON().encode({'a': 1})
>>> bsonnumpy.sequence_to_ndarray(iter([data]), np.dtype([('a', np.int32)]), 1)
array([(1,)], 
      dtype=[('a', '<i4')])
...

The iter call shouldn't be required, sequence_to_ndarray should auto-convert iterables to iterators.

Implement ndarray_to_sequence()

I'm not certain if ndarray_to_sequence is the perfect name for the inverse of sequence_to_ndarray. Perhaps we should change both names: ndarray_to_bson and bson_to_ndarray.

Before closing this, remove the skip decorates from the tests in TestToBSONScalars, and update those tests to actually check the outcome of the conversion to BSON, not simply to call the conversion function.

Clean up branches

There are a number of branches containing work produced since the 0.1 release on PyPi. Work with @ajdavis and @aherlihy to figure out their state and decide what to do with them.

dtype for datetime

Which dtype to define in bson for datetime format "2018-12-18 12:17:34.000Z"

Create wheels

Might be tricky with libbson vendored and a binary dependency on numpy's extension, needs research.

Add pandas tutorial.

This is currently a one-liner in the docs. We should consider expanding it to have a more comprehensive pandas tutorial.

Convert BSON Decimal128 to numpy.double

Vendor libbson

Install Error: Undefined suseconds_t, unknown type name RTL_RUN_ONCE

Trying to install via: pip install git+https://github.com/mongodb/bson-numpy.git
Error message: Undefined suseconds_t, unknown type name RTL_RUN_ONCE

This seems to be a known issue from the C version with mingw, but having the problem with bson-numpy too.
https://jira.mongodb.org/browse/CDRIVER-633

Travis and Evergreen config

Try Travis first, see if we can install latest NumPy including C headers on Travis nicely. Otherwise try Evergreen.

Various comments

I took at look at the project from a new user perspective and came up with the following feedback. My apologies for dumping all of this in one issue:

Pip install

You probably want to prepend https:// to the pip install lines

pip install git+github.com/...  # this doesn't work
pip install git+https://github.com/...  # this does

Exceptions rather than hard failures

Following an example I tried the following

In [1]: import bson

In [2]: import bsonnumpy

In [3]: import numpy as np

In [4]: data = [{'x': i, 'y': i**2, 'z': 'hello'} for i in range(10000)]

In [5]: dt = np.dtype([('x', 'i4'), ('y', 'i8'), ('z', 'S5')])

In [6]: byte_list = list(map(bson.dumps, data))

In [7]: history
import bson
import bsonnumpy
import numpy as np
data = [{'x': i, 'y': i**2, 'z': 'hello'} for i in range(10000)]
dt = np.dtype([('x', 'i4'), ('y', 'i8'), ('z', 'S5')])
byte_list = list(map(bson.dumps, data))
history

In [8]: bsonnumpy.sequence_to_ndarray(data, dt, 1)
src/bson/bson-iter.c:45 bson_iter_init(): precondition failed: bson
Aborted (core dumped)

Generally it would be nice to raise python errors rather than kill the interpreter

Docstrings

In [2]: bsonnumpy.sequence_to_ndarray?
Docstring: Convert an iterator containing BSON documents into an ndarray
Type:      builtin_function_or_method

This function is used often in the docs and appears to be user facing. It might be worth investing in a proper docstring showing parameters, examples, etc.. I personally prefer the numpydoc standard.

Consistently handle un-convertible BSON data

Might be non-convertible because it doesn't match dtype, or data required by dtype is missing. Choose a strategy and implement / test it.

Use Numpy API macros as much as possible

The main goal of this ticket was to reduce the amount of code and rely more on the Numpy C API instead of manually manipulating pointers and using memcpy. The bottom line is because we are using named dtypes, a lot of the macros will not work.

PyArray_SetItem: requires a PyObject which would introduce overhead for converting BSON objects into PyObjects only so we could use this instead of memcpy. Additionally, we need to separate out the validation depending on type, so even if we use this instead of memcpy we could not get rid of separate loading functions. Lastly, flexible dtypes that may be shorter than the assigned size (i.e. string, bytes, etc) need to be zero-padded. Could swap out memcpy in simple scalar cases, but would not end up making the code any less complex.

PyArray_GetPtr: See #30

NpyIter_*: We iterate through the BSON object instead of through the ndarray. Can revisit this while implementing #5.

Ultimately -- the issues that we have with the major Numpy C API Macros are not ones that are going away anytime soon, so I believe we can close this issue.

PyArray_GetPtr does not work with named dtypes

PyArray_GetPtr does not descend into named dtypes.

Example:

array([(0, 10), (1,  9), (2,  8), (3,  7), (4,  6), (5,  5), (6,  4), (7,  3), (8,  2), (9,  1)],
       dtype=[('x', '<i4'), ('y', '<i4')])

In C:

PyArray_GetPtr(ndarray, [1, 0]) == PyArray_GetPtr(ndarray, [1, 10])
PyArray_GETPTR2(ndarray, 1, 0) == PyArray_GETPTR2(ndarray, 1, 10)

(For the record, it works up until the first dtype, or with regular dtypes that don't have any named fields: PyArray_GetPtr([1, 0]) != PyArray_GetPtr([2, 0])

Potentially because named dtypes don't really have an order? Either way, that's why we have "sub_coordinates" and "offset" counters for keeping track of where in the array we are at. Would love to get rid of that extra code and computation and rely on a Numpy API call.

Add README and LICENSE

How to handle heterogenous documents that lack expected fields

I have an optional field (geo) in my dataset. Value is either null or it has a nested structure.

How can I structure my dtype var for this use case?

What I have so far:

dtype = np.dtype([('geo',
                   np.dtype([('coordinates', np.dtype([('0', np.float64), ('1', np.float64)]))]))])
ndarray = bsonnumpy.sequence_to_ndarray((doc.raw for doc in collection.find()), dtype, collection.count())

Thanks!

Memory leak in _load_element_from_bson

Currently my team at work is using this code to convert ~5000 documents from our MongoDB query. When repeating our queries and using bson-numpy, we noticed a memory leak that would gradually crash the program. The memory leak was traced down to _load_element_from_bson within the 1st if statement where parsed->node_type == DTYPE_ARRAY. If that statement was modified such that PyArray_OutputConverter isn't called, the memory leak doesn't occur (which is ~4 MB in size for each query conversion).

Would you dev(s) have any idea what is causing this memory leak within _load_element_from_bson? I've been looking into it also but have only been able to trace it down to that specific if statement.

UPDATE: After some more digging, it seems to be caused by the loop over subarray_tuple where PyTuple_GET_ITEM is repeatedly called using this variable, which may be messing up the reference counts for some reason (currently unknown why this would cause a problem).

Test / document / handle when sequence_to_ndarray's count parameter doesn't match length of sequence

Update to latest libbson (1.17.0)

Create Numpy issue for PyArray_GetPtr and flexible types

missing bson.h

I tried pull, but cant locate bson.h

Docs with examples

Create and use a master branch

This project has no master branch. Let's fix that.

Declare platforms, OSes, and Python versions supported

Remove all Python 2 specific code

We only support Python 3.5+.

Segmentation Fault when loading data

* thread #1: tid = 0xcae262, 0x00007fffa0298885 libsystem_platform.dylib`_platform_memcmp + 293, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x100400029)
  * frame #0: 0x00007fffa0298885 libsystem_platform.dylib`_platform_memcmp + 293
    frame #1: 0x00000001037260ba bsonnumpy.so`_load_document_from_bson + 271
    frame #2: 0x0000000103725c94 bsonnumpy.so`sequence_to_ndarray + 511

The data is loaded with:

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client.twitter

import numpy as np
dtype = np.dtype([('created_at', 'S32'), ('text', 'S512'), ('user', np.dtype([('name', 'S128')]))])


import bsonnumpy
ndarray = bsonnumpy.sequence_to_ndarray(db.tweets.find_raw_batches(), dtype, db.tweets.count())

and the db.teweets contains tweets in JSON format as returned by twitter API. Around 30.000 of them.

Support all the natural NumPy types

Allow double, string, array, binary, null, bool, datetime, int32, int64, decimal128, objectid.

Think again if null is supported or prohibited?

Probably convert decimal128 to NumPy double.

Preparse the dtype before parsing BSON document

sequence_to_ndarray would be simpler to understand and probably faster (for complex documents) if it preparses the input dtype into a simpler C structure that we design. We could move all the error-checking and branching into a separate function that fills out a struct, and if that succeeds, proceed to parse the BSON document, comparing its contents to our simple struct.

Update setup.py metadata

Fix the uri field to point to this project instead of @aherlihy's repo.
Add classifiers field listing the supported operating systems, supported Python versions, Python implementations, development status, topics, intended audience, license, etc.
Add long_description field set to the content of the README
Add license field set to "Apache License, Version 2.0"
Add keywords field
Remove 'and vice versa' from the description since that isn't true, yet.

Test non-ASCII UTF-8 keys in BSON and in dtype field names

The use of PyUnicode_AsASCIIString looks suspect to me.

Add version

PEP8 and PEP7 compliant (within reason)

"DEBUG" mode

Control Python and C print statements with an env var.

Add standard MongoDB header to each source file

/*
 * Copyright 2016-present MongoDB, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

Test and document conversion between sequence of BSON documents and sequence of Pandas dataframes

Test and document conversion in both directions.

Audit code for unhandled edge cases, extend tests for reasonable coverage

Human-readable error messages that include dtype's stringification and express what's wrong or unimplemented.

Rename collection_to_ndarray "sequence_to_ndarray"

Choose subdocument access API

Choose what to do when loading a subdocument into a NumPy array.

Prohibit?
Load as buffer of BSON bytes if the target NumPy type is a "void *"?
Allow loading elements from subdocuments with dot-notation like "x.y.z"?

Remove bson_to_ndarray and ndarray_to_bson.

User stories don't indicate they're necessary.

Transfer tests that we wrote for them to sequence_to_ndarray.

Test in Evergreen

So we can test on a larger set of platforms and operating systems. This is a prerequisite for #47.

Document each BSON type's available NumPy conversions

use field names or not for sequence_to_ndarray

There are multiple ways of interpreting a sequence of BSON documents.

Right now, we are using field names so:

For a collection named "coll" on db "test", which contains documents [{'a': 0, 'b': 0} ....{'a': n, 'b': n}], would become a ndarray of length test.coll.count() and dtype = np.dtype([('a', 'int32'), ('b', 'int32')]).

Should we do what monary did, which is arbitrarily pick order for the document fields and avoid having field names in dtype? So that collection would become np.dtype('int32') with shape (test.coll.count(), 2).

Define supported Python versions

This depends on #29. Document the supported Python and Numpy versions in the README. Also enforce them in setup.py.

Support for windows platform

Hi, thanks for sharing this awesome packages. I am wondering, when could we get a windows version library please?

Provide a script to "clean" data

Since there isn't much checking for badly formed data, if a users documents don't have certain fields or the fields are of the wrong type, it will ugly error. In order to prevent the case where a user tries to use bson-numpy and it mysteriously segfaults, we should write a script that "checks" a user's collection to make sure that the documents are full + consistently typed.

Potential ideas:
Get a schema, through sampling/findOne/user-provided, and create a view that enforces that each of those documents in the view have all the same fields and that they are all the same types. That way we can ensure that if a user without the right type of data tries to use it, they have an easy way to check for limitations.

Or, could go through and audit the code so that we check the return values and error intelligently.