Coder Social home page Coder Social logo

mongodb / bson-numpy Goto Github PK

View Code? Open in Web Editor NEW
43.0 43.0 14.0 490 KB

This project has been superseded by PyMongoArrow - https://github.com/mongodb-labs/mongo-arrow/tree/main/bindings/python

Home Page: https://bson-numpy.readthedocs.io/en/latest/

License: Apache License 2.0

Python 9.15% C 89.63% CMake 0.22% C++ 1.00%

bson-numpy's People

Contributors

aherlihy avatar ajdavis avatar behackett avatar juliusgeo avatar prashantmital avatar shaneharvey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bson-numpy's Issues

Require field names in dtypes

If you pass a dtype that doesn't have named fields, it causes an odd exception:

bsonnumpy.sequence_to_ndarray(iter([data]), np.dtype('i'), 1)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
error: TODO: constant loaded with _load_dcument, shouldn't happen

BSON-NumPy should error earlier and more clearly, describing what kind of dtypes we require.

sequence_to_ndarray should also accept iterables

This works now:

>>> data = bson.BSON().encode({'a': 1})
>>> bsonnumpy.sequence_to_ndarray(iter([data]), np.dtype([('a', np.int32)]), 1)
array([(1,)], 
      dtype=[('a', '<i4')])
...

The iter call shouldn't be required, sequence_to_ndarray should auto-convert iterables to iterators.

Implement ndarray_to_sequence()

I'm not certain if ndarray_to_sequence is the perfect name for the inverse of sequence_to_ndarray. Perhaps we should change both names: ndarray_to_bson and bson_to_ndarray.

Before closing this, remove the skip decorates from the tests in TestToBSONScalars, and update those tests to actually check the outcome of the conversion to BSON, not simply to call the conversion function.

Clean up branches

There are a number of branches containing work produced since the 0.1 release on PyPi. Work with @ajdavis and @aherlihy to figure out their state and decide what to do with them.

dtype for datetime

Which dtype to define in bson for datetime format "2018-12-18 12:17:34.000Z"

Create wheels

Might be tricky with libbson vendored and a binary dependency on numpy's extension, needs research.

Add pandas tutorial.

This is currently a one-liner in the docs. We should consider expanding it to have a more comprehensive pandas tutorial.

Travis and Evergreen config

Try Travis first, see if we can install latest NumPy including C headers on Travis nicely. Otherwise try Evergreen.

Various comments

I took at look at the project from a new user perspective and came up with the following feedback. My apologies for dumping all of this in one issue:

Pip install

You probably want to prepend https:// to the pip install lines

pip install git+github.com/...  # this doesn't work
pip install git+https://github.com/...  # this does

Exceptions rather than hard failures

Following an example I tried the following

In [1]: import bson

In [2]: import bsonnumpy

In [3]: import numpy as np

In [4]: data = [{'x': i, 'y': i**2, 'z': 'hello'} for i in range(10000)]

In [5]: dt = np.dtype([('x', 'i4'), ('y', 'i8'), ('z', 'S5')])

In [6]: byte_list = list(map(bson.dumps, data))

In [7]: history
import bson
import bsonnumpy
import numpy as np
data = [{'x': i, 'y': i**2, 'z': 'hello'} for i in range(10000)]
dt = np.dtype([('x', 'i4'), ('y', 'i8'), ('z', 'S5')])
byte_list = list(map(bson.dumps, data))
history

In [8]: bsonnumpy.sequence_to_ndarray(data, dt, 1)
src/bson/bson-iter.c:45 bson_iter_init(): precondition failed: bson
Aborted (core dumped)

Generally it would be nice to raise python errors rather than kill the interpreter

Docstrings

In [2]: bsonnumpy.sequence_to_ndarray?
Docstring: Convert an iterator containing BSON documents into an ndarray
Type:      builtin_function_or_method

This function is used often in the docs and appears to be user facing. It might be worth investing in a proper docstring showing parameters, examples, etc.. I personally prefer the numpydoc standard.

Use Numpy API macros as much as possible

The main goal of this ticket was to reduce the amount of code and rely more on the Numpy C API instead of manually manipulating pointers and using memcpy. The bottom line is because we are using named dtypes, a lot of the macros will not work.

PyArray_SetItem: requires a PyObject which would introduce overhead for converting BSON objects into PyObjects only so we could use this instead of memcpy. Additionally, we need to separate out the validation depending on type, so even if we use this instead of memcpy we could not get rid of separate loading functions. Lastly, flexible dtypes that may be shorter than the assigned size (i.e. string, bytes, etc) need to be zero-padded. Could swap out memcpy in simple scalar cases, but would not end up making the code any less complex.

PyArray_GetPtr: See #30

NpyIter_*: We iterate through the BSON object instead of through the ndarray. Can revisit this while implementing #5.

Ultimately -- the issues that we have with the major Numpy C API Macros are not ones that are going away anytime soon, so I believe we can close this issue.

PyArray_GetPtr does not work with named dtypes

PyArray_GetPtr does not descend into named dtypes.

Example:

array([(0, 10), (1,  9), (2,  8), (3,  7), (4,  6), (5,  5), (6,  4), (7,  3), (8,  2), (9,  1)],
       dtype=[('x', '<i4'), ('y', '<i4')])

In C:

PyArray_GetPtr(ndarray, [1, 0]) == PyArray_GetPtr(ndarray, [1, 10])
PyArray_GETPTR2(ndarray, 1, 0) == PyArray_GETPTR2(ndarray, 1, 10)

(For the record, it works up until the first dtype, or with regular dtypes that don't have any named fields: PyArray_GetPtr([1, 0]) != PyArray_GetPtr([2, 0])

Potentially because named dtypes don't really have an order? Either way, that's why we have "sub_coordinates" and "offset" counters for keeping track of where in the array we are at. Would love to get rid of that extra code and computation and rely on a Numpy API call.

How to handle heterogenous documents that lack expected fields

I have an optional field (geo) in my dataset. Value is either null or it has a nested structure.

How can I structure my dtype var for this use case?

What I have so far:

dtype = np.dtype([('geo',
                   np.dtype([('coordinates', np.dtype([('0', np.float64), ('1', np.float64)]))]))])
ndarray = bsonnumpy.sequence_to_ndarray((doc.raw for doc in collection.find()), dtype, collection.count())

Thanks!

Memory leak in _load_element_from_bson

Currently my team at work is using this code to convert ~5000 documents from our MongoDB query. When repeating our queries and using bson-numpy, we noticed a memory leak that would gradually crash the program. The memory leak was traced down to _load_element_from_bson within the 1st if statement where parsed->node_type == DTYPE_ARRAY. If that statement was modified such that PyArray_OutputConverter isn't called, the memory leak doesn't occur (which is ~4 MB in size for each query conversion).

Would you dev(s) have any idea what is causing this memory leak within _load_element_from_bson? I've been looking into it also but have only been able to trace it down to that specific if statement.

UPDATE: After some more digging, it seems to be caused by the loop over subarray_tuple where PyTuple_GET_ITEM is repeatedly called using this variable, which may be messing up the reference counts for some reason (currently unknown why this would cause a problem).

Segmentation Fault when loading data

* thread #1: tid = 0xcae262, 0x00007fffa0298885 libsystem_platform.dylib`_platform_memcmp + 293, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x100400029)
  * frame #0: 0x00007fffa0298885 libsystem_platform.dylib`_platform_memcmp + 293
    frame #1: 0x00000001037260ba bsonnumpy.so`_load_document_from_bson + 271
    frame #2: 0x0000000103725c94 bsonnumpy.so`sequence_to_ndarray + 511

The data is loaded with:

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client.twitter

import numpy as np
dtype = np.dtype([('created_at', 'S32'), ('text', 'S512'), ('user', np.dtype([('name', 'S128')]))])


import bsonnumpy
ndarray = bsonnumpy.sequence_to_ndarray(db.tweets.find_raw_batches(), dtype, db.tweets.count())

and the db.teweets contains tweets in JSON format as returned by twitter API. Around 30.000 of them.

Support all the natural NumPy types

Allow double, string, array, binary, null, bool, datetime, int32, int64, decimal128, objectid.

Think again if null is supported or prohibited?

Probably convert decimal128 to NumPy double.

Preparse the dtype before parsing BSON document

sequence_to_ndarray would be simpler to understand and probably faster (for complex documents) if it preparses the input dtype into a simpler C structure that we design. We could move all the error-checking and branching into a separate function that fills out a struct, and if that succeeds, proceed to parse the BSON document, comparing its contents to our simple struct.

Update setup.py metadata

  • Fix the uri field to point to this project instead of @aherlihy's repo.
  • Add classifiers field listing the supported operating systems, supported Python versions, Python implementations, development status, topics, intended audience, license, etc.
  • Add long_description field set to the content of the README
  • Add license field set to "Apache License, Version 2.0"
  • Add keywords field
  • Remove 'and vice versa' from the description since that isn't true, yet.

"DEBUG" mode

Control Python and C print statements with an env var.

Add standard MongoDB header to each source file

/*
 * Copyright 2016-present MongoDB, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

Choose subdocument access API

Choose what to do when loading a subdocument into a NumPy array.

  • Prohibit?
  • Load as buffer of BSON bytes if the target NumPy type is a "void *"?
  • Allow loading elements from subdocuments with dot-notation like "x.y.z"?

Test in Evergreen

So we can test on a larger set of platforms and operating systems. This is a prerequisite for #47.

use field names or not for sequence_to_ndarray

There are multiple ways of interpreting a sequence of BSON documents.

Right now, we are using field names so:

For a collection named "coll" on db "test", which contains documents [{'a': 0, 'b': 0} ....{'a': n, 'b': n}], would become a ndarray of length test.coll.count() and dtype = np.dtype([('a', 'int32'), ('b', 'int32')]).

Should we do what monary did, which is arbitrarily pick order for the document fields and avoid having field names in dtype? So that collection would become np.dtype('int32') with shape (test.coll.count(), 2).

Support for windows platform

Hi, thanks for sharing this awesome packages. I am wondering, when could we get a windows version library please?

Provide a script to "clean" data

Since there isn't much checking for badly formed data, if a users documents don't have certain fields or the fields are of the wrong type, it will ugly error. In order to prevent the case where a user tries to use bson-numpy and it mysteriously segfaults, we should write a script that "checks" a user's collection to make sure that the documents are full + consistently typed.

Potential ideas:
Get a schema, through sampling/findOne/user-provided, and create a view that enforces that each of those documents in the view have all the same fields and that they are all the same types. That way we can ensure that if a user without the right type of data tries to use it, they have an easy way to check for limitations.

Or, could go through and audit the code so that we check the return values and error intelligently.

Prohibit BSON types that aren't natural with NumPy

Prohibit regex, code, symbol, undefined, dbpointer, minkey, maxkey.

Regex, code, timestamp, and the deprecated symbol type are rarely used but easy to support. Prohibit them for now and enable them if requested.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.