smart_open
is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS or local (compressed) files.
It is well tested (using moto), well documented and sports a simple, Pythonic API:
>>> # stream lines from an S3 object
>>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
... print line
>>> # can use context managers too:
>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:
... for line in fin:
... print line
... fin.seek(0) # seek to the beginning
... print fin.read(1000) # read 1000 bytes
>>> # stream from HDFS
>>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'):
... print line
>>> # stream from WebHDFS
>>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'):
... print line
>>> # stream content *into* S3 (write mode):
>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + '\n')
>>> # stream content *into* WebHDFS (write mode):
>>> with smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + '\n')
>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
... print line
>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:
... fout.write("some content\n")
Since going over all (or select) keys in an S3 bucket is a very common operation,
there's also an extra method smart_open.s3_iter_bucket()
that does this efficiently,
processing the bucket keys in parallel (using multiprocessing):
>>> # get all JSON files under "mybucket/foo/"
>>> bucket = boto.connect_s3().get_bucket('mybucket')
>>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
... print key, len(content)
For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs:
>>> import smart_open
>>> help(smart_open.smart_open_lib)
Working with large S3 files using Amazon's default Python library, boto, is a pain. Its key.set_contents_from_string()
and key.get_contents_as_string()
methods only work for small files (loaded in RAM, no streaming).
There are nasty hidden gotchas when using boto
's multipart upload functionality, and a lot of boilerplate.
smart_open
shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.
The module has no dependencies beyond Python >= 2.6 (or Python >= 3.3) and boto
:
pip install smart_open
Or, if you prefer to install from the source tar.gz:
python setup.py test # run unit tests
python setup.py install
To run the unit tests (optional), you'll also need to install mock , moto and responses <https://github.com/getsentry/responses> (pip install mock moto responses
). The tests are also run automatically with Travis CI on every commit push & pull request.
smart_open
is an ongoing effort. Suggestions, pull request and improvements welcome!
On the roadmap:
- better documentation for the default
file://
scheme
smart_open
lives on github. You can file
issues or pull requests there.
smart_open
is open source software released under the MIT license.
Copyright (c) 2015-now Radim Řehůřek.
smart_open's People
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.