pabloem / beam_utils Goto Github PK
View Code? Open in Web Editor NEWA repo with a few tiny Apache Beam utilities that I've coded.
License: Apache License 2.0
A repo with a few tiny Apache Beam utilities that I've coded.
License: Apache License 2.0
Thank you for providing the package.
The following import is failing:
from apache_beam.io import fileio
You are only using fileio
for the CompressionType.
The correct import would be:
from apache_beam.io.filesystem import CompressionTypes
(are you still actively developing this package? I'd be happy to add more struff from my local beam_utils if you do)
I had some trouble using the JSONLinesFileSource; it was hanging when trying to process a large file and processing smaller files unpredictably.
After having a look at the source, I realised that JSONLinesFileSource is designed to process streaming JSON where all of the objects are on one line, using a buffer. JSONLines is a format where each object is on a separate line (http://jsonlines.org).
It might be a good idea to rename the Source in order to avoid this confusion?
Have you run into the problem where the csv reader will not read more than one record? Any csv I try only reads one line and converts it into a dict.
Hi,
Do you have any sample code of how to use this utility ?
Many thanks
Hi, and thanks very much for this library.
I noticed that when trying to feed the CsvFileSource a gzip file, I got this error:
TypeError: argument 1 must be an iterator
After some digging I found out that csv.reader()
expects an iterator, which, unfortunately, is not what self.open_file()
returns (see FileBasedSource).
This test exposes the problem, and this addition to sources.py fixed my problem.
Hi,
thanks a lot for writing this little library. It really helped me handling CSV Files.
But since the delimiter in my files wasn't the default ",", I had problems getting the library to work.
It seems that in CsvFileSource you included an option for different CSV delimiters, but the options are actually never used.
reader = csv.reader(self._file)
should be:
reader = csv.reader(self._file, delimiter=self.delimiter)
Thanks for your help and best regards,
Tobias
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.