ibmstreams / administration Goto Github PK

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.

License: Other

ibm-streams administration stream-processing

administration's Introduction

README -- IBMStreams/administration

The IBMStreams toolkits project is an open source umbrella project focused on the development of projects within the IBMStreams organization. resources for IBM Streams. This currently includes toolkits, samples, tutorials and utilities.

Use this repository to open enhancement issues for new proposals for repositories such as a new toolkit. If you have a proposal or idea for a new utility or sample, or a new operator in an existing toolkit project, open an issue in the corresponding repository.

Refer to the contribute.md file for details on how to get involved with this project.

Refer to the process.md file for a description of the processes used by the projects with the IBMStreams organization. Refer to the project's wiki for more information. �

administration's People

Contributors

Stargazers

Watchers

Forkers

smarterkey mikespicer rohitsw peteyc23 anouri wmarshall484 digideskio cephurs natashadsilva

administration's Issues

Regex RE2 toolkit

A toolkit brings RE2 regular expression library to Streams.
RE2 uses automata theory to guarantee that regular expression searches run in time linear in the size of the input.
The most important feature of the toolkit - regular expressions are compiled on the first call giving much better performance than builtin ones.

Datetime toolkit

A toolkit that provides additional functionality for dates and times. SPL applications usually need timestamps but the timestamp primitive can require too much memory, thus create types and functions to handle times at the required resolution. Including easy integration with time formats expected in Java and Javascript.

Discuss wiki page "Contributing to IBMStreams"

Hi all, I created a "Contributing to IBMStreams" wiki page.
https://github.com/IBMStreams/toolkits/wiki/Contributing-to-IBMStreams

In it, I tried to describe the process for

proposing a toolkit
contributing to an existing toolkit
editing a wiki page
creating a sample
creating a demo

For the samples and demos, I don't really know what the process is, so I mostly guessed. If I guessed wrong, please fix.

Proposal/Namespace question

I put up some hdfs2 formatting composites to make it easier to write txt & json with the hdfs2:
http://hildrum.github.io/streams.samples/

I'd be happy to contribute them to IBMStreams (as a sample, as part of a toolkit, or as a toolkit by themselves) if they are wanted, but in the meantime, is there a recommended namespace? My understanding is that com.ibm.streamsx was the namespace for IBMStreams, so I shouldn't use that. Is there anything in particular that should be used? com.ibm.stream.test? com.ibm.streamdev? com.ibm.streams.incubator?

Parquet Toolkit

Parquet is a columnar storage format for Hadoop. Parquet becoming more and more popular due to its very efficient compression and encoding schemes.
See more details at Parquet home page: http://parquet.io/

The Parquet toolkit will allow to read/write data in Parquet format to/from streaming applications. The toolkit is about to be implemented in Java and will
contain Sink operator in its initial version.

Resolve toolkit layout

it would be good to have a recommended project layout for SPL toolkits.

Some proposals are being discussed on this wiki page.,

https://github.com/IBMStreams/toolkits/wiki/ToolkitLayout

Working with toolkit/repository collaborators

Is there a way in github we can work with repository collaborators who only have read access to the repository, so that for example issues can be assigned to them?

It's good to have volunteers working on the code who are not yet committers, but how to manage that if they can't have issues assigned to them, or can't assign issues to themselves.I think at ASF anyone with a user identifier can assign themselves a JIRA issue, regardless of their commit status.

Create a new streamsx.parsers repository

Repository for toolkits containing operators that parse and convert to and from SPL types.

Document updates

Ideally, we'd like document updates (ie, run spl-make-doc) whenever new code is pushed to GitHub. I don't see a way to get all the way there just yet.

(1) If we can have an email address that can run a script on the when an email is received, then I believe the whole thing is do-able. We'd add a hook to email that email address on a push;when email is received, the script would run and update the docs. But I don't think we have such an email address, so...

(2) What we can do fairly easily is add commit hook that updates the documentation in the local gh-pages branch. The person committing would still have to push to the gh-pages branch when they push to the master. (We could include the push automatically in the pre-commit script, but the need to give a password makes this a little ugly, and if one made a commit but didn't push, the documentation would be out there without the corresponding code.) We could then have another script (updateDocs.pl) that did the actual push. So the workflow would be:
git commit .... // doc stuff invisibly generated
git push origin master // code pushed to github
updateDocs.pl // doc stuff pushed to github

Proposal: Transit toolkit with initial NextBus content.

NextBus provides a public api with live position and meta data for various transit systems in the US, e.g. positional data for buses in the San Francisco Muni system.

http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf

A NextBus toolkit would provide composite operators, types and applications to pull live data from NextBus.

I think this could be a good live data set to build sample applications for Streams, demonstrating use of various open source and product toolkits. E.g. initially it would use HTTPXMLContent from streamsx.inet, XMLParse from SPL standard toolkit, the geospatial toolkit and time functionality from streamsx.datetime. Subsequent improvements might use the timeseries toolkit.

Operator Conventions

I started a wiki page containing operator conventions to help people who want to contribute and during code reviews. The idea is to try to ensure that operators have a consistent look & feel, e.g. parameters that represent time are always float64 and have seconds as their units.

If you think it's a good idea, please add any conventions you know about.

Document Text eXtraction (DTX) Toolkit

DTX provides a canonical operator that combines several different open source text extractors. The operator extracts text and metadata from various document types.
The operator automatically determines the MIME document type and extracts the text and its metadata
DTX uses multiple third party and open source document extraction technologies, and can be enhanced with additional commercial /proprietary extractors.

streamsx.log4jAppender - toolkit

Toolkit Proposal

The log4j logging faciliy can be configured to append log messages to a network socket which will transmit them to a server. Messages arriving at the server are processed by the server's appender. Streams can act as the server's appender injecting the logging messages into Streams. Once the messages are in Streams they can be processed filter/analyzed/alerted.

Process suggestion

Reading the process comment, I see:
"The proposal will be discussed on the IBMStreams/toolkits wiki. A new page will be created for each proposal request."

I think it would make more sense to discuss the proposal in the issue itself. The issue thread keeps track of who said what when. With a wiki, that gets hidden. The proposal itself might belong on the wiki, so that it can be clear what the lastest proposal is, but I think discussion around it better fits in an issue or pull request.

Adding external libraries as dependencies

I would like to discuss the best way to add external libraries (with acceptable licenses) required by our toolkits to the build process.
For the Yarn resourcemanager, we used maven to specify our dependencies and build.
Should we consider maven for toolkits as well?
An alternative would be to do this manually. For every library/dependency, use an environment variable to point to it and include it in the operator and build.xml.
Any other alternatives/thoughts?

streamsx.dps

streamsx.dps is a Streams Distributed Process Store (DPS) toolkit. It allows the Streams operators belonging to one or more applications running on one or more machines to share application specific state in a seamless way using an external NoSQL K/V data stores. Some of the popular K/V data stores supported by this toolkit are listed below:

memcached
Redis
Cassandra
IBM Cloudant
HBase
Mongo
Couchbase

Resolve toolkit versioning

Be good to get guidelines on how to version toolkits, otherwise it's likely that multiple different copies of the toolkit will exist with the same version number, as commits are made and people copy the code without using a "formal" release.

https://github.com/IBMStreams/toolkits/wiki/Toolkit-Versions

Sound

Link to a toolkit that use libsnd.
https://www.ibm.com/developerworks/community/forums/html/topic?id=8753a6fe-8d0a-471c-9ea7-502af3dfdc0d&ps=25

This toolkit permits to read sound files

SmartParser toolkit

One of the repeating tasks of each PoC/Project I've worked with Streams is a data ingestion.
Yes, CSV format is very popular and for small PoT's is a good choice, but in the field a lot of customers/products have their proprietary formats and integrating Streams with them can take 20-30% time from entire PoC.
Another issue - CSV produces a flat tuple format, so if a data consists from 100 fields then the tuple should be defined with 100 attributes.
SmartParser toolkit comes to ease the parsing, tuple structure definition as well as tuple mapping development steps. The toolkit allows to parse custom formats producing desired hierarchical tuple (including lists, maps and even sets) saving the common step of mapping the flat format to the required Streams tuple.

Example 1:
Custom data format - fatherName, motherName, childName* (0 or more childs)
Streams tuple format - tuple< rstring fatherName, rstring motherName, list childs >

Example 2:
Custom data format - key1 : value1, key2 : value3, key3 : value3
Streams tuple format - tuple< map<rstring,rstring> keyValue >

Example 3:
Custom data format - personal data: \n business data:
Streams tuple format - tuple< tuple personalData, tuple businessData >

samples repository is empty, can't issue pull request

Because the samples repository is empty, I can't issue a pull request on it. I have two sample applications I would like to add, which are currently in my repo: https://github.com/scotts/samples

Perhaps we can just clone my repo to be samples.

Ngrams toolkit

A toolkit intented to be a supplement to a wide range of algorithms that need to calculate n-grams, such as NLP, machine learning, speech recognition, compression and etc.
The toolkit implements rolling hash technique and usually runs with better performance that brute force. Also the performance is almost the same for any 'n'.

Display

Link to a toolkit that use GNU Plot to display near real-time data

https://www.ibm.com/developerworks/community/forums/html/topic?id=6c3d3084-38f3-4177-a45e-1cbce7a017f5&ps=25

Discuss web page instructions

This is an issue for discussing the wiki page: Creating a web page for your toolkit

Adding consistent labels to toolkit repositories to help manage issues

Issues in Github is a flat list of work items. There is no concept of priority, severity, and status like other systems. When this list is small, it may be easy to prioritize and manage, but can quickly get out-of-hand as the repositories grow. We need a way to help us manage the issues.

I am not suggesting that we need these for all repositories, but should be considered by Repository Committers based on how they want to manage issues.

Here's a list of proposed labels to help with issues management:

Category Labels

Task
Bug
Feature
Discussion - I think this discussion category is helpful as we have been using issues for a lot of discussions.
Idea - anyone who wants to propose an idea, can use this.  Different from feature where idea is a proposal / suggestion.  Features may result in actual coding being done.  We may want to use this tag for repository proposal to separate them from project admin discussions.

Priority Label

HIGH
LOW

If unassigned, it is by default medium priority. Allow developers to prioritize work.

Severity Label
BLOCKING
MAJOR
NORMAL
MINOR

If unassigned, it is by default normal. Allow issue reporter to let us know how bad the problem is.

Status Labels

triaged
in-progress
fixed
invalid (not-reproducible, working as designed, etc.)
need feedback

With the labels, we can then create queries to help us look at issues that we want. With no status label, issue is new and not-triaged.

Thoughts?

MongoDB toolkit

The toolkit adds support for MongoDB database.
It was created to fulfil a customer request to have an ability to insert a data from Streams to MongoDB.
For now only Insert operator is implemented.
The toolkit is written in C++ and compiled with the latest MongoDB compat26 (stable) driver.
Like MongoDB the operator is schemaless (no XML config needed) and supports any nested tuple or map - the corresponding code is generated in a build time.
Documentation and samples will be added soon.
As the next step I plan to add Query operator.
COO is complete, so the toolkit can be published on IBMStreams.

Are the management committee members listed anywhere?

Is there somewhere here that has the management committee members? Could they be listed either in the process document itself, a separate document in the repository, or in the wiki?

Build repository proposal

Create a repository for utilities related to building SPL applications and toolkits.

E.g. plugins for Apache Ant, or Maven.

First code would be the Apache Ant tasks for Streams that are on Streams Exchange.

Add a streamsx.plumbing repository

The streamsx.plumbing repository will contain operators that change the flow of tuples in an application, but are not part of the logic of the application. I would like to add ElasticLoadBalance to this repository initially, which is a drop-in replacement for ThreadedSplit, but with elasticity.

However, I see other operators falling under this category. For example, I have encountered several implementations of merges, to be used in conjunction with UDP, for preserving tuple ordering in parallel regions. We also have several examples from our own Standard Toolkit which fall under this category (such as Gate, Switch and Throttle), and may be more appropriate here than in the generic "utility" category.

images for IBMStreams repository guidelines

copy repository URL button, from github.com

clone repository button, from EGit

new empty repository, from EGIT

new toolkit operator, from Eclipse

add toolkit to repository, step A, from Eclipse

add toolkit to repository, step B, from Eclipse

git directory and file icon annotations, from Eclipse

team commit button, from Eclipse

team push to upstream button, from Eclipse

team pull button, from Eclipse

changed files, from Eclipse

adding a sample to a repository, from Eclipse

toolkit and sample in git, from EGit

displaying version numbers:

setting version number:

add SSH protocol to Clone Git Repository dialog:

select Import for Clone Git Repository dialog:

Having a tookit for parsing Hexadecimal Data

Having a operator that can extract/decode data from it Hexadecimal format.

We can have thousands of sensors so we could not have one implementation per sensor. As the data extraction is usually the same, we could have a Descriptor for all sensor's data.

This data extraction is needed for example in the automotive industry.

Illustration

InputData

      sensorID,data in Hex
      001,"195E32004CFFFF1E"
      002,"00009779AF3B00"
      003,"0000000000000000"
      001,"195E32004CFFFF1E"
      001,"195E32004CFFFF1E"
      1234,"1000000000780000"
      003,"0000000000000000"
      001,"195E32004CFFFF1E"
      001,"197032004CFFFF1E"
      002,"00009779AF4A00"
      003,"0000000000000000"
      001,"197032004CFFFF1E"
      001,"197032004CFFFF1E"
      1600,"F320014300"

Each sensorId has it own way to extract data.

Descriptor definition

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <raws>
      <raw id="001" name="ABC" length="8" >
        <packet id="1" name="Data_1" start="7" length="9" function="(0.125 * x) +     0"></packet>
      </raw>
      <raw id="1234" name="XXX" length="8" multiplex="a">
            <packet id="11" name="a" start="1" length="8" signed="true" ></packet>
            <packet id="12" name="b" multiplex="10" start="8" length="2" signed="true" function="(12 * x) + 4"></packet>
        <packet id="13" name="c" multiplex="20" start="8" length="2" signed="true" ></packet>
            <packet id="14" name="d" start="11" length="5" signed="true" function="pow(x,2) / 23.7658" ></packet>
      </raw>
    </raws>

Create web page for streamx.hbase

I'd like to have a web page for streamsx.hbase. Whatever template works--perhaps the one used by streamsx.inet, for consistency?

Mail

Link to a toolkit that could read emails and send emails for alerting

https://www.ibm.com/developerworks/community/forums/html/topic?id=6385f52c-6f05-4702-a817-bd328ab6a00d&ps=25

Add a patterns repository

Java primitive operators benefit from being able to use pattern classes that provide common functionality. Examples of these are provided by the Streams product, such as PollingTupleProducer and TupleInTupleOut. This repository would contain a Java project (not an SPL toolkit) with (typically abstract) classes providing additional patterns and common functionality. Eventually maybe the product patterns (which already provide the code) would move to this repoository. Users of Streams have found the existing patterns useful, allowing the Java developer to focus on the required logic, rather than boilerplate code.

An example is a Filter pattern that allow the ability to implement a Filter in Java just by implementing a single method: public boolean filter(Tuple tuple). The pattern classes provides the implementation of the process method, schema checking etc.

SPL toolkits that reused the patterns would typically take the pattern jar from the tool and included it in their toolkit.

The repository would be laid out to support patterns in additional languages, not just Java.

A set of initial contents, including Filter, Split, RegexFilter and RegexSplit is at:

https://github.com/ddebrunner/streamsx.patterns

New sample application LogWatch

New sample application to be pulled into samples: https://github.com/scotts/LogWatch

Cannot label issues in this project

The process says to label an issue to propose a toolkit, but there doesn't seem to be a way to do it.

DBLoader

The 'DB Loader' is a SPL application for IBM InfoSphere Streams
It uses DB2 load command tool or Oracle SQL Loader to load files directly into database.

Thrift

This toolkit brings Apache Thrift framework to Streams (http://thrift.apache.org).
It enables cross-language clients to communicate with Streams transparently in async and sync ways.
ThriftSource operator automatically generates contract (thrift IDL file) at compile time based on output/input ports.
Client code can be easily created based on contract in any language supported by Thrift (C++, C#, Java, JavaScript, Node.js, Perl, Python, PHP, Ruby, Delphi, Erlang, Haskell, Cocoa, Smalltalk, OCaml and other languages).

What is the process for contributing samples?

I have created two issues under "samples" for pulling in my two sample repositories (found at https://github.com/scotts/LogWatch and https://github.com/scotts/LogWatchUDP). However, I now suspect that no one monitors the "samples" repository.

So, several questions:

Should I contribute samples by opening an issue under "toolkits"?
If "yes" to 1, then what is the purpose of the "samples" repository?
If "yes" to 1, then is "toolkits" the best name for this repository? Perhaps we should name it "administration" if it is for managing all issues related to tookits, utilities, samples and tutorials.

Non-toolkit projects

With StreamsExchange it was found that we needed an additional category beyond toolkits and applications (or samples) for utility code.

Can we host such projects here, examples are (on Streams Exchange):

Apache Ant utilities
lscpus

New utility to add: lscpus

My lscpus utility can be pulled from here: https://github.com/scotts/lscpus

moving half a dozen toolkits and demos from developerWorks

I have half a dozen toolkits and demos in 'Streams Exchange' on 'IBM developerWorks' that I want to move to 'IBMStreams' on 'github.com'. They are unrelated to each other, so I will need a separate repository for each one.

I'm not comfortable with the repository names that have been chosen for the first few toolkits. Names like 'com.ibm.streamsx.whatever' are fine for the operator namespaces and the Studio projects containing them. However, I think of repository names as 'titles', and 'https://github.com/IBMStreams' as a 'table of contents'. In a few years, when there are hundreds of repositories, we'll repository names that are more descriptive than 'streamsx.whatever'. I'd like my repository names to read the way we talk about them. When we talk to customers, we say 'the Streams whatever toolkit'. When customers look for that code, I'd like them to see a title like 'StreamsWhateverToolkit' in the table of contents.

I think its unfortunate that the product overloads the word 'toolkit'. When we talk to customers about source code, we distinguish between toolkits (collections of related functions and operators), sample applications, and demos. I'd like my repository names to reflect this distinction, so that after we have talked about 'the Streams Whatever Demo', customers looking for that code will see a title like 'StreamsWhateverDemo'.

So I'll propose these repository names for 'toolkits', which contain new functions and operators, plus sample applications that depend only upon them:

StreamsSynchronizeToolkit
StreamsOpenCVToolkit
StreamsLinuxShellToolkit
StreamsFileToolkit
StreamsNetworkPacketToolkit
StreamsNetworkContentToolkit

I'll propose these repository names for 'demos', which contain applications that depend upon other toolkits and do not contain new operators:

StreamsSignalGeneratorDemo
StreamsNetworkTrafficDemo

Does that make sense?

Toolkit suggestion: updated JSONHelper using SPL Java Operator API

Since the JSONEncoding interface has been added to the SPL Java Operator API it would seem to be a good idea to update the JSONHelper toolkit to use it. Not only would it simplify the implementation, it would ensure that exactly the same underlying encoding logic was executed regardless of access mechanism.

The only reason I can think of to keep the "manual" implementation is to allow for a some kind of different encoding than what the Java Operator API does. In that case it would be useful to have a second helper toolkit that uses the native API to allow SPL programmers to convert tuples to and from JSON using that encoding logic without everyone having to create their own custom java operator.

streamsx.ps

streamsx.ps is a Streams Process Store toolkit which will allow sharing of application specific state among multiple operators that are fused together into a single PE (Processing Element).

Toolkit for HBASE

I'd like to create a GitHub repository for the HBASE toolkit I have posted on Streams Exchange:
https://www.ibm.com/developerworks/community/files/app#/file/d65b213d-96b5-474d-b911-969e3aff6a84

This toolkit allows Streams to write tuples into HBASE and to read tuples from HBASE. To use it, you set HADOOP_HOME and HBASE_HOME, and it picks up the HBASE configuarion from there. It uses the operator parameter values to determine what's a row, columnFamily, columnQualifier, value, and so on. It includes:

HBASEPut (including checkAndPut support)
HBASEGet
HBASEDelete (including checkAndDelete support)
HBASEIncrement
HBASEScan

I have also included at least one sample for each of those operators. I've gotten feedback on the toolkit from at least two groups, so there is interest.

It wouldn't be my preference to put it with the other bigdata operators at the moment. First, it might undergo a somewhat more rapid evolution than the BigData toolkit does, and from a different set of contributors. In that case, it makes sense to keep the contributor sets separate for now. Second, there is another HBASE toolkit available, from a different author, modeled somewhat more after the database toolkit. Picking one of the two hbase toolkits to include in the bigdata package might be seen as making the other less relevant.

I could put it under my own GitHub account if the management committee thinks that's a better place to incubate it.

Proposed namespace: com.ibm.streamsx.bigdata.hbase. (For the other hbase toolkit, if it ends up being contributed, I would suggest com.ibm.streamsx.db.hbase, since it has closer ties to the database toolkit.)

wsserver.samples - repository request

Repository of samples using the com.ibm.streamsx.inet.wsserver::* operators in conjunction with other toolkits. The two examples that I have staged....

wsserve operators with the com.ibm.streamsx.inet.rest::WebContext to access iPhone accelerometer (this content is for a Developers Works article)
wsserve operators with the CEP toolkit to create gesture based voting application.

Discuss "How to put a toolkit in an empty repository"

Wiki page: https://github.com/IBMStreams/toolkits/wiki/How-to-put-a-toolkit-into-an-empty-repository

The page was seeded with directions from @chanskw.

Toolkit creation

I had to developed several toolkits for PoCs and wanted to create more.

Is there any way to create a kind of "toolkit incubator" where we can post toolkits that was developed for PoC, demos ?

If other guys find an interest on, then we could create a community ?

What about having a common structure for those functions ?

streamsx.utility.datetime
streamsx.utility.math
...........