telefonicaid / fiware-cygnus Goto Github PK

A connector in charge of persisting context data sources into other third-party databases and storage systems, creating a historical view of the context

Home Page: https://fiware-cygnus.rtfd.io/

License: GNU Affero General Public License v3.0

Shell 5.80% Java 82.26% Python 6.09% HTML 0.10% Gherkin 4.73% Dockerfile 1.02%

fiware fiware-cygnus historical-data

fiware-cygnus's Introduction

Cygnus

Cygnus is a connector in charge of persisting context data sources into other third-party databases and storage systems, creating a historical view of the context. Internally, Cygnus is based on Apache Flume, Flume is a data flow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. It was built to automate the flow of data between systems. While the term 'dataflow' can be used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems.

Each data persistence agent within Cygnus is composed of three parts - a listener or source in charge of receiving the data, a channel where the source puts the data once it has been transformed into a Flume event, and a sink, which takes Flume events from the channel in order to persist the data within its body into a third-party storage.

This project is part of FIWARE. For more information check the FIWARE Catalogue entry for the Core Context Management.

📚 Documentation	🎓 Academy	quay.io	🐳 Docker Hub	🎯 Roadmap

Background
Install
Roadmap
Further Documentation
License

Background

Internally, Cygnus is based on Apache Flume, a technology addressing the design and execution of data collection and persistence agents. An agent is basically composed of a listener or source in charge of receiving the data, a channel where the source puts the data once it has been transformed into a Flume event, and a sink, which takes Flume events from the channel in order to persist the data within its body into a third-party storage.

Cygnus is designed to run a specific Flume agent per source of data.

Current stable release is able to persist the following sources of data in the following third-party storages:

NGSI-like context data in:
- HDFS, the Hadoop distributed file system.
- MySQL, the well-known relational database manager.
- CKAN, an Open Data platform.
- MongoDB, the NoSQL document-oriented database.
- STH Comet, a Short-Term Historic database built on top of MongoDB.
- Kafka, the publish-subscribe messaging broker.
- DynamoDB, a cloud-based NoSQL database by Amazon Web Services.
- PostgreSQL, the well-known relational database manager.
- Carto, the database specialized in geolocated data.
- PostGIS, a spatial database extender for PostgreSQL object-relational database.
- Orion, the FIWARE Context Broker.
- Elasticsearch, the distributed full-text search engine with JSON documents.
- Arcgis, the Arcgis is a geographic information system (GIS).
Twitter data in:
- HDFS, the Hadoop distributed file system.

IMPORTANT NOTE: for the time being, cygnus-ngsi, cygnus-twitter and cygnus-ngsi-ld agents cannot be installed in the same base path, because of an incompatibility with the required version of the httpclient library. Of course, if you are going to use just one of the agents, there is no problem at all.

cygnus-ngsi:
cygus-twitter :
cygus-ngsi-ld :

Cygnus place in FIWARE architecture

Cygnus (more specifically, cygnus-ngsi agent) plays the role of a connector between Orion Context Broker (which is a NGSI source of data) and many FIWARE storages such as CKAN, Cosmos Big Data (Hadoop) and STH Comet. Of course, as previously said, you may add MySQL, Kafka, Carto, etc as other non FIWARE storages to the FIWARE architecture.

Install

Fiware/Cygnus has four sub-modules cygnus-common, cygnus-ngsi and cygnus-twitter, cygnus-ngsi-ld. Information about how to install these modules can be found at the corresponding section of Installation.

Following are the links of install section of Fiware/Cygnus sub-modules:

cygnus-common:
- Install from sources
- Install with docker
cygnus-ngsi:
- Install from sources
- Install with docker
cygnus-twitter:
- Install from sources
- Install with docker
cygnus-ngsi-ld:
- Install from sources
- Install with docker

Roadmap

The roadmap of this FIWARE GE is described here.

Further documentation

The per agent Quick Start Guide found at readthedocs.org provides a good documentation summary (cygnus-ngsi, cygnus-twitter).

Nevertheless, both the Installation and Administration Guide and the User and Programmer Guide for each agent also found at readthedocs.org cover more advanced topics.

The per agent Flume Extensions Catalogue completes the available documentation for Cygnus (cygnus-ngsi, cygnus-twitter).

Reporting issues and contact information

Any doubt you may have, please refer to the Cygnus Core Team.

License

Cygnus is licensed under Affero General Public License (GPL) version 3. You can find a copy of this license in the repository.

Further information on the use of the AGPL open source license

Are there any legal issues with AGPL 3.0? Is it safe for me to use?

There is absolutely no problem in using a product licensed under AGPL 3.0. Issues with GPL (or AGPL) licenses are mostly related with the fact that different people assign different interpretations on the meaning of the term “derivate work” used in these licenses. Due to this, some people believe that there is a risk in just using software under GPL or AGPL licenses (even without modifying it).

For the avoidance of doubt, the owners of this software licensed under an AGPL 3.0 license wish to make a clarifying public statement as follows:

Please note that software derived as a result of modifying the source code of this software in order to fix a bug or incorporate enhancements is considered a derivative work of the product. Software that merely uses or aggregates (i.e. links to) an otherwise unmodified version of existing software is not considered a derivative work, and therefore it does not need to be released as under the same license, or even released as open source.

fiware-cygnus's People

Contributors

Stargazers

Watchers

Forkers

frarape b-rich palbaladejo neich nikolai5319 joseluisrgupct gianluca-isgro arturloewen andresgeezar c-castillo xdelox javimolla dtorresnava web-bert viur c10a frbattid javipalanca gizero geographicags charwliu rianwouters gti-ia pcoello25 netzahdzc sabrine2020 silvamatteus smartsdk mfcardenas theill dmartinezgomez eduingles dcalvoalonso mriveros joudsi nmatsui matsdemeyer vjain641 tech-sketch yujiazama lmreyes mdrio crs4 grenouille06 smartfog s1mbi0se amfgomez jason-fox luisdelacalle amfergom pooja1pathak jagatjot roboticbase dmtzz thenextlvl manucarrace anmunoz lammoth chandradeep11 gorkagal1983 yatinarora-nec daigo0415 nhasumi madhu-nec vicentelt srrspace stdri ging ryo-ma joelcamus samueltjackson saxionsmartcities ssavva05 angpa dacak mapedraza fiware-ges bobeal flopezag bespanto ggkioka 2212david cafsenra jgcoronado jespalomo vitorinojp pchaozhong seshan01 aradait kajal583 takano32 danielvillalbamota santcorc madhu1029 savvasmohito keshavsoni2511 anjali-nec fisuda skorpinakos nilismith

fiware-cygnus's Issues

CKAN default_dataset prefixed by org name

Giving CKAN limitations about not having same dataset name in different organizations (see http://stackoverflow.com/questions/24203808/is-it-possible-to-create-packages-with-the-same-name-in-different-organizations) the "default_dataset" name in CKAN configuration has to be prefixed by the name of the organiation, to make it unique.

Cygnus: implement a WebHDFS backend manager

Do the same work that has been done with HttpFS in https://github.com/telefonicaid/fiware-connectors/blob/develop/cosmos/cygnus/src/main/java/es/tid/fiware/orionconnectors/cosmosinjector/hdfs/HttpFSBackend.java

OrionSTHSink

Implement another sink for MongoDB and test it along with the Cosmos sink, e.g. run the program with 2 sinks activated and check that when a notifyContextRequest arrives it is persisted in both places, check if Flume can implement "all or nothing" semantics (i.e. if one of the sinks is offline, for instance the connection to MongoDB is broken, then the notifyContextRequest is not persisted in Cosmos), etc.

Remove/relax the Orion version check

The Orion version check is carrying out more troubles than advantages.: it was initially introduced in order to check incompatibility issues among Orion and Cygnus versions, but that point seems to be relevant anymore. In addition, It is currently a source of problems because people forget to change the accepted version when moving from an Orion version to another.

Should we relax the checking, by inspecting only if the header is about Orion? Or should we remove it completely?

Cygnus: unit test

Complement the current code with unit test covering its functionality.

This issue will be closed when the unit test base reach the current "functionality" code. After that, develop should follow TDD principles and unit test should be produced in a regular basic each new PR.

[poc branch] Relax version header checking

Look for a string starint with "orion/", but not for an actual version checking.

Alternatvely, we can move this to the configuration, eg:

check_version_header = true
orion_version = 0.10.*

or something like that.

Cygnus: ability to set the per file consolidation level

Migrated from telefonicaid/fiware-livedemoapp#5:

Currently, each attribute goes to a different file always. However, a more flexible approach will be to use a selector (in the process configuration) to chose between different consolidation levels:

Per attribute files (as it is now)
Per entity files: all the attributes of the given entity goes to the same file
One file: all attributes of all entities goes to the same file.

The naming of the file would be adjusted consequently.

@frbattid aditions, a brief explanation on the reasons that leaded to this:

Not all the attributes regarding an event may be updated at the same time. Therefore, if we record a line per each event, and that line must contain values for each attribute, then the result is a file containing thousands of lines, having each line several null values. Thus, we decided to store only pairs in order to “save” disk space (using big data is not a reason for being inefficient). Then we decided to split each pair in a separated file from the “human being” perspective, because having a single file will show a mesh of attributes, but having a file per attribute will show in a clear way the evolution of such attribute. The last point was probably unnecessary and could be avoided as proposed by Fermín.

Cygnus: management API for FI-LAB users

Moved (partially) from telefonicaid/fiware-livedemoapp#8:

Currently, the ngsi2cosmos is configured by FI-LAB staff in a "static" way. So if a FI-LAB user wants to store the data he/she publishes at CB to Cosmos (as happened during Santander hackaton), he/she has to talk with FI-LAB staff to do that configuration. This is a quite unflexible and not scalable approach.

Thus, the ngs2cosmos component should provide an API so users can configure CB->Cosmos rules. The processing of the request in this API will involve all the underlying actions in CB, Cosmos and ngsi2cosmos components:

In Cosmos: creating a dataset for the user
In CB: set up the proper subscribeContext for the user entities
In ngsi2cosmos: configure the logic so the notifiyContext corresponding to the above subscription goes at the end to Cosmos through its HttpFs or WebHDFS API.

[poc branch] Notification URL to listen should be included in configuration

The URL in which listen should be also included in the configuration, instead of "hardwiring" it in the code.

Cygnus: implement a Infinity backend manager

Do the same work that has been done with HttpFS https://github.com/telefonicaid/fiware-connectors/blob/develop/cosmos/cygnus/src/main/java/es/tid/fiware/orionconnectors/cosmosinjector/hdfs/HttpFSBackend.java

This is a requirement for Cosmos 0.x | x >= 16.

Cygnus: multi-tenancy features

Current version of Cygnus is ready for managing the context data regarding a single user and a single dataset.

There already exists an issue for the multi-dataset managemet (telefonicaid/fiware-connectors#16), it must be analyzed the degree of disjuntion among it and the multi-user support problem.

Multy-user capabilities, or multi-tenancy, is about 2 aspects:

How the data sent by Orion is somehow "labeled" in a per-user fashion.
How to setup the, conceptually speaking, correspondence table matching users and datasets (here is where the issue telefonicaid/fiware-connectors#16 may have relation).

Problem using "/" in orionagent.sinks.hdfs-sink.cosmos_dataset

Using

orionagent.sinks.hdfs-sink.cosmos_dataset = foo/bar

Causes using the directory:

/user/fermin/foo-bar

instead of (as expected):

/user/fermin/foo/bar

Cygnus 0.1 README update

I identify the following missed/updatable content in the README:

No explanation about how to create the RPM based on the neore folder.
No git checkout release/0.1 command after git clone. This is a common error reported by our users.
The file is not UTF-8.
"the_json_part" in the XML section.
Several appearances of "cosmos-injector" or "fiware-connectors" instead of "cygnus".

I think an update on the README is necessary. What do you think?

Cygnus: Create dataset and Hive table at startup time if they doesn't exist

Same behaviour that the one implemented in ngsi2cosmos at the end of its lifecycle (see comments to telefonicaid/fiware-livedemoapp#8).

CKAN: cache management for orgs/datasets/resources

Currently CKAN persistence backend uses several HashMaps for f(Org,Entiy) -> resourceId and f(Org) -> defaultPackageId maping. These HashMaps act as a cache that is populated one time but never updated. Thus, if an out-of-the band change takes place in CKAN (e.g. an Org is renamed or deleted) the CKAN module will fail until Cygnus gets restarted.

This should be improved with a better cache management.

Effort: 6 man day

Cygnus: flexible clasification of data into many HDFS directories

Migrated from telefonicaid/fiware-livedemoapp#4

Use a text file as ngsi2cosmos configuration in the following format:

<id_pattern1>|<id_type1>|<dataset1>
<id_pattern2>|<id_type2>|<dataset2>
<id_pattern3>|<id_type3>|<dataset3>
...

So, each time a new context element is received, ngsi2cosmos check that table (from top to bottom) to find to with HDFS directory send the data.

Eg.:

OUTSMART.NODE.*|Node|/user/opendata/smartcities/santander/llamas
OUTSMART.AMMS.*|AMMS|/user/opendata/smartcities/santander/llamas
OUTSMART.RG.*|Regulator|/user/opendata/smartcities/santander/llamas
urn:smartsantander:testbed:.*|Sensor|/user/opendata/smartcities/santander/smart

New approach for Hive tables creation due to column mode persistence

In the row mode no table name was needed because all the files within the same HDFS folder were going to be added to the same table. Why and how was this possible? Because the external tables work by pointing to a HDFS folder and because all the tables have the same attributes. But with the column mode we have several files containing each one a different number of columns. Thus a unique Hive table for all the files within the same HDFS folder in not valid anymore.

The first obvious solution seems to be to have a subfolder per each file/entity:

Current folder structure (only valid for the row mode, a single Hive table):

hdfs:///user/basuras_valencia/data/entity_1_file.txt

hdfs:///user/basuras_valencia/data/entity_2_file.txt
New proposal (valid for both row and column mode, a Hive table per each entity, as MySQL and CKAN do):

hdfs:///user/basuras_valencia/entity_1/entity_1_file.txt

hdfs:///user/basuras_valencia/entity_2/entity_2_file.txt

Some design idead for the new NGSI connector

Based on our previous experience with the ngsi2cosmos prototype, this issue describe some design ideas to define the evolution of this component.

Although the interface towards Orion Context Broker doesn't change (i.e. it will be based on notifyContextRequest, produced by a subscription at Orion), the new version of the component will allow different persistence backends, that can be used simultaneously (i.e. the process could be configured to persist each notified context element to both a Cosmos backend and a CKAN backend, side note: using several backends at the same time introduces the issue of transactionality, I'd suggest to keep first version best-effort and simple and doesn't take into account transactionality). In fact, we should think in a modular and extensible approach, so new backends could be added in the future. At the present moment the list will be Cosmos, MongoDB and CKAN.

Thus, the name "ngsi2cosmos" should be changed for something more general, e.g. "ngsi_connector" (short but not precise, as connector may refer to input or output), or "ngsi_output_connector" (more precise, but long). Any idea? :)

Next, the command line should be improved (ngsi2cosmos relies on putting the argument in one exact order, combining mandatory ones with optional ones... that is a mess!). The command line could be something very simple, with just 3 parameters:

-c <conf_file>, path for the configuration file (by default, if this is omitted a default value could be used, e.g. the file named "ngsi_injector.conf" in the current directory).
-u, for printing the usage message
-v, for printing version (side note: we need to define some kind of packaging mechanism, to keep control of the different versions of the program, in the same way Orion has; has Python any facility to do this?).

Optionally, we could consider the same approach used by nova_event_listener: parameters specified in the conf file but with the possibility to be overridden by the same parameter name in the command line (makes sense for the dictionary-based format, but not sure in the case of JSON-based format, see discussion on this below).

Regarding the configuration file, it could be structured in a common section, for information that applies to all the possible backends and per-backend type sections (e.g. a section for Cosmos, a section for MongoDB). The way of enabling a particular backend would be including a corresponding section in the configuration file (in which case, the particular information in that section could be used).

What format to use in the configuration file? I think that two approaches are possible: dictionary-based (as the one used by nova_event_listener) or JSON-based. The advantage of the later versus the former could be the possibility of including structured information easily, that, for some configuration pieces could be needed (e.g. solving the functionality at telefonicaid/fiware-livedemoapp#4). On the other hand, the advantage of dictionary-based is a more clear syntax (maybe the trend will change in the future, but I think that nowadays there is more people used to dictionary-based configuration files than to JSON-based configuration files).

Cygnus: Tool to subscribe Cygnus

Moved from telefonicaid/fiware-livedemoapp#11:

Configuring Orion-Cosmos integration currently needs a NGSI10 suscribeContext operation so ngsi2cosmos receives Orion notifications. Currently, this subscription has to be done "manually".

However, it could be great to provide a simple script tool for user, so such tools does the subscription on behalf the user.

Cygnus: cygnus.conf template

Create a template for the cygnus.conf file

HDFS files translation script from 0.1 to 0.3

The purpose of this script is the same than the existing one about the conversion from 0.1 to 0.2. In fact, the new script should derive from that one.

The reason a new script is needed is becasue:

The HDFS file name changes, from cygnus-mysusername-mydataset-entityId-entityType to entityId-entityType (the username and dataset information was redundant, and the new naming convention is coherent with other persistence sinks).
Certain fields within the files have been renamed, specificaly ts is now recv_time_ts and iso8601date is now recv_time.

Per column attribute persistence (MySQL only)

Certain context data is necessary to be persisted in tables/CSV files containing a column per each context attribute, instead of persisting rows for each attribute-value received.

This involves both HDFS, CKAN and MySQL sinks, where a new parameter will govern the behaviour of the sink (per column attributes, or per row attributes), but only MySQL will be considered within this sprint

Cygnus: versioning system

Implement a full fledged versioning system, so the .jar get marked with a proper version instead of using always the same "1.0-SNAPSHOT" and a changelog between version can be maintained.

Probably Maven has some plugin to manage that. Research needed.

ckan2hdfs connector

This is currently under study, but possibly a mechanism for persisting CKAN data into HDFS could be necessary.

This connector is different than the Cygnus one in the sense it is not driven by data events, i.e. an event is sent to Orion and a Notification is automatically received by Cygnus, but a batch copying process that should be programmed in some way, or a batch copying process activated by a single triggered management event (coming from the CKAN web portal).

Cygnus: create hierarchies for backends and sinks

Recent development of OrionCKANSink has revealed common parts with OrionHDFSSink. This suggest the creation of a parent class.

In the case of HDFS backends, there exists an interface class called HDFSBackend, but it only governs the "look and feel" of the HDFS-related classes, but does not implements any common part of code.

Cygnus: HDFS persistence of context data in Json format

Create a new HDFS-related sink for Flume in order the context data coming from Orion can be persisted as Json data.

This is totally different way of persisting Orion context data in Cosmos. Instead of the current method based on "building" lines of CSV-like text and appending them to a file, one per each entity-attribute pair, the goal is to persist Json data containing the same semantic information:

{“ts”:”XXX”, “dateStr”:”XXX”, “entityId”:”XXX”, “entityType”:”XXX”, “attrName”:”XXX”, “attrType”:”XXX”, “attrVlaue:”XXX”}

The advantage is that being the line build in Json format, the attrValue can also be a Json data:

{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room1", "entityType":"Room", "attrName":"complex2", "attrType":"json", "attrValue": '{"c":["d":"3", "e":"4"]}'

Prelimirary tests have reveal this is possible from the HDFS storage point of view (for HDFS this is only "text"):

$ hadoop fs -cat test/*
{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room1", "entityType":"Room", "attrName":"temperature", "attrType":"integer", "attrValue": "20"}
{"ts":"3423453454", "dateStr":"14-05-2013", "entityId":"room2", "entityType":"Room", "attrName":"temperature", "attrType":"integer", "attrValue": "23"}
{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room3", "entityType":"Room", "attrName":"temperature", "attrType":"integer", "attrValue": "29"}
{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room1", "entityType":"Room", "attrName":"temperature", "attrType":"integer", "attrValue": "21.5"}
{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room3", "entityType":"Room", "attrName":"complex1", "attrType":"json", "attrValue": '{"a":"1", "b":"2"}'}
{"ts":"3453453245", "dateStr":"23-04-2013", "entityId":"room1", "entityType":"Room", "attrName":"complex2", "attrType":"json", "attrValue": '{"c":["d":"3", "e":"4"]}'

Even, it is possible to get managed with HiveQL if a serde (serializer/deserializer) is used (https://github.com/rcongiu/Hive-JSON-Serde is the recommended one):

hive> describe opendata_orion_json;
OK
ts      bigint  from deserializer
datestr string  from deserializer
entityid        string  from deserializer
entitytype      string  from deserializer
attrname        string  from deserializer
attrtype        string  from deserializer
attrvalue       string  from deserializer
hive> select * from opendata_orion_json;                           
OK
3453453245      23-04-2013      room1   Room    temperature     integer 20
3423453454      14-05-2013      room2   Room    temperature     integer 23
3453453245      23-04-2013      room3   Room    temperature     integer 29
3453453245      23-04-2013      room1   Room    temperature     integer 21.5
3453453245      23-04-2013      room3   Room    complex1        json    {"a":"1", "b":"2"}
3453453245      23-04-2013      room1   Room    complex2        json    {"c":["d":"3", "e":"4"]}

A MapReduce job should be able to deal with this new format, everything depends on how the Mapper interprets the incoming data.

The existent data from Parque de las Llamas and SmartSantander a converter script must be developed.

Effort: 1 man day

[poc branch] XML rendering

Once we complete JSON testing, next step is to develop the XML source handler.

Cygnus: Check special characters for filenames

Moved from telefonicaid/fiware-livedemoapp#15:

As experienced with FINESCE people, if an entity or attribute name or type contains the "/" character then the persisted file for this entity-attribute pair will contain the "/" character. When creating thar file, WebHDFS/HttpFS understands a subdirectory must be created and then a file.

E.g. entity=mycar, entity_tpe=car, attribute_name=speed, attribute_type=km/h will be persisted as mycar-car-speed-km/h.txt, which in the end is the following directory /user/myuser/mystorage/mycar-car-speed-km containing a h.txt file.

Special characters usage must be checked before creating the file name.

sql2hdfs connector

As feedback got from Sevilla's FI-WARE event: it would be desirable a component able to dump all the data within a SQL batabase into Cosmos HDFS.

This could start a new kind of "dumping" components which must be disabled once all the data has been moved to Cosmos.

MySQL connector included in pom (and in jar with dependencies)

Currently, the MySQL connector must be manually added to /path/to/flume/plugins.d/cygnus/libext/. There is no need on doing such thing, specially when all the other third-party libraries are automatically bundled in the "jar-with-dependencies".

Thus, include the MySQL connector dependency in the pom.xml.

Rethink the the channel-based reliability mechanism given by Flume

Currently, if an event within the channel is not properly processed by a sink, it is not deleted from the channel in order the sink retries. This is part of the reliability mechanism of Flume, but I think this mechanism assumes the events are well formed and that the persistence problems are related to connectivity issues; thus makes sense to retry later. Nevertheless, the new OrionMySQLSink presents scenarios where the persistence error is given by malformed events. Thus, something has to be done, alternatives:

Add a enable/disable option in the configuration in order to totally disable the reliability mechanism.
Be smart and only allow for retries when a connectivity problem arises, otherwise discard the event and log about it.

Effort: 3 man day

Cygnus: change packaging domains and Java folders accordingly to the repository name

Current Cygnus Java folder is:

fiware-connectors/flume/src/main/java/es/tid/fiware/orionconnectors/cosmosinjector

This leads to Java packages like this one for the the OrionHDFSSink.java class:

package es.tid.fiware.orionconnectors.cosmosinjector;

Accordingly to the repository name, these should be:

fiware-connectors/flume/src/main/java/es/tid/fiware/fiwareconnectors/cygnus
package es.tid.fiware.fiwareconnectors.cygnus;

.NET Hive driver

Raised in Sevilla workshop on March 2014:

Some people would be interested in using Hive from .NET. Although it seems that there aren't drivers for that platform, maybe a workaround is possible.

Cygnus: attribute metadata persistence

Orion 0.13.0 (to be released in May 2014) is expected to include metadata associated to NGSI10 attributes. In this sense, we should consider how these metadata are going to be persisted with Cygnus components.

Taking into account the format defined in telefonicaid/fiware-connectors#29, it would be a matter of including a new optional field, named "md" of type JSON vector:

{"ts":"3453453245", 
 "dateStr":"23-04-2013", 
 "entityId":"room1", 
 "entityType":"Room", 
 "attrName":"complex2", 
 "attrType":"json", 
 "attrValue": {"c":[ {"d":"3"}, {"e":"4"}] }, 
 "md": [ 
    {"name": "md1, "type": "string", "value": "somemd" }, 
    {"name": "md2, "type": "string", "value": "someothermd" } 
 ]  
}

The "md" field will be include only if the notifiyContextRequest includes metadata.

Cygnus: selective scaping of delimiter character to avoid "Cosmos injection"

Moved from telefonicaid/fiware-livedemoapp#7:

ngsi2cosmos.py should parse the contextValue before writing it to Cosmos, escaping the delimiter (usually "|"). Otherwise, the user could "inject several columns in a single field" potentially breaking the schema defined by tools such as Hive.

This escaping should be a optional feature (typically, a flag in the CLI or configuration file), given that some cases it could be useful to have this injection to simplify NGSI model definition.

@frbattid is also having a look to this.

[poc branch] README.md modifications

Include JAVA_HOME configuration in the part about Java installation.

Include --confin the CLI to use to start the injector (it seems that in some systems, such as orion.lab it is needed).

CKAN: set resource URL with the convenience operation to retrieve data

Instead of the dummy value that is shown now:

Cygnus: dealing with '&' in notifications?

Moved from telefonicaid/fiware-livedemoapp#12:

@fgalan: Found during Campus Party Brazil 2013: We have found that when the updateContext to Orion uses & in attributes value (e.g. "H&M") the notifyContextRequest sent to ngsi2cosmos breaks the program in some place (a 500 error is returned by the Flaks stack).

@kzangeli: It would be nice to know whether it happens with other special characters as well.
E.g. '/' and '?'
I don't think this is the brokers fault, but it must be investigated.
For now I will test whether the orion broker accepts these special chars in input payload.

@fgalan: As additional information, it seems the ContextBroker is ok (the notifyContextRequest message is received by ngsi2cosmos and event printed in the log just after getting it before start parsing). The problem is in ngsi2cosmos.

@kzangeli: Yeah, I have seen that orion id okay with 'special chars'.
My emacs in xml-mode paints the '&' and everything after it in red though.
That is normally an indication of a possible problem.
Could XML say something about 'necessity to escape certain chars'?

Script to remove old data in MySQL database

As part of the resource folder, with some instructions on how to use it (e.g. how to install in cron).

Per column attribute persistence (CKAN)

Recently we introduced the ability of configuring row vs. column storage style in MySQL, depending on a configuration parameter. The CKAN sink, which stores also information in tables, should use the same parametrization.

Hive queries to Cosmos from WireCloud

Raised in Seville workshop in March 2014:

It would be great to been able to perform the Hive query from a WirecCloud widget. Need to be explored.

Documentation proposal

Documenting Cygnus through the README was OK in first stages of the development, but now is producing a long document. In addition, documenting how to package Cygnus in a RPM and how to use such RPM will make it longer and thus, the user experience when dealing with the README may result very poor.

I've seen in other big projects such as Cosmos that a doc/ folder is created, where the different guides are put: installation, packaging, operation, etc. Following this approach I foresee these split documents for Cygnus:

Installation from sources.
Installation using a RPM (assuming an already built RPM is hosted in somerepo.fiware.org)
Packaging Cygnus.
Arquitecture.
Quick start guide.
(if desired), specific documents regarding the detailed design and development of the different developed sinks, handlers, etc

Finally, the README should have an introduction on Cygnus and links to the different documents in doc/.

Convert README.md to UTF

This way we will avoid problems with accents.

[poc branch] Serialize the timestamps in ISO 8601

toString() method from java.sql.Timestamp writes "yyyy-mm-dd hh:mm:ss" instead of ISO 8601 "yyyy-mm-ddThh:mm:ss".

Move reception timestamping from the sink to the source

Current timestamping for the received notifications is done in the sink, when the internal Flume event is read from the queue. This could not be a very accurate timestamping if the event remained for a while in the queue. Thus, I think a better mechanism would be to obtain the reception time in the source, before an internal Flume event is generated, and add the timestamp as a heacer of the Flume event.

[poc branch] Log to file

Added on behalf @frbattid

To allow logging to file, not only in console.

Change the name of the OrionMySQLSink databases and tables

There has been a change on the requirements for OrionMySQLSink. Current databases and tables are named, respectively:

cygnus_<client/service/tenant>
cygnus_<entityId>_<entityType>

New namig conventions remove the "cygnus_" part:

<client/service/tenant>
<entityId>_<entityType>

Per timestamp persisted context value replacement

Raised while meeting with IDP people (http://www.idp.es/) from DoF project (http://www.districtoffuture.eu/).

IDP people comes from the SCADA and critical infrastructures world, and for them it is very common a sensor becomes "fool" for a certain period of time, providing wrong measures (that in addition are timestamped). When the sensor is fixed, it may provide the correct measures with the correct timestamp, which is in the past, and using Orion+Cygnus+Cosmos the result is two different measures for the same contenxt elelemnt in the same tiemstamp.

IDP people suggest to replace the old measure with the new correct one.

Should Cugnys take care about this? Or should this be resolved by implementing fixing processes in the backend?

Per column attribute persistence (HDFS)

Certain context data is necessary to be persisted in tables/CSV files containing a column per each context attribute, instead of persisting rows for each attribute-value received.

This involves both HDFS and CKAN sinks (MySQL one was implemented in release/0.2), where a new parameter will govern the behaviour of the sink (per column attributes, or per row attributes).

Cygnus: MySQL database connector

The objetive is to create a new sink to persit context elements (sent to Cygnus by Orion using notifyContextRequest) in a relational database.

Regarding the data model to use in the relational database, three alternatives are on the table:

One "big table". All the entities and attributes are in the same table. The columns would be: timestamp, human readable timestamp, entity id, entity type, attr name, atrr type, attr value, attr md (serialized in JSON).
- Pros: simplifies table creation (static), simplifies aggregated queries.
- Cons: big table scalability?
Per-entity table. The name of the table is the name of the entity (actually, the concatenation of id and type, as we are doing now in other parts of Cygnus). The columns of the table are as in solution 1 except from entity id and entity type.
- Pros: better scalability than in alternative 1
- Cons: dynamic table creation (Cygnus has to verify that the table exist before inserting a new row).
Per context element (i.e. entity + attribute) table. The name of the table is the name of the context element (actually, the concatenation of entity id, entity type, attribute name and attribute type, as we are doing now in other parts of Cygnus). The columns of the table are as in solution 2 except from attr name and attr type.
- Pros: better scalability than in alternative 2 (really?)
- Cons: dynamic table creation (Cygnus has to verify that the table exist before inserting a new row), too "scattered model"?

In addition, as some SQL stamement may vary depending on the particular database technology, we should decide ASAP in which one we focus for the first version of the sink (e.g. MySQL, etc.).