gora,enis

Support using file backed data store as a mapreduce in/output

File backed data stores such as DataFileAvroStore can be used as mapreduce inputs. However, we should support reading more than one files as mapreduce input.

Moreover, when file backed data stores are used as mapred outputs, we need to set the file names accordingly so that more than one reduce task can be run.

Rename hbase-mapping.xml to gora-hbase-mapping.xml

Since hbase-mapping.xml is used by external clients, having gora in the name will be much more clear for them.

Add deleteByQuery() to DataStore

It will be convenient if we can add a deleteByQuery(Query) method to DataStore so that entries matching a specific query can be removed.

Implement benchmarks

Gora intends to be the core architecture for IO heavy applications and mapreduce jobs. So performance is critical in many ways. There are lots of possible improvements, and performance bottlenecks that needs to be identified.

We should implement benchmarks for measuring the performance of gora. Especially we should compare using raw APIs for avro and HBase and using Gora.

HbaseStore should honer mapping.file property configuration

HBaseStore does not read configuration for mapping file from properties. It should use DataStoreFactory.getMappingFile() to get the mapping file name

Fix null field treatment across data stores

Null fields treatment is a tricky issue in the context of persistence. Regardless of the way we choose the strategy, the semantics should be the same for all the supported data stores (except maybe file based data stores).

The use cases for null fields can be as follows :

record.field == null, setting the field to null again should not make any changes to the data store
record.field != null, setting the field to null should delete(or override) the value in the datastore

Add a memory based data store

We should add a memory based data store (MemStore). It will help internal tests as well as test for our clients.

MemStore should implement all public DataStore methods, including query operations, field, key and time filters.

TODO Implement SimpleDBStore

See http://aws.amazon.com/simpledb/

Implement gora - pig bindings

Pig is a data processing language. Pig support reading / writing data to various formats through it's Store concept similar to Gora's DataStores.

We should build bindings for gora -> pig so that any data store that gora supports can be used with pig.

Enable compression on SQL tables

We should be able to use avro compression on SQL tables.

use slf4j for logging

Implement CassandraStore

TFile/MapFile backed avro store

We may benefit from an avro store which is backed by a map file / TFile. Unlike DataFile backed avro store, map files support random gets for keys, so some applications (such as tests, etc) can use this as the main data store.

HBase tests takes forever to finish

We extends HBase's test cases, which sets up a mini cluster from scratch each time. This takes up to 1 min. We have to start up the cluster once and run all the tests at once.

SqlStore doesn't work with nested types in Avro schema

As reported in https://issues.apache.org/jira/browse/NUTCH-890, SQLStore does not work with nested record types.

Introduce class Gora as the main Facade

We can rename DataStoreFactory to Gora and use this class as the public facing Facade for third parties.

The following will make more sense :
Properties properties = Gora.properties;

Gora.craeteDataStore( HBaseStore.class, .... )

resolve gora artifacts locally in ivysettings

gora inter module dependencies should be resolved local-first. This can speed up the build a lot.

SQLStore.put() assumes that at least one field is dirty

if not - we end up with an incorrect SQL query e.g.

INSERT INTO webpages (id )VALUES ('uk.co.bbc.news:http/sport2/hi/video_and_audio/default.stm') ON DUPLICATE KEY UPDATE ;

where nothing follows the update

Override storage name using properties

The AvroStore uses the value set in the mapping file for naming the underlying table. It would be nice to be able to set the name using the properties e.g. for cases where we want to generate a temporary structure but want to keep a common mapping file.
Having a parameter name which is generic and not tied to a specific datastore implementation would be better as the client code might need to specify this using the API and does not necessarily need to know what type of DataStore is actually used.

Annotate build artifacts with snapshot version

As per the discussion in https://issues.apache.org/jira/browse/NUTCH-891, until Gora releases first version, nutch needs snapshots version numbers in jars.

java.lang.IllegalArgumentException in IOUtils.readFully

buffer.limit() throws the IllegalArgumentException when count == -1

public static byte[] readFully(InputStream in) throws IOException {
List buffers = new ArrayList(4);
while(true) {
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
buffers.add(buffer);
int count = in.read(buffer.array(), 0, BUFFER_SIZE);
buffer.limit(count);
if(count < BUFFER_SIZE) break;
}

return getAsBytes(buffers);

}

Make reusing of objects by Datastore's and Result's, explicit

Currently, HBaseResults does not reuse objects, where as AvroResult reuses them. Moreover, there is no method to reuse objects at hand, when using DataStore#get().

In short, we need to make reusing objects configurable, and it should be explicit whether objects are reused or not.

Mapreduce support in SqlStore

Mapreduce support is incomplete for SqlStore. We need to implement proper mapreduce support for SQL.

DataStoreFactory.createDataStore should throw exceptions on failure

DataStoreFactory.createDataStore may return null when a Store class is present but it's unable to create a data store (e.g. wrong config, no connection, etc..). Instead it should preserve the stack trace and re-throw a DataStore-specific exception.

Evaluate switching to JavaDB (aka Apache Derby) as the default embedded database for gora-sql

HSQL has proven to be hard to work with regarding database shutdown logic. Back in the days I added support for DB operations to Hadoop, derby did not support LIMIT/OFFSET type of queries, so HSQLDB was chosen as the DB for implementing test cases. However, from 10.4 JavaDB supports these type of queries (http://db.apache.org/derby/docs/10.6/ref/ref-single.html#rrefsqljoffsetfetch). So it is time we decide whether to continue with HSQL or switch to JavaDB.

Also, note that AFAIK, JavaDB does not yet support MERGE statements or INSERT ... ON DUPLICATE KEY statements, so we need to find a fix for the insert/update problem before the switch.

Support complex objects

Support map of maps, map of lists, list of maps, etc...

Persistent.clear()

Gora has the ability to reuse the objects. However, since not all the fields of the objects needs to be read from the data store, reused objects should be cleared by calling the clear method.

HBaseStore does not respect partitionQuery's

HBase store currently does not support executing partitionQueries. Morever, the mapred tests for HBaseStore did not revealed this before, so we also need to check the tests.

Specify name of table separately from the schema

We could have several tables using the same schema but requiring different names (e.g. main webtable in NutchBase and tables for the segments) + these names can be generated dynamically and are not necessarily known in advance. AFAIU Gora currently assumes that there is 1 table per schema and gets its name from there which is a limitation.
I suggest that we separate name of the schema from the name of the tables, by default if no name is specified for a table then the name of the schema would be used

DataStore.put() silently loses records when executed from multiple processes

https://issues.apache.org/jira/browse/NUTCH-893

HBaseGetResult.next() returns true even if there aren't any results

Should the method be modified like this and check that the key is not null a second time?

public boolean next() throws IOException {
if(key == null) {
readNext(result);
// return true
return (key != null);
}

return false;

}

NPE in Hbase store when querying on a key that does not exist

HBaseStore - line 387 : result.getNoVersionMap() returns null which triggers the NPE

java.lang.NullPointerException
at org.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:387)
at org.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:35)
at org.gora.hbase.query.HBaseGetResult.next(HBaseGetResult.java:32)

Fix failing HBase tests

After introducing GoraHBaseTestDriver which reduced the tests for HBase to be completed in much less time, we realized some of the tests were broken during the period that they are not run. This issue should keep track of these tests and fix HBaseStore to pass the tests

MySQL does not recognise LONGVARBINARY

The following SQL query is used for creating a table in NutchBase :

CREATE TABLE webpages (id VARCHAR(512) PRIMARY KEY,headers LONGVARBINARY,text VARCHAR(32000),status INTEGER,markers LONGVARBINARY,parseStatus LONGVARBINARY,modifiedTime BIGINT,score FLOAT,typ VARCHAR(32),baseUrl VARCHAR(512),content LONGVARBINARY,title VARCHAR(512),reprUrl VARCHAR(512),fetchInterval INTEGER,prevFetchTime BIGINT,inlinks LONGVARBINARY,prevSignature LONGVARBINARY,outlinks LONGVARBINARY,fetchTime BIGINT,retriesSinceFetch INTEGER,protocolStatus LONGVARBINARY,signature LONGVARBINARY,metadata LONGVARBINARY)

Unfortunately LONGVARBINARY is not recognised by mysql but 'LONG VARBINARY' is.

Add DataStore#schemaExists()

In addition to create/delete schema methods a method for checking whether the schema exists is needed.

generate clone() methods for gora generated classes

sometimes, the clients need to clone the objects generated by gora easily. Adding a deep-copy clone method will solve the case.

Implement query filters

Query interface should support setting filters, which will will be powerful enough to support SQL where clauses, and HBase native filters.

SQLStore does not accept DB specific types

We should be able to use DB specific types such as MEDIUMBLOB (in MySQL) with Gora.

Implement gora - cascading bindings

Cascading is a nice framework for working with Mapreduce at a higher level. Cascading defines a Tap architecture which is the source/sink for records. This is very similar to gora's DataStore's.

We should develop a GoraTap as an adapter for gora->cascading. This way any data store gora supports can be used at cascading.

TODO Implement MySQLStore

Implement gora - Lucene / Solr bindings

It would be nice if we can implement data stores for Lucene and Solr.
Most of the data processing projects uses Lucene/Solr as their indexing backend, so people should be able to use domain level objects (defined via gora), and use the indexing backend just like any other data store.

merge gora-examples into gora-core test

Having a separate gora-examples module is very logical and useful for the users. However, all of the tests for gora-core and the other modules depend on the data structures, and jobs at gora-examples. Until now, thanks to ivy, we have managed this as follows:
gora-core compile configuration does not depend on anything,
gora-examples compile dependency depends on gora-core
gora-core tests dependency depends on gora-examples.

What seems a cyclic dependency above was resolved by a clever build order among compile and test dependencies. However, this has proven to be a major source of headaches. So long story short, I propose we merge gora-examples into gora-core.

Implement state tracking for Arrays

We track element's individual statuses(dirty, readable, etc) for maps. We can do the same for Arrays.

Add default constructor to AvroQuery

we need to add a default constructor to avro query so that it can be used in mapreduce.

Implement a Filter interface for Query-s

Query objects should optionally take Filter-s, which are used to accept/reject objects. This can be useful for backends like gora-hbase, where filters can work on server-side.

Integrate clover, findbugs, checkstyle and CI into build process

It would be nice to integrate tools for continuous integration, findbugs, test coverage(clover) and style checker to the build process.
We need to wait for the possible ASF adoption for deciding on CI, however other tools can be added easily to build process via ant.

Can't use MemStore in Mapreduce

unless MemQuery is changed to non-static and put outside MemStore.
GoraInputFormat line 76 :
query = (Query<K, T>) DefaultStringifier.load(conf,
QUERY_KEY, Class.forName(queryClass));
throws an exception.
The other implementations of QueryBase are not static and live outside their corresponding Store.

enis / gora Goto Github PK

gora's People

Contributors

Stargazers

Watchers

Forkers

gora's Issues

Recommend Projects

Recommend Topics

Recommend Org