Coder Social home page Coder Social logo

gora's People

Contributors

dogacan avatar enis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gora's Issues

Support using file backed data store as a mapreduce in/output

File backed data stores such as DataFileAvroStore can be used as mapreduce inputs. However, we should support reading more than one files as mapreduce input.

Moreover, when file backed data stores are used as mapred outputs, we need to set the file names accordingly so that more than one reduce task can be run.

Add deleteByQuery() to DataStore

It will be convenient if we can add a deleteByQuery(Query) method to DataStore so that entries matching a specific query can be removed.

Implement benchmarks

Gora intends to be the core architecture for IO heavy applications and mapreduce jobs. So performance is critical in many ways. There are lots of possible improvements, and performance bottlenecks that needs to be identified.

We should implement benchmarks for measuring the performance of gora. Especially we should compare using raw APIs for avro and HBase and using Gora.

Fix null field treatment across data stores

Null fields treatment is a tricky issue in the context of persistence. Regardless of the way we choose the strategy, the semantics should be the same for all the supported data stores (except maybe file based data stores).

The use cases for null fields can be as follows :

  1. record.field == null, setting the field to null again should not make any changes to the data store
  2. record.field != null, setting the field to null should delete(or override) the value in the datastore

Add a memory based data store

We should add a memory based data store (MemStore). It will help internal tests as well as test for our clients.

MemStore should implement all public DataStore methods, including query operations, field, key and time filters.

Implement gora - pig bindings

Pig is a data processing language. Pig support reading / writing data to various formats through it's Store concept similar to Gora's DataStores.

We should build bindings for gora -> pig so that any data store that gora supports can be used with pig.

TFile/MapFile backed avro store

We may benefit from an avro store which is backed by a map file / TFile. Unlike DataFile backed avro store, map files support random gets for keys, so some applications (such as tests, etc) can use this as the main data store.

HBase tests takes forever to finish

We extends HBase's test cases, which sets up a mini cluster from scratch each time. This takes up to 1 min. We have to start up the cluster once and run all the tests at once.

Introduce class Gora as the main Facade

We can rename DataStoreFactory to Gora and use this class as the public facing Facade for third parties.

The following will make more sense :
Properties properties = Gora.properties;

Gora.craeteDataStore( HBaseStore.class, .... )

Override storage name using properties

The AvroStore uses the value set in the mapping file for naming the underlying table. It would be nice to be able to set the name using the properties e.g. for cases where we want to generate a temporary structure but want to keep a common mapping file.
Having a parameter name which is generic and not tied to a specific datastore implementation would be better as the client code might need to specify this using the API and does not necessarily need to know what type of DataStore is actually used.

java.lang.IllegalArgumentException in IOUtils.readFully

buffer.limit() throws the IllegalArgumentException when count == -1

public static byte[] readFully(InputStream in) throws IOException {
List buffers = new ArrayList(4);
while(true) {
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
buffers.add(buffer);
int count = in.read(buffer.array(), 0, BUFFER_SIZE);
buffer.limit(count);
if(count < BUFFER_SIZE) break;
}

return getAsBytes(buffers);

}

Make reusing of objects by Datastore's and Result's, explicit

Currently, HBaseResults does not reuse objects, where as AvroResult reuses them. Moreover, there is no method to reuse objects at hand, when using DataStore#get().

In short, we need to make reusing objects configurable, and it should be explicit whether objects are reused or not.

Evaluate switching to JavaDB (aka Apache Derby) as the default embedded database for gora-sql

HSQL has proven to be hard to work with regarding database shutdown logic. Back in the days I added support for DB operations to Hadoop, derby did not support LIMIT/OFFSET type of queries, so HSQLDB was chosen as the DB for implementing test cases. However, from 10.4 JavaDB supports these type of queries (http://db.apache.org/derby/docs/10.6/ref/ref-single.html#rrefsqljoffsetfetch). So it is time we decide whether to continue with HSQL or switch to JavaDB.

Also, note that AFAIK, JavaDB does not yet support MERGE statements or INSERT ... ON DUPLICATE KEY statements, so we need to find a fix for the insert/update problem before the switch.

Persistent.clear()

Gora has the ability to reuse the objects. However, since not all the fields of the objects needs to be read from the data store, reused objects should be cleared by calling the clear method.

HBaseStore does not respect partitionQuery's

HBase store currently does not support executing partitionQueries. Morever, the mapred tests for HBaseStore did not revealed this before, so we also need to check the tests.

Specify name of table separately from the schema

We could have several tables using the same schema but requiring different names (e.g. main webtable in NutchBase and tables for the segments) + these names can be generated dynamically and are not necessarily known in advance. AFAIU Gora currently assumes that there is 1 table per schema and gets its name from there which is a limitation.
I suggest that we separate name of the schema from the name of the tables, by default if no name is specified for a table then the name of the schema would be used

NPE in Hbase store when querying on a key that does not exist

HBaseStore - line 387 : result.getNoVersionMap() returns null which triggers the NPE

java.lang.NullPointerException
at org.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:387)
at org.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:35)
at org.gora.hbase.query.HBaseGetResult.next(HBaseGetResult.java:32)

Fix failing HBase tests

After introducing GoraHBaseTestDriver which reduced the tests for HBase to be completed in much less time, we realized some of the tests were broken during the period that they are not run. This issue should keep track of these tests and fix HBaseStore to pass the tests

MySQL does not recognise LONGVARBINARY

The following SQL query is used for creating a table in NutchBase :

CREATE TABLE webpages (id VARCHAR(512) PRIMARY KEY,headers LONGVARBINARY,text VARCHAR(32000),status INTEGER,markers LONGVARBINARY,parseStatus LONGVARBINARY,modifiedTime BIGINT,score FLOAT,typ VARCHAR(32),baseUrl VARCHAR(512),content LONGVARBINARY,title VARCHAR(512),reprUrl VARCHAR(512),fetchInterval INTEGER,prevFetchTime BIGINT,inlinks LONGVARBINARY,prevSignature LONGVARBINARY,outlinks LONGVARBINARY,fetchTime BIGINT,retriesSinceFetch INTEGER,protocolStatus LONGVARBINARY,signature LONGVARBINARY,metadata LONGVARBINARY)

Unfortunately LONGVARBINARY is not recognised by mysql but 'LONG VARBINARY' is.

Implement query filters

Query interface should support setting filters, which will will be powerful enough to support SQL where clauses, and HBase native filters.

Implement gora - cascading bindings

Cascading is a nice framework for working with Mapreduce at a higher level. Cascading defines a Tap architecture which is the source/sink for records. This is very similar to gora's DataStore's.

We should develop a GoraTap as an adapter for gora->cascading. This way any data store gora supports can be used at cascading.

Implement gora - Lucene / Solr bindings

It would be nice if we can implement data stores for Lucene and Solr.
Most of the data processing projects uses Lucene/Solr as their indexing backend, so people should be able to use domain level objects (defined via gora), and use the indexing backend just like any other data store.

merge gora-examples into gora-core test

Having a separate gora-examples module is very logical and useful for the users. However, all of the tests for gora-core and the other modules depend on the data structures, and jobs at gora-examples. Until now, thanks to ivy, we have managed this as follows:
gora-core compile configuration does not depend on anything,
gora-examples compile dependency depends on gora-core
gora-core tests dependency depends on gora-examples.

What seems a cyclic dependency above was resolved by a clever build order among compile and test dependencies. However, this has proven to be a major source of headaches. So long story short, I propose we merge gora-examples into gora-core.

Implement a Filter interface for Query-s

Query objects should optionally take Filter-s, which are used to accept/reject objects. This can be useful for backends like gora-hbase, where filters can work on server-side.

Integrate clover, findbugs, checkstyle and CI into build process

It would be nice to integrate tools for continuous integration, findbugs, test coverage(clover) and style checker to the build process.
We need to wait for the possible ASF adoption for deciding on CI, however other tools can be added easily to build process via ant.

Can't use MemStore in Mapreduce

unless MemQuery is changed to non-static and put outside MemStore.
GoraInputFormat line 76 :
query = (Query<K, T>) DefaultStringifier.load(conf,
QUERY_KEY, Class.forName(queryClass));
throws an exception.
The other implementations of QueryBase are not static and live outside their corresponding Store.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.