enis / gora Goto Github PK
View Code? Open in Web Editor NEWGora has moved to Apache Incubator, please goto http://incubator.apache.org/gora/
Home Page: http://incubator.apache.org/gora/
License: Apache License 2.0
Gora has moved to Apache Incubator, please goto http://incubator.apache.org/gora/
Home Page: http://incubator.apache.org/gora/
License: Apache License 2.0
File backed data stores such as DataFileAvroStore can be used as mapreduce inputs. However, we should support reading more than one files as mapreduce input.
Moreover, when file backed data stores are used as mapred outputs, we need to set the file names accordingly so that more than one reduce task can be run.
Since hbase-mapping.xml is used by external clients, having gora in the name will be much more clear for them.
It will be convenient if we can add a deleteByQuery(Query) method to DataStore so that entries matching a specific query can be removed.
Gora intends to be the core architecture for IO heavy applications and mapreduce jobs. So performance is critical in many ways. There are lots of possible improvements, and performance bottlenecks that needs to be identified.
We should implement benchmarks for measuring the performance of gora. Especially we should compare using raw APIs for avro and HBase and using Gora.
HBaseStore does not read configuration for mapping file from properties. It should use DataStoreFactory.getMappingFile() to get the mapping file name
Null fields treatment is a tricky issue in the context of persistence. Regardless of the way we choose the strategy, the semantics should be the same for all the supported data stores (except maybe file based data stores).
The use cases for null fields can be as follows :
We should add a memory based data store (MemStore). It will help internal tests as well as test for our clients.
MemStore should implement all public DataStore methods, including query operations, field, key and time filters.
Pig is a data processing language. Pig support reading / writing data to various formats through it's Store concept similar to Gora's DataStores.
We should build bindings for gora -> pig so that any data store that gora supports can be used with pig.
We should be able to use avro compression on SQL tables.
We may benefit from an avro store which is backed by a map file / TFile. Unlike DataFile backed avro store, map files support random gets for keys, so some applications (such as tests, etc) can use this as the main data store.
We extends HBase's test cases, which sets up a mini cluster from scratch each time. This takes up to 1 min. We have to start up the cluster once and run all the tests at once.
As reported in https://issues.apache.org/jira/browse/NUTCH-890, SQLStore does not work with nested record types.
We can rename DataStoreFactory to Gora and use this class as the public facing Facade for third parties.
The following will make more sense :
Properties properties = Gora.properties;
Gora.craeteDataStore( HBaseStore.class, .... )
gora inter module dependencies should be resolved local-first. This can speed up the build a lot.
if not - we end up with an incorrect SQL query e.g.
INSERT INTO webpages (id )VALUES ('uk.co.bbc.news:http/sport2/hi/video_and_audio/default.stm') ON DUPLICATE KEY UPDATE ;
where nothing follows the update
The AvroStore uses the value set in the mapping file for naming the underlying table. It would be nice to be able to set the name using the properties e.g. for cases where we want to generate a temporary structure but want to keep a common mapping file.
Having a parameter name which is generic and not tied to a specific datastore implementation would be better as the client code might need to specify this using the API and does not necessarily need to know what type of DataStore is actually used.
As per the discussion in https://issues.apache.org/jira/browse/NUTCH-891, until Gora releases first version, nutch needs snapshots version numbers in jars.
buffer.limit() throws the IllegalArgumentException when count == -1
public static byte[] readFully(InputStream in) throws IOException {
List buffers = new ArrayList(4);
while(true) {
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
buffers.add(buffer);
int count = in.read(buffer.array(), 0, BUFFER_SIZE);
buffer.limit(count);
if(count < BUFFER_SIZE) break;
}
return getAsBytes(buffers);
}
Currently, HBaseResults does not reuse objects, where as AvroResult reuses them. Moreover, there is no method to reuse objects at hand, when using DataStore#get().
In short, we need to make reusing objects configurable, and it should be explicit whether objects are reused or not.
Mapreduce support is incomplete for SqlStore. We need to implement proper mapreduce support for SQL.
DataStoreFactory.createDataStore may return null when a Store class is present but it's unable to create a data store (e.g. wrong config, no connection, etc..). Instead it should preserve the stack trace and re-throw a DataStore-specific exception.
HSQL has proven to be hard to work with regarding database shutdown logic. Back in the days I added support for DB operations to Hadoop, derby did not support LIMIT/OFFSET type of queries, so HSQLDB was chosen as the DB for implementing test cases. However, from 10.4 JavaDB supports these type of queries (http://db.apache.org/derby/docs/10.6/ref/ref-single.html#rrefsqljoffsetfetch). So it is time we decide whether to continue with HSQL or switch to JavaDB.
Also, note that AFAIK, JavaDB does not yet support MERGE statements or INSERT ... ON DUPLICATE KEY statements, so we need to find a fix for the insert/update problem before the switch.
Support map of maps, map of lists, list of maps, etc...
Gora has the ability to reuse the objects. However, since not all the fields of the objects needs to be read from the data store, reused objects should be cleared by calling the clear method.
HBase store currently does not support executing partitionQueries. Morever, the mapred tests for HBaseStore did not revealed this before, so we also need to check the tests.
We could have several tables using the same schema but requiring different names (e.g. main webtable in NutchBase and tables for the segments) + these names can be generated dynamically and are not necessarily known in advance. AFAIU Gora currently assumes that there is 1 table per schema and gets its name from there which is a limitation.
I suggest that we separate name of the schema from the name of the tables, by default if no name is specified for a table then the name of the schema would be used
Should the method be modified like this and check that the key is not null a second time?
public boolean next() throws IOException {
if(key == null) {
readNext(result);
// return true
return (key != null);
}
return false;
}
HBaseStore - line 387 : result.getNoVersionMap() returns null which triggers the NPE
java.lang.NullPointerException
at org.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:387)
at org.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:35)
at org.gora.hbase.query.HBaseGetResult.next(HBaseGetResult.java:32)
After introducing GoraHBaseTestDriver which reduced the tests for HBase to be completed in much less time, we realized some of the tests were broken during the period that they are not run. This issue should keep track of these tests and fix HBaseStore to pass the tests
The following SQL query is used for creating a table in NutchBase :
CREATE TABLE webpages (id VARCHAR(512) PRIMARY KEY,headers LONGVARBINARY,text VARCHAR(32000),status INTEGER,markers LONGVARBINARY,parseStatus LONGVARBINARY,modifiedTime BIGINT,score FLOAT,typ VARCHAR(32),baseUrl VARCHAR(512),content LONGVARBINARY,title VARCHAR(512),reprUrl VARCHAR(512),fetchInterval INTEGER,prevFetchTime BIGINT,inlinks LONGVARBINARY,prevSignature LONGVARBINARY,outlinks LONGVARBINARY,fetchTime BIGINT,retriesSinceFetch INTEGER,protocolStatus LONGVARBINARY,signature LONGVARBINARY,metadata LONGVARBINARY)
Unfortunately LONGVARBINARY is not recognised by mysql but 'LONG VARBINARY' is.
In addition to create/delete schema methods a method for checking whether the schema exists is needed.
sometimes, the clients need to clone the objects generated by gora easily. Adding a deep-copy clone method will solve the case.
Query interface should support setting filters, which will will be powerful enough to support SQL where clauses, and HBase native filters.
We should be able to use DB specific types such as MEDIUMBLOB (in MySQL) with Gora.
Cascading is a nice framework for working with Mapreduce at a higher level. Cascading defines a Tap architecture which is the source/sink for records. This is very similar to gora's DataStore's.
We should develop a GoraTap as an adapter for gora->cascading. This way any data store gora supports can be used at cascading.
It would be nice if we can implement data stores for Lucene and Solr.
Most of the data processing projects uses Lucene/Solr as their indexing backend, so people should be able to use domain level objects (defined via gora), and use the indexing backend just like any other data store.
Having a separate gora-examples module is very logical and useful for the users. However, all of the tests for gora-core and the other modules depend on the data structures, and jobs at gora-examples. Until now, thanks to ivy, we have managed this as follows:
gora-core compile configuration does not depend on anything,
gora-examples compile dependency depends on gora-core
gora-core tests dependency depends on gora-examples.
What seems a cyclic dependency above was resolved by a clever build order among compile and test dependencies. However, this has proven to be a major source of headaches. So long story short, I propose we merge gora-examples into gora-core.
We track element's individual statuses(dirty, readable, etc) for maps. We can do the same for Arrays.
we need to add a default constructor to avro query so that it can be used in mapreduce.
Query objects should optionally take Filter-s, which are used to accept/reject objects. This can be useful for backends like gora-hbase, where filters can work on server-side.
It would be nice to integrate tools for continuous integration, findbugs, test coverage(clover) and style checker to the build process.
We need to wait for the possible ASF adoption for deciding on CI, however other tools can be added easily to build process via ant.
unless MemQuery is changed to non-static and put outside MemStore.
GoraInputFormat line 76 :
query = (Query<K, T>) DefaultStringifier.load(conf,
QUERY_KEY, Class.forName(queryClass));
throws an exception.
The other implementations of QueryBase are not static and live outside their corresponding Store.
This is required in order to be able to use AvroStore with MapReduce
We need to add a deleteSchema interface so that schema and all of schema's data will be deleted
JUnit ant task supports an individual unit test to be run. We need to such functionality since some test cases takes too much time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.