Coder Social home page Coder Social logo

Comments (10)

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
That sounds more like an enhancement request than a defect report.

To generalize things a bit, support for a standard data access API would allow 
people 
to plug in multiple DB backends.

Original comment by tfmorris on 10 May 2010 at 7:26

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
One of the major challenges here is how to support undo/redo when changes can 
get into 
the back-end database without going through Gridworks. Another major challenge 
is where 
to store metadata (such as reconciliation records) that is specific to 
Gridworks and 
not native to any existing back-end database.

Original comment by [email protected] on 10 May 2010 at 7:32

  • Added labels: Priority-Low, Type-Enhancement
  • Removed labels: Priority-Medium, Type-Defect

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.

As for undo/redo, maybe adding a commit button would make that easier. So 
changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?

Original comment by mjlissner on 10 May 2010 at 7:36

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
You can either keep it synchronous with the database (effectively using the 
database 
as the backend), but lose undo/redo support and reconciliation with Freebase 
(unless 
you add suitable tables to the database).  You're then using Gridworks for just 
the 
facets really.

The other way, as you suggest, is to make it similar to a disconnected session 
with a 
commit transaction.  The issue is then ensuring consistency between the remote 
database and the snapshot held in Gridworks.  Merging the two back in would be 
an 
issue.  You'd also need to hold keys from the remote database in Gridworks for 
updating records.

Frameworks such as Hibernate or Spring would be worth considering for their 
database 
abstraction layers.

Original comment by iainsproat on 10 May 2010 at 7:46

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
I have read that good ORMs such as Hibernate now support ordered lists and 
other 
features now incorporated into JPA 2.0 as of Dec 2009.

Original comment by thadguidry on 10 May 2010 at 8:05

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024

Original comment by iainsproat on 25 May 2010 at 7:59

  • Changed state: Accepted
  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024

Original comment by iainsproat on 14 Oct 2010 at 9:26

  • Added labels: Component-Logic, Component-Persistence, Usability
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
Initially, what I believe is most important/useful, is simply having the 
ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the 
get-go(create a new project).  Also initially, being able to set/save multiple 
data sources, and multiple dBs within those sources.  When the user creates a 
new project, this would in effect create a disconnected session(as mentioned 
above), wherein the data is treated as is now the data.  Adding 'commit' 
features can come next, followed by more advanced connectivity options, until 
such a point synchronous functionality is in place to one degree or another.  
But for now, it would certainly be nice to add data sources and pull data from 
those sources!

Eric Jarvies

Original comment by [email protected] on 11 Nov 2010 at 7:08

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
Eric,  A cleanup tool using industry best practices is best used offline within 
a process.  There are existing ETL tools that easily consume from 
MySQL/PostgreSQL etc and offer excellent flow control, exporting, and 
connectivity to produce delimited files with relative ease.  Talend is one such 
product that I use along with Google Refine.  Talend (Open Source Community 
edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle) 
and then dumps a customized TSV file that I open with Google Refine for further 
analysis and sometimes clean up.  There are other tools that provide ETL 
(Extract, Transform, Load) like Talend.  I'm not 100% sure if the team really 
feels the need to copy that and flesh out a full ETL platform, since Talend and 
other tools fill that need very nicely.  Incidentally, using Google Refine and 
a bit of clustering, I was able to find a few loop holes in our data storage 
processing that we fixed with a few stored procedures within Oracle.  Google 
Refine was instrumental as a discovery tool for that.  Talend does have an MDM 
component but does not have the interactivity of a discovery tool like Google 
Refine does.  If you do NOT need a daily process, but only one time cleanup, 
just dumping with MySQL or PostgreSQL would offer about the same and depending 
on the size of database takes only secs to minutes.  Dumping can also avoid 
potential live database locks, that if Refine supported might have to tip-toe 
around, depending on the teams' chosen implementation of database connectivity. 
 If you have large database size needs, give Talend or another ETL tool a try 
with Google Refine, and you'll soon see the powerful left-right combination.  
I'm not sure how far the team ultimately decides to absorb direct connectivity 
support within Google Refine.  I'd like to hear other opinions as well on this 
Issue-12.

Original comment by thadguidry on 11 Nov 2010 at 2:36

  • Added labels: ****
  • Removed labels: ****

from google-refine.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
I agree with thaguidry. Let the Refine team focus on bringing data quality 
issues to light. Let Talend focus on data Quality (they do have an data 
profiling tool that can identify some of this stuff) Talend is what we use for 
basic ETL. You could write some SQL to get the data out of MySQL anyway. If you 
analyze directly connected to db server for data quality against an entire 
large table your dba might become angry too.

C

Original comment by [email protected] on 14 Apr 2011 at 2:23

  • Added labels: ****
  • Removed labels: ****

from google-refine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.