Coder Social home page Coder Social logo

finraos / megasparkdiff Goto Github PK

View Code? Open in Web Editor NEW
49.0 49.0 26.0 3.97 MB

A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations of possible data sources. Multiple execution modes in multiple environments enable the user to generate a diff report as a Java/Scala-friendly DataFrame or as a file for future use. Comes with out of the box SparkFactory and SparkCompare tools.

Home Page: https://finraos.github.io/MegaSparkDiff/

License: Apache License 2.0

Java 19.72% Shell 0.08% Scala 79.68% HTML 0.53%

megasparkdiff's People

Contributors

aosama avatar codejunglin avatar dependabot[bot] avatar finraoss avatar matthewgillett avatar mmlinford avatar nrthyrk avatar steven-wei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

megasparkdiff's Issues

Cross compiled to Scala 2.12

Are there any plans to release this library cross compiled to Scala 2.12?

Currently, I am getting this error:
NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;

The code runs on a databricks cluster running Scala 2.11, but does not run on Scala 2.12.

Return results of compareAppleTables in a wrapper class

have the compareAppleTable method return the results in a wrapper class that will further provide more APIs.

So the code should look something like this

result = compareAppleTables(x,y).getLeft().sortByCols(“a”,“b”)
OR
result = compareAppleTables(x,y).getJoinedByCols(“a”,“b”).sortBy(“a”,“b”)

Fix or remove checkstyle configuration

Looks like there's old checkstyle plugin configuration in the POM and some configuration files in the config folder. We should either fix it or remove all of it.

Fix BlackDuck security issues

Right now there's a scary "1/10 (high risk)" reported by BlackDuck for our project. We should really see what we can do to remedy this. It might not be possible for all dependencies, but in those cases we can at least document why we can't resolve it.

Refactor Java Code into Scala Code

Refactor Java Code into Scala Code
classes under these packages should be refactored into scala

  • CmdLine
  • SourceVars
  • SourceTypes
  • VisualResultType
  • Launcher
  • FileUtil

Getting Errors

Hi, I have cloned this project and wanted to learn its capabilities. So, I ran a test case inside the examples folder and got error:

java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x7f63425a) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x7f63425a

Besides that, I was going to try comparing 2 CSV flies in locally and see differences in the outcome report. Could someone, please assist me with the issue?
Thank you in advance!

Improve enums SparkCompare.scala

Maybe sourceType should have a method like supportsSchemaCompare. That way we have the logic in one place, it won't be messy if we add more source types, and we could avoid the nested ifs with just having if (left.sourceType.supportsSchemaCompare() && right.sourceType.supportsSchemaCompare()

Improve Spark Option Configuration

APIs that utilize spark configuration options are a tad restrictive. We need to move the solution away from overloaded methods and into a new paradigm that allows more cleaner, more configurable API calls to the user.

This can possibly be done by providing a few basic APIs as we do now, but also allowing the user to directly interact with the SparkConf and appending options directly from there.

Publish ScalaDoc in maven central

Currently only JavaDoc is published in MavenCentral, need to look into a solution to merge both scaladoc and java doc if possible

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.