Coder Social home page Coder Social logo

mrpowers-io / spark-fast-tests Goto Github PK

View Code? Open in Web Editor NEW
429.0 16.0 77.0 1.71 MB

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

Home Page: https://mrpowers-io.github.io/spark-fast-tests/

License: MIT License

Scala 99.58% Shell 0.42%
spark testing-framework

spark-fast-tests's People

Contributors

alfonsorr avatar carlsverre avatar cchepelov avatar dressingoak avatar dvirtz avatar gitter-badger avatar mrpowers avatar nightscape avatar oscarvarto avatar reynoldsm88 avatar semyonsinchenko avatar skestle avatar snithish avatar zeotuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-fast-tests's Issues

Remove all references to deep in this library

I just noticed that this code errors out in Scala 2.13!?

Array(1, 2, 3).deep == Array(1, 2, 3).deep

Here's the error message:

error: value deep is not a member of Array[Int]

sameElements still works in Scala 2.13, so Array(1, 2, 3) sameElements Array(1, 2, 3) is fine.

Spark 3 will support Scala 2.12. Not sure when Scala 2.13 will be supported by Spark, but might as well proactively make this library as compatible as possible.

Should probably start compiling this library with Scala 2.13 to programatically make sure it stays compatible.

Create a generic assertDatasetEquality method

assertDatasetEquality[U](expected: Dataset[U], result: Dataset[U]) could indirectly call assertSmallDatasetEquality and assertLargeDatasetEquality depending on the size of dataset.

I'm not really sure of a good way to implement this. assertDatasetEquality would need to run a count, which could be expensive. The definition of a "large" Dataset would depend on the machine... we would somehow need to compare the Dataset size with the available machine resources.

Any good ideas on how to implement this @snithish?

Managing the Spark Session

In the spark-fast-tests README, we encourage users to wrap the Spark Session in a trait and mix in the trait to test classes that need access to the Spark Session.

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}

The spark-testing-base library uses the EvilSessionTools approach to extract the SQL context.

I don't think the testing framework should have any knowledge or control over the Spark Session. The Spark Session management should take place in the application and the test framework should simply provide tools that help with assertions.

@snithish @eclosson - I would like your feedback on this intentional design decision. Thanks!

Comparison error while assertApproximateDataFrameEquality

Hello,
I'm trying to use the approximate equality assertion between 2 dataframes, but I get a exception saying that both dataframe doesn't have the same number of rows, which isn't true as you can see.

assertApproximateDataFrameEquality(output, expected, 0.01, ignoreNullable = true)
[info]   com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: Actual DataFrame Row Count: '61'
[info] Expected DataFrame Row Count: '61'
[info]   at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.throwIfDatasetsAreUnequal$1(DatasetComparer.scala:197)
[info]   at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertLargeDatasetEquality(DatasetComparer.scala:213)
[info]   at com.test.TestSpec.assertLargeDatasetEquality(TestSpec.scala:14)
[info]   at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertApproximateDataFrameEquality(DatasetComparer.scala:240)
[info]   at com.test.TestSpec.assertApproximateDataFrameEquality(TestSpec.scala:14)

Note : There is a lot of DoubleType in those dataframes. Is this related to #29 ?

Issue with schemas and assertSmallDataFrameEquality

Hi:

I was following your Testing Spark Applications Medium article, and ran into an issue relating to schemas and spark-fast-tests' assertSmallDataFrameEquality method.

I created the source DataFrame using toDF as suggested, and the expected DataFrame using an explicitly defined schema using a List of StructFields as suggested. The issue I ran into was that, since toDF always has the field being nullable, and in my case the expected DataFrames were not nullable, the schema comparison would always fail for that reason.

Perhaps there could be an option to ignore the nullability when doing the compare, such that as long as the names and types are the same it's a match?

Thanks,
Ken

assertLargeDatasetEquality | False positives

Input:

val df1 = spark.read.parquet("/PATH/datasetA").limit(10)
val df2 = spark.read.parquet("/PATH/datasetA").limit(10)

Step:

assertLargeDatasetEquality(df1, df2, orderedComparison = false)

Expected Result:

  • Result should be True

Actual Result:

  • Result is False

Solution:

  • Yuan to implement a new assertLargeDatasetEqualityWithoutOrder method.

assertSmallRDDEquality bug

Here is the code that errors out:

      val yRDD = spark.sparkContext.parallelize(List("a", "b", "c"))

      val actualRDD = yRDD.zipWithIndex()

      val expectedRDD = spark.sparkContext.parallelize(
        List(
          (1, "a"),
          (2, "b"),
          (3, "c")
        )
      )

      assertSmallRDDEquality(actualRDD, expectedRDD)

This is the error:

[error] /Users/powers/Documents/code/my_apps/spark-spec/src/test/scala/com/github/mrpowers/spark/spec/rdd/RDDSpec.scala:439: type mismatch;
[error] found : org.apache.spark.rdd.RDD[(String, Long)]
[error] required: org.apache.spark.rdd.RDD[(Any, Any)]
[error] Note: (String, Long) <: (Any, Any), but class RDD is invariant in type T.
[error] You may wish to define T as +T instead. (SLS 4.5)
[error] assertSmallRDDEquality(actualRDD, expectedRDD)
[error] ^
[error] /Users/powers/Documents/code/my_apps/spark-spec/src/test/scala/com/github/mrpowers/spark/spec/rdd/RDDSpec.scala:439: type mismatch;
[error] found : org.apache.spark.rdd.RDD[(Int, String)]
[error] required: org.apache.spark.rdd.RDD[(Any, Any)]
[error] Note: (Int, String) <: (Any, Any), but class RDD is invariant in type T.
[error] You may wish to define T as +T instead. (SLS 4.5)
[error] assertSmallRDDEquality(actualRDD, expectedRDD)
[error] ^
[error] two errors found
[error] (test:compileIncremental) Compilation failed
[error] Total time: 5 s, completed Apr 11, 2017 9:01:28 AM

Detected merged artifact: Failed

Trying to download the library using SBT but a warning is thrown:

[info] Updating ...
[info] downloading http://dl.bintray.com/spark-packages/maven/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11.jar ...
[info] 	[SUCCESSFUL ] MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar (1873ms)
[info] Done updating.
[warn] 	Detected merged artifact: [FAILED     ] MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar(src):  (0ms).
[warn] ==== local: tried
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] ==== local-preloaded-ivy: tried
[warn]   /Users/ginfante/.sbt/preloaded/MrPowers/spark-fast-tests/0.17.2-s_2.11/srcs/spark-fast-tests-sources.jar
[warn] ==== local-preloaded: tried
[warn]   file:////Users/ginfante/.sbt/preloaded/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] ==== Spark Packages Repo: tried
[warn]   http://dl.bintray.com/spark-packages/maven/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::              FAILED DOWNLOADS            ::
[warn] 	:: ^ see resolution messages for details  ^ ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar(src)
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::

My sbt file looks like:

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % sparkVersion % "provided" withSources() withJavadoc(),
    "org.apache.spark" %% "spark-sql" % sparkVersion % "provided" withSources() withJavadoc(),
    "org.typelevel" %% "frameless-dataset" % framelessVersion,
    "org.scalactic" %% "scalactic" % "3.0.6",
    "org.scalatest" %% "scalatest" % "3.0.6" % "test",
    "MrPowers" % "spark-fast-tests" % "0.17.2-s_2.11" % Test
)

Repurposing DataFrameComparer to compare Dataset

@MrPowers I was wondering if we can repurpose DataframeComparer to compare Datasets too, since Dataframe is nothing but Dataset[Row] we can use Generics to use the same trait for Dataset.

I think

def assertSmallDataFrameEquality[T](actualDS: Dataset[T], expectedDS: Dataset[T])

def assertLargeDataFrameEquality[T](actualDS: Dataset[T], expectedDS: Dataset[T])

can be used with the exact same logic used now. Making these functions Generic will make them versatile. Type inference will make invocation simple without the need to specify Type information.

Let me know what you think about this.

Publish to Maven Central?

I noticed that this library (which is awesome BTW!) was being published to Maven Central some time ago, but then stopped: https://search.maven.org/classic/#search|ga|1|spark-fast-tests (last release is in June 2018). Is it possible to publish newer versions to Maven Central again?

The issue is, using third-party repositories is often very hard in certain companies which only allow using a proxying artifacts repository which is configured to work with Maven Central and almost nothing else, and are very reluctant to add any other repository to proxy. Thus, there is really no way to use your library, because even if it works on the dev machines, it won't work on CI which is only allowed to use the internal repository.

Enhancement suggestion for assertSmallDataFrameEquality

Hi:

When assertSmallDataFrameEquality fails, the error message prints the top 5 rows of each DataSet. This makes it extremely difficult to tell what exactly is different, and therefore it's hard to know what to correct.

I found some code on Stack Overflow that would diff the DataSets and show the differences. I my case, I added some code based on this to run shows on the differences and was able to determine what they were, although the output was tough to spot amongst all the Spark job output.

Perhaps some version of this, maybe where it could highlight in a different color within the pretty-printed DataSet could be integrated so it's much easier to focus in on what's different?

Thanks,
Ken

Allow ErrorMessage formatting for scalatest users

First off, fantastic library.

Secondly, I've incorporated spark-fast-tests into a project that was already using scalatest and maven, but when I run a failing test I get lines like so:

##teamcity[testFailed name='Derives a correct 14-digit Completion Api' message='|n[01-001-00000-00-00|] || [01-001-00000-00-01|]|n[02-001-00000-00-01|] || [02-001-00000-00-02|]|n[03-001-00000-00-02|] || [03-001-00000-00-03|]|n[04-001-00000-00-03|] || [04-001-00000-00-04|]|n[05-001-00000-00-04|] || [05-001-00000-00-05|]|n[06-001-00000-00-05|] || [06-001-00000-00-01|]' details='com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: |n[01-001-00000-00-00|] || [01-001-00000-00-01|]|n[02-001-00000-00-01|] || [02-001-00000-00-02|]|n[03-001-00000-00-02|] || [03-001-00000-00-03|]|n[04-001-00000-00-03|] || [04-001-00000-00-04|]|n[05-001-00000-00-04|] || [05-001-00000-00-05|]|n[06-001-00000-00-05|] || [06-001-00000-00-01|]|r|n	at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertSmallDatasetEquality(DatasetComparer.scala:70)|r|n	at com.rseg.spark.test.etl.WellApiHelperSpec.assertSmallDatasetEquality(WellApiHelperSpec.scala:15)|r|n	at com.github.mrpowers.spark.fast.tests.DataFrameComparer$class.assertSmallDataFrameEquality(DataFrameComparer.scala:16)|r|n	at com.rseg.spark.test.etl.WellApiHelperSpec.assertSmallDataFrameEquality(WellApiHelperSpec.scala:15)|r|n	at com.rseg.spark.te...

Is there a way to change the formatting to print like other scalatest messages?

assertSmallDataFrameEquality throwing DatasetSchemaMismatch while schema is same but different column order

Hi,

I am trying to use assertSmallDataFrameEquality for the test below.

import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{StringType, StructField}

"createDataFrame" should "be able to create dataframe from a map" in {
        val error = ErrorLogger("Test")
        val dateTimeFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
        val sysDateTime = LocalDateTime.now
        error.logData("JobId") = "0"
        error.logData("ProcessName") = error.processName
        error.logData("SourceType") = "Hive"
        error.logData("StepName") = error.processName
        error.logData("ReasonOfFailure") = "NullPointerError"
        error.logData("JobStartTime") = dateTimeFormat.format(sysDateTime)
        error.logData("JobEndTime") = dateTimeFormat.format(sysDateTime)

        val actualDF = createLogDataFrame(error.logData)
        actualDF.show()
        actualDF.printSchema()

        val expectedDF = spark.createDF(
            List(
                ("0" , error.processName, "Hive", error.processName, "NullPointerError",
                  dateTimeFormat.format(sysDateTime), dateTimeFormat.format(sysDateTime))
            ),
            List(
                StructField("JobId",StringType, nullable = true),
                StructField("ProcessName", StringType, nullable = true),
                StructField("SourceType", StringType, nullable = true),
                StructField("StepName", StringType, nullable = true),
                StructField("ReasonOfFailure", StringType, nullable = true),
                StructField("JobStartTime", StringType, nullable = true),
                StructField("JobEndTime", StringType, nullable = true)
            )
        )
        expectedDF.show()

        assertSmallDataFrameEquality(actualDF, expectedDF, orderedComparison=false)
    }

Here is the function I am testing:

import org.apache.spark.sql.{DataFrame, Row}
import scala.collection.mutable
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.rdd.RDD

def createLogDataFrame(data:mutable.Map[String,String]): DataFrame = {
        val schema = StructType(data.keys.toSeq.map(StructField(_, StringType, nullable = true)))
        val rowRDD: RDD[Row] = spark.sparkContext.parallelize(Seq(Row.fromSeq(data.values.toSeq))) // Seq[row]
        spark.createDataFrame(rowRDD, schema)
    }

I am getting the following error

Actual Schema:
StructType(StructField(JobStartTime,StringType,true), StructField(JobEndTime,StringType,true), StructField(ProcessName,StringType,true), StructField(StepName,StringType,true), StructField(SourceType,StringType,true), StructField(JobId,StringType,true), StructField(ReasonOfFailure,StringType,true))
Expected Schema:
StructType(StructField(JobId,StringType,true), StructField(ProcessName,StringType,true), StructField(SourceType,StringType,true), StructField(StepName,StringType,true), StructField(ReasonOfFailure,StringType,true), StructField(JobStartTime,StringType,true), StructField(JobEndTime,StringType,true))

The schemas for both data frames are similar, only the order of columns is different. I was wondering if there is a way to ignore the order for the columns while comparing two data frames in scala-fast-test.

Thank you for your help in advance! :)

Make the Dataset equality inequality messages better

Here's the current content inequality message:

Screen Shot 2020-03-31 at 5 33 18 AM

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content      | Expected Content
[info] [frank,44,us]       | [frank,44,us]
[info] [li,30,china]       | [li,30,china]
[info] [bob,1,uk]          | [bob,1,france]
[info] [camila,5,peru]     | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]

It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
[frank,44,us]       | [frank,44,us]
[li,30,china]       | [li,30,china]
[bob,1,uk]          | [bob,1,france]
[camila,5,peru]     | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]

Should we get rid of the square brackets for each row of data too?

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content    | Expected Content
frank,44,us       | frank,44,us
li,30,china       | li,30,china
bob,1,uk          | bob,1,france
camila,5,peru     | camila,5,peru
maria,19,colombia | maria,19,colombia

java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$

Hi Guys,

I am about to use this nice framework in our tests (combined with scalatest). By simply copy-paste the sample code from your page. We have gotten the error as described in the title.

scalaVersion: 2.12.7
sparkVersion: 2.4.2
scalatestVersion: 3.1.1
spark-fast-testsVersion: 0.19.1

class Test extends AnyFlatSpec with DataFrameComparer {

  "a" should "b" in {
    print("works")
  }

}

Can you guys help me out?

Data Volume grown when ran Compaction API

Hi Matt,

I have gone through your article that you have written(and youtube) , I wanted to say like they are just wonderful.Thank you very much for that.!!!

I have one question , which I observed and put across delta lake community as well .

I found like while doing compaction operation, data volume is growing like anything . I could not able to understand the reason behind it. Could you please help me to understand .

here is the test case that I have performed .

Here is the details:

Took Sample of 166MB of input data. : Total number Records : 1630168
Once Delta Lake formed It became of 176MB : Total number of Records : 1630168
Ran merge API to update id , now size became 221MB
Ran Vacuum API, now size became 205MB : Total number of records : 1630168
Ran Compaction Now size became 781MB : Total number of records : 3260336 (double of original)
Ran Vacuum API now size became 576MB : Total number of records : 1630168

My code is also attached over here : delta-io/delta#254

After using the library my logging level has changed to INFO

I was trying the library and found that after including it the logging changed to INFO from DEBUG.

I tried a few things but not working.

plugins.sbt:
logLevel := Level.Error addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10") addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.6.1")
spark session:
SparkSession .builder .appName(appName) .master("local[2]") //to fix issue of port assignment on local .config("spark.driver.bindAddress", "localhost") .getOrCreate() spark.sparkContext.setLogLevel("ERROR")
If I remove the library logging is back to Error.

How can get it fixed with the Library?

Is it possible to expand DataSetContentMismatch?

Is there a way to expand the diff so that I can see every column of the dataset? Otherwise it is impossible to see where the difference lies exactly.

[info]   com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: Diffs
[info] +--------------------+--------------------+
[info] |      Actual Content|    Expected Content|
[info] +--------------------+--------------------+
[info] |Event(1,161097120...|Event(1,161097120...|
[info] +--------------------+--------------------+

The above snippet illustrates the problem. The two datasets look exactly the same since they are truncated before the difference is shown.

Support for Scala 2.12

Hi @MrPowers !

Today I started my journey with Apache Spark, once I learned that Spark 2.4.0 can be run with Scala 2.12.
So, I forked your https://github.com/MrPowers/spark-sbt.g8 giter8 template, tweaked it somewhat to work with current stable Scala version,
and then I got a run-time error.

My hunch is that the release of spark-fast-tests I am using, is for Scala 2.11.X only, and might not be binary compatible with next Scala 2.12.X libraries (ScalaTest?).

The error I am getting is this:

[info] MannersSpec:
[info] com.intersysconsulting.TubularSpec *** ABORTED *** (28 milliseconds)
[info]   java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[info]   at com.intersysconsulting.TubularSpec.<init>(TubularSpec.scala:11)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info]   at java.lang.Class.newInstance(Class.java:442)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   ...
[info] com.intersysconsulting.MannersSpec *** ABORTED *** (4 milliseconds)
[info]   java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[info]   at com.intersysconsulting.MannersSpec.<init>(MannersSpec.scala:10)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info]   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info]   at java.lang.Class.newInstance(Class.java:442)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   ...
[error] Uncaught exception when running com.intersysconsulting.MannersSpec: java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[error] sbt.ForkMain$ForkError: java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[error] 	at com.intersysconsulting.MannersSpec.<init>(MannersSpec.scala:10)
[error] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[error] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[error] 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[error] 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[error] 	at java.lang.Class.newInstance(Class.java:442)
[error] 	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[error] 	at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:748)

My complete source code, so that you can reproduce the error above is here: https://github.com/oscarvarto/learning-spark/
I am depending on a SBT-1.X compatible plugin, as shown here: https://github.com/oscarvarto/learning-spark/blob/master/project/plugins.sbt#L3-L4

Publish sources

It'd be good to publish the sources with the JAR so when a developer wants to find out implementation details the IDE can show that.

assertColumnEquality truncates long columns

Here's some example code:

it("provides a good error message for wide columns") {
  val df = spark.createDF(
    List(
      ("this is a really really really really long sentence", "this is a really really really really long sentence and the diff is hidden"),
      ("animation", "animation"),
      ("bill", "bill")
    ), List(
      ("sentence1", StringType, true),
      ("sentence2", StringType, true)
    )
  )

  assertColumnEquality(df, "sentence1", "sentence2")
}

It'll output this error message:

Screen Shot 2020-04-15 at 6 48 47 AM

assertColumnEquality should at least provide an option so the output doesn't truncate. Something like assertColumnEquality(df, "sentence1", "sentence2", truncate=false).

Spark 3 comparation of strings fail due to a missing method

It looks like spark 3 deleted dateToString method in org.apache.spark.sql.catalyst.util.DateTimeUtils to prevent the use of SimpleDateFormat (a non thread safe use).

This implementation should have a custom implementation valid for spark 2.X and 3.X

assertSmallDataFrameEquality Scalatest bug

Screen Shot 2020-03-15 at 3 18 07 PM

The assertSmallDataFrameEquality method always shows the first row as failing, even when it's the same. It's also not adding the line break before printing the first row comparison line.

utest is properly outputting the first line so looks like this bug is only for Scalatest.

Screen Shot 2020-03-15 at 3 08 19 PM

Any suggestions on how to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.