mrpowers-io / spark-fast-tests Goto Github PK
View Code? Open in Web Editor NEWApache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Home Page: https://mrpowers-io.github.io/spark-fast-tests/
License: MIT License
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Home Page: https://mrpowers-io.github.io/spark-fast-tests/
License: MIT License
I see the examples are with Scala, can this library be used with Java ?
The problem: I am comparing two dataframes with a DoubleType field, and this comparision needs to be approximate. But I'm having problems with schema because of nullable fields.
This flags would be very useful for my case. Is this possible?
I just noticed that this code errors out in Scala 2.13!?
Array(1, 2, 3).deep == Array(1, 2, 3).deep
Here's the error message:
error: value deep is not a member of Array[Int]
sameElements
still works in Scala 2.13, so Array(1, 2, 3) sameElements Array(1, 2, 3)
is fine.
Spark 3 will support Scala 2.12. Not sure when Scala 2.13 will be supported by Spark, but might as well proactively make this library as compatible as possible.
Should probably start compiling this library with Scala 2.13 to programatically make sure it stays compatible.
It would be nice if there was a method similar to assertSmallDataFrameEquality
that returned a boolean instead of Unit/Exception.
Use DatasetCountMismatch
for count differences.
Use basicMismatchMessage
for DataFrames that aren't equal.
Seems all jars in mavenrepo only contain html javadocs not the code, eg
https://repo1.maven.org/maven2/com/github/mrpowers/spark-fast-tests_2.11/0.12.5/
all jars have same file size
I don't think the assertColumnEquality
method will work for StructType
columns. Perhaps a assertStructTypeColumnEquality
will be a nice addition.
spark-testing-base has a good example of this
See if we can get all tests passing with OpenJDK 11.
@qnob opened an interesting pull request in spark-testing-base: holdenk/spark-testing-base#291
It seems highly likely this would help the assertLargeDataFrameEquality
method run faster. Would this also help the assertSmallDataFrameEquality
method run faster? We should look into this...
assertDatasetEquality[U](expected: Dataset[U], result: Dataset[U])
could indirectly call assertSmallDatasetEquality
and assertLargeDatasetEquality
depending on the size of dataset.
I'm not really sure of a good way to implement this. assertDatasetEquality
would need to run a count, which could be expensive. The definition of a "large" Dataset would depend on the machine... we would somehow need to compare the Dataset size with the available machine resources.
Any good ideas on how to implement this @snithish?
In the spark-fast-tests README, we encourage users to wrap the Spark Session in a trait and mix in the trait to test classes that need access to the Spark Session.
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("spark session").getOrCreate()
}
}
The spark-testing-base library uses the EvilSessionTools approach to extract the SQL context.
I don't think the testing framework should have any knowledge or control over the Spark Session. The Spark Session management should take place in the application and the test framework should simply provide tools that help with assertions.
@snithish @eclosson - I would like your feedback on this intentional design decision. Thanks!
I'd like to run some official benchmarking analyses to quantify how much faster assertSmallDataFrameEquality
is than assertLargeDataFrameEquality
.
I'm not sure how to run these benchmarking studies in the Scala / Spark environment.
Hello,
I'm trying to use the approximate equality assertion between 2 dataframes, but I get a exception saying that both dataframe doesn't have the same number of rows, which isn't true as you can see.
assertApproximateDataFrameEquality(output, expected, 0.01, ignoreNullable = true)
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: Actual DataFrame Row Count: '61'
[info] Expected DataFrame Row Count: '61'
[info] at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.throwIfDatasetsAreUnequal$1(DatasetComparer.scala:197)
[info] at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertLargeDatasetEquality(DatasetComparer.scala:213)
[info] at com.test.TestSpec.assertLargeDatasetEquality(TestSpec.scala:14)
[info] at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertApproximateDataFrameEquality(DatasetComparer.scala:240)
[info] at com.test.TestSpec.assertApproximateDataFrameEquality(TestSpec.scala:14)
Note : There is a lot of DoubleType
in those dataframes. Is this related to #29 ?
Hi:
I was following your Testing Spark Applications Medium article, and ran into an issue relating to schemas and spark-fast-tests' assertSmallDataFrameEquality
method.
I created the source DataFrame using toDF
as suggested, and the expected DataFrame
using an explicitly defined schema using a List
of StructField
s as suggested. The issue I ran into was that, since toDF
always has the field being nullable
, and in my case the expected DataFrame
s were not nullable
, the schema comparison would always fail for that reason.
Perhaps there could be an option to ignore the nullability when doing the compare, such that as long as the names and types are the same it's a match?
Thanks,
Ken
I try to unit test dataframe equality using spark-fast-tests, but I get a NullPointerException
when I use assertSmallDatasetEquality
. I don't understand why because I'm able to show the content of both dataframes before the assertion. Anyone know what I'm missing ?
Logs
Input:
val df1 = spark.read.parquet("/PATH/datasetA").limit(10)
val df2 = spark.read.parquet("/PATH/datasetA").limit(10)
Step:
assertLargeDatasetEquality(df1, df2, orderedComparison = false)
Expected Result:
True
Actual Result:
False
Solution:
assertLargeDatasetEqualityWithoutOrder
method.Here is the code that errors out:
val yRDD = spark.sparkContext.parallelize(List("a", "b", "c"))
val actualRDD = yRDD.zipWithIndex()
val expectedRDD = spark.sparkContext.parallelize(
List(
(1, "a"),
(2, "b"),
(3, "c")
)
)
assertSmallRDDEquality(actualRDD, expectedRDD)
This is the error:
[error] /Users/powers/Documents/code/my_apps/spark-spec/src/test/scala/com/github/mrpowers/spark/spec/rdd/RDDSpec.scala:439: type mismatch;
[error] found : org.apache.spark.rdd.RDD[(String, Long)]
[error] required: org.apache.spark.rdd.RDD[(Any, Any)]
[error] Note: (String, Long) <: (Any, Any), but class RDD is invariant in type T.
[error] You may wish to define T as +T instead. (SLS 4.5)
[error] assertSmallRDDEquality(actualRDD, expectedRDD)
[error] ^
[error] /Users/powers/Documents/code/my_apps/spark-spec/src/test/scala/com/github/mrpowers/spark/spec/rdd/RDDSpec.scala:439: type mismatch;
[error] found : org.apache.spark.rdd.RDD[(Int, String)]
[error] required: org.apache.spark.rdd.RDD[(Any, Any)]
[error] Note: (Int, String) <: (Any, Any), but class RDD is invariant in type T.
[error] You may wish to define T as +T instead. (SLS 4.5)
[error] assertSmallRDDEquality(actualRDD, expectedRDD)
[error] ^
[error] two errors found
[error] (test:compileIncremental) Compilation failed
[error] Total time: 5 s, completed Apr 11, 2017 9:01:28 AM
Trying to download the library using SBT but a warning is thrown:
[info] Updating ...
[info] downloading http://dl.bintray.com/spark-packages/maven/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11.jar ...
[info] [SUCCESSFUL ] MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar (1873ms)
[info] Done updating.
[warn] Detected merged artifact: [FAILED ] MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar(src): (0ms).
[warn] ==== local: tried
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] ==== local-preloaded-ivy: tried
[warn] /Users/ginfante/.sbt/preloaded/MrPowers/spark-fast-tests/0.17.2-s_2.11/srcs/spark-fast-tests-sources.jar
[warn] ==== local-preloaded: tried
[warn] file:////Users/ginfante/.sbt/preloaded/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] ==== Spark Packages Repo: tried
[warn] http://dl.bintray.com/spark-packages/maven/MrPowers/spark-fast-tests/0.17.2-s_2.11/spark-fast-tests-0.17.2-s_2.11-sources.jar
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: FAILED DOWNLOADS ::
[warn] :: ^ see resolution messages for details ^ ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: MrPowers#spark-fast-tests;0.17.2-s_2.11!spark-fast-tests.jar(src)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
My sbt file looks like:
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided" withSources() withJavadoc(),
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided" withSources() withJavadoc(),
"org.typelevel" %% "frameless-dataset" % framelessVersion,
"org.scalactic" %% "scalactic" % "3.0.6",
"org.scalatest" %% "scalatest" % "3.0.6" % "test",
"MrPowers" % "spark-fast-tests" % "0.17.2-s_2.11" % Test
)
Both printout formats would be nice
@MrPowers I was wondering if we can repurpose DataframeComparer to compare Datasets too, since Dataframe is nothing but Dataset[Row]
we can use Generics to use the same trait for Dataset.
I think
def assertSmallDataFrameEquality[T](actualDS: Dataset[T], expectedDS: Dataset[T])
def assertLargeDataFrameEquality[T](actualDS: Dataset[T], expectedDS: Dataset[T])
can be used with the exact same logic used now. Making these functions Generic will make them versatile. Type inference will make invocation simple without the need to specify Type information.
Let me know what you think about this.
I noticed that this library (which is awesome BTW!) was being published to Maven Central some time ago, but then stopped: https://search.maven.org/classic/#search|ga|1|spark-fast-tests (last release is in June 2018). Is it possible to publish newer versions to Maven Central again?
The issue is, using third-party repositories is often very hard in certain companies which only allow using a proxying artifacts repository which is configured to work with Maven Central and almost nothing else, and are very reluctant to add any other repository to proxy. Thus, there is really no way to use your library, because even if it works on the dev machines, it won't work on CI which is only allowed to use the internal repository.
Integrate with scalafmt to ensure coding consistency/standards across project.
https://scalameta.org/scalafmt/
Hi:
When assertSmallDataFrameEquality
fails, the error message prints the top 5 rows of each DataSet
. This makes it extremely difficult to tell what exactly is different, and therefore it's hard to know what to correct.
I found some code on Stack Overflow that would diff the DataSet
s and show the differences. I my case, I added some code based on this to run show
s on the differences and was able to determine what they were, although the output was tough to spot amongst all the Spark job output.
Perhaps some version of this, maybe where it could highlight in a different color within the pretty-printed DataSet
could be integrated so it's much easier to focus in on what's different?
Thanks,
Ken
First off, fantastic library.
Secondly, I've incorporated spark-fast-tests into a project that was already using scalatest and maven, but when I run a failing test I get lines like so:
##teamcity[testFailed name='Derives a correct 14-digit Completion Api' message='|n[01-001-00000-00-00|] || [01-001-00000-00-01|]|n[02-001-00000-00-01|] || [02-001-00000-00-02|]|n[03-001-00000-00-02|] || [03-001-00000-00-03|]|n[04-001-00000-00-03|] || [04-001-00000-00-04|]|n[05-001-00000-00-04|] || [05-001-00000-00-05|]|n[06-001-00000-00-05|] || [06-001-00000-00-01|]' details='com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: |n[01-001-00000-00-00|] || [01-001-00000-00-01|]|n[02-001-00000-00-01|] || [02-001-00000-00-02|]|n[03-001-00000-00-02|] || [03-001-00000-00-03|]|n[04-001-00000-00-03|] || [04-001-00000-00-04|]|n[05-001-00000-00-04|] || [05-001-00000-00-05|]|n[06-001-00000-00-05|] || [06-001-00000-00-01|]|r|n at com.github.mrpowers.spark.fast.tests.DatasetComparer$class.assertSmallDatasetEquality(DatasetComparer.scala:70)|r|n at com.rseg.spark.test.etl.WellApiHelperSpec.assertSmallDatasetEquality(WellApiHelperSpec.scala:15)|r|n at com.github.mrpowers.spark.fast.tests.DataFrameComparer$class.assertSmallDataFrameEquality(DataFrameComparer.scala:16)|r|n at com.rseg.spark.test.etl.WellApiHelperSpec.assertSmallDataFrameEquality(WellApiHelperSpec.scala:15)|r|n at com.rseg.spark.te...
Is there a way to change the formatting to print like other scalatest messages?
Hi,
I am trying to use assertSmallDataFrameEquality for the test below.
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{StringType, StructField}
"createDataFrame" should "be able to create dataframe from a map" in {
val error = ErrorLogger("Test")
val dateTimeFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val sysDateTime = LocalDateTime.now
error.logData("JobId") = "0"
error.logData("ProcessName") = error.processName
error.logData("SourceType") = "Hive"
error.logData("StepName") = error.processName
error.logData("ReasonOfFailure") = "NullPointerError"
error.logData("JobStartTime") = dateTimeFormat.format(sysDateTime)
error.logData("JobEndTime") = dateTimeFormat.format(sysDateTime)
val actualDF = createLogDataFrame(error.logData)
actualDF.show()
actualDF.printSchema()
val expectedDF = spark.createDF(
List(
("0" , error.processName, "Hive", error.processName, "NullPointerError",
dateTimeFormat.format(sysDateTime), dateTimeFormat.format(sysDateTime))
),
List(
StructField("JobId",StringType, nullable = true),
StructField("ProcessName", StringType, nullable = true),
StructField("SourceType", StringType, nullable = true),
StructField("StepName", StringType, nullable = true),
StructField("ReasonOfFailure", StringType, nullable = true),
StructField("JobStartTime", StringType, nullable = true),
StructField("JobEndTime", StringType, nullable = true)
)
)
expectedDF.show()
assertSmallDataFrameEquality(actualDF, expectedDF, orderedComparison=false)
}
Here is the function I am testing:
import org.apache.spark.sql.{DataFrame, Row}
import scala.collection.mutable
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.rdd.RDD
def createLogDataFrame(data:mutable.Map[String,String]): DataFrame = {
val schema = StructType(data.keys.toSeq.map(StructField(_, StringType, nullable = true)))
val rowRDD: RDD[Row] = spark.sparkContext.parallelize(Seq(Row.fromSeq(data.values.toSeq))) // Seq[row]
spark.createDataFrame(rowRDD, schema)
}
I am getting the following error
Actual Schema:
StructType(StructField(JobStartTime,StringType,true), StructField(JobEndTime,StringType,true), StructField(ProcessName,StringType,true), StructField(StepName,StringType,true), StructField(SourceType,StringType,true), StructField(JobId,StringType,true), StructField(ReasonOfFailure,StringType,true))
Expected Schema:
StructType(StructField(JobId,StringType,true), StructField(ProcessName,StringType,true), StructField(SourceType,StringType,true), StructField(StepName,StringType,true), StructField(ReasonOfFailure,StringType,true), StructField(JobStartTime,StringType,true), StructField(JobEndTime,StringType,true))
The schemas for both data frames are similar, only the order of columns is different. I was wondering if there is a way to ignore the order for the columns while comparing two data frames in scala-fast-test.
Thank you for your help in advance! :)
Here's the current content inequality message:
I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content | Expected Content
[info] [frank,44,us] | [frank,44,us]
[info] [li,30,china] | [li,30,china]
[info] [bob,1,uk] | [bob,1,france]
[info] [camila,5,peru] | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]
It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content | Expected Content
[frank,44,us] | [frank,44,us]
[li,30,china] | [li,30,china]
[bob,1,uk] | [bob,1,france]
[camila,5,peru] | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]
Should we get rid of the square brackets for each row of data too?
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content | Expected Content
frank,44,us | frank,44,us
li,30,china | li,30,china
bob,1,uk | bob,1,france
camila,5,peru | camila,5,peru
maria,19,colombia | maria,19,colombia
See here: https://stackoverflow.com/a/49083566/1125159
Possibly create an assertArrayColumnEquality
method that uses java.util.Arrays.equals under the hood if the performance is a lot better.
Hi Guys,
I am about to use this nice framework in our tests (combined with scalatest). By simply copy-paste the sample code from your page. We have gotten the error as described in the title.
scalaVersion: 2.12.7
sparkVersion: 2.4.2
scalatestVersion: 3.1.1
spark-fast-testsVersion: 0.19.1
class Test extends AnyFlatSpec with DataFrameComparer {
"a" should "b" in {
print("works")
}
}
Can you guys help me out?
assertSmallDataFrameEquality
doesn't work on DataFrames with StructType columns as illustrated in this PR: #36
Need to fix this bug... not sure how yet. Possibly will use flattenSchema
and then do the comparison.
It's a cool idea, but I need to explore if the cost of the additional code complexity is higher than the benefit of this customization option...
Option 1: Use the built in org.apache.spark.sql.Row#equals method?
Option 2: Use something like the approxEquals method.
@snithish - I think we should stick with Option 1 and delete the RowComparer class. All the tests are passing with Option 1 and it's less complex. Thoughts?
Hi Matt,
I have gone through your article that you have written(and youtube) , I wanted to say like they are just wonderful.Thank you very much for that.!!!
I have one question , which I observed and put across delta lake community as well .
I found like while doing compaction operation, data volume is growing like anything . I could not able to understand the reason behind it. Could you please help me to understand .
here is the test case that I have performed .
Here is the details:
Took Sample of 166MB of input data. : Total number Records : 1630168
Once Delta Lake formed It became of 176MB : Total number of Records : 1630168
Ran merge API to update id , now size became 221MB
Ran Vacuum API, now size became 205MB : Total number of records : 1630168
Ran Compaction Now size became 781MB : Total number of records : 3260336 (double of original)
Ran Vacuum API now size became 576MB : Total number of records : 1630168
My code is also attached over here : delta-io/delta#254
I was trying the library and found that after including it the logging changed to INFO from DEBUG.
I tried a few things but not working.
plugins.sbt:
logLevel := Level.Error addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10") addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.6.1")
spark session:
SparkSession .builder .appName(appName) .master("local[2]") //to fix issue of port assignment on local .config("spark.driver.bindAddress", "localhost") .getOrCreate() spark.sparkContext.setLogLevel("ERROR")
If I remove the library logging is back to Error.
How can get it fixed with the Library?
Is there a way to expand the diff so that I can see every column of the dataset? Otherwise it is impossible to see where the difference lies exactly.
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch: Diffs
[info] +--------------------+--------------------+
[info] | Actual Content| Expected Content|
[info] +--------------------+--------------------+
[info] |Event(1,161097120...|Event(1,161097120...|
[info] +--------------------+--------------------+
The above snippet illustrates the problem. The two datasets look exactly the same since they are truncated before the difference is shown.
Hi @MrPowers !
Today I started my journey with Apache Spark, once I learned that Spark 2.4.0 can be run with Scala 2.12.
So, I forked your https://github.com/MrPowers/spark-sbt.g8 giter8 template, tweaked it somewhat to work with current stable Scala version,
and then I got a run-time error.
My hunch is that the release of spark-fast-tests I am using, is for Scala 2.11.X only, and might not be binary compatible with next Scala 2.12.X libraries (ScalaTest?).
The error I am getting is this:
[info] MannersSpec:
[info] com.intersysconsulting.TubularSpec *** ABORTED *** (28 milliseconds)
[info] java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[info] at com.intersysconsulting.TubularSpec.<init>(TubularSpec.scala:11)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info] at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info] at java.lang.Class.newInstance(Class.java:442)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] ...
[info] com.intersysconsulting.MannersSpec *** ABORTED *** (4 milliseconds)
[info] java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[info] at com.intersysconsulting.MannersSpec.<init>(MannersSpec.scala:10)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info] at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info] at java.lang.Class.newInstance(Class.java:442)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] ...
[error] Uncaught exception when running com.intersysconsulting.MannersSpec: java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[error] sbt.ForkMain$ForkError: java.lang.NoSuchMethodError: com.github.mrpowers.spark.fast.tests.DatasetComparer.$init$(Lcom/github/mrpowers/spark/fast/tests/DatasetComparer;)V
[error] at com.intersysconsulting.MannersSpec.<init>(MannersSpec.scala:10)
[error] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[error] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[error] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[error] at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[error] at java.lang.Class.newInstance(Class.java:442)
[error] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[error] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:304)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:748)
My complete source code, so that you can reproduce the error above is here: https://github.com/oscarvarto/learning-spark/
I am depending on a SBT-1.X compatible plugin, as shown here: https://github.com/oscarvarto/learning-spark/blob/master/project/plugins.sbt#L3-L4
It'd be good to publish the sources with the JAR so when a developer wants to find out implementation details the IDE can show that.
See this question: #78
There's already an option to compare DataFrame that are equal, but have different row ordering. This feature is to compare DataFrames that have different column orderings.
Here's some example code:
it("provides a good error message for wide columns") {
val df = spark.createDF(
List(
("this is a really really really really long sentence", "this is a really really really really long sentence and the diff is hidden"),
("animation", "animation"),
("bill", "bill")
), List(
("sentence1", StringType, true),
("sentence2", StringType, true)
)
)
assertColumnEquality(df, "sentence1", "sentence2")
}
It'll output this error message:
assertColumnEquality
should at least provide an option so the output doesn't truncate. Something like assertColumnEquality(df, "sentence1", "sentence2", truncate=false)
.
Use color highlighting to make these diff tools easier to use.
It looks like spark 3 deleted dateToString
method in org.apache.spark.sql.catalyst.util.DateTimeUtils
to prevent the use of SimpleDateFormat (a non thread safe use).
This implementation should have a custom implementation valid for spark 2.X and 3.X
The assertSmallDataFrameEquality
method always shows the first row as failing, even when it's the same. It's also not adding the line break before printing the first row comparison line.
utest is properly outputting the first line so looks like this bug is only for Scalatest.
Any suggestions on how to fix this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.