databricks / scala-style-guide Goto Github PK
View Code? Open in Web Editor NEWDatabricks Scala Coding Style Guide
Databricks Scala Coding Style Guide
There's a missing "t" in the first line of the last section :
"When there is an existing well-tesed method" instead of
"When there is an existing well-tested method"
It seems we mention about source compatibility such as in https://github.com/databricks/scala-style-guide#varargs.
However, maybe, we should describe discuraging changing the method signature explicitly.
I have seen several Spark PRs that try to change it and some reviewers say the same comment repeatedly. Also, IMHO, they seem not describing the specific reason sometimes and also I was even asked to break this by few reviewers once.
I think it might be great if that is described here and simply we can leave the link pointer.
To be more concrete,
If we have both classes as below:
object A {
def toString(v: Int): String = v.toString
}
object Main {
def main(args: Array[String]): Unit = println(A.toString(1))
}
We could run as below:
scalac A.scala
scalac Main.scala
scala Main
prints
1
However, if someone changes the type from Int
to Any
as below mistakenly and the newer version of A.scala
is released as below:
object A {
def toString(v: Any): String = v.toString
}
scalac A.scala
scala Main
Previously compiled Main.scala
is broken as below and we need to recompile Main.scala
:
java.lang.NoSuchMethodError: A$.toString(I)Ljava/lang/String;
at Main$.main(Main.scala:2)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
Anyone knows of a good scala style checker that is flexible enough to incorporate this style guide? Will be useful for automation / continuous integration. Thanks!
didn't you mean "Avoid defining apply methods on objects"
Most of the settings like indentation, method naming and all can we put together in a editorconf file so that everyone can use that, if possible ?
Hi guys,
There is a really strong statement here: https://github.com/databricks/scala-style-guide#recursion-and-tail-recursion
Avoid using recursion,...
...then the author proceeds to recommend a for loop instead of a tail recursive solution. Scala is an (impure) functional language where it's recommended to solve you problems with recursion and tailrec will outperform a for loop solution at any time: http://ideone.com/J13mbU
The linked test is a simplistic benchmarking on using for loop, while loop and a tail rec function to sum the first 1M integers: the for loop takes 75x(!) and the while loop 1.75x more time than the solution with tailrec.
In nanoseconds:
139091940 : for loop
3254375 : while loop
1853058 : tail rec
In Scala the only place for the "for loop" is when you want to block your Future(s) - which isn't recommended neither - or when you're using high level language constructs with scalaz(https://github.com/scalaz/scalaz).
It seems like the author tries to disgust the reader with boilerplate:
def max(data: Array[Int]): Int = {
@tailrec
def max0(data: Array[Int], pos: Int, max: Int): Int = {
if (pos == data.length) {
max
} else {
max0(data, pos + 1, if (data(pos) > max) data(pos) else max)
}
}
max0(data, 0, Int.MinValue)
}
This is how you do it with Scala and @tailrec:
def max(data: Array[Int]): Option[Int] = {
@tailrec def getMax(pos: Int, max: Int): Int =
if(pos == data.size) max
else getMax(pos + 1, if(data(pos) > max) data(pos) else max)
if(data.isEmpty) None else Some(getMax(1, data(0)))
}
In Scala, the concept of "null" is almost taboo, revisit http://www.lihaoyi.com/post/StrategicScalaStylePracticalTypeSafety.html#scalazzi-scala
The mentioned 3 ways are primitive and too javaish. There is no safe way to handle mutable memory. One commonly accepted way is to encapsulate the state with actors(http://akka.io/) and another is to implement uniqueness types(http://lampwww.epfl.ch/~phaller/capabilities.html).
Avoid defining apply methods on classes. These methods tend to make the code less readable, especially for people less familiar with Scala.
It's used as the indexing operator - pretty common in Scala and it can remove lots of boilerplate. Remember: Scala isn't Java.
Do NOT use symbolic method names
It's ok to use sane operators - we've scaladoc and stuff.
Also,
stream1 >>= stream2 ... stream1.join(stream2)
the ">>=" is flatMap, not join(or flatten) and these things are quite useful for people using scalaz/cats.
Prefer java.util.concurrent.ConcurrentHashMap over scala.collection.concurrent.Map.
Avoid "concurrent maps" - they're useless. There are better ways for local key-value stores.
add follow message to section "Spacing and Indentation"
make it clear that there must be a space between class declaration and curly braces
// Correct:
class Person {
// some code
}
// Wrong: omit spaces before curly braces
class Person{
// some code
}
I find the wording around Scala concurrent collections a bit too harsh. ConcurrentMap is deprecated and it's just an interface. It explicitly documents each method that is atomic (including putIfAbsent
).
I agree with the observation that TrieMap.getOrElseUpdate
was unfortunately not atomic, but according to SI-7943 this was fixed in 2.11.6.
I know this is a coding style that works for the Databricks team, but a lot of people are going to look at it and follow it. I feel at least a link to the ticket and a more nuanced approach would be better.
Scala class definitions are often over 100 chars, so it'll be useful to give some clear guidance. The purpose of this issue is to clarify the whitespace guidelines for the different scenarios. Once that's clarified, I'll update the style guide with a PR detailing the scenarios.
Thanks again for making this style guide. It's been incredibly helpful over the years.
Scenario 1: When possible, put everything on one line
class Bar(val param1: String) extends BarInterface
Scenario 2: (params + traits) > 100 chars
The style guide currently has this:
class Foo(
val param1: String, // 4 space indent for parameters
val param2: String,
val param3: Array[Byte])
extends FooInterface // 2 space indent here
with Logging {
def firstMethod(): Unit = { ... } // blank line above
}
Here's the formatting I'm actually seeing in the Spark codebase:
class Foo(val param1: String, val param2: String, val param3: Array[Byte])
extends FooInterface with Logging {
def firstMethod(): Unit = { ... } // blank line above
}
Here's some actual Spark code:
case class SubtractTimestamps(endTimestamp: Expression, startTimestamp: Expression)
extends BinaryExpression with ExpectsInputTypes with NullIntolerant {
Seems like the style guide is suggesting we should break up SubtractTimestamps
to 6 lines. I like the 2 line formatting personally.
Scenario 3: params > 100 chars and traits > 100 chars
Here's another actual Spark class:
case class MonthsBetween(
date1: Expression,
date2: Expression,
roundOff: Expression,
timeZoneId: Option[String] = None)
extends TernaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes
with NullIntolerant {
Is the rule "when the traits > 100 chars, then they should all be put on a newline with 2 space indentation"? e.g. is this the desired formatting?
case class MonthsBetween(
date1: Expression,
date2: Expression,
roundOff: Expression,
timeZoneId: Option[String] = None)
extends TernaryExpression
with TimeZoneAwareExpression
with ImplicitCastInputTypes
with NullIntolerant {
Or do we want this?
case class MonthsBetween(
date1: Expression,
date2: Expression,
roundOff: Expression,
timeZoneId: Option[String] = None)
extends TernaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes
with NullIntolerant {
Scenario 4: params < 100 chars and traits > 100 chars
Is this the correct formatting?
case class AddHours(startTime: Expression, numHours: Expression, timeZoneId: Option[String] = None)
extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
with TimeZoneAwareExpression {
Or is it like the params - whenever the traits > 100 chars, then each trait needs to be broken out on a newline.
Thank you for the help!
I think we should not use Option.get
in our code. When the object in question is None
, the error message is a cryptic NoSuchElementException: None.get
. Instead, we should use Option.getOrElse(throw new Exception(exceptionMsg))
so that it is easier to discover where and why the object was None
in the first place.
could you help me to read ASC X12 electronic data interchange (EDI) file in spark with scala. Any examples.
FunSuite
provides ===
for comparison. I see ==
and ===
are used in a mixed way across unit tests. It might be great if this is encouraged documented.
See http://www.scalatest.org/getting_started_with_fun_suite
ScalaTest lets you use Scala's assertion syntax, but defines a triple equals operator (===) to give you better error messages. The following code would give you an error indicating only that an assertion failed:
assert(1 == 2)
Using triple equals instead would give you the more informative error message, "1 did not equal 2":assert(1 === 2)
Consider the first line of an implementation method like this:
| <- 100 columns
override def moveObject(audit: AuditInfo, source: String, destination: String): Unit = synchronized {
Is the appropriate style to change this to something like
| <- 100 columns
override def moveObject(audit: AuditInfo, source: String, destination: String): Unit = {
synchronized {
or
| <- 100 columns
override def moveObject(audit: AuditInfo, source: String, destination: String): Unit =
synchronized {
If we remove the outermost block, the inner indentation level will be one stop closer to the left margin.
https://docs.scala-lang.org/style/naming-conventions.html#constants-values-variable-and-methods,
val DEFAULT_PORT = 10000
val MyConstant = ...
whick is better?
We updated the apply method section today: https://github.com/databricks/scala-style-guide#apply_method
@Hawstein can you submit a pull request to update the Chinese version? Thanks.
It would be great if there is a Korean version of this.
If you accept this, I would like to try because I am Korean as you know :)
Apache Spark (especially what I've seen of MLlib) and the current Scala style guide differ on one item: Should multiline method and class constructor invocations be written with 1 arg per line or multiple args per line?
Spark mostly uses multiple args per line (for invocation, not for declarations). The current Scala style guide says to put 1 arg per line: See "Methods with Numerous Arguments" here: http://docs.scala-lang.org/style/indentation.html
Can we add our standard to this doc?
E.g.:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.