databricks / scala-style-guide Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 578.0 355 KB

Databricks Scala Coding Style Guide

scala-style-guide's People

Contributors

Stargazers

Watchers

Forkers

earle kayousterhout purijatin mubarak mrefish firemuzzy ashkrit androidnuts stephentu programmer-util hcchen ilovejs mmallad stuhood sirocchj yorkhuang-au dragos daishichao defaultrobot doubaokun astray1988 ysde mt0803 xingzhixi s4nk scwf schmmd concretevitamin concerned3rdparty abhijitapp tkawachi tomzhang sunil-123 miguelperalvo rxin crazyjvm jeanlyn misercoal prog012 ketothxupack hungaikev nkhuyu pinguo-liguo unthingable watermen yucheng1992 lhigueragamboa is00hcw zerosign maalmeida1837 youzijin alanmcshane maplenvg mozartvn stew se77en ironcorelabs reactivetype gokingstones xpmars ranjankumar03 ameyc dsbc hawstein archie-lab ssdtfarm hyukjinkwon stonegao tianshangjun rb01usa redwa weiweichen ouyangshourui hellodengfei amitnke agilemobiledev skp33 timspannairisdata kalyankumarpichuka goverdata alonsoir zhaodonghui3939 cimszw dongjoon-hyun monster28g aseigneurin tom-ogle hanhanwu ilinuxer krasnyanskiy pxbin shingon kingyiu87 tabkram ghosthamlet bveliqi pchalcol obinsc tempbottle anand-singh

scala-style-guide's Issues

Missing "t" in the last section

There's a missing "t" in the first line of the last section :
"When there is an existing well-tesed method" instead of
"When there is an existing well-tested method"

Discourage changing external API's method signature for binary compatibility

It seems we mention about source compatibility such as in https://github.com/databricks/scala-style-guide#varargs.

However, maybe, we should describe discuraging changing the method signature explicitly.

I have seen several Spark PRs that try to change it and some reviewers say the same comment repeatedly. Also, IMHO, they seem not describing the specific reason sometimes and also I was even asked to break this by few reviewers once.

I think it might be great if that is described here and simply we can leave the link pointer.

To be more concrete,

If we have both classes as below:

object A {
  def toString(v: Int): String = v.toString
}

object Main {
  def main(args: Array[String]): Unit = println(A.toString(1))
}

We could run as below:

scalac A.scala
scalac Main.scala
scala Main

prints

However, if someone changes the type from Int to Any as below mistakenly and the newer version of A.scala is released as below:

object A {
  def toString(v: Any): String = v.toString
}

scalac A.scala
scala Main

Previously compiled Main.scala is broken as below and we need to recompile Main.scala:

java.lang.NoSuchMethodError: A$.toString(I)Ljava/lang/String;
	at Main$.main(Main.scala:2)
	at Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)

Automated checks for CI ?

Anyone knows of a good scala style checker that is flexible enough to incorporate this style guide? Will be useful for automation / continuous integration. Thanks!

Question on: "Avoid defining apply methods on classes"

didn't you mean "Avoid defining apply methods on objects"

Is there any editorconf settings for the IDEs?

Most of the settings like indentation, method naming and all can we put together in a editorconf file so that everyone can use that, if possible ?

Revisit non-scalaish recommendations

Hi guys,

There is a really strong statement here: https://github.com/databricks/scala-style-guide#recursion-and-tail-recursion

Avoid using recursion,...

...then the author proceeds to recommend a for loop instead of a tail recursive solution. Scala is an (impure) functional language where it's recommended to solve you problems with recursion and tailrec will outperform a for loop solution at any time: http://ideone.com/J13mbU
The linked test is a simplistic benchmarking on using for loop, while loop and a tail rec function to sum the first 1M integers: the for loop takes 75x(!) and the while loop 1.75x more time than the solution with tailrec.
In nanoseconds:
139091940 : for loop
3254375 : while loop
1853058 : tail rec

In Scala the only place for the "for loop" is when you want to block your Future(s) - which isn't recommended neither - or when you're using high level language constructs with scalaz(https://github.com/scalaz/scalaz).

It seems like the author tries to disgust the reader with boilerplate:

def max(data: Array[Int]): Int = {
  @tailrec
  def max0(data: Array[Int], pos: Int, max: Int): Int = {
    if (pos == data.length) {
      max
    } else {
      max0(data, pos + 1, if (data(pos) > max) data(pos) else max)
    }
  }
  max0(data, 0, Int.MinValue)
}

This is how you do it with Scala and @tailrec:

def max(data: Array[Int]): Option[Int] = {
  @tailrec def getMax(pos: Int, max: Int): Int =
    if(pos == data.size) max 
    else getMax(pos + 1, if(data(pos) > max) data(pos) else max)

  if(data.isEmpty) None else Some(getMax(1, data(0)))
}

Option vs null

In Scala, the concept of "null" is almost taboo, revisit http://www.lihaoyi.com/post/StrategicScalaStylePracticalTypeSafety.html#scalazzi-scala

Java features

Static fields - with companion objects
Static inner classes - with companion objects
Java enums - by specific libraries
Annotations - http://docs.scala-lang.org/overviews/reflection/annotations-names-scopes.html

"Safe" mutable access

The mentioned 3 ways are primitive and too javaish. There is no safe way to handle mutable memory. One commonly accepted way is to encapsulate the state with actors(http://akka.io/) and another is to implement uniqueness types(http://lampwww.epfl.ch/~phaller/capabilities.html).

Apply methods

Avoid defining apply methods on classes. These methods tend to make the code less readable, especially for people less familiar with Scala.

It's used as the indexing operator - pretty common in Scala and it can remove lots of boilerplate. Remember: Scala isn't Java.

Operators...

Do NOT use symbolic method names

It's ok to use sane operators - we've scaladoc and stuff.

Also,

stream1 >>= stream2 ... stream1.join(stream2)

the ">>=" is flatMap, not join(or flatten) and these things are quite useful for people using scalaz/cats.

Concurrent maps

Prefer java.util.concurrent.ConcurrentHashMap over scala.collection.concurrent.Map.

Avoid "concurrent maps" - they're useless. There are better ways for local key-value stores.

Make it clear that there must be a space between class declaration and curly braces

add follow message to section "Spacing and Indentation"
make it clear that there must be a space between class declaration and curly braces

Put one space between class declaration and curly braces.

// Correct:
class Person {
  // some code
}

// Wrong: omit spaces before curly braces
class Person{
  // some code
}

ConcurrentMap qualifications

I find the wording around Scala concurrent collections a bit too harsh. ConcurrentMap is deprecated and it's just an interface. It explicitly documents each method that is atomic (including putIfAbsent).

I agree with the observation that TrieMap.getOrElseUpdate was unfortunately not atomic, but according to SI-7943 this was fixed in 2.11.6.

I know this is a coding style that works for the Databricks team, but a lot of people are going to look at it and follow it. I feel at least a link to the ticket and a more nuanced approach would be better.

Clarifying whitespace guidelines for classes

Scala class definitions are often over 100 chars, so it'll be useful to give some clear guidance. The purpose of this issue is to clarify the whitespace guidelines for the different scenarios. Once that's clarified, I'll update the style guide with a PR detailing the scenarios.

Thanks again for making this style guide. It's been incredibly helpful over the years.

Scenario 1: When possible, put everything on one line

class Bar(val param1: String) extends BarInterface

Scenario 2: (params + traits) > 100 chars

The style guide currently has this:

class Foo(
    val param1: String,  // 4 space indent for parameters
    val param2: String,
    val param3: Array[Byte])
  extends FooInterface  // 2 space indent here
  with Logging {

  def firstMethod(): Unit = { ... }  // blank line above
}

Here's the formatting I'm actually seeing in the Spark codebase:

class Foo(val param1: String, val param2: String, val param3: Array[Byte])
  extends FooInterface with Logging {

  def firstMethod(): Unit = { ... }  // blank line above
}

Here's some actual Spark code:

case class SubtractTimestamps(endTimestamp: Expression, startTimestamp: Expression)
  extends BinaryExpression with ExpectsInputTypes with NullIntolerant {

Seems like the style guide is suggesting we should break up SubtractTimestamps to 6 lines. I like the 2 line formatting personally.

Scenario 3: params > 100 chars and traits > 100 chars

Here's another actual Spark class:

case class MonthsBetween(
    date1: Expression,
    date2: Expression,
    roundOff: Expression,
    timeZoneId: Option[String] = None)
  extends TernaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes
    with NullIntolerant {

Is the rule "when the traits > 100 chars, then they should all be put on a newline with 2 space indentation"? e.g. is this the desired formatting?

case class MonthsBetween(
    date1: Expression,
    date2: Expression,
    roundOff: Expression,
    timeZoneId: Option[String] = None)
  extends TernaryExpression
  with TimeZoneAwareExpression
  with ImplicitCastInputTypes
  with NullIntolerant {

Or do we want this?

case class MonthsBetween(
    date1: Expression,
    date2: Expression,
    roundOff: Expression,
    timeZoneId: Option[String] = None)
  extends TernaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes
  with NullIntolerant {

Scenario 4: params < 100 chars and traits > 100 chars

Is this the correct formatting?

case class AddHours(startTime: Expression, numHours: Expression, timeZoneId: Option[String] = None)
  extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant
  with TimeZoneAwareExpression {

Or is it like the params - whenever the traits > 100 chars, then each trait needs to be broken out on a newline.

Thank you for the help!

Thoughts about Option.get

I think we should not use Option.get in our code. When the object in question is None, the error message is a cryptic NoSuchElementException: None.get. Instead, we should use Option.getOrElse(throw new Exception(exceptionMsg)) so that it is easier to discover where and why the object was None in the first place.

Do we have any spark scala syntax to read edi files into data frame

could you help me to read ASC X12 electronic data interchange (EDI) file in spark with scala. Any examples.

Encourage to use `==` instead of `===` for writing unit tests.

FunSuite provides === for comparison. I see == and === are used in a mixed way across unit tests. It might be great if this is encouraged documented.

See http://www.scalatest.org/getting_started_with_fun_suite

ScalaTest lets you use Scala's assertion syntax, but defines a triple equals operator (===) to give you better error messages. The following code would give you an error indicating only that an assertion failed:

assert(1 == 2)
Using triple equals instead would give you the more informative error message, "1 did not equal 2":

assert(1 === 2)

Synchronized definitions which exceed 100 columns

Consider the first line of an implementation method like this:

                                                                                                    | <- 100 columns
  override def moveObject(audit: AuditInfo, source: String, destination: String): Unit = synchronized {

Is the appropriate style to change this to something like

                                                                                                    | <- 100 columns
  override def moveObject(audit: AuditInfo, source: String, destination: String): Unit = {
    synchronized {

                                                                                                    | <- 100 columns
  override def moveObject(audit: AuditInfo, source: String, destination: String): Unit =
    synchronized {

If we remove the outermost block, the inner indentation level will be one stop closer to the left margin.

Can we add our standard to this doc?

E.g.:

For method and class definitions,
- If they fit on 1 line, put everything on 1 line.
- For multi-line definitions, put 1 argument per line.
For method and class constructor invocations,
- If they fit on 1 line, put everything on 1 line.
- For multi-line calls, put (multiple arguments per line) or (1 argument per line)?