Coder Social home page Coder Social logo

findspark's Introduction

Find spark

PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

To initialize PySpark, just call

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Without any arguments, the SPARK_HOME environment variable will be used, and if that isn't set, other possible install locations will be checked. If you've installed spark with

brew install apache-spark

on OS X, the location /usr/local/opt/apache-spark/libexec will be searched.

Alternatively, you can specify a location with the spark_home argument.

findspark.init('/path/to/spark_home')

To verify the automatically detected location, call

findspark.find()

Findspark can add a startup file to the current IPython profile so that the environment vaiables will be properly set and pyspark will be imported upon IPython startup. This file is created when edit_profile is set to true.

ipython --profile=myprofile
findspark.init('/path/to/spark_home', edit_profile=True)

Findspark can also add to the .bashrc configuration file if it is present so that the environment variables will be properly set whenever a new shell is opened. This is enabled by setting the optional argument edit_rc to true.

findspark.init('/path/to/spark_home', edit_rc=True)

If changes are persisted, findspark will not need to be called again unless the spark installation is moved.

findspark's People

Contributors

abdealiloko avatar alope107 avatar barik avatar caioaao avatar carreau avatar freeman-lab avatar gglanzani avatar lrabbade avatar minrk avatar prashant-shahi avatar rjurney avatar shaynali avatar snoe925 avatar stared avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

findspark's Issues

python path dir in spark 2.4.4

spark 2.4.4:

from
spark_python = os.path.join(spark_home, 'python')
to
spark_python = os.path.join(spark_home, 'libexec/python')

Findspark IndexError on Windows 10

Hi,

I've previously used findspark on Mac OS, but am setting up a Windows system so I can write an instructional guide.

I'm getting the following error, and I'm not sure where to start with troubleshooting:

In [2]: findspark.init()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-a4bc4c9af84d> in <module>()
----> 1 findspark.init()

C:\Users\josh\Anaconda3\lib\site-packages\findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    132     # add pyspark to sys.path
    133     spark_python = os.path.join(spark_home, 'python')
--> 134     py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
    135     sys.path[:0] = [spark_python, py4j]
    136

IndexError: list index out of range

Any suggestions on where I might start with this?

what version of spark does this work with?

I have not used spark in several years. I have jupyter installed on my mac. In version spark-2.3.0-bin-hadoop2. I could start it as follows

export SPARK_HOME = sparkpath
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

pyspark  $extraPkgs $*

PYSPARK_DRIVER_PYTHON is still supported
https://spark.apache.org/docs/latest/configuration.html#environment-variables

PYSPARK_DRIVER_PYTHON_OPTS is undocumented. Not sure if it is needed or not

this approach continues to work. (I tested using a trivial example from https://spark.apache.org/docs/latest/quick-start.html on spark-3.1.2-bin-hadoop3.2

I am not sure what your solution does?

I think your solution might be better. I will be able to start the Jupyter server and only use spark in the notebooks that are needed

Kind regards

Andy

Version of findspark not up to date on pypi

The current version of this repo, with add_packages, is not present on pypi, even though it pypi lists it as 1.0.0. The current (github) version of findspark needs to be pushed to pypi, with a version number increment.

Can't find SPARK_HOME (Windows 10)

Hi everyone, I experienced an issue while using FindSpark as instructed on the DataQuest website. An easy solution from the community hasn't arisen and it's been suggested I open an issue here.

I am using findspark to use PySpark in a Jupyter Notebook. If I launch Spark from the command shell, it works as expected. However, findspark is unable to recognize the SPARK_HOME variable, although os.environ on it works correctly. Here are two screenshots:
https://www.dropbox.com/s/mi3wpjobqoplna0/Screenshot%202016-07-18%2016.32.08.png?dl=0
https://www.dropbox.com/s/r894asfx990ll4u/Screenshot%202016-07-18%2016.32.03.png?dl=0

I wonder if you are already aware of this issue or if I have done something incorrectly.

Not working with Remote Cluster with Jupyter on local

Need to set
os.environ["PYSPARK_PYTHON"] = "/home/ubuntu/anaconda3/bin/python"

In my local Jupyter where PYSPARK_PYTHON is of worker machines.
Otherwise it takes my sys.executable to cluster and i get error from workers stating python file not found.

does not support add_packages and add_jars

as a work around you either need to adjust the initial value of PYSPARK_SUBMIT_ARGS to have your packages or our jars but you cannot use both functions and construct a correct PYSPARK_SUBMIT_ARGS as both function add the fixed value of pyspark-shell to the end of the string when they are called.

Support for finding "pip install pyspark"

I have been using findspark for a while, and it makes life a lot simpler.
But I recently started using the pip install pyspark approach in a CI environment.
I found that when I do this, and use import pyspark directly, the default behavior of pyspark is to use the system "python3" in the PATH. I'd rather use the venv I am using - as I have my python packages.

Proposed change

When detecting the spark installation, also detect the pyspark installed in the current python installation

Alternative options

Option 1: I could set my PYTHON_PATH manually - but I find that using findspark would just be simpler as it would do the standard set of configs I use with my own custom SPARK_HOME.

Option 2: I could set my SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark/ - but then It's just a bit more work I need to do - would be better if findspark can auto detect.

Option 3: To avoid the python version - I can do the following: venv/bin/python -c "import os, pyspark; print(os.path.dirname(pyspark.__file__))"

I am using option 3 as a workaround as of now

Who would use this feature?

Anyone using the pip install pyspark - which I think is just for easier convenience in CI and local installs.

(Optional): Suggest a solution

I think it's just adding the following:

  • Check if pyspark is installed (Using pkg_resources?)
  • And adding the path to it as a possible SPARK_HOME

Spark env - executor cores

Hi, I'm using findspark on jupyter with spark 2.1.0, which works great. However, I am trying to increase the number of executor cores in standalone mode. I thought by default it uses all cores on my OSX(8 cores), but its only using 1. Currently tried:

spark = (SparkSession.builder\
                    .master("local[*]")\
                    .config("spark.sql.warehouse.dir", "target/spark-warehouse")\
                    .appName("Ad analysis")\
                    .enableHiveSupport()\
                    .getOrCreate())

spark.sparkContext.defaultParallelism <-- returns 1

Secondly, how do I provide a conf file when running such interactive shell with findspark

Does it work with Python 3.4?

Spark 1.4.0 and above support Python 3.4. However, when I am trying to run Spark using

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

rdd = sc.parallelize([1,2,3])
rdd.count()

Then I am getting an error:

"Exception: Python in worker has different version 2.7 than that in driver 3.4, PySpark cannot run with different minor versions".

Is it tested on Python 3.4?

incorrect SPARK_HOME in 2.0

Version 2.0.0 fires command spark-shell in wrong directory. It tries with
'/usr/lib/spark/python/pyspark/./bin/spark-submit' i.e SPARK_HOME, ./bin/spark-submit

when we rolled back to version 1.4.2 it worked without any issues.

RuntimeError: Java gateway process exited before sending its port number โ€‹

Bug description

RuntimeError: Java gateway process exited before sending its port number

Expected behaviour

launching bin/pyspark runs from a terminal shell.
However, it fails in Jupyter lab:
% env | grep SPARK
PYSPARK_SUBMIT_ARGS=--master spark://localhost:7077
SPARK_HOME=/Users/davidlaxer/spark

Actual behaviour

import findspark
findspark.init("/Users/davidlaxer/spark")
findspark.find()
'/Users/davidlaxer/spark/python/pyspark'

import pyspark
sc = pyspark.SparkContext(appName="myAppName", master="spark://127.0.0.1:7077")


RuntimeError Traceback (most recent call last)
in
1 import pyspark
----> 2 sc = pyspark.SparkContext(appName="myAppName", master="spark://127.0.0.1:7077")

~/spark/python/pyspark/context.py in init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
337 with SparkContext._lock:
338 if not SparkContext._gateway:
--> 339 SparkContext._gateway = gateway or launch_gateway(conf)
340 SparkContext._jvm = SparkContext._gateway.jvm
341

~/spark/python/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise RuntimeError("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file, "rb") as info:

RuntimeError: Java gateway process exited before sending its port number
โ€‹

How to reproduce

  1. Go to '...'
  2. import pyspark
    sc = pyspark.SparkContext(appName="myAppName", master="spark://127.0.0.1:7077")
  3. Click on '....'
  4. Scroll down to '....'
  5. See error

Your personal set up

  • OS:
  • Version:OS X Monterey 12.1 Beta
  • Anaconda
  • Configuration:Jupyter lab: Version 2.2.6

Spark version 3.3.0-SNAPSHOT
Python version 3.8.5
Screen Shot 2021-11-24 at 7 59 12 AM

From the jupyter lab output:

Error: Missing application resource.

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).
  --archives ARCHIVES         Comma-separated list of archives to be extracted into the
                              working directory of each executor.

  --conf, -c PROP=VALUE       Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.

 Spark standalone, Mesos or K8s with cluster deploy mode only:
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone, Mesos and Kubernetes only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone, YARN and Kubernetes only:
  --executor-cores NUM        Number of cores used by each executor. (Default: 1 in
                              YARN and K8S modes, or all available cores on the worker
                              in standalone mode).

 Spark on YARN and Kubernetes only:
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --principal PRINCIPAL       Principal to be used to login to KDC.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above.

 Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
      
[I 07:41:21.056 LabApp] Saving file at /findspark/Untitled.ipynb
[I 07:53:21.067 LabApp] Saving file at /findspark/Untitled.ipynb


Exception when parsing existing PYSPARK_SUBMIT_ARGS

Parsing of an existing PYSPARK_SUBMIT_ARGS environment variable is currently broken:

In [1]: import findspark                                                                                                                                                                                    

In [2]: findspark.add_packages("'org.elasticsearch:elasticsearch-hadoop:7.7.0'")                                                                                                                            

In [3]: findspark.add_packages("'org.elasticsearch:elasticsearch-hadoop:7.7.0'")                                                                                                                            
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-20322a6fe280> in <module>
----> 1 findspark.add_packages("'org.elasticsearch:elasticsearch-hadoop:7.7.0'")

~/anaconda3/lib/python3.7/site-packages/findspark.py in add_packages(packages)
    195         packages = [packages]
    196 
--> 197     _add_to_submit_args("--packages " + ",".join(packages))
    198 
    199 

~/anaconda3/lib/python3.7/site-packages/findspark.py in _add_to_submit_args(to_add, exe)
    162     existing_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    163     if existing_args:
--> 164         args, existing_exe = existing_args.rpartition(" ")
    165     else:
    166         args = ""

ValueError: too many values to unpack (expected 2)

This is because rpartition() returns 3 values, including the separator, while the code expects 2.

E.g.:

>>> "a b".rpartition(" ")
('a', ' ', 'b')
>>> x,y = "a b".rpartition(" ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

Furthermore, it seems to be expecting the last part of the value to be the name of an executable. This isn't necessarily the case - e.g. during the second call above, the last value is just the name of a package.

How about a new version on PyPI?

This is a nice little tool and it looks like quite a lot of work has been done since 1.3.0. (Indeed some issues have already been addressed.) I'm happy to install from github, but it would be great to have a 1.4.0.

Thanks!

Java gateway process exited before sending its port number in Google Colab with Spark 2.4.5

Recently, I've encountered Java gateway process exited before sending its port number error using findspark in Google Colab. I was puzzled, because the code worked perfectly before. After some investigation, it turned out that it is connected to the latest 1.4.0 version of findspark. I'm using Spark 2.4.5 and OpenJDK 8.

Here's the minimal example to reproduce the error in Google Colab:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

import findspark
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext

findspark.init("spark-2.4.5-bin-hadoop2.7")
sc = pyspark.SparkContext('local[*]')  # Here is the error
spark = SparkSession.builder.appName('abc').getOrCreate()

Here's the full stack trace:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-118d2d9079d2> in <module>()
     14 
     15 findspark.init("spark-2.4.5-bin-hadoop2.7")
---> 16 sc = pyspark.SparkContext('local[*]')
     17 spark = SparkSession.builder.appName('abc').getOrCreate()

3 frames
/usr/local/lib/python3.6/dist-packages/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

A temporary workaround for my code is obviously downgrading findspark by doing !pip install -q findspark==1.3.0 pyspark, but it does not resolve the problem at its source.

Add relevant trove classifiers to setup.py

Adding classifiers will allow findspark's PyPI package to show if users filter based on different categories (such as python version support). More Info. pip also uses these classifiers to ensure dependencies are met during installation.

findspark not working after installation

Hi, I used pip3 install findspark . after installation complete I tryed to use import findspark but it said No module named 'findspark'. I don't know what is the problem here

Quiet Spark Logging

There is a huge amount of logging by Spark by default which clutters up the terminal and confuses new users. Findspark should cut down on this logging. @freeman-lab recommended using the following to change the logging level at runtime:

log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

This could be implemented in Findspark by monkey-patching the SparkContext like so:

import pyspark
old_init = pyspark.SparkContext.__init__
def new_init(self, *args, **kwargs):
    old_init(self, *args, **kwargs)
    log4j = self._jvm.org.apache.log4j
    log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
pyspark.SparkContext.__init__ = new_init

This however feels like a fragile solution to me. We could instead modify the logger properties files at $SPARK_HOME/conf/log4j.properties but this changes the logging for all uses of Spark, and may be too heavyweight of a solution.

Error when I try to use the Spark Conext

Sorry to be bothersome with this. I am trying to use findspark but in my jupyter logs I get the error

"Must specify a primary resource (JAR or Python or R file)"

I have installed spark 1.5.0 with homebrew. Using findspark.find returns '/usr/local/opt/apache-spark/libexec'. Any ideas? Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.