Coder Social home page Coder Social logo

xmur / bigdatarstrata2017 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from winvector/bigdatarstrata2017

0.0 2.0 0.0 31.51 MB

All material for "Modeling big data with R, sparklyr, and Apache Spark" Strata Hadoop 2017.

Home Page: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55791

License: GNU General Public License v3.0

HTML 99.98% R 0.02%

bigdatarstrata2017's Introduction

Materials for:

1:30pm–5:00pm Tuesday, March 14, 2017
Data science & advanced analytics
Location: LL21 C/D
Level: Intermediate
Secondary topics:  R

John Mount  (Win-Vector LLC)

We have a short video showing how to install Spark using R and RStudio here.

Also please click through for slides from Edgar Ruiz's excellent Strata Sparklyr presentation and cheat-sheet.

Description from Strata announcement

Modeling big data with R, sparklyr, and Apache Spark

John Mount (Win Vector LLC) 1:30pm–5:00pm Tuesday, March 14, 2017 Data science & advanced analytics Location: LL21 C/D Level: Intermediate Secondary topics: R

Who is this presentation for?

Data scientists, data analysts, modelers, R users, Spark users, statisticians, and those in IT

Prerequisite knowledge

Basic familiarity with R

Experience using the dplyr R package (If you have not used dplyr before, please read this chapter before coming to class.) Materials or downloads needed in advance.

A WiFi-enabled laptop (You'll be provided an RStudio Server Pro login for students to use on the day of the workshop.)

What you'll learn

Learn how to quickly set up a local Spark instance, store big data in Spark and then connect to the data with R, use R to apply machine-learning algorithms to big data stored in Spark, and filter and aggregate big data stored in Spark and then import the results into R for analysis and visualization Understand how to extend R and use sparkly) to access the entire Spark API

Description

Sparklyr, developed by RStudio in conjunction with IBM, Cloudera, and H2O, provides an R interface to Spark’s distributed machine-learning algorithms and much more. Sparklyr makes practical machine learning scalable and easy. With sparklyr, you can interactively manipulate Spark data using both dplyr and SQL (via DBI); filter and aggregate Spark datasets then bring them into R for analysis and visualization; orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater; create extensions that call the full Spark API and provide interfaces to Spark packages; and establish Spark connections and browse Spark data frames within the RStudio IDE.

John Mount demonstrates how to use sparklyr to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also also explores the sparklyr integration built into the RStudio IDE.

Derived from R for big data (GitHub"" https://github.com/rstudio/Strata2016).

Public repository is: https://github.com/WinVector/BigDataRStrata2017.

config

Current list of CRAN packages used:

# often a good idea, though try "n" to build source
# may interfere with us pinning h2o to a specific version
# update.packages(ask=FALSE) 
cranpkgs <- c(
 'babynames',
 'caret',
 'DBI',
 'devtools',
 'dplyr',
 'dygraphs',
 'e1071',
 'formatR',
 'ggplot2',
  # 'h2o', # installed a bit later
 'lubridate',
 'nycflights13',
 'plotly',
 'rbokeh',
 'rsparkling',
 'RSQLite',
 'sparklyr',
 'tidyr',
 'tidyverse',
 'titanic',
 'xtable'
 )
install.packages(cranpkgs)
devpkgs <- c(
  'RStudio/EDAWR',
  'WinVector/replyr',
  'WinVector/WVPlots' )

for(pkgi in devpkgs) {
  devtools::install_github(pkgi)
}

Also it is critical to look at Exercises/solutions/RsparklingExample.Rmd as it installs and configures some packages. A refresh of all packages will break the matching version numbers required by h2o and rsparkling. So please work through the details in RsparklingExample.Rmd after updating and installing all the above packages.

A copy of those note are below (but it is better to look at RsparklingExample.Rmd).

# updated from https://gist.github.com/edgararuiz/6453d44a91c85a87998cfeb0dfed9fa9
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods", "statmod", "stats",
          "graphics", "RCurl", "jsonlite",
          "tools", "utils")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) {
     install.packages(pkg)
  }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-turnbull/2/R")

# Installing 'rsparkling' from CRAN
install.packages("rsparkling")
options(rsparkling.sparklingwater.version = "2.0.3")
# Reinstalling 'sparklyr' 
install.packages("sparklyr")
sparklyr::spark_install(version = "2.0.0")

bigdatarstrata2017's People

Contributors

johnmount avatar

Watchers

James Cloos avatar Zhichao Luo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.