Light

feraranas / kaggle-titanic-challenge-pyspark Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 4.8 MB

Solving Kaggle Titanic with Pyspark libraries

Jupyter Notebook 100.00%

distributed-computing hadoop pyspark spark sparkmllib

kaggle-titanic-challenge-pyspark's Introduction

Kaggle Titanic Challenge con Pyspark

Equipo 4:

Alumn	Matricula
Mauricio Juárez Sánchez	A01660336
Alfredo Jeong Hyun Park	A01658259
Fernando Alfonso Arana Salas	A01272933
Miguel Ángel Bustamante Pérez	A01781583

Challenge Kaggle – Titanic classification

The objective of this project is to solve the Titanic - Machine Learning from Disaster problem from the Kaggle Competition using classification algorithms. Specifically, we will be using the Pyspark library in contrast to Pandas.

Pandas vs Pyspark:

1. In-memory Computation

Pandas	Pyspark
Pandas operates primarily in-memory, which means it loads the entire dataset into memory for processing. This can be limiting when working with very large datasets that don't fit in memory.	PySpark also performs in-memory computation, but it can efficiently handle large datasets by distributing the data across a cluster's memory.

2. Distributed Processing using Parallelize

Pandas	Pyspark
Pandas does not have native support for distributed processing or parallelization. It's designed for single-machine data analysis.	PySpark is designed for distributed processing and can parallelize computations across a cluster of machines, making it suitable for big data processing.

3. Cluster Managers

Pandas	Pyspark
Pandas does not integrate with cluster managers like Spark, Yarn, or Mesos.	PySpark is designed to work with various cluster managers, such as Spark's built-in cluster manager, YARN, and Mesos, allowing it to leverage cluster resources efficiently.

4. Fault-Tolerant

Pandas	Pyspark
Pandas does not have built-in fault tolerance features.	PySpark is designed for fault tolerance. It can recover from node failures in a cluster and continue processing.

5. Immutable

Pandas	Pyspark
Pandas DataFrames are mutable, meaning you can modify them in place.	PySpark DataFrames are immutable, which means any transformation on a DataFrame creates a new DataFrame. This immutability simplifies parallel processing and fault tolerance.

6. Lazy-evaluation

Pandas	Pyspark
Pandas does not support lazy evaluation.	PySpark supports lazy evaluation, which means transformations on DataFrames are not executed immediately but are deferred until an action is performed. This optimizes query execution.

7. Cache & Persistence

Pandas	Pyspark
Pandas does not provide built-in mechanisms for caching or persisting data.	PySpark allows you to cache intermediate DataFrames in memory for faster access during iterative computations, improving performance.

7. Inbuilt Optimization with DataFrames

Pandas	Pyspark
Pandas does not provide built-in optimization for distributed computing.	PySpark's DataFrames are designed for optimized distributed computing. The Catalyst query optimizer and Tungsten execution engine help improve query performance.

9. Supports ANSI SQL

Pandas	Pyspark
Pandas does not directly support ANSI SQL, but you can use SQL-like syntax with the pandasql library.	PySpark has built-in support for ANSI SQL through its Spark SQL module, allowing you to run SQL queries on DataFrames.

kaggle-titanic-challenge-pyspark's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.