Coder Social home page Coder Social logo

kaggle-titanic-challenge-pyspark's Introduction

Kaggle Titanic Challenge con Pyspark

spark-logo

Equipo 4:

Alumn

Matricula

Mauricio Juárez Sánchez
A01660336
Alfredo Jeong Hyun Park
A01658259
Fernando Alfonso Arana Salas
A01272933
Miguel Ángel Bustamante Pérez
A01781583

Challenge Kaggle – Titanic classification

The objective of this project is to solve the Titanic - Machine Learning from Disaster problem from the Kaggle Competition using classification algorithms. Specifically, we will be using the Pyspark library in contrast to Pandas.

Pandas vs Pyspark:

1. In-memory Computation

Pandas

Pyspark

Pandas operates primarily in-memory, which means it loads the entire dataset into memory for processing. This can be limiting when working with very large datasets that don't fit in memory.
PySpark also performs in-memory computation, but it can efficiently handle large datasets by distributing the data across a cluster's memory.

2. Distributed Processing using Parallelize

Pandas

Pyspark

Pandas does not have native support for distributed processing or parallelization. It's designed for single-machine data analysis.
PySpark is designed for distributed processing and can parallelize computations across a cluster of machines, making it suitable for big data processing.

3. Cluster Managers

Pandas

Pyspark

Pandas does not integrate with cluster managers like Spark, Yarn, or Mesos.
PySpark is designed to work with various cluster managers, such as Spark's built-in cluster manager, YARN, and Mesos, allowing it to leverage cluster resources efficiently.

4. Fault-Tolerant

Pandas

Pyspark

Pandas does not have built-in fault tolerance features.
PySpark is designed for fault tolerance. It can recover from node failures in a cluster and continue processing.

5. Immutable

Pandas

Pyspark

Pandas DataFrames are mutable, meaning you can modify them in place.
PySpark DataFrames are immutable, which means any transformation on a DataFrame creates a new DataFrame. This immutability simplifies parallel processing and fault tolerance.

6. Lazy-evaluation

Pandas

Pyspark

Pandas does not support lazy evaluation.
PySpark supports lazy evaluation, which means transformations on DataFrames are not executed immediately but are deferred until an action is performed. This optimizes query execution.

7. Cache & Persistence

Pandas

Pyspark

Pandas does not provide built-in mechanisms for caching or persisting data.
PySpark allows you to cache intermediate DataFrames in memory for faster access during iterative computations, improving performance.

7. Inbuilt Optimization with DataFrames

Pandas

Pyspark

Pandas does not provide built-in optimization for distributed computing.
PySpark's DataFrames are designed for optimized distributed computing. The Catalyst query optimizer and Tungsten execution engine help improve query performance.

9. Supports ANSI SQL

Pandas

Pyspark

Pandas does not directly support ANSI SQL, but you can use SQL-like syntax with the pandasql library.
PySpark has built-in support for ANSI SQL through its Spark SQL module, allowing you to run SQL queries on DataFrames.

kaggle-titanic-challenge-pyspark's People

Contributors

feraranas avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.