Coder Social home page Coder Social logo

talks's Introduction

PyData Bangalore Talks

To propose a talk, please open an issue!

Join the conversation on our Slack workspace 🗣️

List of talks at PyData Bangalore meetups (in reverse chronological order):

# Date Talk Speaker Slides Twitter/GitHub/LinkedIn handle YouTube URL
7 January 18, 2020 Defensive Programming for Deep Learning with Pytorch Akash Swamy Slides Github
7 January 18, 2020 A look at state-of-the-art dimensionality reduction techniques Tamojit Slides Linkedin
7 January 18, 2020 Interpretability: Cracking open the Black Box Manu Joseph Slides Linkedin
7 January 18, 2020 OpenStreetMap Data Processing with Python Ramya Ragupathy Slides Linkedin
6 December 14, 2019 Streaming Data Pipelines with Apache Beam Tanay Tummalapalli Slides Linkedin
6 December 14, 2019 Interactive Visulization with `plotly.express` AbdulMajedRaja RS Slides Linkedin
6 December 14, 2019 Data Visualizations with R Baijayanti Chakraborty Slides Linkedin
6 December 14, 2019 Machine Learning in R - the Tidy way Saurav Ghosh Slides Linkedin
5 November 9, 2019 spaCy Transformers Matthew Honnibal Slides GitHub
5 November 9, 2019 Globally Scalable ClickStream Data Collection Ramjee Ganti Slides GitHub
5 November 9, 2019 PyTorch Lightning William Falcon Slides GitHub
4 September 21, 2019 Making ML Models available as API using R Saurav Ghosh Slides GitHub YouTube
4 September 21, 2019 Spatial thinking with Python Sangarshanan Slides GitHub YouTube
4 September 21, 2019 Getting started with Apache Beam using Python Ramjee Ganti Slides GitHub YouTube
4 September 21, 2019 Text Analytics using R Prof. Aruna Devi & Dr. Vinothina V Slides GitHub YouTube
3 August 24, 2019 Building Pluto: Compute for Exploratory Financial Data Analysis Shyam Sunder Slides GitHub YouTube
3 August 24, 2019 Machine Learning Bias AbdulMajedRaja RS Slides GitHub YouTube
3 August 24, 2019 Transfer Learning in Computer Vision Viswanath Ravindran Slides GitHub YouTube
2 July 13, 2019 [Lightning talk] Learn Pipenv for great good! Jaysinh Shukla - GitHub YouTube
2 July 13, 2019 FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks Nandan Thakur Slides GitHub YouTube
2 July 13, 2019 Automating your pipelines with Self Describing Data Ramjee Ganti Slides GitHub YouTube
2 July 13, 2019 Extracting Names from multi-lingual conversation Meghana Bhange Slides GitHub YouTube
1 June 15, 2019 Approximate deduplication at scale: LSH to the rescue Chirag Yadav - GitHub YouTube
1 June 15, 2019 Experiment to Deployment NLP with spaCy Nirant Kasliwal Slides GitHub YouTube

talks's People

Contributors

amrrs avatar vinayak-mehta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

talks's Issues

Experiment to Deployment NLP with spaCy

Experiment to Deployment NLP

Description

Today's main challenges in NLP are two-fold:

  • Building with small or untagged datasets
  • Building fast-enough systems

I share a bag of tricks ranging such including,

  • Lightning fast tokenization,
  • Building your own Entity Recognition rulekit
  • Using linguistics to "generate" questions & answers for a quiz

We will get a glimpse of spaCy as we do above.

Duration

  • 30 min

Audience

Prerequisites: Exposure to typical ML challenges in a NLP context e.g. compute time, cleaning large datasets

Best suited for an intermediate audience

Outline

Topics marked with * can/will be skipped to honour time constraints

  • Introduction [2 min]

    • Tools & Ideas: spaCy, displacy, textacy
    • Context of what to expect today
  • Building fast NLP Pipelines [5 min]

    • Why pipelines?
    • Why pipelines are slow?
    • Why you need faster, leaner pipelines for production systems?
    • 1-line change to get a faster tokenization pipeline with spaCy
  • Custom Entity Recognition with spaCy [10 minutes]

    • Writing rules specific to your dataset, and/or
    • Using Language Modeling pretraining*
  • Challenge: Lack of Tagged QA (Question & Answer) Data [15 minutes]

    • How do we generate QA from a free text corpus?
    • Linguistics 101
    • Making the Questions: Using Sentence Inflection
    • Getting the quiz together in 1 place*
  • Wrap up with end to end demo of QA generation

Additional notes

  1. Previous Talks:

    • inMobi Tech Talks: A Nightmare on the LM Street; Slides
    • Wingify DevFest: NLP for Indian Languages; Slides, Youtube
  2. About me: I have written a book on Practical NLP for Developers, and won the NLP Kernel Prize from Kaggle. I maintain awesome-nlp - a repository of world's best NLP tools

I am a FastAI International Fellow, Part 2 for both 2018 & 2019.


  • Don't record this talk.

Building Python Applications with smaller memory & time foot-print

Building Python Applications with a Smaller Memory & Time foot-print

Description

Python has a very limited set of native types namely dict, int, bool, float, tuple, str , etc but a plethora of other defined types with similar function to these but a smaller memory foot-print like named tuples, frozen dict, frozen set, ImmutableDict, etc.
Along with these data types there are many nuggets of helper functions in functools, cache tools and itertools to improve the running time of the functions in your application.
we'll largely focus on the three standard libraries: functools, cachetools and itertools.

Duration

  • 30 min

Audience

Pre-requisite: Love for Python and Optimisation

Outline

0-5 minutes: How to calculate the memory footprint of your function
5-10 minutes: Replacing native objects with memory efficient objects
5-15 minutes: Using functools and cachetools to improve your functions memory and time footprint.
15-20 minutes: Writing Idiomatic python code using memory efficient design patterns; Adapter and Decorator Design Patterns.
20-25 minutes: Iterators and generators to the Rescue.
25-30 minutes: Q & A.

Additional notes

View Github Repo: https://github.com/atifadib/python_best_practises for more details.

Python in Gravitational Wave Astronomy : A bird's eye view

Title

Python in Gravitational Wave Astronomy : A bird's eye view

Description

Albert Einstein's General Theory of Relativity interprets space-time as a 'fabric', and gravity to be the distortions in this 'fabric'. Einstein also predicted, in 1916, the existence of ripples in this fabric called gravitational waves. But it was only in 2015, almost a hundred years after the prediction, that the Laser Interferometer Gravitational-wave Observatory (LIGO) detected the first gravitational wave from the merger of two black holes. Since then, we have detected a bunch of gravitational waves from black holes and neutron stars spiralling towards each other. ( Rest of the description is WIP)

Duration

  • 30 min
  • 45 min

Audience

The talk is intended as an introduction to the field of Gravitational Wave astronomy and its computational challenges to people working in computer science. As such, there are no prerequisites for the talk.

Outline

A detailed outline for your talk. The more detailed the better. (1000 words) (WIP)

Additional notes

I am a PhD student in the Astrophysical Relativity group at ICTS-TIFR.


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Evaluating Agents against Environment at Scale

Evaluating AI Agents against Environment at Scale

Description

Talk about my project that will help users to evaluate Agents against Environment at Scale.

Duration

  • 30 min

Outline

  • The rise of reinforcement learning based problems or any problem which requires that an agent must interact with an environment introduces additional challenges for benchmarking. In contrast to the supervised learning setting where performance is measured by evaluating on a static test set, it is less straightforward to measure generalization performance of these agents in the context of the interactions with the environment. Evaluating these agents involves running the associated code on a collection of unseen environments that constitutes a hidden test set for such a scenario. This project deals with seting up a robust pipeline for uploading prediction code in the form of Docker containers (as opposed to test prediction file) that will be evaluated on remote machines.
  • Presentation: http://bit.do/gsoc-cloudcv

Additional notes

  • Product Engineer at Gojek, previously worked as Google Summer of Code Developer at CloudCV and as intern at BookMyShow

Understanding Extractive Text Summarization

Title

Understanding Extractive Text Summarization

Description

A glance on evolution of extractive summarization techniques from word frequency based to fine tuning of BERT.

Duration

  • 30 min
  • 45 min

Audience

Talk is for beginners as well as intermediate NLP practitioners.

Prerequisites:

  1. Familiarity of NLP terminologies.
  2. Basic understanding of Deep-learning techniques.
  3. Basic Python

Outline

As the use of ML/AI technology is growing there is a need to tap the potential of unstructured data. One of the primary task of NLP is to get the summary and key words from a text corpus.
Although Abstractive summarization is desirable at this point of time the readability of lengthy content created by algorithms is not production ready. Extractive summarization with better
readability is a implementable solution for summarization problem.

Topics

What is summarization?
Importance of text summarization
Type of text summarizations
Why Extractive
Evolution from word frequency based to text rank to BERTSum.
Explain BERT-Sum approach in detail : Use of Attention Mechanism

  • Don't record this talk.

Slide not available

Slide not available

4 | September 21, 2019 | Text Analytics using R | Prof. Aruna Devi & Dr. Vinothina V

Slide not available

Lessons learned by a rookie data scientist from working in a real data team

Title

Lessons learned by a rookie data scientist from working in a real data team.

Description

Things about data that can only be learned when working at scale and in a team. Both technological and ideological.

Duration

  • 30 min

Audience

The talk is centered on some key takeaways for a beginner data scientist if he/she is planning to work in the field. It has a blend of technology insights as well as some ways of approaching one's work when part of a data science team. The audience is expected to have a broad understanding of what data science entails and some intermediate knowledge of python and its libraries to understand the technological aspects.

Outline

The talk will broadly be around this medium article that I have written. In addition to this, I will be talking in some detail about how one might work with big data efficiently such as csv files of around 20-30 gb using chunks and other methods. Following is an outline of the talk -

  1. The scale is the nemesis (5mins)
    If it were a kaggle competition, a 7Gb model that gives out your final predictions in a csv with x% accuracy is as good as a 2Mb one with x-1 %. In real life, not so much. Taking care of not just the current performance of the model, but also of how it would evolve over time is of the essence. Faster iterations need agile models. Models that can improve with the addition of new data and that do not take a huge amount of resource to deploy and obtain feedback. Learning that accuracy is not the only metric and scalability is an equally important factor was crucial.

  2. The team delivers, not just the data scientist(10-15min)

In my usual pet projects, I was the one who cleaned the data, the one who tried models and the one who ignored its reproducibility and sustenance. Projects that add value are made by the combined effort of a data team. This can be a group of roughly 7–8 people, depending also sometimes on the maturity of the team and the Org.

A data scientist is one member of the team. One whose primary role is to drive analysis forward from the data, gather and report insights, using statistical and deep learning models if required to aid in the process. Another is a data engineer. He is somewhere there in the middle of the intersection between backend software development and a big data analyst and is typically in charge of managing data workflows, pipelines, and ETL processes. Then there are members of the business team, who communicate with clients and understand their problems and convey it to the team. They also get the insights from the team and break it down in terms of what the end user needs and then conveys that to them. Besides these people, usually in a well-working data team, there is also someone from high up in the ranks of the org who participates in the day to day functioning of the team. They help the team’s presence felt in the org and communicate to others the work of the members of the team.

  1. Security and Trust(5mins)

Data is the new oil. Data is also the new electricity. Both can be stolen or misused if not taken care of. Working with sensitive data on local machines is a big no-no. Data cannot be freely exported in and out according to one’s whims and fancies. Trust of people and organizations who share their data with your org is of utmost importance. Trust is ensured through security. Security of each data source and each piece of code that is written to work on that data source. The people in charge of this have a huge responsibility on their shoulders. Being an intern, it was very important for me to remember the security aspect of my work and use and store my data judiciously. This almost never happens if it is one of your personal projects. So, being alert while working with the data and its flow, is right up there in the list of necessary steps to take.

  1. Communicate. Ask. Do not get stuck.(5mins)

One thing that is quite different when you are working in an organization as compared to when you are working on an individual project is that you have a number of wiser heads all around you. There is someone who must have worked on this new framework that you are going to use or that new preprocessing technique which you are about to try out. It is best to politely ask these people for some of their valuable time and once you have a sound background of what you are about to attempt, have a quick one to one brainstorm with those guys. This will give you a definite roadmap of the project ahead as well as remove any wrong notions you might have had. The same applies if you have been stuck on a little bug for a while. Ask!

This is the link to the complete medium post

Additional notes

I am a final year computer science undergraduate from BITS Pilani, Goa. I just completed my summer internship at SocialCops, Delhi and I am now at Bangalore for a 6-month internship at American Express, Big Data Labs. I have been working in the field of data science and machine learning from the past 2 years. I have given talks at my college on getting started with data science as a student. I have also been a part of the core organizing committee of Google Developers Group, Goa. During my first year in college, I hosted Mr. Joel Spolsky, CEO and founder of Stack Overflow and Trello, for a talk in my college's tech fest, Quark. Since I am in Bangalore now for at least the next 6 months I am looking forward to being involved in more tech meetups and contribute to them if possible.


  • Don't record this talk.

A look at state-of-the-art dimensionality reduction techniques

Title

A look at state-of-the-art dimensionality reduction techniques

Description

A glance at dimensionality reduction techniques applied to public datasets

Duration

  • 30 min
  • 45 min

Audience

Talk is for beginners as well as for intermediate data scientists

Outline

With the advent of large amounts of data, it can be difficult to develop a feel for the data as well as identify latent features that possess most predictive power. We take a look at traditional matrix-based factorization approaches to dimensionality reduction, introduce the audience to current state-of-the-art manifold-based approaches and compare their performances on popular publicly available datasets.

Topics

  1. Dimensionality Reduction, what and why?
  2. Overview of Matrix-based factorization methods
  3. Introduction to Manifold based Approaches
  4. t-SNE : Current state-of-the-art technique
  5. Performance comparison of the two approaches on public datasets

Additional notes

https://www.linkedin.com/in/tamojit-maiti-635691157/


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Building Pluto: Compute for Exploratory Financial Data Analysis

Building Pluto: Compute for Exploratory Financial Data Analysis

Description

Financial data, like price series, fundamental information, corporate actions, meta-data, etc., are locked behind firewalls.

Pluto aims to provide a set of open source libraries and a compute cloud running behind one of these firewalls for data scientists who wish to explore these datasets without having to spend weeks collecting data and setting up databases.

This is a call for contributors, mentors and users for our open source project.

Duration

  • 45 min

Audience

People who have tried to analyze time-series data pertaining to financial markets in the past or those who are curious about applying their Python or R skills on financial datasets.

Outline

  1. What makes financial data special?
  2. A survey of current approaches.
  3. My previous attempts at solving this problem.
  4. Why open is the way to go.
  5. Proof of concept.
  6. Call for volunteers!

Additional notes

A collection of blogs that gives an idea about the kind of analysis that is possible using Pluto: https://stockviz.biz/category/collections/


  • Don't record this talk.

Transformer Architectures

Title

Introduction to Transformers and its various descendants like BERT, GPT-2, XLNet, etc

Description

I will be introducing the Transformer architecture in full from its base(encoder, decoder, and self-attention layer) and then if time permits I would like to talk about its descendants like BERT, GPT-2, XLNet, etc.

Duration

  • 45 min

Audience

A simple understanding of Neural Networks

Outline

A detailed outline for my talk is described in my medium articles for the same:
https://medium.com/@tejanm

Additional notes

Here is a link to my medium articles that I will be referring to:
https://medium.com/@tejanm


  • [Yes] Do you require internet for the presentation?
  • [Ok] Do you want your talk to be recorded?

Evolution of word Embedding in NLP

Title

Evolution of word Embedding in NLP

Description

Modeling natural language is a really complex problem because natural language doesn’t follow any mathematical construct.
It is very abstract which makes it very hard to train language. In past there have been rule based solutions but it had limited performance in terms of automation
Over time it became evident that if we want to solve NLP problems efficiently we have to treat it as a mathematical problem and from there started the journey of embeddings
From representing text in a simple one hot encoded method to more advanced representation like vectors notation, we have come a long way.
This talk takes a journey through how this representation known as the embedding have evolved over the years.

Duration

  • 30 min
  • 45 min

Audience

Basic understanding of NLP would be good. This talk is for Beginner to Intermediate level

Outline

A detailed outline for your talk. The more detailed the better. (1000 words)

  • Intro -

  • Text analytics and how it is relevant today

  • Challenges in working with NLP data - need to represent text as mathematical representation Why do need to represent text in mathematical format

  • Early stage - representing text in mathematical space
    1. Early forms of representation using one hot encoding
    2. Measuring text relevancy using TF-IDF and how it improves representation quality
    3. Practical applications of word embedding leveraging TF-IDF

  • Mid Stage - capturing relationship in words
    1. Intro of word2vec model
    Architectural overview - Skip gram and Continuous bag of words
    2. How word2vec started the legacy of pretrained models
    3. Limitations of word2vec model

  • Current - Capturing contextual information in embeddings
    Intro to attention mechanism
    1. Overview of Attention based model - Bi,BERT
    2. Transfer Learning in NLP - ImageNet moment
    3. Architectural comparison between current state of the art NLP models - BERT, XLNet and GPT 1,2
    Applications leveraging SOTA models

Additional notes

I am Data science Professional with VMware with 6+years of exp. I have completed my masters in software Engineering. My interests are in NLP and Language models


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks

Title

FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks

Description

Data Science starts with data cleaning. When developers are working with text, they often clean it up first. Sometimes by replacing keywords (“Javascript” with “JavaScript”) while other times, to find out whether a keyword (“JavaScript”) was mentioned in a document. In today’s fast-moving world, bigger and bigger datasets are coming up with tens of thousands to millions of documents. the amount of time one would want to invest in cleaning these gigantic datasets would take them days using RegEx (5 days ~ 20K keywords and 3 Million documents). Therefore, FlashText - a super blazingly fast library reduced days of computation time into few minutes (15mins ~ 20K keywords and 3 Million documents). FlashText can search and replace keywords from text really fast and has been implemented using the Aho-Corasick algorithm and the Trie Data Structure approach.

Duration

  • 30 min

Audience

This talk is centered around people who are interested in ML engineering especially in NLP domain and basic familiarity in Python, Dictionaries, and Regex should be sufficient enough.

Outline

Slides can be viewed here -https://docs.google.com/presentation/d/1qv0EKUCmjcvbIMDJSfUYvmpG_nlmFznZzQOM14JEyZE/edit?usp=sharing

[0-3mins]: Brief Introduction about Myself. Introduction to FlashText and compare FlashText vs. Regular Expressions Performance.

[3-10mins]: How is FlashText so blazingly fast?

[10-15mins]: When to Use FlashText?

[15-20mins]: Installing FlashText.

[20-24mins]: UseCase 1: Code – Searching for words in a text document

[24-28mins]: UseCase 2: Code – Replacing words in a text document

[28-30mins]: End Notes and Feedback for Future Talks.

Additional notes

The repository has over 2700+ Stars on GitHub and 15,000+ claps on Medium. Radim Rehurek (Founder of RaRe Technologies (Gensim)) has tweeted about this repository here: https://twitter.com/RadimRehurek/status/904989624589803520

Medium Article: https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/ (Over 15,000+ Claps)

GitHub Repo: https://github.com/vi3k6i5/flashtext (Over 2700+ Stars)

FlashText Documentation: https://buildmedia.readthedocs.org/media/pdf/flashtext/latest/flashtext.pdf

FlashText Research Paper: https://arxiv.org/pdf/1711.00046.pdf

LinkedIn: https://linkedin.com/in/nthakur20/

Video Preview: https://youtu.be/s8WP79QU1zw

Slides: https://docs.google.com/presentation/d/1qv0EKUCmjcvbIMDJSfUYvmpG_nlmFznZzQOM14JEyZE/edit?usp=sharing

About Me: My Name is Nandan Thakur, A BITS Graduate currently working as a Data Scientist (RnD) in Knolskape, Bangalore. I am a perpetual, quick learner and keen to explore the realm of Data Analytics and Science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed in domains such as Natural Language Processing, Machine Learning, and Signal Processing and share a keen interest in learning interdisciplinary concepts involving Machine Learning. I am looking forward to being involved in more tech meetups and contribute to more open-source actively.


  • Record this talk.

Globally Scalable ClickStream Data Collection

Globally Scalable ClickStream Data Collection

Description

Clickstream, Event Data is ubiquitous these days. With serverless and cloud building scalable data collection architecture is a one man job. For the same it would have required a team of 30+ people in the past.

In this talk we will go over how one can go about building data collection system for streaming data.

Duration

  • 30 min
  • 45 min

Audience

It will help if you have used , or built streaming data systems.

Outline

The talks covers the following:

  • What ClickStream Data
  • Components of streaming data collection system.
  • Deep dive into each component

Additional notes

About me: I have built and scaled technology products at multiple startups. Currently building a marketing analytics solution.


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Extracting Names from multi-lingual conversation

Extracting Names from multi-lingual conversation

Description

Chatbots are an upcoming and automated way employed by businesses to communicate with their clients. An important aspect of personalising this communication is to employ natural language rather than use text boxes with strict bounds. As part of this, it is important to extract named entities(to understand the customer’s name) from messages written in unstructured, natural language. This problem is called Named Entity Recognition.

Named Entity Recognition for Chats can face several issues like inputs consisting of-

  • Inconsistent grammer
  • Emojies and other unexpected unicode characters
  • Typos and inconsistent capitalisation.

This talk would focus on creating an efficient NER for chats by tweaking the current state of the art NER’s

Duration

  • 30 min

Audience

Prerequisites: This is a beginner friendly talk. Prior knowledge of NLP jargon is not expected.

Outline

  • Introduction [5 mins]
    • Why do we need NER(Named Entity Recognition) in chatbots?
    • What are the challenges faced in NER?
  • Current state-of-the-art work [5 mins]
    • Exploring NER offerings of Natural Language Libraries like spaCy and Flair(Zolando Research) ?
    • Where do these models fail for chats?
      • Humans do not always use complete grammar (E.g. What is your name? raj)
  • Dealing with chat (human conversation) specific cases [10 mins]
    • Random Capitalisation (Eg. RobErT)
    • Grammatical Errors (E.g. Name Robert mine)
    • Unicode Handling (E.g. Emojis, Accents)
    • Training model for chats
      • FastText Model for NER and where it fails
      • Tweaking Flair NER for multiple languages and case insensitive data.
  • Taking context into consideration [5 mins]
    • How Conditional Random Fields help bring context in NER tagging?
  • Wrap up and questions [5 mins]

  • Don't record this talk.

'Cloud Firestore' - The startup world's nitroTitle of the proposal

Title

'Cloud Firestore' - The startup world's nitro

Description

The latest trending technologies look brighter at the start but slowly take us towards a horizon that scatters the focus of business. This talk is a quick overview of 'why to use Firestore for your startup / next project' and how efficiently will the family of Firebase helps you to launch a scalable solution, faster. More importantly, you can understand the limitations and make them work in your favour.

Duration

  • 45 min

Audience

The talk is for an intermediate and advanced audience.

Outline

We would want to cover sub-topics around Firestore

Introduction
Firebase Family
Usual Interaction Layers
Architecture
How it works?
and more

Additional notes

Fluxon is a global product development company founded by ex-Googlers and startup founders. We work with fast-growing startups and tech leaders like Google, Stripe and Zapier to deliver the world’s most innovative products. Bringing together strong expertise across disciplines and industries, Fluxon offers full-cycle software development: from ideation and design to build and go-to-market.
Intro to Fluxon (New).pdf

Our Website - https://www.fluxon.com/
Our LinkedIn Page - https://www.linkedin.com/company/fluxon/


  • Do you require internet for the presentation? - Yes
  • Do you want your talk to be recorded? - Yes

Apache beam with python, getting started

Title

Apache beam with python, getting started

Description

Apache beam introduces a new paradigm to data processing. No longer do you need to suffer the duality of lambda architecture worrying about maintaining two variations of your code, one for batch and another for streaming.
This talk introduces to you the concepts behind beam, how it de coplues data processing from underlying infrastructure.

We will then walk through a small example in python

Duration

  • 30 min
  • 45 min

Audience

Experience with data processing helps. This is a 101 level

Outline

In the talk, we start off by providing an overview of Apache Beam using the Python SDK and the problems it tries to address from an end user’s perspective. We cover the core programming constructs in the Beam model such as PCollections, ParDo, GroupByKey, windowing and triggers. We describe how these constructs make it possible for pipelines to be executed in a unified fashion in both batch and streaming. Then we use examples to demonstrate these capabilities. The examples showcase using Beam for stream processing and real time data analysis, and how Beam can be used for feature engineering in some Machine Learning applications using Tensorflow. Finally, we end with Beam's vision of creating runner and execution independent graphs using the Beam FnApi [2].

Link to Presentation.

Additional notes

About me: I have built and scaled technology products at multiple startups. Currently building a marketing analytics solution.


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Transfer Learning in Computer Vision

Title

Transfer Learning in Computer Vision

Description

This talk introduces members in the Data Science community about applying transfer learning for common Computer Vision Tasks. As an example I would demonstrate using Transfer Learning on Multi class Classification.

Duration

  • 30 min

Audience

Basic Python code Understanding is the only requirement and members who are working on images would certainly be the most appropriate audience. Besides I am also looking at converting

Outline

The talk is a demonstration of how easy it is to work on Computer Vision Tasks using the power of transfer Learning.
Agenda covers:

  1. Intro to Transfer Learning
  2. Different Computer Vision (CV) Tasks
  3. Demo of Multi class Classification
  4. Some of the astonishing results that are achieved in CV tasks
  5. Converting a Non-CV tasks to achieve good results

A detailed outline for your talk. The more detailed the better. (1000 words)

Additional notes

I work as a Principal Data Scientist at Idexcel where we are integrating Machine Learning into idexcel's own fintech product suite and helping some of our clients to leverage Analytics to help .

  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Defensive Programming for Deep Learning with Pytorch

Title

Defensive Programming for Deep Learning with PyTorch

Description

More than often quality production code is ignored in while building deep leading or machine learning models. The proper level of abstractions and building core API will help the entire project shine. To make this happen, Python has a new tool called “typing”. This introduces defining data types and data structures which were previously left to Python and was hard and difficult to debug.

Duration

  • 30 min

Audience

Advanced Audience

Outline

Building a High-Level Abstraction

Build modules and using Python ABC module to build high-level abstractions and API.

Defining Data Types and Data Structures

Using the python’s typing module. This module will help build data abstractions at a high level which will be later used by downstream modules. It will help in debugging, readability and in some cases fishing out logical bugs which otherwise will only be possible in runtime.

Error Handling

Error handling even though seems very obvious is missed by most people. It’s always best to imagine a dumb user who will input random or unknown values. Then proceed with that assumptions and build proper Error Handling assert statements. Emphasis should be given on precious msg which instantly will reveal the nature of the error.

Unit Testing

Unit Testing is the most important aspect of the production code, an effective unit test written for a module will not only help identify logics bugs but also help to build a better model. Once this step is done, the top 3 steps have to be repeated again to make everything work seamlessly.

Additional notes

I used to teach Data Science and Machine learning on an online learning platform called Digital Vidya. After that, I do not have any formal experience with public talking but I believe I can manage it.

Currently, I am a Sr. Machine Learning Engineer working toward building scalable solutions to various clients.

Please refer to my GitHub profile for more information on me.

Thanks


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Understanding 3D CNN

Title

Understanding 3D CNN

Description

What happens when we extend 2D CNN to temporal domain, This talk is about understanding the 3D CNN architecture and how the feature map changes when we traverse through each layer. I will be using Keras and pictorial slides for explanation

Duration

  • 45 min

Audience

Understanding of Neural Networks is the basic Prerequisite. It will be a big advantage if audience already knows the basic architecture of 2D CNN. This would be for intermediate audience.

Outline

I will update the detailed outline after few days

Additional notes

Anuj Shah
Linkedin - https://www.linkedin.com/in/anuj-shah-759149117/
Github - https://github.com/anujshah1003
Youtube - https://www.youtube.com/channel/UCjAOM-s-f2YfeLNGGZuY-Bg
Medium -


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Machine Learning Bias

Title

Machine Learning Bias

Description

ML Bias has become a global discussion yet the awareness among developer/Data science community around us doesn't seem significant. This is an attempt to spread the word out.

Duration

  • 30 min
  • 45 min

Audience

ML / Data Science Practitioners

Outline

A model wouldn’t lie unless the Machine Learning Engineer want it to lie. Humans are filled with unconscious biases and when these are fed into Machine to Learn in the form of Data, the resulting AI model wouldn't be fair enough without Biases. This talk tries to introduce you to the world of Machine Learning Bias.

Additional notes

https://speakerdeck.com/amrrs/machine-learning-bias


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Learn Pipenv for great good!

Learn Pipenv for great good

Description

Pipenv is a development workflow for humans. Pipenv isolates python dependencies in an dedicated environment. It is a higher level wrapper over virtualenv and virtualenvwrapper. Advantage of Pipenv compared to other tools is its simplicity and delicacy. Learning Pipenv is an investment of a few minutes which paybacks as a saving of handsome amount of time to manage dependencies.

  • No pip and virtualenv anymore!
  • No requirements.txt, requrements_test.txt anymore!
  • Get your dependencies as graph!
  • Written by Kenneth Reitz (author of requests module)!

Duration

  • 5 minutes

Audience

Anyone who is developing applications using Python

Outline

Installing Pipenv

pip install pipenv

Example Flask app for demonstration (app.py)

#! /usr/bin/env python
# -*- coding: utf-8 -*-

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello world!"

Installing a new package

pipenv install flask

Installing developer dependencies

pipenv install --dev pytest

Pipfile

This file will maintain an meta information of package. This includes information like name of the package, url, dependencies and version of Python it is dependent on.

Example

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]
pytest = "*"

[packages]
flask = "*"

[requires]
python_version = "3.7"

You should commit this file in your version control system of choice.

Pipfile.lock

I am not sharing it as an example here because it will contain too many values of package and its meta package. This tool tracks packages by its hash value. Every time when new version is available, it updates it automatically and regenerate lock file. You should commit this file at your preferred version control system.

Using an installed packages

pipenv run flask run

Installing developer dependencies

pipenv install --dev pytest

Spawn an virtual environment

pipenv shell

Graph of dependency

pipenv graph

Checking safety of the package

pipenv check

Command to make build at anywhere

pipenv install

This command checks a safety of installed package using PEP 508.

Approximate deduplication at scale: LSH to the rescue

Approximate deduplication at scale: LSH to the rescue

Description

Recent advancements in deep learning have opened a pandora box of applications utilising NLP techniques to solve business problems. But one of the important tasks starts at the pre-processing stage which involves deduplicating similar documents. Though one can use various algorithms like Jaccard distance, cosine distance, Jaro Winkler, Levenshtein distance(depending on the document size and the nature of data) to find the similarity among documents, scaling them to a dataset of around millions is not a very time optimised approach as the number of operations scale in the order of O(N^2). In this talk I will talk about LSH and minhash based deduplication approach which at a small comprise on accuracy can quickly reduce the problem to O(N) complexity, which when put in actual numbers reduced our pre-processing time from around 48 hours to around 15 mins for a problem involving deduplicating millions of company names.

Duration

  • 30 min
  • 45 min

Audience

This talk is intended for people who are interested in ML engineering especially in NLP domain and basic familiarity in Python, probability and algorithms should be sufficient enough.

Outline

  • detailed description of the problem statement
    • Application of deduplication in practical cases
  • Walkthrough of the results and time complexity of the solution based on LSH
    • Using Minhash based LSH in python
    • Performance improvement as compared to the naive approach
    • Comparative analysis of results obtained from probabilistic and deterministic approaches
  • Minhash based document similarity
    • Abstraction of Similarity of documents
    • Jaccard similarity and its approximation my Minhash
  • Theoretical understanding of how LSH works
    • Selecting the signature size and bucket size

About Myself

I have around 5 years of experience in machine Learning domain with exposure to multiple industries like Fintech, Insurtech and eCommerce. I personally like to work on developing machine learning products and one of the product we developed at my last company is currently used on millions of financial transactions daily.
LinkedIn: https://www.linkedin.com/in/chirag-yadav-85227340/


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

ML Models and Dataset versioning

Title

ML Models and Dataset versioninig

Description

In this talk we will discuss the current best practices of organizing ML projects and why traditional open-source tools like Git, And I will be discussing about one of the best practises ie ML models and Dataset versioning

Duration

  • 30 min
  • 45 min

Audience

Intermediate

Outline

In this talk we will discuss about the current best practices of organizing ML projects
and why traditional open-source tools like Git and Git-LFS won't help us here.

Currently the life-cycle of any Machine learning model goes through following process:

  • a ML practitioner tries out new image classification algorithm with input dataset
  • He tweaks algorithms, tries other ideas and fix bugs. All in local system
  • Some of her training data might require long runs, and may change code while weights remains same
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments.
  • She publishes her results, with code and the trained weights.

Git can’t handle large amount of data of GB’s of size. While Git-LFS comes with the in-build difficulty of supporting only 2 GBs of data at the maximum(Github limitations) and even more problems exist.

Data Version Control or DVC.ORG is an open-source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favourite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects. Also, I will be discussing tools in the market for both experiment tracking and dataset versioning, and what are the best features of these products(PS: no comparison among one another).

Talk Outline

  • Startup Adventures
  • Challenges
  • Model and Dataset versioning?
  • How I discovered DVC?
  • Use case: Versioning Cats vs Dogs Deep Learning problem(8 min)
  • Conclusion

Slides

Additional notes

Kurian Benoy is an open-source contributor at CloudCV, DVC. He is the lead organiser of School of AI, Kochi and is an AI enthusiast working on Deep Learning and Computer Vision. Kurian is FOSSASIA Open TechNights WInner and gave a talk in FOSSASIA Open Tech submit about the keralarescue.in team.

I am an active kaggler and was the first person to introduce about Data Version Control in Kaggle and is among the top 10 contributors of dvc, so far.


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Streaming Data Pipelines with Apache Beam

Title

Streaming Data Pipelines with Apache Beam

Description

Concepts involved in writing Streaming Data Pipelines with example using Apache Beam.

Duration

  • 30 min
  • 45 min

Audience

Intermediate

Outline

  • Introduction
  • Apache Beam Model
    • Unbounded and Bounded Data
    • Event time - Processing time Skew
    • What/When/Where/How
    • Windowing
    • Watermarks
    • Triggers
    • Accumulation Mode
  • Data Processing in Apache Beam
    • Core PTransforms
    • Beam execution model
  • Applications / Use-cases
  • Contributing
  • More Resources

Additional notes

GSoC '19 student with Apache Beam. Intern @clarisights. Previous @atlanhq


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

OpenStreetMap Data Processing with Python

Title

OpenStreetMap Data Processing with Python

Description

Openstreetmap, with its community of more than 5 million contributors, is the wikipedia of maps. If you find your favorite store missing on the map, you can directly add it to OpenStreetMap (OSM). It’s for everyone, and anyone interested can create their own maps using the data. If needed the whole world data can be downloaded offline. It is a highly detailed data set that can be used for data science and exploratory data analysis. This presentation covers an overview of the whole data pipeline with OSM and Python.

Duration

  • 45 min

Audience

Beginner friendly

Outline

  • Introduce OpenStreetMap (OSM) and its data structure
  • Various methods to retrieve OSM data
  • Data Transformation with GeoPandas
  • Interactive visualization with Folium/mapboxgl

Additional notes

https://twitter.com/ramya_ragupathy

A Credential Management Tool using Google Cloud KMS and Datastore

Title

A Credential Management Tool using Google Cloud KMS and Datastore

Description

Software systems need access to credentials like database passwords, or API keys for third-party services, a credential management tool will help you to encrypt, store and distribute these secrets without storing it on the config file or env variable.

Duration

  • 30 min
  • 45 min

Audience

This talk is for beginner or intermediate audience.

Outline

  • Problem statement
  • Gcredstash - A credential management tool
  • Brief introduction to Google Cloud KMS and Cloud Datastore
  • How this tool will help?
  • Examples

Additional notes

Linkedin - https://www.linkedin.com/in/rajeshphegde/
Gcredstash - https://github.com/RajeshHegde/gcredstash


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Automating your pipelines with Self Describing Data

Title

Automating your pipelines with Self Describing Data

Description

Every data stream has to comply to a schema, go through basic type checks and other validation criteria. If all of your data has to go through this, why do we have to write source specific pipelines. I present an approach where the data describes itself and build generic pipelines which span across schemas.

Duration

  • 30 min

Audience

This is talk has no prerequisite, an experience with data pipelines will be useful.

Outline

  • The Problem
  • Live Demo
  • Self Describing Data
  • Self Describing Schema
  • Schema Repository
  • In practice

Additional notes

About me: I have built and scaled technology products at multiple startups. Currently building a marketing analytics solution.


  • Don't record this talk.

Spatial thinking with Python

Title

Spatial thinking and location intelligence

Description

Data can take many forms, It could be a text an image or maybe it is a bunch of coordinates. Though computer vision and natural language processing have hit it off spatial data science doesn’t get the attention it deserves. Spatial data has both social and industrial impact. Spatial data is useful in agriculture and for observing weather patterns to predict natural disasters. It is also very important for industries that deal with logistics and supply chain management. So in this talk I would like to focus on spatial data and how effective use of location intelligence can add immense value

Duration

  • 30 min
  • 45 min

Audience

This is a beginner friendly talk. Basic Python knowledge would be more than necessary

Outline

  • What is spatial thinking and the state of location intelligence
  • The need for spatial analytics
  • Spatial support from databases and packages. Why Osgeo is awesome
  • Problems that can be solved and processes that can be optimized with spatial data
  • Talking some cool spatial use-cases by Uber, Airbnb, Walmart etc

  • Don't record this talk.

Interpretability: Cracking open the Black Box

Title

Interpretability: Cracking open the Black Box

Description

A review of Interpretable models, how to interpret them and common pitalls. And then move on to look at techniques to explain black box models and how to correctly interpret the results.

Duration

  • 30 min
  • 45 min

Audience

Talk is for intermediate as well as experienced ML practitioners.
Prerequisites:
1. Basic Knowldge of Machine Learning Algorithms
2. Python (soft requirement as the examples are in Python, but the concepts are universal)

Outline

As the machine learning field matured and more and more models have started to be in use in the wild, the need for explainability has become more and more important. The talk would be structured as below

  • What is Interpretability?
  • Interpretable Models - How to interpret them?
  • Blackbox model explanation techniques
    • Mean Decrease in Impurity,
    • Permutation Importance
    • Partial Dependence Plots
    • LIME
    • Shapely Values and SHAP

Additional notes

https://www.linkedin.com/in/manujosephv/


  • Don't record this talk.

Check this if you don't want your talk to be recorded.

Learn Apache Spark Using PySpark

Learn Apache Spark - PySpark

Description

Apache Spark is a distributed general-purpose cluster-computing framework. The talk teaches the basics of Spark. A hands-on code snippet driven style of teaching will allow Python developers to quickly learn ramp up the basics and working of Spark.

Duration

  • 45 min

Knowledge of

  • Python 3.5+

Outline

  • Installation
  • Hello World
  • Read a large CSV and run basic computation using
    • RDD
    • dataframes
  • Data Transformations
    • Filter
    • Flatmap
    • Sample
    • Union
    • Intersection
    • Distinct
    • Cartesian
    • Pipe
    • Coalesce
    • Repartition
    • .agg()
    • .sql()
  • Persisting Data
  • Datasets & SQL
  • Streaming Data into Spark

Additional notes

Linkedin Profile - https://www.linkedin.com/in/ankurgupta1982/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.