Coder Social home page Coder Social logo

data_science's Introduction

data_science

seeing is believing. A witty saying proves nothing.

"When solving a problem of interest, do not solve a more general problem as an intermediate step." (Vladimir Vapnik)

Winining solutions

Case stydies:

DS Coursera

Heroes of DL

Top conferences:

Deep Learning

Events: I will put word cloud for that.

EMNLP 2017: http://noisy-text.github.io/2017/

NLPStan reading

LXMLS16:

ACL2017

VietAI

My SOTA

  • My ATIS: sequence tagging, nb of params: 324335, bi-LSTM
  • Quore question duplicate detection: Accuracy 85% on Wang's test
 - best F1 score: 94.92/94.64
 - train scores: 97.5446666667/96.17
 - val scores: 93.664/92.94

Yandex

ICLR 2017 Review

LearningNewThingIn2017

Conf events

NIPs 2016 slides

Theano based DL applications

learn to learn: algos optimization

Pin:

Data type: NOQ

  • Nominal (N):cat, dog --> x,o | vis: shape, color
  • Ordinal (O): Jan - Feb - Mar - Apr | vis: area, density
  • Quantitative (Q): numerical 0.42, 0.58 | vis: length, position

People:

Fin data:

Projects:

Wikidata:

Cartoons & Quotes:

Books:

Done:

  1. EMNLP 2016, Austin, 2-4 Nov: http://www.emnlp2016.net/tutorials.html#practical

day 1:

  • Hugo(Twitter): Feed forward NN
  • Kartpathy(OpenAI): Convnet
  • Socher(MetaMind): NLP = word2vec/glove + GRU + MemNet
  • Tensorflow tut: from 5:55:49
  • Ruslan: Deep Unsup Learning: from 7:10:39
  • Andrew Ng: Nuts and bolts in applied DL from 9:09:46

day 2:

AI mistakes:

Keras:

NLP:

Apps:

German word embedding:

PyGotham:

Journalist LDA and ML:

Europython:

Scipy 2016:

Performance Evaluation(PE):

Hypothesis testing

Metrics:

Rock, Metal and NLP:

Financial:

Twitter:

Deep Learning Frameworks/Toolkits:

  • Tensorflow
  • Torch
  • Theano
  • Keras
  • Dynet
  • CNTK

ElasticSearch + Kibana:

Attention based:

ResNet: Residual Networks

Sentiment

NER

ML Stacking

Tensorflow tutorials

Covariate shift

#PydataLondon2017

NLP course

Dataset

Tricks of DL

Pointer network

Attention

Log likelihood test


MLtrainings.ru

GCloud

Current conference

https://github.com/aymericdamien/TensorFlow-Examples

Timeline

07.12

06.12

05.12

04.12

02.12

online marketing applications

01.12

30.11

29.11

28.11

27.11

24.11

23.11

22.11

21.11

17.11

16.11

15.11

14.11

13.11

10.11

09.11

08.11

3.11

2.11

1.11

31.10

30.10

29.10

28.10

27.10

26.10

25.10

24.10

23.10

20.10

19.10

18.10

17.10

16.10

15.10

13.10

12.10

11.10

10.10

07.10

05.10

04.10

03.10

02.10

30.09

29.09

28.09

27.09

25.09

22.09

21.09

19.09

18.09

17.09

16.09

15.09

14.09

13.09

12.09

11.09

10.09

09.09

08.09

07.09

06.09

05.09

04.09

03.09

02.09

01.09

31.08

30.08

29.08

28.08

26.08

25.08

24.08

22.08

21.08

18.08

17.08

16.08

15.08

14.08

13.08

11.08

10.08

09.08

08.08

07.08

06.08

04.08

01.08

31.07

25.07

24.05

23.07

22.07

21.07

20.07

19.07

18.07

17.07

15.07

14.07

13.07

12.07

10.07

06.07

Maxout:

05.07

04.07

03.07

02.07

30.06

29.06

28.06

27.06

26.06

24.06

23.06

22.06

21.06

19.06

14.06

13.06

12.06

09.06

07.06

05.06

02.06

01.06

31.05

30.05

29.05

26.05

25.05

21.05

20.05

19.05

18.05

17.05

16.05

15.05

13.05

12.05

11.05

10.05

09.05

08.05

05.05

04.05

03.05

02.05

30.04

27.04

26.04

25.04

24.04

21.04

20.04

19.04

18.04

17.04

16.04

15.04

14.04

13.04

12.04

10.04

08.04

07.04

06.04

05.04

04.04

03.04

01.04

31.03

30.03

29.03

28.03

27.03

26.03

25.03

23.03

21.03

20.03

I haven't gone back to check what they are suggesting in their original paper, but I can guarantee that recent code written by Christian applies relu before BN. It is still occasionally a topic of debate, though.

17.03

16.03

15.03

14.03

13.03

10.03

09.03

08.03

07.03

06.03

05.03

04.03

02.03

01.03

28.02

27.02

26.02

25.02

24.02

23.02

22.02

21.02

20.02

19.02

18.02

17.02

16.02

15.02

14.02

13.02

12.02

10.02

08.02

07.02

06.02

27.1

26.1

25.1

24.1

23.1

20.1

19.1

18.1

17.1

16.1

15.1

14.1

13.1

12.1

11.1

10.1

9.1

7.1

5.1

4.1

3.1

2.1.17

31.12

30.12

29.12

28.12

27.12

26.12

24.12

23.12

22.12

21.12

20.12

19.12

17.12

16.12

15.12

14.12

13.12

12.12

11.12

9.12

8.12

7.12

6.12

5.12

2.12

1.12

30.11

29.11

28.11

27.11

26.11

25.11

24.11

23.11

Multithread in Theano:

Debug

22.11

21.11

19.11

18.11

17.11

16.11

15.11

14.11

13.11

12.11

11.11

10.11

9.11

8.11

7.11

6.11

04.11

3.11

2.11

1.11

31.10

30.10

29.10

28.10

27.10

26.10

25.10

24.10

23.10

22.10

21.10

20.10

18.10

17.10

16.10

15.10

14.10

13.10

12.10

11.10

10.10

7.10

6.10

5.10

3.10

30.9

29.9

28.9

27.9

26.9

25.9

23.9

22.9

21.9

20.9

19.9

15.9

14.9

13.9

9.9

8.9

7.9

6.9

5.9

2.9

1.9

31.8

29.8

28.8

26.8

25.8

24.8

23.8

22.8

19.8

18.8

16.8

15.8

14.8

13.8

12.8

10.8

9.8

8.8

  • twitter buy magic pony: $150M, apple buy Turi $200: time of ML
  • spotify release rader: brand new music from acoustic frame

7.8

5.8

4.8

3.8

2.8

1.8

28.7

27.7

26.7

25.7

24.7

22.7

21.7

20.7

19.7

18.7

15.7

14.7

data science summit:

daily

13.7

12.7

11.7

8.7

7.7

6.7

5.7

4.7

1.7

30.6

29.6

28.6

27.6

24.6

23.6

22.6

21.6

20.6

18.6

17.6

16.6

toread:

15.6

13.6

11.6

9.6

user classifiers:

Readings:

8.6

7.6

6.6

1.6

10 lesson learned from Xavier recap:

  • implicit signal beats explicit ones (almost always): clickbait, rating psychology
  • your model will learn what you teach it to learn: feature, function, f score
  • sup + unsup = life
  • everything is ensemble
  • model sequences: output of the model is input of others
  • FE: reusable, transformable, interpretable, reliable
  • ML infra: experimentation phase: easiness, flexibility, reusability. production phase: performance, scalable
  • Debugging feature values
  • you don't need to distribute ML algo
  • DS + ML engineering = perfection

31.5

30.5

29.5

26.5

25.5

In summary, here is what I recommend if you plan to use word2vec: choose the right training parameters and training data for word2vec, use avg predictor for query, sentence and paragraph(code here) after picking a dominant word set and apply deep learning on resulted vectors.

===

For SGNS, here is what I believe really happens during the training: If two words appear together, the training will try to increase their cosine similarity. If two words never appear together, the training will reduce their cosine similarity. So if there are a lot of user queries such as โ€œauto insuranceโ€ and โ€œcar insuranceโ€, then โ€œautoโ€ vector will be similar to โ€œinsuranceโ€ vector (cosine similarity ~= 0.3) and โ€œcarโ€ vector will also be similar to โ€œinsuranceโ€ vector. Since โ€œinsuranceโ€, โ€œloanโ€ and โ€œrepairโ€ rarely appear together in the same context, their vectors have small mutual cosine similarity (cosine similarity ~= 0.1). We can treat them as orthogonal to each other and think them as different dimensions. After training is complete, โ€œautoโ€ vector will be very similar to โ€œcarโ€ vector (cosine similarity ~= 0.6) because both of them are similar in โ€œinsuranceโ€ dimension, โ€œloanโ€ dimension and โ€œrepairโ€ dimension. This intuition will be useful if you want to better design your training data to meet the goal of your text learning task.

===

for short sentences/phrases, Tomas Mikolov recommends simply adding up individual vector words to get a "sentence vector" (see his recent NIPS slides).

For longer documents, it is an open research question how to derive their representation, so no wonder you're having trouble :)

I like the way word2vec is running (no need to use important hardware to process huge collection of text). It's more usable than LSA or any system which requires a term-document matrix.

Actually LSA requires less structured data (only a bag-of-words matrix, whereas word2vec requires exact word sequences), so there's no fundamental difference in input complexity.

24.5

TSNE:

Conferences:

20.5

19.5

18.5

sentifi:

http://davidrosenberg.github.io/ml2016/#home

pydatalondon 2016:

spotify:

lda asyn, auto alpha: http://rare-technologies.com/python-lda-in-gensim-christmas-edition/

mapk: https://github.com/benhamner/Metrics/tree/master/Python/ml_metrics

ilcr2016: https://tensortalk.com/?cat=conference-iclr-2016

l.m.thang

https://github.com/jxieeducation/DIY-Data-Science

http://drivendata.github.io/cookiecutter-data-science/

http://ofey.me/papers/sparse_ijcai16.pdf

Spotify:

skflow:

a few useful things to know about ML:

tdb: https://github.com/ericjang/tdb

dask for task parallel, delayed: http://dask.pydata.org/en/latest/examples-tutorials.html

skflow:

http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

https://medium.com/a-year-of-artificial-intelligence/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a#.ecj0iv4n8

https://github.com/andrewt3000/DL4NLP/blob/master/README.md

tf:

tf chatbot: https://github.com/nicolas-ivanov/tf_seq2seq_chatbot

Bayesian Opt: https://github.com/fmfn/BayesianOptimization/blob/master/examples/visualization.ipynb

click-o-tron rnn: http://clickotron.com auto generated headline clickbait: https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/

http://blog.computationalcomplexity.org/2016/04/the-master-algorithm.html http://jyotiska.github.io/blog/posts/python_libraries.html

LSTM: http://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

CS224d:

Sota of sa, mikolo and me :)

Thang M. L: http://web.stanford.edu/class/cs224n/handouts/cs224n-lecture16-nmt.pdf

CS224d reports:

QA in keras:

Chinese LSTM + word2vec:

DL with SA: https://cs224d.stanford.edu/reports/HongJames.pdf

MAB:

cnn nudity detection: http://blog.clarifai.com/what-convolutional-neural-networks-see-at-when-they-see-nudity/#.VxbdB0xcSko

sigopt: https://github.com/sigopt/sigopt_sklearn

first contact with TF: http://www.jorditorres.org/first-contact-with-tensorflow/

eval of ML using A/B or multibandit: http://blog.dato.com/how-to-evaluate-machine-learning-models-the-pitfalls-of-ab-testing

how to make mistakes in Python: www.oreilly.com/programming/free/files/how-to-make-mistakes-in-python.pdf

keras tut: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/keras_tutorial.pdf

Ogrisel word embedding: https://speakerd.s3.amazonaws.com/presentations/31f18ad0522c0132b9b662e7bb117668/Word_Embeddings.pdf

Tensorflow whitepaper: http://download.tensorflow.org/paper/whitepaper2015.pdf

Arimo distributed tensorflow: https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/

Best ever word2vec in code: http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb

TF japanese: http://www.slideshare.net/yutakashino/tensorflow-white-paper

TF tut101: https://github.com/aymericdamien/TensorFlow-Examples

Jeff Dean: http://learningsys.org/slides/NIPS-Learning-Systems-Workshop-TensorFlow-Jeff-Dean.pdf DL: http://www.thoughtly.co/blog/deep-learning-lesson-1/ Distributed TF: https://www.tensorflow.org/versions/r0.8/how_tos/distributed/index.html

playground: http://playground.tensorflow.org/

Hoang Duong blog: http://hduongtrong.github.io/ Word2vec short explanation: http://hduongtrong.github.io/2015/11/20/word2vec/

ForestSpy: https://github.com/jvns/forestspy/blob/master/inspecting%20random%20forest%20models.ipynb

Netflix:

Lessons learned

WMD:

Hanoi trip:

VinhKhuc:

RS:

Data science bootcamp: https://cambridgecoding.com/datascience-bootcamp#outline

CambridgeCoding NLP:

Annoy:

RPForest: https://github.com/lyst/rpforest LightFM: https://github.com/lyst/lightfm Secure because of math: https://www.youtube.com/watch?v=TYVCVzEJhhQ Talking machines: http://www.thetalkingmachines.com/ Dive into DS: https://github.com/rasbt/dive-into-machine-learning

DS process: https://www.oreilly.com/ideas/building-a-high-throughput-data-science-machine Friendship paradox: https://vuhavan.wordpress.com/2016/03/25/ban-ban-ban-nhieu-hon-ban-ban/

AB test:

EMNLP 2015:

To read:

Idols:

IPython/Jupyter:

LSTM:

RNN:

Unicode:

EVENTS:

  • April 8-10 2016: PyData Madrid
  • April 15-17 2016: PyData Florence
  • May 6-8 2016: PyData London hosted by Bloomberg
  • May 20-21 2016: PyData Berlin
  • September 14-16 2016: PyData Carolinas hosted by IBM
  • October 7-9 2016: PyData DC hosted by Capital One
  • November 28-30 2016: PyData Cologne

Other Conference Dates Coming Soon!

QUOTES:

  • My name is Sherlock Homes. It is my business to know what other people dont know.
  • Take the first step in faith. You don't have to see the whole staircase, just take the first step. [M.L.King. Jr]
  • "Data data data" he cried impatiently. I can't make bricks without clay. [Arthur Donan Doyle]

STATS:

BOOKS:

CLUSTER:

EMBEDDING:

Linux:

BENCHMARK:

data_science's People

Contributors

lampts avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.