sujitjean,Sujit Jena,github

100daysofmlcode

This repository contains all the required resources regarding the 100+ Days Of ML Code Telgram Group which was driven by me from 1-1-2019 to 31-12-2019 !!

a-path-finding-visualization

A python visualization of the A* path finding algorithm. It allows you to pick your start and end location and view the process of finding the shortest path.

artificial-intelligence-deep-learning-machine-learning-tutorials

A comprehensive list of Deep Learning / Artificial Intelligence and Machine Learning tutorials - rapidly expanding into areas of AI/Deep Learning / Machine Vision / NLP and industry specific areas such as Automotives, Retail, Pharma, Medicine, Healthcare by Tarry Singh until at-least 2020 until he finishes his Ph.D. (which might end up being inter-stellar cosmic networks! Who knows! 😀)

autograding-handwritten-mathematical-worksheets

Image processing and computer vision model to automatically evaluate and grade handwritten mathematical equations

awesome-artificial-intelligence

A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers

awesome-computer-vision

A curated list of awesome computer vision resources

awesome-deep-learning

A curated list of awesome Deep Learning tutorials, projects and communities.

clsutering-__-kmeans-allagglomerative-hierarchical-clustering

step 1: Choose any vectorizer (data matrix) that you have worked in any of the assignments, and got the best AUC value. step 2: Choose any of the feature selection/reduction algorithms ex: selectkbest features, pretrained word vectors, model based feature selection etc and reduce the number of features to 5k features. step 3: Apply all three kmeans, Agglomerative clustering, DBSCAN K-Means Clustering: ● Find the best ‘k’ using the elbow-knee method (plot k vs inertia_) Agglomerative Clustering: ● Apply agglomerative algorithm and try a different number of clusters like 2,5 etc. ● As this is very computationally expensive, take 5k datapoints only to perform hierarchical clustering because they do take a considerable amount of time to run. DBSCAN Clustering: ● Find the best ‘eps’ using the elbow-knee method. ● Take 5k datapoints only. step 4: Summarize each cluster by manually observing few points from each cluster. step 5: You need to plot the word cloud with essay text for each cluster for each of algorithms mentioned in step 3.

deep-learning-models

Keras code and weights files for popular deep learning models.

deep-learning-papers-reading-roadmap

Deep Learning papers reading roadmap for anyone who are eager to learn this amazing tech!

desision-trees-with-various-hyperparametrs

Apply both Random Forrest and GBDT on these feature sets Set 1: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(BOW) + preprocessed_eassay (BOW) Set 2: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF)+ preprocessed_eassay (TFIDF) Set 3: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(AVG W2V)+ preprocessed_eassay (AVG W2V). Here for this set take 20K datapoints only. Set 4: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF W2V)+ preprocessed_eassay (TFIDF W2V). Here for this set take 20K datapoints only. The hyper paramter tuning (Consider any two hyper parameters preferably n_estimators, max_depth) Consider the following range for hyperparameters n_estimators = [10, 50, 100, 150, 200, 300, 500, 1000], max_depth = [2, 3, 4, 5, 6, 7, 8, 9, 10] Find the best hyper parameter which will give the maximum AUC value Find the best hyper paramter using simple cross validation data You can write your own for loops to do this task Representation of results You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure with X-axis as n_estimators, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynb or You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure seaborn heat maps with rows as n_estimators, columns as max_depth, and values inside the cell representing AUC Score You can choose either of the plotting techniques: 3d plot or heat map Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test. Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points Conclusion You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library link

dimensional-reduction-and-tsne

This Analysis of data is based on dimensionality reduction techniques,data minning , text preprocessing and feature selection

exploratory-data-analysis

DonorsChoose DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve: How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible How to increase the consistency of project vetting across different volunteers to improve the experience for teachers How to focus volunteer time on the applications that need the most assistance The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

homemade-machine-learning

🤖 Python examples of popular machine learning algorithms with interactive Jupyter demos and math being explained

inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

inndata_analytics

k-nearest-neighbors-knn-algorithm-

In the journey of exploring the flown of Data science and predictivemodeling, I explored this Very interesting algorithm k-nearest neighbors (KNN) algorithm.I have tried to leverage the ability of the Classification algorithm whichcomes under Supervised learning of Section of predictive modeling. I used the KNN algorithm for the classification of approvalrate of the projects submitted by the teachers of the United states for students.The main business context of the project was to reduce the manualevaluation of the projects that were done by volunteers as the process of evaluationcan take a long time, which may also be based on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.· How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible· How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers· How to focus volunteer time on the applicationsthat need the most assistance.The goal of the project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine-learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has because I believein growing together and helping others as this makes me a great team player. Italso increases the storytelling ability and to represent data.For the implementation of all the code, I have used the SKlearn Library. 1. Apply KNN (brute force version) on thesefeature sets1. I have formed the different sets of thedata for checking which Vectorization of the text data works better than others.Set 1: categorical, numericalfeatures + project title (BOW) + preprocessed essay (BOW)Set 2: categorical, numericalfeatures + project title (TFIDF)+ preprocessed essay (TFIDF)Set 3: categorical, numericalfeatures + project title (AVG W2V) + preprocessed essay (AVG W2V)Set 4: categorical, numericalfeatures + project title (TFIDF W2V) + preprocessed essay (TFIDF W2V)2. Hyperparameter tuning to find best K andMetrix used for evaluation of the model. 1. Find the best hyperparameter which results inthe maximum AUC value2. Find the best hyperparameter using k-foldcross-validation (or) simple cross-validation data3. Use grid search-cv or random search-cv or write your own for loops to dothis task 3. Representation of results1. Plotting the performance of model both on traindata and cross-validation data for each hyperparameter. 2. Once you find the best hyperparameter, you needto train your model-M using the best hyper-param. Now, find the AUC on testdata and plot the ROC curve on both train and test using model-M. 3. Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points 4. . Select top 2000 features from the featureSet 2 using `SelectKBest` and then apply KNN on top of these features(this thesection wherein we select the best features from all the features we have)1. Repeat the steps 2 and 3 on the data matrixafter feature selection 5. Conclusion1. Summarize the results at the end of thenotebook, summarizing is done in the table format.

keras

Deep Learning for humans

keras-applications

Reference implementations of popular deep learning models.

lambda-deep-learning-demo

logistic-regression-on-donorschoose-dataset

In the journey of exploring the flied of Data science and predictivemodeling, I explored this Very interesting algorithm Logistic Regression algorithm.I have tried to leverage the ability of the Classification algorithm whichcomes under Supervised learning of Section of predictive modeling. I used the Logistic Regression algorithm for theclassification of approval rate of the projects submitted by the teachers ofUnited states for students.The main business context of the Project was to reduce the manualevaluation of the projects that was done by volunteers as the process of evaluationcan take long time, which may also be biased on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.· How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible· How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers· How to focus volunteer time on the applicationsthat need the most assistance.The goal of the Project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has, because I believein growing together and helping others as this makes me a great team player . Italso increases the story telling ability and to represent data.For implementation of all the code I have used the SKlearn Library. 1. Logistic Regression (either SGDClassifierwith log loss, or LogisticRegression) on these feature setsSet 1: categorical, numerical features+ project_title(BOW) + preprocessed_eassay (`BOW with bi-grams` with`min_df=10` and `max_features=5000`)Set 2: categorical, numericalfeatures + project_title(TFIDF)+ preprocessed_eassay (`TFIDF with bi-grams`with `min_df=10` and `max_features=5000`)Set 3: categorical, numericalfeatures + project_title(AVG W2V)+ preprocessed_eassay (AVG W2V)Set 4: categorical, numericalfeatures + project_title(TFIDF W2V)+ preprocessed_essay (TFIDF W2V) 2. Hyper parameter tuning (find best hyperparameters corresponding the algorithm that you choose)1. Find the best hyper parameter which will givethe maximum AUC value2. Find the best hyper parameter using k-fold crossvalidation or simple cross validation data3. Use gridsearch cv or random search cv or you canalso write your own for loops to do this task of hyperparameter tuning 3. Representation of results1. You need to plot the performance of model bothon train data and cross validation data for each hyper parameter, like shown inthe figure. 2. Once after you found the best hyper parameter,you need to train your model with it, and find the AUC on test data and plotthe ROC curve on both train and test. 3. Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points.Please visualize your confusion matrices using seaborn heatmaps. Task-2 Apply Logistic Regression on the belowfeature set Set 5 by finding the best hyper parameter as suggested in step 2and step 3.Consider these set of features Set 5 :school state : categorical dataclean categories : categorical dataclean subcategories : categorical dataproject_grade_category :categorical datateacher prefix : categorical dataquantity : numerical datateacher_number_of_previously_posted_projects : numerical dataprice : numerical datasentiment score's of each of the essay : numerical datanumber of words in the title : numerical datanumber of words in the combine essays : numerical dataAnd apply the Logistic regression on these features byfinding the best hyper paramter as suggested in step 2 and step 3 4 . ConclusionYou need to summarize the results at the end of thenotebook, summarize it in the table format. To print out a table please referto this pretty table library link.

machine-learning-for-software-engineers

A complete daily plan for studying to become a machine learning engineer.

machine-learning-tutorials

machine learning and deep learning tutorials, articles and other resources

microsoft-malware-detection

ml-from-scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

mnist-lenet-keras

LeNet implemented in keras, using the mnist dataset

models

Models and examples built with TensorFlow

my-first-blog

naive-bayes-algorithm-for-classification-of-donorschoose

In the journey of exploring the flied of Data science and predictivemodeling, I explored this Very interesting algorithm Naive Bayes algorithm. I havetried to leverage the ability of the Classification algorithm which comes underSupervised learning of Section of predictive modeling. I used the Naive Bayes algorithm for the classification ofapproval rate of the projects submitted by the teachers of United states for students.The main business context of the Project was to reduce the manualevaluation of the projects that was done by volunteers as the process of evaluationcan take long time, which may also be biased on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.· How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible· How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers· How to focus volunteer time on the applicationsthat need the most assistance.The goal of the Project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has, because I believein growing together and helping others as this makes me a great team player . Italso increases the story telling ability and to represent data.For implementation of all the code I have used the SKlearn Library. 1. Applying Multinomial NB on these featuresetsSet 1: categorical, numericalfeatures + preprocessed_eassay (BOW)Set 2: categorical, numericalfeatures + preprocessed_eassay (TFIDF)2. The hyper parameter tuning(find best alpha:smoothing parameter)1. Find the best hyper parameter which will givethe maximum AUC value.2. find the best hyper parameter using k-fold crossvalidation (use GridsearchCV or RandomsearchCV)/simple cross validation data(write for loop to iterate over hyper parameter values)3. Representationof results1. plot the performance of model both on train dataand cross validation data for each hyper parameter, like shown in the figure 2. Once after you found the best hyper parameter,you need to train your model with it, and find the AUC on test data and plotthe ROC curve on both train and test. 3. Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points 4. The top 20 features from either from feature Set1 or feature Set 2 using absolute values of `feature_log_prob_ ` parameter of`MultinomialNB` (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)and print their corresponding feature names 4. You need to summarize the results at the endof the notebook, summarize it in the table format

sujitjean Goto Github PK

Sujit Jena's Projects

Recommend Projects

Recommend Topics

Recommend Org