Coder Social home page Coder Social logo

Sujit Jena's Projects

100daysofmlcode icon 100daysofmlcode

This repository contains all the required resources regarding the 100+ Days Of ML Code Telgram Group which was driven by me from 1-1-2019 to 31-12-2019 !!

a-path-finding-visualization icon a-path-finding-visualization

A python visualization of the A* path finding algorithm. It allows you to pick your start and end location and view the process of finding the shortest path.

artificial-intelligence-deep-learning-machine-learning-tutorials icon artificial-intelligence-deep-learning-machine-learning-tutorials

A comprehensive list of Deep Learning / Artificial Intelligence and Machine Learning tutorials - rapidly expanding into areas of AI/Deep Learning / Machine Vision / NLP and industry specific areas such as Automotives, Retail, Pharma, Medicine, Healthcare by Tarry Singh until at-least 2020 until he finishes his Ph.D. (which might end up being inter-stellar cosmic networks! Who knows! 😀)

clsutering-__-kmeans-allagglomerative-hierarchical-clustering icon clsutering-__-kmeans-allagglomerative-hierarchical-clustering

step 1: Choose any vectorizer (data matrix) that you have worked in any of the assignments, and got the best AUC value. step 2: Choose any of the feature selection/reduction algorithms ex: selectkbest features, pretrained word vectors, model based feature selection etc and reduce the number of features to 5k features. step 3: Apply all three kmeans, Agglomerative clustering, DBSCAN K-Means Clustering: ● Find the best ‘k’ using the elbow-knee method (plot k vs inertia_) Agglomerative Clustering: ● Apply agglomerative algorithm and try a different number of clusters like 2,5 etc. ● As this is very computationally expensive, take 5k datapoints only to perform hierarchical clustering because they do take a considerable amount of time to run. DBSCAN Clustering: ● Find the best ‘eps’ using the elbow-knee method. ● Take 5k datapoints only. step 4: Summarize each cluster by manually observing few points from each cluster. step 5: You need to plot the word cloud with essay text for each cluster for each of algorithms mentioned in step 3.

desision-trees-with-various-hyperparametrs icon desision-trees-with-various-hyperparametrs

Apply both Random Forrest and GBDT on these feature sets Set 1: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(BOW) + preprocessed_eassay (BOW) Set 2: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF)+ preprocessed_eassay (TFIDF) Set 3: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(AVG W2V)+ preprocessed_eassay (AVG W2V). Here for this set take 20K datapoints only. Set 4: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF W2V)+ preprocessed_eassay (TFIDF W2V). Here for this set take 20K datapoints only. The hyper paramter tuning (Consider any two hyper parameters preferably n_estimators, max_depth) Consider the following range for hyperparameters n_estimators = [10, 50, 100, 150, 200, 300, 500, 1000], max_depth = [2, 3, 4, 5, 6, 7, 8, 9, 10] Find the best hyper parameter which will give the maximum AUC value Find the best hyper paramter using simple cross validation data You can write your own for loops to do this task Representation of results You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure with X-axis as n_estimators, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynb or You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure seaborn heat maps with rows as n_estimators, columns as max_depth, and values inside the cell representing AUC Score You can choose either of the plotting techniques: 3d plot or heat map Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test. Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points Conclusion You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library link

exploratory-data-analysis icon exploratory-data-analysis

DonorsChoose DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve: How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible How to increase the consistency of project vetting across different volunteers to improve the experience for teachers How to focus volunteer time on the applications that need the most assistance The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

homemade-machine-learning icon homemade-machine-learning

🤖 Python examples of popular machine learning algorithms with interactive Jupyter demos and math being explained

inltk icon inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

k-nearest-neighbors-knn-algorithm- icon k-nearest-neighbors-knn-algorithm-

In the journey of exploring the flown of Data science and predictivemodeling, I explored this Very interesting algorithm k-nearest neighbors (KNN) algorithm.I have tried to leverage the ability of the Classification algorithm whichcomes under Supervised learning of Section of predictive modeling. I used the KNN algorithm for the classification of approvalrate of the projects submitted by the teachers of the United states for students.The main business context of the project was to reduce the manualevaluation of the projects that were done by volunteers as the process of evaluationcan take a long time, which may also be based on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.·      How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible·      How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers·      How to focus volunteer time on the applicationsthat need the most assistance.The goal of the project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine-learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has because I believein growing together and helping others as this makes me a great team player. Italso increases the storytelling ability and to represent data.For the implementation of all the code, I have used the SKlearn Library.  1.      Apply KNN (brute force version) on thesefeature sets1.      I have formed the different sets of thedata for checking which Vectorization of the text data works better than others.Set 1: categorical, numericalfeatures + project title (BOW) + preprocessed essay (BOW)Set 2: categorical, numericalfeatures + project title (TFIDF)+ preprocessed essay (TFIDF)Set 3: categorical, numericalfeatures + project title (AVG W2V) + preprocessed essay (AVG W2V)Set 4: categorical, numericalfeatures + project title (TFIDF W2V) + preprocessed essay (TFIDF W2V)2.      Hyperparameter tuning to find best K andMetrix used for evaluation of the model. 1.      Find the best hyperparameter which results inthe maximum AUC value2.    Find the best hyperparameter using k-foldcross-validation (or) simple cross-validation data3.  Use grid search-cv or random search-cv or write your own for loops to dothis task 3.      Representation of results1.      Plotting the performance of model both on traindata and cross-validation data for each hyperparameter.  2.      Once you find the best hyperparameter, you needto train your model-M using the best hyper-param. Now, find the AUC on testdata and plot the ROC curve on both train and test using model-M. 3.      Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points  4.      . Select top 2000 features from the featureSet 2 using `SelectKBest` and then apply KNN on top of these features(this thesection wherein we select the best features from all the features we have)1.      Repeat the steps 2 and 3 on the data matrixafter feature selection 5.      Conclusion1.      Summarize the results at the end of thenotebook, summarizing is done in the table format.      

logistic-regression-on-donorschoose-dataset icon logistic-regression-on-donorschoose-dataset

In the journey of exploring the flied of Data science and predictivemodeling, I explored this Very interesting algorithm Logistic Regression algorithm.I have tried to leverage the ability of the Classification algorithm whichcomes under Supervised learning of Section of predictive modeling. I used the Logistic Regression algorithm for theclassification of approval rate of the projects submitted by the teachers ofUnited states for students.The main business context of the Project was to reduce the manualevaluation of the projects that was done by volunteers as the process of evaluationcan take long time, which may also be biased on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.·      How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible·      How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers·      How to focus volunteer time on the applicationsthat need the most assistance.The goal of the Project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has, because I believein growing together and helping others as this makes me a great team player . Italso increases the story telling ability and to represent data.For implementation of all the code I have used the SKlearn Library.  1.      Logistic Regression (either SGDClassifierwith log loss, or LogisticRegression) on these feature setsSet 1: categorical, numerical features+ project_title(BOW) + preprocessed_eassay (`BOW with bi-grams` with`min_df=10` and `max_features=5000`)Set 2: categorical, numericalfeatures + project_title(TFIDF)+ preprocessed_eassay (`TFIDF with bi-grams`with `min_df=10` and `max_features=5000`)Set 3: categorical, numericalfeatures + project_title(AVG W2V)+ preprocessed_eassay (AVG W2V)Set 4: categorical, numericalfeatures + project_title(TFIDF W2V)+ preprocessed_essay (TFIDF W2V) 2.      Hyper parameter tuning (find best hyperparameters corresponding the algorithm that you choose)1.      Find the best hyper parameter which will givethe maximum AUC value2.      Find the best hyper parameter using k-fold crossvalidation or simple cross validation data3.      Use gridsearch cv or random search cv or you canalso write your own for loops to do this task of hyperparameter tuning 3.      Representation of results1.      You need to plot the performance of model bothon train data and cross validation data for each hyper parameter, like shown inthe figure. 2.      Once after you found the best hyper parameter,you need to train your model with it, and find the AUC on test data and plotthe ROC curve on both train and test. 3.      Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points.Please visualize your confusion matrices using seaborn heatmaps.  Task-2 Apply Logistic Regression on the belowfeature set Set 5 by finding the best hyper parameter as suggested in step 2and step 3.Consider these set of features Set 5 :school state : categorical dataclean categories : categorical dataclean subcategories : categorical dataproject_grade_category :categorical datateacher prefix : categorical dataquantity : numerical datateacher_number_of_previously_posted_projects : numerical dataprice : numerical datasentiment score's of each of the essay : numerical datanumber of words in the title : numerical datanumber of words in the combine essays : numerical dataAnd apply the Logistic regression on these features byfinding the best hyper paramter as suggested in step 2 and step 3 4 . ConclusionYou need to summarize the results at the end of thenotebook, summarize it in the table format. To print out a table please referto this pretty table library link.

ml-from-scratch icon ml-from-scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

models icon models

Models and examples built with TensorFlow

naive-bayes-algorithm-for-classification-of-donorschoose icon naive-bayes-algorithm-for-classification-of-donorschoose

In the journey of exploring the flied of Data science and predictivemodeling, I explored this Very interesting algorithm Naive Bayes algorithm. I havetried to leverage the ability of the Classification algorithm which comes underSupervised learning of Section of predictive modeling. I used the Naive Bayes algorithm for the classification ofapproval rate of the projects submitted by the teachers of United states for students.The main business context of the Project was to reduce the manualevaluation of the projects that was done by volunteers as the process of evaluationcan take long time, which may also be biased on some factors and some irreducibleerrors could also be introduced into the processes. Some other import points are.·      How to scale current manual processes andresources to screen 500,000 projects so that they can be posted as quickly andas efficiently as possible·      How to increase the consistency of projectvetting across different volunteers to improve the experience for teachers·      How to focus volunteer time on the applicationsthat need the most assistance.The goal of the Project is to predict whether or not aDonorsChoose.org project proposal submitted by a teacher will be approved,using the text of project descriptions as well as additional metadata about theproject, teacher, and school. DonorsChoose.org can then use this information toidentify projects most likely to need further review before approvalThe steps followed for Data Preparation and PredictiveModeling is as follows:Note: Giving Unstructured data (Garbage in common terms) toa machine learning algorithm gives you random data (Garbage) again. All the code is written in a very clean and untestablemanner ignoring fancy methods where ever possible and reference for everything thatis used in coding is given above the code so that is it easy for everyone tounderstand the code and leverage the potential that AI has, because I believein growing together and helping others as this makes me a great team player . Italso increases the story telling ability and to represent data.For implementation of all the code I have used the SKlearn Library.  1.      Applying Multinomial NB on these featuresetsSet 1: categorical, numericalfeatures + preprocessed_eassay (BOW)Set 2: categorical, numericalfeatures + preprocessed_eassay (TFIDF)2.      The hyper parameter tuning(find best alpha:smoothing parameter)1.      Find the best hyper parameter which will givethe maximum AUC value.2.      find the best hyper parameter using k-fold crossvalidation (use GridsearchCV or RandomsearchCV)/simple cross validation data(write for loop to iterate over hyper parameter values)3.       Representationof results1.      plot the performance of model both on train dataand cross validation data for each hyper parameter, like shown in the figure 2.      Once after you found the best hyper parameter,you need to train your model with it, and find the AUC on test data and plotthe ROC curve on both train and test. 3.      Along with plotting ROC curve, you need to printthe confusion matrix with predicted and original labels of test data points 4.      The top 20 features from either from feature Set1 or feature Set 2 using absolute values of `feature_log_prob_ ` parameter of`MultinomialNB` (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)and print their corresponding feature names      4.  You need to summarize the results at the endof the notebook, summarize it in the table format 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.