Coder Social home page Coder Social logo

dataset-1's Introduction

Software Requirements

python 3

python packages: pandas, numpy, math, keras, sklearn, matplotlib

jupyter notebook

Datasets

link:

Steps to follow:

The project goal is to develop a framework to predict if a biological sequence belongs to a promoter or non-promoter category. Steps are as follows:

  1. Remove duplicate sequences from promoter data.
  2. Generate shuffle data from promoter sequences.
  3. Perform k-merization.
  4. Frequency based tokenization
  5. Building and training classification algorithms:

RandomForestClassifier (from sklearn.ensemble)

LSTM (from keras.layer)

CNN (from keras.layers.convolutional)

  1. Performance evaluation

Code snippets to develop a framework

Drop Duplicates:

X_drop_dup = X.drop_duplicates()

idx = X_drop_dup.index

data = np.array(df)

df= pd.DataFrame(data[idx])

Shuffle Data Snippet:

def shuffler(sequence, k_let):

length = [sequence[i:i+k_let] for i in range(0,len(sequence),k_let)]

np.random.shuffle(length)

return ''.join(length)

K-merization:

k=2 #K-mer size 2, 4 or8

def getKmers(X, size=k):

return [X[x:x+size].lower() for x in range(len(X) - size + 1)]

CNN architecture:

model = Sequential()

model.add(Embedding(vocab_size,128,input_length=max_len))

model.add(Conv1D(filters=128, kernel_size=5,padding='same'))

model.add(MaxPooling1D(pool_size=4))

model.add(Conv1D(filters=64, kernel_size=5, padding='same'))

model.add(MaxPooling1D(pool_size=4))

model.add(Conv1D(filters=32, kernel_size=5, padding='same'))

model.add(MaxPooling1D(pool_size=4))

model.add(Dense(1024, activation='relu'))

model.add(Dropout(0.2))

model.add(Dense(512, activation='relu'))

model.add(Dropout(0.2))

model.add(Dense(128, activation='relu'))

model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

LSTM architecture:

model = Sequential()

model.add(Embedding(vocab_size, 50 ,input_length=max_len))

model.add(LSTM(128))

model.add(Dense(64, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

For binary classification:

Loss function: binary_crossentropy,

Optimizer: adam

number of epochs: 10

For multispecies classification:

Loss function: sparse_categorical_crossentropy

Optimizer: adam

number of epochs: 10

dataset-1's People

Contributors

nikitabhandari-dl avatar

Stargazers

 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.