Machine Learning with KERAS:

Mothi Sriram
5 min readJun 13, 2021

--

PROTEIN FUNCTION CLASSIFIER:

In this project,We have a data which is a fasta file that contains of proteins... Proteins files are made up of amino acids and in our case Amino acids are in form of sequence of alphabets. Each protein has a ID and a sequence, what we are doing to is Predicting the protein wheather it can perform atp binding.

We are going to see this project in 5 phases:

  1. Processing the data.
  2. Loading and shaping up the data.
  3. Testing and train.
  4. Sequential model.
  5. Functional API

Hey points in the 5 phases are explained below :

Processing the data:

The general idea in the prediction is to establish a classification model between the sequence of a protein and its structural class based on data available from proteins of known sequence and known structure.

At first we import the necessary librariy. we need re packages or regular regression and we have os and glob packages for some implication.first we wat to do define our file disc.we are our file in the data scrapes folder. So weuse scrape sir,in we se inux for../data scrapes and for windows .\data scrapes.Os.ath.joinbythis code wecan ru our code in Google platforms for free.Then import date and time reason for this which and when fieis created.we initialize the num proteins as 0.we create fasta file1 coz in this we have many fasta file .

Helper function is a function that performs part of the computation of another function. Helper functions are used to make your programs easier to read by giving descriptive names to computations.

Loading and shaping up the data:

Initially we have setup the max seq size to 500 , so than the datas with more than 500 seq are ignored, but all rest of the proteins doesn't have seq =500 . They are less than that(like 300,199,480).

So we use padding option(from Keras.preprocessing import seq),in order to make all sequences in a batch fit a given standard length.now the data we have taken has 379 seq. Now by padding we make the seq count to 500 but adding empty '_' to the seq. Which is indicated by 0 in the above image.

Padding option:Since we initially setup the max sequence length to 500. The datapoints with seq greater than 500 are ignored. But all the datapoints are not equal to 500. So we use padding. So that if the datapoints has 490 seq the rest 10(empty spaces) will be '0’. So that our datapoints fit the batch with given standard length.

Training and testing:

Dataset:

A single row of data is called an instance.A collection of instances is a Dataset.

Train dataset:

A dataset that we feed into our machine learning algorithm to train our model.

Test dataset:

A dataset that we use to validate the accuracy of our model but is not used to train the model.

Splitting a dataset:

Datasets are split into two parts:
1.) Train dataset
2.) Test dataset (validate dataset)

Data splitting is the act of partitioning available data into. two portions, usually for cross-validatory purposes. One. portion of the data is used to develop a predictive model. and the other to evaluate the model's performance.

Previously we did padding to over come the length issue.Now we are going to split the data into test and train datas.
Before that we are printing the ( x_all shape and y_all shape) to make sure about the datas.
Now we use random function to randomize the datas as we prefer to train the datas randomly, we don't want to train with First 5 datas and test with next 2 datas. So we use(np.random.shuffle) to shuffle the datas.
Then we split the datas into test and train:
N of datas=7
Trainset=5
Test set=2

Sequential Model:

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs.

Embedding layer:enables us to convert each word into a fixed length vector of defined size.

Flatten layer:A flatten operation on a tensor reshapes the tensor to have the shape that is equal to the number of elements contained in tensor non including the batch dimension.

Activation layer :The activation layer in keras is equivalent to a dense layer with the same activation passed as an argument.

The output generated by the dense layer is an 'm' dimensional vector.

dense layer is basically used for changing the dimensions of the vector.

Defining and using of the above layers in our project are shown in the above image.

Use of optimizers:Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.

Optimizer in our project : SGD

Function API:

The Keras functional API is a way to create models that are more flexible than the tf. keras.
The functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.

we will add an embedding layer with the same data we don’t add input size because we will pass input to the embedding layer. now we can create a new layer called flatten layer and passed with embedding layer and add a dense layer in 25and 1. And thus we run using Functional API.,

--

--