Skip to content

8key/Multilayer-Perceptron-Model-for-XOR-Gated-Variable-Detection

Repository files navigation

Multilayer-Perceptron-Model-for-XOR-Gated-Variable-Detection

1 – Exploring XOR Gate and Neural Networks

In this code, we build a multilayer perceptron model for XOR gated data sets. The data are explored and found gated variables determined the magnitude of our target variables. The model reaches a level of less than 0.5 on MSE in roughly 900 epochs. We have also studied the dynamic behavior through plotting our model against input data sets. Some interesting observation has drawn our attention. MLP is able to learn concurrently over three simple functions whereas leave the more complicated on to the end to learn. We extensively discussed this in the last two subsections. Finally, we also question that how efficient MLP can rule out irrelevant information. Adding 3, 10, and even 100 dummy binary variables does not impact MLP performance, showing an excellent power in data science. This report contains 6 subsections. We explore our data, then build the model. Initial results are summarized. Dynamic fitting and dummy variables impact are analyzed in subsection 4 and 5. Finally. We conclude and discuss our results in subsection 6. Copy of the basic code for this exercise can be found in file main_file.py.

1.1 Data exploration

The given data has already been split into training set and testing set, with each contains 10000 observations. Columns A, and B are binary gate variables that form a set of four combinations, namely, [A, B] = {[0, 1], [1, 1], [0, 0], [1, 0]}, whereas column t, another independent feature, is ranging from 0 to 1 with standard deviation 0.29. Our target variable y ranges from10 to 65. However, it shall be clear that the magnitude of y depends on values of A and B. Figure 1 shows this behavior through sub-plotting t-y over gates A and B, along with key statistics summarized in titles. As can be seen, the mean values are deviated from one another as well as the standard deviation.

Figure 1. Behavior of t-y over gated variables A and B (figure 1) Once noticing that the data is divided over value combinations of A and B, it is worth questioning that whether we have balanced training samples over each category. Though investigating the data structure, we find the training set is well divided over four categories, ie., categories [0, 0] and [1, 1] have 4989 observations, and categories [0, 1] and [1, 0] have 2511 observations.

1.2 MLP Model

Our model journey begins with a four-hidden layer MLP model. The number of neurons decreases with the depth of the model where the first layer has 128 neurons, the second 64 neurons, the third 32, and the last 16. Several impacts of the model structure have taken into consideration while selecting this initial structure. Problem nature always comes to the first. We have found that the number of features and observations are both low, and through data visualization, the trend over each category is clear. Secondly, we have only limited time and computational power, getting a satisfactory accuracy is top-priority, and fine-tuning the hyper-parameters should be beneficial with extra time allowed. Thirdly, the data demonstrates multiple levels of nature. In other words, there is a clear relationship between t and y, but on top of that, there also exists gated behavior over values A and B. Thus, analogous to CNN where specific level features with increasing abstract are leaned through different levels of hidden layers, we want our model also to be multiple layer with decreasing number of neurons to catch that gated behavior from training data. Given these considerations, our detailed model are as follows. After input layer, we have 4 hidden layers with which each is activated through hyperbolic tangent function. Since our target is a continuous value, we fully connect the last layer to our output unit through linear weights and biases operation. At each layer, dropout operation is added to prevent the modelling from overfitting.

1.3 Initial Results

We define the loss function as mean square error, and use Adam optimizer with learning rate set to 0.001. For batch size 500, we achieved loss as low as 0.41 after first 900 epoch. The loss stagnates with very minor fluctuation for epoch beyond 900. The overfitting issue is insignificant even without dropout. Applying the model to testing set yields mean square error 0.49. Figure 2 summarizes the training performance for our program.

Figure 2. Base model performance, plot log scale the MSE with respect to every 100 number of epoch (figure 2) The base model is contained in file main_file.py, and after running, the output should look like: Epoch: 000/1500 Training cost: 1458.615 Epoch: 100/1500 Training cost: 289.980 …… Epoch: 1400/1500 Training cost: 0.473 Testing data MSE value: 0.495615

1.4 Dynamic Fitting

We have divided our training epoch and investigated for every 40 epoch what the model look like and how it is fit to data. The results are very interesting. Recall Figure 1, we have four categories of data relationship t-y, however, three of them show similar pattern, namely, [0, 0], [0, 1], and [1, 0] all have roughly two local maximum and one local minimal. The last category, [1, 1], has 4 local maximum and five local minimal. In other words, the frequency for the last category is distinct from the rest. Interestingly, our model is also impacted by this nature and finding a hard time to learn the last category. Figure 3 and 4 show the MLP model during training and the level to which they fit to our data

Figure 3. MLP model after 800 epoch of training (figure 3) Figure 4. MLP model after 1000 epoch of training (figure 4) (You should be able to find a complete animation in file dynamic_fitting.flv). Some observations worth mentioning. The process proceeds to learn simple features concurrently, whereas leave the hard work later. This can be proven by Figure 3 where all features are learned for the first three categories, the last remain almost untouched. The reason, we postulate, would be due to that Adam searching algorithm is majorly greedy, going for the direction where an immediate and sizeable payoff can be secured. Thus, we will see one that with easier structure gets learned quickly and be prioritized. This can also be testified from the fact that after 800 epochs, as shown in Figure 3, the MSE is already smaller than 10. In other words, the rest of training mainly serves as refining the model for easy problems and exploratively learn for the hard one. As can be seen in Figure 4, the model succeeds in making a prefect prediction over our date sets.

1.5 Adding Dummy Variables

In this section, we explore the impact of adding dummy binary variable on model behavior. Since the gated variables binary, it is of an interest to question what would happen if random gate variables are added and how the model is able to recognize. Three scenarios are created: adding four, ten, and one hundred dummy variables to the original features, respectively. We have run the algorithm and concluded the following results:

  1. In general, there is no significant impact of adding dummy variable of any scale on our modelling performance, which including training speed and predicative power.
  2. Three models demonstrate similar behavior over MSE reduction with respect to time. This is meaningful that MLP model can quickly rule out over a set of features which ones are mostly relevant. This also means that after only few epoch, the model assigns zeros to the weight coefficients from the input layer that are closely related to dummy variables.
  3. The speed and efficiency of which MLP converges were not impacted. As can be seen in Figure 5, all models converge quickly within the first 800 epochs.

Figure 5. Training comparison amongst three models in MSE (figure 5)

1.6 Discussions

In this exercise, we have explored our training data and testing data, identifying that variable A and B are gate variable controlling the magnitude of our target variable. The data sets over each category are evenly separated, benefiting our training process without bias towards either category. If the data sets have imbalanced quantity over categories, techniques like balanced sampling, and augmented data must be used to decrease bias. We have used MLP model. With a structure of four hidden layers, the algorithm reaches MSE less than 0.5 within 2 mins, via about 1500 epochs. The model performs well without overfitting the training set, this is true even no dropout is present. Some interesting observations are drawn from a dynamic investigation throughout the training process. The model concurrently learns the features over simple data curves, i.e., the first three categories, whereas preserves its searching power once the “easy” learning is done. Readers should refer to the video recorded in file dynamic_fitting.flv. Finally, we have explored further regarding how far MLP can go if extra noise is present. This is done through adding dummy binary variables to the original data sets. MLP is almost “independent” to irrelevant information in this XOR problem. Adding four, ten, and even one hundred dummy variables does not show a significant impact on modelling training and performance. This observation is well recorded and discussed in section 1.5. If more time is allowed, we shall purpose to continue explore and refine our model structure. Our algorithm can be easily tuned with grid of multiple dimensions over learning error, dropout ratio, number of layers, activation functions, etc.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors