- D-Tree
- Attribute - IV
- Target Variable - DV
- SD = Standard Deviation
- SDR = Standard Deviation Reduction
- ID3 = Iterative Dichotomiser 3
- ID3[Iterative Dichotomizer 3] Algorithm is one of the most popular algorithm used to solve classification problems. ID3 Algorithms is mainly used to predict Input/Output variables Categorical in nature.
- The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking.
- The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.
SDR Formula
SD Formula
- Please find the below flowchart which explains the pseudocode of ID3 Algorithm for Regression problem.
- Now we are going to see how to implement D-Tree using ID3 Algorithm. For the step-by-step implementation we are going to consider the famous Play-Tennis data-set where the Target Variable is the Real Numbers. Please find the below data-set for your reference.
- It has 4 IV's which is categorical in nature and 1 DV which is Continues in nature, The problem we are trying to solve here is based on the IV categorical in nature we are going to predict Output target class(DV) using SDR Algorithm.
Step - 1: Find SD of Target Variable
- Finding the SD of the System(target variable) is important since it is part of SDR gain for each attribute. so we are going to find SD of the System first. And to do that use the SD formula mentioned above and apply it on the System(target variable).
- Please find the below image for reference. You will end-up with entropy value of 9.32.
Step - 2: Find SD of Input Attribute(IV)
- Finding the SD of each attribute as same as above step and you will end-up with SD values for each input attribute as shown below.
Step - 3: Find SDR of Input Attribute(IV)
- To find SDR use the above mentioned formula, which will show below results.
- If you see Attribute Outlook has the Highest SDR. So we are going to consider that as root element and and build our tree like this.
- Now divide the data-set into sub-Branches(Sub Partitions) and check any branch met below conditions. If any branch met our condition then consider that as Leaf Node.
- In practice, we need some termination criteria. For example, when coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%).
- Lets assume the threshold for our problem to be 10% and calculate the coefficient of variation for sub-branches
- coefficient of variation (CV):
- The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation to the mean (average).
- It is also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution.
- Formula for coefficient of deviation (CV):
- If we apply above formula on attribute Outlook sub-branches we will get the following results. which clearly shows Overcast has the lowest CV value of (8%) which is smaller than Threshold value (10%). If a CV value is smaller than Threshold value means low in variance so we can consider that as Leaf Node.
- And the above tree will look like this.
- Perform same step for remaining sub-branch and continue the iteration until will find values for all the sub-branches.
- Once we finished the above all steps will see Tree like below.
- How to Write your first program using python
- If you completed the above steps you can see the D-Tree Algorithm predicts test samples based on training result.
- Since ID3 for Regression not give much accuracy it will be used only on Approximate Computing and not much in Regression.
