Understanding the Data Mining Decision Tree
Data mining tree is a classification model that is structured like a tree. It comprises of leaf nodes, branches and root node. The leaf node represents a class label while a branch node indicates a blue book for an attribute. The root is the highest node and it matches with the best predictor.
Data mining tree is mainly used to divide datasets into smaller and more manageable subsets. With this, you can handle large data whether categorical or numerical data. A decision tree, in data mining, can be described as the use of both computer and mathematical techniques to describe, categorize and generalize a set of data.
A decision tree in data mining is used to describe data though at times it can be used in decision making. There are two main types of decision trees used in data mining. They are classification trees and regression trees.
- Classification trees – Classification trees are generally described as the analysis where the predictable result is the same class as the data.
- Regression Trees- They are generally described as the analysis in which the most likely outcome is a number.
Decision Tree Algorithm
There are different algorithms that are used data mining trees. They include ID3, C4.5, CART, k-means, EM among others. Here we will mainly focus on ID3 and C4.5.
ID3 Algorithm
Almost all data mining algorithm can be traced to the Iterative Dichotomiser popularly known as the ID3. It was developed in the year 1980 by a computer science researcher named J. Ross Quinlan. The ID3 is constructed in a top-down manner and there is no backtracking. To construct a decision tree, it mainly uses information gain and entropy.
- Entropy
A decision tree is made from top to bottom and it mainly involves sub-dividing data into subsets of the same values. Entropy is used in ID3 algorithm to determine the similarity of a given sample. A sample can be entropy 1 or entropy zero. Entropy one is where the sample is equally divided while entropy zero is where the sample is completely similar. - Information Gain
The main aim of constructing a decision tree is to get the most information from a given data. Information gain uses the results from entropy to further split the datasets with the sole aim of finding the most valuable information.
C4.5 algorithm
C4.5 is a successor of ID3 and it was also developed by J. Ross Quinlan. It also uses the top-down approach and has no backtracking. The most notable difference between ID3 and C4.5 is that the later uses data that is already classified.
The C4.5 algorithm uses information gain and it has a single-pass pruning process which is different from others. It can also work with both complete and incomplete sets of data and it has a way of dealing with incomplete data.
Tree pruning
As normal trees are pruned, decision trees should also be pruned. Pruning helps remove anomalies and in return making it easier to use decision trees. When you prune a decision tree, it will be smaller and less complicated.
There are two types of pruning namely: pre-pruning and post-pruning. As the name suggests, post-pruning is removing of details from an already existing tree while pre-pruning is pruning the tree while it is under construction.
Pros of data mining trees
There are so many benefits of using data mining trees. Below are a few:
- Decision trees are relatively easy to use. This makes it easier for data preparation and decision making.
- Decision trees from data mining can be used to screen variables or select features. The top nodes are often the most important features in the whole tree. This makes it easier for predictive analysis.