classification by decision tree induction
Attribute Selection Measure:
–Information gain (ID3/C4.5)
•All attributes are assumed to be categorical
•Can be modified for continuous-valued attributes
–Gini index (IBM IntelligentMiner)
•All attributes are assumed continuous-valued
•Assume there exist several possible split values for each attribute
•May need other tools, such as clustering, to get the possible split values
•Can be modified for categorical attributes
•Avoid overfitting
•Extract classification rules from trees
Gini Index (IBM IntelligentMiner)
•If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.
•If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
•The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
Approaches to Determine the Final Tree Size:
•Separate training (2/3) and testing (1/3) sets
•Use cross validation, e.g., 10-fold cross validation
•Use all the data for training
–but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
•Use minimum description length (MDL) principle:
–halting growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction
•Allow for continuous-valued attributes
–Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals
•Handle missing attribute values
–Assign the most common value of the attribute
–Assign probability to each of the possible values
•Attribute construction
–Create new attributes based on existing ones that are sparsely represented
–This reduces fragmentation, repetition, and replication
Classification in Large Databases
•Classification—a classical problem extensively studied by statisticians and machine learning researchers
•Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
•Why decision tree induction in data mining?
–relatively faster learning speed (than other classification methods)
–convertible to simple and easy to understand classification rules
–can use SQL queries for accessing databases
–comparable classification accuracy with other methods