# Measuring Entropy (data disorder) and Information Gain

This is a very short post about two of the most basic metrics in the Information Theory

• is a measure of the amount of uncertainty in the (data) set S (i.e. entropy characterizes the (data) set S).
• in other words, it is the average amount of information contained in each message received (message here stands for an event, sample or character drawn from a distribution or data stream)
• it characterizes the uncertainty about our source of information (Entropy is best understood as a measure of uncertainty rather than certainty, as entropy is larger for more random sources)
• a data source is also characterized by the probability distribution of the samples drawn from it (the less likely an event is, the more information it provides when it occurs)
• it makes sense to define information as the negative of the logarithm of the probability distribution (the probability distribution of the events, coupled with the information amount of every event, forms a random variable whose average (expected) value is the average amount of information (entropy) generated by this distribution).
• because entropy is average information, it is also measured in shannons, nats, or hartleys, depending on the base of the logarithm used to define it

Math interpretation:   Python implementation:

```# Calculates the entropy of the given data set for the target attribute.
def entropy(data, target_attr):

val_freq = {}
data_entropy = 0.0

# Calculate the frequency of each of the values in the target attr
for record in data:
if (val_freq.has_key(record[target_attr])):
val_freq[record[target_attr]] += 1.0
else:
val_freq[record[target_attr]]  = 1.0

# Calculate the entropy of the data for the target attribute
for freq in val_freq.values():
data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2)

return data_entropy
```

• is the measure of the difference in entropy from before to after the data set S is split on an attribute A
• in other words, how much uncertainty in S was reduced after splitting data set S on attribute A
• it is a synonym for Kullback–Leibler divergence (in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the expectation value of the Kullback–Leibler divergence of a conditional probability distribution. The expected value of the information gain is the mutual information I(X; A) of X and A – i.e. the reduction in the entropy of X achieved by learning the state of the random variable A. In machine learning, this concept is used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high mutual information should be preferred to other attributes).

Math interpretation:  Python implementation:

```# Calculates the information gain (reduction in entropy) that would result by splitting the data on the chosen attribute (attr).
def gain(data, attr, target_attr):

val_freq = {}
subset_entropy = 0.0

# Calculate the frequency of each of the values in the target attribute
for record in data:
if (val_freq.has_key(record[attr])):
val_freq[record[attr]] += 1.0
else:
val_freq[record[attr]]  = 1.0

# Calculate the sum of the entropy for each subset of records weighted by their probability of occuring in the training set.
for val in val_freq.keys():
val_prob = val_freq[val] / sum(val_freq.values())
data_subset = [record for record in data if record[attr] == val]
subset_entropy += val_prob * entropy(data_subset, target_attr)

# Subtract the entropy of the chosen attribute from the entropy of the whole data set with respect to the target attribute (and return it)
return (entropy(data, target_attr) - subset_entropy)
```

Cheers!

Resources:

Tagged:

## One thought on “Measuring Entropy (data disorder) and Information Gain”

1. Zbigniew May 30, 2017 at 5:34 am Reply

Thanks very much for the explicit labeling of variables in those equations.

Could not get what I () stood for in all other articles on the web.

Cheers!