Some time ago while reading the journal of Knowledge and Information Systems (KAIS; vol. Dec24, 2007) i came across a paper titled “Top 10 Algorithms in Data Mining”.

This paper was presented at the IEEE International Conference on Data Mining (ICDM; 2006 Hong Kong), and a companion book was published in 2009; edited by the authors of the mentioned paper (Xindong Wu, Vipin Kumar et al).

It was “the paper” that attracted me to the field of Machine Learning, and with this post i’m starting a series of articles related to this exciting area 🙂

Below you’ll find my rough notes (which you still may find useful) as well as my descriptions of some of the Top 10 Algorithms presented in the aforementioned paper.

Basics:

- Machine Learning – “making sense of the data”
**Supervised learning**– we specify a target variable and the machine learns from our data (by identifying patterns) to eventually get the target variable. (“You know what you are looking for”)- Sample tasks:
- Classification – predicting what class an instance of data should fall into
- Regression – prediction of a numeric value

- Sample tasks:
**Unsupervised learning**– in case of which you don’t know what you’re looking for, and ask the machine to tell you this instead. (“what do these data have in common?”)- Sample tasks:
- Clustering – grouping similar items together
- Density Estimation – finding statistical values that describe the data

- Sample tasks:

Examples of supervised learning algorithms:

**k-Nearest Neighbors (kNN)**– uses a distance metric to classify items**Decision Trees**– map observations about an item to conclusions about the item’s target value**Naïve Bayes**– uses probability distributions for classification**Logistic Regression**– finds best parameters to properly classify data**Support Vector Machines**– construct a hyperplane or set of hyperplanes in a high- or infinite-dimensional space**AdaBoost**– is made up of a collection of classifiers (a meta-algorithm)

Examples of unsupervised learning algorithms:

**k-Means clustering****Apriori algorithm****FP-Growth**

Classification:

- In classification, the target variable (“class”) can take:
- nominal values (true, false, car, plane, human, animal, etc.)
- infinite number of numeric values (in this case we’re talking about regression)

- Classification Imbalance – a real-world problem where you have more data from one class than other classes

Overview of the classification algorithms based on their design (Haralambos Marmanis)

Steps in developing a machine learning application:

**Data collection**– using publicly available sources, API’s, RSS feeds, sensors, etc.**Input data preparation**– algorithm-specific data formatting**Input data Analysis**– it’s important to “understand the data”**Algorithm training**– extraction of knowledge or information (this step does not apply to unsupervised learning)**Algorithm testing**– putting to use information learned in the previous step (evaluating the algorithm)**Algorithm usage**– solving a problem

Take care.

Resources:

- Knowledge and Information Systems, An International Journal (http://www.cs.uvm.edu/~kais/)
- Top 10 algorithms in data mining; SURVEY PAPER (http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf)
- “The Top Ten Algorithms in Data Mining”, Companion Book (http://www.crcpress.com/product/isbn/9781420089646)
- Top 10 Algorithms in Data Mining (http://www.cs.uvm.edu/~icdm/algorithms/index.shtml)
- Statistical classification (http://en.wikipedia.org/wiki/Classification_(machine_learning))
- Regression analysis (http://en.wikipedia.org/wiki/Regression_analysis)
- Cluster analysis (http://en.wikipedia.org/wiki/Cluster_analysis)
- Density estimation (http://en.wikipedia.org/wiki/Density_estimation)