# Machine Learning 1

INF554 - Machine learning I at École Polytechnique of Prof. Vazirgiannis

## Dimension Reduction

• PCA

Principal Components Analysis

Maximization of covariance

• SVD

Singular Value Decomposition

• MDS

MultiDimensional Scaling

• classical MDS

Seek to find an optimal configuration of x in a space of lower dimension that gives

• Metric MDS

Relax $d_{ij}\approx \hat{d}_{ij}$ by allowing

for some monotone function $f$

The objective is to minimize the stress

Even if $f(d_{ij}) = d_{ij}$, the solution is not the same as classical MDS

Sammon’s stress preserve small $d_{ij}$

• Non-metric MDS

$f$ only preserve the order of distance $d_{ij}$

• NMF

Non negative Matrix Factorization

Find $U, V$ such that $X\approx UV^t$

• LSI

Latent Semantic Indexing, an application of SVD

## Supervised Learning

• Basic algorithms

• Linear Regression

• Logistic Regression - Classification

logistic function $h(x)=\frac{1}{1+\exp(-x)}$

• SVM

• Primal formulation

• Dual formulation

The gradient to $w_i,b,\epsilon_i$ gives:

Therefore,

• Generative / Discriminative models

• Discriminative models learn the (hard or soft) boundary between classes

Perceptron, SVM, Decision Tree, Nearest neighbor

• Generative models model the distribution of individual classes

Naive Bayes, Logistic regression

• LDA

Linear Discriminant Analysis

LDA assume that every density within each class is a Gaussian distribution.

The objective is to project data in a lower dimensional space to

• maximize the distance of the class centers
• minimize the in-class variance

Therefore, it’s to maximize the ratio between them.

Fischer’s linear discriminant

Remark: LDA produce at most C-1 features

## Feature selection

• Univariate - Filter methods

Rank the features and take the top k

• Information Gain
• Pearson correlation
• Chi-square test

A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. ?? p value, F test, ??

• The relief method
• Multivariate methods

Two individually irrelevant features may become relevant after combination

Subset generation

Find the subset of k variables that predicts best

How to search all the possibilities ?

• Wrapper methods
• Sequential Forward Selection Start with the variable with lowest p-value, add feature with highest F-test value until performance stops improving (F-value too small or p-value big enough).

• Sequential Backward Elimination Start with all features and progressively eliminates the least promising ones (smallest F-value or highest p-value)

• Bidirectional selection Add and remove features simultaneously

• (Adaptive) Sequential Forward/Backward (Floating) Search
• Genetic algorithms
• Simulated annealing
• Embedded methods
• Non linear methods
• Locally Linear Embedding

Suppose the data is sampled from some smooth underlying manifold, we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.

• Compute knn neighbors of each data point in $D\gg d$ dimensions
• Compute the weights that best reconstruct each data point only from its neighbors (with $\sum\limits_j w_{ij} = 1$ which makes the weights invariant to rotations, rescalings and translations),

• Compute the vectors in $d$ dimensions which best reconstructed by the fixed weights (a eigenvector problem)

• Laplacian EigenMaps
• Isometric Feature Mapping (Isomap)
• Kernel PCA

## Ensembling

• Bagging (Bootstrap Aggregating)

Random forest, Extremely randomized trees, Extra trees

• Sample $B$ data sets ${D_1,\cdots , D_B}$, each size $n$, randomly with replacement from training set $D$
• For each $D_b$, train a tree $f_b$
• Given a unseen sample $x’$, final decision can be
• $\frac{1}{B}\sum\limits_b f_b(x’)$ (regression)
• majority voting (classification)
• Train weak learner using distribution $D_t$
• Weak learner has error $\epsilon_t$