Machine Learning 1
INF554  Machine learning I at École Polytechnique of Prof. Vazirgiannis
Dimension Reduction
 PCA
Principal Components Analysis
Maximization of covariance
 SVD
Singular Value Decomposition
 MDS

classical MDS
Seek to find an optimal configuration of x in a space of lower dimension that gives

Metric MDS
Relax $d_{ij}\approx \hat{d}_{ij}$ by allowing
for some monotone function $f$
The objective is to minimize the stress
Even if $f(d_{ij}) = d_{ij}$, the solution is not the same as classical MDS
Sammon’s stress preserve small $d_{ij}$

Nonmetric MDS
$f$ only preserve the order of distance $d_{ij}$

 NMF
Non negative Matrix Factorization
Find $U, V$ such that $X\approx UV^t$
 LSI
Latent Semantic Indexing, an application of SVD
Supervised Learning

Basic algorithms

Linear Regression

Logistic Regression  Classification
logistic function $h(x)=\frac{1}{1+\exp(x)}$


Primal formulation

Dual formulation
The gradient to $w_i,b,\epsilon_i$ gives:
Therefore,



Generative / Discriminative models

Discriminative models learn the (hard or soft) boundary between classes
Perceptron, SVM, Decision Tree, Nearest neighbor

Generative models model the distribution of individual classes
Naive Bayes, Logistic regression


LDA
Linear Discriminant Analysis
LDA assume that every density within each class is a Gaussian distribution.
The objective is to project data in a lower dimensional space to
 maximize the distance of the class centers
 minimize the inclass variance
Therefore, it’s to maximize the ratio between them.
Fischer’s linear discriminant
Remark: LDA produce at most C1 features
Feature selection
 Univariate  Filter methods
Rank the features and take the top k
 Information Gain
 Pearson correlation
 Chisquare test
A chisquare test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. ?? p value, F test, ??
 The relief method
 Multivariate methods
Two individually irrelevant features may become relevant after combination
Subset generation
Find the subset of k variables that predicts best
How to search all the possibilities ?
 Wrapper methods

Sequential Forward Selection Start with the variable with lowest pvalue, add feature with highest Ftest value until performance stops improving (Fvalue too small or pvalue big enough).

Sequential Backward Elimination Start with all features and progressively eliminates the least promising ones (smallest Fvalue or highest pvalue)

Bidirectional selection Add and remove features simultaneously
 (Adaptive) Sequential Forward/Backward (Floating) Search
 Genetic algorithms
 Simulated annealing

 Embedded methods
 Non linear methods
 Locally Linear Embedding
Suppose the data is sampled from some smooth underlying manifold, we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.
 Compute knn neighbors of each data point in $D\gg d$ dimensions

Compute the weights that best reconstruct each data point only from its neighbors (with $\sum\limits_j w_{ij} = 1$ which makes the weights invariant to rotations, rescalings and translations),

Compute the vectors in $d$ dimensions which best reconstructed by the fixed weights (a eigenvector problem)
 Laplacian EigenMaps
 Isometric Feature Mapping (Isomap)
 Kernel PCA
 Locally Linear Embedding
 Non linear methods
 Wrapper methods
Ensembling
 Bagging (Bootstrap Aggregating)
Random forest, Extremely randomized trees, Extra trees
 Sample $B$ data sets ${D_1,\cdots , D_B}$, each size $n$, randomly with replacement from training set $D$
 For each $D_b$, train a tree $f_b$
 Given a unseen sample $x’$, final decision can be
 $\frac{1}{B}\sum\limits_b f_b(x’)$ (regression)
 majority voting (classification)
 Boosting (Adaboost)
Combine many relatively weak learners, after a learner, misclassified examples gain weight so that future weak learners focus more on them.
For each round
 Train weak learner using distribution $D_t$
 Weak learner has error $\epsilon_t$