Machine Learning 1
INF554 - Machine learning I at École Polytechnique of Prof. Vazirgiannis
Dimension Reduction
- PCA
Principal Components Analysis
Maximization of covariance
- SVD
Singular Value Decomposition
- MDS
-
classical MDS
Seek to find an optimal configuration of x in a space of lower dimension that gives
-
Metric MDS
Relax $d_{ij}\approx \hat{d}_{ij}$ by allowing
for some monotone function $f$
The objective is to minimize the stress
Even if $f(d_{ij}) = d_{ij}$, the solution is not the same as classical MDS
Sammon’s stress preserve small $d_{ij}$
-
Non-metric MDS
$f$ only preserve the order of distance $d_{ij}$
-
- NMF
Non negative Matrix Factorization
Find $U, V$ such that $X\approx UV^t$
- LSI
Latent Semantic Indexing, an application of SVD
Supervised Learning
-
Basic algorithms
-
Linear Regression
-
Logistic Regression - Classification
logistic function $h(x)=\frac{1}{1+\exp(-x)}$
-
-
Primal formulation
-
Dual formulation
The gradient to $w_i,b,\epsilon_i$ gives:
Therefore,
-
-
-
Generative / Discriminative models
-
Discriminative models learn the (hard or soft) boundary between classes
Perceptron, SVM, Decision Tree, Nearest neighbor
-
Generative models model the distribution of individual classes
Naive Bayes, Logistic regression
-
-
LDA
Linear Discriminant Analysis
LDA assume that every density within each class is a Gaussian distribution.
The objective is to project data in a lower dimensional space to
- maximize the distance of the class centers
- minimize the in-class variance
Therefore, it’s to maximize the ratio between them.
Fischer’s linear discriminant
Remark: LDA produce at most C-1 features
Feature selection
- Univariate - Filter methods
Rank the features and take the top k
- Information Gain
- Pearson correlation
- Chi-square test
A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. ?? p value, F test, ??
- The relief method
- Multivariate methods
Two individually irrelevant features may become relevant after combination
Subset generation
Find the subset of k variables that predicts best
How to search all the possibilities ?
- Wrapper methods
-
Sequential Forward Selection Start with the variable with lowest p-value, add feature with highest F-test value until performance stops improving (F-value too small or p-value big enough).
-
Sequential Backward Elimination Start with all features and progressively eliminates the least promising ones (smallest F-value or highest p-value)
-
Bidirectional selection Add and remove features simultaneously
- (Adaptive) Sequential Forward/Backward (Floating) Search
- Genetic algorithms
- Simulated annealing
-
- Embedded methods
- Non linear methods
- Locally Linear Embedding
Suppose the data is sampled from some smooth underlying manifold, we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.
- Compute knn neighbors of each data point in $D\gg d$ dimensions
-
Compute the weights that best reconstruct each data point only from its neighbors (with $\sum\limits_j w_{ij} = 1$ which makes the weights invariant to rotations, rescalings and translations),
-
Compute the vectors in $d$ dimensions which best reconstructed by the fixed weights (a eigenvector problem)
- Laplacian EigenMaps
- Isometric Feature Mapping (Isomap)
- Kernel PCA
- Locally Linear Embedding
- Non linear methods
- Wrapper methods
Ensembling
- Bagging (Bootstrap Aggregating)
Random forest, Extremely randomized trees, Extra trees
- Sample $B$ data sets ${D_1,\cdots , D_B}$, each size $n$, randomly with replacement from training set $D$
- For each $D_b$, train a tree $f_b$
- Given a unseen sample $x’$, final decision can be
- $\frac{1}{B}\sum\limits_b f_b(x’)$ (regression)
- majority voting (classification)
- Boosting (Adaboost)
Combine many relatively weak learners, after a learner, misclassified examples gain weight so that future weak learners focus more on them.
For each round
- Train weak learner using distribution $D_t$
- Weak learner has error $\epsilon_t$