Hello All. So, this Summer I decided to do something new and wanted to read research papers so that I can enhance my knowledge and understanding of the current trends in the Industry. So, for the last week, I read these 3 Research papers on Computer Vision. I have written a summary about them, and also mentioned some of the useful links I read/saw in order to gain more clarity for the paper.
Paper 1: Objects as Points
- Detection identifies objects as axis-aligned objects. This approach is called “wasteful, inefficient” by the author.
- Here, the model object as a single point, the center point of its bounding box. They called this method “CenterNet” and is said to be simpler, faster and more accurate than corresponding bounding box based detectors.
- Applications of Object Detection: Instance Segmentation, Pose Estimation, Tracking, Action Recognition, Surveillance, Autonomous Driving, and VQA.
- Current Detection Techniques Approach:
- They represent each object as an axis-aligned tightly bound box.
- This way we get an extensive number of potential object bounding boxes, which can be classified using other Algorithms.
- For each bounding box, the Classifier determines if the image content is a specific object or background.
- One-Stage Detectors don’t specify the box content; they just slide the complex arrangement of possible bounding boxes.
- Two-Stage Detectors precompute image features for each potential box, then classify those features.
- DISADVANTAGE OF CURRENT TECHNIQUES: Post-processing (non-maxima suppression) removes duplicate detections for the same instance by the computing bounding box. This is hard to differentiate and train. Hence, making current detectors not end-to-end trainable.
- Since the Sliding window based object detectors enumerate all possible object locations and dimensions, it is quite wasteful.
In this paper, the approach simply feeds the Input image to a fully convolutional network – generates a heat map – Peaks in this heat map correspond to object centers. Image Features at peak predict the object bounding box height and weight.
Model trains using Standard supervised learning. The algorithm is end-to-end Differentiable. CenterNet assigns the “anchor” solely on location, not box overlap.
Here, a point to note is, we do not need Non-max Suppression, we have only one positive anchor per object. There are no manual thresholds for foreground and background. The Model used Key point Estimation concept. CornerNet and ExtremeNet build on similar Objects as point concept.
Dataset: The COCO dataset is an excellent object detection dataset with 80 classes, 80,000 training images, and 40,000 validation images.
- Understanding the Object Detection Metrics: https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
- CornerNet, Algorithm that was referred for this paper: https://www.youtube.com/watch?v=aJnvTT1-spc
- Found this Video of Siraj Raval useful to understand the basics of OD: https://www.youtube.com/watch?v=4eIBisqx9_g
MixMatch: A Holistic Approach to Semi-Supervised Learning
In the given paper, Authors have proposed a new Algorithm for Semi-Supervised Learning, called “MixMatch”. SSL seeks to alleviate the need for labeled data by allowing a model to leverage unlabelled data. So, in order for the model to generalize better, SSL adds a loss term which is computed on unlabelled data.
MixMatch introduces a single Loss function which unifies dominant Approaches to SSL, i.e.
- Entropy minimization – Encourages model to output confident predictions on unlabelled data
- Consistency Regularization – Encourages the model to produce the same O/P distribution when the inputs are perturbed
- Generic Regularization – Encourage the model to generalize well and avoid overfitting on Training data
Basic Approach: (For single Unlabelled data Point/Image)
- Stochastic Data Augmentation is applied to unlabelled image K times
- Each augmented image is fed through a classifier
- Average of these K Predictions is “sharpened” by adjusting the distribution’s Temperate
Related Work: The paper discusses Regularization techniques it has taken inspiration from in detail.
- Consistency Regularization:
- This approach has been applied to image classification benchmarks using sophisticated augmentation techniques
- CR applies Data Augmentation (To deal with Regularization), to SSL by leveraging the idea that a classifier should Output the same class distribution for an unlabelled example even after it has been augmented.
- Drawback: Usage of domain-specific Data Augmentation strategies.
- MixMatch utilizes a form of consistency regularization through the use of standard data Augmentation for images.
- Entropy Minimization:
- The assumption in SSL is that the classifier Decision boundary must not pass through high-Density Regions of the data space.
- One way to enforce this is to require classifier to output Low-entropy (Less random) or let’s say more stable predictions on unlabelled data. Hence, stopping it from overfitting the data points.
- This is done explicitly by adding a loss term.
- MixMatch has a concept of “Pseudo Label” which does entropy minimization implicitly by constructing hard labels from high confident predictions on unlabelled data and using as training targets in Standard cross-entropy loss.
- Traditional Regularization:
- It is generally used so that the Model doesn’t memorize the training data and hence generalizes better to unseen data points.
- They have used a new type of Regularizer recently proposed, called “MixUp”. It is used both as a regularizer (for Labelled data points) and SSL method (for unlabelled data points)
- It trains a model on a convex combination of both inputs and labels. It encourages the model to have strictly linear behavior between examples.
- MixUp is applied to both Labeled and unlabeled examples with Label guesses.
- The algorithm tries hard to generalize on all the data points, and hence even there some tiny changes in the equation for mixup which can be seen in Section 3.3.
- Given Batch X = Labelled Data with One-hot Targets, and Batch U = Unlabelled Data,
- Sizes |X| = |U|
- The algorithm produces a processed batch of augmented Labelled examples X’ and a batch of augmented unlabeled examples with “guessed” labels U’.
- U’ and X’ are then used in computing separate labeled and unlabelled loss terms.
- The combined Loss function is calculated using:
- Loss = Loss (X) + (Lambda (u) * Loss (U))
- Where Lambda (u) is a hyperparameter. Basically, we are performing an operation of combining both the losses due to X’ and U’. We need to take into consideration that X’ and U’ are formed by combination both labeled and unlabeled point.
- Algorithm elaborated on Page 4.
- Augmentation is applied for both labeled and unlabeled data points. For each labeled point, we produce one augmented version, while for each unlabelled datapoint we produce K versions, these individual augmentations are used for generating a “guessed label” for each unlabeled point, by averaging the O/P for each augmented point.
- For Label guessing, the step taken is inspired by Entropy Minimization. Given average predictions over augmentations, we apply a “sharpening” function, mentioned in the paper (Page 4)
- Various Hyperparameters used:
- T = Sharpening Temperature (Label Guessing)
- K = # of Augmentations for Unlabeled data points
- Alpha = Parameter in MixUp
- Lambda(u) = Unsupervised Loss weight used to compute the combined loss
- Squared L2 Loss is bounded and less sensitive to completely incorrect predictions, unlike cross entropy. As a result, it is frequently been used as a loss function for prediction on unlabeled data in SSL and a measure of predictive uncertainty. (Not sure how!)
- “Mean Teacher”, VAT, Pseudo-Label, and MixUp are used a Baseline Models. Refer the paper for a further breakdown of performance.
- Wide ResNet-28 Model is used.
- Datasets used: CIFAR-10, CIFAR-100, SVHN, and STL-10. STL-10 is specifically for SSL as it has 5,000 labeled images and 100,000 unlabeled ones which are drawn from a slightly different distribution than the labeled data.
Deep Residual Learning for Image Recognition
Deeper Networks are difficult to Train and hence the paper presents a residual learning Framework to ease the training of networks that are substantially deeper than those used previously.
I found a number of videos on Youtube to be extremely useful in order to understand this Paper. The paper basically explains the architecture, motivation, and results obtained by “ResNet”, which has become quite famous in the recent times, which has won 1st places at the ILSVRC and COCO 2015 competitions.
- Deep Residual Learning for Image Recognition – https://www.youtube.com/watch?v=C6tLw-rPQ2o
- ResNet Architecture – https://www.youtube.com/watch?v=0tBPSxioIZE
- Andrew NG’s Video on ResNet – https://www.youtube.com/watch?v=ahkBkIGdnWQ
These are the summaries of the 3 papers I read over the last week. I’ll try to be more regular and post the excerpts of the papers I read every week. Next week, I am planning to read papers on NLP. Let us see how that goes. Thank you for reading so far. Hope you got to know something new about the field after reading the article.