Using Neural Nets to learn the best images to discover a new local establishment

7 min readJun 26, 2018

A well-known saying amongst chefs is, “You eat first with your eyes” and nothing captures this inherently social experience better than a picture. magicpin is a social network for local experiences and pictures are central to sharing this experience on the platform. In this day and age of high res smartphone cameras the best way to checkout a trendy neighbourhood eatery is through pictures, before of-course you team-up with your friends to group buy a magicpin voucher at an amazing price :). Our users spend close to 30 min a day exploring and discovering trending local eateries, pubs, salons and even grocery stores through user uploaded pictures of a place. In this post we will describe how we built our system to auto feature the best and most interesting pictures uploaded by users.

A user uploads a picture whenever they transact at a local establishment and claim a cashback. We select the best pictures uploaded by users and feature them on the merchant page and also in the feed of the users in the same locality. Rating pictures at best is a subjective task and we try to mitigate the issue by having multiple analysts rate and re-rate the picture. When we started out our team of analysts was solely responsible for selecting the pictures to feature but as our user base and transactions grew keeping up with the pictures became next to impossible. Recent developments in the field of neural nets have greatly improved the performance of the state of the art image classification systems. In this post we will describe how we built and tested different machine learning techniques on a 1 Million+ image dataset progressing from classic ML techniques like Random Forest to multi-million parameter CNNs. To gain a thorough in-depth understanding of some these models we recommend going through the material in this course on CNNs

The main objective of this system is to develop a machine-learning based model capable of classifying pictures uploaded by users. Like with almost all such systems the model sits in the middle of a much larger system that is responsible of other tasks like data preparation, evaluation etc. The diagram below describes one such system at a high level.

Dataset — 1 Million manually rated images

Our first dataset comprised of 1 Million user uploaded images and the corresponding transactions(image metadata) that has been rated on a scale of 1–5 by our human analysts. We went on to extract 20+ features from the image metadata like the time, day, location, transaction amount etc. The intuition was to build a classic machine-learning model like SVN to establish a baseline and compare the performance with neural nets. In the section below we describe the first cut of our models and how we iterated on the datasets and progressed to neural nets.

This step also involved the following data massaging steps -

Replace missing data entries with mean values
Convert string data into numerical format using one-hot encoding or label-encoding.
Feature Engineering — Categorize createTime into 4 group based on hour of transaction (morning,afternoon,evening,night),weekday,month. Convert continuous attributes like cost for two, transaction amount, cash back amount into 1,2,3, and 4.Create columns to capture proximity in user home location and merchant location.

First Cut — Classification models based on techniques like Random Forest and SVN

For our first cut we decided to train 9 different classification models with the objective to predict one of the 5 classes of rating (1,2,3,4 or 5). The results of the different models are below. All these models were implemented using Scikit learn or LibSVM.

The last column indicates the precision on the training set. Further the accuracy on the test set for the best performing models (Random Forest/Decision Tree) was even lower at close to 40%. To understand the reason for this lack of generalization we decided to plot the rating vs different attributes. The below figure captures why none of these models worked as expected for this problem.

Looking at the picture it is obvious that we need to use neural nets for this task since it incorporates attributes of the images rather than working only on the metadata like in the case of the above models.

Migration to Deep Learning — CNNs

In order to improve the system the next step was to train neural nets. For this we expanded our dataset to be 1.5M images and split it into training, validation and test set.

Training via Transfer Learning

To train our CNNs we first started with transfer learning on existing image classification models. Neural nets work best when they have many parameters, making them powerful function approximators. However, this means they must be trained on very large datasets. Because training models from scratch can be a very computationally intensive process requiring days or even weeks, we decided to start with transfer learning on the following pre-trained models on the ILSVRC-2012-CLS image classification dataset.

We did transfer learning on the following model -

VGG 16
MobileNet V2 1.4
Inception V3
ResNet50
InceptionResNetV2
Xception
NASNet-Mobile
NASNet-Large
DenseNet 201

We worked with both popular models like vgg16,inceptionV3 etc. to the state of the art models like Xception,NASNet etc.

Below are the training and test results for these models -

Surprisingly the test accuracies are lower than what we achieved with Random Forest. To understand these results, we plotted the confusion matrix for Inception V3.

The model is unable to predict any images for class 1,3 and 5 and to understand why we looked at the distribution in our training set for 500000 images.

The skewness in the distribution of images of each class resulted in the model not being able to generalize. So all images of classes (1–2–3) were predicted as 2 and images of classes (4–5) were predicted as 4.

Deep Learning Pipeline Enhancements

To improve the performance of the above models we decided to explore the following:

Eliminate skewness in the training data — We built a training set with equal number of images in the all the classes.
Binary Classification instead of categorical classification — Our original aim was to classify images as {1,2,3,4,5} however for our system classifying images as {1,0} would suffice. We aggregated images rated {1,2} into the label {0} and images rated {4,5} into the label {1}. We decided to remove class 3 from dataset to improve the generalization power of the model.
Fine Tuning the models instead of Transfer Learning — In transfer learning the weights of the pre-trained model are not changed, we only add our own last layer to do prediction on previously learned technique. However when we fine-tune a model, we make the model do the entire backpropagation again which ensures weights are adjusted with respect to our dataset. The advantage is the extremely high number of trainable parameters however the process of training with fine tuning takes a substantially longer time than transfer learning.

We trained all the above models again using Fine Tuning on a dataset with 12k images and selected the best 3. The 3 best models — NASNet-Mobile, InceptionResNetV2, and Xception — were trained on our entire data set. Below is the training, validation and test accuracy.

The above results were obtained using ‘Adam’ model optimizer and learning rate of 0.001. On changing the hyperparameters to ‘RMSProp’ for model optimizer and a learning rate of 0.00001 the results improved further.

Conclusion

Moving to a deep learning based system to identify the best images has led to substantial reduction in the workload of our analysts and led to a increased number of images that we can show to our users. At the same time we realize neural nets like any machine learning technique are far from perfect and its shortcomings are well documented and reported. We will continue our endeavours to make this system better — investing substantially in a hybrid system of analysts and neural nets. Ping us if you have questions about what we did or would like to tell us how we could have done something better? If you want to come and build with us, even better!

Using Neural Nets to learn the best images to discover a new local establishment

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by magicpin Engineering

No responses yet