Understanding customers better through neural network embeddings

Adam Hornsby | @adamnhornsby
(Use arrow keys to navigate slides)

Recording of this presentation



If you prefer reading slides, please click onward...

Today


  1. Why item similarity is so important in retail science
  2. What are embeddings?
  3. How can 2vec algorithms help?
  4. Results & conclusions


  5. (Massive thanks to Josh Cooper for doing lots of the thinking in this presentation)

Item similarity

How similar are these two products?



Questions of similarity are everywhere in retail:

"Your selected product X goes well with product Y" (i.e. product complementarity)
"Your product X was not available, so how about alternative Y?" (i.e. product substitutability)
"Customers like you also tend to buy Y" (i.e. customer similarity)

Solving similarity is a huge goal for data scientists working to improve; recommendations, ranging, pricing, assortment and more

Traditional methods don't help with similarity


Most scientists will use dummy (or one-hot) encoding to represent categorical data

Dummy coded data is too sparse (i.e. too many zeros)
Any similarity measure will tell you that cat food and dog food are totally unrelated using this approach

We need a way to represent these items using dense vectors

Embeddings

Embeddings are dense representations


Traditionally used in natural language processing (NLP)
Learnt dense representations of otherwise sparse data (vector size is a parameter)
Typically learnt with neural networks

When learnt well, produce similar vectors for similar items (e.g. cats vs. dogs)

*2vec Algorithms



A set of unsupervised learning algorithms that learn vector embeddings (Mikolov et al.)

word2vec used to learn vector embeddings for items (e.g. words or products)
doc2vec used to learn vector embeddings for documents (e.g. sentences, baskets, customers etc.)

Create lots of little classification problems (e.g. use vector of one item to predict next item in document, then backprop)
Typically very fast to train and scale well to large data

Sentences vs. Baskets



NLP algorithms (like doc2vec) assume that you have words (i.e. items) and sentences (i.e. documents)

Products are analogous to words and baskets/customers are analogous to sentences
(e.g. A basket is represented with product codes)

The model

The modelling problem



Aim: Understand whether doc2vec can learn useful vector representations of products and baskets

Model developed on free transactional dataset from dunnhumby source files
2 years worth of transactions from 2500 households

Using gensim implementation of doc2vec (Python)
Learn product and document vectors at same time

Trained on Google Cloud Platform Compute Engine (4 vCPUs and 15GB memory) in (~5 minutes per experiment)
Model solutions evaluated using Tensorflow's amazing Embedding Projector

Evaluation

Analogical reasoning



Semantic relationships between words are typically preserved within embedding space

$King - Man + Woman = y$

Vectors most similar to $y$ tend to be related to "queen"

What would be the equivalent using product embeddings?

Identifying cheaper alternatives

Premium Meat - Premium + Economy = ?

Rank Product description Cosine similarity
1 Economy meat 0.542439
2 Margarine stick 0.518072
3 Carrots (bagged) 0.500584
4 Chocolate milk 0.489882


The most similar product to the target vector is economy meat. Other items are also cheaper alternatives.
Suggests that there is a "price sensitivity" direction within the space

Identifying vegetarian alternatives

Frozen Burgers - Beef + Tofu = ?

Rank Product description Cosine similarity
1 Frozen entrees 0.518292
2 Frozen meat (vegetarian) 0.484718
3 Frozen meal combo/dinners 0.464290
4 Eggs 0.462903


The most similar products to the target vector tend to be frozen and/or meat free
Suggests there are "frozen" and "vegetarian" directions within the embedding space

Applications

Hands-free learning

We can use the trained document vectors to automatically generate modelling features at the basket and customer-level
We can then grid-search to find the best configuration for predicting a target variable

How well does this work?

A sparse dataset predicting coupon redemptions had accuracy (AUC) 0.74
A model using customer embeddings had accuracy (AUC) of 0.85

Also demonstrated predictive power with demographics such as age and number of kids

Conclusions

The model looks great



Very visual application of machine learning in retail
The model appears to understand product similarity well
Analogical reasoning results seem intuitive
Scales better than rival algorithms (e.g. NMF)

Harder with very low velocity items
Need some more automated ways of evaluating unsupervised models

This model has many applications


Global deployment of model in collaboration with partners

Hands-free feature generation for models and segmentations
Identifying alternative & supplementary products during store ranging
Recommending new items in personalised recommender systems

A significant leap forward in the endeavour to describe similarity

Thank You

@adamnhornsby