Generating text similarity scores using BERT.

For a long time the domain of text/sentence similarity has been very popular in NLP. And with the release of libraries like sentence transformers and models like BERT it has become very easy to create a text/sentence similarity generator. Looking at the very complex documentation of Hugging Face it is a bit overwhelming for someone new to NLP to get into it (It was pretty difficult for me 😄).

So this article aims to help those who are new to get started with NLP and sentence transformers.

Procedure:-

First install the sentence transformers and sklearn libraries using

pip install sentence-transformers
pip install sklearn

After you have installed these libraries import them like this

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

After you have imported these libraries you need to create the word embeddings. Word embeddings are a learned representation generally in the form of a vector where words that have a similar meaning or are somehow related to each other have a similar representation/vector. Earlier to create word vectors you had to train a machine learning model and then use its weights to generate an embedding, but sentence transformers gives you access to some of the very best pre-trained models which can be used to generate word embeddings.

To read more about word embeddings read this article https://machinelearningmastery.com/what-are-word-embeddings/

For now we are going to focus on the BERT ( Bidirectional Encoder Representation from Transformers ) model which makes use of an attention mechanism that learns contextual relations between words in a text to generate a word or sentence vector.

To read more about BERT read this article https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

a = "I love dogs"
b = "I hate dogs"

sentences = [a, b]
sentence_embeddings = sbert_model.encode(sentences)

In the first line of code we just initialize the model. After that we create a list of sentences and encode it using the model. This will give you the sentence embeddings of these two sentences.

Once the word embeddings have been created use the cosine_similarity function to get the cosine similarity between the two sentences. The cosine similarity gives an approximate similarity between the 2 sentences . The higher the cosine similarity the more similar they are.

cos_sim = cosine_similarity(sentence_embeddings)

Note that the cosine_similarity function returns a matrix which consists of the similarity scores of each sentence with one another ( kinda like a confusion matrix ).

That’s it. Just like that in less than 15 lines of code and 10 minutes you have created a text similarity generator that will output the similarity between any two corpus of text.

Sources:-