With the advent of chatbots, training computers to read, understand and write language has become a big business. The training may seem easy at first but as you start your journey with Natural Language Processing (NLP) you realize that surmounting the challenges is no easy task. That’s why sentence similarity is amongst the toughest problems.
Why is it so hard to teach language to a computer?
After all a child picks up languages intuitively. That’s because computers are good at number crunching but are not capable of reading or understanding human language directly.
For example, when we say, ‘An apple a day keeps you healthy’, how do you teach the computer that ‘Apple’ in this context is a fruit and not the company Apple Inc.
We take great pains to transform words and sentences into numerical representations and then train the computers with these numerical representations of language.
These representations have to capture the meaning of words (semantics), how they occur in a sentence (syntax), the context and the intertwining of words.
A new phase
NLP got a big boost when Tomas Mikolov at Google invented word2vec. Word2vec is a method to convert a word into a representation in an n-dimensional vector space which is referred to as word embedding.
Glove is another method from Stanford that creates vector representations of words.
One of the popular and new methods is fastText from Facebook. These methods make it easy for the computer to compare words based on the context and the meaning. Armed with this technique, let’s see if we can now extend this comparison to complete sentences or find similarities between sentences.
Spacy, a popular open source NLP engine, provides an out-of-the-box feature to find similarity between sentences.
The numbers represent sentence similarity. The greater the similarity value, the more similar are the sentences. The first two sentences are more similar since a city and country occur in them. However, why do we get a 0.65 and 0.55 similarity with the third one?
Let’s play with Spacy’s sentence similarity models to figure out what’s going on behind the scenes. <please copy paste the links below into a browser window>
Test Case 1: Very similar sentences
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells Android smartphones.’
The two models give around 0.85 similarity which is decent.
Let’s fool the model by stating something that is not a fact.
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells iOS smartphones.’
The two models give around 0.94 similarity which is technically correct but factually incorrect.
Notice how replacing Android by iOS in the second sentence enhanced the similarity from 0.85 to 0.94.
The models used here have not been trained with ‘facts’ but trained to compare sentences semantically. Would a human have performed better? Yes, only if the human has the knowledge that Google does not sell iOS phones. So, by imparting this knowledge to the model it can perform at the same level as a human.
Test Case 2: Somewhat similar sentences
The sentences used are: ‘Apple invented iOS.’ and ‘Google bought Android.’
The two models give 0.73 and 0.62 similarity, which may be acceptable.
Test Case 3: Dissimilar sentences
The sentences used are: ‘Apple invented iOS.’ and ‘Apple a day keeps you healthy.’
The two models give 0.6 similarity, which is high.
The results from the above tests prove that we have a long way to go for coming up with models that can do an acceptable job.
By default, Spacy averages the word vectors in a sentence to come up with a sentence level vector or a vector that represents the meaning of the sentence in an n-dimensional vector space.
It then compares the sentence level vectors of the two sentences by using cosine similarity method to come up with the similarity number.
Cosine in sentence similarity
It is a measurement of similarity between two non-zero vectors of an inner product space that measure the cosine of the angle between them.
Isn’t this non-intuitive? Would a human compare sentences in the same manner as this?
Recent developments in Deep Learning have shown promise that semantic similarity at a sentence level can be solved with better accuracy using recurrent and recursive neural networks. This will be the subject of discussion in a future post.
In the meantime, if you’d like to know more about NLP, machine learning or the chatbot technology then feel free to reach out. We have some of the best in the industry at our office who can help you with it all. Their expertise will better help you understand the technology and make use of it at work.
We have an in-house chatbot product named Engati that offers a free chatbot platform. Businesses can build their own chatbot for free and can upgrade as per their need and requirement. Engati comes with more than 20 international languages. Businesses can select their preferred language. Therefore, get started with the chatbot technology. In fact, start at the earliest because the chatbot only gets better with time because it becomes mature with more data.