How to use BERT for sentiment analysis?

BERT stands for Bidirectional Encoder Representations from Transformers. Before we get to it, first let’s understand what we mean by chatbots with sentiment analysis and why sentiment analysis is so important in chatbot development.

Why is Sentiment Analysis crucial for Chatbots?

Customer support Chatbots have become an integral part of businesses to improve customer experience. The current focus in the industry is to build a better chatbot enriching the human experience. Comprehension of customer reactions thus becomes a natural expectation., To achieve this, the business chatbot needs to understand the computational linguistics, context, and tone of the customer.

The opinion or sentiment expressed in a document or sentence can be binary (positive, negative) or fine-grained (positive, negative, strong-positive, strong-negative, neutral, mixed). Emotion analysis can go beyond polarity that includes the classification of emotions (angry, sad, happy).

Enabling sentiment analysis on chatbots not only helps tweak the responses according to user mood but also helps in advance analysis of services and breakdowns.

‍

How does sentiment analysis work?

Sentiment analysis on chat messages is not easy as opinions can carry sarcasm, ambiguity, and implicit negation. Some implicit negations like “When can I expect an answer” or a query like “How to cancel the order?” convolute the analysis as they are not directly negative but affect the business. This brings in the need to have a sentiment analyzer trained on domain data.

‍

‍

Data collection

In the case of chatbots that cater to multiple domains, variance in the data can be high.

Any ML or DL model needs sufficiently enough data for it to learn patterns, extracting features out of it. The optimal number of data points for training is not a magic number. Underfitting or Overfitting can lead to poor generalization.

‍

Pre-processing

Clean the data to remove URLs, emails, digits, special characters as they do not contribute in sentiment detection but produce noise. For a model to imitate how a human comprehends sentiment, It does not side with the generally followed practice of removing stop-words and lemmatisation in NLP. “How can I trust you” v/s “I trust you”, “Do not make me angry” v/s “I am not angry”, “You could be smarter” v/s “You are smart”

Imagine these sentences reduce to the same thing while carrying completely different sentiments.

‍

Labeling

When we talk about sentiments, often we forget about a neutral class, which is nothing but a lack of sentiment. It is okay to believe that the model should not classify it as Positive or Negative based on confidence scores.

But keeping a separate Neutral class helps the model clearly distinguish between the two classes and improves accuracy.

Labelling the dataset is one of the most crucial steps as it drives the learning.

Rule-based Natural Language Processing Systems like VADER can be used for first-level sentiment tracking to make Human review for the next stage easier.

‍

Balancing the dataset

Learning on an Imbalanced dataset tends to favor the majority class, which may lead to misleading accuracies. This is particularly problematic when we are interested in the correct classification of a minority class.

Most of the chatbots today are for customer support; hence messages are highly variant queries that carry no sentiment at all.

Under-sample the Neutral samples (on the basis of frequency distribution) and oversample the others to balance the dataset.

Oversample the minority classes using Data Augmentation techniques. To improve generalization, Data Augmentation is a well-known strategy. Augmenting meaningful Positive and Negative samples using Back Translation with multiple languages and synonym replacement for the minority classes helps reduce manual labeling of more texts by generating new similar texts, and increases the accuracy significantly.

‍

Modeling

To learn a good representation of the sentence, Keras trainable embeddings along with models like CNN and LSTMs can be used. Tokenizers like sentencepiece and wordpiece can handle misspelled words.

Optimized CNN networks with embedding_dimension: 300, filters: [32, 64], kernels: [2, 3, 5], kernel_initialization: ‘he_uniform’, pooling strategy: (average and max on concatenated CNN layers) yield accuracy as high as 92%. This network could outperform Bidirectional LSTMs for the task.

‍

Despite the high accuracy, both fail to generalize well and handle negation.

‍

A good sentence representation

Using pre-trained embeddings like Glove and Word2Vec didn’t help in boosting the performance as they are not contextual. A word can carry different meanings in different contexts. Like “block” with reference to “Block D” vs “The account is blocked” can mean two very different things. Contextual Embeddings are thus important to perceive the correct meaning, and training those requires a huge amount of data.

For any business, the availability of data and resources to model it are the major concerns. It is not practical to continually feed the model with more and more data. This requirement opened the gates for Transfer Learning.

‍

Carrying out sentiment analysis with BERT

Bidirectional Encoder Representations from Transformers

Bert is a Contextual model. Instead of generating a single word embedding representation for each word in the vocabulary. It generates the representation of each word that is based on the other words in the sentence.

Applying the bidirectional training of Transformer, a popular attention model, to masked language modelling can have a deeper sense of language context and flow than single-direction language models. It is pre-trained on massive Wikipedia and book corpus datasets. BERT only uses the Encoder of the transformer model. A total of N encoder blocks are stacked together to generate the Encoder’s output.

A specific block is in charge of finding and encoding patterns or relationships between the input representations. BERT uses 12 separate attention mechanisms for each layer. Each token can focus on 12 distinct aspects of other tokens. Each head can focus on a different kind of constituent combinations.

‍

‍

BERT is a pre-trained model that expects input data in a specific format. Special tokens to mark the beginning ([CLS]) and separation/end of sentences ([SEP]).

BERT passes each input token through a Token Embedding layer so that each token is transformed into a vector representation, Segment Embedding Layer (to distinguish different sentences), and Position Embedding Layer (to show token position within the sequence).

BERT tokenizer has a WordPiece model. It greedily creates a fixed-size vocabulary. Its vocabulary size limits to 30,000. It includes words, subwords (front and back), and characters.

The detailed working of a Transformer model and BERT by Google.

‍

Fine-tuning Bert

Bert can be used as a feature extractor, where meaningful sentence representation can be constructed by concatenating the output of the last few layers or averaging out the output of the last layer of the pre-trained model.

Fine-tuning with respect to a particular task is very important as BERT was pre-trained for next word and next sentence prediction. BERT Sentiment analysis can be done by adding a classification layer on top of the Transformer output for the [CLS] token.

The [CLS] token representation becomes a meaningful sentence representation if the model has been fine-tuned, where the last hidden layer of this token is used as the “sentence vector” for sequence classification.

‍

Architecture of BERT analysis — BERT architecture

‍

Pre-trained Model: bert-base-uncased (12-layer, 768-hidden, 12-heads, 110M parameters)

Processor: Custom or Cola Processor with Label_list: [“Neutral”, “Negative”, “Positive”]Dataset: ChatBot conversations

Fine-tuning parameters: max_seq_length= 30batch_size= 32learning_rate=2e-5num_train_epochs=4.0warmup_proportion=0.1 (The learning rate increases linearly over the warm-up period.If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n , and so on until we hit the nominal rate at iteration n)

‍

Results

eval_accuracy = 0.93949044eval_loss = 0.22454698Test_accuracy = 0.9247

Experiments with the output of the last encoder:[CLS] token representation: eval_acc 0.939Global Max Pooling: eval_acc 0.9458Global Average Pooling: eval_acc 0.9522

This model generalizes well on un-seen cases and can capture negation cases or implicit negation even with less training data.

‍

Interpretability and Explainability

Deep Learning models are thought of as a black box, which is unlikely to hold true anymore. You can interpret false cases and explain them by looking at the attention on each token that lead to a particular prediction.Much work has been done on the pre-trained models for Next Sentence Prediction like BertViz in pytorch.

In BERT Sentiment Analysis, similar analysis on self-attention layers can be done.

Algorithm:

Take the attention weights from the last multi-head attention layer assigned to the [CLS] token.
Average each token across multiple heads
Normalize across tokens

Visualization

It is hard for a human to perceive large n dimensional arrays.

However, all of it is not so formidable when put up into a visualization.

One can do a Synopsis with the visualization of the intensity or attention of each token in a sentence.

‍

BERT is a trained Transformer Encoder stack, with twelve in the Base version, and twenty-four in the Large version.
- Towards Data Science.

‍

What is sentiment analysis example?

‍

‍

Conclusion

Sentiment analysis is vital to creating better customer experiences. Without NLP sentiment analysis, your bot might annoy your customers, appearing rude or inconsiderate.

‍

References

Attention Is All You Need; Vaswani et al., 2017.
Pre-training of Deep Bidirectional Transformers for Language Understanding; Devlin et al., 2018.
XLNet: Generalized Autoregressive Pretraining for Language Understanding; Zhilin Yang., 2019
A Robustly Optimized Pretraining Approach; Yinhan Liu., 2019
ERNIE: Enhanced Language Representation with Informative Entities; Zhengyan Zhang., 2019
A tool for visualizing multi-head self-attention in the model; Jesse Vig., 2019
Bert as Service; Hanxio

‍

3x your revenue with Chatbots and Live Chat

Schedule a demo

‍

How to use BERT for sentiment analysis?

Why is Sentiment Analysis crucial for Chatbots?