top of page

Text analysis with BERT and Conditional random Feilds for prediction support

Context

Language is a vital part of our lives, connecting us with each other, our past, and the future. It's not just a tool for communication; it's how we pass down stories through generations. The longstanding desire for computers to understand our language has driven researchers to create advanced algorithms and models.

Recent technological leaps have made these algorithms more sophisticated, precise, and efficient. As data scientists, our task is to keep pace with these advancements. The growing volume of data we handle requires us to rethink our strategies, seeking better, more complex, and advanced approaches. This is the ongoing work of researchers, and as data scientists, our role is to comprehend and maximize the potential of these innovations.

Natural Language Processing (NLP) is a field focused on exploring how machines can understand and manipulate human language for practical purposes. Researchers aim to leverage knowledge from various disciplines, including computer science, linguistics, and artificial intelligence, to develop tools that enable computers to comprehend and use natural language. NLP applications span areas such as machine translation, text processing, user interfaces, and more.

In this context, wespecifically examined the intersection of Machine Learning and NLP to evaluate interpretive behavior crucial for successful comprehension of a Named Entity Recognition (NER) dataset.

BERT

CRF

Models overview

CRF

The bag-of-words (BoW) approach works well for many text classification problems. This approach assumes that the presence or absence of word(s) matters more than the sequence of words. However, there are problems such as feature recognition, a part of speech identification where word sequences matter as much, if not more. Now let's understand how CRF(Conditional random Feilds) is formulated. Here's the formula for CRF where Y is the hidden state (for example, part of the speech) and X is the observed variable (in our example, it's the entity or other words around it).

Normalization: You may have noticed that there are no probabilities on the right-hand side of the equation where we have the weights and characteristics. However, the result should be a probability and, therefore, there is a need for normalization.

  1. The normalization constant Z(x) is a sum of all possible state sequences such that the total becomes 1.

  2. Weights and Features: This component can be thought of as the logistic regression formula with corresponding weights and features. Weights are estimated by maximum likelihood estimation and features are defined by us.

5555.png

BERT

BERT (Bidirectional Encoder Representations) is a pre-trained model designed to capture the context of a word from both its left and right sides. BERT marks a new era for NLP, and despite its high accuracy, it relies on two fundamental ideas. The BERT principle consists of two main steps: pretraining and fine-tuning. In the model's initial stage, BERT is trained on unlabeled data using various pre-training tasks. This involves two unsupervised tasks:

  1. Masked ML: Here, a deep bidirectional model is randomly trained by masking certain input tokens to prevent the model from seeing the analyzed word.

  2. Next sentence prediction: In this task, each pre-training instance is established 50% of the time. If a sentence S1 is followed by S2, then S2 is labeled as "Is Next." But if S2 is a random sentence, then it is labeled "Not Next."

After this, it's time for fine-tuning. In this phase, all model parameters are refined using labeled data obtained from "downstream tasks." Each downstream task involves a different model with tuned parameters. BERT can be applied in various applications, such as named entity recognition and question-answering. To use BERT, one can choose either TensorFlow or PyTorch. In our case, we will be using PyTorch.

training and evaluation of models

Results

The comparison between Bert and the CRF classifier shows that both give better performance for NER or Named Entity Recognition. The difference is small, 0.05 percentage points, but it can consistently outperform the CRF model over the other. The CRF is a more stable classifier for the different hyperparameters, giving lower variance results. This corresponds with theoretical and previous results for NER tasks where CRFs have long been a popular classifier.

1.png

For more informations or details reach out to me on linkedIn

© 2023 Fatimatou El Moussaoui

bottom of page