Scientific keyword extraction with SciBERT and KeyBERT

1. What is SciBERT?

Keyword extraction is a text analysis technique to automatically extract the words and phrases that are most relevant to an input. There are many approaches for keyword extraction:

  • Statistical approaches (e.g.: Word Frequency, N-gram statistic, Term Frequency-Inverse document frequency (TF-IDF), RAKE- Rapid Automatic Keyword Extraction).
  • Linguistic approaches: use morphological or syntactic information (such as the part-of-speech of words or the relations between words in a dependency grammar representation of sentences) is used to determine what keywords should be extracted [3].
  • Graph-based approaches, e.g.: TextRank model.
  • Machine learning approaches, e.g.: Support Vector Machines (SVM)

In November 2018, Google introduced BERT as the State-of-the-Art Pre-training for Natural Language Processing [4]. BERT means Bidirectional Encoder Representations from Transformers, which is a bi-directional transformer model that allow us to transform phrases and documents into vectors that capture their meaning. 

Figure 1: BERT [5]

SciBERT is a BERT model trained on scientific text, developed by the Allen Institute for Artificial Intelligence (AI2). It results in state-of-the-art performance on a wide range of scientific domain Natural language processing (NLP) tasks [1]. You can find more information on the evaluation in the paper SciBERT: A Pretrained Language Model for Scientific Text.

In this post example, we will use the PyTorch version, which can be downloaded from the Hugging Face library. If you work with Tensorflow, you can find the TensorFlow version from Google Research.

Besides, in order to keep the implementation simple, KeyBERT is a good option. It is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and key phrases that are most similar to a document [2]. KeyBERT provides the Flair embeddings via keybert[flair] so that we can easy to embed the SciBERT model.

2. Scientific keyword extraction with SciBERT and KeyBERT

First, create and activate the virtual environment for the project:

# virtualenv scibert-env
# scibert-env\Scripts\activate

Install keybert[flair] using pip

(scibert-env) # pip install keybert[flair]

In this example, I randomly use the following description as the input text:

"Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.[1] It infers a function from labeled training data consisting of a set of training examples.[2] In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias)."

Now, we are ready to run the keyword extraction:

from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias)..
         """
scibert = TransformerDocumentEmbeddings('allenai/scibert_scivocab_uncased')
model = KeyBERT(model=scibert)
keywords = model.extract_keywords(doc, keyphrase_ngram_range=(2, 2), stop_words='english', use_mmr=True, diversity=0.7)
print(keywords)

 And the extraction result:

[ ('supervised learning', 0.7159), 
  ('data consisting', 0.4804), 
  ('function used', 0.5651), 
  ('algorithm analyzes', 0.6165), 
  ('produces inferred', 0.6131)]

You can modify these parameters:

  • "keyphrase_ngram_range" to set the length of the resulting keywords/keyphrases
  • "user_mmr" (Maximal Margin Relevance) and "diversity" to create high diversity for the result.

 3. Troubleshooting

- If you get the error "ERROR: torch has an invalid wheel, .dist-info directory not found" when installing KeyBERT, try to run this command

(scibert-env) # pip install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

and then use pip to install keybert[flair]

- If you get the error "ImportError: cannot import name 'NoReturn'" with Python 3.6.1, try to upgrade your Python version, e.g. 3.8.x

- If you get the error "Transformer: Error importing packages. “ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler'”, try to change the torch version.

 

References

[1] https://github.com/allenai/scibert

[2] https://github.com/MaartenGr/KeyBERT/

[3] https://monkeylearn.com/keyword-extraction/

[4] https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

[5] https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270