Automatic text summarization refers to a group of methods that employ algorithms to compress a certain amount of text while preserving the text's key points. Although it may not receive as much attention as other machine learning successes, this field of computer automation has witnessed consistent advancement and improvement. Therefore, systems capable of extracting the key concepts from the text while maintaining the overall meaning have the potential to revolutionize a variety of industries, including banking, law, and even healthcare.
Types of Text Summarization
There are typically two basic methods for automatic text summarization:
- Extractive summarization
- Abstractive summarization
Extractive Summarization
Extractive summarization algorithms are employed to generate a summary by selecting and combining key passages from the source material. Unlike humans, these models emphasize creating the most essential sentences from the original text rather than generating new ones.
Extractive summarization utilizes the Text Rank algorithm, which is highly suitable for text summarization tasks. Let's explore how it functions by considering a sample text summarization scenario.
Utilizing TextRank Algorithm for Extractive Text Summarization
The implementation of TextRank offers a spaCy pipeline as an additional feature. SpaCy is an excellent Python library for addressing challenges in natural language processing. Additionally, you need pytextrank, a spaCy extension that effectively implements the TextRank algorithm. It is evident that the TextRank algorithm can produce reasonably satisfactory results. Nevertheless, extractive summarization techniques merely provide a modified version of the original text, retaining certain phrases that were not eliminated, instead of generating new text (new data) to summarize the information contained in the original text.
Prerequisite
Spacy
To Install the Spacy and Dowload the English Language Dependency run the below code in terminal
!pip install spacy
To install the english laguage dependency
!python3 -m spacy download en_core_web_lg
TextRank
To Install the TextRank
!pip install pytextrank
Text Summarizations
This code uses spaCy and PyTextRank to automatically summarize a given text. It first installs the required packages, downloads a spaCy language model, and loads the model with the TextRank summarization pipeline. It then processes a lengthy text and generates a summary of the text's key phrases and sentences. The summary is limited to 2 phrases and 2 sentences.
Python
import spacy
import pytextrank
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")
example_text = """Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial neural networks with
representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically,
neural networks tend to be static and symbolic, while the biological brain of most living organisms
is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple
layers in the network. Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can.
Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size,
which permits practical application and optimized implementation, while retaining theoretical universality
under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely
from biologically informed connectionist models, for the sake of efficiency, trainability and understandability,
whence the structured part."""
print('Original Document Size:',len(example_text))
doc = nlp(example_text)
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
print(sent)
print('Summary Length:',len(sent))
Output:
Original Document Size: 1808
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.
Summary Length: 76
Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue.
Summary Length: 27
Abstractive Summarization
Abstractive summarization techniques emulate human writing by generating entirely new sentences to convey key concepts from the source text, rather than merely rephrasing portions of it. These fresh sentences distill the vital information while eliminating irrelevant details, often incorporating novel vocabulary absent in the original text. The term "Transformers" has recently dominated the natural language processing field, although these models initially relied on designs based on recurrent neural networks (RNNs).
What Are Transformers?
Transformers represent a series of systems that employ a unique encoder-decoder architecture to transform an input sequence into an output sequence. Transformers feature a distinctive "self-attention" mechanism, along with several other enhancements like positional encoding, which set them apart.
NOTE: Not all Transformers are intended for use in text summarization. Let's delve into the recently released model called PEGASUS, which appears to excel in terms of output quality for text summarization.
PEGASUS shares similarities with other transformer models, with its primary distinction lying in a unique approach used during the model's pre-training. Specifically, the most crucial sentences in the training text corpora are "masked" (hidden from the model) during PEGASUS pre-training. The model is then tasked with generating these concealed sentences as a single output sequence.
Prerequisite
To run the text summarizations below code, First we need to install the below python libraries and framework.
!pip install git+https://github.com/PyTorchLightning/pytorch-lightning
!pip install git+https://github.com/huggingface/transformers
!pip install sentencepiece
!pip install git+https://github.com/stas00/transformers
!pip install pegasus
This code uses the Hugging Face Transformers library to summarize text using the PEGASUS model. It installs necessary packages, selects the model, tokenizes the input text, generates a summary, and prints it. Additionally, it demonstrates using the summarization pipeline for text summarization.
Python
from transformers import pipeline
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
# Pick model
model_name = "google/pegasus-xsum"
# Load pretrained tokenizer
pegasus_tokenizer = PegasusTokenizer.from_pretrained(model_name)
example_text = """
Deep learning (also known as deep structured learning) is part of a broader family of machine learning
methods based on artificial neural networks with representation learning.
Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as
deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis,
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various differences
from biological brains. Specifically, neural networks tend to be static and symbolic,
while the biological brain of most living organisms is dynamic (plastic) and analogue.
The adjective "deep" in deep learning refers to the use of multiple layers in the network.
Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of
unbounded width can. Deep learning is a modern variation which is concerned with an
unbounded number of layers of bounded size, which permits practical application and
optimized implementation, while retaining theoretical universality under mild conditions.
In deep learning the layers are also permitted to be heterogeneous and to deviate widely
from biologically informed connectionist models, for the sake of efficiency, trainability
and understandability, whence the structured part."""
print('Original Document Size:',len(example_text))
# Define PEGASUS model
pegasus_model = PegasusForConditionalGeneration.from_pretrained(model_name)
# Create tokens
tokens = pegasus_tokenizer(example_text, truncation=True, padding="longest", return_tensors="pt")
# Generate the summary
encoded_summary = pegasus_model.generate(**tokens)
# Decode the summarized text
decoded_summary = pegasus_tokenizer.decode(encoded_summary[0], skip_special_tokens=True)
# Print the summary
print('Decoded Summary :',decoded_summary)
summarizer = pipeline(
"summarization",
model=model_name,
tokenizer=pegasus_tokenizer,
framework="pt"
)
summary = summarizer(example_text, min_length=30, max_length=150)
summary[0]["summary_text"]
Output:
Original Document Size: 1825
Decoded Summary : Deep learning is a branch of computer science that deals with the study and training of machine learning.
'Deep learning is a branch of computer science which deals with the study and training of complex systems such as speech recognition, natural language processing, machine translation and medical image analysis. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and neuralal networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.'
Conclusion
As we come to a conclusion, the future of text summarization seems bright. We are working to uncover the possibility of summarizing text with even more accuracy and human-like intuition by using extractive and abstractive approaches, as well as potent models like PEGASUS. This journey is continuing to transform how we condense massive volumes of information into succinct, insightful insights, and it promises a future in which we will be able to distil knowledge more effectively than before. The development of text summarization is evidence of the ever-expanding potential of AI and its dedication to improving human comprehension.