A Beginner's Guide to Text Analytics

In this project, after researching all the available methods for each task, I built an end-to-end text analytics system that can perform summarization, coherence analysis, context analysis, POS tagging, and grammar checking on text data. I created an API for each task to integrate them into our internal no-code, low-code platform.

Automatic text summarization is a common problem in machine learning and natural language processing (NLP). There are existing extractive and abstractive methods for text summarization. Often, extractive methods fail to capture the coherence between sentences in the summary. When it comes to abstractive methods, training the model requires a lot of data and, hence, time. The main goal is to implement a comparatively faster text summarizer by leveraging transformers and part of speech tagging, meaning, clarity, coherence and context extraction from text.

Data Acquisition and Preprocessing

Data Source:

Used the CNN-Daily Mail News Text Summarization dataset to evaluate available models for text summarization.

Preprocessing:

Removed stop-words, unnecessary white spaces, non-alphanumeric characters, etc., using NLTK and Gensim.

def Text_Preprocessing(l):
  lemmatizer = WordNetLemmatizer() 
  l1=strip_tags(l)
  l2=remove_stopwords(l1)
  l3=strip_multiple_whitespaces(l2)
  l5=strip_non_alphanum(l3)
  l6=strip_numeric(l5)
  l7=strip_punctuation(l6)
  l8=l7.lower()
  l9=lemmatizer.lemmatize(l8)
  t= nltk.word_tokenize(l9)
  return t

Python

Methodology and Modeling

Implemented some of the common text summarization methods, like

TF-IDF
PageRank
Sequence-to-sequence models
Transformers

Evaluated summaries using the ROUGE score (ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of automatic summaries by comparing them to reference summaries (usually human-generated summaries)).

Summarization

def Summarization(text):
  model = AutoModelWithLMHead.from_pretrained("t5-base")
  tokenizer = AutoTokenizer.from_pretrained("t5-base")
  inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512,truncation=True)
  outputs= model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
  t5_summary=strip_tags(tokenizer.decode(outputs[0]))
  return(t5_summary)

Python

POS Tagging

def POS_Tagging(text):
  nlp = spacy.load("en_core_web_sm")
  doc=nlp(text)
  tags=[]
  for token in doc:
    tags.append((token, token.pos_))
  return tags

Python

Context Analysis

def Context_Analysis(text):
  blob = TextBlob(text)
  np = blob.noun_phrases
  return np

Python

Grammar Checking

import language_tool_python
def Grammar_Checking(text):
  tool = language_tool_python.LanguageTool('en-US')
  return (tool.check(text),tool.correct(text))

Python

Coherence Analysis

def Coherence_Analysis(sentence_list):
  documents = sentence_list
  document_wrd_splt = []
  for i in documents:
    document_wrd_splt.append(Text_Preprocess(i))
  print(document_wrd_splt)
  dictionary = corpora.Dictionary(document_wrd_splt)
  #print(dictionary.token2id)

  corpus = [dictionary.doc2bow(text) for text in document_wrd_splt]

  lda = LdaModel(corpus, num_topics=3, id2word = dictionary, passes=50)

  num_topics = 3
  topic_words = []
  for i in range(num_topics):
    tt = lda.get_topic_terms(i,20)
    topic_words.append([dictionary[pair[0]] for pair in tt])
  topic_list=[]
  
  topic_list.append(topic_words[0])
  topic_list.append(topic_words[1])
  topic_list.append(topic_words[2])
  
  
  coh_m = CoherenceModel(topics=topic_list, corpus=corpus, dictionary=dictionary, coherence='u_mass')
  coherence = coh_m.get_coherence()  # get coherence value
  return coherence

Python

Results and Evaluation

Based on the evaluation of various models, transformers(Google’s T5 from Hugging Face) provided the best result for summarization. For POS tagging, I used Spacy. Context analysis was done by TextBlob. Grammar checking was done using the language-tool-python package.

Conclusion and Insights

Extractive summarization methods are faster
Extractive methods are very popular in the industry as they are very easy to implement. They use existing natural language phrases and are reasonably accurate and fast.
Abstractive methods generate more human-like summaries. Still, they take much longer to train and require a lot of data. Sometimes, the network fails to focus on the core of the source text and summarizes a less important, secondary piece of information.
Text summarization is a growing area.

Reflection and Future Work

While there are simple text summarization methods in Python, such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers, such as T5 and, recently, GPT-3.

Several areas of future work and research directions in text summarization include:

Improving the summary of abstractive summarization methods.
Investigating techniques for summarizing multimodal content, including text, images, audio, and video.
Investigating techniques for summarizing multimodal content, including text, images, audio, and video

Libraries

TextBlob
NLTK
spaCy
Transformers

2 Comments

Kenton Metchikoff says:
Posted on June 21, 2024 at 7:19 am

Reply

Heya this is kind of of off topic but I was wondering if blogs
use WYSIWYG editors or if you have to manually code with HTML.
I’m starting a blog soon but have no coding
expertise so I wanted to get advice from someone with experience.
Any help would be greatly appreciated!
1. Agnes Augustine says:
  Posted on June 24, 2024 at 8:26 am
  
  Reply
  
  I use WYSIWYG.

A Beginner’s Guide to Text Analytics

Data Acquisition and Preprocessing

Methodology and Modeling

Summarization

POS Tagging

Context Analysis

Grammar Checking

Coherence Analysis

Results and Evaluation

Conclusion and Insights

Reflection and Future Work

Libraries

2 Comments

Leave a Reply Cancel reply