In this project, after researching all the available methods for each task, I built an end-to-end text analytics system that can perform summarization, coherence analysis, context analysis, POS tagging, and grammar checking on text data. I created an API for each task to integrate them into our internal no-code, low-code platform.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP). There are existing extractive and abstractive methods for text summarization. Often, extractive methods fail to capture the coherence between sentences in the summary. When it comes to abstractive methods, training the model requires a lot of data and, hence, time. The main goal is to implement a comparatively faster text summarizer by leveraging transformers and part of speech tagging, meaning, clarity, coherence and context extraction from text.
Data Acquisition and Preprocessing
Data Source:
Used the CNN-Daily Mail News Text Summarization dataset to evaluate available models for text summarization.
Preprocessing:
Removed stop-words, unnecessary white spaces, non-alphanumeric characters, etc., using NLTK and Gensim.
def Text_Preprocessing(l):
lemmatizer = WordNetLemmatizer()
l1=strip_tags(l)
l2=remove_stopwords(l1)
l3=strip_multiple_whitespaces(l2)
l5=strip_non_alphanum(l3)
l6=strip_numeric(l5)
l7=strip_punctuation(l6)
l8=l7.lower()
l9=lemmatizer.lemmatize(l8)
t= nltk.word_tokenize(l9)
return t
PythonMethodology and Modeling
Implemented some of the common text summarization methods, like
- TF-IDF
- PageRank
- Sequence-to-sequence models
- Transformers
Evaluated summaries using the ROUGE score (ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of automatic summaries by comparing them to reference summaries (usually human-generated summaries)).
Summarization
def Summarization(text):
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512,truncation=True)
outputs= model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
t5_summary=strip_tags(tokenizer.decode(outputs[0]))
return(t5_summary)
PythonPOS Tagging
def POS_Tagging(text):
nlp = spacy.load("en_core_web_sm")
doc=nlp(text)
tags=[]
for token in doc:
tags.append((token, token.pos_))
return tags
PythonContext Analysis
def Context_Analysis(text):
blob = TextBlob(text)
np = blob.noun_phrases
return np
PythonGrammar Checking
import language_tool_python
def Grammar_Checking(text):
tool = language_tool_python.LanguageTool('en-US')
return (tool.check(text),tool.correct(text))
PythonCoherence Analysis
def Coherence_Analysis(sentence_list):
documents = sentence_list
document_wrd_splt = []
for i in documents:
document_wrd_splt.append(Text_Preprocess(i))
print(document_wrd_splt)
dictionary = corpora.Dictionary(document_wrd_splt)
#print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in document_wrd_splt]
lda = LdaModel(corpus, num_topics=3, id2word = dictionary, passes=50)
num_topics = 3
topic_words = []
for i in range(num_topics):
tt = lda.get_topic_terms(i,20)
topic_words.append([dictionary[pair[0]] for pair in tt])
topic_list=[]
topic_list.append(topic_words[0])
topic_list.append(topic_words[1])
topic_list.append(topic_words[2])
coh_m = CoherenceModel(topics=topic_list, corpus=corpus, dictionary=dictionary, coherence='u_mass')
coherence = coh_m.get_coherence() # get coherence value
return coherence
PythonResults and Evaluation
Based on the evaluation of various models, transformers(Google’s T5 from Hugging Face) provided the best result for summarization. For POS tagging, I used Spacy. Context analysis was done by TextBlob. Grammar checking was done using the language-tool-python package.
Conclusion and Insights
- Extractive summarization methods are faster
- Extractive methods are very popular in the industry as they are very easy to implement. They use existing natural language phrases and are reasonably accurate and fast.
- Abstractive methods generate more human-like summaries. Still, they take much longer to train and require a lot of data. Sometimes, the network fails to focus on the core of the source text and summarizes a less important, secondary piece of information.
- Text summarization is a growing area.
Reflection and Future Work
While there are simple text summarization methods in Python, such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers, such as T5 and, recently, GPT-3.
Several areas of future work and research directions in text summarization include:
- Improving the summary of abstractive summarization methods.
- Investigating techniques for summarizing multimodal content, including text, images, audio, and video.
- Investigating techniques for summarizing multimodal content, including text, images, audio, and video
Libraries
- TextBlob
- NLTK
- spaCy
- Transformers
Heya this is kind of of off topic but I was wondering if blogs
use WYSIWYG editors or if you have to manually code with HTML.
I’m starting a blog soon but have no coding
expertise so I wanted to get advice from someone with experience.
Any help would be greatly appreciated!
I use WYSIWYG.