Documents Summary in Python

Published Aug 29, 2023

Hello Coders!

This article explains how to process documents using Python. Analyzing the content of a Microsoft Word document, extracting text, and creating a summary involves a multi-step process. Python provides libraries that can help you achieve these tasks.

Here's a general approach

✅ Install Libraries

You'll need the docx2txt library to extract text from Word documents and the nltk library for text summarization. Install them using pip:

pip install python-docx nltk

✅ Extract Text from Word Document

Use the docx2txt library to extract text from the Word document:

import docx2txt

# Replace 'document.docx' with your file path
text = docx2txt.process("document.docx")

✅ Preprocess Text

Before summarization, it's a good idea to preprocess the extracted text. You can use the nltk library to tokenize the text into sentences and words:

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)
words = word_tokenize(text)

✅ Text Summarization

To create a summary, you can use various techniques. One common approach is extractive summarization, where important sentences from the text are selected to form the summary. The nltk library offers a basic implementation of the TextRank algorithm for this purpose:

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
    words1 = [w.lower() for w in sent1]
    words2 = [w.lower() for w in sent2]
    all_words = list(set(words1 + words2))
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
    for w in words1:
        if w not in stopwords:
            vector1[all_words.index(w)] += 1
    for w in words2:
        if w not in stopwords:
            vector2[all_words.index(w)] += 1
    return 1 - cosine_distance(vector1, vector2)

def build_similarity_matrix(sentences, stop_words):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stop_words)
    return similarity_matrix

def generate_summary(text, top_n=5):
    stop_words = stopwords.words('english')
    sentences = sent_tokenize(text)
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    summary = ""
    for i in range(top_n):
        summary += " " + ranked_sentences[i][1]
    return summary

# Generate a summary with the top 5 sentences
summary = generate_summary(text, top_n=5)

The above code defines functions to calculate sentence similarity and generate a summary using TextRank Algorithm.

Please note that this is a simplified example, and there are more advanced methods and libraries available for text summarization.

Additionally, the quality of the summary might vary depending on the complexity of the document and the chosen summarization technique.

Thank you!

Resources

👉 Generate Django Apps using Rocket Generator
👉 Join the Community and chat with the support team

Python

Report

Enjoy this post? Give Adi Chirilov - Sm0ke a like if it's helpful.

Adi Chirilov - Sm0ke

Founder @ AppSeed.us

#Automation #Python #Design Patterns

Discover and read more posts from Adi Chirilov - Sm0ke

get started

Be the first to share your opinion

GitHub flavored markdown supported

submit

Martinez Assa

8 months ago

In the realm of Python programming, documents summary takes center stage, acting as the https://magicalkits.com/ of efficiency. With the wave of a code wand, Python scripts can extract key insights from lengthy documents, unveiling their essence in a concise manner. This process, akin to pulling a rabbit from a hat, empowers users to swiftly comprehend the contents of extensive files, enhancing productivity and understanding in one magical stroke.

Show more replies