Visualize News Word Cloud using Python, Flask and JQCloud

Published May 28, 2017Last updated Aug 05, 2017
Visualize News Word Cloud using Python, Flask and JQCloud

Learning objectives of this post are:

  1. Get news feed using REST based API from NewsAPI.org
  2. Get words and their frequency
  3. Visualize word cloud using JQCloud

We will build a Flask APP to put everything in place.

Goal-1: Get news feed from NewsAPI.org

NewsAPI.org provides free and simple to use RESTful API. It provides two endpoints, sources and articles.
Using sources endpoint we can get a list of news sources and blogs available on News API and using articles endpoint we can get a list of articles for a particular source.

Let's now understand the request url for articles endpoint.

https://newsapi.org/v1/articles?source={source}&apiKey={apikey}

As, we can see API requires two parameters, source and apikey. Source is a short code provided by NewsAPI.org for each news source or blog listed by it. You can choose any news source or blog of your choice, some of the sources with their respective codes are Bloomberg (bloomberg), BBC News (bbc-news), Business Insider(business-insider) etc.

You can generate your API key from NewsAPI.org when you will register on it.
So, request will look like this:

https://newsapi.org/v1/articles?source=bbc-news&apiKey=123456

Let's write python code to get news data

import requests  # this we will use to call API and get data
import json  # to convert python dictionary/list to string format

# get API key from NewsAPI.org
NEWS_API_KEY = "123456"

# url for articles endpoint
# I'm using bbc-news source, you can choose a source of your choice 
# or can pull data from multiple sources
url = "https://newsapi.org/v1/articles?source=bbc-news&apiKey="+NEWS_API_KEY

# call the api
response = requests.get(url)

# get the data in json format
result = response.json()
print(result)

This is the json response we will get.

{

    "status":"ok",
    "source":"bbc-news",
    "sortBy":"top",
    "articles":[
        {
            "author":"BBC News",
            "title":"British Airways to resume most flights but delays still expected",
            "description":"British Airways warns there will still be some delays and cancellations, a day after its IT crash.",
            "url":"http://www.bbc.co.uk/news/uk-40074751",
            "urlToImage":"https://ichef.bbci.co.uk/news/1024/cpsprodpb/11F52/production/_96245537_ba_reuters.jpg",
            "publishedAt":"2017-05-28T07:54:49+00:00"
        }
    ]

}

Goal-2: Get words and their frequency

To achieve this first we will get description for each news article returned by the API. Then we will split the description/sentences into words using NLTK and after that we will use collections.Counter to get the words and their frequencies.

Let's achieve this step-by-step!

from nltk.tokenize import word_tokenize  # to split sentences into words
from nltk.corpus import stopwords  # to get a list of stopwords
from collections import Counter  # to get words-frequency

descriptions = []
# this is in continuation of above code
# result variable holds the json response
# all the news articles are listed under 'articles' key
# we are interested in the description of each news article
for each_article in result['articles']:
  description.append(each_article['description])

# split sentences into words
words = []
for description in descriptions:
  tokens = word_tokenize(description)
    words.extend(tokens)

# remove stopwords from our words list and also remove any word whose length is less than 3
# stopwords are commonly occuring words like is, am, are, they, some, etc.
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words and len(word)>2]

# now, get the words and their frequency
words_freq = Counter(words)
print(words_freq)

Goal-3: Visualize word cloud using JQCloud

In this goal we will return the word cloud data from python to the JQCloud for the visualization.

JQCloud requires data in following format, so before returning word cloud data, we will have to put it in a usable format.

[

    {
        'text':'police',
        'weight':100
    },
    {
        'text':'parents',
        'weight':80
    }
]

Code to convert data JQCloud compatible format and also dump json into string format

words_json = [{'text': word, 'weight': count} for word, count in words_freq.items()]

# json.dumps is used to convert json object i.e. dictionary or list into a string
print(json.dumps(words_freq))

Now lets write some JQuery code, to call our flask-app endpoint, get data and then build word cloud

First, we will write our html code (index.html) to include the css and js for JQCloud and also include our jquery script.

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/html">
<head>
    <meta charset="UTF-8">
    <title>News Word Cloud</title>

    <!-- You can download css and js from https://github.com/mistic100/jQCloud/tree/master/dist -->
    <link rel="stylesheet" href="../static/css/jqcloud.min.css">
    
    <!-- You need to include jquery before the jqcloud.js, you can get it from -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
    <script type="text/javascript" src="../static/js/jqcloud.min.js"></script>
    
    <!-- here we included our script.js -->
    <script type="text/javascript" src="../static/js/script.js"></script>
</head>
<body>
    <!-- Empty div where JQCloud will build the word cloud-->
    <div id="word_cloud">
    </div>
</body>
</html>

In script.js, we will call our flask-app endpoint 'word_cloud', get data and visualize it using JQCloud

$(document).ready(function () {
    // on page load this will fetch data from our flask-app asynchronously
   $.ajax({url: '/word_cloud', success: function (data) {
       // returned data is in string format we have to convert it back into json format
       var words_data = $.parseJSON(data);
       // we will build a word cloud into our div with id=word_cloud
       // we have to specify width and height of the word_cloud chart
       $('#word_cloud').jQCloud(words_data, {
           width: 800,
           height: 600
       });
   }});
});

Here is our complete Flask-APP code

from flask import Flask, render_template
from nltk.tokenize import word_tokenize  # to split sentences into words
from nltk.corpus import stopwords  # to get a list of stopwords
from collections import Counter  # to get words-frequency
import requests  # this we will use to call API and get data
import json  # to convert python dictionary to string format

app = Flask(__name__)

# get API key from NewsAPI.org
NEWS_API_KEY = "123456"


@app.route('/')
def home_page():
    return render_template('index.html')


@app.route('/word_cloud', methods=['GET'])
def word_cloud():
    try:
        # url for articles endpoint
        # I'm using bbc-news source, you can choose a source of your choice
        # or can pull data from multiple sources
        url = "https://newsapi.org/v1/articles?source=bbc-news&apiKey="+NEWS_API_KEY

        # call the api
        response = requests.get(url)

        # get the data in json format
        result = response.json()

        # all the news articles are listed under 'articles' key
        # we are interested in the description of each news article
        sentences = ""
        for news in result['articles']:
            description = news['description']
            sentences = sentences + " " + description

        # split sentences into words
        words = word_tokenize(sentences)

        # get stopwords
        stop_words = set(stopwords.words('english'))

        # remove stopwords from our words list and also remove any word whose length is less than 3
        # stopwords are commonly occuring words like is, am, are, they, some, etc.
        words = [word for word in words if word not in stop_words and len(word) > 3]

        # now, get the words and their frequency
        words_freq = Counter(words)

        # JQCloud requires words in format {'text': 'sample', 'weight': '100'}
        # so, lets convert out word_freq in the respective format
        words_json = [{'text': word, 'weight': count} for word, count in words_freq.items()]

        # now convert it into a string format and return it
        return json.dumps(words_json)
    except Exception as e:
        return '[]'


if __name__ == '__main__':
    app.run()

Added jumbotron from bootstrap to make it look little better!..

You can fork the complete code from git repository - https://github.com/prateekkrjain/newsapi_word_cloud

Discover and read more posts from Prateek Jain
get started
Enjoy this post?

Leave a like and comment for Prateek

2