DataViz Mastery Part 2 - Word Clouds

Posted by Mubaris NK on November 11, 2017

This is part 2 of DataViz Mastery. In part 1, we learned how to create Treemaps using Python - Read it here. In this post we will learn how to create Word Clouds using Python. So, let’s get started.

Word Cloud

A Word Cloud (or tag cloud) is a visual representation for text data, typically used to depict keyword metadata (tags) on websites, to visualize free form text or to analyses speeches( e.g. election’s campaign). Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence.

Examples

  • Top 1000 most common password

Password

  • Word Cloud of Trump Insults

Trump Insult

The Code

Required Libraries

Creating Word Cloud is very easy with the help wordcloud developed by Andreas Mueller.

Word Cloud 1 - Simple

We will create a Word Cloud of top words from Wonder Woman Movie. We will use the movie script provided in this website. We will need to remove Stop Words from the script before creating the cloud. wordcloud library provides a list of stop words. We will use that for our usage.

from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Reading the script
script = open("wonderwoman.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
stopwords.add("will")
# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords, 
               width=1600, height=900, colormap=matplotlib.cm.inferno)
# Generate WordCloud
wc.generate(script)
# Show the WordCloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 1599.5, 899.5, -0.5)

png

It’s very clear that, “Diana” is the most repeated word in the movie.

Word Cloud 2 - With Mask

We can also create Word Clouds with custom masks. We will create a word cloud of top words from “The Dark Knight(2008)” movie with a Batman symbol mask. Script Link

from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
script = open("batman.txt").read()
stopwords = set(STOPWORDS)
batman_mask = np.array(Image.open("batman-logo.png"))

# Custom Colormap
from matplotlib.colors import LinearSegmentedColormap
colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)

wc = WordCloud(background_color="white", stopwords=stopwords, mask=batman_mask,
               width=1987, height=736, colormap=cmap)
wc.generate(script)
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 999.5, 369.5, -0.5)

png

Word Cloud 3 - Colored Mask

We will create Word Cloud of “Captain America: Civil War” script with following mask.

Civil War Mask

This method colorizes the cloud with average color in the area.

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
mask = np.array(Image.open("civilwar.jpg"))
# Reading the script
script = open("civilwar.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords, 
               width=1280, height=628, mask=mask)
wc.generate(script)
# Image Color Generator
image_colors = ImageColorGenerator(mask)

plt.figure()
plt.imshow(wc.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")
(-0.5, 1279.5, 627.5, -0.5)

png

Word Cloud 4 - Cannon of Sherlock Holmes

In this example, we will create a word cloud from the “Canon of Sherlock Holmes”.

import random
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Custom Color Function
def grey_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

script = open("canon.txt").read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("will")
mask = np.array(Image.open("sherlock.jpeg"))

wc = WordCloud(background_color="black", stopwords=stopwords, mask=mask,
               width=875, height=620,  font_path="lato.ttf")
wc.generate(script)
plt.figure()
plt.imshow(wc.recolor(color_func=grey_color_func, random_state=3), 
            interpolation="bilinear")
plt.axis("off")
(-0.5, 874.5, 619.5, -0.5)

png

Word Cloud 5 - Trump Tweets

I have collected last 193 tweets from Mr. Donald Trump after removing urls and hashtags and without considering retweets. We will make a Word Cloud of top words from these tweets.

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
mask = np.array(Image.open("trump.jpg"))
# Reading the script
script = open("trump.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
stopwords.add("will")

from matplotlib.colors import LinearSegmentedColormap
colors = ["#BF0A30", "#002868"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)

# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords,
                 font_path="titilium.ttf", 
               width=853, height=506, mask=mask, colormap=cmap)
wc.generate(script)


plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 2399.5, 1422.5, -0.5)

png

Word Cloud 6 - All Star Wars Scripts

import random
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Custom Color Function
def grey_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

script = open("starwars.txt").read()
stopwords = set(STOPWORDS)
stopwords.add("will")
mask = np.array(Image.open("darthvader.jpg"))

wc = WordCloud(background_color="black", stopwords=stopwords, mask=mask,
          width=736, height=715,  font_path="lato.ttf")
wc.generate(script)
plt.figure()
plt.imshow(wc.recolor(color_func=grey_color_func, random_state=3),
           interpolation="bilinear")
plt.axis("off")
(-0.5, 735.5, 714.5, -0.5)

png

That’s all for Word Clouds. We will be continue this series with more visualization tutorials. Checkout the following references and books to learn more. Checkout this Github Repo for the code and more visualizations.

Resources

Data Visualization Books

1) Storytelling with Data: A Data Visualization Guide for Business Professionals

2) The Truthful Art: Data, Charts, and Maps for Communication

3) Data Visualization: a successful design process

4) Data Visualisation: A Handbook for Data Driven Design