DataViz Mastery Part 2 - Word Clouds
This is part 2 of DataViz Mastery. In part 1, we learned how to create Treemaps using Python - Read it here. In this post we will learn how to create Word Clouds using Python. So, let's get started.
Word Cloud
A Word Cloud (or tag cloud) is a visual representation for text data, typically used to depict keyword metadata (tags) on websites, to visualize free form text or to analyses speeches( e.g. election’s campaign). Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence.
Examples
- Top 1000 most common password
- Word Cloud of Trump Insults
The Code
Required Libraries
Creating Word Cloud is very easy with the help wordcloud developed by Andreas Mueller.
Word Cloud 1 - Simple
We will create a Word Cloud of top words from Wonder Woman Movie. We will use the movie script provided in this website. We will need to remove Stop Words from the script before creating the cloud. wordcloud
library provides a list of stop words. We will use that for our usage.
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Reading the script
script = open("wonderwoman.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
stopwords.add("will")
# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords,
width=1600, height=900, colormap=matplotlib.cm.inferno)
# Generate WordCloud
wc.generate(script)
# Show the WordCloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 1599.5, 899.5, -0.5)
It's very clear that, "Diana" is the most repeated word in the movie.
Word Cloud 2 - With Mask
We can also create Word Clouds with custom masks. We will create a word cloud of top words from "The Dark Knight(2008)" movie with a Batman symbol mask. Script Link
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
script = open("batman.txt").read()
stopwords = set(STOPWORDS)
batman_mask = np.array(Image.open("batman-logo.png"))
# Custom Colormap
from matplotlib.colors import LinearSegmentedColormap
colors = ["#000000", "#111111", "#101010", "#121212", "#212121", "#222222"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)
wc = WordCloud(background_color="white", stopwords=stopwords, mask=batman_mask,
width=1987, height=736, colormap=cmap)
wc.generate(script)
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 999.5, 369.5, -0.5)
Word Cloud 3 - Colored Mask
We will create Word Cloud of "Captain America: Civil War" script with following mask.
This method colorizes the cloud with average color in the area.
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
mask = np.array(Image.open("civilwar.jpg"))
# Reading the script
script = open("civilwar.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords,
width=1280, height=628, mask=mask)
wc.generate(script)
# Image Color Generator
image_colors = ImageColorGenerator(mask)
plt.figure()
plt.imshow(wc.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")
(-0.5, 1279.5, 627.5, -0.5)
Word Cloud 4 - Cannon of Sherlock Holmes
In this example, we will create a word cloud from the "Canon of Sherlock Holmes".
import random
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Custom Color Function
def grey_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
script = open("canon.txt").read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("will")
mask = np.array(Image.open("sherlock.jpeg"))
wc = WordCloud(background_color="black", stopwords=stopwords, mask=mask,
width=875, height=620, font_path="lato.ttf")
wc.generate(script)
plt.figure()
plt.imshow(wc.recolor(color_func=grey_color_func, random_state=3),
interpolation="bilinear")
plt.axis("off")
(-0.5, 874.5, 619.5, -0.5)
Word Cloud 5 - Trump Tweets
I have collected last 193 tweets from Mr. Donald Trump after removing urls and hashtags and without considering retweets. We will make a Word Cloud of top words from these tweets.
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
mask = np.array(Image.open("trump.jpg"))
# Reading the script
script = open("trump.txt").read()
# Set of Stop words
stopwords = set(STOPWORDS)
stopwords.add("will")
from matplotlib.colors import LinearSegmentedColormap
colors = ["#BF0A30", "#002868"]
cmap = LinearSegmentedColormap.from_list("mycmap", colors)
# Create WordCloud Object
wc = WordCloud(background_color="white", stopwords=stopwords,
font_path="titilium.ttf",
width=853, height=506, mask=mask, colormap=cmap)
wc.generate(script)
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
(-0.5, 2399.5, 1422.5, -0.5)
Word Cloud 6 - All Star Wars Scripts
import random
from PIL import Image
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (16.0, 9.0)
# Custom Color Function
def grey_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
script = open("starwars.txt").read()
stopwords = set(STOPWORDS)
stopwords.add("will")
mask = np.array(Image.open("darthvader.jpg"))
wc = WordCloud(background_color="black", stopwords=stopwords, mask=mask,
width=736, height=715, font_path="lato.ttf")
wc.generate(script)
plt.figure()
plt.imshow(wc.recolor(color_func=grey_color_func, random_state=3),
interpolation="bilinear")
plt.axis("off")
(-0.5, 735.5, 714.5, -0.5)
That's all for Word Clouds. We will be continue this series with more visualization tutorials. Checkout the following references and books to learn more. Checkout this Github Repo for the code and more visualizations.
Resources
- Word Cloud - DataViz Project
- Word Clouds in Python - wordcloud Documentation
- Twitter Timeline to Word Cloud
Data Visualization Books
- Storytelling with Data: A Data Visualization Guide for Business Professionals
- The Truthful Art: Data, Charts, and Maps for Communication
- Data Visualization: a successful design process
- Data Visualisation: A Handbook for Data Driven Design