Generating a Word Cloud In Python - Issue 30

Learn how to make a word cloud for text analysis

Feb 03, 2021

Word or text clouds are very common tasks for analysts who work with textural, qualitative, or semantical data analysis. They are also common take-home assignments for candidates to test their knowledge of handling, processing, and visualizing text data. Below, I’ll showcase one of the ways to build a word cloud in Python.

There are many applications, tools, and libraries that can help you to generate a word cloud in mere seconds for free (you can check some of those below). That being said, as an analyst, you should be able to create your own visuals in either R or Python, both of which should grant you the freedom to tailor your dataset as needed. Pick a style and customization that works best for you!

Word clouds - what and why

First things first! You’ll need to make a word cloud if you want to visualize which words are used the most in your dataset. The more often a word is used, the larger it will appear in your cloud. Text clouds are the best option when you have to quickly find a pattern, insight, or note a frequency of words used in your data. This will be your first request for any Exploratory Data Analysis tasks with text data.

Getting started

For my analysis today, I am choosing the Python wordcloud package. We’ll use NumPy and Pandas for data processing.

import numpy as np # linear algebra

import pandas as pd # data processing

import seaborn as sns #statist graph package

import matplotlib.pyplot as plt #plot package

import wordcloud #will use for the word cloud plot

from wordcloud import WordCloud, STOPWORDS # optional to filter out the stopwords

You don’t have to use stopwords to generate a word cloud. It’s advised to use them, however, in order to eliminate the text noise. You also can set a list of stop words to anything you like:

stop_words = set(['have', 'when', 'about', 'according', ‘who’, 'actually','zero', ''])

💡 Tip: if you are unfamiliar with the package and its functions or limitations, you can simply run ?WordCloud to get its documentation.

Prepare dataset

Before we proceed with the cloud, we have to tailor our dataset to ensure the values are in an appropriate format.

First, we have to remove NULL values:

df["title"] = df["title"].fillna(value="")

Now, let’s add a string value instead to make our Series clean:

word_string=" ".join(df['title'].str.lower())

and... Plotting!

plt.figure(figsize=(15,15))

wc = WordCloud(background_color="purple", stopwords = STOPWORDS, max_words=2000, max_font_size= 300, width=1600, height=800)

wc.generate(word_string)

plt.imshow(wc.recolor( colormap= 'viridis' , random_state=17), interpolation="bilinear")

plt.axis('off')

We set 2000 words limits for this cloud. Let’s try setting the limit to 50 and changing the background color:

plt.figure(figsize=(15,15))

wc = WordCloud(background_color="yellow", stopwords = STOPWORDS, max_words=50, max_font_size= 300, width=1400, height=800)

wc.generate(word_string)

plt.imshow(wc.recolor( colormap= 'viridis' , random_state=17), interpolation="bilinear")

plt.axis('off')

The full code is on Kaggle - Word Cloud using Python Pandas.

💡You probably noticed I am using imshow. It’s a function from matplot package that transforms your data into an image. To set or change its parameters, follow this guide.

That’s it for now. In one of my next issues, I’ll demonstrate using masks for generating clouds in a form of a star, circle, or any shape that you could possibly ever want! (Maybe.)

Check out a list of my favorite go-to online word cloud generators:

TagCrowd.com
MonkeyLearn.com - you have to create an account, but once you are set, they provide a lot of text semantical analysis.
WordArt.com
WordItOut.com - works the best with a cleaned text.
WordCloudMaker.com

Thanks for reading, everyone. Until next Wednesday!