Analysis: Exploratory Data Analysis Using Python Pandas and SQL

What is EDA? What is the easiest and fastest way to start? A walk through the most common commands in Python Pandas.

Pandas Python library is becoming more and more popular between data scientists and analysts. It allows you quickly to load, process, transform, analyze, and visualize the data.

When you work with Pandas, the most important thing to understand is that there are two main data structures — Series and DataFrame:

  • Series is a one-dimensional indexed array that can hold any data type (integer, float, etc).

  • DataFrame is a primary Pandas data structure. It is a two-dimensional data class (rows and columns), where every column may contain different data type. You may also pass an index and additional columns to a given DataFrame.

Below I will use some common commands for exploratory data analysis using Pandas and SQL for a public sample of random Reddit posts.

Importing the packages

We start from importing the necessary packages which will be used for our data analysis:

import numpy as np #linear algebra
import pandas as pd #data processing 
import seaborn as sns #statistical graph package 
import matplotlib.pyplot as plt #plot package for visualisations 
import pandasql as ps #sql package 

You will need pandasql package only if you are going to run SQL. If you are working with visuals, you may have to choose the plot style:

#plt.style.use('bmh') #setting up 'bmh' as "Bayesian Methods for Hackers" style sheet 
plt.style.use('ggplot') #R ggplot style

To see all available plot styles, you can run:

print(plt.style.available)

Reading the data

As I mentioned above, I’ll be running the analysis for a publicly released dataset of random Reddit posts published on Kaggle. You can find it by clicking on this Kaggle link.

Reading CSV file in Pandas:

df = pd.read_csv('/kaggle/input/dataisbeautiful/r_dataisbeautiful_posts.csv') 
df.head(5)

We can see that there are 183,890 entries in the given dataset. A caveat, not all columns are completed, and there are some missing/null values. You can clean the data by returning only non-missing or NULL values by running df.notnull() or df.isnull() accordingly.

Running EDA

Exploratory data analysis is often a first brief look at your dataset which helps you understand its structure, form, size and discover patterns. Below I’ll demonstrate a few common commands for EDA and will show a way how to run SQL statements in Pandas.

We can start by running basic DataFrame exploratory commands:

df.info() 
df.describe() 
#or df.count()
df.score.describe()
df.score.median()

Running SQL in Pandas

One of the biggest advantages of the Pandas library is that it can work well with SQL and tabular data. One of the ways to run SQL statements is to import pandasql package and call the following commands:

q1 = """SELECT removed_by, count(distinct id)as number_of_removed_posts
FROM df 
where removed_by is not null 
group by removed_by 
order by 2 desc """ 
grouped_df = ps.sqldf(q1, locals())
grouped_df

This returns us the DataFrame:

Visualizing data using SQL output:

Let’s build a bar chart from the output from the code above to visualize our data:

From the chart above we can see that from all deleted reddits, 68% is removed by a moderator and 6% — by users.

There are many types of plots and analysis, from basic statistics to complex visuals and forecasts, that you can build and perform in Pandas. You can see a more complete analysis in Kaggle here.

Subscribe to receive more guides on data analysis.