SQL Joins Don’t Scale? - Issue 2

Hello and welcome to the second issue of my Data Analytics Journal!

If you don’t remember subscribing to my newsletter, you might be one of my LinkedIn contacts. You can unsubscribe at the bottom (and miss interesting reads on data analysis). If you choose to stick around, you will receive a weekly newsletter every Wednesday.

This week: 

  • Nail your SQL interview (all levels break down)  

  • R makes a comeback.

  • Joins don’t scale?

  • Join a small group of Data Scientists for COVID19 research! 


Weekend Longread

I explored a public sample compiled of random Reddit posts to give a detailed overview of how to use Pandas and SQL for exploratory data analysis. Read about it here.

What’s New This Week

  • Data Engineering is about to get even more exciting if you use Google Cloud and AWS. This week Google introduced BigQuery Omni for Multi-Cloud Data Analytics. This multi-cloud data analysis tool lets you analyze data in Google Cloud and AWS (Azure coming soon). It is based on SQL and BigQuery APIs.

  • R makes a comeback. Yes, you read that right. According to the 2020 TIOBE July report, R climbs back into 8th place (up from 20th) of most used programming languages (still, far behind Python). Being an R fan myself (I made a switch from R to Python only a few years ago), I am optimistic that this marks the beginning of a revival. I’ll speak more about R’s return to relevancy in one of my next issues.

Nailed It

Be prepared for your next interview

In Issue 2, we’re going to focus on SQL a little more. Regardless of which position you’re applying for - product analyst, data analyst, data scientist, or data engineer - your SQL knowledge will be tested thoroughly.

I’m a big SQL enthusiast myself. I use it today extensively, not only for data extraction, but also for formatting, casting, and analysis. Therefore, when interviewing candidates, I put a  lot of emphasis on a candidate’s SQL knowledge and comfort level. 

Tested SQL skills for different levels: 

Beginner: At the beginning level, I am primarily concerned that the candidate has an understanding of general SQL concepts like types of JOINS, WHERE, HAVING. Be prepared to explain the differences between LEFT and RIGHT JOIN, OUTER and INNER, WHERE and HAVING.

Intermediate: At the intermediate level, I’m looking for the ability to not only extract data but get that data into the right format for analysis. Moreover, I like to see that the candidate is comfortable applying different math operations, like AVG, MIN, %, while maintaining appropriate SQL format. Be prepared to use a subquery, apply aggregations, to explain the difference between UNION and UNION ALL.

Advanced: Finally at the advanced level, I want to see a comfort level with data manipulations, feature engineering, and formatting, as well as a deep and thorough understanding of cost and time optimizations. Be prepared to explain the difference between DROP, DELETE, and TRUNCATE, along with INSERT and UPDATE, evaluate the pros and cons of each of these, apply SELF JOIN, and window functions. 

Read the full article here.

Good luck! 

Do Some Good.

It pays back.

  • If you are looking for ways to get involved in machine learning while doing some good for the world, you have the opportunity to join a small team of data scientists to work together on Covid19 analysis (in Kaggle or GitHub). You need to have an understanding of exploratory data analysis, basic ML knowledge, and some python. Please let me know (respond to this email) if you’re interested, and I will put you in touch with the project owner. 

Expert Spotlight on NoSQL vs. RDBMS

Discussion to follow — JOINs don’t scale?

A long story short: Alex DeBrie (AWS Data Engineer and the author of The DynamoDB Book) wrote an article about how NoSQL (DynamoDB specifically) scales where relational databases do not. Franck Pachot (database evangelist at DBIServices) responded with his in-depth research and analysis that demonstrates that joins actually scale just fine. Some NoSQL fans disagreed, providing more and more valid points. Suspense. Regardless who is right or wrong, I highly recommend reading Franck’s analysis. 

Here are a few takeaways:

  • As you probably already know, there are two ways to approach and work with data — structured and unstructured, or RDBMS and NoSQL. Both have their own specifics, pros, and cons.

  • SQL isn't “a black box” and you can use the EXPLAIN feature. But, it's hard to interpret how it's going to work as data scales.

  • The cost depends only on the height of the data structure and not the size of the table. You can increase the table exponentially before the height of the index increases. It’s a very important concept which demonstrates RDBMS scalability.

  • Yes, adding joins increases the time complexity for each table.  

  • But the size of the table has nothing to do with the inner loop cost.

  • The major RDBMS databases of today offer many features to enhance performance and the predictability of response time. 

  • NoSQL databases are key-value optimized data stores. “When you start to split a document into multiple entities with their own keys, remember that relational databases were invented for that.” (read the article comments for more details)

  • “...Distributed NewSQL will win the war against NoSQL”

As a data analyst who spends a lot of time working with reports and analysis, I stand behind the RDBM and am convinced that SQL is the best way to retrieve and manipulate data. 

Drink and Mingle

Upcoming events, meetups, talk, webinars

Try It Out

A quick Python exercise. Because it’s fun.

Assign 200 to x. If x > 200, print out “Big”; If x > 100 and x <= 200, print out “Average”; and If x <= 100, print out “Small”

Try it yourself and check your solution here.

Until next Wednesday!