Hello and welcome to the second issue of my Data Analytics Journal!
If you don’t remember subscribing to my newsletter, you might be one of my LinkedIn contacts. You can unsubscribe at the bottom (and miss interesting reads on data analysis). If you choose to stick around, you will receive a weekly newsletter every Wednesday.
Nail your SQL interview (all levels break down)
R makes a comeback.
Joins don’t scale?
Join a small group of Data Scientists for COVID19 research!
I explored a public sample compiled of random Reddit posts to give a detailed overview of how to use Pandas and SQL for exploratory data analysis. Read about it here.
What’s New This Week
Data Engineering is about to get even more exciting if you use Google Cloud and AWS. This week Google introduced BigQuery Omni for Multi-Cloud Data Analytics. This multi-cloud data analysis tool lets you analyze data in Google Cloud and AWS (Azure coming soon). It is based on SQL and BigQuery APIs.
R makes a comeback. Yes, you read that right. According to the 2020 TIOBE July report, R climbs back into 8th place (up from 20th) of most used programming languages (still, far behind Python). Being an R fan myself (I made a switch from R to Python only a few years ago), I am optimistic that this marks the beginning of a revival. I’ll speak more about R’s return to relevancy in one of my next issues.
Interested in NLP deep learning? Check the slides from last Saturday's NLP deep dive webinar covering the basics of sentiment analysis and NLP feature engineering. You are welcome.
Be prepared for your next interview
In Issue 2, we’re going to focus on SQL a little more. Regardless of which position you’re applying for - product analyst, data analyst, data scientist, or data engineer - your SQL knowledge will be tested thoroughly.
I’m a big SQL enthusiast myself. I use it today extensively, not only for data extraction, but also for formatting, casting, and analysis. Therefore, when interviewing candidates, I put a lot of emphasis on a candidate’s SQL knowledge and comfort level.
Tested SQL skills for different levels:
Beginner: At the beginning level, I am primarily concerned that the candidate has an understanding of general SQL concepts like types of JOINS, WHERE, HAVING. Be prepared to explain the differences between LEFT and RIGHT JOIN, OUTER and INNER, WHERE and HAVING.
Intermediate: At the intermediate level, I’m looking for the ability to not only extract data but get that data into the right format for analysis. Moreover, I like to see that the candidate is comfortable applying different math operations, like AVG, MIN, %, while maintaining appropriate SQL format. Be prepared to use a subquery, apply aggregations, to explain the difference between UNION and UNION ALL.
Advanced: Finally at the advanced level, I want to see a comfort level with data manipulations, feature engineering, and formatting, as well as a deep and thorough understanding of cost and time optimizations. Be prepared to explain the difference between DROP, DELETE, and TRUNCATE, along with INSERT and UPDATE, evaluate the pros and cons of each of these, apply SELF JOIN, and window functions.
Read the full article here.
Do Some Good.
It pays back.
If you are looking for ways to get involved in machine learning while doing some good for the world, you have the opportunity to join a small team of data scientists to work together on Covid19 analysis (in Kaggle or GitHub). You need to have an understanding of exploratory data analysis, basic ML knowledge, and some python. Please let me know (respond to this email) if you’re interested, and I will put you in touch with the project owner.
Expert Spotlight on NoSQL vs. RDBMS
Discussion to follow — JOINs don’t scale?
A long story short: Alex DeBrie (AWS Data Engineer and the author of The DynamoDB Book) wrote an article about how NoSQL (DynamoDB specifically) scales where relational databases do not. Franck Pachot (database evangelist at DBIServices) responded with his in-depth research and analysis that demonstrates that joins actually scale just fine. Some NoSQL fans disagreed, providing more and more valid points. Suspense. Regardless who is right or wrong, I highly recommend reading Franck’s analysis.
Here are a few takeaways:
As you probably already know, there are two ways to approach and work with data — structured and unstructured, or RDBMS and NoSQL. Both have their own specifics, pros, and cons.
SQL isn't “a black box” and you can use the EXPLAIN feature. But, it's hard to interpret how it's going to work as data scales.
The cost depends only on the height of the data structure and not the size of the table. You can increase the table exponentially before the height of the index increases. It’s a very important concept which demonstrates RDBMS scalability.
Yes, adding joins increases the time complexity for each table.
But the size of the table has nothing to do with the inner loop cost.
The major RDBMS databases of today offer many features to enhance performance and the predictability of response time.
NoSQL databases are key-value optimized data stores. “When you start to split a document into multiple entities with their own keys, remember that relational databases were invented for that.” (read the article comments for more details)
“...Distributed NewSQL will win the war against NoSQL”
As a data analyst who spends a lot of time working with reports and analysis, I stand behind the RDBM and am convinced that SQL is the best way to retrieve and manipulate data.
Drink and Mingle
Upcoming events, meetups, talk, webinars
July 22 AI model training and inference on Protected Health Information
July 22 Galvanize: Intro into deep learning and neural networks. Classification.
July 22 Alexa Live 2020: Developer community event
July 22 R Ladies: I built an R model. Now what?
July 22 AWS: AI Powered Applications With AWS AI Service
July 30 DSS elevate women in data
July 30 Zillow: Women in tech event covering data science, engineering, and analytics.
July 31 CDS: What is Chartered Data Scientist Program
Aug 6 Grid Dynamics: Machine learning for enterprise
Sept 6 TigerGraph 2020 Hackathon: (relevant for data architects and engineers and all Graph database fans)
Try It Out
A quick Python exercise. Because it’s fun.
Assign 200 to x. If x > 200, print out “Big”; If x > 100 and x <= 200, print out “Average”; and If x <= 100, print out “Small”
Try it yourself and check your solution here.
Until next Wednesday!