I can’t believe it’s been a month already since I started writing my Data Analytics Journal! Thank you all for your feedback, support, and love.
I started writing this journal to bring classic Data Analysis back.
As technologies have matured, they cease to be a cutting-edge advantage, but rather a competitive expectation for many companies. It forces analysts to focus on “how” rather than “what” or “why” in their work. I want to encourage and inspire my fellow data fans and enthusiasts to go back to the basics and enjoy genuine data analysis.
I received many questions and requests to cover specific topics: A/B testing, multivariate product experiments, ML model selection, tackling SAAS metrics, SQL for calculating user retention, etc... I’ll make sure to focus on those in my next issues. Today, I wanted to wrap up my first month of writing. And, as I am finishing my glass of merlot, I wanted to summarise some quick takes from the past month from my previous newsletters and published stories.
Weekend Longreads From Data Analytics Journal:
Product Analysis: read about User Segmentation Technique, learn how to identify power users on your platform, and how to use this data to scale your user growth.
Data Analysis: I explored a public sample compiled from random Reddit posts to give a detailed overview of how to use Pandas and SQL for exploratory data analysis. Read about it here.
Data Science: Learn about Supervised Machine Learning, and the Logistic Regression model here, where I talk about common use cases and walk through the steps of applying this algorithm for data forecasting.
Data Science: Another classification model for Supervised Machine Learning - Support Vector Machines. Learn when to apply it, what it stands for, and appropriate use-cases here.
10 Quick Takes From Last Month Newsletters:
Sad truth - Data Scientists spend most of their time doing data cleaning, according to the recently published Anaconda survey. Here is the full breakdown of time spent: data loading - 19%, data cleaning - 26%, data visualisation - 21%, model selection - 11%, model training and scoring - 12%, deploying models - 11%.
Don’t rush adopting or averaging the industry-acceptable metrics for your project. Different businesses often measure the same metric completely differently. According to Brian Balfour, benchmarks can be helpful, but watch out for benchmarks traps. Check out Brian’s analysis to learn the best practices to approach and measure growth.
R makes a comeback. According to the 2020 TIOBE July report, R has climbed back into 8th place (up from 20th) of most used programming languages (still, far behind Python). Given that the most academic medical data research is completed using R, most COVID researchers are encouraged to use R as well. That could account for one of the reasons R has seen a recent resurgence in popularity.
Check out this guide for SQL Interview questions and preparation tips.
Discussion to follow - Joins Don’t Scale? Adding joins increases the time complexity for each table. But the size of the table has nothing to do with the inner loop cost. The major RDBMS databases of today offer many features to enhance performance and the predictability of response time.
Interested in Spark learning? Check this complete guide on learning Spark.
Last week I wrote about an ongoing Meow attack that wiped out 1,000 databases. Hackers deleted all user data, leaving only the word “Meow” behind. This week, that number has increased to 4,000 databases! You better encourage your company to secure sensitive user data ASAP. Cats do not joke around with it.
If you want to brush up on the core concepts for data science, analysis, or statistics, check this list of the best data science books.
If you want to connect with other women in technology for networking, idea sharing, or the best margarita recipes, you can do so here.
Watch Live Penguin Cam from California Academy of Sciences in San Francisco. It’s not related to big data, but it’s calming, and penguins are cute.
Drink and Mingle
Upcoming free events, meetups, talk, webinars
Aug 6, DAA: Analytics Trivia
Aug 6, Data Science Summit: Interpretable Machine Learning For Enterprise
Aug 8, Spectra: Join Hackathon
Aug 15, PyBay 2020
Aug 18, RMDS Lab: How to leverage AI and ML to improve Data Quality
Aug 21, Girls in Tech SF: Hacking for humanity 2020
Aug 27, DSS: Data Science Salon Elevate
Sept 26, OpenMined: Data Privacy Conference 2020
Try It Out
A quick Python exercise. Because it’s fun.
You have the following list of numbers:
n = [1,2,3,4,5,6]
Iterate over this list, printing out each list value multiplied by 10.
Try it yourself and check your solution here.
If you missed my previous weekly newsletters, here are the links:
Thank you all again for your support and for sharing this ride with me.
Until next Wednesday!