Welcome to Issue 5 of my Data Analytics Journal newsletter, where I write about data analysis, data science, and business intelligence.
If you just joined us, this is a weekly newsletter to summarise the events and news of the last week and to highlight the new articles and helpful materials recently published in my Data Analytics Journal.
In today’s newsletter, we’ll be reviewing:
Decision Tree Classifier overview - a popular classification commonly used in supervised machine learning for regression or ranking
How to migrate 40TB SQL Server Database
Free workshops, books, and courses to learn data science, math, and algorithms
The Adjacent User concept and how to use it to improve user engagement and retention
Let’s get started!
Read about the Decision Tree Classifier - a popular supervised machine learning algorithm which can be used for predicting categorical and continuous variables, and learn how to apply it for a binary classification problem.
What’s new this week
Read a comprehensive guide on how to scale Relational SQL Databases which describes how to use data types and indexes appropriately, normalize and decompute data, leverage materialized views, compress data storage, work with bulk INSERT and UPDATE statements. An amazing write up with all essential database components gathered together in one piece. If you are a data analyst, it’s very helpful to see a breakdown of RDBMS components to understand why you need to optimize your SQL queries.
Amazon Fraud Detector is now available for its customers. It’s a new service that helps Amazon clients identify potentially fraudulent activity (fake accounts, fraudulent transactions, payments made from stolen credit cards, etc.). To use a detector, you have to upload your historical data and select the ML detector model for your analysis. The service automatically analyzes your data, performs feature engineering, selects algorithms, and trains your model. Simple and easy.
Have you ever wondered what it takes to migrate data from one infrastructure to another? Read this article about a long and painful step-by-step guide on how to migrate a 40TB SQL Server Database at Stack Overflow. It’s a lot of data. And, it took them 11 months to complete the migration. I was working on a similar large data migration project two years ago, where we had to move all user data from the old infrastructure into the new one. It took us many months of tests, cookies, sleepless nights, and prays. So I definitely can relate to some frustrations and learning points highlighted by the author.
Check these free learning resources
Packtpub, a new workshop platform, is offering free workshops in web development and programming. Learn Python, Go, SQL for free with hands-on exercises. No, they don’t pay me for the shoutout.
Looking for Data Science courses? Check 365datascience. It’s one of my favorite Data Science learning platforms (with some free courses open now). I like how it structures the content you need: Statistics, Data Analysis, Math, SQL, Python, Probabilities, and is focused mostly on the terminology and concepts you will be using at your work. It sets the right expectation on what you need to learn and is “up to date” with the industry demands. They don’t pay me for this promotion either.
If you are struggling to learn and understand algorithms, check these cards! The I Love Algorithms card deck was developed by a group of Stanford students to help you understand the logic and remember the concepts.
How about math? I know you don’t like it, but if you are tackling machine learning, it never hurts to brush up your calculus. If you like the idea, enjoy this free full undergraduate-level textbook on calculus.
And here is a nice resource on computational linear algebra, the course which was taught in the University of San Francisco's Masters of Science in Analytics program. You are welcome.
Volunteers at S-Cube World (a non-profit community of students) will be conducting a Python workshop starting this week. During this two-week-long Bootcamp, the participants will learn python from scratch and also code their own game. Anyone looking to learn or brush up python can find more details here. Registrations are open here.
Expert Spotlight on Growth and Strategy
For today’s issue, I wanted to highlight the concept of Adjacent Users developed by Bangaly Kaba (Former Head of Growth at Instagram and Instacart). If you have read my theory about Power Users, you already know that working on product growth and strategy definition, you have to develop different user segments. The Adjacent User segment is a group of users who know about your product but haven’t become engaged users yet.
Below are some quick takes from the Adjacent User theory on how to transform this group into your Power Users:
"The Adjacent Users are aware of a product and have possibly tried using it, but are not able to successfully become an engaged user. This is typically because the current product positioning or experience has too many barriers to adoption for them."
The Adjacent User is critical to segment because it helps you capture the full potential of your product marketing positioning.
To convert the Adjacent Users into your Power Users, you have to know who is successful today and why. This gives you a number of user features that help you to differentiate the “powerful” category from the adjacent.
Here are four techniques Bangaly recommends:
Be the adjacent user by simulating their environment.
Watch the adjacent user through research studies.
Talk to the adjacent user through customer discovery.
Visit the adjacent user to watch them in their actual environment.
Recognizing the Adjacent Users group is challenging. There are many personas for your product/service, but targeting the right user segment is very important. When it’s done correctly, you will see the improvement in your user retention and engagement.
Drink and Mingle
Upcoming free events, meetups, talk, webinars
Aug 13, Databricks: Modern Cloud Data Architecture
Aug 15, PyBay 2020
Aug 18, RMDS Lab: How to leverage AI and ML to improve Data Quality
Aug 19, Women in Analytics: Mitigating Bias in Analytics
Aug 20, DSS: Breaking into AI: ML in the Real World
Aug 21, Girls in Tech SF: Hacking for humanity 2020
Aug 27, DSS: Data Science Salon Elevate
Aug 27, R-Ladies: Tangible Steps Towards Algorithmic Accountability
Sept 26, OpenMined: Data Privacy Conference 2020
Try It Out
A quick Python exercise. Because it’s fun.
Given an int n, return the absolute difference between n and 21, except return double the absolute difference if n is over 21.
Try it yourself and check your solution here.
Thanks for reading everyone. Until next Wednesday!