How To Choose The Correct ML Algorithm For Your Problem - Issue 9

Sep 09, 2020

Today is Wednesday, and it’s time for a weekly recap of interesting stories and events in the big data world from Data Analysis Journal.

✨ This week highlights:

Working with Python, PySpark, and SQL on Azure Databricks.
A new Residency Program for ML and AI from Apple.
Supervised ML algorithm guide for problem-solving.
15 must-read books on data analysis.

Enjoy!

📚 Weekend Longread

Azure DataBricks is Microsoft big data analytics service that offers cloud data storage and processing. Read about the Azure DataBricks guide which covers how to use Python, PySpark, SQL and convert a Spark Dataframe to a Python Pandas Dataframe.

🔥 What’s new this week

If you missed PyBay2020 (5th annual Python conference) this July, you can find its talks and recordings on this YouTube channel.
Apple recently launched its first residency program for ML and AI. It’s a year-long program, and the candidates must meet some qualifications. Will we see a similar one from Facebook soon?
Optimizely, a famous testing platform, was acquired by a content management company Episerver. This new acquisition might add more experimentation support and new features to Optimizely.
Microsoft is developing its own Hadoop, HDInsight 4.0 which is based on Apache Hadoop and YARN, and the customers will be able to add Kafka or Spark to it.
Which has better performance - PostgreSQL or Elasticsearch? Check this basic comparison for a full-text search query performance. (spoiler alert: Elasticsearch is better).
Want to be a data engineer? Start with a roadmap of data engineering in 2020 - a modern data engineering landscape covering must-know tools and frameworks.

🎓 Level Up

Certifications, internships, schools, and courses.

Whether you’re a data analyst beginner or looking to take your skills to the next level, refresh your reading list with these 15 must-read books on data analysis. One book I would like to highlight from that list is Hacking Growth: How Today’s Fastest-Growing Companies Drive Breakout Success by Sean Ellis & Morgan Brown. It introduces a new growth hacking mindset about your product, which is based on product loops rather than funnels.

If you are a beginner and learning a programming language, check this free September LeetCode challenge. Solve quick daily problems, practice algorithms, and get a prize.

Adobe India scholarship is currently open for the 2021 season. Indian female citizens and full-time students pursuing a Major or Minor in Data Science, Maths, Computing, or Engineering are eligible to apply.

🏆 Nailed It

Be prepared for your next interview

How to choose the right ML algorithm?

There are so many different ML algorithms with different levels of complexities, and it can be challenging to figure out which model to choose for your analysis.

The common practice is to run a few models against 20% of your training data and pick the one with the highest accuracy score (as I demonstrated here). That being said, the high accuracy score might not always be the best measure for assessing your model performance, especially if you are dealing with imbalanced data. Choosing the appropriate algorithm is partially also a business question.

To start with, describe the output and the type of your problem:

classification
regression
clustering

Then, depending on how much data you have and the type of your business case, you have to decide with which ML type to proceed:

Supervised Learning - you know how to classify the input data and the type of behavior you want to predict, but you need the algorithm to calculate it for you on new data

Unsupervised Learning - you do not know how to classify the data, and you want the algorithm to find patterns and classify the data for you

Reinforcement Learning - you don’t have a lot of training data; you cannot clearly define the ideal end state, or the only way to learn about the environment is to interact with it

For supervised ML, check my most recent publication to decide which ML algorithm to pick for which business problem.

🍸 Drink and Mingle

Upcoming free events, meetups, talk, webinars

Sep 10, Darabricks: Data pipelines and ML with Spark
Sep 10, R-Ladies: Fireside chat about DS Career, Communities, and AI
Sep 23, Introduction to Python
Sep 15, Looker: Light Up Your Data Direction
Sep 16, Affirm: ML Talks
Sep 16, Claravine: Optimizing for event-based analytics
Sep 17, Algorithmia: Eight must-haves for MLOps success
Sept 26, OpenMined: Data Privacy Conference 2020
Sep 28, Grid Dynamics: Enterprise AI: Case Studies From Leading Companies
Sep 30, Anaconda: Performance Tips For Pandas

If you want to attend a paid DSS event Applying ML and AI to Media, Advertisement, and Entertainment on Sept 22-25, I have some free tickets left! You can respond to this email or email me at olga@berezovsky.me.

Thank you for reading! Until next Wednesday!