Today is Wednesday, and it’s time for a weekly recap of interesting stories and events in the big data world from the Data Analytics Journal.
If you don’t remember subscribing to my newsletter, you might be one of my LinkedIn contacts. You can unsubscribe at the bottom (and miss interesting reads on data analysis). If you choose to stick around, you will receive a weekly newsletter every Wednesday.
Today we will be discussing:
The ongoing Meow attack has hit 1000+ databases deleting all user data
What’s the point of statistics?
The best Data Science books to read right now.
Learn about Supervised Machine Learning, and the Logistic Regression model here, where I talk about common use cases and walk through the steps of applying this algorithm for data forecasting.
Uber launched a (free) service to provide data on drivers and riders who may have been in contact with someone infected with COVID-19 to health agencies. This service allows health officials to locate and contact affected individuals. Yes, your Uber data now is shared and re-shared with myriad healthcare data vendors and providers. For good cause.
Google launched Recommendations AI their machine learning automation service for e-commerce. It helps online retailers deliver personalized product recommendations to their customers based on user shopping history data. It’s very similar to Amazon Personalize, which was released last year and works on AWS platforms.
The growing use of cloud services has led to an increased amount of targeted attacks on unsecured databases. If you store user data and haven’t heard about the Meow attack yet, you should be alarmed - more than 1,000 (987 ElasticSearch and 70 MongoDB) unsecured databases so far have been permanently deleted over the past few days. All unprotected sensitive user data gets wiped out, leaving only the word “Meow” behind. I know. Cats have been watching us and waiting for their time to attack. Don’t leave your laptop unlocked with your cats around.
Recently I ran across this article on Medium Don’t waste your time on statistics, an article that questions the value of statistics in daily data analysis work, and couldn’t say nothing. The author points out that instead of breaking your head over complicated statistical concepts and formulas, you should just follow your “gut feeling”.
By and large, the author underlines that statistics don’t always have a point, that you can't obtain certainty from uncertainty, and advises you go with “your best guess”, which she nicely defines as “analytics”.
This is misleading, and it oversimplifies the statistics approach in data analysis.
In fact, there is a big difference between descriptive statistics (this is what the author calls “analytics”) and inferential statistics (which she describes as “decisions under uncertainty”). Both are very different approaches to solving and describing problems, and shouldn’t be merged into a single method and oversimplified. Furthermore, statistical theory, methods, and analysis aim to provide certainty out of uncertainty.
I recently noticed a bad trend in the content quality in the Towards Data Science blog. The Medium blog editors used to carefully review and curate publications but recently I often see more “catchy” article titles that misrepresent Data Science concepts.
With the democratization of Data Science, there are more tutorials and classes which aim to make machine learning and statistics fast and simple. While these can be simplified and focused on practical usage, they still should deliver the foundation of statistics first.
If you want to brush up on the core concepts for data science, analysis, or statistics, check this list of the best data science books. I wanted to especially recommend two must-read books from that list:
An Introduction To Statistical Learning is perfect for beginners who do not have statistical knowledge or background. It covers the most important machine learning algorithms with examples and use cases.
Naked Statistics: Stripping The Dread From Data has less academic theory and more practical statistics concepts and examples overview. Easy to read and understand.
Drink and Mingle
Upcoming events, meetups, talk, webinars
July 30 Subsurface: the cloud data lake conference
July 30 DSS elevate women in data
Aug 6, Data Science Summit: Interpretable Machine Learning For Enterprise Applications
Aug 21: Girls in Tech SF: Hacking for humanity 2020 - join the annual hackathon!
If you are a woman and want to connect with other women in technology for networking, idea sharing, or the best margarita recipes, you can do so here.
Try It Out
A quick Python exercise. Because it’s fun.
I had this same exercise at least 3 times in different interviews.
Write a program that outputs the string representation of numbers from 1 to n.
But for multiples of three, it should output “Fizz” instead of the number and for the multiples of five output “Buzz”. For numbers which are multiples of both three and five output “FizzBuzz”.
Try it yourself and check your solution here.
Until next Wednesday!