Expert Insight: Duong Vu - Best Ways to Learn Data Science

An insightful look into what it takes to become a data scientist and what to look for when interviewing or growing your team

If you have not walked the Great Wall, you are not a great man - the sign read. Being a great woman instead, I left and walked a new path.

Duong Vu is a data scientist at UrbanLogiq, mentor, Vancouver Women in Data Science (WiDS) Ambassador, Women in ML&DS co-organizer. She is also the author of DataCamp tutorials, such as:

Her path to becoming a data scientist was not a common one, and yet her experience and knowledge inspired many people (including myself) to pursue a difficult and exciting world of data. I am so excited Duong agreed for me to interview her and share some of the tips, guides, values, and wisdom towards learning data science and growing data teams.

1. When you interview candidates for your team, what qualities and skills do you look for? And what would be a red flag or a stopper for you to proceed with a candidate?

When interviewing candidates, one of the things I look for is if they are life-long learners. What I mean by that is if they are willing to keep learning and improving their skills. Especially for people who have been working for a while- are they continually reading papers, learning something new, and improving? In this field, everything is moving so fast, new tools, new improvements every month or every year, and if you don’t keep learning, you won’t be able to adapt to the new things. 

For graduates, juniors, or people who are finishing school and/or Bootcamp courses, I look at their passion, either in the field of data or other fields that they are interested in. If you are applying for a company in finance, for example, are they genuinely interested in the work they will be doing at the company, or in the finance field in general? It would differentiate different types of candidates. 

Passion, patience, interest - all the things that contribute to each other. Candidates must be interested in the work they will be doing. Patience goes a long way. As a data analyst or a data scientist, sometimes you will be required to take on some  “boring” tasks, such as cleaning or transforming data, or writing documentation.  

To summarize, life-long learners and passion are the two main things I look for when interviewing candidates. Additionally, if a candidate has a Github repo or a website, it helps me to get a peek/glimpse into their programming skills, logical thinking, and other skills such as storytelling. 

Proficiency is important, but also if the person is serious - have they done some homework researching the company, respect the company name (not misspelled), things like that. Recent graduates apply to many companies to increase their chances to be seen, but when they demonstrate that they spent some time researching what the company is doing, it shows they are serious and prepared.

2. What was the most exciting project you have worked on? 

There are many interesting projects that I have been fortunate to work on at UrbanLogiq and it’s hard to choose one, so I will name two. One is a data science project and one is more leaning towards data engineering.

The first one was a project that requires me to use education data to discover the contributing factors behind students’ success. In this project, I leveraged SHAP value to measure impact from a variety of features, including academic performance, proximity to services and other, on the success of different groups of student. This allowed us to answer questions like what type of projects or policy that we can do to improve student success? We had, first, to define what is success, how to measure and quantify it. And from there, we wanted to see what kind of features we can construct. Some features that should be avoided are those harder to be changed, for example, income or socioeconomic backgrounds. Instead, using features that could be easier to manipulate: distance to the closest job center, number of math classes, etc. Feature engineering with context becomes a lot more meaningful. It was one of the most memorable projects for me. 

Another interesting but painful project was related to building a data pipeline to ingest 10 years of data for one of our clients.  As many may be aware, data in the wild is a beast! The data that we have to work with coming from multiple sources with a different callback. Many things can change over the course of 10 years, and so are some of the attributes in our data. The inconsistency in the format is one thing, but what can you do if a street changes its name and now the same object has two different names in the system? It is incredibly challenging to think about how to clean, transform and make it readable. On top of that, it is important to inform our client in a clear and concise of the transformations and data quality, if there is a problem. It is a very interesting yet challenging project. 

These two are the most representative works of what we do.   

3. If you had to start your career over, what would you do differently?

My first degree was in public finance. And then I went into compliance, and then law. After my graduation, I was working at a law firm. 

Law school really changed me. Working at the law office wasn't that exciting compared to law school. There was a lot of paperwork, documentation. I learned a database there for the first time. It was interesting. So I started looking for a part-time job as a data analyst for another place. They offered me a full-time job as an analyst. Then I started learning programming at that job - how you solve the problems or make the process more sufficient. I realized my knowledge was not enough, so I then went to MS in data science to explore the field. That’s how I got into data science. 

It took me a while to find out what I want to do. I wouldn’t do anything different. Everything contributes to who you are today. One thing I keep reminding myself is that you should not compare yourself to other people. Everyone has a different starting point. The background gives a different perspective. You might see a problem from a different angle. People have to try out different things and see what they are good at. 

4. Do you have a favorite tool, application, or software you use a lot?

There are two tools I would like to highlight.

The first one is PyCharm, because I mainly use python for work. It is a great tool, that helps me write better code. PyCharm has typing, linting, PEP 8 checking, and also a good debug system if you run into some problems. And, don’t forget integrations with git. When I started out, I’m not a strong programmer, so this tool helps me build my good practice programming habit.

The second tool is Kedro from QuantumBlack Lab (McKinsey). It is an open-source pipeline package used for building ETLs processes and data science pipelines. It helps the process to be reproducible and enables easier collaboration. 

It can manage from a simple pipeline like this (above) to something more complicated like this (below)

5. Can you share the values you foster within your team?

One important value of my team is that we support each other. For example, if one person has some problems or blockers on their code, we can stop (almost) everything and do pair programming with that person. Pair programming helps you to learn from other people their coding style and logical thinking. We encourage people to ask any time they have problems. 

Another one is encouraging learning and exploring new things. We often propose new papers, lectures, or ideas which are related to what we do. For example, how to embed privacy by design into our system, a method to remove bias from datasets, and many more. We proposed this idea and a paper for our weekly Data Science Learning session’s discussion. Also, we have a bookshelf at the office, when you can pass books to each other. 

6. Anything else you want to share to encourage or inspire people to learn data?

Currently, many people are interested in the DS and DA fields for various reasons. There are many requirements such as statistics, programming, storytelling, and so on. It is hard to know where to start. I want to tell people that instead of looking into theory, look at the applications to find the value of analysis, to see how DS is used. This is a great way to start. For example, companies like Spotify or YouTube use recommendation systems to suggest songs that you may be interested in. Or stores that use computer vision to check inventories. Or Google is using NLP for translating between different languages. There are many applications in almost every aspect of life. Find the field that interests you, then find the application and start digging deeper.

Looking from the application standpoint can help you understand what type of analysis should be used or which evaluation metrics best serve your purpose.

Share


Follow Women in Data Science Vancouver LinkedIn Page to stay in touch with Duong and register for upcoming events.