What’s the key to cracking data science competitions? How do you use this experience to break into the data science industry? We regularly come across these questions from aspiring data scientists wondering how to make a name for themselves in data science.
Who better to answer these questions and provide an in-depth insight into the data science world than a Kaggle Master and a Analytics Vidhya hackathon expert? Ladies and gentlemen, I’m delighted to present Sonny Laskar!
Sonny is a MBA post-graduate from IIM Indore, the place he credits for starting his data science journey. So for any of you wondering if it’s possible to make a career transition to data science from a non-data science field – this article is for you.
I found Sonny to be a very approachable person and his answers, as you’ll soon see, are very interesting, knowledgeable and rich with experience. Despite holding a senior role in the industry, Sonny loves taking part in data science competitions and hackathons and regularly scales the top echelons of competition leaderboards.
Sonny also holds a lot of experience in the data engineering side of this field. As you can imagine, there is a LOT we can learn from him. I had the opportunity to pick his brain about various data science topics and bring this article to you.
We covered a variety of data science topics during our conversation:
- Sonny’s background and his first role in data science
- The difference between data science competitions and industry projects
- Sonny’s framework and approach to data science competitions
- His advice to aspiring data scientists
And a whole lot more! There is SO much to learn from Sonny’s knowledge and thought process. Enjoy the discussion!
Sonny Laskar’s Background and First Role in Data Science
Pranav Dar: You are currently the Associate Director of Automation and Analytics at Microland, finished 4 times in the top 3 in AV’s hackathons, and hold a runner-up finish in a Kaggle competition. It’s been quite a ride!
How and where did your data science journey begin?
Sonny Laskar: My Data Science journey started when I was pursuing my MBA from IIM Indore. Analytics was the go-to area for every aspirant. One of the early topics of discussions was based on how Target figured out a teen girl was pregnant before her father did. This made me very curious and I started to deep dive into the world of Data Science.
I had already worked extensively with data but mostly around engineering problems and business intelligence. No serious machine learning stuff was popular back then with organizations in India.
“I spent two months at the University of Texas, Austin in early 2014 and was surprised by the level of maturity they had with data. My visit to Dell’s headquarters in Austin and how they used social media data to enhance their product positioning was amazing. By the end of this, I was completely convinced that I needed to work on this.”
PD: Your professional career didn’t start off in data science. The first 6 years or so were spent on data warehousing and infrastructure.
So what kind of challenges did you face when you were getting into data science? How did you overcome them?
SL: I started my career in 2007 in the world of IT Infrastructure. In the initial six years, I was primarily working on building massive scale data warehousing applications (processing ~10TB data every). The focus was more on ETL and BI. Dashboards and Data marts were the primary output of all these efforts. This was what we called “Descriptive Analytics”.
By 2014-15, “Predictive Analytics” was already getting a lot of attention and adoption in the US. It was then that many organizations in India started looking at “Predictive Analytics” with significant focus. We were already processing Terabytes of data and were very well versed with the engineering side of things.
I was able to understand the fundamentals of Data Science very well since my Mathematics and Statistics concepts are strong and I had a fair exposure to programming.
I started with R since that was the programming language popular in academics and improved my understanding by practicing writing code and replicating other work.
During my MBA, I got a bird’s eye view of many statistical and Data Science approaches. Since the focus during MBA was more on business, it didn’t allow me to master the technical skills as much as the industry needs. Post my MBA, I started spending roughly 4-5 hours every day writing code and building on top of it.
Patience, Perseverance & Practice has been my thumb rule for everything in life, which was what I applied here as well.
Industry Experience versus Data Science Competitions
PD: We often hear from hiring managers how aspiring data scientists participate in hackathons and competitions and struggle to bridge the gap during their transition into an industry role.
You have been on both sides of this – you hold rich experience in data science and have excelled in hackathons. What has been your experience in the industry vs. hackathon debate?
SL: Data Science is getting a lot of attention from the workforce in the market. It is in fact very easy to get some training to understand the basic concepts (thanks to MOOCs). This leads to excessive supply and recruiters then need some ways to filter.
One of the best ways that work is establishing credibility by participating in data science competitions.
Just like most things in life, competitions have their pros & cons. There is a lot of preparatory work that gets done before a competition is published. That work is at times extremely complex, time-taking and needs multi-domain understanding.
Similarly, the competition ends with a leaderboard score without any view on what was done with the winners’ solutions. These are grey areas for many first-timers into Data Science which creates a lot of issues when they join the industry.
I have conducted at least 100 in-person interviews in the last year and I can see this struggle very prominently. Data Scientists are not expected to just design a machine learning model to predict something. In many organizations, discussions in meeting rooms end up with a task for the Data Scientist such as “Let us build a model to predict X”.
A good Data Scientist might end up concluding that many such X use cases should not be solved at all with machine learning! A Data Science team is not expected to be very large in the real world. They might get involved in many tasks which are either not valuable or can be easily solved without using Machine Learning.
If they feel it can be solved with Machine Learning, then there must be a series of discussions to understand what data would help them address that.
“Unlike competitions, nobody gives you two .csv files called train and test and a nicely written evaluation metric. Almost 80% of the efforts go into defining the problem and getting and processing data. Remaining 20% effort goes into pure modeling and deployment.”
Exposure to competitions helps address a few parts of this:
- Processing data and feature engineering
- Building different types of models and getting the best score
These are very significant activities and hence recruiters use “competitions” as a good filter to focus on a smaller set of candidates.
To summarize, below are the key issues which competition focused people face when they join the industry:
- Building a business acumen for understanding how a problem statement helps the business goals and what data drives that
- Having a problem solver attitude
- Understanding the software engineering side of production deployment
- Story-telling: Ability to communicate the results to non-technical folks
Data Science Hackathons and Competitions
PD: Ever since data science started becoming mainstream in the last 5 years, multiple competitions keep happening across platforms simultaneously. How do you pick and choose which data science hackathon or competition you’ll participate in?
SL: I was hooked to data science competitions back in 2016. I used to participate in as many competitions as I could! Lately, my personal interest has kind of plateaued as incremental learning has diminished. Now I participate only if I have time and a very interesting problem.
I also try to participate in offline hackathons along with my Kaggle Grandmaster friend Sudalai Rajkumar (SRK). I usually participate based on three factors:
- The novelty of the problem: If the problem statement is something new to me from an existing or new domain which I might not have enough experience in, I would like to play with the data as it helps me build some perception on that problem/domain
- Data size: I love problems where the data size is extremely large. I like the kick I get when I run models on machines with 500 GB RAM and 64 Core processors. It is a lot of fun!
- Multiple scheme of approaches: If there are multiple techniques I can experiment with. In fact, our first Kaggle competition needed us to perform both Text Analytics & Image Analytics and a clear way to merge both
PD: How should a beginner go about participating in these data science hackathons? Which kind of competition should they first dip their toes into?
SL: As a beginner, it is important for folks to know the basic building blocks.
“I would strictly advise that they should not participate in any competition where the data set is large, and the problem statement is complex.”
They should start with relatively easy data science competitions. Below is what aspiring data scientists should do in the initial few weeks:
- Understand the data well. Do not get directly into running xgb.train
- Read about what transformations are effective for your problem & model:
- Example: Does one-hot encode help or numeric labeling is better? Does the column have too many categories? Can we reduce them? Is that numeric field really a number or a category?
- Feature Engineering is key and your early learning on feature engineering will come from other people’s code. So, build a practice of reading others’ code line-by-line and replicate it. Ask yourselves questions like why did the author do that, and how does that help?
- Kaggle kernels are an excellent place to read
- On Analytics Vidhya, participants upload their code which beginners should read
- Get familiar with the process of building models using different algorithms
PD: How should aspiring data scientists approach a competition?
SL: As we participate in many competitions, we realize that there are a common set of steps that we always follow. We should try to create a template out of it which we can easily modify in every competition. This makes life simpler.
I follow the below process:
- Build a naïve base model using all features and basic feature engineering
- Record each change and score in an excel sheet to track progress
- Do hyperparameter tuning by hand (without spending too much time) to get something decent
- Go back to data understanding and rework the features completely
- Explore the data, build visual plots to see the patterns, etc.
- Read discussions, kernels, etc.
- Repeat all these steps
Data Science Industry-Related
PD: What are 3 critical aspects of a data science project which you feel are often overlooked by newcomers?
SL: Interesting question. Here is what I would recommend focusing on:
- Taking Models to Production:
- In the real world, taking models to production takes a lot of effort. There are many things that data scientists need to do from a software engineering perspective, like building Docker containers, setting up a CI/CD pipeline, exposing REST APIs for prediction, version control, etc.
- Understanding the Importance of SQL:
- SQL is that one thing that every data scientist should learn irrespective of which programming framework they use. SQL is something they would end up using for sure
- Learning to write efficient code for Big Data:
- Badly written code might not be a problem when working on a small dataset. But it becomes a show-stopper when we run it against large datasets. Such scenarios can be handled by making changes. For example, if you use “for-loops” in your code, then it can be very slow when it has to iterate over a long list. Instead, use lambda architecture. There are many functional programming guidelines that need to be followed
PD: AutoML is coming up huge in the industry. What are some other trends in data science we can expect to see in the next 2-3 years?
SL: AutoML will eventually automate most of the model building & model deployment part of the work. This will include dealing and working with feature engineering (to quite an extent).
“Importance of domain knowledge, logical reasoning, and having a problem-solving attitude is all that Data Scientist would be expected to excel at.”
Other key trends that I see:
- Adoption of Graphs in Machine Learning: Most folks do not use Graph. That’s a travesty! Graphs are such amazing structures for solving many complex problems
- Augmented Analytics: Augmented Analytics automates data insight by utilizing machine learning and natural language to automate data preparation and enable data sharing
- Autonomous Systems: Autonomous Systems are like Driverless Cars which can take decisions on their own. Reinforcement learning is behind this. One of the products we are building in Microland is for “Autonomous IT” which will replicate what a human does when there is a problem and learn that behavior to replicate it in real time
Rapid Fire Questions: Sonny’s Take on Various Data Science Aspects
PD: Tell us 3 things you have learned working in data science.
SL: There are too many to list down! But here are my top 3 picks:
- Domain Knowledge is key
- Being “Jack of Many Trades” helps a lot
- Always think out-of-the-box
PD: Which is your favorite machine learning/deep learning algorithm and why?
SL: I use Xgboost & Lightgbm for most of my tasks. They work almost every time. For deep learning, Keras with TensorFlow seems perfect to me.
PD: Which data science professional would you pick to take part in a high-stakes data science competition?
SL: Sudalai Rajkumar (SRK) any day!
PD: What’s your advice to people trying to get their first data science role?
SL: Here are a few tips from my experience:
- Do not try to learn two languages at the same time. Master any one which you like. Ignore all the news that you hear like “Language X is better than language Y”, etc.
- Build a decent Github profile with all the different types of problem you have tried to solve
- Take an open problem where you can get data and build some Data Science application around that
- Finally, participate in competitions and make it to the top!
I thoroughly enjoyed interacting with Sonny Laskar for this interview. His knowledge, his thought process and the way he articulates and structures his thoughts is something we can all learn from.
What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!