Learn everything about Analytics

DataHack Radio #23: Ines Montani and Matthew Honnibal – The Brains behind spaCy

Introduction

What would you do if you had the chance to pick the brains behind one of the most popular Natural Language Processing (NLP) libraries of our era? A library that has helped usher in the current boom in NLP applications and nurtured tons of NLP scientists?

Well – you invite the creators on our popular DataHack Radio podcast and let them do the talking! We are delighted to welcome Ines Montani and Matt Honnibal, the developers of spaCy – a powerful and advanced library for NLP.

That’s right – everything you’ve ever wanted to know about the wonderful spaCy library is right here in the latest DataHack Radio podcast.

This podcast is a 40 minute+ bonanza for NLP enthusiasts and practitioners. Ines and Matt spoke about all things spaCy and NLP, including:

  • The idea behind developing spaCy
  • spaCy’s awesome evolution from the first alpha release to the current version 2.1
  • Use cases of spaCy including a couple of surprising applications
  • Ines and Matt’s advice to NLP enthusiasts

And much, much more!

I’ve put together the key takeaways and highlights in this article. Enjoy the episode and don’t forget to subscribe to Analytics Vidhya’s DataHack Radio podcast on any of the below platforms:

 

The Brains behind the spaCy Library

The story behind spaCy goes like this. During his graduation days, Matt had written code for a specific Natural Language Processing (NLP) task. He had been working in NLP for a long time and there were a few companies who wanted to use his research code.

The code, however, wasn’t ready to be used for general purposes (it was geared more towards performing a specific task). Matt wanted to build something that could be used for a broader set of NLP functions. And from that, the concept of spaCy was born.

Matt met up with Ines shortly after the first alpha version of spaCy was released. They have since collaborated on spaCy and have founded their company – explosion.ai. Explosion AI is a digital studio specializing in Artificial Intelligence and Natural Language Processing. Their three primary offerings are:

I really liked the story behind Ines and Matt’s initial days working on spaCy. Ines was intrigued by the computational linguistics aspect of working on spaCy and by the thought that companies could use it to build really significant systems.

There’s a funny anecdote about Ines’ first reaction to understanding the algorithm behind NLP tasks back then. I won’t spoil it here – so make sure you listen to this section!

 

The Motivation and Idea for Developing spaCy

The NLTK library existed before spaCy was developed. So why create spaCy in the first place? What was the motivation behind creating a different NLP library? I’m sure most of you must have asked this question back when spaCy came out.

As Matt put it so well, it’s important to understand the NLTK library before we answer these questions. NLTK was developed with a different code base and is more from the point of view of teaching NLP topics.

“The need for commercial NLP comes from a different set of priorities. One of the most important of those priorities is efficiency. And the best way to get efficient Python usable things is to write C extensions with Cython – the language spaCy is ultimately implemented in.” – Matt Honnibal

Linear models used to be quite popular when Matt started working on spaCy because:

  • These models used a lot of the machine’s memory
  • They could be implemented quickly using C or Cython

There was no possible avenue for writing these C extensions for NLTK. Additionally, Matt wanted to develop a different NLP library that took a different approach. And hence spaCy was designed to fill in these gaps and give a different perspective to folks working in NLP.

 

The Business Side of spaCy and Prodigy

There was always a scope for creating a business aspect out of spaCy right from the beginning. The question was whether it had to be from the consulting side or something else.

Ines and Matt settled on annotation tools at the beginning of their journey. And this kept coming up, according to Ines. So the question was – what were people using and what actually worked for them? Two features stood out:

  • Named entity recognition
  • Creating labeled data and running experiments

I’m sure most of you working in NLP can relate to these! Ines and Matt actually came up with the concept behind Prodigy thanks to these features. Prodigy, for those who haven’t seen it before, is an annotation tool that data scientists can use to do the annotation themselves, enabling a new level of rapid iteration. Whether you’re working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. And who doesn’t want that?

 

The Evolution of spaCy (from v1 to v2.1)

“This is influenced to a large degree by the research and development around NLP.”

The first version of spaCy was built with the linear model technology we saw above. This was back when neural networks were still in their infancy stage, not quite ready to take the machine learning world by storm.

Once the revolution came, when neural networks became more and more mainstream, spaCy made the switch from version 1 to 2. Many of the key features in spaCy 2.0 were around the various ML pipeline components (such as plug-and-play architectures), inspired and influenced by the ever-evolving machine learning community.

spaCy 2.1, the current version, is geared more towards stability and performance. One of its stand-out features is dealing with transfer learning and language models – the two concepts that have accelerated research and progress in the NLP field.

So what’s next for spaCy? What kind of new features can we expect from future updates? Here’s Matt:

“One of the core uses of spaCy is in information extraction. Basically, going from unstructured text to structured knowledge. What we have in the works is a new component for entity linking – resolving names to knowledge-based entries. We have a develop working on this currently and should have it ready by July, when the spaCy conference comes around.”

They are further working on integrating various regional languages into the spaCy environment. Exciting times ahead!

 

A few Surprising Use Cases of spaCy

The beauty of machine learning is in its sheer scope and complexity. Even the creators of a library or package can sometimes be taken by surprise looking at its use cases. Here are a few such use cases of spaCy which Ines and Matt did not see coming:

  • Extracting information from resumes (PDF parsing)
  • Working with network logs

I would love to hear from our community here – are there any unique spaCy use cases you have come across? Or perhaps you’ve experimented with yourself? Let me know in the comments section below!

 

Future Trends in NLP and spaCy

NLP has come leaps and bounds in the last 12-18 months with the release of breakthrough frameworks like OpenAI’s GPT-2, Google’s BERT, fast.ai’s ULMFiT, among others. So what do the next 2-5 years hold for NLP and spaCy?

Ines and Matt provided quite an in-depth and insightful answer to this. Here are my key takeaways:

  • A change in making NLP models smaller and more efficient. In other words, which algorithms will be able to scale with ever-larger data sizes and growing datasets?
  • Ines and Matt are already working on improving spaCy and Prodigy with these future trends in mind
  • We can expect to see a lot of transfer learning aspects in spaCy soon (hello pre-trained models!). Spending a lot less time in training our NLP model and not having to wait forever for it to converge sounds perfect to me

 

What’s the one problem you would want to solve in NLP over the next 3-4 years?

I loved this question by Kunal and here’s Matt’s answer in full:

“A better process for information extraction that can be trained on custom problems and that supports entity linking as well. And with an integrated process for annotation.

 

People end up making ad-hoc systems at the moment. spaCy has some useful components for it but we can do better for that.”

And here is Ines weighing in with her thoughts:

“Building out a set of best practices and putting everything together that we’ve learned and seen. At the moment a lot of it is trial and error. It’s very difficult to recommend something because every use case is different.

 

Over the next 203 years, I’m hoping we can have a better sense of best practices, workflows, how to build end-to-end systems that generalize well to different use cases.”

 

Advice for NLP Enthusiasts and Aspiring Data Scientists

Here’s a summary of Ines and Matt’s golden advice for NLP enthusiasts and aspiring data scientists in general:

  • Hone your existing skills and craft what you’re learning so you can contribute towards a team that’s already delivering successful projects in this space. Don’t start out and say “I want to be an NLP specialist”. This could lead you to run projects that are potentially set up to fail because you are unlikely to have enough influence with the leadership and other decision makers
  • There’s more to NLP than the modern deep learning algorithms that attract aspiring data scientists. There are numerous components to it, such as:
    • Building software
    • Building applications/products
    • Understanding the logic, structure and rules of a NLP problem, rather than finding a one-size fits all approach to build a model

 

End Notes and Resources

Our data scientists here at Analytics Vidhya are huge fans and avid users of spaCy. We’ve been using it since it was launched and have integrated it into our research and our courses as well. It is a truly superb library for NLP tasks.

To wrap things up, here are a few resources you might find useful:


Download Resource


Download Resource