If you are interested in Natural Language Processing, this interview with Rajesh Muppalla, Senior Director of Engineering at Avalara is a must-read. A regular speaker at Scale By the Bay who this year is presenting a talk Lessons Learnt Building Domain Specific NLP Pipelines, in this interview, Rajesh is sharing how he started with NLP, how he's been building the world’s largest product catalog, what the biggest open problems in NLP are, what trends are shaping the future of NLP and why is travelling all the way from Chennai, India, to speak at Scale By the Bay.
Welcome to Scale By the Bay! Please tell us more about yourself: how did you join Avalara and how did you get interested in natural language processing (NLP)?
Thank you. I am a Senior Director of Engineering at Avalara, leading the teams working on product classification and tax content sourcing automation. Earlier this year, I joined Avalara through the acquisition of my previous company Indix, which I co-founded in 2012. At the time of acquisition, I was leading the machine learning (ML) and data platform teams. Prior to Indix, I was at Thoughtworks as a tech lead on Go-CD, an open source Continuous Delivery (CD) tool, where I was fortunate to work with some of the pioneers in the area of CD.
At Indix, our mission was to build the world’s largest product catalog. It was an ambitious goal that involved crawling the web to gather product information from 5,000+ brand and retailer web sites, classifying the products to a taxonomy of 5,000+ nodes, and extracting relevant attributes of the products to match different products across retailers.
This structured data was then exposed via a search API that would help customer use cases that needed product information. I initially worked on the crawling infrastructure using Scala and Akka, then I lead the team that built the data pipelines to process this data using the principles of Lambda Architecture. Post that, my team started focusing on solving NLP problems like product classification and product attribute extraction and that work resulted in me presenting at Scale by the Bay in 2017 on “Applying Continuous Delivery principles in Machine Learning” (video, slides).
We spent most of 2018 building an e-commerce knowledge graph with 100 million nodes and about a billion edges to solve problems like Query Intent Recognition and Query Understanding for Product Search. During the last six months at Avalara, my team has been working on using Transfer Learning techniques to solve NLP problems that classify and extract tax related entities from unstructured tax rules and regulations.
What's your current role and what exciting things are you working on at the moment?
Avalara is a tax technology company that helps businesses of every size. Our mission is to help businesses manage complicated tax compliance obligations imposed by their state, local, and other tax authorities throughout the world. Each year, we process billions of indirect tax transactions for more than 25,000 customers. Figuring out the exact tax rules for what they sell is a big hurdle for many businesses.
The teams that I lead are applying NLP techniques to identify these tax rules and their changes automatically and classify products into the right categories, so the appropriate rules can be applied. Here are more details on both of these problem areas.
Automatic Rate and Rule Sourcing - One universal truth about sales tax compliance is that nothing is uniform - rules and rates change not only over time, but also between state, city, county, and local jurisdictions. Most often, these rules and rates are published on the individual jurisdiction websites a few weeks or days before they are effective. These rate and rule changes are either available in tables or in plain text. My team crawls these sources, parses them using NLP techniques, and publishes them to our content repository so that these can then be used by our tax engine.
Product Classification - Some states in the U.S. exempt certain products like baby diapers because they’re essential, while others may subject products like soda to a higher rate because they pose a health risk. To accurately calculate the sales tax for a product, businesses have to accurately classify the product, which is where our product classification comes in. We are able to classify products into one of the thousands of internal tax codes in our system, allowing us to apply the applicable tax rules for that product. In addition to classification in the US, my team also helps classify products that are shipped internationally into thousands of Harmonized System codes. These codes are used by customs authorities in every cross border shipment that happens across the world to allocate the correct rate of duty and tax for each product. Both of these classification systems build upon the work we did at Indix, where we were classifying these products to our own taxonomy.
What are the three trends that will shape the future of NLP?
I have been working in Natural Language Processing (NLP) for the last five years as a practitioner and have been fortunate enough to experience firsthand some of the massive changes that have happened in this field.
Unless you have been under a rock, you would know that Deep Learning has taken NLP by storm recently. We have had numerous papers that have been published in the last 2-3 years that are beating the State of the Art (SOTA) results on various NLP tasks like document classification, named entity recognition, and machine translation. Based on what I know and what I am seeing, below are the three trends that I think will shape the next few years in NLP.
Transfer Learning - Recent advances in the Computer Vision (CV) tasks like image classification, segmentation, or captioning can be attributed to Transfer learning. The idea behind Transfer Learning is to fine tune CV Models on top of models that have already been pre-trained on large datasets like ImageNet and MS-COCO. These pre-trained Deep Language models provide a headstart for downstream tasks. Inspired by this work, NLP researchers were able to demonstrate that by using pre-trained language models that are trained on large corpus of unlabelled text data, they were able to achieve state of the art results on downstream NLP tasks like classification and named entity recognition. ELMO, ULMFIT, OpenAI-GPT, BERT, and XL-NET are some examples where transfer learning is beating the earlier SOTA results on standard datasets. Another important benefit of transfer learning is that it allows you to train a model for a target task with small number of labels. For people who are already familiar with pre-training with word embeddings like Word2Vec, GloVE or FastText, there is one key difference here; instead of using a single embedding layer in case of word embeddings, transfer learning uses an entire pre-trained deep learning model which captures more deeper features and representations including context.
Parameter Tuning Improvements - One of the most time consuming tasks in the training of ML models and deep learning networks is hyper parameter tuning, when you are trying to choose a set of optimal parameters for the learning algorithm. The learning rate is one of the most important hyper-parameters to tune. Leslie Smith et. al published a couple of papers that help one find an optimal learning rate with fewer number of experiments without compromising on accuracy. This helps reduce your training time and cost, which may be significant if you are training your models for several epochs on GPUs. The two papers that are worth looking at are - Cyclical Learning Rates for Training Neural Networks and A Disciplined Approach to Neural Network Hyper-Parameters.
Democratization of Deep Learning - I remember back in 2014-2015, where one had to spend quite a lot of time setting up CUDA libraries to do any deep learning work. We have come a long way since then, since it is now easier than ever to get started and build your first Deep Learning NLP model.
With courses like Andrew Ng’s deeplearning.ai specialization that provides the theoretical and mathematical underpinnings of various techniques and courses from fast.ai, which follow a more top down approach by helping you build something end-to-end before peeling the layers underneath, you don’t need a PHD in ML to be successful at building something practical and at the same time understanding how it works.
With algorithmic improvements, compute requirements are falling rapidly and model sizes are becoming more manageable,so you don't need machines with a lot of GPU power to train your models. Additionally, with platforms like Google Colab, Kaggle Kernels, and Azure Notebooks that provide free credits or free GPU hours you don’t need to spend a single penny while you are learning.
Finally, one can now replicate the SOTA results from research papers a few days after the paper is published as most authors now share their code on github. Papers with Code is a site where you can see the latest research papers along with their code allowing you to reproduce the same results from a research paper.
What are some of the biggest open problems in NLP?
Although the pre-trained language models in transfer learning are able to generate sentences that are syntactically correct with a certain degree of common-sense reasoning and basic factual knowledge, a big drawback with these models is that their knowledge is limited to memorizing the facts that were observed during training.
As a result, for entities that are rare or unseen, the models are unable to generate factually correct sentences that mimic “real” understanding of the world.
We attempted to solve this problem at Indix by building a high quality and scalable Knowledge Graph (KG) for e-commerce domain. The expectation was the embeddings from KG will act as external knowledge and augment what our models learned from the training data. However, we were not able to show significant improvements in performance with the approach. This is an active area of research and I hope that sometime in the near future we will see models whose internal memory, the knowledge learned from the training data, is enriched with an external memory, like the knowledge inherited from a KG.
What will you talk about at Scale By the Bay and why did you choose to cover this subject?
To build the world’s largest product catalog, we needed a robust NLP pipeline to make sense of unstructured text data at scale.
The first part of my talk will cover the evolution of this architecture, the building blocks and algorithms of the NLP pipeline. The building blocks I will cover would include Language Models, Word Embeddings, and Knowledge Graphs. The algorithms I will cover will include classification, entity extraction, document similarity, and query understanding (for e-commerce domain). At Avalara, the team has been tasked to make sense of unstructured text data in the tax compliance domain with limited data. The second part of the talk will focus on how we are using transfer learning techniques and our learnings from the e-commerce domain to solve problems with language understanding in the tax compliance domain.
In 2017, during my talk on applying Continuous Delivery Principles for Machine Learning, I had introduced the concept of a "Machine Learning Sandwich" (see figure below)
There are three key parts in any ML workflow:
Data Engineering - This stage deals with data cleaning, transformation, and feature engineering
ML Algorithms - This stage deals with experimentation, hyperparameter tuning, and evaluation of the right ML algorithms
Productionization of ML models - This stage involves taking these models to production.
One would assume that in an ML workflow the “meatier” part would be the ML algorithms, but from our experience we realized that only 20% of time was spent on the actual algorithms. Majority of our time was spent in cleaning data, transforming, featurizing it, and taking the ML products to productionization. Based on our learnings, we built some great tooling for Data Engineering and ML Models Productionization that allowed us to truly focus on the meatier part. While my talk focused on Productionization of ML models, my colleague Manoj Mahalingam, spoke about the tooling we built for solving the Data Engineering problems at the same conference (video, slides).
My talk this time will mostly focus on the ML algorithms and associated techniques and lessons learnt while solving the NLP problems at Indix and Avalara.
Who should attend your talk and what will they learn?
If you are someone who is interested in NLP and want to know more about how to build a robust NLP pipeline specific to your domain using some of the latest techniques, especially the latest deep learning approaches, this talk will provide you the necessary background and pointers for further exploration. I will not only cover the techniques that worked, but also shed some light on the approaches that did not.
Anything else you'd like to add?
This will be my third time speaking at Scale by the Bay. I really enjoy being part of the Scale by the Bay community and look forward to traveling to speak at the conference all the way from Chennai, India.