How to select the best NLP framework for your data science and artificial intelligence (AI) project.
What is NLP and Why is it Important?
Natural Language Processing (NLP) is an incredibly popular subfield of machine learning, which combines computer science, artificial intelligence (AI) and linguistics. NLP analyzes large amounts of natural language data using a collection of methods and algorithms aimed to understand human languages. Through these techniques, language rules including semantics and syntax are learned and analyzed to extract meaning. Common NLP applications in the present world include Google Translate, automated telephone calls, automated word processors such as Grammarly and mobile predictive text, and personal assistant applications such as Siri (Apple) and Alexa (Amazon). Other examples of applications that leverage NLP can be seen in TAUS Matching Data and the cleaning library for TAUS Data Marketplace.
Due to its wide variety of applications, NLP has become an important field in today’s businesses. The rise of AI and personalized technology has created a surge in further research and development in NLP techniques. Due to popularity and demand, many NLP frameworks have been made available to use for a given NLP application. An NLP framework is a toolkit that provides the technology and modules needed to be able to execute an NLP task or model. In this article, we will introduce and describe the top 10 most popular NLP frameworks available today.
Top 10 NLP Frameworks
Tensorflow is arguably the most popular framework for a variety of deep learning applications. It is a free and open-source library that was originally developed by the Google Brain team in 2015. Tensorflow is applauded for its comprehensive toolkit and its ability to run a flexible and production-scale architecture both on CPUs and GPUs. The library uses static graphs for computation, meaning the computation graph must be defined statically before a model is run.
Tensorflow supports a number of NLP text processing methodologies such as tokenization and vectorization and is well-suited for most NLP applications.
PyTorch is another widely popular open-source library that is built off of Torch. This framework uses dynamic computation graphs, where developers are able to define, modify, or execute nodes at runtime. PyTorch’s Tensor is almost identical to a Numpy array, however, it provides more operations and can run on GPUs to accelerate computation. PyTorch is commonly used in practice for computer vision and NLP applications. The advantage of using PyTorch is that it is easy to learn and available in the Python ecosystem with extensive documentation.
Keras is another Python and neural-network-based API that runs on both CPU and GPU, making it suitable for deep learning. Because it is part of the Python ecosystem, Keras, like PyTorch, can be quite handy for data scientists for any given AI or NLP application. Keras is a high-level API that sits on top of Tensorflow, making models fast and easy to implement with well-defined parameters. Hence, it is easier to understand and use than Tensorflow.
SpaCy is a speedy and agile open-source library in Python built specifically for advanced NLP. It supports custom models in both PyTorch and Tensorflow. Furthermore, it boasts simple workflow management and model deployment. SpaCy supports named entity recognition, dependency parsing, part-of-speech tagging, tokenization, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and other NLP methodologies.
5. Facebook AI XLM/mBERT
XLM/mBERT is a cross-lingual language model (XLM) created by Facebook AI in 2019. The model’s results achieved astounding results in cross-lingual classification, consisting of both supervised and unsupervised machine learning. The pre-trained model is built on top of PyTorch and it supports over 100 languages, including underrepresented languages. Hence, if you are working with low-resource language data but lack the proper data sources, XLM/mBERT could be a great option.
Theano is a numerical computation open-source Python library that allows developers to efficiently define and optimize multidimensional arrays. These computations can be compiled to run on both CPU and GPU architectures, making the framework handy for big data. Theano allows users to be able to define their own machine learning models, including an extensive NLP toolkit. Using this library, you can perform NLP tasks such as machine translation, word embeddings, speech recognition, and text classification.
The Chainer framework was the Open Source Data Science Project winner at ODSC in 2018. Built off of the Python ecosystem, Chainer is a standalone deep learning open-source framework. It provides a flexible and high-performance implementation of a wide variety of neural network architectures, including feed-forward, convnets, recurrent, and recursive networks.
Scikit-Learn is a largely popular library in Python that has a wide variety of machine learning implementations. This library is a beginner-friendly library that supports baseline tasks, yet powerful, models like classification and regression. Scikit-Learn is more suitable for simple small-scale NLP tasks. It may not be the best option for higher-order NLP tasks, however, it does give beginners a good starting point.
9. Stanford CoreNLP
Stanford’s CoreNLP is a highly flexible and extensible framework that can stand alone as a web service. It is acclaimed for its ability to execute NLP tasks simply by applying a wide range of linguistic analysis and modern grammatical tools. CoreNLP has APIs available for common programming languages, including Python. The Python interface called Stanza connects to the CoreNLP Java package and provides additional NLP functionalities. Six languages are currently supported, including English, Arabic, Chinese, French, German, and Spanish. Some common NLP tasks that CoreNLP supports include token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency, parsing, coreference, sentiment analysis, and quote attributions.
Deeplearning4j is an open-source Java-based library for the Java virtual machine. This library supports a variety of deep learning models and can integrate with Apache Spark for distributed parallel computing on CPU and GPU. Deeplearning4j supports text and NLP tasks which include tokenization, sentence iteration, and word vectorization.
Keras vs Tensorflow vs PyTorch
If you are looking to get your feet wet with NLP in a framework, Keras, Tensorflow, and PyTorch are perhaps the three most popular options. The chart below shows a comparison of the three frameworks for your given application. If you are looking to train a highly complex model with a large volume of data, Tensorflow or PyTorch may be good fits. If you have a smaller dataset with a simple model, Keras would be a better option.
As the field of NLP continues to grow and improve, technologies continue to advance in parallel. Tensorflow, Keras, and PyTorch are the most widely used general deep learning frameworks that provide NLP support. Frameworks built specific to NLP include SpaCy, CoreNLP, and XLM/mBERT. Depending on the application, the above ten frameworks are great to explore when beginning development of an NLP-based project.
Need help to process your language data? Contact TAUS NLP Experts for a consultation.