What is this about?
It all started with language. It all messed up with computers. What a great cocktail! Let's enjoy the party! This is a quick learning overview about some NLP fundamental concepts.
After years working as a linguist with computational knowledge at my disposal, I found in Natural Language Processing the perfect activity to carry out. I have worked in computational linguistics, corpus linguistics and natural language processing for two decades now. Nowadays, NLP is a revolution. It’s everywhere. And I think it would be nice to talk a little bit about it and share a compilation of common NLP tasks.
You might know that Python is a programming language. For linguists, the first part of the term, programming, might be hard to be introduced. So, let’s focus on the second part, language. Python is a code for communication, like Spanish or Greek. If I need to talk to French people, I’d better speak French, right? If I’d like to talk to a computer regarding NLP matters, I’d better write Python. This is it. You don’t need to speak it yet, just write it.
Python works as any other language: it basically has grammar rules, orthotypographic rules and a vocabulary. Once you understand it, it’s like uncorking a bottle of wine: you get access to the wine, not just the bottle. And the thing is that there are hundreds of useful libraries available, most of them open source resources. You just need to “speak” Python.
This is not a Python course, but we will need it for the tasks. I leave it on your hands. Below you will see a list of common NLP tasks and their definitions, together with sample codes. If you would like to test the code and play around with it, you can use this Colab Notebook. There are several examples, like text generation, in which you can complete a sentence like “As far as I see, the future of localization…” and get automated texts, like a foreteller!
Common NLP tasks
Tokenization
Tokenization is the process of breaking a piece of text into smaller pieces called tokens. Here’s an example of how to tokenize a sentence using the nltk
library:
import nltk
# Tokenize a sentence
sentence = "This is a sentence that we want to tokenize."
tokens = nltk.word_tokenize(sentence)
print(tokens)
# Output: ['This', 'is', 'a', 'sentence', 'that', 'we',
# 'want', 'to', 'tokenize', '.']
PythonPart-of-speech tagging
Part-of-speech (POS) tagging is the process of marking each word in a text with its corresponding part of speech. Here’s an example of how to perform POS tagging using the nltk
library:
import nltk
# POS tagging
sentence = "This is a sentence that we want to tag."
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)
# Output: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'),
# ('sentence', 'NN'), ('that', 'IN'), ('we', 'PRP'),
# ('want', 'VBP'), ('to', 'TO'), ('tag', 'VB'),
# ('.', '.')]
PythonNamed entity recognition
Named entity recognition (NER) is the process of identifying and classifying named entities in a text, such as person names, organizations, locations, etc. Here’s an example of how to perform NER using the spacy
library:
import spacy
# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')
# Text to process
text = "John Smith works for Google in Mountain View, California."
# Process the text
doc = nlp(text)
# Print named entities
for ent in doc.ents:
print(ent.text, ent.label_)
# Output:
# John Smith PERSON
# Google ORG
# Mountain View GPE
# California GPE
PythonSentiment analysis
Sentiment analysis is the process of determining the sentiment (positive, negative, or neutral) of a piece of text. Here’s an example of how to perform sentiment analysis using the textblob
library:
from textblob import TextBlob
# Text to process
text = "I had a great time at the movie. The acting was excellent and the story was very engaging."
# Process the text
blob = TextBlob(text)
# Get the sentiment
sentiment = blob.sentiment
# Print the sentiment
print(sentiment.polarity)
# Output: 0.7
PythonLemmatization
Lemmatization is the process of reducing a word to its base form, or lemma. For example, the lemma of the word “was” is “be”, and the lemma of the word “better” is “good”. Here’s an example of how to perform lemmatization using the nltk
library:
import nltk
# Text to process
text = "The cats were playing in the garden. They were having a great time."
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Perform lemmatization
lemmatizer = nltk.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmas)
# Output: ['The', 'cat', 'be', 'play', 'in', 'the', 'garden', '.', 'They', 'be', 'have', 'a', 'great', 'time', '.']
PythonStemming
Stemming is the process of reducing a word to its root form. It is similar to lemmatization, but it is usually simpler and less accurate. Here’s an example of how to perform stemming using the nltk
library:
import nltk
# Text to process
text = "The cats were playing in the garden. They were having a great time."
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Perform stemming
stemmer = nltk.PorterStemmer()
stems = [stemmer.stem(token) for token in tokens]
print(stems)
# Output: ['the', 'cat', 'were', 'play', 'in', 'the', 'garden', '.', 'they', 'were', 'have', 'a', 'great', 'time', '.']
PythonTopic modeling
Topic modeling is the process of automatically identifying the topics present in a collection of documents. Here’s an example of how to perform topic modeling using the gensim
library:
import gensim
# Text to process
text = ["The cats were playing in the garden.", "They were having a great time.", "I love cats."]
# Tokenize the text
tokens = [nltk.word_tokenize(doc) for doc in text]
# Create a dictionary from the tokens
dictionary = gensim.corpora.Dictionary(tokens)
# Create a bag-of-words representation of the documents
bow_corpus = [dictionary.doc2bow(doc) for doc in tokens]
# Train a LDA model on the corpus
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=2, id2word=dictionary)
# Print the topics
for idx, topic in lda_model.print_topics(-1):
print("Topic: ", idx)
print(topic)
# Output:
# Topic: 0
# 0.205*"in
PythonWord sense disambiguation
Word sense disambiguation (WSD) is the process of determining the sense of a word in a particular context. Here’s an example of how to perform WSD using the nltk library:
import nltk
# Text to process
text = "I saw the man with the telescope."
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Perform POS tagging
tagged_tokens = nltk.pos_tag(tokens)
# Perform WSD
wsd = nltk.WordNetLemmatizer()
disambiguated_tokens = [(token, wsd.lemmatize(token, pos=tag[0].lower())) for token, tag in tagged_tokens]
print(disambiguated_tokens)
# Output: [('I', 'I'), ('saw', 'see'), ('the', 'the'), ('man', 'man'), ('with', 'with'), ('the', 'the'), ('telescope', 'telescope')]
PythonLanguage translation
Language translation is the process of converting text from one language to another. Here’s an example of how to perform language translation using the googletrans
library:
from googletrans import Translator
# Text to translate
text = "Bonjour, comment ça va?"
# Translate the text
translator = Translator()
translation = translator.translate(text, dest='en')
print(translation.text)
# Output: "Hello, how are you?"
PythonText classification
Text classification is the process of assigning a label to a piece of text based on its content. Here’s an example of how to perform text classification using the scikit-learn
library:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
# Training data
X_train = ["I love cats.", "I hate dogs.", "I love dogs."]
y_train = [1, 0, 1]
# Test data
X_test = ["I love cats.", "I hate dogs."]
# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train_vectors, y_train)
# Predict the labels for the test data
predictions = model.predict(X_test_vectors)
print(predictions)
# Output: [1, 0]
PythonInformation extraction
Information extraction is the process of extracting structured information from unstructured text. Here’s an example of how to perform information extraction using the spacy
library:
import spacy
# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')
# Text to process
text = "John Smith works for Google in Mountain View, California. He can be reached at john.smith@gmail.com."
# Process the text
doc = nlp(text)
# Extract the information
name = doc[0].text
organization = doc[4].text
location = doc[6].text
email = doc[-2].text
print(f"Name: {name}")
print(f"Organization: {organization}")
print(f"Location: {location}")
print(f"Email: {email}")
# Output:
# Name: John Smith
# Organization: Google
# Location: Mountain View
# Email: john.smith@gmail.com
PythonText Summarization
Summarization is the process of generating a shortened version of a text that conveys its main points. Here’s an example of how to perform summarization using the sumy
library:
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
LANGUAGE = "english"
SENTENCES_COUNT = 10
if __name__ == "__main__":
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
# or for plain text files
# parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
# parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
print(sentence)
PythonText generation
Text generation is the process of automatically creating text using a given model. Here’s an example of how to perform text generation using the gpt-2-simple library:
import transformers
from transformers import pipeline
# Create generator
gpt2_generator = pipeline('text-generation', model='gpt2')
# Generator setup
sentences = gpt2_generator("As far as I see, the future", do_sample=True, top_k=50, temperature=0.6, max_length=128, num_return_sequences=3)
# Generate texts
for sentence in sentences:
print(sentence["generated_text"])
print("="*50)
PythonSpelling correction
Spelling correction is the process of correcting spelling errors in a piece of text. Here’s an example of how to perform spelling correction using the autocorrect
library:
import autocorrect
from autocorrect import Speller
# Text to correct
text = "I weint to the stoie to bguy some vebetagles."
# Create a spellchecker
spell = Speller()
# Correct the spelling
spell(text)
print(f"Corrected text: {spell(text)}")
# Output: I went to the store to buy some vegetables.
PythonIn a nutshell
The realm of Natural Language Processing offers a diverse array of tools and techniques that can significantly enhance our interaction with technology. From basic tokenization to advanced text generation and sentiment analysis, each NLP task enables us to bridge the gap between human language and machine understanding.
As you embark on experimenting with the provided sample codes, remember that the field of NLP is not just about coding, but also about understanding language in its many forms. Embrace these tools, and you may find that NLP not only makes technology more accessible, but also enriches your understanding of human language.
Whether you’re a linguist, a developer, or just a curious mind, the power of NLP opens up a world of possibilities that can transform both mundane and complex tasks into something more intuitive and engaging.