Common NLP Tasks

What is this about?

It all started with language. It all messed up with computers. What a great cocktail! Let's enjoy the party! This is a quick learning overview about some NLP fundamental concepts.

After years working as a linguist with computational knowledge at my disposal, I found in Natural Language Processing the perfect activity to carry out. I have worked in computational linguistics, corpus linguistics and natural language processing for two decades now. Nowadays, NLP is a revolution. It’s everywhere. And I think it would be nice to talk a little bit about it and share a compilation of common NLP tasks.

You might know that Python is a programming language. For linguists, the first part of the term, programming, might be hard to be introduced. So, let’s focus on the second part, language. Python is a code for communication, like Spanish or Greek. If I need to talk to French people, I’d better speak French, right? If I’d like to talk to a computer regarding NLP matters, I’d better write Python. This is it. You don’t need to speak it yet, just write it.

Python works as any other language: it basically has grammar rules, orthotypographic rules and a vocabulary. Once you understand it, it’s like uncorking a bottle of wine: you get access to the wine, not just the bottle. And the thing is that there are hundreds of useful libraries available, most of them open source resources. You just need to “speak” Python.

This is not a Python course, but we will need it for the tasks. I leave it on your hands. Below you will see a list of common NLP tasks and their definitions, together with sample codes. If you would like to test the code and play around with it, you can use this Colab Notebook. There are several examples, like text generation, in which you can complete a sentence like “As far as I see, the future of localization…” and get automated texts, like a foreteller!

Tokenization

Tokenization is the process of breaking a piece of text into smaller pieces called tokens. Here’s an example of how to tokenize a sentence using the nltk library:

import nltk

# Tokenize a sentence
sentence = "This is a sentence that we want to tokenize."
tokens = nltk.word_tokenize(sentence)

print(tokens)
# Output: ['This', 'is', 'a', 'sentence', 'that', 'we', 
# 'want', 'to', 'tokenize', '.']

Python

Part-of-speech tagging

Part-of-speech (POS) tagging is the process of marking each word in a text with its corresponding part of speech. Here’s an example of how to perform POS tagging using the nltk library:

import nltk

# POS tagging
sentence = "This is a sentence that we want to tag."
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)

print(tagged_tokens)
# Output: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), 
# ('sentence', 'NN'), ('that', 'IN'), ('we', 'PRP'), 
# ('want', 'VBP'), ('to', 'TO'), ('tag', 'VB'), 
# ('.', '.')]

Python

Named entity recognition

Named entity recognition (NER) is the process of identifying and classifying named entities in a text, such as person names, organizations, locations, etc. Here’s an example of how to perform NER using the spacy library:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Text to process
text = "John Smith works for Google in Mountain View, California."

# Process the text
doc = nlp(text)

# Print named entities
for ent in doc.ents:
    print(ent.text, ent.label_)
# Output:
# John Smith PERSON
# Google ORG
# Mountain View GPE
# California GPE

Python

Sentiment analysis

Sentiment analysis is the process of determining the sentiment (positive, negative, or neutral) of a piece of text. Here’s an example of how to perform sentiment analysis using the textblob library:

from textblob import TextBlob

# Text to process
text = 
"""I had a great time at the movie. 
The acting was excellent and 
the story was very engaging."""

# Process the text
blob = TextBlob(text)

# Get the sentiment
sentiment = blob.sentiment

# Print the sentiment
print(sentiment.polarity)
# Output: 0.7

Python

Lemmatization

Lemmatization is the process of reducing a word to its base form, or lemma. For example, the lemma of the word “was” is “be”, and the lemma of the word “better” is “good”. Here’s an example of how to perform lemmatization using the nltk library:

import nltk

# Text to process
text = """The cats were playing in the garden. 
They were having a great time."""

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform lemmatization
lemmatizer = nltk.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]

print(lemmas)
# Output: ['The', 'cat', 'be', 'play', 
# 'in', 'the', 'garden', '.', 'They', 
# 'be', 'have', 'a', 'great', 'time', '.']

Python

Stemming

Stemming is the process of reducing a word to its root form. It is similar to lemmatization, but it is usually simpler and less accurate. Here’s an example of how to perform stemming using the nltk library:

import nltk

# Text to process
text = "The cats were playing in the garden. They were having a great time."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform stemming
stemmer = nltk.PorterStemmer()
stems = [stemmer.stem(token) for token in tokens]

print(stems)
# Output: ['the', 'cat', 'were', 'play', 'in', 'the', 'garden', '.', 'they', 'were', 'have', 'a', 'great', 'time', '.']

Python

Topic modeling

Topic modeling is the process of automatically identifying the topics present in a collection of documents. Here’s an example of how to perform topic modeling using the gensim library:

import gensim

# Text to process
text = ["The cats were playing in the garden.", "They were having a great time.", "I love cats."]

# Tokenize the text
tokens = [nltk.word_tokenize(doc) for doc in text]

# Create a dictionary from the tokens
dictionary = gensim.corpora.Dictionary(tokens)

# Create a bag-of-words representation of the documents
bow_corpus = [dictionary.doc2bow(doc) for doc in tokens]

# Train a LDA model on the corpus
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=2, id2word=dictionary)

# Print the topics
for idx, topic in lda_model.print_topics(-1):
    print("Topic: ", idx)
    print(topic)
# Output:
# Topic:  0
# 0.205*"in

Python

Word sense disambiguation

Word sense disambiguation (WSD) is the process of determining the sense of a word in a particular context. Here’s an example of how to perform WSD using the nltk library:

import nltk

# Text to process
text = "I saw the man with the telescope."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Perform WSD
wsd = nltk.WordNetLemmatizer()
disambiguated_tokens = [(token, wsd.lemmatize(token, pos=tag[0].lower())) for token, tag in tagged_tokens]

print(disambiguated_tokens)
# Output: [('I', 'I'), ('saw', 'see'), ('the', 'the'), ('man', 'man'), ('with', 'with'), ('the', 'the'), ('telescope', 'telescope')]

Python

Language translation

Language translation is the process of converting text from one language to another. Here’s an example of how to perform language translation using the googletrans library:

from googletrans import Translator

# Text to translate
text = "Bonjour, comment ça va?"

# Translate the text
translator = Translator()
translation = translator.translate(text, dest='en')

print(translation.text)
# Output: "Hello, how are you?"

Python

Text classification

Text classification is the process of assigning a label to a piece of text based on its content. Here’s an example of how to perform text classification using the scikit-learn library:

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# Training data
X_train = ["I love cats.", "I hate dogs.", "I love dogs."]
y_train = [1, 0, 1]

# Test data
X_test = ["I love cats.", "I hate dogs."]

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train_vectors, y_train)

# Predict the labels for the test data
predictions = model.predict(X_test_vectors)

print(predictions)
# Output: [1, 0]

Python

Information extraction

Information extraction is the process of extracting structured information from unstructured text. Here’s an example of how to perform information extraction using the spacy library:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Text to process
text = "John Smith works for Google in Mountain View, California. He can be reached at john.smith@gmail.com."

# Process the text
doc = nlp(text)

# Extract the information
name = doc[0].text
organization = doc[4].text
location = doc[6].text
email = doc[-2].text

print(f"Name: {name}")
print(f"Organization: {organization}")
print(f"Location: {location}")
print(f"Email: {email}")

# Output:
# Name: John Smith
# Organization: Google
# Location: Mountain View
# Email: john.smith@gmail.com

Python

Text Summarization

Summarization is the process of generating a shortened version of a text that conveys its main points. Here’s an example of how to perform summarization using the sumy library:

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Python

Text generation

Text generation is the process of automatically creating text using a given model. Here’s an example of how to perform text generation using the gpt-2-simple library:

import transformers
from transformers import pipeline

# Create generator
gpt2_generator = pipeline('text-generation', model='gpt2')

# Generator setup
sentences = gpt2_generator("As far as I see, the future", do_sample=True, top_k=50, temperature=0.6, max_length=128, num_return_sequences=3)

# Generate texts
for sentence in sentences:
  print(sentence["generated_text"])
  print("="*50)

Python

Spelling correction

Spelling correction is the process of correcting spelling errors in a piece of text. Here’s an example of how to perform spelling correction using the autocorrect library:

import autocorrect
from autocorrect import Speller

# Text to correct
text = "I weint to the stoie to bguy some vebetagles."

# Create a spellchecker
spell = Speller()

# Correct the spelling
spell(text)

print(f"Corrected text: {spell(text)}")

# Output: I went to the store to buy some vegetables.

Python

In a nutshell

The realm of Natural Language Processing offers a diverse array of tools and techniques that can significantly enhance our interaction with technology. From basic tokenization to advanced text generation and sentiment analysis, each NLP task enables us to bridge the gap between human language and machine understanding.

As you embark on experimenting with the provided sample codes, remember that the field of NLP is not just about coding, but also about understanding language in its many forms. Embrace these tools, and you may find that NLP not only makes technology more accessible, but also enriches your understanding of human language.

Whether you’re a linguist, a developer, or just a curious mind, the power of NLP opens up a world of possibilities that can transform both mundane and complex tasks into something more intuitive and engaging.

Common NLP Tasks

Quick Learning

Common NLP Tasks – Quick Learning

What is this about?

Common NLP tasks

Tokenization

Part-of-speech tagging

Named entity recognition

Sentiment analysis

Lemmatization

Stemming

Topic modeling

Word sense disambiguation

Language translation

Text classification

Information extraction

Text Summarization

Text generation

Spelling correction

In a nutshell

About Sergio Calvo

LocNLP Lab23 Apps

Recent articles

Categories

Topics

Common NLP Tasks

Quick Learning

Common NLP Tasks – Quick Learning

What is this about?

Common NLP tasks

Tokenization

Part-of-speech tagging

Named entity recognition

Sentiment analysis

Lemmatization

Stemming

Topic modeling

Word sense disambiguation

Language translation

Text classification

Information extraction

Text Summarization

Text generation

Spelling correction

In a nutshell

About Sergio Calvo

You also might be interested in

💥 Translation Quality Issues in the Localization Process

NLP — The Secret Behind the Evolution in Translation and Localization

The Role of Localization Quality Assurance (LQA) in Ensuring Flawless Translations

LocNLP Lab23 Apps

Recent articles

Tag Cloud

Categories

Topics