Introduction

Last module we worked with images. This module, we pivot to text as our input data, accessing an open-access repository for full-text electronic books. We will learn to extract structure from text documents (split them into pieces such as chapters, paragraphs, sentences, words), clean out the text (remove punctuation, alter character case, skip non-informative stop words, reduce plural forms into their singular counterparts (a simple case of stemming).

Learning outcomes

This module will help you do the following:

Load electronic books as input to a program
Decode text into the desired encoding (utf-8)
Search for substrings within text
Search and replace characters within text
Split text into a list of words
Alter the character case (upper, lower)
Filter a list with a yes/no subroutine
Discard stop words
Calculate the frequencies of appearance of values in a list
Draw a word cloud using word frequencies
Create a directed graph of words based on their sequence within a text

Warm-up

Warm-up assessment

Based on the warm-up video, make a list of about a dozen applications you can envision for NLP but that you have not yet encountered in existence. Then, for each, assess whether you think it will easy, moderate, or hard to implement. Also, assess whether it will be a wholesome positive thing to have, a neutral development of technology, or potentially harmful if used for dishonest or discriminatory purposes.

Concepts

After this module, you should be familiar with the following concepts:

Repository
Preprocessing
Regular expression
Stop word
Stemming
Frequency
Word cloud

Remember that you can always look concepts up in the glossary. Should anything be missing or insufficient, please report it.