1. Language Processing and Python (2024)

It is easy to get our hands on millions of words of text.What can we do with it, assuming we can write some simple programs?In this chapter we'll address the following questions:

  1. What can we achieve by combining simple programming techniques with large quantities of text?
  2. How can we automatically extract key words and phrases that sum up the style and content of a text?
  3. What tools and techniques does the Python programming language provide for such work?
  4. What are some of the interesting challenges of natural language processing?

This chapter is divided into sections that skip between two quitedifferent styles. In the "computing with language" sections we willtake on some linguistically motivated programming tasks without necessarilyexplaining how they work. In the "closer look at Python" sections wewill systematically review key programming concepts. We'll flag the two styles in the section titles,but later chapters will mix both styles without being so up-front about it.We hope this style of introduction gives you anauthentic taste of what will come later, while covering a range ofelementary concepts in linguistics and computer science.If you have basic familiarity with both areas, you can skip to5;we will repeat any important points in later chapters, and if you miss anythingyou can easily consult the online reference material at http://nltk.org/.If the material is completely new to you, this chapter will raisemore questions than it answers, questions that are addressed inthe rest of this book.

We're all very familiar with text, since we read and write it every day.Here we will treat text as raw data for the programs we write,programs that manipulate and analyze it in a variety of interesting ways.But before we can do this, we have to get started with the Python interpreter.

1.1Getting Started with Python

One of the friendly things about Python is that it allows youto type directly into the interactive interpreter —the program that will be running your Python programs.You can access the Python interpreter using a simple graphical interfacecalled the Interactive DeveLopment Environment (IDLE).On a Mac you can find this under ApplicationsMacPython,and on Windows under All ProgramsPython.Under Unix you can run Python from the shell by typing idle(if this is not installed, try typing python).The interpreter will print a blurb about your Python version;simply check that you are running Python 3.2 or later(here it is for 3.4.2):

Python 3.4.2 (default, Oct 15 2014, 22:01:37)[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>


If you are unable to run the Python interpreter, you probably don'thave Python installed correctly. Please visit http://python.org/ fordetailed instructions. NLTK 3.0 works for Python 2.6 and2.7. If you are using one of these older versions, note thatthe / operator roundsfractional results downwards (so 1/3 will give you 0).In order to get the expected behavior of divisionyou need to type: from __future__ import division

The >>> prompt indicates that the Python interpreter is now waitingfor input. When copying examples from this book, don't typethe ">>>" yourself. Now, let's begin by using Python as a calculator:

>>> 1 + 5 * 2 - 38>>>

Once the interpreter has finished calculating the answer and displaying it, theprompt reappears. This means the Python interpreter is waiting for another instruction.


Your Turn:Enter a few more expressions of your own. You can use asterisk (*)for multiplication and slash (/) for division, and parentheses forbracketing expressions.

The preceding examples demonstrate how you can work interactively with thePython interpreter, experimenting with various expressions in the languageto see what they do.Now let's try a nonsensical expression to see how the interpreter handles it:

>>> 1 + File "<stdin>", line 1 1 + ^SyntaxError: invalid syntax>>>

This produced a syntax error. In Python, it doesn't make senseto end an instruction with a plus sign. The Python interpreterindicates the line where the problem occurred (line 1 of <stdin>,which stands for "standard input").

Now that we can use the Python interpreter, we're ready to start workingwith language data.

1.2Getting Started with NLTK

Before going further you should install NLTK 3.0, downloadable for free from http://nltk.org/.Follow the instructions there to download the version required for your platform.

Once you've installed NLTK, start up the Python interpreter asbefore, and install the data required for the book bytyping the following two commands at the Python prompt, then selectingthe book collection as shown in 1.1.

>>> import nltk>>> nltk.download()

1. Language Processing and Python (1)

Figure 1.1: Downloading the NLTK Book Collection: browse the available packagesusing nltk.download(). The Collections tab on the downloadershows how the packages are grouped into sets, and you should select the line labeledbook to obtain alldata required for the examples and exercises in this book. It consistsof about 30 compressed files requiring about 100Mb disk space.The full collection of data (i.e., all in the downloader) isnearly ten times this size (at the time of writing) and continues to expand.

Once the data is downloaded to your machine, you can load some of itusing the Python interpreter.The first step is to type a special command at thePython prompt which tells the interpreter to load some texts for us toexplore: from nltk.book import *.This says "from NLTK's book module, loadall items." The book module contains all the data you will needas you read this chapter. After printing a welcome message, it loadsthe text of several books (this will take a few seconds). Here's thecommand again, together with the output thatyou will see. Take care to get spelling and punctuation right, andremember that you don't type the >>>.

>>> from nltk.book import **** Introductory Examples for the NLTK Book ***Loading text1, ..., text9 and sent1, ..., sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908>>>

Any time we want to find out about these texts, we just haveto enter their names at the Python prompt:

>>> text1<Text: Moby Dick by Herman Melville 1851>>>> text2<Text: Sense and Sensibility by Jane Austen 1811>>>>

Now that we can use the Python interpreter, and have some data to work with,we're ready to get started.

1.3Searching Text

There are many ways to examine the context of a text apart fromsimply reading it. A concordance view shows us every occurrence of a given word, togetherwith some context. Here we look up the word monstrous in MobyDick by entering text1 followed by a period, then the termconcordance, and then placing "monstrous" in parentheses:

>>> text1.concordance("monstrous")Displaying 11 of 11 matches:ong the former , one was of a most monstrous size . ... This came towards us ,ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have rll over with a heathenish array of monstrous clubs and spears . Some were thickd as you gazed , and wondered what monstrous cannibal and savage could ever havthat has survived the flood ; most monstrous and most mountainous ! That Himmalthey might scout at Moby Dick as a monstrous fable , or still worse and more deth of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere ling Scenes . In connexion with the monstrous pictures of whales , I am stronglyere to enter upon those still more monstrous stories of them which are to be foght have been rummaged out of this monstrous cabinet there is no telling . Butof Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u>>>

The first time you use a concordance on a particular text, it takes afew extra seconds to build an index so that subsequent searches are fast.


Your Turn:Try searching for other words; to save re-typing, you might be able touse up-arrow, Ctrl-up-arrow or Alt-p to access the previous command and modify the word being searched.You can also try searches on some of the other texts we have included.For example, search Sense and Sensibility for the wordaffection, using text2.concordance("affection"). Search the book of Genesisto find out how long some people lived, usingtext3.concordance("lived"). You could look at text4, theInaugural Address Corpus, to see examples of English goingback to 1789, and search for words like nation, terror, godto see how these words have been used differently over time.We've also included text5, the NPS Chat Corpus: search this forunconventional words like im, ur, lol.(Note that this corpus is uncensored!)

Once you've spent a little while examining these texts, we hope you have a newsense of the richness and diversity of language. In the next chapteryou will learn how to access a broader range of text, including text inlanguages other than English.

A concordance permits us to see words in context. For example, we saw thatmonstrous occurred in contexts such as the ___ picturesand a ___ size . What other words appear in a similar rangeof contexts? We can find outby appending the term similar to the name of the text inquestion, then inserting the relevant word in parentheses:

>>> text1.similar("monstrous")mean part maddens doleful gamesome subtly uncommon careful untowardexasperate loving passing mouldy christian few true mystifyingimperial modifies contemptible>>> text2.similar("monstrous")very heartily so exceedingly remarkably as vast a great amazinglyextremely good sweet>>>

Observe that we get different results for different texts.Austen uses this word quite differently from Melville; for her, monstrous haspositive connotations, and sometimes functions as an intensifier like the wordvery.

The term common_contexts allows us to examine just thecontexts that are shared by two or more words, such as monstrousand very. We have to enclose these words by square brackets aswell as parentheses, and separate them with a comma:

>>> text2.common_contexts(["monstrous", "very"])a_pretty is_pretty am_glad be_glad a_lucky>>>


Your Turn:Pick another pair of words and compare their usage in two different texts, usingthe similar() and common_contexts() functions.

It is one thing to automatically detect that a particular word occurs in a text,and to display some words that appear in the same context. However, we can also determinethe location of a word in the text: how many words from the beginning it appears.This positional information can be displayed using a dispersion plot.Each stripe represents an instanceof a word, and each row represents the entire text. In 1.2 wesee some striking patterns of word usage over the last 220 years(in an artificial text constructed by joiningthe texts of the Inaugural Address Corpus end-to-end).You can produce this plot as shown below.You might like to try more words (e.g., liberty, constitution),and different texts. Can you predict thedispersion of a word before you view it? As before, takecare to get the quotes, commas, brackets and parentheses exactly right.

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])>>>

1. Language Processing and Python (2)

Figure 1.2: Lexical Dispersion Plot for Words in U.S. Presidential Inaugural Addresses:This can be used to investigate changes in language use over time.


Important:You need to have Python's NumPy and Matplotlib packages installedin order to produce the graphical plots used in this book.Please see http://nltk.org/ for installation instructions.


You can also plot the frequency of word usage through time usinghttps://books.google.com/ngrams

Now, just for fun, let's try generating some random text in the variousstyles we have just seen. To do this, we type the name of the textfollowed by the term generate. (We need to include theparentheses, but there's nothing that goes between them.)

>>> text3.generate()In the beginning of his brother is a hairy man , whose top may reachunto heaven ; and ye shall sow the land of Egypt there was no bread inall that he was taken out of the month , upon the earth . So shall thywages be ? And they made their father ; and Isaac was old , and kissedhim : and Laban with his cattle in the midst of the hands of Esau thyfirst born , and Phichol the chief butler unto his son Isaac , she>>>


The generate() method is not available in NLTK 3.0 but will bereinstated in a subsequent version.

1.4Counting Vocabulary

The most obvious fact about texts that emerges from the preceding examples is thatthey differ in the vocabulary they use. In this section we will see how to use thecomputer to count the words in a text in a variety of useful ways.As before, you will jump right in and experiment withthe Python interpreter, even though you may not have studied Python systematicallyyet. Test your understanding by modifying the examples, and trying theexercises at the end of the chapter.

Let's begin by finding out the length of a text from start to finish,in terms of the words and punctuation symbols that appear. We use theterm len to get the length of something, which we'll apply here to thebook of Genesis:

>>> len(text3)44764>>>

So Genesis has 44,764 words and punctuation symbols, or "tokens."A token is the technical name for a sequence of characters— such as hairy, his, or :) — that we want to treat as agroup. When we count the number of tokens in a text, say, the phraseto be or not to be, we are counting occurrences of thesesequences. Thus, in our example phrase there are two occurrences of to,two of be, and one each of or and not. But there areonly four distinct vocabulary items in this phrase.How many distinct words does the book of Genesis contain?To work this out in Python, we have to pose the question slightlydifferently. The vocabulary of a text is just the set of tokensthat it uses, since in a set, all duplicates are collapsedtogether. In Python we can obtain the vocabulary items of text3 with thecommand: set(text3). When you do this, many screens of words willfly past. Now try the following:

>>> sorted(set(text3)) ['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)','A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech','Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]>>> len(set(text3)) 2789>>>

By wrapping sorted() around the Python expression set(text3), we obtain a sorted list of vocabulary items, beginningwith various punctuation symbols and continuing with words starting with A. Allcapitalized words precede lowercase words.We discover the size of the vocabulary indirectly, by askingfor the number of items in the set, and again we can use len toobtain this number . Although it has 44,764 tokens, this bookhas only 2,789 distinct words, or "word types."A word type is the form or spelling of the word independently of itsspecific occurrences in a text — that is, theword considered as a unique item of vocabulary. Our count of 2,789 itemswill include punctuation symbols, so we will generally call theseunique items types instead of word types.

Now, let's calculate a measure of the lexicalrichness of the text. The next example shows us that the number ofdistinct words is just 6% of the total number of words, or equivalentlythat each word is used 16 times on average(remember if you're using Python 2, to start with from __future__ import division).

>>> len(set(text3)) / len(text3)0.06230453042623537>>>

Next, let's focus on particular words. We can count how often a word occursin a text, and compute what percentage of the text is taken up by a specific word:

>>> text3.count("smote")5>>> 100 * text4.count('a') / len(text4)1.4643016433938312>>>


Your Turn:How many times does the word lol appear in text5?How much is this as a percentage of the total number of wordsin this text?

You may want to repeat such calculations on several texts,but it is tedious to keep retyping the formula. Instead,you can come up with your own name for a task, like"lexical_diversity" or "percentage", and associate it with a block of code.Now you only have to type a shortname instead of one or more complete lines of Python code, andyou can re-use it as often as you like. The block of code that does atask for us is called a function, andwe define a short name for our function with the keyword def. Thenext example shows how to define two new functions,lexical_diversity() and percentage():

>>> def lexical_diversity(text): ...  return len(set(text)) / len(text) ...>>> def percentage(count, total): ...  return 100 * count / total...


The Python interpreter changes the prompt from>>> to ... after encountering the colon at theend of the first line. The ... prompt indicatesthat Python expects an indented code block to appear next.It is up to you to do the indentation, by typing fourspaces or hitting the tab key. To finish the indented block justenter a blank line.

In the definition of lexical_diversity() , wespecify a parameter named text . This parameter isa "placeholder" for the actual text whose lexical diversity we want tocompute, and reoccurs in the block of code that will run when thefunction is used . Similarly, percentage() is defined totake two parameters, named count and total .

Once Python knows that lexical_diversity() and percentage()are the names for specific blocksof code, we can go ahead and use these functions:

>>> lexical_diversity(text3)0.06230453042623537>>> lexical_diversity(text5)0.13477005109975562>>> percentage(4, 5)80.0>>> percentage(text4.count('a'), len(text4))1.4643016433938312>>>

To recap, we use or call a function such as lexical_diversity() by typing its name, followedby an open parenthesis, the name of the text, and then a closeparenthesis. These parentheses will show up often; their role is to separatethe name of a task — such as lexical_diversity() — from the datathat the task is to be performed on — such as text3.The data value that we place in the parentheses when we call afunction is an argument to the function.

You have already encountered several functions in this chapter, suchas len(), set(), and sorted(). By convention, we willalways add an empty pair of parentheses after a function name, as inlen(), just to make clear that what we are talking about is afunction rather than some other kind of Python expression.Functions are an important concept in programming, and we onlymention them at the outset to give newcomers a sense of thepower and creativity of programming. Don't worry if you find it a bitconfusing right now.

Later we'll see how to use functions when tabulating data, as in 1.1.Each row of the table will involve the same computation butwith different data, and we'll do this repetitive work using a function.

Table 1.1:

Lexical Diversity of Various Genres in the Brown Corpus

GenreTokensTypesLexical diversity
skill and hobbies82345119350.145
fiction: science1447032330.223
press: reportage100554143940.143
fiction: romance7002284520.121

You've seen some important elements of the Python programming language.Let's take a few moments to review them systematically.


What is a text? At one level, it is a sequence of symbols on a page suchas this one. At another level, it is a sequence of chapters, made upof a sequence of sections, where each section is a sequence of paragraphs,and so on. However, for our purposes, we will think of a text as nothingmore than a sequence of words and punctuation. Here's how we representtext in Python, in this case the opening sentence of Moby Dick:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']>>>

After the prompt we've given a name we made up, sent1, followedby the equals sign, and then some quoted words, separated withcommas, and surrounded with brackets. This bracketed materialis known as a list in Python: it is how we store a text.We can inspect it by typing the name . We can ask for its length .We can even apply our own lexical_diversity() function to it .

>>> sent1 ['Call', 'me', 'Ishmael', '.']>>> len(sent1) 4>>> lexical_diversity(sent1) 1.0>>>

Some more lists have been defined for you,one for the opening sentence of each of our texts,sent2sent9. We inspect two of themhere; you can see the rest for yourself using the Python interpreter(if you get an error which says that sent2 is not defined, youneed to first type from nltk.book import *).

>>> sent2['The', 'family', 'of', 'Dashwood', 'had', 'long','been', 'settled', 'in', 'Sussex', '.']>>> sent3['In', 'the', 'beginning', 'God', 'created', 'the','heaven', 'and', 'the', 'earth', '.']>>>


Your Turn:Make up a few sentences of your own, by typing a name, equalssign, and a list of words, like this:ex1 = ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'].Repeat some of the other Python operations we saw earlier in1,e.g., sorted(ex1), len(set(ex1)), ex1.count('the').

A pleasant surprise is that we can use Python's addition operator on lists.Adding two lists creates a new listwith everything from the first list, followedby everything from the second list:

>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']>>>


This special use of the addition operation is called concatenation;it combines the lists together into a single list. We can concatenatesentences to build up a text.

We don't have to literally type the lists either; we can use shortnames that refer to pre-defined lists.

>>> sent4 + sent1['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the','House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']>>>

What if we want to add a single item to a list? This is known as appending.When we append() to a list, the list itself is updated as a resultof the operation.

>>> sent1.append("Some")>>> sent1['Call', 'me', 'Ishmael', '.', 'Some']>>>

2.2Indexing Lists

As we have seen, a text in Python is a list of words, representedusing a combination of brackets and quotes. Just as with an ordinarypage of text, we can count up the total number of words in text1with len(text1), and count the occurrences in a text of aparticular word — say, 'heaven' — using text1.count('heaven').

With some patience, we can pick out the 1st, 173rd, or even 14,278thword in a printed text. Analogously, we can identify the elements of aPython list by their order of occurrence in the list. The number thatrepresents this position is the item's index. We instruct Pythonto show us the item that occurs at an index such as 173 in a textby writing the name of the text followed by the index inside square brackets:

>>> text4[173]'awaken'>>>

We can do the converse; given a word, find the index of when it firstoccurs:

>>> text4.index('awaken')173>>>

Indexes are a common way to access the words of a text,or, more generally, the elements of any list.Python permits us to access sublists as well, extractingmanageable pieces of language from large texts, a techniqueknown as slicing.

>>> text5[16715:16735]['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good','because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without','buying', 'it']>>> text6[1600:1625]['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We','take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive','officer', 'for', 'the', 'week']>>>

Indexes have some subtleties, and we'll explore these withthe help of an artificial sentence:

>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',...  'word6', 'word7', 'word8', 'word9', 'word10']>>> sent[0]'word1'>>> sent[9]'word10'>>>

Notice that our indexes start from zero: sent element zero, written sent[0],is the first word, 'word1', whereas sent element 9 is 'word10'.The reason is simple: the moment Python accesses the content of a list fromthe computer's memory, it is already at the first element;we have to tell it how many elements forward to go.Thus, zero steps forward leaves it at the first element.


This practice of counting from zero is initially confusing,but typical of modern programming languages.You'll quickly get the hang of it ifyou've mastered the system of counting centuries where 19XY is a yearin the 20th century, or if you live in a country where the floors ofa building are numbered from 1, and so walking up n-1 flights ofstairs takes you to level n.

Now, if we accidentally use an index that is too large, we get an error:

>>> sent[10]Traceback (most recent call last): File "<stdin>", line 1, in ?IndexError: list index out of range>>>

This time it is not a syntax error, because the program fragment is syntactically correct.Instead, it is a runtime error, and it produces a Traceback message thatshows the context of the error, followed by the name of the error,IndexError, and a brief explanation.

Let's take a closer look at slicing, using our artificial sentence again.Here we verify that the slice 5:8 includes sent elements atindexes 5, 6, and 7:

>>> sent[5:8]['word6', 'word7', 'word8']>>> sent[5]'word6'>>> sent[6]'word7'>>> sent[7]'word8'>>>

By convention, m:n means elements mn-1.As the next example shows,we can omit the first number if the slice begins at the start of thelist , and we can omit the second number if the slice goes to the end :

>>> sent[:3] ['word1', 'word2', 'word3']>>> text2[141525:] ['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',','that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of','each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between','themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.','THE', 'END']>>>

We can modify an element of a list by assigning to one of its index values.In the next example, we put sent[0] on the left of the equals sign . We can alsoreplace an entire slice with new material . A consequence of thislast change is that the list only has four elements, and accessing a later valuegenerates an error .

>>> sent[0] = 'First' >>> sent[9] = 'Last'>>> len(sent)10>>> sent[1:9] = ['Second', 'Third'] >>> sent['First', 'Second', 'Third', 'Last']>>> sent[9] Traceback (most recent call last): File "<stdin>", line 1, in ?IndexError: list index out of range>>>


Your Turn:Take a few minutes to define a sentence of your own and modify individual words andgroups of words (slices) using the same methods used earlier. Check your understandingby trying the exercises on lists at the end of this chapter.


From the start of 1, you have hadaccess to texts called text1, text2, and so on. It saved a lotof typing to be able to refer to a 250,000-word book with a short namelike this! In general, we can make up names for anything we careto calculate. We did this ourselves in the previous sections, e.g.,defining a variable sent1, as follows:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']>>>

Such lines have the form: variable = expression. Python will evaluatethe expression, and save its result to the variable. This process iscalled assignment. It does not generate any output;you have to type the variable on a line of itsown to inspect its contents. The equals sign is slightly misleading,since information is moving from the right side to the left.It might help to think of it as a left-arrow.The name of the variable can be anything you like, e.g., my_sent, sentence, xyzzy.It must start with a letter, and can include numbers and underscores.Here are some examples of variables and assignments:

>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',... 'forth', 'from', 'Camelot', '.']>>> noun_phrase = my_sent[1:4]>>> noun_phrase['bold', 'Sir', 'Robin']>>> wOrDs = sorted(noun_phrase)>>> wOrDs['Robin', 'Sir', 'bold']>>>

Remember that capitalized words appear before lowercase words in sorted lists.


Notice in the previous example that we split the definitionof my_sent over two lines. Python expressions can be split acrossmultiple lines, so long as this happens within any kind of brackets.Python uses the "..." prompt to indicate that more input isexpected. It doesn't matter how much indentation is used in thesecontinuation lines, but some indentation usually makes them easier to read.

It is good to choose meaningful variable names to remind you — and to help anyoneelse who reads your Python code — what your code is meant to do.Python does not try to make sense of the names; it blindly follows your instructions,and does not object if you do something confusing, such as one = 'two' or two = 3.The only restriction is thata variable name cannot be any of Python's reserved words, such asdef, if, not,and import. If you use a reserved word, Python will produce a syntax error:

>>> not = 'Camelot' File "<stdin>", line 1 not = 'Camelot' ^SyntaxError: invalid syntax>>>

We will often use variables to hold intermediate steps of a computation, especiallywhen this makes the code easier to follow. Thus len(set(text1)) could also be written:

>>> vocab = set(text1)>>> vocab_size = len(vocab)>>> vocab_size19317>>>


Take care with your choice of names (or identifiers) for Pythonvariables. First, you should start the name with a letter, optionallyfollowed by digits (0 to 9) or letters. Thus, abc23 is fine, but23abc will cause a syntax error.Names are case-sensitive, which means that myVar and myvarare distinct variables. Variable names cannot contain whitespace,but you can separate words using an underscore, e.g.,my_var. Be careful not to insert a hyphen instead of anunderscore: my-var is wrong, since Python interprets the"-" as a minus sign.


Some of the methods we used to access the elements of a list also work with individual words,or strings. For example, we can assign a string to a variable ,index a string , and slice a string :

>>> name = 'Monty' >>> name[0] 'M'>>> name[:4] 'Mont'>>>

We can also perform multiplication and addition with strings:

>>> name * 2'MontyMonty'>>> name + '!''Monty!'>>>

We can join the words of a list to make a single string, or split a string into a list, as follows:

>>> ' '.join(['Monty', 'Python'])'Monty Python'>>> 'Monty Python'.split()['Monty', 'Python']>>>

We will come back to the topic of strings in 3.For the time being, we have two important building blocks— lists and strings —and are ready to get back to some language analysis.

Let's return to our exploration of the ways we can bring our computationalresources to bear on large quantities of text. We began this discussion in1, and saw how to search for wordsin context, how to compile the vocabulary of a text, how to generate randomtext in the same style, and so on.

In this section we pick up the question of what makes a text distinct,and use automatic methods to find characteristic words and expressionsof a text. As in 1, you can trynew features of the Python language by copying them into the interpreter,and you'll learn about these features systematically in the following section.

Before continuing further, you might like to check your understanding of thelast section by predicting the output of the following code. You can usethe interpreter to check whether you got it right. If you're not sure howto do this task, it would be a good idea to review the previous sectionbefore continuing further.

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done',...  'more', 'is', 'said', 'than', 'done']>>> tokens = set(saying)>>> tokens = sorted(tokens)>>> tokens[-2:]what output do you expect here?>>>

3.1Frequency Distributions

How can we automatically identify the words of a text that are mostinformative about the topic and genre of the text? Imagine how you mightgo about finding the 50 most frequent words of a book. One methodwould be to keep a tally for each vocabulary item, like that shown in 3.1.The tally would need thousands of rows, and it would be an exceedinglylaborious process — so laborious that we would rather assign the task to a machine.

1. Language Processing and Python (37)

Figure 3.1: Counting Words Appearing in a Text (a frequency distribution)

The table in 3.1 is known as a frequency distribution,and it tells us the frequency of each vocabulary item in the text.(In general, it could count any kind of observable event.)It is a "distribution"because it tells us how the total number of word tokens in the textare distributed across the vocabulary items.Since we often need frequency distributions in language processing, NLTKprovides built-in support for them. Let's use a FreqDist to find the50 most frequent words of Moby Dick:

>>> fdist1 = FreqDist(text1) >>> print(fdist1) <FreqDist with 19317 samples and 260819 outcomes>>>> fdist1.most_common(50) [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]>>> fdist1['whale']906>>>

When we first invoke FreqDist, we pass the name of the text as anargument . We can inspect the total number of words ("outcomes")that have been counted up — 260,819 in thecase of Moby Dick. The expression most_common(50) gives us a list ofthe 50 most frequently occurring types in the text .


Your Turn:Try the preceding frequency distribution example for yourself, fortext2. Be careful to use the correct parentheses and uppercase letters.If you get an error message NameError: name 'FreqDist' is not defined,you need to start your work with from nltk.book import *

Do any words produced in the last example help us grasp the topic or genre of this text?Only one word, whale, is slightly informative! It occurs over 900 times.The rest of the words tell us nothing about the text; they're just English "plumbing."What proportion of the text is taken up with such words?We can generate a cumulative frequency plot for these words,using fdist1.plot(50, cumulative=True), to produce the graph in 3.2.These 50 words account for nearly half the book!

1. Language Processing and Python (44)

Figure 3.2: Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick:these account for nearly half of the tokens.

If the frequent words don't help us, how about the words that occur onceonly, the so-called hapaxes? View them by typing fdist1.hapaxes().This list contains lexicographer, cetological,contraband, expostulations, and about 9,000 others.It seems that there are too many rare words, and without seeing thecontext we probably can't guess what half of the hapaxes mean in any case!Since neither frequent nor infrequent words help, we need to trysomething else.

3.2Fine-grained Selection of Words

Next, let's look at the long words of a text; perhaps these will bemore characteristic and informative. For this we adapt some notationfrom set theory. We would like to find the words from the vocabularyof the text that are more than 15 characters long. Let's callthis property P, so that P(w) is trueif and only if w is more than 15 characters long.Now we can express the words of interest using mathematicalset notation as shown in (1a).This means "the set of all w such that w is anelement of V (the vocabulary) and w has property P".

a.{w | wV & P(w)}
b.[w for w in V if p(w)]

The corresponding Python expression is given in (1b).(Note that it produces a list, not a set, which means that duplicates are possible.)Observe how similar the two notations are. Let's go one more step andwrite executable Python code:

>>> V = set(text1)>>> long_words = [w for w in V if len(w) > 15]>>> sorted(long_words)['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically','characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations','comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness','irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities','simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness','uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']>>>

For each word w in the vocabulary V, we check whetherlen(w) is greater than 15; all other words willbe ignored. We will discuss this syntax more carefully later.


Your Turn:Try out the previous statements in the Python interpreter,and experiment with changing the text and changing the length condition.Does it make a difference to your results if you change thevariable names, e.g., using [word for word in vocab if ...]?

Let's return to our task of finding words that characterize a text.Notice that the long words in text4 reflect its national focus— constitutionally, transcontinental —whereas those in text5 reflect its informal content:boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm.Have we succeeded in automatically extracting words that typifya text? Well, these very long words are often hapaxes (i.e., unique)and perhaps it would be better to find frequently occurringlong words. This seems promising since it eliminatesfrequent short words (e.g., the) and infrequent long words(e.g. antiphilosophists).Here are all words from the chat corpusthat are longer than seven characters, that occur more than seven times:

>>> fdist5 = FreqDist(text5)>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question','actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football','innocent', 'listening', 'remember', 'seriously', 'something', 'together','tomorrow', 'watching']>>>

Notice how we have used two conditions: len(w) > 7 ensures that thewords are longer than seven letters, and fdist5[w] > 7 ensures thatthese words occur more than seven times. At last we have managed toautomatically identify the frequently-occurring content-bearingwords of the text. It is a modest but important milestone: a tiny piece of code,processing tens of thousands of words, produces some informative output.

3.3Collocations and Bigrams

A collocation is a sequence of words that occur togetherunusually often. Thus red wine is a collocation, whereas thewine is not. A characteristic of collocations is that they areresistant to substitution with words that have similar senses;for example, maroon wine sounds definitely odd.

To get a handle on collocations, we start off by extracting from a texta list of word pairs, also known as bigrams. This is easilyaccomplished with the function bigrams():

>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]>>>


If you omitted list() above, and just typed bigrams(['more', ...]),you would have seen output of the form <generator object bigrams at 0x10fb8b3a8>.This is Python's way of saying that it is ready to computea sequence of items, in this case, bigrams. For now, you just needto know to tell Python to convert it into a list, using list().

Here we see that the pair of words than-done is a bigram, and we writeit in Python as ('than', 'done'). Now, collocations are essentiallyjust frequent bigrams, except that we want to pay more attention to thecases that involve rare words. In particular, we want to findbigrams that occur more often than we would expect based onthe frequency of the individual words. The collocations() functiondoes this for us. We will see how it works later.

>>> text4.collocations()United States; fellow citizens; four years; years ago; FederalGovernment; General Government; American people; Vice President; OldWorld; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;God bless; every citizen; Indian tribes; public debt; one another;foreign nations; political parties>>> text8.collocations()would like; medium build; social drinker; quiet nights; non smoker;long term; age open; Would like; easy going; financially secure; funtimes; similar interests; Age open; weekends away; poss rship; wellpresented; never married; single mum; permanent relationship; slimbuild>>>

The collocations that emerge are very specific to the genre of thetexts. In order to find red wine as a collocation, we wouldneed to process a much larger body of text.

3.4Counting Other Things

Counting words is useful, but we can count other things too. For example, we canlook at the distribution of word lengths in a text, by creating a FreqDistout of a long list of numbers, where each number is the length of the correspondingword in the text:

>>> [len(w) for w in text1] [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]>>> fdist = FreqDist(len(w) for w in text1) >>> print(fdist) <FreqDist with 19 samples and 260819 outcomes>>>> fdistFreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})>>>

We start by deriving a list of the lengths of words in text1,and the FreqDist then counts the number of times each of theseoccurs . The result is a distribution containinga quarter of a million items, each of which is a number corresponding to aword token in the text. But there are at most only 20 distinctitems being counted, the numbers 1 through 20, because there are only 20different word lengths. I.e., there are words consisting of just one character,two characters, ..., twenty characters, but none with twenty one or morecharacters. One might wonder how frequent the different lengths of word are(e.g., how many words of length four appear in the text, are there more words of length fivethan length four, etc). We can do this as follows:

>>> fdist.most_common()[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]>>> fdist.max()3>>> fdist[3]50223>>> fdist.freq(3)0.19255882431878046>>>

From this we see that the most frequent word length is 3, and thatwords of length 3 account for roughly 50,000 (or 20%) of the words making up thebook. Although we will not pursue it here, further analysis of wordlength might help us understand differences between authors, genres, orlanguages.

3.1 summarizes the functions defined in frequency distributions.

Table 3.1:

Functions Defined for NLTK's Frequency Distributions

fdist = FreqDist(samples)create a frequency distribution containing the given samples
fdist[sample] += 1increment the count for this sample
fdist['monstrous']count of the number of times a given sample occurred
fdist.freq('monstrous')frequency of a given sample
fdist.N()total number of samples
fdist.most_common(n)the n most common samples and their frequencies
for sample in fdist:iterate over the samples
fdist.max()sample with the greatest count
fdist.tabulate()tabulate the frequency distribution
fdist.plot()graphical plot of the frequency distribution
fdist.plot(cumulative=True)cumulative plot of the frequency distribution
fdist1 |= fdist2update fdist1 with counts from fdist2
fdist1 < fdist2test if samples in fdist1 occur less frequently than in fdist2

Our discussion of frequency distributions has introduced some important Python concepts,and we will look at them systematically in 4.

So far, our little programs have had some interesting qualities:the ability to work with language, andthe potential to save human effort through automation.A key feature of programming is the ability of machines tomake decisions on our behalf, executing instructions whencertain conditions are met, or repeatedly looping throughtext data until some condition is satisfied. This featureis known as control, and is the focus of this section.


Python supports a wide range of operators, such as < and >=, fortesting the relationship between values. The full set of these relationaloperators is shown in 4.1.

Table 4.1:

Numerical Comparison Operators

<less than
<=less than or equal to
==equal to (note this is two "=" signs, not one)
!=not equal to
>greater than
>=greater than or equal to

We can use these to select different words from a sentence of news text.Here are some examples — only the operator is changed from oneline to the next. They all use sent7, the first sentence from text7(Wall Street Journal). As before, if you get an error saying that sent7is undefined, you need to first type: from nltk.book import *

>>> sent7['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the','board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']>>> [w for w in sent7 if len(w) < 4][',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']>>> [w for w in sent7 if len(w) <= 4][',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']>>> [w for w in sent7 if len(w) == 4]['will', 'join', 'Nov.']>>> [w for w in sent7 if len(w) != 4]['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board','as', 'a', 'nonexecutive', 'director', '29', '.']>>>

There is a common pattern to all of these examples:[w for w in text if condition ], where condition is aPython "test" that yields either true or false.In the cases shown in the previous code example, the condition is always a numerical comparison.However, we can also test various properties of words,using the functions listed in 4.2.

Table 4.2:

Some Word Comparison Operators

s.startswith(t)test if s starts with t
s.endswith(t)test if s ends with t
t in stest if t is a substring of s
s.islower()test if s contains cased characters and all are lowercase
s.isupper()test if s contains cased characters and all are uppercase
s.isalpha()test if s is non-empty and all characters in s are alphabetic
s.isalnum()test if s is non-empty and all characters in s are alphanumeric
s.isdigit()test if s is non-empty and all characters in s are digits
s.istitle()test if s contains cased characters and is titlecased(i.e. all words in s have initial capitals)

Here are some examples of these operators being used toselect words from our texts:words ending with -ableness;words containing gnt;words having an initial capital;and words consisting entirely of digits.

>>> sorted(w for w in set(text1) if w.endswith('ableness'))['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]>>> sorted(term for term in set(text4) if 'gnt' in term)['Sovereignty', 'sovereignties', 'sovereignty']>>> sorted(item for item in set(text6) if item.istitle())['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]>>> sorted(item for item in set(sent7) if item.isdigit())['29', '61']>>>

We can also create more complex conditions. If c is acondition, then not c is also a condition.If we have two conditions c1 and c2,then we can combine them to form a new condition using conjunction and disjunction:c1 and c2,c1 or c2.


Your Turn:Run the following examples and try to explain what is going on in each one.Next, try to make up some conditions of your own.

>>> sorted(w for w in set(text7) if '-' in w and 'index' in w)>>> sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)>>> sorted(w for w in set(sent7) if not w.islower())>>> sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

4.2Operating on Every Element

In 3, we saw some examples ofcounting items other than words. Let's take a closer look at the notation we used:

>>> [len(w) for w in text1][1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]>>> [w.upper() for w in text1]['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]>>>

These expressions have the form [f(w) for ...] or [w.f() for ...], wheref is a function that operates on a word to compute its length, or toconvert it to uppercase.For now, you don't need to understand the difference between the notations f(w) andw.f(). Instead, simply learn this Python idiom which performs thesame operation on every element of a list. In the preceding examples, it goes througheach word in text1, assigning each one in turn to the variable w andperforming the specified operation on the variable.


The notation just described is called a "list comprehension." This is our first exampleof a Python idiom, a fixed notation that we use habitually without bothering toanalyze each time. Mastering such idioms is an important part of becoming afluent Python programmer.

Let's return to the question of vocabulary size, and apply the same idiom here:

>>> len(text1)260819>>> len(set(text1))19317>>> len(set(word.lower() for word in text1))17231>>>

Now that we are not double-counting words like This and this, which differ onlyin capitalization, we've wiped 2,000 off the vocabulary count! We can go a step furtherand eliminate numbers and punctuation from the vocabulary count by filtering out anynon-alphabetic items:

>>> len(set(word.lower() for word in text1 if word.isalpha()))16948>>>

This example is slightly complicated: it lowercases all the purely alphabetic items.Perhaps it would have been simpler just to count the lowercase-only items, but thisgives the wrong answer (why?).

Don't worry if you don't feel confident with list comprehensions yet,since you'll see many more examples along with explanations in the following chapters.

4.3Nested Code Blocks

Most programming languages permit us to execute a block of code when aconditional expression, or if statement, is satisfied. Wealready saw examples of conditional tests in code like [w for w insent7 if len(w) < 4]. In the following program, we have created avariable called word containing the string value 'cat'. Theif statement checks whether the test len(word) < 5 is true.It is, so the body of the if statement is invoked and theprint statement is executed, displaying a message to the user.Remember to indent the print statement by typing four spaces.

>>> word = 'cat'>>> if len(word) < 5:...  print('word length is less than 5')...  word length is less than 5>>>

When we use the Python interpreter we have to add an extra blank line in order for it to detect that the nested block is complete.


If you are using Python 2.6 or 2.7, you need to include the followingline in order for the above print function to be recognized:

>>> from __future__ import print_function

If we change the conditional test to len(word) >= 5,to check that the length of word is greater than or equal to 5,then the test will no longer be true.This time, the body of the if statement will not be executed,and no message is shown to the user:

>>> if len(word) >= 5:...  print('word length is greater than or equal to 5')...>>>

An if statement is known as a control structurebecause it controls whether the code in the indented block will be run.Another control structure is the for loop.Try the following, and remember to include the colon and the four spaces:

>>> for word in ['Call', 'me', 'Ishmael', '.']:...  print(word)...CallmeIshmael.>>>

This is called a loop because Python executes the code incircular fashion. It starts by performing theassignment word = 'Call',effectively using the word variable to name the firstitem of the list. Then, it displays the value of wordto the user. Next, it goes back to the for statement,and performs the assignment word = 'me', before displaying this new valueto the user, and so on. It continues in this fashion untilevery item of the list has been processed.

4.4Looping with Conditions

Now we can combine the if and for statements.We will loop over every item of the list, and printthe item only if it ends with the letter l. We'll pick anothername for the variable to demonstrate that Python doesn'ttry to make sense of variable names.

>>> sent1 = ['Call', 'me', 'Ishmael', '.']>>> for xyzzy in sent1:...  if xyzzy.endswith('l'):...  print(xyzzy)...CallIshmael>>>

You will notice that if and for statementshave a colon at the end of the line,before the indentation begins. In fact, all Pythoncontrol structures end with a colon. The colonindicates that the current statement relates to theindented block that follows.

We can also specify an action to be taken ifthe condition of the if statement is not met.Here we see the elif (else if) statement, andthe else statement. Notice that these also havecolons before the indented code.

>>> for token in sent1:...  if token.islower():...  print(token, 'is a lowercase word')...  elif token.istitle():...  print(token, 'is a titlecase word')...  else:...  print(token, 'is punctuation')...Call is a titlecase wordme is a lowercase wordIshmael is a titlecase word. is punctuation>>>

As you can see, even with this small amount of Python knowledge,you can start to build multiline Python programs.It's important to develop such programs in pieces,testing that each piece does what you expect beforecombining them into a program. This is why the Pythoninteractive interpreter is so invaluable, and why you should getcomfortable using it.

Finally, let's combine the idioms we've been exploring.First, we create a list of cie and cei words,then we loop over each item and print it. Notice theextra information given in the print statement: end=' '.This tells Python to print a space (not the default newline) after each word.

>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)>>> for word in tricky:...  print(word, end=' ')ancient ceiling conceit conceited conceive conscienceconscientious conscientiously deceitful deceive ...>>>

We have been exploring language bottom-up, with the help of texts andthe Python programminglanguage. However, we're also interested in exploiting our knowledge of language and computationby building useful language technologies. We'll take the opportunitynow to step back from the nitty-gritty of code in order to paint abigger picture of natural language processing.

At a purely practical level, we all need help to navigate the universe of informationlocked up in text on the Web. Search engines have been crucial to thegrowth and popularity of the Web, but have some shortcomings.It takes skill, knowledge, and some luck,to extract answers to such questions as: What tourist sites can Ivisit between Philadelphia and Pittsburgh on a limited budget?What do experts say about digital SLR cameras? Whatpredictions about the steel market were made by credible commentatorsin the past week? Getting a computer to answer them automaticallyinvolves a range of language processing tasks, including information extraction,inference, and summarization, and would need to be carried out on a scaleand with a level of robustness that is still beyond our current capabilities.

On a more philosophical level, a long-standing challenge within artificial intelligencehas been to build intelligent machines, and a major part of intelligent behaviour is understandinglanguage. For many years this goal has been seen as too difficult.However, as NLP technologies become more mature, and robust methods foranalyzing unrestricted text become more widespread, the prospect ofnatural language understanding has re-emerged as a plausible goal.

In this section we describe some language understanding technologies,to give you a sense of the interesting challenges that are waiting for you.

5.1Word Sense Disambiguation

In word sense disambiguation we want to work outwhich sense of a word was intended in a given context. Consider theambiguous words serve and dish:

a.serve: help with food or drink; hold an office; put ball into play
b.dish: plate; course of a meal; communications device

In a sentence containing the phrase: he served the dish, youcan detect that both serve and dish are being used withtheir food meanings. It's unlikely that the topic of discussionshifted from sports to crockery in the space of three words.This would force you to invent bizarre images, like a tennis protaking out his or her frustrations on a china tea-set laid out beside the court.In other words, we automatically disambiguate words using context, exploitingthe simple fact that nearby words have closely related meanings.As another example of this contextual effect, consider the wordby, which has several meanings, e.g.: the book byChesterton (agentive — Chesterton was the author of the book);the cup by the stove (locative — the stove is where thecup is); and submit by Friday (temporal — Friday is thetime of the submitting).Observe in (3c) that the meaning of the italicized word helps usinterpret the meaning of by.

a.The lost children were found by the searchers (agentive)
b.The lost children were found by the mountain (locative)
c.The lost children were found by the afternoon (temporal)

5.2Pronoun Resolution

A deeper kind of language understanding is to work out "who did what to whom" —i.e., to detect the subjects and objects of verbs. You learnt to do this inelementary school, but it's harder than you might think.In the sentence the thieves stole the paintingsit is easy to tell who performed the stealing action.Consider three possible following sentences in (4c), and try to determinewhat was sold, caught, and found (one case is ambiguous).

a.The thieves stole the paintings. They were subsequently sold.
b.The thieves stole the paintings. They were subsequently caught.
c.The thieves stole the paintings. They were subsequently found.

Answering this question involves finding the antecedent of the pronoun they,either thieves or paintings. Computational techniques for tackling this probleminclude anaphora resolution — identifying what a pronoun or noun phraserefers to — and semantic role labeling — identifying how a noun phraserelates to the verb (as agent, patient, instrument, and so on).

5.3Generating Language Output

If we can automatically solve such problems of language understanding, we willbe able to move on to tasks that involve generating language output, such asquestion answering and machine translation. In the first case,a machine should be able to answer a user's questions relating to collection of texts:

a.Text: ... The thieves stole the paintings. They were subsequently sold. ...
b.Human: Who or what was sold?
c.Machine: The paintings.

The machine's answer demonstrates that it has correctly worked out that theyrefers to paintings and not to thieves. In the second case, the machine shouldbe able to translate the text into another language, accuratelyconveying the meaning of the original text. In translating the example text into French,we are forced to choose the gender of the pronoun in the second sentence:ils (masculine) if the thieves are found, and elles (feminine) ifthe paintings are found. Correct translation actually depends on correct understanding ofthe pronoun.

a.The thieves stole the paintings. They were subsequently found.
b.Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)
c.Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the paintings)

In all of these examples, working out the sense of a word, the subject of a verb, and theantecedent of a pronoun are steps in establishing the meaning of a sentence, thingswe would expect a language understanding system to be able to do.

5.4Machine Translation

For a long time now, machine translation (MT) hasbeen the holy grail of language understanding,ultimately seeking to provide high-quality,idiomatic translation between any pair of languages.Its roots go back to the early days of the Cold War, when the promiseof automatic translation led to substantial government sponsorship,and with it, the genesis of NLP itself.

Today, practical translation systems exist for particular pairsof languages, and some are integrated into web search engines.However, these systems have some serious shortcomings, whichare starkly revealed by translating a sentence back and forthbetween a pair of languages until equilibrium is reached, e.g.:

0> how long before the next flight to Alice Springs?

1> wie lang vor dem folgenden Flug zu Alice Springs?

2> how long before the following flight to Alice jump?

3> wie lang vor dem folgenden Flug zu Alice springen Sie?

4> how long before the following flight to Alice do you jump?

5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?

6> how long, before the following flight to Alice does, do you jump?

7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?

8> how long before the following flight to Alice does, do you jump?

9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?

10> how long, before the following flight does to Alice, do do you jump?

11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?

12> how long before the following flight does leap to Alice, does you?

Observe that the system correctly translates Alice Springs from Englishto German (in the line starting 1>), but on the way back to English, this ends up as Alice jump(line 2). The preposition before is initially translated into the correspondingGerman preposition vor, but later into the conjunction bevor (line 5).After line 5 the sentences become nonsensical (but notice the various phrasingsindicated by the commas, and the change from jump to leap).The translation system did not recognize when a word was part of a proper name,and it misinterpreted the grammatical structure.


Your Turn: Try this yourself using http://translationparty.com/

Machine translation is difficult because a given word could have several possibletranslations (depending on its meaning), and because word order must be changedin keeping with the grammatical structure of the target language.Today these difficulties are being faced by collecting massive quantities ofparallel texts from news and government websites that publish documentsin two or more languages. Given a document in German and English, and possiblya bilingual dictionary, we can automatically pair up the sentences,a process called text alignment. Once we have a million or more sentencepairs, we can detect corresponding words and phrases, and build a modelthat can be used for translating new text.

5.5Spoken Dialog Systems

In the history of artificial intelligence, the chief measure of intelligencehas been a linguistic one, namely the Turing Test: can a dialogue system,responding to a user's text input, perform so naturally that we cannot distinguish*t from a human-generated response? In contrast, today's commercial dialogue systemsare very limited, but still perform useful functions in narrowly-defined domains,as we see here:

S: How may I help you?

U: When is Saving Private Ryan playing?

S: For what theater?

U: The Paramount theater.

S: Saving Private Ryan is not playing at the Paramount theater, but

it's playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

You could not ask this system to provide driving instructions ordetails of nearby restaurants unless the required informationhad already been stored and suitable question-answer pairshad been incorporated into the language processing system.

Observe that this system seems to understand the user's goals:the user asks when a movie is showing and the systemcorrectly determines from this that the user wants to seethe movie. This inference seems so obvious that you probablydidn't notice it was made, yet a natural language systemneeds to be endowed with this capability in order to interactnaturally. Without it, when asked Do you know when Saving PrivateRyan is playing?, a system might unhelpfully respond with a cold Yes.However, the developers of commercial dialogue systems usecontextual assumptions and business logic to ensure that the different ways in which a user mightexpress requests or provide information are handled in a way thatmakes sense for the particular application. So, if you typeWhen is ..., or I want to know when ..., or Can you tell mewhen ..., simple rules will always yield screening times. This isenough for the system to provide a useful service.

1. Language Processing and Python (53)

Figure 5.1: Simple Pipeline Architecture for a Spoken Dialogue System:Spoken input (top left) is analyzed, words are recognized, sentences are parsed andinterpreted in context, application-specific actions take place (top right);a response is planned, realized as a syntactic structure, then to suitablyinflected words, and finally to spoken output; different types oflinguistic knowledge inform each stage of the process.

Dialogue systems give us an opportunity to mention thecommonly assumed pipeline for NLP.5.1 shows the architecture of a simple dialogue system.Along the top of the diagram, moving from left to right, is a"pipeline" of some language understanding components.These map from speech input via syntactic parsingto some kind of meaning representation. Along the middle, moving fromright to left, is the reverse pipeline of components for convertingconcepts to speech. These components make up the dynamic aspects of the system.At the bottom of the diagram are some representative bodies ofstatic information: the repositories of language-related data thatthe processing components draw on to do their work.


Your Turn:For an example of a primitive dialogue system, try havinga conversation with an NLTK chatbot. To see the available chatbots,run nltk.chat.chatbots().(Remember to import nltk first.)

5.6Textual Entailment

The challenge of language understanding has been brought into focus in recent years by a public"shared task" called Recognizing Textual Entailment (RTE). The basicscenario is simple. Suppose you want to find evidence to supportthe hypothesis: Sandra Goudie was defeated by Max Purnell, andthat you have another short text that seems to be relevant, for example,Sandra Goudie was first elected to Parliament in the 2002 elections,narrowly winning the seat of Coromandel by defeating Labour candidateMax Purnell and pushing incumbent Green MP Jeanette Fitzsimons intothird place. Does the text provide enough evidence for you toaccept the hypothesis? In this particular case, the answer will be "No."You can draw this conclusion easily, but it is very hard to come up withautomated methods for making the right decision. The RTEChallenges provide data that allow competitors to develop theirsystems, but not enough data for "brute force" machine learning techniques (a topicwe will cover in chap-data-intensive). Consequently, somelinguistic analysis is crucial. In the previous example, it is importantfor the system to note that Sandra Goudie names the person beingdefeated in the hypothesis, not the person doing the defeating in thetext. As another illustration of the difficulty of the task, considerthe following text-hypothesis pair:

a.Text: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books
b.Hypothesis: Golinkin has written eighteen books

In order to determine whether the hypothesis is supported by thetext, the system needs the following background knowledge:(i) if someone is an author of a book, then he/she has written thatbook; (ii) if someone is an editor of a book, then he/she has notwritten (all of) that book; (iii) if someone is editor or author of eighteenbooks, then one cannot conclude that he/she is author of eighteen books.

5.7Limitations of NLP

Despite the research-led advances in tasks like RTE, natural languagesystems that have been deployed for real-world applications still cannot performcommon-sense reasoning or draw on world knowledge in a general androbust manner. We can wait for these difficult artificialintelligence problems to be solved, but in the meantime it isnecessary to live with some severe limitations on the reasoning andknowledge capabilities of natural language systems. Accordingly, rightfrom the beginning, an important goal of NLP research has been tomake progress on the difficult task of building technologies that"understand language," using superficial yet powerful techniques instead ofunrestricted knowledge and reasoning capabilities.Indeed, this is one of the goals of this book, and we hope to equip you withthe knowledge and skills to build useful NLP systems, and tocontribute to the long-term aspiration of building intelligent machines.

  • Texts are represented in Python using lists:['Monty', 'Python']. We can use indexing, slicing,and the len() function on lists.
  • A word "token" is a particular appearance of a given word in a text;a word "type" is the unique form of the word as a particular sequenceof letters. We count word tokens using len(text) and word types usinglen(set(text)).
  • We obtain the vocabulary of a text t using sorted(set(t)).
  • We operate on each item of a text using [f(x) for x in text].
  • To derive the vocabulary, collapsing case distinctions and ignoring punctuation,we can write set(w.lower() for w in text if w.isalpha()).
  • We process each word in a text using a for statement, suchas for w in t: or for word in text:. This must be followed by the colon characterand an indented block of code, to be executed each time through the loop.
  • We test a condition using an if statement: if len(word) < 5:.This must be followed by the colon character and an indented block ofcode, to be executed only if the condition is true.
  • A frequency distribution is a collection of items along with their frequency counts(e.g., the words of a text and their frequency of appearance).
  • A function is a block of code that has been assigned a name and canbe reused. Functions are defined using the def keyword, as indef mult(x, y); x and y are parameters of the function,and act as placeholders for actual data values.
  • A function is called by specifying its name followed by zero or morearguments inside parentheses, like this: texts(), mult(3, 4), len(text1).

This chapter has introduced new concepts in programming, natural language processing,and linguistics, all mixed in together.Many of them are consolidated in the following chapters. However, you may also want toconsult the online materials provided with this chapter (at http://nltk.org/), including linksto additional background materials, and links to online NLP systems.You may also like to read up onsome linguistics and NLP-related concepts in Wikipedia (e.g., collocations,the Turing Test, the type-token distinction).

You should acquaint yourself with the Python documentation available at http://docs.python.org/,including the many tutorials and comprehensive reference materials linked there.A Beginner's Guide to Python is available at http://wiki.python.org/moin/BeginnersGuide.Miscellaneous questions about Python might be answered in the FAQ athttp://python.org/doc/faq/general/.

As you delve into NLTK, you might want to subscribe to the mailing list where newreleases of the toolkit are announced. There is also an NLTK-Users mailing list,where users help each other as they learn how to use Python and NLTK forlanguage analysis work. Details of these lists are available at http://nltk.org/.

For more information on the topics covered in 5,and on NLP more generally, you might like to consult one of the following excellentbooks:

  • Indurkhya, Nitin and Fred Damerau (eds, 2010) Handbook of Natural Language Processing(Second Edition) Chapman & Hall/CRC. 2010.
  • Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (Second Edition). Prentice Hall.
  • Mitkov, Ruslan (ed, 2003) The Oxford Handbook of Computational Linguistics. Oxford University Press.(second edition expected in 2010). (Mitkov, 2002)

The Association for Computational Linguistics is the international organization thatrepresents the field of NLP. The ACL website (http://www.aclweb.org/) hosts many useful resources, including:information about international and regional conferences and workshops;the ACL Wiki with links to hundreds of useful resources;and the ACL Anthology, which contains most of the NLP research literaturefrom the past 50+ years, fully indexed and freely downloadable.

Some excellent introductory Linguistics textbooks are:[Finegan2007]_, (O'Grady et al, 2004), (OSU, 2007). You might like to consultLanguageLog, a popular linguistics blog with occasional posts thatuse the techniques described in this book.

  1. ☼ Try using the Python interpreter as a calculator, andtyping expressions like 12 / (4 + 1).

  2. ☼ Given an alphabet of 26 letters, there are 26 to the power10, or 26 ** 10, ten-letter strings we can form. That works outto 141167095653376. How many hundred-letter strings are possible?

  3. ☼ The Python multiplication operation can be applied to lists.What happens when you type ['Monty', 'Python'] * 20,or 3 * sent1?

  4. ☼ Review 1 oncomputing with language. How many words are there in text2?How many distinct words are there?

  5. ☼ Compare the lexical diversity scores for humorand romance fiction in 1.1. Which genre ismore lexically diverse?

  6. ☼ Produce a dispersion plot of the four main protagonists inSense and Sensibility: Elinor, Marianne, Edward, and Willoughby.What can you observe about the different roles played by the malesand females in this novel? Can you identify the couples?

  7. ☼ Find the collocations in text5.

  8. ☼ Consider the following Python expression: len(set(text4)).State the purpose of this expression. Describe the two stepsinvolved in performing this computation.

  9. ☼ Review 2on lists and strings.

    1. Define a string and assign it to a variable, e.g.,my_string = 'My String' (but put something more interesting in the string).Print the contents of this variable in two ways, firstby simply typing the variable name and pressing enter, thenby using the print statement.
    2. Try adding the string to itself using my_string + my_string, or multiplyingit by a number, e.g., my_string * 3. Notice that the stringsare joined together without any spaces. How could you fix this?
  10. ☼ Define a variable my_sent to be a list of words, usingthe syntax my_sent = ["My", "sent"] (but with your own words,or a favorite saying).

    1. Use ' '.join(my_sent) to convert this into a string.
    2. Use split() to split the string back into the list formyou had to start with.
  11. ☼ Define several variables containing lists of words, e.g., phrase1,phrase2, and so on. Join them together in various combinations (using the plus operator)to form whole sentences. What is the relationship betweenlen(phrase1 + phrase2) and len(phrase1) + len(phrase2)?

  12. ☼ Consider the following two expressions, which have the samevalue. Which one will typically be more relevant in NLP? Why?

    1. "Monty Python"[6:12]
    2. ["Monty", "Python"][1]
  13. ☼ We have seen how to represent a sentence as a list of words, whereeach word is a sequence of characters. What does sent1[2][2] do?Why? Experiment with other index values.

  14. ☼ The first sentence of text3 is provided to you in thevariable sent3. The index of the in sent3 is 1, because sent3[1]gives us 'the'. What are the indexes of the two other occurrencesof this word in sent3?

  15. ☼ Review the discussion of conditionals in 4.Find all words in the Chat Corpus (text5)starting with the letter b. Show them in alphabetical order.

  16. ☼ Type the expression list(range(10)) at the interpreter prompt.Now try list(range(10, 20)), list(range(10, 20, 2)), and list(range(20, 10, -2)).We will see a variety of uses for this built-in function in later chapters.

  17. ◑ Use text9.index() to find the index of the word sunset.You'll need to insert this word as an argument between the parentheses.By a process of trial and error, find the slice for the complete sentence thatcontains this word.

  18. ◑ Using list addition, and the set and sorted operations, compute thevocabulary of the sentences sent1 ... sent8.

  19. ◑ What is the difference between the following two lines?Which one will give a larger value? Will this be the case for other texts?

    >>> sorted(set(w.lower() for w in text1))>>> sorted(w.lower() for w in set(text1))
  20. ◑ What is the difference between the following two tests:w.isupper() and not w.islower()?

  21. ◑ Write the slice expression that extracts the last two words of text2.

  22. ◑ Find all the four-letter words in the Chat Corpus (text5).With the help of a frequency distribution (FreqDist), show thesewords in decreasing order of frequency.

  23. ◑ Review the discussion of looping with conditions in 4.Use a combination of for and if statements to loop over the words ofthe movie script for Monty Python and the Holy Grail (text6)and print all the uppercase words, one per line.

  24. ◑ Write expressions for finding all words in text6 thatmeet the conditions listed below. The result should be in the form ofa list of words: ['word1', 'word2', ...].

    1. Ending in ise
    2. Containing the letter z
    3. Containing the sequence of letters pt
    4. Having all lowercase letters except for an initial capital (i.e., titlecase)
  25. ◑ Define sent to be the list of words['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore'].Now write code to perform the following tasks:

    1. Print all words beginning with sh
    2. Print all words longer than four characters
  26. ◑ What does the following Python code do? sum(len(w) for w in text1)Can you use it to work out the average word length of a text?

  27. ◑ Define a function called vocab_size(text) that has a singleparameter for the text, and which returns the vocabulary size of the text.

  28. ◑ Define a function percent(word, text) that calculateshow often a given word occurs in a text, and expresses the resultas a percentage.

  29. ◑ We have been using sets to store vocabularies. Try the followingPython expression: set(sent3) < set(text1). Experiment with this usingdifferent arguments to set(). What does it do?Can you think of a practical application for this?

About this document...

UPDATED FOR NLTK 3.0.This is a chapter from Natural Language Processing with Python,by Steven Bird, Ewan Klein and Edward Loper,Copyright © 2019 the authors.It is distributed with the Natural Language Toolkit [http://nltk.org/],Version 3.0, under the terms of theCreative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License[http://creativecommons.org/licenses/by-nc-nd/3.0/us/].

This document was built onWed 4 Sep 2019 11:40:48 ACST

System Message: ERROR/3 (ch01.rst2, line 1889); backlink

Unknown target name: "finegan2007".

1. Language Processing and Python (2024)


Top Articles
Latest Posts
Article information

Author: Velia Krajcik

Last Updated:

Views: 5987

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.