NLP - Pos and N-Gram Models

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 21

Outline

Why part of speech tagging?


Word classes
Tag sets and problem definition
N-Grams

Slide 1
POS

“The process of assigning a part-of-speech or


other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
TAGS
the
mother
kissed N
the V
child P
on DET
the
cheek

Slide 2
An Example

WORD LEMMA TAG

the the DET


mother mother NOUN
kissed kiss VPAST
the the DET
child child NOUN
on on PREP
the the DET
cheek cheek NOUN

Slide 3
Word Classes and Tag Sets

Basic word classes: Noun, Verb, Adjective, Adverb,


Preposition, …

Tag sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
purpose

Open vs. Closed classes


Open:
Nouns, Verbs, Adjectives, Adverbs.

Closed:
Determiners, pronouns, prepositions, conjunction …

Slide 4
Open Class Words

Every known human language has nouns and verbs


Nouns: people, places, things
Classes of nouns
proper vs. common
count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs:
Unfortunately, John walked home extremely slowly yesterday
Numerals: one, two, three, third, …

Slide 5
Closed Class Words

Differ more from language to language and over time


than open class words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …

Slide 6
Prepositions from CELEX

Slide 7
Pronouns in CELEX

Slide 8
Conjunctions

Slide 9
Auxiliaries

Slide 10
Word Classes: Tag set example

PRP
PRP$

Slide 11
Example of Penn Treebank Tagging of
Brown Corpus Sentence

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT


number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .
Book that flight .

VBZ DT NN VB NN ?
Does that flight serve dinner ?

Slide 12
The Problem

Words often have more than one word class: this


This is a nice day = PRP
This day is nice = DT
You can go this far = RB

Slide 13
Part-of-Speech Tagging

• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level


analysis)
• Stochastic Tagger: HMM-based
• Transformation-Based Tagger (Brill)

Slide 14
Word prediction

Guess the next word...


... I notice three guys standing on the ???

- by simply looking at the preceding words and keeping track of some fairly
simple counts.

We can formalize this task using what are called N-gram models.
N-grams are token sequences of length N.

The above example contains the following 2-grams (bigrams)


(I notice), (notice three), (three guys), (guys standing), (standing on), (on the)

Slide 15
N-Gram Models

More formally, we can use knowledge of the counts of N-grams to


assess the conditional probability of candidate words as the next
word in a sequence.
Or, we can use them to assess the probability of an entire sequence of
words.

Slide 16
Counting

Simple counting lies at the core of any probabilistic approach. So let’s first
take a look at what we’re counting.
He stepped out into the hall, was delighted to encounter a water brother.
15 tokens

Not always that simple


I do uh main- mainly business data processing
Spoken language poses various challenges.
Should we count “uh” and other fillers as tokens?
What about the repetition of “mainly”?

The answers depend on the application. If we’re focusing on something like ASR
to support indexing for search, then “uh” isn’t helpful (it’s not likely to occur as a
query). But filled pauses are very useful in dialog management, so we might want
them there.

Slide 17
Counting: Types and Tokens

How about
They picnicked by the pool, then lay back on the grass and looked at the
stars.
18 tokens (again counting punctuation)

But we might also note that “the” is used 3 times, so there are only 16
unique types (as opposed to tokens).

Slide 18
Counting: Corpora

So what happens when we look at large bodies of text


instead of single utterances?
Brown et al (1992) large corpus of English text
583 million wordform tokens
293,181 wordform types
Google
Crawl of 1,024,908,267,229 English tokens
13,588,391 wordform types
That seems like a lot of types... After all, even large dictionaries of English have only around
500k types. Why so many here?

Numbers
Misspellings
Names
Acronyms
etc
Slide 19
Language Modeling

Back to word prediction


We can model the word prediction task as the ability to assess the conditional
probability of a word given the previous words in the sequence
P(wn|w1,w2…wn-1)

Calculating - conditional probability


One way is to use the definition of conditional probabilities and look for counts. So
to get
P(the | its water is so transparent that)
By definition that’s
P(its water is so transparent that the)
P(its water is so transparent that)

We can get each of those from counts in a large corpus.

Slide 20
Very Easy Estimate

How to estimate?
P(the | its water is so transparent that)

P(the | its water is so transparent that) =


Count(its water is so transparent that the)
Count(its water is so transparent that)

According to Google those counts are 5/9.


But 2 of those were to these slides... So maybe it’s really 3/7

Slide 21

You might also like