NLP - Pos and N-Gram Models
NLP - Pos and N-Gram Models
NLP - Pos and N-Gram Models
Slide 1
POS
Slide 2
An Example
Slide 3
Word Classes and Tag Sets
Tag sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
purpose
Closed:
Determiners, pronouns, prepositions, conjunction …
Slide 4
Open Class Words
Slide 5
Closed Class Words
Slide 6
Prepositions from CELEX
Slide 7
Pronouns in CELEX
Slide 8
Conjunctions
Slide 9
Auxiliaries
Slide 10
Word Classes: Tag set example
PRP
PRP$
Slide 11
Example of Penn Treebank Tagging of
Brown Corpus Sentence
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
Slide 12
The Problem
Slide 13
Part-of-Speech Tagging
Slide 14
Word prediction
- by simply looking at the preceding words and keeping track of some fairly
simple counts.
We can formalize this task using what are called N-gram models.
N-grams are token sequences of length N.
Slide 15
N-Gram Models
Slide 16
Counting
Simple counting lies at the core of any probabilistic approach. So let’s first
take a look at what we’re counting.
He stepped out into the hall, was delighted to encounter a water brother.
15 tokens
The answers depend on the application. If we’re focusing on something like ASR
to support indexing for search, then “uh” isn’t helpful (it’s not likely to occur as a
query). But filled pauses are very useful in dialog management, so we might want
them there.
Slide 17
Counting: Types and Tokens
How about
They picnicked by the pool, then lay back on the grass and looked at the
stars.
18 tokens (again counting punctuation)
But we might also note that “the” is used 3 times, so there are only 16
unique types (as opposed to tokens).
Slide 18
Counting: Corpora
Numbers
Misspellings
Names
Acronyms
etc
Slide 19
Language Modeling
Slide 20
Very Easy Estimate
How to estimate?
P(the | its water is so transparent that)
Slide 21