Natural Language Processing Knowledge
2016-08-15 09:33
579 查看
Natural Language Processing
[Based on Stanford NLP course]Preference
Language Technology
Mostly solved:Spam detection
Part-of-speech (POS) tagging
Named Entity Recognition (NER)
Making good progress
Sentiment analysis
Co-reference resolution
Word sense disambiguation (WSD)
Parsing
Machine translation (MT)
Information Extraction (IE)
Still really hard
Question answering (QA)
Paraphrase
Summarization
Dialog
Why NLP difficult?
Non-standard Languagesegmentation issues
idioms
neologisms
world knowledge
tricky entity names
Basic skills
Regular ExpressionsTokenization
Word Normalization and Stemming
Classifier
Decision Tree
Logistic Regression
SVM
Neural Nets
Edit Distance
Used forSpell correction
Computational Biology
Basic operations
Insertion
Deletion
Substitution
Algorithm
Levenshtein
Back trace
Needleman-Wunsch
Smith-Waterman
Language Model
Probabilistic Language Models
Machine translationSpell correction
Speech Recognition
Summarization
Question answering
Markov Assumption
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$Unigram Model
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$Bigram Model
$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$Add-k Smoothing
$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$Unigram prior smoothing
$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$Smoothing Algorithm
Good-TuringKneser-Ney
Witten-Bell
Spelling Correction
tasks:Spelling error detection
Spelling error correction
Autocorrect
Suggest a correction
Suggestion lists
Real word spelling errors:
For each word $w$, generate candidate set
Choose best candidate
Find the correct word $w$
Candidate generation
words with similar spelling
words with similar pronunciation
Factors that could influence p(misspelling | word)
source letter
target letter
surrounding letter
the position in the word
nearby keys on the keyboard
homology on the keyboard
pronunciation
likely morpheme transformations
Text Classification
Used for:
Assigning subject categories, topics, or genresSpam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis
…
Methods: Supervised Machine Learning
Naive BayesLogistic Regression
Support Vector Machines
k-Nearset Neighbors
Naive Bayes
$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P© $$Laplace (add-1) Smoothing
Used for Spam Filtering
Training data:
No training data: manually written rules
Very little data:
Use Naive Bayes
Get more labeled data
Try semi-supervised training methods
A reasonable amount of data:
All clever Classifiers:
SVM
Regularized Logistic Regression
User-interpretable decision trees
A huge amount of data:
At a cost
SVM (train time)
kNN (test time)
Regularized Logistic Regression can be somewhat better
Naive Bayes
Tweak performance
Domain-specific
Collapse terms
Upweighting some words
F Measure
Precision: % of selected items that are correctRecall: % of correct items that are selected
/ | correct | not correct |
---|---|---|
selected | tp | fp |
not selected | fn | tn |
Sentiment Analysis
Typology of Affective StatesEmotion
Mood
Interpersonal stances
Attitudes (Sentiment Analysis)
Personality traits
Baseline Algorithm
Tokenization
Feature Extraction
Classification
Naive Bayes
MaxEnt (better)
SVM (better)
Issues
HTML, XML and other markups
Capitalization
Phone numbers, dates
Emoticons
Sentiment Lexicons
semi-supervised learning of lexiconsuse a small amount of information
to bootstrap a lexicon
adjectives conjoined by “and” have same polarity
adjectives conjoined by “but” do not
Process
label seed set of adjectives
expand seed set to conjoined adjectives
supervised classifier assigns “polarity similarity” to each word pair
clustering for partitioning
Turney Algorithm
extract a phrasal lexicon from reviews
learn polarity of each phrase
rate a review by the average polarity of its phrases
Advantages
domain-specific
more robust
Assume classes have equal frequencies:
if not balanced: need to use F-scores
severe imbalancing also can degrade classifier performance
solutions:
Resampling in training
Cost-sensitive learning
penalize SVM more for misclassification of the rare thing
Features:
Negation is important
Using all words works well for some tasks (NB)
Finding subsets of words may help in other tasks
Features
Joint and Discriminative
Joint (generative) models: place probabilities over both observed data and the hidden stuff ---- P(c,d)N-gram models
Naive Bayes classifiers
Hidden Markov models
Probabilistic context-free grammars
IBM machine translation alignment models
Discriminative (conditional) models: take the data as given, and put a probability over hidden structure given the data ---- P(c|d)
Logistic regression
Conditional loglinear
Maximum Entropy models
Conditional Random Fields
SVMs
Perceptron
Features
Feature Expectations:Empirical count
Model expectation
Feature-Based Models:
Text Categorization
Word-Sense Disambiguation
POS Tagging
Maximum Entropy
$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c’} exp\sum_i \lambda_if_i(c’,d)} $$Find the optimal parameters
Gradient descent (GD), Stochastic gradient descent (SGD)
Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)
Conjugate gradient (CG), perhaps with preconditioning
Quasi-Newton methods - limited memory variable metric (LMVM) methods, in particular, L-BFGS
Feature Overlap
Maxent models handle overlapping features well
Unlike NB, there is no double counting
Feature Interaction
Maxent models handle overlapping features well, but do not automatically model feature interactions
Feature Interaction
If you want to interaction terms, you have to add them
A disjunctive feature would also have done it
Smoothing:
Issues of scale
Lots of features
Lots of sparsity
Optimization problems
Methods
Early stopping
Priors (MAP)
Regularization
Virtual Data
Count Cutoffs
Named Entity Recognition (NER)
The uses:Named entities can be indexed, linked off, etc.
Sentiment can be attributed to companies or products
A lot of IE relations are associations between named entities
For question answering, answers are often named entities
Training:
Collect a set of representative training documents
Label each token for its entity class or other
Design feature extractors appropriate to the text and classes
Train a sequence classifier to predict the labels form the data
Inference
Greedy
Fast, no extra memory requirements
Very easy to implement
With rich features including observations to the right, it may perform quite well
Greedy, we make commit errors we cannot recover from
Beam
Fast, beam sizes of 3-5 are almost as good as exact inference in many cases
Easy to implement
Inexact: the globally best sequence can fall off the beam
Viterbi
Exact: the global best sequence is returned
Harder to implement long-distance state-state interactions
CRFs
Training is slower, but CRFs avoid causal-competition biases
In practice usually work much the same as MEMMs
Relation Extraction
How to build relation extractorsHand-written patterns
High-precision and low-recall
Specific domains
A lot of work
Supervised machine learning
MaxEnt, Naive Bayes, SVM
Can get high accuracies with enough training data
Labeling a large training set is expensive
Brittle, don’t generalize well to different genres
Semi-supervised and unsupervised
Bootstrapping (using seeds)
Find sentences with these pairs
Look at the context between or around the pair and generalize the context to create patterns
Use the patterns for grep for more pairs
Distance supervision
Doesn’t require iteratively expanding patterns
Uses very large amounts of unlabeled data
Not sensitive to genre issues in training corpus
Unsupervised learning from the web
Use parsed data to train a “trustworthy tuple” classifier
Single-pass extract all relations between NPs, keep if trustworthy
Assessor ranks relations based on text redundancy
POS Tagging
PerformanceAbout 97% currently
But baseline is already 90%
Partly easy because
Many words are unambiguous
You get points for them and for punctuation marks
Difficulty
ambiguous words
common words can be ambiguous
Source of information
knowledge of neighboring words
knowledge of word probabilities
word
lowercased word
prefixes
suffixes
capitalization
word shapes
Summary
the change from generative to discriminative model does not by itself result in great improvement
the higher accuracy of discriminative models comes at the price of much slower training
Parsing
Treebankreusability of the labor
many parser, POS taggers, etc.
valuable resource for linguistics
broad coverage
frequencies and distributional information
a way to evaluate systems
Statistical parsing applications
high precision question answering
improving biological named entity finding
syntactically based sentence compression
extracting interaction in computer games
helping linguists find data
source sentence analysis for machine translation
relation extraction systems
Phrase structure grammars (= context-free grammars, CFGs) in NLP
G = (T, C, N, S, L, R)
T is a set of terminal symbols
C is a set of preterminal symbols
N is a set of nonterminal symbols
S is the start symbol
L is the lexicon, a set of items of the form X -> x
R is the grammar, a set of items of the form X -> $\gamma$
e is the empty symbol
Probabilistic Parsing
Probabilistic - or stochastic - context-free grammars (PCFGs)G = (T, N, S, R, P)
P is a probability function
Chomsky Normal Form
Reconstructing n-aries is easy
Reconstructing unaries/empties is trickier
Binarization is crucial for cubic time CFG parsing
Cocke-Kasami-Younger (CKY) Constituency Parsing
Unaries can by incorporated into the algorithm
Empties can be incorporated
Binarization is vital
Performance
Robust
Partial solution for grammar ambiguity
Give a probabilistic language model
The problem seems to be that PCFGs lack the lexicalization of a trigram model
Lexicalized Parsing
CharniakProbabilistic conditioning is “top-down” like a regular PCFG, but actual parsing is bottom-up, somewhat like the CKY algorithm we saw
Non-Independence
The independence assumptions of a PCFG are often too strong
We can relax independence assumptions by encoding dependencies into the PCFG symbols, by state splitting (sparseness)
Accurate Unlexicalized Parsing
Grammar rules are not systematically specified to the level of lexical items
Closed vs. open class words
Learning Latent Annotations
brackets are known
base categories are known
induce subcategories
clever split/merge category refinement
EM, like Forward-Backward for HMMs, but constrained by tree
Dependency Parsing
MethodsDynamic programming (like in the CKY algorithm)
Graph algorithm
Constraint Satisfaction
Deterministic Parsing
Sources of information
Bilexical affinities
Dependency distance
Intervening material
Valency of heads
MaltParser
Greedy
Bottom up
Has
a stack
a buffer
a set of dependency arcs
a set of actions
Each action is predicted by a discriminative classifier (SVM)
No search
Provides close to state of the art parsing performance
Provides very fast linear time parsing
Projective
Dependencies from a CFG tree using heads, must be projective
But dependency theory normally does allow non-projective structure to account for displaced constituents
The arc-eager algorithm only builds projective dependency trees
Stanford Dependencies
Projective
Can be generated by postprocessing headed phrase structure parses, or dependency parsers like MaltParser or the Easy-First Parser
Information Retrieval
Used forweb search
e-mail search
searching your laptop
corporate knowledge bases
legal information retrieval
Classic search
User taskInfo need
Query & Collection
Search engine
Results
Initial stages of text processing
TokenizationNormalization
Stemming
Stop words
Query processing
AND“merge” algorithm
Phrase queries
Biword indexes
false positives
bigger dictionary
Positional indexes
Extract inverted index entries for each distinct term
Merge their doc:position lists to enumerate all positions
Same general method for proximity searches
A positional index is 2-4 as large as a non-positional index
Caveat: all of this holds for “English-like” language
These two approaches can be profitably combined
Ranked Retrieval
AdvantageFree text queries
large result sets are not an issue
Query-document matching scores
Jaccard coefficient
Doesn’t consider term frequency
Length normalization needed
Bag of words model
Term frequency (tf)
Log-frequency weighting
Inverse document frequency (idf)
tf-idf weighting
$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$Best known weighting scheme in information retrieval
Distance: cosine(query, document)
$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum{|V|}_{i=1}q_id_i}{\sqrt{\sum{|V|}{i=1}q_i2}\sqrt{\sum{|V|}{i=1}d^2_i}} $$Weighting
Many search engines allow for different weightings for queries vs. documentsA very standard weighting scheme is: Inc.Itc
Document: logarithmic tf, no idf and cosine normalization
Query: logarithmic tf idf, cosine normalization
Evaluation
Mean average precision (MAP)Semantic
Situation
Reminder: lemma and wordformHomonymy
Homographs
Homophones
Polysemy
Synonyms
Antonyms
Hyponymy and Hypernymy
Hyponyms and Instances
Applications of Thesauri and Ontologies
Information ExtractionInformation Retrieval
Question Answering
Bioinformatics and Medical Informatics
Machine Translation
Word Similarity
Synonymy and similaritySimilarity algorithm
Thesaurus-based algorithm
Distributional algorithms
Thesaurus-based similarity
$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\logP(c_1)+\log P(c_2)} $$$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$
Evaluating
Intrinsic
Correlation between algorithm and human word similarity ratings
Extrinsic (task_based, end-to-end)
Malapropism (Spelling error) detection
WSD
Essay grading
Taking TOEFL multiple-choice vocabulary tests
Problems
We don’t have a thesaurus for every language
recall
missing words
missing phrases
missing connections between senses
works less well for verbs, adj.
Distributional models of meaning
For the term-document matrix: tf-idfFor the term-context matrix: Positive Pointwise Mutual Information (PPMI) is common
$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$
PMI is biased toward infrequent events
various weighting schemes
add-one smoothing
Question Answering
Question processingDetect question type, answer type (NER), focus, relations
Formulate queries to send a search engine
Passage Retrieval
Retrieval ranked documents
Break into suitable passages and re-rank
Answer processing
Extract candidate answers
Rank candidates
Approaches
Knowledge-basedbuild a semantic representation of the query
Map from this semantics to query structured data or resources
Hybrid
build a shallow semantic representation of the query
generate answer candidate using IR methods
Score each candidate using richer knowledge sources
Answer type taxonomy
6 coarse classesAbbreviation
Entity
Description
Human
Location
Numeric
50 finer classes
Detection
Hand-written rules
Machine Learning
Hybrids
Features
Question words and phrases
Part-of-speech tags
Parse features
Named Entities
Semantically related words
Keyword selection
Non-stop wordsNNP words in recognized named entities
Complex nominals with their adjectival modifiers
Other complex nominals
Nouns with their adjectival modifiers
Other nouns
Verbs
Adverbs
QFW word (skipped in all previous steps)
Other words
Passage Retrieval
IR engine retrieves documents using query termsSegment the documents into shorter units
Passage ranking
number of named entities of the right type in passage
number of query words in passage
number of question N-grams also in passage
proximity of query keywords to each other in passage
longest sequence of question words
rank of the document containing passage
Features for ranking candidate answers
answer type matchpattern match
question keywords
keyword distance
novelty factor
apposition features
punctuation location
sequences of question terms
Common Evaluation Metrics
AccuracyMean Reciprocal Rank (MRR)
$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$
Summarization
Applicationsoutlines or abstracts
summaries
action items
simplifying
Three stages
content selection
information ordering
sentence realization
salient words
tf-idf
topic signature
mutual information
log-likelihood ratio (LLR)
$$weight(w_i)=\begin{cases}1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise\end{cases}$$
Supervised content selection problem
hard to get labeled training data
alignment difficult
performance not better than unsupervised algorithm
ROUGE (Recall Oriented Understudy for Gisting Evaluation)
Intrinsic metric for automatically evaluating summaries
based on BLEU
not as good as human evaluation
much more convenient
$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$
Maximal Marginal Relevance (MMR)
Iteratively (greedily)
Relevant: high cosine similarity to the query
Novel: low cosine similarity to the summary
Stop when desired length
Information Ordering
Chronological ordering
Coherence
Topical ordering
相关文章推荐
- 《Natural Language Processing》斯坦福视频学习笔记——2.text processing
- A New General Deep Learning Approach for Natural Language Processing
- spaCy is a library for advanced natural language processing in Python and Cython:spaCy 工业级自然语言处理工具
- Ask Me Anything: Dynamic Memory Networks for Natural Language Processing 阅读笔记及tensorflow实现
- 《Natural Language Processing with Python》6.2节的一些错误
- Natural language processing: Deep Neural networks with multitask learning
- Natural Language Processing 课程,文章,论文
- 《Natural Language Processing》斯坦福视频学习笔记——3.编辑距离
- Natural Language Processing, 2017, Mar.29, Weekly Report
- Neural Network Methods for Natural Language Processing 系列读书笔记 -- 预览
- Natural Language Processing With Python (2)
- 【每周一文】Natural Language Processing (almost) From Scratch
- Neural Network Methods for Natural Language Processing 读书笔记2 —— 文本中的特征
- Natural Language Processing With Python (1)
- Tools for Natural Language Processing(转)
- <Natural Language Processing with Python>学习笔记一
- Natural Language Processing
- CS224n: Natural Language Processing with Deep Learning——assigment 3 代码
- The Stanford NLP (Natural Language Processing) Group
- The Stanford NLP (Natural Language Processing) Group