0%

NLP技术演进

Basic

Text to Sequence

image.png

image.png

分词的算法大致分为两种:

1.基于词典的分词算法

正向最大匹配算法 逆向最大匹配算法 双向匹配分词法

2.基于统计的机器学习算法

N-gram、HMM、CRF、SVM、LSTM+CRF

jieba分词的框架图

image

Word Embedding

  1. First,represent words using one-hot vectors.

    1. Suppose the dictionary contain V unique words;
    2. then the one-hot vectors img are v-dimensional;

image.png

  1. second,map the one-hot vectors to low-dimensional vectors

    1. P is parameter matrix which can be learned from training data;
    2. imgis the one-hot vector of the i-th word in dictonary;

How to interpret the parameter matrix?

image.png

image.png

Model

无监督训练word2vec的两种模型:CBOW和skip-gram

14.png15.png

Tip17.png

如果一个语料库稍微大一些,可能的结果简直太多了,最后一层相当于softmax,计算起来十分耗时,有什么办法来解决嘛?

  • 输入两个单词,看他们是不是前后对应的输入和输出,也就相当于一个二分类任务19.png
  • 出发点非常好,但是此时训练集构建出来的标签全为1,无法进行较好的训练
  • 改进方案:加入一些负样本(负采样模型21.png

Use word Classification

至此就可以尝试使用各种分类器对齐进行分类,

ShortComing

仅仅是利用词的统计和频率分类,没有考虑序列;

RNN(Recurrent Neural Networks)

using RNN to instead simple Classification

image.png

Simple RNN

image.png

注意上述结构中的双曲正切函数是必要的,用于防止传递过程中发生的A的n次方导致后方序列失去效果!

ShortComing

  • SimpleRNN is good at short-term dependence;
  • SimpleRNN is bad at long-term dependence;
  • Only one such parameter matrix,no matter how long the sequence is;

image.png

image.png

LSTM Model(Using LSTM instead Simple RNN)

Hochreiter and Schmidhuber. Long short-term memory. Neural computation, 1997.

  • Conveyor belt: the past information directly flows to the future.(解决梯度消失)
  • Each of the following blocks has a parameter matrix:
    • Forget gate**😗*forget the gender of the old subject.
    • Input gate:decides which values of the conveyor belt we’ll update.
    • New value (img): to be added to the conveyor belt
    • Output gate**😗*decide what flows from the conveyor beltimgto the state img.
  • Number of parameters:
    • 4 * shape(h) * [shape(h) + shape(x)]

Gate Struct

image.png

Update

image.png

RNN More Effective

Stacked RNN

image.png

Bidirectional RNN

image.png

Pretraining

image.png

Seq2Seq

image.png

How to Improve?

  • Bi-LSTM instead of LSTM;(Encoder ONLY!)
  • Tokenization in the word-level(instead of char-level);
  • Multitask learning;
  • Attention

Attention

Bahdanau,Cho,& Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

  • Attention tremendously improves Seq2Seq model.
  • With attention, Seq2Seq model does not forget source input.
  • With attention, the decoder knows where to focus.
  • Downside: much more computation.

image.png

Much ways to calculate Weight:

image.png**

image.png**

image.png

image.png

image.png

Self-Attention

通用性更强,跳出了seq to seq 的局限;

image.png

image.png

image.png

image.png

Attention to seq2seq

image.png

Transformer

  1. Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and
    translate. In ICLR, 2015.
  2. Cheng, Dong, & Lapata. Long Short-Term Memory-Networks for Machine Reading. In
    EMNLP, 2016.
  3. Vaswani et al. Attention Is All You Need**. In NIPS, 2017.**
  • Transformer is a Seq2Seq model.
  • Transformer is not RNN.
  • Purely based attention and dense layers.
  • Higher accuracy than RNNs on large datasets.

image.png

Attention without RNN

image.png

image.png

image.png

image.png

image.png

Self-Attention without RNN

image.png

Transformer

image.png

image.pngimage.png

image.png

image.png

image.png

  • Transformer is Seq2Seq model; it has an encoder and a decoder.
  • Transformer model is not RNN.
  • Transformer is based on attention and self-attention.
  • Transformer outperforms all the state-of-the-art RNN models.

BERT(Bidirectional Encoder Representationsfrom Transformers)

  1. Devlin, Chang, Lee, and Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.

  2. Vaswani and others. Attention is all you need. In NIPS, 2017.

Main:

  • Predict masked word.
  • Predict next sentence.

Combining the two methods

  • Loss 1 is for binary classification (i.e., predicting the nextsentence.)
  • Loss 2 and Loss 3 are for multi-class classification (i.e., predicting the masked words.)
  • Objective function is the sum of the three loss functions.
  • Update model parameters by performing one gradient descent.

Data

  • BERT does not need manually labeled data. (Nice! Manual labeling is expensive.)
  • Use large-scale data, e.g., English Wikipedia (2.5 billion words.)
  • Randomly mask words (with some tricks.)
  • 50% of the next sentences are real. (The other 50% are fake.)

-------------本文结束感谢您的阅读-------------
作者将会持续总结、分享,如果愿意支持作者,作者将会十分感激!!!