目录

模型

模型 项目 论文 描述
GloVe: Global Vectors for Word Representation https://nlp.stanford.edu/projects/glove/ https://nlp.stanford.edu/pubs/glove.pdf
The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html
GPT (from OpenAI) https://github.com/openai/finetune-transformer-lm language understanding Improving Language Understanding with Unsupervised Learning
GPT-2 (from OpenAI) https://github.com/openai/gpt-2 https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Better Language Models and Their Implications
Transformer-XL (from Google/CMU) https://github.com/kimiyoung/transformer-xl Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Attentive Language Models Beyond a Fixed-Length Context
XLNet (from Google/CMU) https://github.com/zihangdai/xlnet/ XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLM (from Facebook) https://github.com/facebookresearch/XLM/ Cross-lingual Language Model Pretraining
RoBERTa (from Facebook) https://github.com/pytorch/fairseq/tree/master/examples/roberta Robustly Optimized BERT Pretraining Approach
DistilBERT (from HuggingFace) https://github.com/huggingface/transformers/tree/master/examples/distillation Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
bert https://github.com/google-research/bert https://arxiv.org/abs/1810.04805

项目

项目 模型 论文 描述
https://allennlp.org
https://github.com/didi/delta https://arxiv.org/pdf/1908.01853.pdf 滴滴 Delta
https://nlp.stanford.edu/software/CRF-NER.html CRF Stanford CRF NER
项目 模型 论文 描述
https://nlp.stanford.edu/projects/glove/ GloVe: Global Vectors for Word Representation https://nlp.stanford.edu/pubs/glove.pdf
http://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated Transformer
https://github.com/huggingface/transformers Transformers https://huggingface.co/transformers 实现了很多模型(Bert, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT)<\b> https://transformer.huggingface.co
https://github.com/openai/finetune-transformer-lm GPT (from OpenAI) language understanding Improving Language Understanding with Unsupervised Learning
https://github.com/openai/gpt-2 GPT-2 (from OpenAI) https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Better Language Models and Their Implications
https://github.com/kimiyoung/transformer-xl Transformer-XL (from Google/CMU) Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Attentive Language Models Beyond a Fixed-Length Context
https://github.com/zihangdai/xlnet/ XLNet (from Google/CMU) XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://github.com/facebookresearch/XLM/ XLM (from Facebook) Cross-lingual Language Model Pretraining
https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa (from Facebook) Robustly Optimized BERT Pretraining Approach
https://github.com/huggingface/transformers/tree/master/examples/distillation DistilBERT (from HuggingFace) Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
https://github.com/google-research/bert bert https://arxiv.org/abs/1810.04805
https://github.com/hundredblocks/concrete_NLP_tutorial An NLP workshop by Emmanuel Ameisen (@EmmanuelAmeisen), from Insight AI
https://github.com/BrikerMan/Kashgari Word2Vec, BERT, and GPT2 Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
https://github.com/kyzhouhzau/BERT-NER Bert 基于 CoNLL-2003 数据集的实现
https://github.com/ProHiryu/bert-chinese-ner 基于人民日报数据集的实现
https://github.com/macanv/BERT-BiLSTM-CRF-NER bert training, serving
项目 描述
https://github.com/marcotcr/lime 用于解释机器学习的分类器。

论文:https://arxiv.org/abs/1602.04938

数据集

数据集 描述
https://github.com/SophonPlus/ChineseNlpCorpus
https://github.com/SophonPlus/ChineseWordVectors
ChnSentiCorp_htl_all
https://github.com/thunlp/CAIL Chinese AI & Law Challenge http://cail.cipsc.org.cn

NER 数据集

数据集 描述
https://github.com/ontonotes/conll-formatted-ontonotes-5.0 This is a CoNLL formatted version of the OntoNotes 5.0 release.
https://github.com/juand-r/entity-recognition-datasets#references A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

学习资料

NLP roadmap

https://github.com/graykode/nlp-roadmap

Probability & Statistics

Probability & Statistics

Machine Learning

Machine Learning

Text Mining

Text Mining

Natural Language Processing

Natural Language Processing

标注工具

| 工具 | 链接 | 描述 | |—+—+—| | Prodigy |https://prodi.gy/docs/|Prodigy (explosion.ai 开发 spacy 的公司)| | brat |https://github.com/nlplab/brat|| | Knowtator |http://knowtator.sourceforge.net/index.shtml|| | Protégé + Knowtator plugin |https://github.com/UCDenver-ccp/Knowtator-2.0 https://protege.stanford.edu/short-courses.php|| ||http://deepdive.stanford.edu/labeling|| ||https://github.com/SongRb/DeepDiveChineseApps|| ||https://github.com/qiangsiwei/DeepDive_Chinese|| ||https://github.com/jiesutd/SUTDAnnotator|| ||https://github.com/HazyResearch/snorkel|| ||https://bitbucket.org/dainkaplan/slate/|| | iepy | https://github.com/machinalis/iepy | 标注,信息提取 | | doccano |https://github.com/chakki-works/doccano|| | YEDDA |https://github.com/jiesutd/YEDDA|| | Chinese-Annotator | https://github.com/deepwel/Chinese-Annotator | online/offline 结合的中文标注工具,想法比较好,目前项目还不完善 | | HanNLP | https://github.com/hankcs/HanLP | NLP 工具箱(中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁),Java语言 | | poplar |https://github.com/synyi/poplar|国内“森亿”公司开发|

brat

brat 配置

Annotation 配置 annotation.conf

  • [entities]
[entities]	 
Person
Location
Organization
  • [relations]

参数指定格式 “ARG:TYPE”

[relations]	 
Family	Arg1:Person, Arg2:Person
Employment	Arg1:Person, Arg2:Organization

参数可以有多个,使用 “|” 分隔

[relations]	 	 
Located	Arg1:Person,	Arg2:Building|City|Country
Located	Arg1:Building,	Arg2:City|Country
Located	Arg1:City,	Arg2:Country
  • [events]

事件参数格式 “ROLE:TYPE”, ROLE 可以任意指定。

[events]	 
Marriage	Participant1:Person, Participant2:Person
Bankruptcy	Org:Company
  • [attributes]

属性作用域 “ARG:TYPE” ,可以用在 relation 和 event 中。

拥有多个值的属性的值的定义方法是 “Value:VAL1|VAL2|VAL3[…]”

[attributes]	 
Negated	Arg:<EVENT>
Confidence	Arg:<EVENT>, Value:L1|L2|L3

brat 标注信息格式

http://brat.nlplab.org/standoff.html

T1	Organization 0 4	Sony
T2	MERGE-ORG 14 27	joint venture
T3	Organization 33 41	Ericsson
E1	MERGE-ORG:T2 Org1:T1 Org2:T3
T4	Country 75 81	Sweden
R1	Origin Arg1:T3 Arg2:T4

Protege & Knowtator

编译 Knowtator 插件:

git clone https://github.com/UCDenver-ccp/Knowtator-2.0.git
mvn clean install
cp xxx/plugins/knowtator-2.1.5.jar /Applications/Protégé.app/Contents/Java/plugins/

重启 Protege

iepy

NOTE

iepy 的 preprcess.py 会失败,需要按照以下方式修改 corenlp.sh 脚本

Preprocess not running under MacOS

Problems with the preprocess under MacOS? Apparently a change in the CoreNLP script is needed to be run. You need to change the file corenlp.sh that is located on /Users//Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/ and change scriptdir=dirname $0 for scriptdir=dirname "$0" (ie, add double quotes around $0)

https://iepy.readthedocs.io/en/stable/troubleshooting.html#troubleshooting

自然语言处理工具

自然语言处理 中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁

https://github.com/hankcs/HanLP

NLP 应用

聊天机器人

论文

http://nlp.seas.harvard.edu/2018/04/03/attention.html