模型

模型	项目	论文	描述
GloVe: Global Vectors for Word Representation	https://nlp.stanford.edu/projects/glove/	https://nlp.stanford.edu/pubs/glove.pdf
The Annotated Transformer	http://nlp.seas.harvard.edu/2018/04/03/attention.html
GPT (from OpenAI)	https://github.com/openai/finetune-transformer-lm	language understanding	Improving Language Understanding with Unsupervised Learning
GPT-2 (from OpenAI)	https://github.com/openai/gpt-2	https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf	Better Language Models and Their Implications
Transformer-XL (from Google/CMU)	https://github.com/kimiyoung/transformer-xl	Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	Attentive Language Models Beyond a Fixed-Length Context
XLNet (from Google/CMU)	https://github.com/zihangdai/xlnet/	XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLM (from Facebook)	https://github.com/facebookresearch/XLM/	Cross-lingual Language Model Pretraining
RoBERTa (from Facebook)	https://github.com/pytorch/fairseq/tree/master/examples/roberta	Robustly Optimized BERT Pretraining Approach
DistilBERT (from HuggingFace)	https://github.com/huggingface/transformers/tree/master/examples/distillation	Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
bert	https://github.com/google-research/bert	https://arxiv.org/abs/1810.04805

项目

项目	模型	论文	描述
https://allennlp.org
https://github.com/didi/delta		https://arxiv.org/pdf/1908.01853.pdf	滴滴 Delta
https://nlp.stanford.edu/software/CRF-NER.html	CRF		Stanford CRF NER

项目	模型	论文	描述
https://nlp.stanford.edu/projects/glove/	GloVe: Global Vectors for Word Representation	https://nlp.stanford.edu/pubs/glove.pdf
http://nlp.seas.harvard.edu/2018/04/03/attention.html	The Annotated Transformer
https://github.com/huggingface/transformers	Transformers		https://huggingface.co/transformers 实现了很多模型（Bert, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT）<\b> https://transformer.huggingface.co
https://github.com/openai/finetune-transformer-lm	GPT (from OpenAI)	language understanding	Improving Language Understanding with Unsupervised Learning
https://github.com/openai/gpt-2	GPT-2 (from OpenAI)	https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf	Better Language Models and Their Implications
https://github.com/kimiyoung/transformer-xl	Transformer-XL (from Google/CMU)	Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	Attentive Language Models Beyond a Fixed-Length Context
https://github.com/zihangdai/xlnet/	XLNet (from Google/CMU)	XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://github.com/facebookresearch/XLM/	XLM (from Facebook)	Cross-lingual Language Model Pretraining
https://github.com/pytorch/fairseq/tree/master/examples/roberta	RoBERTa (from Facebook)	Robustly Optimized BERT Pretraining Approach
https://github.com/huggingface/transformers/tree/master/examples/distillation	DistilBERT (from HuggingFace)	Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
https://github.com/google-research/bert	bert	https://arxiv.org/abs/1810.04805
https://github.com/hundredblocks/concrete_NLP_tutorial			An NLP workshop by Emmanuel Ameisen (@EmmanuelAmeisen), from Insight AI
https://github.com/BrikerMan/Kashgari	Word2Vec, BERT, and GPT2		Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
https://github.com/kyzhouhzau/BERT-NER	Bert		基于 CoNLL-2003 数据集的实现
https://github.com/ProHiryu/bert-chinese-ner			基于人民日报数据集的实现
https://github.com/macanv/BERT-BiLSTM-CRF-NER	bert		training, serving

项目	描述
https://github.com/marcotcr/lime	用于解释机器学习的分类器。论文：https://arxiv.org/abs/1602.04938

数据集

数据集	描述
https://github.com/SophonPlus/ChineseNlpCorpus
https://github.com/SophonPlus/ChineseWordVectors
ChnSentiCorp_htl_all
https://github.com/thunlp/CAIL	Chinese AI & Law Challenge http://cail.cipsc.org.cn

NER 数据集

数据集	描述
https://github.com/ontonotes/conll-formatted-ontonotes-5.0	This is a CoNLL formatted version of the OntoNotes 5.0 release.
https://github.com/juand-r/entity-recognition-datasets#references	A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

学习资料

NLP roadmap

https://github.com/graykode/nlp-roadmap

Probability & Statistics

Machine Learning

Text Mining

Natural Language Processing

标注工具

| 工具 | 链接 | 描述 | |—+—+—| | Prodigy |https://prodi.gy/docs/|Prodigy (explosion.ai 开发 spacy 的公司)| | brat |https://github.com/nlplab/brat|| | Knowtator |http://knowtator.sourceforge.net/index.shtml|| | Protégé + Knowtator plugin |https://github.com/UCDenver-ccp/Knowtator-2.0 https://protege.stanford.edu/short-courses.php|| ||http://deepdive.stanford.edu/labeling|| ||https://github.com/SongRb/DeepDiveChineseApps|| ||https://github.com/qiangsiwei/DeepDive_Chinese|| ||https://github.com/jiesutd/SUTDAnnotator|| ||https://github.com/HazyResearch/snorkel|| ||https://bitbucket.org/dainkaplan/slate/|| | iepy | https://github.com/machinalis/iepy | 标注，信息提取 | | doccano |https://github.com/chakki-works/doccano|| | YEDDA |https://github.com/jiesutd/YEDDA|| | Chinese-Annotator | https://github.com/deepwel/Chinese-Annotator | online/offline 结合的中文标注工具，想法比较好，目前项目还不完善 | | HanNLP | https://github.com/hankcs/HanLP | NLP 工具箱（中文分词词性标注命名实体识别依存句法分析新词发现关键词短语提取自动摘要文本分类聚类拼音简繁），Java语言 | | poplar |https://github.com/synyi/poplar|国内“森亿”公司开发|

brat

brat 配置

Annotation 配置 annotation.conf

[entities]

[entities]	 
Person
Location
Organization

[relations]

参数指定格式 “ARG:TYPE”

[relations]	 
Family	Arg1:Person, Arg2:Person
Employment	Arg1:Person, Arg2:Organization

参数可以有多个，使用 “|” 分隔

[relations]	 	 
Located	Arg1:Person,	Arg2:Building|City|Country
Located	Arg1:Building,	Arg2:City|Country
Located	Arg1:City,	Arg2:Country

[events]

事件参数格式 “ROLE:TYPE”, ROLE 可以任意指定。

[events]	 
Marriage	Participant1:Person, Participant2:Person
Bankruptcy	Org:Company

[attributes]

属性作用域 “ARG:TYPE” ，可以用在 relation 和 event 中。

拥有多个值的属性的值的定义方法是 “Value:VAL1|VAL2|VAL3[…]”

[attributes]	 
Negated	Arg:<EVENT>
Confidence	Arg:<EVENT>, Value:L1|L2|L3

brat 标注信息格式

http://brat.nlplab.org/standoff.html

T1	Organization 0 4	Sony
T2	MERGE-ORG 14 27	joint venture
T3	Organization 33 41	Ericsson
E1	MERGE-ORG:T2 Org1:T1 Org2:T3
T4	Country 75 81	Sweden
R1	Origin Arg1:T3 Arg2:T4

Protege & Knowtator

编译 Knowtator 插件：

git clone https://github.com/UCDenver-ccp/Knowtator-2.0.git
mvn clean install
cp xxx/plugins/knowtator-2.1.5.jar /Applications/Protégé.app/Contents/Java/plugins/

重启 Protege

iepy

NOTE

iepy 的 preprcess.py 会失败，需要按照以下方式修改 corenlp.sh 脚本

Preprocess not running under MacOS

Problems with the preprocess under MacOS? Apparently a change in the CoreNLP script is needed to be run. You need to change the file corenlp.sh that is located on /Users//Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/ and change scriptdir=dirname $0 for scriptdir=dirname "$0" (ie, add double quotes around $0)

https://iepy.readthedocs.io/en/stable/troubleshooting.html#troubleshooting

自然语言处理工具

自然语言处理中文分词词性标注命名实体识别依存句法分析新词发现关键词短语提取自动摘要文本分类聚类拼音简繁

https://github.com/hankcs/HanLP

NLP 应用

聊天机器人

用 TensorFlow 做个聊天机器人

论文

http://nlp.seas.harvard.edu/2018/04/03/attention.html

NLP 资源整理

目录

模型

项目

数据集

NER 数据集

学习资料

NLP roadmap

Probability & Statistics

Machine Learning

Text Mining

Natural Language Processing

标注工具

brat

brat 配置

Annotation 配置 annotation.conf

brat 标注信息格式

Protege & Knowtator

iepy

自然语言处理工具

NLP 应用

聊天机器人

论文

singleye

NLP 资源整理

目录

模型

项目

数据集

NER 数据集

学习资料

NLP roadmap

Probability & Statistics

Machine Learning

Text Mining

Natural Language Processing

标注工具

brat

brat 配置

Annotation 配置 annotation.conf

brat 标注信息格式

Protege & Knowtator

iepy

自然语言处理工具

NLP 应用

聊天机器人

论文

singleye

使用 ros::waitForShutdown() 导致 dynamic_reconfigure::Server 无法正常获取配置更新的问题

django-rest-framework 和 simplejwt 的类关系

Python 内存管理

左乘/右乘旋转

多媒体格式标准、H264 编码与 MP4 格式简要介绍

摄像机模型及实现

旋转矩阵

NLP 资源整理

ssh代理方法

机器学习笔记 - 贝叶斯分类法推导