NLP 资源整理
目录
模型
项目
项目 | 模型 | 论文 | 描述 |
---|---|---|---|
https://allennlp.org | |||
https://github.com/didi/delta | https://arxiv.org/pdf/1908.01853.pdf | 滴滴 Delta | |
https://nlp.stanford.edu/software/CRF-NER.html | CRF | Stanford CRF NER |
项目 | 描述 |
---|---|
https://github.com/marcotcr/lime | 用于解释机器学习的分类器。论文:https://arxiv.org/abs/1602.04938 |
数据集
NER 数据集
数据集 | 描述 |
---|---|
https://github.com/ontonotes/conll-formatted-ontonotes-5.0 | This is a CoNLL formatted version of the OntoNotes 5.0 release. |
https://github.com/juand-r/entity-recognition-datasets#references | A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types. |
学习资料
NLP roadmap
https://github.com/graykode/nlp-roadmap
Probability & Statistics
Machine Learning
Text Mining
Natural Language Processing
标注工具
| 工具 | 链接 | 描述 | |—+—+—| | Prodigy |https://prodi.gy/docs/|Prodigy (explosion.ai 开发 spacy 的公司)| | brat |https://github.com/nlplab/brat|| | Knowtator |http://knowtator.sourceforge.net/index.shtml|| | Protégé + Knowtator plugin |https://github.com/UCDenver-ccp/Knowtator-2.0 https://protege.stanford.edu/short-courses.php|| ||http://deepdive.stanford.edu/labeling|| ||https://github.com/SongRb/DeepDiveChineseApps|| ||https://github.com/qiangsiwei/DeepDive_Chinese|| ||https://github.com/jiesutd/SUTDAnnotator|| ||https://github.com/HazyResearch/snorkel|| ||https://bitbucket.org/dainkaplan/slate/|| | iepy | https://github.com/machinalis/iepy | 标注,信息提取 | | doccano |https://github.com/chakki-works/doccano|| | YEDDA |https://github.com/jiesutd/YEDDA|| | Chinese-Annotator | https://github.com/deepwel/Chinese-Annotator | online/offline 结合的中文标注工具,想法比较好,目前项目还不完善 | | HanNLP | https://github.com/hankcs/HanLP | NLP 工具箱(中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁),Java语言 | | poplar |https://github.com/synyi/poplar|国内“森亿”公司开发|
brat
brat 配置
Annotation 配置 annotation.conf
- [entities]
[entities]
Person
Location
Organization
- [relations]
参数指定格式 “ARG:TYPE”
[relations]
Family Arg1:Person, Arg2:Person
Employment Arg1:Person, Arg2:Organization
参数可以有多个,使用 “|” 分隔
[relations]
Located Arg1:Person, Arg2:Building|City|Country
Located Arg1:Building, Arg2:City|Country
Located Arg1:City, Arg2:Country
- [events]
事件参数格式 “ROLE:TYPE”, ROLE 可以任意指定。
[events]
Marriage Participant1:Person, Participant2:Person
Bankruptcy Org:Company
- [attributes]
属性作用域 “ARG:TYPE” ,可以用在 relation 和 event 中。
拥有多个值的属性的值的定义方法是 “Value:VAL1|VAL2|VAL3[…]”
[attributes]
Negated Arg:<EVENT>
Confidence Arg:<EVENT>, Value:L1|L2|L3
brat 标注信息格式
http://brat.nlplab.org/standoff.html
T1 Organization 0 4 Sony
T2 MERGE-ORG 14 27 joint venture
T3 Organization 33 41 Ericsson
E1 MERGE-ORG:T2 Org1:T1 Org2:T3
T4 Country 75 81 Sweden
R1 Origin Arg1:T3 Arg2:T4
Protege & Knowtator
编译 Knowtator 插件:
git clone https://github.com/UCDenver-ccp/Knowtator-2.0.git
mvn clean install
cp xxx/plugins/knowtator-2.1.5.jar /Applications/Protégé.app/Contents/Java/plugins/
重启 Protege
iepy
NOTE
iepy 的 preprcess.py 会失败,需要按照以下方式修改 corenlp.sh 脚本
Preprocess not running under MacOS
Problems with the preprocess under MacOS? Apparently a change in the CoreNLP script is needed to be run. You need to change the file corenlp.sh that is located on /Users/
/Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/ and change scriptdir= dirname $0
for scriptdir=dirname "$0"
(ie, add double quotes around $0)
https://iepy.readthedocs.io/en/stable/troubleshooting.html#troubleshooting
自然语言处理工具
自然语言处理 中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁
https://github.com/hankcs/HanLP