Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
551 commits
Select commit Hold shift + click to select a range
ce97b2b
removeUTF8BOM 健壮性
hankcs Aug 25, 2018
2a071ec
:checkered_flag:一亿字全世界最大中文语料库;小版本+1,发布v1.6.8
hankcs Aug 25, 2018
9798524
Merge branch 'master' into portable
hankcs Aug 25, 2018
847dc88
预备发布 portable-1.6.8
hankcs Aug 25, 2018
94e8eb8
排除人名<周有>
resorcap Aug 28, 2018
402dc0c
Merge pull request #941 from resorcap/patch-1
hankcs Aug 28, 2018
b435880
修正“体面”的拼音 fix https://github.com/hankcs/HanLP/issues/937
hankcs Aug 28, 2018
42fb6fd
删除冷僻姓氏“年” fix https://github.com/hankcs/HanLP/issues/939
hankcs Aug 28, 2018
135abf8
删除繁转简“通道=信道”
hankcs Aug 30, 2018
3082b12
自定义词典更新时自动删除缓存文件
AnyListen Sep 3, 2018
1600ab2
Merge pull request #954 from AnyListen/master
hankcs Sep 3, 2018
9e2aeed
Merge remote-tracking branch 'origin/master'
hankcs Sep 8, 2018
066ae2e
修复双数组trie树的最长匹配问题 fix https://github.com/hankcs/HanLP/issues/966
hankcs Sep 15, 2018
61e59d0
删除二阶隐马等废弃模块的调用入口 fix https://github.com/hankcs/HanLP/issues/964
hankcs Sep 15, 2018
afa21c3
TextRankKeyword支持构造自任意分词器
hankcs Sep 22, 2018
431ffc8
CustomDictionary.insert("新词语", "词性标签") 支持省略频次
hankcs Sep 28, 2018
b2f93a0
优化双数组trie树,构建后自动shrink到最低内存 close https://github.com/hankcs/HanLP/iss…
hankcs Sep 29, 2018
b5a02ef
支持通过JVM的启动参数指定data路径:java -DHANLP_ROOT=/opt/hanlp 则加载/opt/hanlp/data …
hankcs Sep 29, 2018
36a9bf1
载入停用词词典失败时以RuntimeException形式通知 https://github.com/hankcs/HanLP/issue…
hankcs Sep 29, 2018
c6d131e
NeuralNetworkDependencyParser构造函数接受Segment
hankcs Oct 4, 2018
29275b5
校正“大家@电”bigram fix https://github.com/hankcs/HanLP/issues/999
hankcs Oct 18, 2018
f1a7b58
修订简繁转换词条 fix https://github.com/hankcs/HanLP/issues/998
hankcs Oct 18, 2018
bc061f9
新词发现模块不再过滤英文字符,以抽取"K线图"等词汇
hankcs Oct 18, 2018
655ae03
词法分析器加入规则 enableRuleBasedSegment https://github.com/hankcs/HanLP/issu…
hankcs Oct 28, 2018
3889c9c
修订繁简转换 https://github.com/hankcs/HanLP/issues/835#issuecomment-434198320
hankcs Nov 3, 2018
e25103a
修订繁简转换 https://github.com/hankcs/HanLP/issues/1011
hankcs Nov 4, 2018
52df5a9
微调ngram和nr模型
hankcs Nov 10, 2018
69329f9
词法分析器新增流水线模式
hankcs Nov 10, 2018
b2c72f4
分词断句支持指定断句颗粒度 fix https://github.com/hankcs/HanLP/issues/1018
hankcs Nov 10, 2018
49ffc9d
:checkered_flag:新增文本聚类、流水线分词;中版本+1,发布v1.7.0
hankcs Nov 10, 2018
2bc2bbb
Merge branch 'master' into portable
hankcs Nov 10, 2018
6a4381c
预备发布 portable-1.7.0
hankcs Nov 10, 2018
2f5b8bb
修订现代汉语词典
hankcs Nov 11, 2018
9d8b57c
自定义词典兼容含有空格的路径 fix https://github.com/hankcs/HanLP/issues/1025
hankcs Nov 17, 2018
1094563
*Tokenizer支持分句颗粒度 fix https://github.com/hankcs/HanLP/issues/1019
hankcs Nov 17, 2018
583e97d
更新论文和演示链接
hankcs Nov 23, 2018
de42c07
修订繁转简 https://github.com/hankcs/HanLP/issues/835#issuecomment-440953253
hankcs Nov 24, 2018
9acb8c4
使热更新产生的缓存文件包含用户词性 fix https://github.com/hankcs/HanLP/issues/1028
hankcs Nov 24, 2018
54a230b
利用BufferedOutputStream加速缓存生成,快37倍
hankcs Nov 28, 2018
2881df4
新增可自定义用户词典的维特比分词器
AnyListen Dec 3, 2018
49d34d6
新增可自定义用户词典的维特比分词器
AnyListen Dec 3, 2018
4a29ae0
扩展维特比分词添加是否缓存词典选择
AnyListen Dec 3, 2018
4b03c2b
维特比分词设置默认自定义词典
AnyListen Dec 3, 2018
a538b39
复用自定义词典加载代码
AnyListen Dec 5, 2018
82d478c
完善扩展维特比分词器测试代码
AnyListen Dec 5, 2018
43f0ea8
修复可变DAT的entrySet方法 fix https://github.com/hankcs/HanLP/issues/1038
hankcs Dec 8, 2018
58e5f6c
整合维特比自定义词典代码
AnyListen Dec 11, 2018
572b2aa
修订繁转简 fix https://github.com/hankcs/HanLP/issues/1046
hankcs Dec 11, 2018
a5fae4d
Merge pull request #1040 from AnyListen/master
hankcs Dec 16, 2018
cd2e7a7
删除 未##数@请
hankcs Dec 20, 2018
3da7c41
:checkered_flag:高速缓存、动态词典;小版本+1,发布v1.7.1
hankcs Dec 23, 2018
d483b33
Merge branch 'master' into portable
hankcs Dec 23, 2018
47f2fa3
预备发布 portable-1.7.1
hankcs Dec 23, 2018
455f593
调整繁體分詞策略 fix https://github.com/hankcs/HanLP/issues/1059
hankcs Dec 25, 2018
d72b789
Catalog添加toString方法
hankcs Jan 4, 2019
179a373
修正卡方检验整型溢出的问题,准确率提升(95.47->96.08) fix https://github.com/hankcs/HanLP…
hankcs Jan 4, 2019
02de2ac
更新文本分类示例
hankcs Jan 5, 2019
89ef642
Merge branch 'master' into portable
hankcs Jan 5, 2019
ce949cd
使LexicalAnalyzer支持TranslatedPersonRecognition和JapanesePersonRecogniti…
hankcs Jan 12, 2019
86bd212
微调人名识别
hankcs Jan 16, 2019
801d797
提示在线学习不可能学习新的标签
hankcs Jan 20, 2019
0421b87
tokenizer的seg2sentence修改为static
hankcs Jan 27, 2019
a55b210
补充ngram
hankcs Jan 30, 2019
7f6a2d0
新增基于ArcEager转移系统以平均感知机作为分类器的柱搜索依存句法分析器
hankcs Feb 6, 2019
c706a1a
词法分析器默认关闭规则系统
hankcs Feb 6, 2019
70bb4ae
删除错误的unigram和bigram fix https://github.com/hankcs/HanLP/issues/1054
hankcs Feb 8, 2019
5612f7f
发布KBeamArcEagerDependencyParser,废弃MaxEntDependencyParser
hankcs Feb 8, 2019
5794097
感知机句法分析器文档
hankcs Feb 8, 2019
5b9813c
调整感知机句法分析器训练接口
hankcs Feb 8, 2019
447cce0
感知机句法分析器evaluate接口
hankcs Feb 11, 2019
27650a6
CoNLLSentence新增两个方法
hankcs Feb 12, 2019
52bc3d5
修订拼音 fix https://github.com/hankcs/HanLP/issues/1093
hankcs Feb 18, 2019
e154ad1
更新PKU98语料库地址 fix https://github.com/hankcs/HanLP/issues/1101
hankcs Feb 22, 2019
710d81f
修正CustomDictionary.reload(); fix https://github.com/hankcs/HanLP/issu…
hankcs Feb 22, 2019
6de6262
:checkered_flag:新的句法分析模块、多项改进;小版本+1,发布v1.7.2
hankcs Feb 22, 2019
c503dc5
Merge branch 'master' into portable
hankcs Feb 22, 2019
e601bc6
:checkered_flag:新的句法分析模块、多项改进;小版本+1,发布v1.7.2
hankcs Feb 22, 2019
91c45b7
更新word2vec示例
hankcs Feb 27, 2019
7f49ec5
添加customerize ner tag 功能
Feb 28, 2019
d276035
添加customerize ner tag 功能
Feb 28, 2019
7761ef5
添加customerize ner tag 功能, fix typo
Feb 28, 2019
a241213
Merge pull request #1104 from zhangruinan/master
hankcs Mar 6, 2019
2b928e6
修订拼音 fix https://github.com/hankcs/HanLP/issues/1118
hankcs Mar 20, 2019
e148f23
Merge remote-tracking branch 'origin/master'
hankcs Mar 20, 2019
a39b14a
防止ViterbiSegment.dat不必要的初始化
hankcs Mar 28, 2019
478f895
优化DoubleArrayTrie fix https://github.com/hankcs/HanLP/issues/1136
hankcs Mar 29, 2019
d82cc08
修复词法分析器对动态插入的词条的处理 fix https://github.com/hankcs/HanLP/issues/271#iss…
hankcs Apr 4, 2019
668ec6b
修复语料库下载链接 fix https://github.com/hankcs/HanLP/issues/1148
hankcs Apr 10, 2019
073a0ea
词法分析器seg接口支持自定义词性覆盖统计词性 fix https://github.com/hankcs/HanLP/issues/1156
hankcs Apr 20, 2019
82a48f9
感知机词法分析器默认使用98年人民日报6个月的大模型
hankcs Apr 20, 2019
b6e19fe
:checkered_flag:常规维护、多项改进;小版本+1,发布v1.7.3
hankcs Apr 20, 2019
613022b
Merge branch 'master' into portable
hankcs Apr 20, 2019
30a2015
预备发布 portable-1.7.3
hankcs Apr 20, 2019
8d2057c
修复gpg签名
hankcs Apr 20, 2019
c74ef77
停用词典支持热更新:fix https://github.com/hankcs/HanLP/issues/1158
hankcs Apr 27, 2019
a538d07
修正 CollectionUtility.sortMapByValue(java.util.Map<K,V>, boolean) fix …
hankcs Apr 27, 2019
e33b1e5
微调bigram
hankcs May 1, 2019
69cddf7
修复自定义词性 fix https://github.com/hankcs/HanLP/issues/1172
hankcs May 6, 2019
8318bee
微调bigram fix https://github.com/hankcs/HanLP/issues/1015
hankcs May 8, 2019
c6ee46f
修订简繁转换 fix https://github.com/hankcs/HanLP/issues/1182
hankcs May 25, 2019
5be39b0
修正URLTokenizer中的正则表达式 fix https://github.com/hankcs/HanLP/issues/1188
hankcs Jun 1, 2019
bd60162
文档
hankcs Jun 5, 2019
c5391d5
Add unit tests for com.hankcs.hanlp.algorithm.EditDistance
ThomasPerkins1123 Jun 3, 2019
a6b0d85
Merge pull request #1194 from Diffblue-benchmarks/add-EditDistance-tests
hankcs Jun 6, 2019
a179699
Add unit tests for com.hankcs.hanlp.utility.MathUtilityTest
ThomasPerkins1123 Jun 10, 2019
9c34ed1
Merge pull request #1199 from Diffblue-benchmarks/add-MathUtil-Tests
hankcs Jun 12, 2019
80f6215
修正角色标注时“始##始”的A标签 fix https://github.com/hankcs/HanLP/issues/434
hankcs Jun 22, 2019
9495a4d
修订人名词典
hankcs Jun 28, 2019
f7c928c
无损转换OpenCC词典,结果一致 https://github.com/hankcs/OpenCC-to-HanLP fix https…
hankcs Jun 28, 2019
590af00
:checkered_flag:简繁转换与OpenCC完全一致;小版本+1,发布v1.7.4
hankcs Jun 28, 2019
2131d8f
Merge branch 'master' into portable
hankcs Jun 28, 2019
d9e7a25
预备发布 portable-1.7.4
hankcs Jun 28, 2019
eecb4aa
修复Analyzer的enableCustomDictionaryForcing方法 fix https://github.com/han…
hankcs Jul 4, 2019
d87b3a2
让CoreStopWordDictionary.apply返回结果
hankcs Jul 4, 2019
c4725b8
DocVectorModel支持自定义分词器、开/关停用词过滤器 fix https://github.com/hankcs/HanLP/…
hankcs Jul 27, 2019
5fd89fd
Change method name 'convert' to 'createSynonymList'
doubleblinddoubleblinddoubleblind Aug 1, 2019
8aad0a8
Merge pull request #1259 from pdhung3012/master
hankcs Aug 1, 2019
9a0d81c
修复repeated bisection聚类算法 fix https://github.com/hankcs/HanLP/issues/1…
hankcs Aug 8, 2019
498b6f7
将换行空格等视作CT_OTHER fix https://github.com/hankcs/HanLP/issues/1283
hankcs Sep 19, 2019
49fefec
文档
hankcs Sep 19, 2019
19c11b4
删除“一推” fix https://github.com/hankcs/HanLP/issues/1288#issuecomment-5…
hankcs Oct 3, 2019
a7c05c7
删除“要买”
hankcs Oct 9, 2019
422077b
《自然语言处理入门》新书携v1.7.5发布🔥:http://nlp.hankcs.com/book.php
hankcs Oct 10, 2019
523b3d3
Merge branch 'master' into portable
hankcs Oct 17, 2019
598b73c
预备发布 portable-1.7.5
hankcs Oct 17, 2019
3c214ec
WordVectorModel支持自定义Map类型:https://github.com/hankcs/HanLP/issues/1304
hankcs Oct 20, 2019
233b550
删除“邀请人”
hankcs Oct 25, 2019
af5d8a9
Using `buffer` instead of `_` in code in order to prevent compile fai…
LucienShui Oct 30, 2019
12ccffc
Merge pull request #1312 from LucienShui/dev
hankcs Oct 30, 2019
9171a1b
Merge remote-tracking branch 'origin/master'
hankcs Oct 30, 2019
511b978
NGramDictionaryMaker等默认UTF-8编码 fix https://github.com/hankcs/HanLP/is…
hankcs Nov 8, 2019
1c38a6d
优化 segmentBackwardLongest 的运行速度
hankcs Nov 10, 2019
6877863
更新文档
hankcs Nov 13, 2019
50a05e4
清理代码 fix https://github.com/hankcs/HanLP/issues/1322
hankcs Nov 13, 2019
2874b14
自动下载文件时加上 User-Agent
hankcs Nov 19, 2019
ab0cf20
修订现代汉语补充词库 fix https://github.com/hankcs/HanLP/issues/1330
hankcs Nov 23, 2019
9751c98
自动下载支持重定向
hankcs Nov 26, 2019
5fb8a4d
利用配置文件中的路径判断data是否已下载
hankcs Nov 26, 2019
9cca30b
词法分析器新增空格处理 fix https://github.com/hankcs/HanLP/issues/797
hankcs Nov 26, 2019
5cd35e4
新增 DocVectorModel.nearest(java.lang.String, int) 方法 fix https://githu…
hankcs Nov 26, 2019
d5c63f2
修复:加载自定义停用词文件无效
allen615 Dec 5, 2019
0672bb1
Merge pull request #1346 from allen615/master
hankcs Dec 9, 2019
4bfbfcd
HMMLexicalAnalyzerTest自动下载PKU语料库
hankcs Dec 10, 2019
651382d
开放 CoreStopWordDictionary.dictionary https://github.com/hankcs/HanLP/…
hankcs Dec 18, 2019
832aae9
tfidf,idf的数据可以通过加载idf文件得到
allen615 Dec 23, 2019
b44c3b4
tfidf,idf的数据可以通过加载idf文件得到
allen615 Dec 23, 2019
6b93df0
Merge pull request #1360 from allen615/master
hankcs Dec 24, 2019
c6b4ab8
Nature is not concurrent safe. Change TreeMap to ConcurrentHashMap
Dec 27, 2019
477bae1
Merge pull request #1365 from zhuchaokn/master
hankcs Dec 27, 2019
7dd79cc
修复信息熵计算中的除零错误 fix https://github.com/hankcs/HanLP/issues/1366
hankcs Dec 31, 2019
78769d8
:checkered_flag:常规维护、多项改进;小版本+1,发布v1.7.6
hankcs Dec 31, 2019
773b9af
Merge branch 'master' into portable
hankcs Dec 31, 2019
143e1bc
Portable同步升级到v1.7.6
hankcs Dec 31, 2019
19809f3
修复聚类数目大于文档数目时引发的异常 fix https://github.com/hankcs/HanLP/issues/1397
hankcs Jan 10, 2020
a5efa69
开放 CWSEvaluator.Result 内部成员 fix https://bbs.hankcs.com/t/topic/887
hankcs Feb 1, 2020
854ed9c
更新文档
hankcs Feb 1, 2020
3a30684
修复 AbstractClassifier.enableProbability fix https://github.com/hankcs…
hankcs Feb 14, 2020
65e0475
格式化代码
hankcs Feb 14, 2020
e8a920c
改进原子切分 fix https://github.com/hankcs/HanLP/issues/1421
hankcs Feb 14, 2020
b62db0d
进一步改进原子切分 fix https://github.com/hankcs/HanLP/issues/1421#issuecommen…
hankcs Feb 15, 2020
33f2973
去掉 幺=么 fix https://github.com/hankcs/HanLP/issues/1427
hankcs Feb 18, 2020
90b0c15
support getting all tags
tiandiweizun Feb 19, 2020
de656bf
Merge pull request #1428 from tiandiweizun/patch-1
hankcs Feb 19, 2020
c45c0cd
使用构造函数代替静态create,方便子类继承
hankcs Mar 2, 2020
958b7c0
公开HMM的成员
hankcs Mar 5, 2020
9577651
:checkered_flag:常规维护、多项改进;小版本+1,发布v1.7.7
hankcs Mar 5, 2020
b698f0f
Merge branch '1.x' into portable
hankcs Mar 5, 2020
1ea796c
Portable同步升级到v1.7.7
hankcs Mar 5, 2020
dd561bd
开放 CRFNERecognizer.tagSet,补充CRF自动机名称识别案例 https://bbs.hankcs.com/t/crf…
hankcs Mar 9, 2020
3148af0
Typo Fix
caoyi0905 Mar 16, 2020
fd2a829
Merge pull request #1439 from caoyi0905/patch-1
hankcs Mar 16, 2020
fee3e14
CharType使用IOAdapter fix https://github.com/hankcs/HanLP/issues/1480
hankcs May 27, 2020
ef1c59b
加入自定义词条“雄安”
hankcs Jun 1, 2020
180f7c0
:checkered_flag:常规维护、多项改进;小版本+1,发布v1.7.8
hankcs Jun 15, 2020
f338b1d
Merge branch '1.x' into portable
hankcs Jun 15, 2020
d7ece24
Portable同步升级到v1.7.8
hankcs Jun 15, 2020
1afbaf1
fix errors when compound word consists of two words and appears at th…
bqwu Jun 26, 2020
83ee72b
Merge pull request #1497 from bqwu/1.x
hankcs Jun 26, 2020
cb2d20f
清理代码
hankcs Aug 8, 2020
ef44aad
HiddenMarkovModel构造时备份参数 fix https://github.com/hankcs/HanLP/issues/1530
hankcs Aug 15, 2020
3b163cb
Fix Sentence.create on compound word consisting of single word
hankcs Oct 1, 2020
66e328d
新增 KBeamArcEagerDependencyParser(String modelPath, String cwsModelPat…
hankcs Nov 15, 2020
ae845b6
新增热更新方法 CoreDictionary.reload() fix https://github.com/hankcs/HanLP/i…
hankcs Dec 22, 2020
6265aec
双数组trie树防止传入空白key导致无法转移状态 fix https://bbs.hankcs.com/t/dat/3196/8
hankcs Jan 15, 2021
2577426
修复 CoreStopWordDictionary.dictionary.clear() fix https://github.com/h…
hankcs Jan 16, 2021
b9a899b
支持𩽾𩾌(ān kāng)之类的补充字符集 fix https://github.com/hankcs/HanLP/issues/1564
hankcs Jan 31, 2021
aff3f3a
重构CustomDictionary,支持多实例 https://github.com/hankcs/HanLP/issues/1339
hankcs Jan 31, 2021
b3562d9
:checkered_flag:支持多实例、补充字符集;中版本+1,发布v1.8.0
hankcs Feb 6, 2021
49bc6c2
Merge branch '1.x' into portable
hankcs Feb 11, 2021
4d14a07
Portable同步升级到v1.8.0
hankcs Feb 11, 2021
68d063a
修复CharTable 归一化部分字符错误 fix https://github.com/hankcs/HanLP/issues/1615
hankcs Feb 22, 2021
88d3eb0
提问请上论坛:https://bbs.hanlp.com/
hankcs Mar 10, 2021
18e5c7a
修复 convertToPinyinList fix https://github.com/hankcs/HanLP/issues/1634
hankcs Mar 19, 2021
4704cc1
:checkered_flag:常规维护与修复;小版本+1,发布v1.8.1
hankcs Mar 19, 2021
80cc0fd
Merge branch '1.x' into portable
hankcs Mar 19, 2021
9e3e8c4
Portable同步升级到v1.8.1
hankcs Mar 19, 2021
6b60684
修复 CustomDictionary.reload() fix https://github.com/hankcs/HanLP/issu…
hankcs Mar 23, 2021
1696479
lve4的声母修正为ve fix https://github.com/hankcs/HanLP/issues/1644
hankcs Apr 17, 2021
1632955
修订简繁映射表
hankcs May 14, 2021
a3f9d02
修订bigram模型
hankcs May 25, 2021
99548e7
修复CoreDictionary的reload方法
hankcs Jun 7, 2021
3a99bc6
调整公式,维特比分词准确率从94.49提升至94.69 https://bbs.hankcs.com/t/topic/136/61?u=h…
hankcs Jun 8, 2021
61631b0
改进 HMM 采样函数 https://bbs.hankcs.com/t/topic/136/64?u=hankcs
hankcs Jun 10, 2021
8ee039b
支持禁用自动刷新词典缓存(CustomDictionaryAutoRefreshCache=false)fix https://githu…
hankcs Jun 18, 2021
6b89f39
:checkered_flag:常规维护与准确率提升;小版本+1,发布v1.8.2
hankcs Jun 18, 2021
24ccd6e
Merge branch '1.x' into portable
hankcs Jun 18, 2021
babc1e4
Portable同步升级到v1.8.2
hankcs Jun 18, 2021
9ae1498
调整`莎=sha1,suo1` fix https://github.com/hankcs/HanLP/issues/1670
hankcs Aug 11, 2021
a9997d8
DoubleArrayTrie里的LongestSearcher的next方法需要进行强化,当传入的treemap的value为null时…
Aug 24, 2021
7b03824
Merge pull request #1674 from tiandiweizun/1.x
hankcs Aug 24, 2021
363d0b0
Do not allow any transition when parse with empty trie fix https://gi…
hankcs Aug 13, 2021
61cc753
清理代码
hankcs Oct 15, 2021
6cff689
根据总词频动态决定未登录词的默认词频
hankcs Nov 5, 2021
d34dab3
Update DoubleArrayTrie.java
TITC Dec 7, 2021
2de961b
Merge pull request #1699 from TITC/patch-1
hankcs Dec 7, 2021
2f796df
删除几个“名+名词”
hankcs Dec 31, 2021
8e750ee
修复动态自定义词典与CustomDictionaryForcing的搭配问题 fix https://github.com/hankcs/…
hankcs Feb 21, 2022
4737766
Merge branch '1.x' into portable
hankcs Feb 21, 2022
51b97e9
Portable同步升级到v1.8.3
hankcs Feb 21, 2022
4b43124
将<>视作分隔符 fix https://bbs.hankcs.com/t/topic/4527
hankcs Feb 27, 2022
69506a7
Segment 添加是否进行 Normalize 的配置方法 close https://github.com/hankcs/HanLP/…
hankcs Mar 8, 2022
867cc8d
修复文本推荐的评分器分数计算时 scorer.boost 的 bug fix: https://github.com/hankcs/Han…
hankcs Apr 9, 2022
551d578
bugfix: 修复 bintrie 树全分词时 提前跳出循环 bug
carl10086 Aug 11, 2022
b216b24
Merge pull request #1775 from carl10086/bugfix/bintrie_parsetext
hankcs Aug 12, 2022
b165273
自定义词典支持.tsv格式 fix: https://github.com/hankcs/HanLP/issues/1785
hankcs Sep 15, 2022
1323221
修复自定义词典路径传参 fix: https://github.com/hankcs/HanLP/issues/1799
hankcs Jan 13, 2023
ce07395
增加enableFastBuild
Feb 23, 2023
6b4c681
调整注释
Feb 23, 2023
41b2f3a
🙅补单测
Feb 23, 2023
94b41c5
调整单测
Feb 23, 2023
e1020b0
修复word2vec文件流关闭问题 fix: https://github.com/hankcs/HanLP/issues/1806
hankcs Feb 24, 2023
9f26460
:checkered_flag:常规维护与准确率提升;小版本+1,发布v1.8.4
hankcs Feb 25, 2023
08d091e
Merge branch '1.x' into portable
hankcs Feb 25, 2023
6316759
Portable同步升级到v1.8.4
hankcs Feb 25, 2023
d57aab2
欢迎引用我们的论文:https://aclanthology.org/2021.emnlp-main.451/
hankcs Feb 27, 2023
e19bc7a
演示如何调整二元文法: https://bbs.hankcs.com/t/topic/5326
hankcs Apr 4, 2023
9b2ff93
修复ViterbiSegment分词器中加载自定义词典时未替换DoubleArrayTrie导致分词不符合预期的问题
Aug 11, 2023
2d3b1bf
Merge pull request #1835 from wxy929629/1.x
hankcs Aug 13, 2023
9e2c58c
Merge remote-tracking branch 'origin/1.x' into 1.x
hanlpbot Aug 13, 2023
4b2686c
修复mini二元文法在JRE初始化后第一次分词可能出现的不一致 fix: https://github.com/hankcs/HanLP/…
hankcs Oct 19, 2023
69e69b5
fix:修复中文分词评测工具比较时的计算错误
webSue Oct 19, 2023
4ac13f1
Merge pull request #1853 from webSue/fix/cws_evaluate
hankcs Oct 20, 2023
a089963
:checkered_flag:常规维护;小版本+1,发布v1.8.5
hankcs Nov 16, 2024
926e126
Merge branch '1.x' into portable
hankcs Nov 16, 2024
0df6a5d
Portable同步升级到v1.8.5
hankcs Nov 16, 2024
e68be80
Merge remote-tracking branch 'origin/1.x' into 1.x
hankcs Nov 16, 2024
03d3e63
清理 `Predefine`
hankcs Dec 28, 2024
dadd5c7
:checkered_flag:常规维护;小版本+1,发布v1.8.6
hankcs Dec 28, 2024
e3cfaa0
Merge branch '1.x' into portable
hankcs Dec 28, 2024
4f7949a
Portable同步升级到v1.8.6
hankcs Dec 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
🏁一亿字全世界最大中文语料库;小版本+1,发布v1.6.8
  • Loading branch information
hankcs committed Aug 25, 2018
commit 2a071ec36fc32112e03f5ea93009a4a39acd0573
46 changes: 22 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ HanLP提供下列功能:
* 中文分词
* HMM-Bigram(速度与精度最佳平衡;一百兆内存)
* [最短路分词](https://github.com/hankcs/HanLP#1-%E7%AC%AC%E4%B8%80%E4%B8%AAdemo)、[N-最短路分词](https://github.com/hankcs/HanLP#5-n-%E6%9C%80%E7%9F%AD%E8%B7%AF%E5%BE%84%E5%88%86%E8%AF%8D)
* 由字构词(侧重精度,可识别新词;适合NLP任务)
* 由字构词(侧重精度,全世界最大语料库,可识别新词;适合NLP任务)
* [感知机分词](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)、[CRF分词](https://github.com/hankcs/HanLP#6-crf%E5%88%86%E8%AF%8D)
* 词典分词(侧重速度,每秒数千万字符;省内存)
* [极速词典分词](https://github.com/hankcs/HanLP#7-%E6%9E%81%E9%80%9F%E8%AF%8D%E5%85%B8%E5%88%86%E8%AF%8D)
Expand Down Expand Up @@ -54,15 +54,15 @@ HanLP提供下列功能:
* 词向量训练、加载、词语相似度计算、语义运算、查询、KMeans聚类
* 文档语义相似度计算
* [语料库工具](https://github.com/hankcs/HanLP/tree/master/src/main/java/com/hankcs/hanlp/corpus)
- 默认模型训练自小型语料库,鼓励用户自行训练。所有模块提供[训练接口](https://github.com/hankcs/HanLP/wiki),语料可参考[OpenCorpus](https://github.com/hankcs/OpenCorpus)。
- 部分默认模型训练自小型语料库,鼓励用户自行训练。所有模块提供[训练接口](https://github.com/hankcs/HanLP/wiki),语料可参考[OpenCorpus](https://github.com/hankcs/OpenCorpus)。

在提供丰富功能的同时,HanLP内部模块坚持低耦合、模型坚持惰性加载、服务坚持静态提供、词典坚持明文发布,使用非常方便,同时自带一些语料处理工具,帮助用户训练自己的模型。
在提供丰富功能的同时,HanLP内部模块坚持低耦合、模型坚持惰性加载、服务坚持静态提供、词典坚持明文发布,使用非常方便。默认模型训练自全世界最大规模的中文语料库,同时自带一些语料处理工具,帮助用户训练自己的模型。

------

## 项目主页

[在线演示](http://hanlp.hankcs.com/)、[Python调用](https://github.com/hankcs/pyhanlp)、[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)、[国内下载](http://hanlp.dksou.com/HanLP.html)、[更多信息](https://github.com/hankcs/HanLP/wiki)。
[在线演示](http://hanlp.hankcs.com/)、[Python调用](https://github.com/hankcs/pyhanlp)、[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)、[论文引用](https://github.com/hankcs/HanLP/wiki/%E8%AE%BA%E6%96%87%E5%BC%95%E7%94%A8)、[更多信息](https://github.com/hankcs/HanLP/wiki)。

------

Expand All @@ -76,7 +76,7 @@ HanLP提供下列功能:
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.6.7</version>
<version>portable-1.6.8</version>
</dependency>
```

Expand Down Expand Up @@ -110,15 +110,15 @@ HanLP中的数据分为*词典*和*模型*,其中*词典*是词法分析必需

为data的**父目录**即可,比如data目录是`/Users/hankcs/Documents/data`,那么`root=/Users/hankcs/Documents/` 。

最后将`hanlp.properties`放入classpath即可,对于任何项目,都可以放到src或resources目录下,编译时IDE会自动将其复制到classpath中。除了配置文件外,还可以使用环境变量`HANLP_ROOT`来设置`root`。
最后将`hanlp.properties`放入classpath即可,对于多数项目,都可以放到src或resources目录下,编译时IDE会自动将其复制到classpath中。除了配置文件外,还可以使用环境变量`HANLP_ROOT`来设置`root`。安卓项目请参考[demo](https://github.com/hankcs/HanLPAndroidDemo)

如果放置不当,HanLP会提示当前环境下的合适路径,并且尝试从项目根目录读取数据集。

## 调用方法

HanLP几乎所有的功能都可以通过工具类`HanLP`快捷调用,当你想不起来调用方法时,只需键入`HanLP.`,IDE应当会给出提示,并展示HanLP完善的文档。

所有Demo都位于[com.hankcs.demo](https://github.com/hankcs/HanLP/tree/master/src/test/java/com/hankcs/demo)下,比文档覆盖了更多细节,更新更及时,**强烈建议运行一遍**。
所有Demo都位于[com.hankcs.demo](https://github.com/hankcs/HanLP/tree/master/src/test/java/com/hankcs/demo)下,比文档覆盖了更多细节,更新更及时,**强烈建议运行一遍**。此处仅列举部分常用接口。

### 1. 第一个Demo

Expand Down Expand Up @@ -155,8 +155,8 @@ System.out.println(NLPTokenizer.analyze("我的希望是希望张晚霞的背影
System.out.println(NLPTokenizer.analyze("支援臺灣正體香港繁體:微软公司於1975年由比爾·蓋茲和保羅·艾倫創立。"));
```
- 说明
* NLP分词`NLPTokenizer`会执行全部命名实体识别和词性标注
* 默认模型训练自[微软研究院语料库修订版](https://github.com/hankcs/OpenCorpus/tree/master/msra-ne)或[98年1月份人民日报语料修订版](https://github.com/hankcs/OpenCorpus/tree/master/pku98)(仅有`183`万字)。语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上[训练新模型](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)以适应新领域、识别新的命名实体。
* NLP分词`NLPTokenizer`会执行词性标注和命名实体识别,由[结构化感知机序列标注框架](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)支撑
* 默认模型训练自`9970`万字的大型综合语料库,是已知范围内**全世界最大**的中文分词语料库。语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上[训练新模型](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)以适应新领域、识别新的命名实体。

### 4. 索引分词

Expand Down Expand Up @@ -194,23 +194,21 @@ for (String sentence : testCase)
### 6. CRF分词

```java
Segment segment = new CRFSegment();
segment.enablePartOfSpeechTagging(true);
List<Term> termList = segment.seg("你看过穆赫兰道吗");
System.out.println(termList);
for (Term term : termList)
{
if (term.nature == null)
{
System.out.println("识别到新词:" + term.word);
}
}
CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer();
String[] tests = new String[]{
"商品和服务",
"上海华安工业(集团)公司董事长谭旭光和秘书胡花蕊来到美国纽约现代艺术博物馆参观",
"微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。" // 支持繁体中文
};
for (String sentence : tests)
{
System.out.println(analyzer.analyze(sentence));
}
```
- 说明
* CRF对新词有很好的识别能力,但是开销较大。
- 算法详解
* [《CRF分词的纯Java实现》](http://www.hankcs.com/nlp/segment/crf-segmentation-of-the-pure-java-implementation.html)
* [《CRF++模型格式说明》](http://www.hankcs.com/nlp/the-crf-model-format-description.html)
* [《CRF中文分词、词性标注与命名实体识别》](https://github.com/hankcs/HanLP/wiki/CRF%E8%AF%8D%E6%B3%95%E5%88%86%E6%9E%90)

### 7. 极速词典分词

Expand Down Expand Up @@ -282,8 +280,8 @@ public class DemoCustomDictionary
}
```
- 说明
* `CustomDictionary`是一份全局的用户自定义词典,可以随时增删,影响全部分词器。
* 另外可以在任何分词器中关闭它。通过代码动态增删不会保存到词典文件
* `CustomDictionary`是一份全局的用户自定义词典,可以随时增删,影响全部分词器。另外可以在任何分词器中关闭它。通过代码动态增删不会保存到词典文件。
* 中文分词≠词典,词典无法解决中文分词,`Segment`提供高低优先级应对不同场景,请参考[FAQ](https://github.com/hankcs/HanLP/wiki/FAQ#%E4%B8%BA%E4%BB%80%E4%B9%88%E4%BF%AE%E6%94%B9%E4%BA%86%E8%AF%8D%E5%85%B8%E8%BF%98%E6%98%AF%E6%B2%A1%E6%9C%89%E6%95%88%E6%9E%9C)
- 追加词典
* `CustomDictionary`主词典文本路径是`data/dictionary/custom/CustomDictionary.txt`,用户可以在此增加自己的词语(不推荐);也可以单独新建一个文本文件,通过配置文件`CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 我的词典.txt;`来追加词典(推荐)。
* 始终建议将相同词性的词语放到同一个词典文件里,便于维护和分享。
Expand Down
1 change: 0 additions & 1 deletion data/dictionary/custom/CustomDictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1015,7 +1015,6 @@
市惠 v 1
布警 v 1
希世 nz 1
希望 v 7685 vn 616
帕金森综合征 v 1
带手儿 d 1
带音 v 3
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>1.6.7</version>
<version>1.6.8</version>

<name>HanLP</name>
<url>http://www.hankcs.com/</url>
Expand Down
4 changes: 3 additions & 1 deletion src/main/java/com/hankcs/hanlp/HanLP.java
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ public static final class Config
public static String CRFSegmentModelPath = "data/model/segment/CRFSegmentModel.txt";
/**
* HMM分词模型
*
* @deprecated 已废弃,请使用{@link PerceptronLexicalAnalyzer}
*/
public static String HMMSegmentModelPath = "data/model/segment/HMMSegmentModel.bin";
/**
Expand All @@ -184,7 +186,7 @@ public static final class Config
/**
* 感知机分词模型
*/
public static String PerceptronCWSModelPath = "data/model/perceptron/msra/cws.bin";
public static String PerceptronCWSModelPath = "data/model/perceptron/large/cws.bin";
/**
* 感知机词性标注模型
*/
Expand Down
1 change: 1 addition & 0 deletions src/test/java/com/hankcs/demo/DemoNLPSegment.java
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ public class DemoNLPSegment extends TestUtility
{
public static void main(String[] args)
{
NLPTokenizer.ANALYZER.enableCustomDictionary(false); // 中文分词≠词典,不用词典照样分词。
System.out.println(NLPTokenizer.segment("我新造一个词叫幻想乡你能识别并正确标注词性吗?")); // “正确”是副形词。
// 注意观察下面两个“希望”的词性、两个“晚霞”的词性
System.out.println(NLPTokenizer.analyze("我的希望是希望张晚霞的背影被晚霞映红").translateLabels());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ public static void main(String[] args)
"我在上海林原科技有限公司兼职工作,",
"我经常在台川喜宴餐厅吃饭,",
"偶尔去开元地中海影城看电影。",
"不用词典,福哈生态工程有限公司是动态识别的结果。",
};
Segment segment = HanLP.newSegment().enableOrganizationRecognize(true);
Segment segment = HanLP.newSegment().enableCustomDictionary(false).enableOrganizationRecognize(true);
for (String sentence : testCase)
{
List<Term> termList = segment.seg(sentence);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
import java.io.IOException;

/**
* 基于感知机序列标注的词法分析器,默认模型训练自个人修订版1998人民日报语料1月份,仅有183万字。
* 基于感知机序列标注的词法分析器,可选多个模型。
* - large训练自一亿字的大型综合语料库,是已知范围内全世界最大的中文分词语料库。
* - pku199801训练自个人修订版1998人民日报语料1月份,仅有183万字。
* 语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上训练新模型以适应新领域、识别新的命名实体。
* 无论在何种语料上训练,都完全支持简繁全半角和大小写。
*
Expand Down
2 changes: 1 addition & 1 deletion src/test/java/com/hankcs/hanlp/seg/SegmentTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,7 @@ public void testIssue784() throws Exception
String s = "苏苏中级会计什么时候更新";
CustomDictionary.add("苏苏");
StandardTokenizer.SEGMENT.enableCustomDictionaryForcing(true);
assertEquals("[苏苏/nz, 中级会计/nz, 什么/ry, 时候/n, 更新/v]", HanLP.segment(s).toString());
assertTrue(HanLP.segment(s).toString().contains("苏苏"));
}

public void testIssue790() throws Exception
Expand Down