diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md deleted file mode 100644 index 36e02cda4..000000000 --- a/.github/ISSUE_TEMPLATE.md +++ /dev/null @@ -1,65 +0,0 @@ - - -## 注意事项 -请确认下列注意事项: - -* 我已仔细阅读下列文档,都没有找到答案: - - [首页文档](https://github.com/hankcs/HanLP) - - [wiki](https://github.com/hankcs/HanLP/wiki) - - [常见问题](https://github.com/hankcs/HanLP/wiki/FAQ) -* 我已经通过[Google](https://www.google.com/#newwindow=1&q=HanLP)和[issue区检索功能](https://github.com/hankcs/HanLP/issues)搜索了我的问题,也没有找到答案。 -* 我明白开源社区是出于兴趣爱好聚集起来的自由社区,不承担任何责任或义务。我会礼貌发言,向每一个帮助我的人表示感谢。 -* [ ] 我在此括号内输入x打钩,代表上述事项确认完毕。 - -## 版本号 - - -当前最新版本号是: -我使用的版本是: - - - -## 我的问题 - - - -## 复现问题 - - -### 步骤 - -1. 首先…… -2. 然后…… -3. 接着…… - -### 触发代码 - -``` - public void testIssue1234() throws Exception - { - CustomDictionary.add("用户词语"); - System.out.println(StandardTokenizer.segment("触发问题的句子")); - } -``` -### 期望输出 - - - -``` -期望输出 -``` - -### 实际输出 - - - -``` -实际输出 -``` - -## 其他信息 - - - diff --git a/.github/bug_report.md b/.github/bug_report.md new file mode 100644 index 000000000..ae3589912 --- /dev/null +++ b/.github/bug_report.md @@ -0,0 +1,44 @@ +--- +name: 🐛发现一个bug +about: 需提交版本号、触发代码、错误日志 +title: '' +labels: bug +assignees: hankcs + +--- + + + +**Describe the bug** +A clear and concise description of what the bug is. + +**Code to reproduce the issue** +Provide a reproducible test case that is the bare minimum necessary to generate the problem. + +```python +``` + +**Describe the current behavior** +A clear and concise description of what happened. + +**Expected behavior** +A clear and concise description of what you expected to happen. + +**System information** +- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): +- Python/Java version: +- HanLP version: + +**Other info / logs** +Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. + +* [ ] I've completed this form and searched the web for solutions. + + + \ No newline at end of file diff --git a/.github/config.yml b/.github/config.yml new file mode 100755 index 000000000..0180a0e49 --- /dev/null +++ b/.github/config.yml @@ -0,0 +1,5 @@ +blank_issues_enabled: false +contact_links: + - name: ⁉️ 提问求助请上论坛 + url: https://bbs.hanlp.com/ + about: 欢迎前往中文社区求助 diff --git a/.github/feature_request.md b/.github/feature_request.md new file mode 100644 index 000000000..af4b92452 --- /dev/null +++ b/.github/feature_request.md @@ -0,0 +1,36 @@ +--- +name: 🚀新功能请愿 +about: 建议增加一个新功能 +title: '' +labels: feature request +assignees: hankcs + +--- + + + +**Describe the feature and the current behavior/state.** + +**Will this change the current api? How?** + +**Who will benefit with this feature?** + +**Are you willing to contribute it (Yes/No):** + +**System information** +- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): +- Python/Java version: +- HanLP version: + +**Any other info** + +* [ ] I've carefully completed this form. + + + \ No newline at end of file diff --git a/README.md b/README.md index ec1130891..b90a91473 100644 --- a/README.md +++ b/README.md @@ -2,79 +2,69 @@ HanLP: Han Language Processing ===== 汉语言处理包 -[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.hankcs/hanlp/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.hankcs/hanlp/) +[![Maven Central](https://img.shields.io/maven-central/v/com.hankcs/hanlp?label=maven)](https://mvnrepository.com/artifact/com.hankcs/hanlp) [![GitHub release](https://img.shields.io/github/release/hankcs/HanLP.svg)](https://github.com/hankcs/hanlp/releases) [![License](https://img.shields.io/badge/license-Apache%202-4EB1BA.svg)](https://www.apache.org/licenses/LICENSE-2.0.html) [![Docker Stars](https://img.shields.io/docker/stars/samurais/hanlp-api.svg?maxAge=2592000)](https://hub.docker.com/r/samurais/hanlp-api/) ------ -**HanLP**是由一系列模型与算法组成的Java工具包,目标是普及自然语言处理在生产环境中的应用。**HanLP**具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。 +HanLP是一系列模型与算法组成的NLP工具包,目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。内部算法经过工业界和学术界考验,配套书籍[《自然语言处理入门》](http://nlp.hankcs.com/book.php)已经出版。目前,基于深度学习的[HanLP 2.x](https://github.com/hankcs/HanLP/tree/doc-zh)已正式发布,次世代最先进的NLP技术,支持包括简繁中英日俄法德在内的104种语言上的联合任务。如果您在研究中使用了HanLP,请引用我们的[EMNLP论文](https://aclanthology.org/2021.emnlp-main.451/)。 -**HanLP**提供下列功能: +HanLP提供下列功能: * 中文分词 - * 最短路分词 - * N-最短路分词 - * CRF分词 - * 索引分词 - * 极速词典分词 - * 用户自定义词典 + * HMM-Bigram(速度与精度最佳平衡;一百兆内存) + * [最短路分词](https://github.com/hankcs/HanLP/tree/1.x#1-%E7%AC%AC%E4%B8%80%E4%B8%AAdemo)、[N-最短路分词](https://github.com/hankcs/HanLP/tree/1.x#5-n-%E6%9C%80%E7%9F%AD%E8%B7%AF%E5%BE%84%E5%88%86%E8%AF%8D) + * 由字构词(侧重精度,全世界最大语料库,可识别新词;适合NLP任务) + * [感知机分词](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)、[CRF分词](https://github.com/hankcs/HanLP/tree/1.x#6-crf%E5%88%86%E8%AF%8D) + * 词典分词(侧重速度,每秒数千万字符;省内存) + * [极速词典分词](https://github.com/hankcs/HanLP/tree/1.x#7-%E6%9E%81%E9%80%9F%E8%AF%8D%E5%85%B8%E5%88%86%E8%AF%8D) + * 所有分词器都支持: + * [索引全切分模式](https://github.com/hankcs/HanLP/tree/1.x#4-%E7%B4%A2%E5%BC%95%E5%88%86%E8%AF%8D) + * [用户自定义词典](https://github.com/hankcs/HanLP/tree/1.x#8-%E7%94%A8%E6%88%B7%E8%87%AA%E5%AE%9A%E4%B9%89%E8%AF%8D%E5%85%B8) + * [兼容繁体中文](https://github.com/hankcs/HanLP/blob/1.x/src/test/java/com/hankcs/demo/DemoPerceptronLexicalAnalyzer.java#L29) + * [训练用户自己的领域模型](https://github.com/hankcs/HanLP/wiki) * 词性标注 + * [HMM词性标注](https://github.com/hankcs/HanLP/blob/1.x/src/main/java/com/hankcs/hanlp/seg/Segment.java#L584)(速度快) + * [感知机词性标注](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)、[CRF词性标注](https://github.com/hankcs/HanLP/wiki/CRF%E8%AF%8D%E6%B3%95%E5%88%86%E6%9E%90)(精度高) * 命名实体识别 - * 中国人名识别 - * 音译人名识别 - * 日本人名识别 - * 地名识别 - * 实体机构名识别 + * 基于HMM角色标注的命名实体识别 (速度快) + * [中国人名识别](https://github.com/hankcs/HanLP/tree/1.x#9-%E4%B8%AD%E5%9B%BD%E4%BA%BA%E5%90%8D%E8%AF%86%E5%88%AB)、[音译人名识别](https://github.com/hankcs/HanLP/tree/1.x#10-%E9%9F%B3%E8%AF%91%E4%BA%BA%E5%90%8D%E8%AF%86%E5%88%AB)、[日本人名识别](https://github.com/hankcs/HanLP/tree/1.x#11-%E6%97%A5%E6%9C%AC%E4%BA%BA%E5%90%8D%E8%AF%86%E5%88%AB)、[地名识别](https://github.com/hankcs/HanLP/tree/1.x#12-%E5%9C%B0%E5%90%8D%E8%AF%86%E5%88%AB)、[实体机构名识别](https://github.com/hankcs/HanLP/tree/1.x#13-%E6%9C%BA%E6%9E%84%E5%90%8D%E8%AF%86%E5%88%AB) + * 基于线性模型的命名实体识别(精度高) + * [感知机命名实体识别](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)、[CRF命名实体识别](https://github.com/hankcs/HanLP/wiki/CRF%E8%AF%8D%E6%B3%95%E5%88%86%E6%9E%90) * 关键词提取 - * TextRank关键词提取 + * [TextRank关键词提取](https://github.com/hankcs/HanLP/tree/1.x#14-%E5%85%B3%E9%94%AE%E8%AF%8D%E6%8F%90%E5%8F%96) * 自动摘要 - * TextRank自动摘要 + * [TextRank自动摘要](https://github.com/hankcs/HanLP/tree/1.x#15-%E8%87%AA%E5%8A%A8%E6%91%98%E8%A6%81) * 短语提取 - * 基于互信息和左右信息熵的短语提取 -* 拼音转换 - * 多音字 - * 声母 - * 韵母 - * 声调 -* 简繁转换 - * 繁体中文分词 + * [基于互信息和左右信息熵的短语提取](https://github.com/hankcs/HanLP/tree/1.x#16-%E7%9F%AD%E8%AF%AD%E6%8F%90%E5%8F%96) +* [拼音转换](https://github.com/hankcs/HanLP/tree/1.x#17-%E6%8B%BC%E9%9F%B3%E8%BD%AC%E6%8D%A2) + * 多音字、声母、韵母、声调 +* [简繁转换](https://github.com/hankcs/HanLP/tree/1.x#18-%E7%AE%80%E7%B9%81%E8%BD%AC%E6%8D%A2) * 简繁分歧词(简体、繁体、臺灣正體、香港繁體) -* 文本推荐 - * 语义推荐 - * 拼音推荐 - * 字词推荐 +* [文本推荐](https://github.com/hankcs/HanLP/tree/1.x#19-%E6%96%87%E6%9C%AC%E6%8E%A8%E8%8D%90) + * 语义推荐、拼音推荐、字词推荐 * 依存句法分析 - * 基于神经网络的高性能依存句法分析器 - * MaxEnt依存句法分析 - * CRF依存句法分析 -* 文本分类 - * 情感分析 -* word2vec + * [基于神经网络的高性能依存句法分析器](https://github.com/hankcs/HanLP/tree/1.x#21-%E4%BE%9D%E5%AD%98%E5%8F%A5%E6%B3%95%E5%88%86%E6%9E%90) + * [基于ArcEager转移系统的柱搜索依存句法分析器](https://github.com/hankcs/HanLP/blob/1.x/src/test/java/com/hankcs/demo/DemoDependencyParser.java#L34) +* [文本分类](https://github.com/hankcs/HanLP/wiki/%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E4%B8%8E%E6%83%85%E6%84%9F%E5%88%86%E6%9E%90) + * [情感分析](https://github.com/hankcs/HanLP/wiki/%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E4%B8%8E%E6%83%85%E6%84%9F%E5%88%86%E6%9E%90#%E6%83%85%E6%84%9F%E5%88%86%E6%9E%90) +* [文本聚类](https://github.com/hankcs/HanLP/wiki/%E6%96%87%E6%9C%AC%E8%81%9A%E7%B1%BB) + - KMeans、Repeated Bisection、自动推断聚类数目k +* [word2vec](https://github.com/hankcs/HanLP/wiki/word2vec) * 词向量训练、加载、词语相似度计算、语义运算、查询、KMeans聚类 * 文档语义相似度计算 -* 语料库工具 - * 分词语料预处理 - * 词频词性词典制作 - * BiGram统计 - * 词共现统计 - * CoNLL语料预处理 - * CoNLL UA/LA/DA评测工具 +* [语料库工具](https://github.com/hankcs/HanLP/tree/1.x/src/main/java/com/hankcs/hanlp/corpus) + - 部分默认模型训练自小型语料库,鼓励用户自行训练。所有模块提供[训练接口](https://github.com/hankcs/HanLP/wiki),语料可参考[98年人民日报语料库](http://file.hankcs.com/corpus/pku98.zip)。 -在提供丰富功能的同时,**HanLP**内部模块坚持低耦合、模型坚持惰性加载、服务坚持静态提供、词典坚持明文发布,使用非常方便,同时自带一些语料处理工具,帮助用户训练自己的模型。 +在提供丰富功能的同时,HanLP内部模块坚持低耦合、模型坚持惰性加载、服务坚持静态提供、词典坚持明文发布,使用非常方便。默认模型训练自全世界最大规模的中文语料库,同时自带一些语料处理工具,帮助用户训练自己的模型。 ------ ## 项目主页 -HanLP下载地址:https://github.com/hankcs/HanLP/releases - -国内下载地址:http://hanlp.dksou.com/HanLP.html - -Solr、Lucene插件:https://github.com/hankcs/hanlp-solr-plugin - -更多细节:https://github.com/hankcs/HanLP/wiki +[《自然语言处理入门》🔥](http://nlp.hankcs.com/book.php)、[随书代码](https://github.com/hankcs/HanLP/tree/v1.7.5/src/test/java/com/hankcs/book)、[在线演示](http://hanlp.com/)、[Python调用](https://github.com/hankcs/pyhanlp)、[Solr及Lucene插件](https://github.com/hankcs/hanlp-lucene-plugin)、[论坛](https://bbs.hankcs.com/)、[论文引用](https://github.com/hankcs/HanLP/wiki/papers)、[更多信息](https://github.com/hankcs/HanLP/wiki)。 ------ @@ -84,70 +74,53 @@ Solr、Lucene插件:https://github.com/hankcs/hanlp-solr-plugin 为了方便用户,特提供内置了数据包的Portable版,只需在pom.xml加入: -``` +```xml com.hankcs hanlp - portable-1.5.3 + portable-1.8.6 ``` -零配置,即可使用基本功能(除CRF分词、依存句法分析外的全部功能)。如果用户有自定义的需求,可以参考方式二,使用hanlp.properties进行配置。 +零配置,即可使用基本功能(除由字构词、依存句法分析外的全部功能)。如果用户有自定义的需求,可以参考方式二,使用hanlp.properties进行配置(Portable版同样支持hanlp.properties)。 ### 方式二、下载jar、data、hanlp.properties -**HanLP**将数据与程序分离,给予用户自定义的自由。 +HanLP将数据与程序分离,给予用户自定义的自由。 -#### 1、下载jar - -[hanlp.jar](https://github.com/hankcs/HanLP/releases) - -#### 2、下载data - -| 数据包 | 功能 | 体积(MB) | -| -------- | -----: | :----: | -| [data.zip](https://github.com/hankcs/HanLP/releases) | 全部 | 255 | +#### 1、下载:[data.zip](http://nlp.hankcs.com/download.php?file=data) 下载后解压到任意目录,接下来通过配置文件告诉HanLP数据包的位置。 -**HanLP**中的数据分为*词典*和*模型*,其中*词典*是词法分析必需的,*模型*是句法分析必需的。 +HanLP中的数据分为*词典*和*模型*,其中*词典*是词法分析必需的,*模型*是句法分析必需的。 data │ ├─dictionary └─model -用户可以自行增删替换,如果不需要句法分析功能的话,随时可以删除model文件夹。 +用户可以自行增删替换,如果不需要句法分析等功能的话,随时可以删除model文件夹。 + - 模型跟词典没有绝对的区别,隐马模型被做成人人都可以编辑的词典形式,不代表它不是模型。 - GitHub代码库中已经包含了data.zip中的词典,直接编译运行自动缓存即可;模型则需要额外下载。 -#### 3、配置文件 -示例配置文件:[hanlp.properties](https://github.com/hankcs/HanLP/releases) -在GitHub的发布页中,```hanlp.properties```一般和```jar```打包在同一个```zip```包中。 +#### 2、下载jar和配置文件:[hanlp-release.zip](http://nlp.hankcs.com/download.php?file=jar) 配置文件的作用是告诉HanLP数据包的位置,只需修改第一行 - root=usr/home/HanLP/ + root=D:/JavaProjects/HanLP/ 为data的**父目录**即可,比如data目录是`/Users/hankcs/Documents/data`,那么`root=/Users/hankcs/Documents/` 。 -- 如果选用mini词典的话,则需要修改配置文件: -``` -CoreDictionaryPath=data/dictionary/CoreNatureDictionary.mini.txt -BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.mini.txt -``` - -最后将`hanlp.properties`放入classpath即可,对于任何项目,都可以放到src或resources目录下,编译时IDE会自动将其复制到classpath中。 +最后将`hanlp.properties`放入classpath即可,对于多数项目,都可以放到src或resources目录下,编译时IDE会自动将其复制到classpath中。除了配置文件外,还可以使用环境变量`HANLP_ROOT`来设置`root`。安卓项目请参考[demo](https://github.com/hankcs/HanLPAndroidDemo)。 如果放置不当,HanLP会提示当前环境下的合适路径,并且尝试从项目根目录读取数据集。 ## 调用方法 -**HanLP**几乎所有的功能都可以通过工具类`HanLP`快捷调用,当你想不起来调用方法时,只需键入`HanLP.`,IDE应当会给出提示,并展示**HanLP**完善的文档。 - -*推荐用户始终通过工具类`HanLP`调用,这么做的好处是,将来**HanLP**升级后,用户无需修改调用代码。* +HanLP几乎所有的功能都可以通过工具类`HanLP`快捷调用,当你想不起来调用方法时,只需键入`HanLP.`,IDE应当会给出提示,并展示HanLP完善的文档。 -所有Demo都位于[com.hankcs.demo](https://github.com/hankcs/HanLP/tree/master/src/test/java/com/hankcs/demo)下,比文档覆盖了更多细节,强烈建议运行一遍。 +所有Demo都位于[com.hankcs.demo](https://github.com/hankcs/HanLP/tree/1.x/src/test/java/com/hankcs/demo)下,比文档覆盖了更多细节,更新更及时,**强烈建议运行一遍**。此处仅列举部分常用接口。 ### 1. 第一个Demo @@ -155,10 +128,10 @@ BiGramDictionaryPath=data/dictionary/CoreNatureDictionary.ngram.mini.txt System.out.println(HanLP.segment("你好,欢迎使用HanLP汉语处理包!")); ``` - 内存要求 - * **HanLP**对词典的数据结构进行了长期的优化,可以应对绝大多数场景。哪怕**HanLP**的词典上百兆也无需担心,因为在内存中被精心压缩过。 - * 如果内存非常有限,请使用小词典。**HanLP**默认使用大词典,同时提供小词典,请参考配置文件章节。 -- 写给正在编译**HanLP**的开发者 - * 如果你正在编译运行从Github检出的**HanLP**代码,并且没有下载data缓存,那么首次加载词典/模型会发生一个*自动缓存*的过程。 + * 内存120MB以上(-Xms120m -Xmx120m -Xmn64m),标准数据包(35万核心词库+默认用户词典),分词测试正常。全部词典和模型都是惰性加载的,不使用的模型相当于不存在,可以自由删除。 + * HanLP对词典的数据结构进行了长期的优化,可以应对绝大多数场景。哪怕HanLP的词典上百兆也无需担心,因为在内存中被精心压缩过。如果内存非常有限,请使用小词典。HanLP默认使用大词典,同时提供小词典,请参考配置文件章节。 +- 写给正在编译HanLP的开发者 + * 如果你正在编译运行从Github检出的HanLP代码,并且没有下载data缓存,那么首次加载词典/模型会发生一个*自动缓存*的过程。 * *自动缓存*的目的是为了加速词典载入速度,在下次载入时,缓存的词典文件会带来毫秒级的加载速度。由于词典体积很大,*自动缓存*会耗费一些时间,请耐心等待。 * *自动缓存*缓存的不是明文词典,而是双数组Trie树、DAWG、AhoCorasickDoubleArrayTrie等数据结构。 @@ -169,7 +142,7 @@ List termList = StandardTokenizer.segment("商品和服务"); System.out.println(termList); ``` - 说明 - * **HanLP**中有一系列“开箱即用”的静态分词器,以`Tokenizer`结尾,在接下来的例子中会继续介绍。 + * HanLP中有一系列“开箱即用”的静态分词器,以`Tokenizer`结尾,在接下来的例子中会继续介绍。 * `HanLP.segment`其实是对`StandardTokenizer.segment`的包装。 * 分词结果包含词性,每个词性的意思请查阅[《HanLP词性标注集》](http://www.hankcs.com/nlp/part-of-speech-tagging.html#h2-8)。 - 算法详解 @@ -178,11 +151,14 @@ System.out.println(termList); ### 3. NLP分词 ```java -List termList = NLPTokenizer.segment("中国科学院计算技术研究所的宗成庆教授正在教授自然语言处理课程"); -System.out.println(termList); +System.out.println(NLPTokenizer.segment("我新造一个词叫幻想乡你能识别并标注正确词性吗?")); +// 注意观察下面两个“希望”的词性、两个“晚霞”的词性 +System.out.println(NLPTokenizer.analyze("我的希望是希望张晚霞的背影被晚霞映红").translateLabels()); +System.out.println(NLPTokenizer.analyze("支援臺灣正體香港繁體:微软公司於1975年由比爾·蓋茲和保羅·艾倫創立。")); ``` - 说明 - * NLP分词`NLPTokenizer`会执行全部命名实体识别和词性标注。 + * NLP分词`NLPTokenizer`会执行词性标注和命名实体识别,由[结构化感知机序列标注框架](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)支撑。 + * 默认模型训练自`9970`万字的大型综合语料库,是已知范围内**全世界最大**的中文分词语料库。语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上[训练新模型](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)以适应新领域、识别新的命名实体。 ### 4. 索引分词 @@ -195,6 +171,7 @@ for (Term term : termList) ``` - 说明 * 索引分词`IndexTokenizer`是面向搜索引擎的分词器,能够对长词全切分,另外通过`term.offset`可以获取单词在文本中的偏移量。 + * 任何分词器都可以通过基类`Segment`的`enableIndexMode`方法激活索引模式。 ### 5. N-最短路径分词 @@ -219,23 +196,21 @@ for (String sentence : testCase) ### 6. CRF分词 ```java -Segment segment = new CRFSegment(); -segment.enablePartOfSpeechTagging(true); -List termList = segment.seg("你看过穆赫兰道吗"); -System.out.println(termList); -for (Term term : termList) -{ - if (term.nature == null) - { - System.out.println("识别到新词:" + term.word); - } -} + CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer(); + String[] tests = new String[]{ + "商品和服务", + "上海华安工业(集团)公司董事长谭旭光和秘书胡花蕊来到美国纽约现代艺术博物馆参观", + "微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。" // 支持繁体中文 + }; + for (String sentence : tests) + { + System.out.println(analyzer.analyze(sentence)); + } ``` - 说明 * CRF对新词有很好的识别能力,但是开销较大。 - 算法详解 - * [《CRF分词的纯Java实现》](http://www.hankcs.com/nlp/segment/crf-segmentation-of-the-pure-java-implementation.html) - * [《CRF++模型格式说明》](http://www.hankcs.com/nlp/the-crf-model-format-description.html) + * [《CRF中文分词、词性标注与命名实体识别》](https://github.com/hankcs/HanLP/wiki/CRF%E8%AF%8D%E6%B3%95%E5%88%86%E6%9E%90) ### 7. 极速词典分词 @@ -263,7 +238,7 @@ public class DemoHighSpeedSegment ``` - 说明 * 极速分词是词典最长分词,速度极其快,精度一般。 - * 在i7上跑出了2000万字每秒的速度。 + * 在i7-6700K上跑出了4500万字每秒的速度。 - 算法详解 * [《Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配》](http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html) @@ -290,7 +265,7 @@ public class DemoCustomDictionary String text = "攻城狮逆袭单身狗,迎娶白富美,走上人生巅峰"; // 怎么可能噗哈哈! - // AhoCorasickDoubleArrayTrie自动机分词 + // AhoCorasickDoubleArrayTrie自动机扫描文本中出现的自定义词语 final char[] charArray = text.toCharArray(); CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit() { @@ -300,29 +275,22 @@ public class DemoCustomDictionary System.out.printf("[%d:%d]=%s %s\n", begin, end, new String(charArray, begin, end - begin), value); } }); - // trie树分词 - BaseSearcher searcher = CustomDictionary.getSearcher(text); - Map.Entry entry; - while ((entry = searcher.next()) != null) - { - System.out.println(entry); - } - // 标准分词 + // 自定义词典在所有分词器中都有效 System.out.println(HanLP.segment(text)); } } ``` - 说明 - * `CustomDictionary`是一份全局的用户自定义词典,可以随时增删,影响全部分词器。 - * 另外可以在任何分词器中关闭它。通过代码动态增删不会保存到词典文件。 + * `CustomDictionary`是一份全局的用户自定义词典,可以随时增删,影响全部分词器。另外可以在任何分词器中关闭它。通过代码动态增删不会保存到词典文件。 + * 中文分词≠词典,词典无法解决中文分词,`Segment`提供高低优先级应对不同场景,请参考[FAQ](https://github.com/hankcs/HanLP/wiki/FAQ#%E4%B8%BA%E4%BB%80%E4%B9%88%E4%BF%AE%E6%94%B9%E4%BA%86%E8%AF%8D%E5%85%B8%E8%BF%98%E6%98%AF%E6%B2%A1%E6%9C%89%E6%95%88%E6%9E%9C)。 - 追加词典 * `CustomDictionary`主词典文本路径是`data/dictionary/custom/CustomDictionary.txt`,用户可以在此增加自己的词语(不推荐);也可以单独新建一个文本文件,通过配置文件`CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 我的词典.txt;`来追加词典(推荐)。 * 始终建议将相同词性的词语放到同一个词典文件里,便于维护和分享。 - 词典格式 * 每一行代表一个单词,格式遵从`[单词] [词性A] [A的频次] [词性B] [B的频次] ...` 如果不填词性则表示采用词典的默认词性。 * 词典的默认词性默认是名词n,可以通过配置文件修改:`全国地名大全.txt ns;`如果词典路径后面空格紧接着词性,则该词典默认是该词性。 - * 在基于层叠隐马模型的最短路分词中,并不保证自定义词典中的词一定被切分出来。 + * 在统计分词中,并不保证自定义词典中的词一定被切分出来。用户可在理解后果的情况下通过`Segment#enableCustomDictionaryForcing`强制生效。 * 关于用户词典的更多信息请参考**词典说明**一章。 - 算法详解 * [《Trie树分词》](http://www.hankcs.com/program/java/tire-tree-participle.html) @@ -351,6 +319,7 @@ for (String sentence : testCase) * 目前分词器基本上都默认开启了中国人名识别,比如`HanLP.segment()`接口中使用的分词器等等,用户不必手动开启;上面的代码只是为了强调。 * 有一定的误命中率,比如误命中`关键年`,则可以通过在`data/dictionary/person/nr.txt`加入一条`关键年 A 1`来排除`关键年`作为人名的可能性,也可以将`关键年`作为新词登记到自定义词典中。 * 如果你通过上述办法解决了问题,欢迎向我提交pull request,词典也是宝贵的财富。 + * 建议NLP用户使用感知机或CRF词法分析器,精度更高。 - 算法详解 * [《实战HMM-Viterbi角色标注中国人名识别》](http://www.hankcs.com/nlp/chinese-name-recognition-in-actual-hmm-viterbi-role-labeling.html) @@ -409,6 +378,7 @@ for (String sentence : testCase) - 说明 * 目前标准分词器都默认关闭了地名识别,用户需要手动开启;这是因为消耗性能,其实多数地名都收录在核心词典和用户自定义词典中。 * 在生产环境中,能靠词典解决的问题就靠词典解决,这是最高效稳定的方法。 + * 建议对命名实体识别要求较高的用户使用[感知机词法分析器](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)。 - 算法详解 * [《实战HMM-Viterbi角色标注地名识别》](http://www.hankcs.com/nlp/ner/place-names-to-identify-actual-hmm-viterbi-role-labeling.html) @@ -430,6 +400,7 @@ for (String sentence : testCase) - 说明 * 目前分词器默认关闭了机构名识别,用户需要手动开启;这是因为消耗性能,其实常用机构名都收录在核心词典和用户自定义词典中。 * HanLP的目的不是演示动态识别,在生产环境中,能靠词典解决的问题就靠词典解决,这是最高效稳定的方法。 + * 建议对命名实体识别要求较高的用户使用[感知机词法分析器](https://github.com/hankcs/HanLP/wiki/%E7%BB%93%E6%9E%84%E5%8C%96%E6%84%9F%E7%9F%A5%E6%9C%BA%E6%A0%87%E6%B3%A8%E6%A1%86%E6%9E%B6)。 - 算法详解 * [《层叠HMM-Viterbi角色标注模型下的机构名识别》](http://www.hankcs.com/nlp/ner/place-name-recognition-model-of-the-stacked-hmm-viterbi-role-labeling.html) @@ -563,9 +534,9 @@ public class DemoPinyin } ``` - 说明 - * **HanLP**不仅支持基础的汉字转拼音,还支持声母、韵母、音调、音标和输入法首字母首声母功能。 - * **HanLP**能够识别多音字,也能给繁体中文注拼音。 - * 最重要的是,**HanLP**采用的模式匹配升级到`AhoCorasickDoubleArrayTrie`,性能大幅提升,能够提供毫秒级的响应速度! + * HanLP不仅支持基础的汉字转拼音,还支持声母、韵母、音调、音标和输入法首字母首声母功能。 + * HanLP能够识别多音字,也能给繁体中文注拼音。 + * 最重要的是,HanLP采用的模式匹配升级到`AhoCorasickDoubleArrayTrie`,性能大幅提升,能够提供毫秒级的响应速度! - 算法详解 * [《汉字转拼音与简繁转换的Java实现》](http://www.hankcs.com/nlp/java-chinese-characters-to-pinyin-and-simplified-conversion-realization.html#h2-17) @@ -586,7 +557,7 @@ public class DemoTraditionalChinese2SimplifiedChinese } ``` - 说明 - * **HanLP**能够识别简繁分歧词,比如`打印机=印表機`。许多简繁转换工具不能区分“以后”“皇后”中的两个“后”字,**HanLP**可以。 + * HanLP能够识别简繁分歧词,比如`打印机=印表機`。许多简繁转换工具不能区分“以后”“皇后”中的两个“后”字,HanLP可以。 - 算法详解 * [《汉字转拼音与简繁转换的Java实现》](http://www.hankcs.com/nlp/java-chinese-characters-to-pinyin-and-simplified-conversion-realization.html#h2-17) @@ -622,7 +593,7 @@ public class DemoSuggester } ``` - 说明 - * 在搜索引擎的输入框中,用户输入一个词,搜索引擎会联想出最合适的搜索词,**HanLP**实现了类似的功能。 + * 在搜索引擎的输入框中,用户输入一个词,搜索引擎会联想出最合适的搜索词,HanLP实现了类似的功能。 * 可以动态调节每种识别器的权重 ### 20. 语义距离 @@ -675,7 +646,7 @@ public class DemoWord2Vec ```java /** - * 依存句法分析(CRF句法模型需要-Xms512m -Xmx512m -Xmn256m,MaxEnt和神经网络句法模型需要-Xms1g -Xmx1g -Xmn512m) + * 依存句法分析(MaxEnt和神经网络句法模型需要-Xms1g -Xmx1g -Xmn512m) * @author hankcs */ public class DemoDependencyParser @@ -708,14 +679,12 @@ public class DemoDependencyParser ``` - 说明 * 内部采用`NeuralNetworkDependencyParser`实现,用户可以直接调用`NeuralNetworkDependencyParser.compute(sentence)` - * 也可以调用基于最大熵的依存句法分析器`MaxEntDependencyParser.compute(sentence)` + * 也可以调用基于ArcEager转移系统的柱搜索依存句法分析器`KBeamArcEagerDependencyParser` - 算法详解 * [《基于神经网络分类模型与转移系统的判决式依存句法分析器》](http://www.hankcs.com/nlp/parsing/neural-network-based-dependency-parser.html) - * [《最大熵依存句法分析器的实现》](http://www.hankcs.com/nlp/parsing/to-achieve-the-maximum-entropy-of-the-dependency-parser.html) - * [《基于CRF序列标注的中文依存句法分析器的Java实现》](http://www.hankcs.com/nlp/parsing/crf-sequence-annotation-chinese-dependency-parser-implementation-based-on-java.html) ## 词典说明 -本章详细介绍**HanLP**中的词典格式,满足用户自定义的需要。**HanLP**中有许多词典,它们的格式都是相似的,形式都是文本文档,随时可以修改。 +本章详细介绍HanLP中的词典格式,满足用户自定义的需要。HanLP中有许多词典,它们的格式都是相似的,形式都是文本文档,随时可以修改。 ### 基本格式 词典分为词频词性词典和词频词典。 @@ -733,9 +702,9 @@ public class DemoDependencyParser ### 数据结构 -Trie树(字典树)是**HanLP**中使用最多的数据结构,为此,我实现了通用的Trie树,支持泛型、遍历、储存、载入。 +Trie树(字典树)是HanLP中使用最多的数据结构,为此,我实现了通用的Trie树,支持泛型、遍历、储存、载入。 -用户自定义词典采用AhoCorasickDoubleArrayTrie和二分Trie树储存,其他词典采用基于[双数组Trie树(DoubleArrayTrie)](http://www.hankcs.com/program/java/%E5%8F%8C%E6%95%B0%E7%BB%84trie%E6%A0%91doublearraytriejava%E5%AE%9E%E7%8E%B0.html)实现的[AC自动机AhoCorasickDoubleArrayTrie](http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html)。 +用户自定义词典采用AhoCorasickDoubleArrayTrie和二分Trie树储存,其他词典采用基于[双数组Trie树(DoubleArrayTrie)](http://www.hankcs.com/program/java/%E5%8F%8C%E6%95%B0%E7%BB%84trie%E6%A0%91doublearraytriejava%E5%AE%9E%E7%8E%B0.html)实现的[AC自动机AhoCorasickDoubleArrayTrie](http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html)。关于一些常用数据结构的性能评估,请参考[wiki](https://github.com/hankcs/HanLP/wiki/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84)。 ### 储存形式 @@ -763,23 +732,37 @@ HanLP.Config.enableDebug(); * 基于角色标注的命名实体识别比较依赖词典,所以词典的质量大幅影响识别质量。 * 这些词典的格式与原理都是类似的,请阅读[相应的文章](http://www.hankcs.com/category/nlp/ner/)或代码修改它。 -如果问题解决了,欢迎向我提交一个pull request,这是我在代码库中保留明文词典的原因,众人拾柴火焰高! +若还有疑问,请参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)相应章节。如果问题解决了,欢迎向我提交一个pull request,这是我在代码库中保留明文词典的原因,众人拾柴火焰高! ------ +## [《自然语言处理入门》](http://nlp.hankcs.com/book.php) + +![img](http://file.hankcs.com/img/nlp-book-squre.jpg) + +一本配套HanLP的NLP入门书,基础理论与生产代码并重,Python与Java双实现。从基本概念出发,逐步介绍中文分词、词性标注、命名实体识别、信息抽取、文本聚类、文本分类、句法分析这几个热门问题的算法原理与工程实现。书中通过对多种算法的讲解,比较了它们的优缺点和适用场景,同时详细演示生产级成熟代码,助你真正将自然语言处理应用在生产环境中。 + +[《自然语言处理入门》](http://nlp.hankcs.com/book.php)由南方科技大学数学系创系主任夏志宏、微软亚洲研究院副院长周明、字节跳动人工智能实验室总监李航、华为诺亚方舟实验室语音语义首席科学家刘群、小米人工智能实验室主任兼NLP首席科学家王斌、中国科学院自动化研究所研究员宗成庆、清华大学副教授刘知远、北京理工大学副教授张华平和52nlp作序推荐。感谢各位前辈老师,希望这个项目和这本书能成为大家工程和学习上的“蝴蝶效应”,帮助大家在NLP之路上蜕变成蝶。 + ## 版权 -### Apache License Version 2.0 +HanLP 的授权协议为 **Apache License 2.0**,可免费用做商业用途。请在产品说明中附加HanLP的链接和授权协议。HanLP受版权法保护,侵权必究。 + +##### 自然语义(青岛)科技有限公司 + +HanLP从v1.7版起独立运作,由自然语义(青岛)科技有限公司作为项目主体,主导后续版本的开发,并拥有后续版本的版权。 + +##### 大快搜索 + +HanLP v1.3~v1.65版由大快搜索主导开发,继续完全开源,大快搜索拥有相关版权。 -如不特殊注明,所有模块都以此协议授权使用。 +##### 上海林原公司 -### 上海林原信息科技有限公司 -- HanLP产品初始知识产权归上海林原信息科技有限公司所有,任何人和企业可以无偿使用,可以对产品、源代码进行任何形式的修改,可以打包在其他产品中进行销售。 -- 任何使用了HanLP的全部或部分功能、词典、模型的项目、产品或文章等形式的成果必须显式注明HanLP及此项目主页。 +HanLP 早期得到了上海林原公司的大力支持,并拥有1.28及前序版本的版权,相关版本也曾在上海林原公司网站发布。 ### 其他版权方 -- 自`1.2.4`后是个人维护,还会接受任何人与任何公司向本项目开源的模块。 -- 充分尊重所有版权方的贡献,本项目不占有这些新模块的版权。 +- 实施上由个人维护,欢迎任何人与任何公司向本项目开源模块。 +- 充分尊重所有版权方的贡献,本项目不占有用户贡献模块的版权。 ### 鸣谢 感谢下列优秀开源项目: @@ -799,9 +782,9 @@ HanLP.Config.enableDebug(); - An Efficient Implementation of Trie Structures, JUN-ICHI AOE AND KATSUSHI MORIMOTO - TextRank: Bringing Order into Texts, Rada Mihalcea and Paul Tarau -感谢上海林原信息科技有限公司的刘先生,允许我利用工作时间开发HanLP,提供服务器和域名,并且促成了开源。感谢诸位用户的关注和使用,HanLP并不完善,未来还恳求各位NLP爱好者多多关照,提出宝贵意见。 +感谢诸位用户的关注和使用,HanLP并不完善,未来还恳求各位NLP爱好者多多关照,提出宝贵意见。 作者 [@hankcs](http://weibo.com/hankcs/) -2014年12月16日 +2016年9月16日 diff --git a/data/dictionary/pinyin/pinyin.txt b/data/dictionary/pinyin/pinyin.txt new file mode 100644 index 000000000..58432c069 --- /dev/null +++ b/data/dictionary/pinyin/pinyin.txt @@ -0,0 +1,30728 @@ +〇=ling2 +一=yi1 +一丁点儿=yi1,ding1,dian3,er5 +一不小心=yi1,bu4,xiao3,xin1 +一丘之貉=yi1,qiu1,zhi1,he2 +一丝不差=yi4,si1,bu4,cha1 +一丝不苟=yi1,si1,bu4,gou3 +一个=yi1,ge4 +一个半个=yi1,ge4,ban4,ge4 +一个巴掌拍不响=yi1,ge4,ba1,zhang3,pai1,bu4,xiang3 +一个萝卜一个坑=yi1,ge4,luo2,bo5,yi1,ge4,keng1 +一举两得=yi1,ju3,liang3,de2 +一之为甚=yi1,zhi1,wei2,shen4 +一了=yi1,liao3 +一了百了=yi1,liao3,bai3,liao3 +一了百当=yi1,liao3,bai3,dang4 +一事无成=yi1,shi4,wu2,cheng2 +一五一十=yi4,wu3,yi4,shi2 +一些=yi1,xie1 +一些半些=yi1,xie1,ban4,xie1 +一人做事一人当=yi1,ren2,zuo4,shi4,yi1,ren2,dang1 +一仆二主=yi1,pu2,er4,zhu3 +一代=yi1,dai4 +一代不如一代=yi1,dai4,bu4,ru2,yi1,dai4 +一代楷模=yi1,dai4,kai3,mo2 +一代风流=yi1,dai4,feng1,liu2 +一令纸=yi1,ling3,zhi3 +一会=yi1,hui4 +一会儿=yi1,hui4,er5 +一体两面=yi1,ti3,liang3,mian4 +一倍=yi2,bei4 +一倡百和=yi1,chang4,bai3,he4 +一共=yi1,gong4 +一出戏=yi4,chu1,xi4 +一刀两断=yi1,dao1,liang3,duan4 +一分为二=yi1,fen1,wei2,er4 +一分子=yi1,fen4,zi3 +一分钟=yi4,fen1,zhong1 +一切=yi1,qie4 +一切万物=yi1,qie1,wan4,wu4 +一动不动=yi1,dong4,bu4,dong4 +一劳永逸=yi1,lao2,yong3,yi4 +一匹=yi1,pi3 +一匹马=yi4,pi3,ma3 +一半=yi1,ban4 +一卷=yi1,juan4 +一厢情愿=yi1,xiang1,qing2,yuan4 +一去不复返=yi1,qu4,bu4,fu4,fan3 +一发=yi1,fa4 +一发千钧=yi1,fa4,qian1,jun1 +一口咬定=yi1,kou3,yao3,ding4 +一口气=yi1,kou3,qi4 +一只=yi1,zhi1 +一叶扁舟=yi1,ye4,pian1,zhou1 +一吐为快=yi1,tu3,wei2,kuai4 +一向=yi1,xiang4 +一呼百应=yi1,hu1,bai3,ying4 +一命呜呼=yi1,ming4,wu1,hu1 +一哄而上=yi1,hong3,er2,shang4 +一哄而散=yi2,hong4,er2,san4 +一哄而起=yi1,hong4,er2,qi3 +一哭二闹=yi4,ku1,er4,nao4 +一唱一和=yi1,chang4,yi1,he4 +一回儿=yi4,hui2,er5 +一场=yi4,chang2 +一场春梦=yi1,chang3,chun1,meng4 +一场梦=yi4,chang2,meng4 +一场空=yi1,chang2,kong1 +一场雨=yi1,chang2,yu3 +一块石头落了地=yi1,kuai4,shi2,tou5,luo4,le5,di4 +一块石头落地=yi1,kuai4,shi2,tou2,luo4,di4 +一声不吭=yi1,sheng1,bu4,keng1 +一声不响=yi1,sheng1,bu4,xiang3 +一夕一朝=yi1,xi1,yi1,zhao1 +一夕之间=yi4,xi1,zhi1,jian1 +一夜情=yi1,ye4,qing2 +一夜走红=yi2,ye4,zou3,hong2 +一大堆=yi1,da4,dui1 +一天星斗=yi1,tian1,xing1,dou3 +一如既往=yi1,ru2,ji4,wang3 +一定=yi1,ding4 +一定不易=yi1,ding4,bu4,yi4 +一定不移=yi1,ding4,bu4,yi2 +一宿=yi1,xiu3 +一尘不到=yi1,chen2,bu4,dao4 +一尘不染=yi1,chen2,bu4,ran3 +一差二错=yi1,cha1,er4,cuo4 +一帆风顺=yi4,fan1,feng1,shun4 +一幕=yi1,mu4 +一年一度=yi1,nian2,yi1,du4 +一年到头=yi1,nian2,dao4,tou2 +一年半载=yi1,nian2,ban4,zai3 +一年级=yi1,nian2,ji2 +一度=yi1,du4 +一得=yi1,de5 +一得之功=yi1,de2,zhi1,gong1 +一得之愚=yi1,de2,zhi1,yu2 +一得之见=yi1,de2,zhi1,jian4 +一心一意=yi1,xin1,yi1,yi4 +一心向往=yi4,xin1,xiang4,wang3 +一念之差=yi1,nian4,zhi1,cha1 +一念之间=yi1,nian4,zhi1,jian1 +一怒之下=yi1,nu4,zhi1,xia4 +一怔=yi1,zheng1 +一意孤行=yi1,yi4,gu1,xing2 +一成=yi1,cheng2 +一成不变=yi1,cheng2,bu4,bian4 +一成不易=yi1,cheng2,bu4,yi4 +一扎=yi1,za1 +一打=yi1,da3 +一技之长=yi1,ji4,zhi1,chang2 +一把子=yi1,ba4,zi5 +一抔黄沙=yi4,pou2,huang2,sha1 +一报还一报=yi1,bao4,huan2,yi1,bao4 +一掊土=yi1,pou2,tu3 +一撮=yi1,zuo3 +一文不值=yi1,wen2,bu4,zhi2 +一方面=yi1,fang1,mian4 +一无可取=yi1,wu2,ke3,qu3 +一无所得=yi1,wu2,suo3,de2 +一无所有=yi1,wu2,suo3,you3 +一无所知=yi1,wu2,suo3,zhi1 +一无是处=yi1,wu2,shi4,chu4 +一无长物=yi1,wu2,chang2,wu4 +一日三省=yi1,ri4,san1,xing3 +一日三秋=yi1,ri4,san1,qiu1 +一旦=yi1,dan4 +一时千载=yi1,shi2,qian1,zai3 +一时半会儿=yi1,shi2,ban4,hui4,er5 +一时得逞=yi4,shi2,de2,cheng3 +一时疏忽=yi4,shi2,shu1,hu1 +一暴十寒=yi1,pu4,shi2,han2 +一曝十寒=yi1,pu4,shi2,han2 +一月=yi1,yue4 +一服=yi1,fu4 +一服药=yi1,fu4,yao4 +一望无垠=yi1,wang4,wu2,yin2 +一望无涯=yi1,wang4,wu2,ya2 +一望无际=yi1,wang4,wu2,ji4 +一朝=yi1,zhao1 +一朝一夕=yi1,zhao1,yi1,xi1 +一朝之忿=yi1,zhao1,zhi1,fen4 +一朝天子一朝臣=yi1,chao2,tian1,zi3,yi1,chao2,chen2 +一棍子打死=yi1,gun4,zi5,da3,si3 +一模一样=yi4,mu2,yi2,yang4 +一次性=yi1,ci4,xing4 +一步一个脚印=yi1,bu4,yi1,ge4,jiao3,yin4 +一步一步=yi2,bu4,yi2,bu4 +一步步=yi2,bu4,bu4 +一步步地=yi2,bu4,bu4,de5 +一毛不拔=yi1,mao2,bu4,ba2 +一毫不差=yi1,hao2,bu4,cha1 +一气之下=yi1,qi4,zhi1,xia4 +一沓纸=yi4,da2,zhi3 +一沓钞票=yi4,da2,chao1,piao4 +一波三折=yi1,bo1,san1,zhe2 +一泻千里=yi1,xie4,qian1,li3 +一派胡言=yi1,pai4,hu2,yan2 +一派胡言乱语=yi2,pai4,hu2,yan2,luan4,yu3 +一流=yi1,liu2 +一清二楚=yi1,qing1,er4,chu3 +一清宿弊=yi4,qing1,su4,bi4 +一溜儿=yi1,liu4,er2 +一溜烟=yi2,liu4,yan1 +一溜烟跑掉=yi2,liu4,yan1,pao3,diao4 +一点=yi1,dian3 +一点钟=yi4,dian3,zhong1 +一片散沙=yi1,pian4,san4,sha1 +一片昏黑=yi2,pian4,hun1,hei1 +一片混乱=yi1,pian4,hun4,luan4 +一片空白=yi1,pian4,kong1,bai2 +一物降一物=yi1,wu4,xiang2,yi1,wu4 +一现昙华=yi1,xian4,tan2,hua1 +一生一世=yi1,sheng1,yi1,shi4 +一百=yi1,bai3 +一百二十行=yi1,bai3,er4,shi2,hang2 +一百亿=yi4,bai3,yi4 +一百八十度=yi1,bai3,ba1,shi2,du4 +一盘散沙=yi1,pan2,san3,sha1 +一目了然=yi1,mu4,liao3,ran2 +一目五行=yi1,mu4,wu3,hang2 +一目十行=yi1,mu4,shi2,hang2 +一目数行=yi1,mu4,shu4,hang2 +一直=yi1,zhi2 +一直走=yi4,zhi2,zou3 +一窍不通=yi1,qiao4,bu4,tong1 +一笑了之=yi1,xiao4,liao3,zhi1 +一笑了事=yi1,xiao4,le5,shi4 +一笔抹摋=yi1,bi3,mo4,sa4 +一笔抹煞=yi1,bi3,mo3,sha1 +一等奖=yi1,deng3,jiang3 +一筹莫展=yi1,chou2,mo4,zhan3 +一箭上垛=yi1,jian4,shang4,duo4 +一篇文章=yi4,pian1,wen2,zhang1 +一线之隔=yi1,xian4,zhi1,ge2 +一缘一会=yi1,yuan2,yi1,hui4 +一网打尽=yi1,wang3,da3,jin4 +一而再=yi4,er2,zai4 +一肚子=yi1,du3,zi5 +一股劲=yi1,gu3,jin4 +一股劲儿=yi1,gu3,jin4,er5 +一臂之力=yi1,bi4,zhi1,li4 +一致=yi1,zhi4 +一般=yi1,ban1 +一般见识=yi1,ban1,jian4,shi2 +一艘船=yi4,sou1,chuan2 +一见了然=yi1,jian4,le5,ran2 +一见钟情=yi1,jian4,zhong1,qing2 +一视同仁=yi1,shi4,tong2,ren2 +一览无遗=yi4,lan3,wu2,yi2 +一觉=yi1,jiao4 +一言中的=yi1,yan2,zhong1,di4 +一言为定=yi1,yan2,wei2,ding4 +一言九鼎=yi1,yan2,jiu3,ding3 +一言兴邦=yi1,yan2,xing1,bang1 +一言半语=yi1,yan2,ban4,yu3 +一言难尽=yi1,yan2,nan2,jin4 +一语中的=yi1,yu3,zhong4,di4 +一语破的=yi1,yu3,po4,di4 +一诺千金=yi1,nuo4,qian1,jin1 +一贫如洗=yi1,pin2,ru2,xi3 +一贯=yi1,guan4 +一走了之=yi1,zou3,liao3,zhi1 +一起=yi1,qi3 +一路平安=yi1,lu4,ping2,an1 +一路神祇=yi1,lu4,shen2,qi2 +一路顺风=yi1,lu4,shun4,feng1 +一蹴而就=yi1,cu4,er2,jiu4 +一蹶不兴=yi1,jue3,bu4,xing1 +一蹶不振=yi1,jue2,bu4,zhen4 +一边倒=yi1,bian1,dao3 +一还一报=yi1,huan2,yi1,bao4 +一退六二五=yi1,tui1,liu4,er4,wu3 +一通=yi2,tong4 +一邱之貉=yi1,qiu1,zhi1,he4 +一重一掩=yi1,chong2,yi1,yan3 +一针见血=yi1,zhen1,jian4,xie3 +一钱不落虚空地=yi1,qian2,bu4,luo4,xu1,kong1,di4 +一隅=yi1,yu2 +一隅之地=yi1,yu2,zhi1,di4 +一隅之见=yi1,yu2,zhi1,jian4 +一隅之说=yi1,yu2,zhi1,shuo1 +一面=yi1,mian4 +一面之缘=yi1,mian4,zhi1,yuan2 +一面之识=yi1,mian4,zhi1,shi2 +一面倒=yi2,mian4,dao3 +一鞭先著=yi1,bian1,xian1,zhuo2 +一驮粮=yi2,duo4,liang2 +一骨碌=yi1,gu1,lu4 +一鸣惊人=yi1,ming2,jing1,ren2 +丁=ding1,zheng1 +丁点儿=ding1,dian3,er5 +丂=kao3,qiao3,yu2 +七=qi1 +七十二行=qi1,shi2,er4,hang2 +七拱八翘=qi1,gong3,ba1,qiao4 +七次量衣一次裁=qi1,ci4,liang2,yi1,yi1,ci4,cai2 +七行俱下=qi1,hang2,ju4,xia4 +七返还丹=qi1,fan3,huan2,dan1 +丄=shang4 +丅=xia4 +丆=han3 +万=wan4,mo4 +万俟=mo4,qi2 +万别千差=wan4,bie2,qian1,cha1 +万卷=wan4,juan4 +万夫不当=wan4,fu1,bu4,dang1 +万夫不当之勇=wan4,fu1,bu4,dang1,zhi1,yong3 +万头攒动=wan4,tou2,cuan2,dong4 +万应灵丹=wan4,ying4,ling2,dan1 +万应锭=wan4,ying4,ding4 +万户侯=wan4,hu4,hou4 +万无一失=wan4,wu2,yi1,shi1 +万箭攒心=wan4,jian4,cuan2,xin1 +万象更新=wan4,xiang4,geng1,xin1 +万贯家私=wan4,guan4,ji5,si1 +万里长城=wan4,li3,chang2,cheng2 +万里长征=wan4,li3,chang2,zheng1 +丈=zhang4 +丈量=zhang4,liang2 +三=san1 +三不拗六=san1,bu4,niu4,liu4 +三人为众=san1,ren2,wei4,zhong4 +三十六行=san1,shi2,liu4,hang2 +三占从二=san1,zhan1,cong2,er4 +三句不离本行=san1,ju4,bu4,li2,ben3,hang2 +三句话不离本行=san1,ju4,hua4,bu4,li2,ben3,hang2 +三只手=san1,zhi1,shou3 +三天两宿=san1,tian1,liang3,xiu3 +三差两错=san1,cha1,liang3,cuo4 +三差五错=san1,cha1,wu3,cuo4 +三年五载=san1,nian2,wu3,zai3 +三座大山=san1,zuo4,da4,shan1 +三徙成都=san1,xi3,cheng2,dou1 +三战三北=san1,zhan1,san1,bei3 +三折肱为良医=san1,zhe2,gong1,wei2,liang2,yi1 +三更=san1,geng1 +三灾八难=san1,zai1,ba1,nan4 +三百六十行=san1,bai3,liu4,shi2,hang2 +三省吾身=san1,xing3,wu2,shen1 +三色堇=san1,se4,jin3 +三藏=san1,zang4 +三角裤衩=san1,jiao3,ku4,cha3 +三重=san1,chong2 +上=shang4,shang3 +上供=shang4,gong4 +上卷=shang4,juan4 +上去=shang3,qu4 +上吐下泻=shang4,tu3,xia4,xie4 +上声=shang3,sheng1 +上头=shang4,tou5 +上将=shang4,jiang4 +上岁数=shang4,sui4,shu4 +上当=shang4,dang4 +上当学乖=shang4,dang1,xue2,guai1 +上相=shang4,xiang4 +上调=shang4,diao4 +上铺=shang4,pu4 +下=xia4 +下不为例=xia4,bu4,wei2,li4 +下乘=xia4,sheng4 +下卷=xia4,juan4 +下处=xia4,chu3 +下头=xia4,tou5 +下巴颏=xia4,ba1,ke1 +下调=xia4,diao4 +下辈子=xia4,bei4,zi5 +下铺=xia4,pu4 +丌=qi2,ji1 +不=bu4,fou3 +不一会儿=bu2,yi4,hui2,er5 +不一定=bu4,yi1,ding4 +不一样=bu4,yi2,yang4 +不三不四=bu4,san1,bu4,si4 +不为五斗米折腰=bu4,wei4,wu3,dou3,mi3,zhe2,yao1 +不为人知=bu4,wei2,ren2,zhi1 +不为名不为利=bu4,wei2,ming2,bu4,wei2,li4 +不为已甚=bu4,wei2,yi3,shen4 +不义=bu4,yi4 +不义之财=bu4,yi4,zhi1,cai2 +不了=bu4,liao3 +不了不当=bu4,liao3,bu4,dang4 +不了了之=bu4,liao3,liao3,zhi1 +不了而了=bu4,liao3,er2,liao3 +不亦乐乎=bu4,yi4,le4,hu1 +不亦善夫=bu4,yi5,shan4,fu1 +不亦说乎=bu4,yi4,yue4,hu1 +不以为然=bu4,yi3,wei2,ran2 +不以语人=bu4,yi3,yu4,ren2 +不会=bu4,hui4 +不但=bu4,dan4 +不住=bu4,zhu4 +不值识者一笑=bu4,zhi2,shi2,zhe3,yi2,xiao4 +不兴=bu4,xing1 +不切实际=bu2,qie4,shi2,ji4 +不切题=bu2,qie4,ti2 +不到=bu4,dao4 +不到位=bu2,dao4,wei4 +不到长城非好汉=bu4,dao4,chang2,cheng2,fei1,hao3,han4 +不到黄河心不死=bu4,dao4,huang2,he2,xin1,bu4,si3 +不务正业=bu4,wu4,zheng4,ye4 +不动如山=bu2,dong4,ru2,shan1 +不卑不亢=bu4,bei1,bu4,kang4 +不厌其烦=bu4,yan4,qi2,fan2 +不可奈何=bu4,ke3,mai4,he2 +不可揆度=bu4,ke3,kui2,duo2 +不可胜举=bu4,ke3,sheng4,ju4 +不吝赐教=bu4,lin4,ci4,jiao4 +不含糊=bu4,han2,hu4 +不在乎=bu2,zai4,hu1 +不堪一击=bu4,kan1,yi1,ji1 +不声不吭=bu4,sheng1,bu4,keng1 +不大方便=bu2,da4,fang1,bian4 +不安分=bu4,an1,fen4 +不客气=bu2,ke4,qi4 +不寐=bu2,mei4 +不对=bu4,dui4 +不对劲=bu4,dui4,jin4 +不对头=bu4,dui4,tou2 +不对茬儿=bu2,dui4,cha2,er2 +不屑=bu4,xie4 +不屑一读=bu2,xie4,yi4,du2 +不屑一顾=bu4,xie4,yi1,gu4 +不屑于=bu2,xie4,yu2 +不屑意=bu2,xie4,yi4 +不屑教诲=bu4,xie4,jiao4,hui4 +不屑毁誉=bu4,xie4,hui3,yu4 +不差=bu4,cha4 +不差上下=bu4,cha1,shang4,xia4 +不差什么=bu4,cha4,shi2,mo3 +不差毫厘=bu4,cha1,hao2,li2 +不差毫发=bu4,cha1,hao2,fa4 +不差累黍=bu4,cha1,lei3,shu3 +不干=bu4,gan4 +不干不净=bu4,gan1,bu4,jing4 +不幸而言中=bu4,xing4,er2,yan2,zhong4 +不当=bu2,dang4 +不当不正=bu4,dang1,bu4,zheng4 +不当人子=bu4,dang1,ren2,zi3 +不徇私情=bu4,xun2,si1,qing2 +不得了=bu4,de2,liao3 +不得已而为之=bu4,de2,yi3,er2,wei2,zhi1 +不必=bu4,bi4 +不忘宿志=bu2,wang4,su4,zhi4 +不忿=bu4,fen4 +不怀好意=bu4,huai2,hao4,yi4 +不悦=bu2,yue4 +不惜一切代价=bu4,xi1,yi2,qie4,dai4,jia4 +不愧=bu4,kui4 +不愧下学=bu4,kui4,xia4,xue2 +不愧不作=bu4,kui4,bu4,zuo4 +不愧不怍=bu4,kui4,bu4,zuo4 +不愧屋漏=bu4,kui4,wu1,lou4 +不慎=bu4,shen4 +不懈=bu4,xie4 +不战而胜=bu2,zhan4,er2,sheng4 +不战而降=bu2,zhan4,er2,xiang2 +不拔一毛=bu4,ba2,yi4,mao2 +不拘形迹=bu4,ju1,xing2,ji1 +不揣冒昧=bu4,chuai3,mao4,mei4 +不揪不睬=bu4,chou3,bu4,cai3 +不敢为天下先=bu4,gan3,wei2,tian1,xia4,xian1 +不敢越雷池一步=bu4,gan3,yue4,lei2,chi2,yi1,bu4 +不料=bu4,liao4 +不易之论=bu4,yi4,zhi1,lun4 +不是=bu2,shi4 +不是冤家不聚头=bu4,shi4,yuan1,jia1,bu4,ju4,tou2 +不是味儿=bu2,shi4,wei4,er2 +不根之谈=bu4,gan1,zhi1,tan2 +不正常=bu2,zheng4,chang2 +不治之症=bu4,zhi4,zhi1,zheng4 +不点儿=bu4,dian3,er5 +不爽累黍=bu4,shuang3,lei4,shu3 +不犯着=bu4,fan4,zhao2 +不甚了了=bu4,shen4,liao3,liao3 +不用说=bu2,yong4,shuo1 +不由得=bu4,you2,de5 +不痛不痒=bu4,tong4,bu4,yang3 +不登大雅=bu4,deng1,da4,ya3 +不登大雅之堂=bu4,deng1,da4,ya3,zhi1,tang2 +不相为谋=bu4,xiang1,wei2,mou2 +不省人事=bu4,xing3,ren2,shi4 +不着边际=bu4,zhuo2,bian1,ji4 +不知薡蕫=bu4,zhi1,ding1,dong3 +不紧不慢=bu4,jin1,bu4,man4 +不绝如发=bu4,jue2,ru2,fa4 +不置可否=bu4,zhi4,ke3,fou3 +不翼而飞=bu4,yi4,er2,fei1 +不老少=bu4,lao3,shao4 +不耐烦=bu4,nai4,fan2 +不肖=bu2,xiao4 +不肖子=bu2,xiao4,zi3 +不肖子孙=bu4,xiao4,zi3,sun1 +不胜=bu4,sheng4 +不胜其任=bu4,sheng4,qi2,ren4 +不胜其烦=bu4,sheng4,qi2,fan2 +不胜其苦=bu4,sheng4,qi2,ku3 +不胜杯杓=bu4,sheng4,bei1,shao2 +不胜枚举=bu4,sheng4,mei2,ju3 +不胜桮杓=bu4,sheng4,bei1,shao2 +不胫而走=bu4,jing4,er2,zou3 +不能登大雅之堂=bu4,neng2,deng1,da4,ya3,zhi1,tang2 +不自量力=bu4,zi4,liang4,li4 +不舒服=bu4,shu1,fu5 +不蔓不枝=bu4,man4,bu4,zhi1 +不要=bu4,yao4 +不要怕=bu2,yao4,pa4 +不要紧=bu4,yao4,jin3 +不要脸=bu4,yao4,lian3 +不见=bu4,jian4 +不见不散=bu2,jian4,bu2,san4 +不见了=bu2,jian4,le5 +不见天日=bu4,jian4,tian1,ri4 +不见得=bu2,jian4,de2 +不见棺材不下泪=bu4,jian4,guan1,cai2,bu4,xia4,lei4 +不见棺材不落泪=bu4,jian4,guan1,cai2,bu4,luo4,lei4 +不见经传=bu4,jian4,jing1,zhuan4 +不见舆薪=bu4,jian4,yu2,xin1 +不计其数=bu4,ji4,qi2,shu4 +不计得失=bu2,ji4,de2,shi1 +不记得=bu2,ji4,de2 +不论=bu4,lun4 +不论如何=bu2,lun4,ru2,he2 +不费吹灰之力=bu4,fei4,chui1,hui1,zhi1,li4 +不赖=bu4,lai4 +不赞一词=bu4,zan4,yi1,ci2 +不越雷池=bu4,yue4,lei2,shi5 +不足为凭=bu4,zu2,wei2,ping2 +不足为外人道=bu4,zu2,wei2,wai4,ren2,dao4 +不足为奇=bu4,zu2,wei2,qi2 +不足为意=bu4,zu2,wei2,yi4 +不足为据=bu4,zu2,wei2,ju4 +不足为法=bu4,zu2,wei2,fa3 +不足为训=bu4,zu2,wei2,xun4 +不辟斧钺=bu4,bi4,fu3,yue4 +不辩菽麦=bu4,bian4,shu1,mai4 +不适合=bu2,shi4,he2 +不透明=bu2,tou4,ming2 +不速之客=bu4,su4,zhi1,ke4 +不遂=bu4,sui2 +不遑启处=bu4,huang2,qi3,chu3 +不遑宁处=bu4,huang2,ning2,chu3 +不道德=bu4,dao4,de2 +不重要=bu2,zhong4,yao4 +不错=bu4,cuo4 +不问事实真相=bu2,wen4,shi4,shi2,zhen1,xiang4 +不问年龄大小=bu2,wen4,nian2,ling3,da4,xiao3 +不问是非曲直=bu2,wen4,shi4,fei1,qu1,zhi2 +不间不界=bu4,gan1,bu4,ga4 +不闻不问=bu4,wen2,bu4,wen4 +不随以止=bu4,sui2,yi3,zhi3 +不露圭角=bu4,lu4,gui1,jiao3 +不露声色=bu4,lu4,sheng1,se4 +不露形色=bu4,lu4,xing2,se4 +不露神色=bu4,lu4,shen2,se4 +不露锋芒=bu4,lu4,feng1,mang2 +不露锋铓=bu4,lu4,feng1,hui4 +不顾一切=bu4,gu4,yi1,qie4 +与=yu3,yu4,yu2 +与世无争=yu2,shi4,wu2,zheng1 +与世沉浮=yu2,shi4,chen2,fu2 +与人为善=yu3,ren2,wei2,shan4 +与会=yu4,hui4 +与时消息=yu3,shi2,xiao1,xi4 +与民更始=yu3,ren2,geng1,shi3 +与民除害=yu3,hu3,chu2,hai4 +与赛=yu4,sai4 +与闻=yu4,wen2 +丏=mian3 +丐=gai4 +丑=chou3 +丑媳妇总得见公婆=chou3,xi2,fu4,zong3,de5,jian4,gong1,po2 +丑态毕露=chou3,tai4,bi4,lu4 +丑相=chou3,xiang4 +丑角=chou3,jue2 +丒=chou3 +专=zhuan1 +专差=zhuan1,chai1 +专心一致=zhuan1,xin1,yi1,zhi4 +专横=zhuan1,heng4 +专横跋扈=zhuan1,heng4,ba2,hu4 +且=qie3,ju1 +且住为佳=qie3,zhu4,wei2,jia1 +且食蛤蜊=qie3,shi2,ha2,li2 +丕=pi1 +世=shi4 +丗=shi4 +丘=qiu1 +丙=bing3 +业=ye4 +丛=cong2 +东=dong1 +东三省=dong1,san1,xing3 +东墙处子=dong1,qiang2,chu3,zi3 +东扯西拽=dong1,che3,xi1,zhuai1 +东抹西涂=dong1,mo4,xi1,tu2 +东方将白=dong4,fang5,jiang4,bai4 +东猜西揣=dong1,cai1,xi1,chuai1 +东莞=dong1,guan3 +东西南北人=dong1,xi4,nan2,bei3,ren2 +东西南北客=dong1,xi4,nan2,bei3,ke4 +东观西望=dong1,guang1,xi1,wang4 +东躲西藏=dong1,duo3,xi1,cang2 +东量西折=dong1,liang4,xi1,she2 +东阿=dong1,e1 +丝=si1 +丝恩发怨=si1,en1,fa4,yuan4 +丞=cheng2 +丞相=cheng2,xiang4 +丟=diu1 +丠=qiu1 +両=liang3 +丢=diu1 +丢三落四=diu1,san1,la4,si4 +丢下耙儿弄扫帚=diu1,xia4,pa2,er5,nong4,sao4,zhou3 +丢卒保车=diu1,zu2,bao3,ju1 +丢面子=diu1,mian4,zi5 +丣=you3 +两=liang3 +两世为人=liang3,shi4,wei2,ren2 +两口子=liang3,kou3,zi5 +严=yan2 +严丝合缝=yan2,si1,he2,feng4 +严处=yan2,chu3 +严查=yan2,zha1 +严禁=yan2,jin4 +並=bing4 +丧=sang4,sang1 +丧乱=sang1,luan4 +丧事=sang1,shi4 +丧假=sang1,jia4 +丧偶=sang4,ou3 +丧失=sang4,shi1 +丧家=sang1,jia1 +丧家之犬=sang4,jia1,zhi1,quan3 +丧家之狗=sang4,jia1,zhi1,gou3 +丧家犬=sang4,jia1,quan3 +丧尽天良=sang4,jin4,tian1,liang2 +丧服=sang1,fu2 +丧权=sang4,quan2 +丧气=sang4,qi4 +丧生=sang4,sheng1 +丧礼=sang1,li3 +丧胆=sang4,dan3 +丧胆销魂=sang4,hun2,xiao1,hun2 +丧葬=sang1,zang4 +丧钟=sang1,zhong1 +丧魂落魄=sang4,hun2,luo4,po4 +丨=gun3 +丩=jiu1 +个=ge4,ge3 +个头儿=ge4,tou5,er5 +个子=ge4,zi5 +丫=ya1 +丫头=ya1,tou5 +丫杈=ya1,cha4 +丬=pan2 +中=zhong1,zhong4 +中伤=zhong4,shang1 +中卷=zhong1,juan4 +中奖=zhong4,jiang3 +中将=zhong1,jiang4 +中弹=zhong4,dan4 +中彩=zhong4,cai3 +中招=zhong4,zhao1 +中暑=zhong4,shu3 +中标=zhong4,biao1 +中毒=zhong4,du2 +中牟=zhong1,mu4 +中的=zhong1,di4 +中缝=zhong1,feng4 +中行=zhong1,hang2 +中计=zhong4,ji4 +中靶=zhong4,ba3 +中风=zhong4,feng1 +中风瘫痪=zhong4,feng1,tan1,huan4 +丮=ji3 +丯=jie4 +丰=feng1 +丰屋蔀家=feng1,wu1,zhi1,jia1 +丱=guan4,kuang4 +串=chuan4 +丳=chan3 +临=lin2 +临危不惧=lin2,wei1,bu4,ju4 +临帖=lin2,tie4 +临敌易将=lin2,di2,yi4,jiang4 +临深履薄=lin2,shen1,lv3,bo2 +临难=lin2,nan4 +临难不恐=lin2,nan4,bu4,kong3 +临难不惧=lin2,nan4,bu4,ju3 +临难不避=lin2,nan2,bu4,bi4 +临难无慑=lin2,nan2,wu2,she4 +临难苟免=lin2,nan4,gou3,mian3 +临难铸兵=lin2,nan4,zhu4,bing1 +丵=zhuo2 +丶=zhu3 +丷=ba1 +丸=wan2 +丸子=wan2,zi5 +丹=dan1 +丹凤朝阳=dan1,feng4,chao2,yang2 +丹参=dan1,shen1 +丹心碧血=dan1,xin1,bi4,xue4 +为=wei4,wei2 +为主=wei2,zhu3 +为之语塞=wei2,zhi1,yu3,se4 +为五斗米折腰=wei4,wu3,dou3,mi3,zhe2,yao1 +为人=wei2,ren2 +为人作嫁=wei4,ren2,zuo4,jia4 +为人师表=wei2,ren2,shi1,biao3 +为人说项=wei4,ren2,shuo1,xiang4 +为什么=wei4,shen2,me5 +为仁不富=wei2,ren2,bu4,fu4 +为伍=wei2,wu3 +为啥=wei4,sha2 +为善最乐=wei2,shan4,zui4,le4 +为国为民=wei2,guo2,wei2,min2 +为国捐躯=wei4,guo2,juan1,qu1 +为好成歉=wei2,hao3,cheng2,qian4 +为害=wei2,hai4 +为富不仁=wei2,fu4,bu4,ren2 +为山止篑=wei2,shan1,zhi3,kui4 +为德不卒=wei2,de2,bu4,zu2 +为德不终=wei2,de2,bu4,zhong1 +为恶不悛=wei2,e4,bu4,quan1 +为患=wei2,huan4 +为所欲为=wei2,suo3,yu4,wei2 +为政=wei2,zheng4 +为数=wei2,shu4 +为时=wei2,shi2 +为时过早=wei2,shi2,guo4,zao3 +为期=wei2,qi1 +为止=wei2,zhi3 +为此=wei4,ci3 +为民父母=wei2,min2,fu4,mu3 +为法自弊=wei2,fa3,zi4,bi4 +为生=wei2,sheng1 +为着=wei2,zhe5 +为虎作伥=wei4,hu3,zuo4,chang1 +为虺弗摧=wei2,hui3,fu2,cui1 +为蛇添足=wei2,she2,tian1,zu2 +为蛇画足=wei2,she2,hua4,zu2 +为裘为箕=wei2,qiu2,wei2,ji1 +为限=wei2,xian4 +为难=wei2,nan2 +为非作恶=wei2,fei1,zuo4,e4 +为非作歹=wei2,fei1,zuo4,dai3 +为饥寒所迫=wei2,ji1,han2,suo3,po4 +为首=wei2,shou3 +为鬼为蜮=wei2,gui3,wei2,yu4 +主=zhu3 +主仆=zhu3,pu2 +主将=zhu3,jiang4 +主干=zhu3,gan4 +主干线=zhu3,gan4,xian4 +主角=zhu3,jue2 +主调=zhu3,diao4 +丼=jing3 +丽=li4,li2 +丽水=li2,shui3 +丽都=li4,du1 +举=ju3 +举不胜举=ju3,bu4,sheng4,ju3 +举手相庆=ju3,shou3,xiang1,qing4 +举措不当=ju3,cuo4,bu4,dang4 +丿=pie3 +乀=fu2 +乁=yi2,ji2 +乂=yi4 +乃=nai3 +乄=wu3 +久=jiu3 +久要不忘=jiu3,yao1,bu4,wang4 +乆=jiu3 +乇=tuo1,zhe2 +么=me5,mo2,ma5,yao1 +义=yi4 +义薄云天=yi4,bo2,yun2,tian1 +乊=yi1 +之=zhi1 +乌=wu1 +乌头白马生角=wu1,tou2,bai2,ma3,sheng1,jiao3 +乌拉=wu4,la5 +乍=zha4 +乎=hu1 +乏=fa2 +乏累=fa2,lei4 +乐=le4,yue4,yao4,lao4 +乐善好施=le4,shan4,hao4,shi1 +乐器=yue4,pu3 +乐团=yue4,tuan2 +乐坛=yue4,tan2 +乐子=le4,zi5 +乐官=yue4,guan1 +乐山乐水=yao4,shan1,yao4,shui3 +乐工=yue4,gong1 +乐师=yue4,shi1 +乐府=yue4,fu3 +乐府诗=yue4,fu3,shi1 +乐律=yue4,lv4 +乐得=le4,de5 +乐感=yue4,gan3 +乐户=yue4,hu4 +乐曲=yue4,qu3 +乐歌=yue4,ge1 +乐毅=yue4,yi4 +乐池=yue4,chi2 +乐清=yue4,qing1 +乐章=yue4,zhang1 +乐舞=yue4,wu3 +乐谱=yue4,qi4 +乐迷=yue4,mi2 +乐道好古=le4,dao4,hao3,gu3 +乐队=yue4,dui4 +乐音=yue4,yin1 +乑=yin2 +乒=ping1 +乓=pang1 +乔=qiao2 +乕=hu3 +乖=guai1 +乗=cheng2,sheng4 +乘=cheng2,sheng4 +乘便=cheng2,bian4 +乘兴=cheng2,xing4 +乘势=cheng2,shi4 +乘壶=sheng4,hu2 +乘客=cheng2,ke4 +乘数=cheng2,shu4 +乘肥衣轻=cheng2,fei2,yi4,qing1 +乘舆=sheng4,yu2 +乘舆播越=cheng2,yu2,bo1,yue4 +乘间伺隙=cheng2,jian1,si4,xi4 +乘风兴浪=cheng2,feng1,xing1,lang4 +乘风破浪=cheng2,feng1,po4,lang4 +乙=yi3 +乚=hao2,yi3 +乛=yi3 +乜=mie1,nie4 +乜斜=nie4,xie2 +乜斜缠帐=nie4,xie2,chan2,zhang4 +九=jiu3 +九大行星=jiu3,da4,hang2,xing1 +九曲回肠=jiu3,qu1,hui2,chang2 +九死一生=jiu3,si3,yi1,sheng1 +九牛一毛=jiu3,niu2,yi1,mao2 +九牛拉不转=jiu3,niu2,la1,bu4,zhuan4 +九蒸三熯=jiu3,zheng1,san1,sheng1 +九行八业=jiu3,hang2,ba1,ye4 +九转功成=jiu3,zhuan4,gong1,cheng2 +九重霄=jiu3,chong2,xiao1 +九鼎不足为重=jiu3,ding3,bu4,zu2,wei2,zhong4 +乞=qi3 +乞降=qi3,xiang2 +也=ye3 +习=xi2 +习以为常=xi2,yi3,wei2,chang2 +习焉不察=xi1,yan1,bu4,cha2 +乡=xiang1 +乡曲=xiang1,qu1 +乢=gai4 +乣=jiu3 +乤=xia4 +书=shu1 +书卷=shu1,juan4 +书卷气=shu1,juan4,qi4 +书缺有间=shu1,que1,you3,jian4 +书背=shu1,bei4 +乧=dou3 +乨=shi3 +乩=ji1 +乪=nang2 +乫=jia1 +乬=ju4 +乭=shi2 +乮=mao3 +乯=hu1 +买=mai3 +买得起=mai3,de5,qi3 +买椟还珠=mai3,du2,huan2,zhu1 +乱=luan4 +乱七八糟=luan4,qi1,ba1,zao1 +乱作胡为=luan4,zuo4,hu2,wei2 +乱哄哄=luan4,hong3,hong3 +乱成一团=luan4,cheng2,yi1,tuan2 +乱箭攒心=luan4,jian4,cuan2,xin1 +乳=ru3 +乳晕=ru3,yun4 +乳臭=ru3,xiu4 +乳臭未除=ru3,chou4,wei4,chu2 +乴=xue2 +乵=yan3 +乶=fu3 +乷=sha1 +乸=na3 +乹=qian2 +乺=suo3 +乻=yu2 +乼=zhu4 +乽=zhe3 +乾=qian2,gan1 +乿=zhi4,luan4 +亀=gui1 +亁=qian2 +亂=luan4 +亃=lin3,lin4 +亄=yi4 +亅=jue2 +了=le5,liao3 +了不可见=liao3,bu4,ke3,jian4 +了不得=liao3,bu4,de2 +了不起=liao3,bu4,qi3 +了不长进=liao3,bu4,zhang3,jin3 +了了=liao3,liao3 +了了可见=liao3,liao3,ke3,jian4 +了事=liao3,shi4 +了却=liao3,que4 +了如指掌=liao3,ru2,zhi3,zhang3 +了局=liao3,ju2 +了当=liao3,dang4 +了得=liao3,de2 +了悟=liao3,wu4 +了断=liao3,duan4 +了无=liao3,wu2 +了无惧色=liao3,wu2,ju4,se4 +了望台=liao4,wang4,tai2 +了然=liao3,ran2 +了然于胸=liao3,ran2,yu2,xiong1 +了然无闻=liao3,ran2,wu2,wen2 +了结=liao3,jie2 +了若指掌=liao3,ruo4,zhi3,zhang3 +了解=liao3,jie3 +了账=liao3,zhang4 +了身达命=liao3,shen1,da2,ming4 +亇=ge4,ma1 +予=yu2,yu3 +予人口实=yu3,ren2,kou3,shi2 +予以=yu3,yi3 +予取予求=yu2,qu3,yu2,qiu2 +予夺生杀=yu3,duo2,sheng1,sha1 +予齿去角=yu3,chi3,qu4,jiao3 +争=zheng1 +争得=zheng1,de5 +亊=shi4 +事=shi4 +事与心违=shi4,yu4,xin1,wei2 +事假=shi4,jia4 +事危累卵=shi4,wei1,lei4,luan3 +事后诸葛亮=shi4,hou4,zhu1,ge2,liang4 +事在人为=shi4,zai4,ren2,wei2 +二=er4 +二万五千里长征=er4,wan4,wu3,qian1,li3,chang2,zheng1 +二人转=er4,ren2,zhuan4 +二十八宿=er4,shi2,ba1,xiu4 +二竖为虐=er4,shu4,wei2,nve4 +二重=er4,chong2 +二重奏=er4,chong2,zou4 +二重性=er4,chong2,xing4 +二重根=er4,chong2,gen1 +亍=chu4 +于=yu2 +于今为烈=yu2,jin1,wei2,lie4 +于家为国=yu2,jia1,wei2,guo2 +于思=yu2,sai1 +亏=kui1 +亏得=kui1,de5 +亏累=kui1,lei3 +亐=yu2 +云=yun2 +云兴霞蔚=yun2,xing1,xia2,wei4 +云屯席卷=yun2,tun2,xi2,juan3 +云朝雨暮=yun2,zhao1,yu3,mu4 +云泥之差=yun2,ni2,zhi1,cha1 +互=hu4 +互为因果=hu4,wei2,yin1,guo4 +互为表里=hu4,wei2,biao3,li3 +互见=hu4,xian4 +亓=qi2 +五=wu3 +五侯七贵=wu3,hou4,qi1,gui4 +五侯蜡烛=wu3,hou4,la4,zhu2 +五斗折腰=wu3,dou3,zhe2,yao1 +五斗柜=wu3,dou3,gui4 +五斗橱=wu3,dou3,chu2 +五方杂处=wu3,fang1,za2,chu3 +五更=wu3,geng1 +五石六鹢=wu3,shi2,liu4,yi1 +五羖大夫=wu3,gu3,da4,fu1 +五脊六兽=wu3,ji2,liu4,shou4 +五色相宣=wu3,se4,xiang1,xuan1 +五行八作=wu3,hang2,ba1,zuo4 +五行并下=wu3,hang2,bing4,xia4 +五行生克=wu3,xing2,sheng1,ke4 +五金行=wu3,jin1,hang2 +五陵年少=wu3,ling2,nian2,shao4 +井=jing3 +井底虾蟆=jing3,di3,xia1,ma2 +井底蛤蟆=jing3,di3,ha2,ma2 +亖=si4 +亗=sui4 +亘=gen4 +亘古奇闻=gen4,gu3,qi1,wen2 +亙=gen4 +亚=ya4 +亚得里亚海=ya4,de5,li3,ya4,hai3 +亚肩叠背=ya4,jian1,die2,bei4 +亚肩迭背=ya4,jian1,die2,bei4 +些=xie1,suo4 +亜=ya4 +亝=qi2,zhai1 +亞=ya4,ya1 +亟=ji2,qi4 +亠=tou2 +亡=wang2,wu2 +亡国大夫=wang2,guo2,da4,fu1 +亡魂失魄=wang2,hun2,shi1,hun2 +亢=kang4 +亣=da4 +交=jiao1 +交卷=jiao1,juan4 +交响乐=jiao1,xiang3,yue4 +交差=jiao1,chai1 +交恶=jiao1,wu4 +交白卷=jiao1,bai2,juan4 +交行=jiao1,hang2 +交还=jiao1,huan2 +交通梗塞=jiao1,tong1,geng3,se4 +亥=hai4 +亦=yi4 +产=chan3 +产假=chan3,jia4 +亨=heng1,peng1 +亩=mu3 +亪=ye5 +享=xiang3 +京=jing1 +京片子=jing1,pian4,zi5 +京都=jing1,du1 +亭=ting2 +亭子=ting2,zi5 +亮=liang4 +亮相=liang4,xiang4 +亯=xiang3 +亰=jing1 +亱=ye4 +亲=qin1,qing4 +亲切=qin1,qie4 +亲家=qing4,jia1 +亲密无间=qin1,mi4,wu2,jian4 +亳=bo2 +亴=you4 +亵=xie4 +亶=dan3,dan4 +亷=lian2 +亸=duo3 +亹=wei3,men2 +亹亹不倦=tan1,wei3,bu4,juan4 +人=ren2 +人不可以貌相=ren2,bu4,ke3,yi3,mao4,xiang4 +人不可貌相=ren2,bu4,ke3,mao4,xiang4 +人中狮子=ren2,zhong1,shi1,zi3 +人为=ren2,wei2 +人为刀俎=ren2,wei2,dao1,zu3 +人事不省=ren2,shi4,bu4,xing3 +人们=ren2,men5 +人参=ren2,shen1 +人口总数=ren2,kou3,zong3,shu4 +人多阙少=ren2,duo1,que4,shao3 +人心向背=ren2,xin1,xiang4,bei4 +人才难得=ren2,cai2,cai2,de2 +人数=ren2,shu4 +人数众多=ren2,shu4,zhong4,duo1 +人模狗样=ren2,mu2,gou3,yang4 +人涉卬否=ren2,she4,ang2,fou3 +人满为患=ren2,man3,wei2,huan4 +人生朝露=ren2,sheng1,chao2,lu4 +人生自古谁无死=ren2,sheng1,zi4,gu3,shui2,wu2,si3 +人给家足=ren2,ji3,jia1,zu2 +人自为战=ren2,zi4,wei2,zhan4 +人自为政=ren2,zi4,wei2,zheng4 +人行道=ren2,xing2,dao4 +人言藉藉=ren2,yan2,ji2,ji2 +人足家给=ren2,zu2,jia1,ji3 +人轧人=ren2,ga2,ren2 +亻=ren2 +亼=ji2 +亽=ji2 +亾=wang2 +亿=yi4 +什=shen2,shi2 +什么=shen2,me5 +什件儿=shi2,jian4,er2 +什伍东西=shi2,wu3,dong1,xi1 +什围伍攻=shi2,wei2,wu3,gong1 +什物=shi2,wu4 +什袭以藏=shi2,xi2,yi3,cang2 +什袭珍藏=shi2,xi2,zhen1,cang2 +什袭而藏=shi2,xi2,er2,cang2 +什锦=shi2,jin3 +仁=ren2 +仂=le4 +仃=ding1 +仄=ze4 +仅=jin3,jin4 +仆=pu1,pu2 +仆人=pu2,ren2 +仆仆亟拜=pu2,pu2,ji2,bai4 +仆仆道途=pu2,pu2,dao4,tu2 +仆仆风尘=pu2,pu2,feng1,chen2 +仆从=pu2,cong2 +仆妇=pu2,fu4 +仆役=pu2,yi4 +仇=chou2,qiu2 +仇姓=qiu2,xing4 +仈=ba1 +仉=zhang3 +今=jin1 +今朝=jin1,zhao1 +今朝有酒今朝醉=jin1,zhao1,you3,jiu3,jin1,zhao1,zui4 +介=jie4 +仌=bing1 +仍=reng2 +从=cong2,zong4 +从一而终=cong2,yi1,er2,zhong1 +从从容容=cong2,cong2,rong2,rong2 +从俗就简=cong2,su2,jiu4,jia3 +从容不迫=cong2,rong2,bu4,po4 +从容自在=cong1,rong2,zi4,zai4 +从容自若=cong2,rong2,zi4,ruo4 +仏=fo2 +仐=jin1,san3 +仑=lun2 +仒=bing1 +仓=cang1 +仓卒=cang1,cu4 +仓卒主人=cang1,cu4,zhu3,ren2 +仓卒之际=cang1,cu4,zhi1,ji4 +仔=zi3,zi1,zai3 +仔猪=zai3,zhu1 +仔鸡=zai3,ji1 +仕=shi4 +他=ta1 +他们=ta1,men5 +他们俩=ta1,men1,lia3 +他们的=ta1,men5,de5 +他们自己=ta1,men5,zi4,ji3 +他们自己的=ta1,men5,zi4,ji3,de5 +他处=ta1,chu3 +仗=zhang4 +付=fu4 +付之一炬=fu4,zhi1,yi1,ju4 +仙=xian1 +仙露明珠=xian1,lu4,ming2,zhu1 +仚=xian1 +仛=tuo1,cha4,duo2 +仜=hong2 +仝=tong2 +仞=ren4 +仟=qian1 +仠=gan3,han4 +仡=yi4,ge1 +仡佬族=ge1,lao3,zu2 +仢=bo2 +代=dai4 +代为=dai4,wei2 +代为说项=dai4,wei2,shuo1,xiang4 +代人说项=dai4,ren2,shuo1,xiang4 +代数=dai4,shu4 +代数和=dai4,shu4,he2 +代数式=dai4,shu4,shi4 +代数方程=dai4,shu4,fang1,cheng2 +令=ling4,ling2,ling3 +令人发指=ling4,ren2,fa4,zhi3 +令人捧腹=ling4,ren2,peng3,fu3 +令原之戚=ling2,yuan2,zhi1,qi1 +令狐=ling2,hu2 +以=yi3 +以一当十=yi3,yi1,dang1,shi2 +以不济可=yi3,fou3,ji4,ke3 +以为=yi3,wei2 +以书为御=yi3,shu1,wei2,yu4 +以人为鉴=yi3,ren2,wei2,jian4 +以人为镜=yi3,ren2,wei2,jing4 +以冠补履=yi3,guan1,bu3,lv3 +以利累形=yi3,li4,lei3,xing2 +以升量石=yi3,sheng1,liang2,dan4 +以古为鉴=yi3,gu3,wei2,jian4 +以古为镜=yi3,gu3,wei2,jing4 +以夜继朝=yi3,ye4,ji4,zhao1 +以大恶细=yi3,da4,wu4,xi4 +以天下为己任=yi3,tian1,xia4,wei2,ji3,ren4 +以守为攻=yi3,shou3,wei2,gong1 +以宫笑角=yi3,gong1,xiao4,jue2 +以己度人=yi3,ji3,duo2,ren2 +以微知着=yi3,wei1,zhi1,zhu4 +以忍为阍=yi3,ren3,wei2,hun1 +以意为之=yi3,yi4,wei2,zhi1 +以慎为键=yi3,shen4,wei2,jian4 +以攻为守=yi3,gong1,wei2,shou3 +以日为年=yi3,ri4,wei2,nian2 +以毁为罚=yi3,hui3,wei2,fa2 +以毛相马=yi3,mao2,xiang4,ma3 +以水洗血=yi3,shui3,xi3,xue4 +以水济水=yi3,shui3,ji3,shui3 +以法为教=yi3,fa3,wei2,jiao4 +以泽量尸=yi3,ze2,liang2,shi1 +以牙还牙=yi3,ya2,huan2,ya2 +以疏间亲=yi3,shu1,jian4,qin1 +以白为黑=yi3,bai2,wei2,hei1 +以眼还眼=yi3,yan3,huan2,yan3 +以筌为鱼=yi3,quan2,wei2,yu2 +以紫为朱=yi3,zi3,wei2,zhu1 +以耳为目=yi3,er3,wei2,mu4 +以茶当酒=yi3,cha2,dang4,jiu3 +以蠡测海=yi3,li2,ce4,hai3 +以血洗血=yi3,xue4,xi3,xue4 +以规为瑱=yi3,gui1,wei2,tian4 +以言为讳=yi3,yan2,wei2,hui4 +以誉为赏=yi3,yu4,wei2,shang3 +以讹传讹=yi3,e2,chuan2,e2 +以身许国=yi3,sheng1,xu3,guo2 +以还=yi3,huan2 +以退为进=yi3,tui4,wei2,jin4 +以邻为壑=yi3,lin2,wei2,he4 +以郄视文=yi3,xi4,shi4,wen2 +以鹿为马=yi3,lu4,wei2,ma3 +以黑为白=yi3,hei1,wei2,bai2 +仦=chao4 +仧=chang2,zhang3 +仨=sa1 +仩=chang2 +仪=yi2 +仫=mu4 +们=men2 +仭=ren4 +仮=fan3 +仯=chao4,miao3 +仰=yang3,ang2 +仰不愧天=yang3,bu4,kui4,tian1 +仰事俯畜=yang3,shi4,fu3,xu4 +仰屋着书=yang3,wu1,zhu4,shu1 +仰给=yang3,ji3 +仱=qian2 +仲=zhong4 +仳=pi3,pi2 +仴=wo4 +仵=wu3 +件=jian4 +件数=jian4,shu4 +价=jia4,jie4,jie5 +仸=yao3,fo2 +仹=feng1 +仺=cang1 +任=ren4,ren2 +任丘=ren2,qiu1 +任姓=ren2,xing4 +任达不拘=ren4,lao2,bu4,ju1 +仼=wang2 +份=fen4,bin1 +份子=fen4,zi5 +仾=di1 +仿=fang3 +仿佛=fang3,fu2 +伀=zhong1 +企=qi3 +伂=pei4 +伃=yu2 +伄=diao4 +伅=dun4 +伆=wen3 +伇=yi4 +伈=xin3 +伉=kang4 +伊=yi1 +伋=ji2 +伌=ai4 +伍=wu3 +伎=ji4,qi2 +伏=fu2 +伏帖=fu2,tie1 +伏瘕=fu3,jia3 +伏而咶天=fu2,er2,shi4,tian1 +伏虎降龙=fu2,hu3,xiang2,long2 +伐=fa2 +休=xiu1,xu3 +休假=xiu1,jia4 +伒=jin4,yin2 +伓=pi1 +伔=dan3 +伕=fu1 +伖=tang3 +众=zhong4 +众啄同音=zhong4,zhou4,tong2,yin1 +众好众恶=zhong4,hao4,zhong4,wu4 +众数=zhong4,shu4 +众星攒月=zhong4,xing1,cuan2,yue4 +众毛攒裘=zhong4,mao2,cuan2,qiu2 +众生相=zhong4,sheng1,xiang4 +众矢之的=zhong4,shi3,zhi1,di4 +优=you1 +优孟衣冠=you1,meng4,yi1,guan1 +伙=huo3 +会=hui4,kuai4 +会儿=hui4,er5 +会计=kuai4,ji4 +伛=yu3 +伛偻=yu3,lv3 +伜=cui4 +伝=yun2 +伞=san3 +伟=wei3 +传=chuan2,zhuan4 +传为佳话=chuan2,wei2,jia1,hua4 +传为笑柄=chuan2,wei2,xiao4,bing3 +传为笑谈=chuan2,wei2,xiao4,tan2 +传为美谈=chuan2,wei2,mei3,tan2 +传单炸弹=chuan2,dan1,zha4,dan4 +传承=chuan2,cheng2 +传略=zhuan4,lve4 +传纸条=chuan2,chi3,tiao2 +传记=zhuan4,ji4 +传说=chuan2,shuo1 +传问=chuan2,wen4 +传闻=chuan2,wen2 +传风扇火=chuan2,feng1,shan1,huo3 +传风搧火=chuan2,feng1,you3,huo3 +伡=che1,ju1 +伢=ya2 +伣=qian4 +伤=shang1 +伤言扎语=shang1,yan2,zha1,yu3 +伥=chang1 +伦=lun2 +伧=cang1,chen5 +伨=xun4 +伩=xin4 +伪=wei3 +伪君子=wei3,jun1,zi3 +伫=zhu4 +伬=chi3 +伭=xian2,xuan2 +伮=nu2,nu3 +伯=bo2,bai3,ba4 +伯乐一顾=bo1,le4,yi1,gu4 +伯乐相马=bo2,le4,xiang4,ma3 +伯伯=bo2,bo5 +估=gu1,gu4 +估摸=gu1,mo5 +估衣=gu4,yi1 +估量=gu1,liang5 +伱=ni3 +伲=ni3,ni4 +伳=xie4 +伴=ban4 +伵=xu4 +伶=ling2 +伷=zhou4 +伸=shen1 +伸手不见五指=shen1,shou3,bu4,jian4,wu3,zhi3 +伸曲=shen1,qu1 +伹=qu1 +伺=ci4,si4 +伺侯=ci4,hou4 +伺候=ci4,hou4 +伺机=si4,ji1 +伺瑕导蠙=si4,xia2,dao3,pin2 +伺瑕导隙=si4,xia2,dao3,xi4 +伺瑕抵蠙=si4,xia2,di3,pin2 +伺瑕抵隙=si4,xia2,di3,xi4 +伺隙=si4,xi4 +伻=beng1 +似=si4,shi4 +似的=shi4,de5 +伽=jia1,qie2,ga1 +伽蓝=qie2,lan2 +伽马=ga1,ma3 +伾=pi1 +伿=yi4 +佀=si4 +佁=yi3,chi4 +佂=zheng1 +佃=dian4,tian2 +佄=han1,gan4 +佅=mai4 +但=dan4 +佇=zhu4 +佈=bu4 +佉=qu1 +佊=bi3 +佋=zhao1,shao4 +佌=ci3 +位=wei4 +低=di1 +低唱浅斟=di1,chang4,qian3,zhen1 +低唱浅酌=di4,chang4,qian3,zhuo2 +低情曲意=di1,qing2,qu1,yi4 +低血压=di1,xue4,ya1 +低调=di1,diao4 +住=zhu4 +住一宿=zhu4,yi4,xiu3 +佐=zuo3 +佑=you4 +佒=yang3 +体=ti3,ti1 +体己=ti1,ji5 +体胀系数=ti3,zhang4,xi4,shu4 +佔=zhan4,dian1 +何=he2,he1,he4 +何乐不为=he2,le4,bu4,wei2 +何乐而不为=he2,le4,er2,bu4,wei2 +何处=he2,chu3 +何所不为=he2,suo3,bu4,wei2 +何曾=he2,zeng1 +何足为奇=he2,zu2,wei2,qi2 +佖=bi4 +佗=tuo2 +佘=she2 +余=yu2 +余勇可贾=yu2,yong3,ke3,gu3 +余数=yu2,shu4 +佚=yi4,die2 +佛=fo2,fu2,bi4,bo2 +佛头著粪=fo2,tou2,zhuo2,fen4 +作=zuo4 +作为=zuo4,wei2 +作乐=zuo4,yue4 +作兴=zuo4,xing1 +作坊=zuo1,fang5 +作嫁衣裳=zuo4,jia4,yi1,shang1 +作弄=zuo1,nong4 +作数=zuo4,shu4 +作歹为非=zuo4,dai3,wei2,fei1 +作浪兴风=zuo4,lang4,xing1,feng1 +佝=gou1,kou4 +佝偻=gou1,lou2 +佝偻着腰=gou1,lou2,zhe5,yao1 +佞=ning4 +佟=tong2 +你=ni3 +你们=ni3,men5 +你们俩=ni3,men1,lia3 +你们的=ni3,men5,de5 +你们自己=ni3,men5,zi4,ji3 +你们自己的=ni3,men5,zi4,ji3,de5 +佡=xian1 +佢=qu2 +佣=yong1,yong4 +佣中佼佼=yong4,zhong1,jiao3,jiao3 +佣人=yong4,ren2 +佣工=yong4,gong1 +佣金=yong4,jin1 +佣钱=yong4,qian2 +佤=wa3 +佥=qian1 +佦=you4 +佧=ka3 +佨=bao1 +佩=pei4 +佪=hui2,huai2 +佫=ge2 +佬=lao3 +佭=xiang2 +佮=ge2 +佯=yang2 +佰=bai3 +佱=fa3 +佲=ming3 +佳=jia1 +佳人薄命=jia1,ren2,bo2,ming4 +佴=er4,nai4 +併=bing4 +佶=ji2 +佷=hen3 +佸=huo2 +佹=gui3 +佺=quan2 +佻=tiao1 +佻薄=tiao1,bo2 +佼=jiao3 +佽=ci4 +佾=yi4 +使=shi3 +使出浑身解数=shi3,chu1,hun2,shen1,xie4,shu4 +使得=shi3,de5 +使性子=shi3,xing4,zi5 +使羊将狼=shi3,yang2,jiang4,lang2 +侀=xing2 +侁=shen1 +侂=tuo1 +侃=kan3 +侃大山=kan3,tai4,shan1 +侄=zhi2 +侄儿=zhi2,er5 +侄女儿=zhi2,nv3,er5 +侄子=zhi2,zi5 +侅=gai1 +來=lai2 +侇=yi2 +侈=chi3 +侉=kua3 +侉子=kua3,zi5 +侊=gong1 +例=li4 +例假=li4,jia4 +例子=li4,zi5 +例直禁简=li4,zhi2,jin4,jian3 +侌=yin1 +侍=shi4 +侍应生=shi4,ying4,sheng1 +侎=mi3 +侏=zhu1 +侏儒一节=zhu1,ru3,yi1,jie2 +侏儒观戏=zhu1,ru3,guan1,xi4 +侐=xu4 +侑=you4 +侒=an1 +侓=lu4 +侔=mou2 +侔色揣称=mou2,se4,chuai3,chen4 +侕=er2 +侖=lun2 +侗=dong4,tong2,tong3 +侘=cha4 +侙=chi4 +侚=xun4 +供=gong4,gong1 +供不应求=gong1,bu4,ying4,qiu2 +供人=gong1,ren2 +供佛=gong4,fo2 +供养=gong4,yang3 +供品=gong4,pin3 +供奉=gong4,feng4 +供孩子上学=gong1,hai2,zi5,shang4,xue2 +供应=gong1,ying4 +供应站=gong1,ying4,zhan4 +供料=gong1,liao4 +供旅客休息=gong1,lv3,ke4,xiu1,xi1 +供暖=gong1,nuan3 +供果=gong4,guo3 +供桌=gong4,zhuo1 +供欣赏=gong1,xin1,shang3 +供气=gong1,qi4 +供水=gong1,shui3 +供求=gong1,qiu2 +供状=gong4,zhuang4 +供献=gong4,xian4 +供电=gong1,dian4 +供给=gong1,ji3 +供职=gong4,zhi2 +供认=gong4,ren4 +供认不讳=gong4,ren4,bu4,hui4 +供词=gong4,ci2 +供读者参考=gong1,du2,zhe3,can1,kao3 +供过于求=gong1,guo4,yu2,qiu2 +供销=gong1,xiao1 +供需=gong1,xu1 +侜=zhou1 +侜张为幻=zhou1,zhang1,wei2,huan4 +依=yi1 +依丱附木=yi1,kuang4,fu4,mu4 +依头缕当=yi1,tou2,lv3,dang4 +依阿取容=yi1,e1,qu3,rong2 +侞=ru2 +侟=cun2,jian4 +侠=xia2 +価=si4 +侢=dai4 +侣=lv3 +侤=ta5 +侥=jiao3,yao2 +侥幸=jiao3,xing4 +侦=zhen1 +侦查=zhen1,zha1 +侦缉=zhen1,ji1 +侧=ce4,ze4,zhai1 +侧歪=zhai1,wai1 +侨=qiao2 +侩=kuai4 +侪=chai2 +侫=ning4 +侬=nong2 +侭=jin3 +侮=wu3 +侯=hou2,hou4 +侰=jiong3 +侱=cheng3,ting3 +侲=zhen4,zhen1 +侳=zuo4 +侴=hao4 +侵=qin1 +侵占=qin1,zhan1 +侶=lv3 +侷=ju2 +侸=shu4,dou1 +侹=ting3 +侺=shen4 +侻=tuo2,tui4 +侼=bo2 +侽=nan2 +侾=xiao1 +便=bian4,pian2 +便了=bian4,liao3 +便人=bian4,ren2 +便佞=pian2,ning4 +便嬖=pian2,bi4 +便宜=pian2,yi5 +便宜从事=bian4,yu2,cong2,shi4 +便宜行事=bian4,yi2,xing2,shi4 +便宜货=bian4,yi2,huo4 +便当=bian4,dang4 +便溺=bian4,niao4 +便血=bian4,xie3 +俀=tui3 +俁=yu3 +係=xi4 +促=cu4 +俄=e2 +俅=qiu2 +俆=xu2 +俇=guang4 +俈=ku4 +俉=wu4 +俊=jun4 +俋=yi4 +俌=fu3 +俍=liang2 +俎=zu3 +俏=qiao4,xiao4 +俏头=qiao4,tou5 +俐=li4 +俑=yong3 +俒=hun4 +俓=jing4 +俔=qian4 +俕=san4 +俖=pei3 +俗=su2 +俘=fu2 +俙=xi1 +俚=li3 +俛=fu3 +俛拾地芥=bi4,shi2,di4,jie4 +俛首帖耳=ma3,shou3,tie1,er3 +俜=ping1 +保=bao3 +保不住=bao3,bu2,zhu4 +保得住=bao3,de5,zhu4 +俞=yu2,yu4,shu4 +俟=si4,qi2 +俠=xia2 +信=xin4,shen1 +信号弹=xin4,hao4,dan4 +信差=xin4,chai1 +信皮儿=xin4,pi2,er5 +俢=xiu1 +俣=yu3 +俤=di4 +俥=che1,ju1 +俦=chou2 +俧=zhi4 +俨=yan3 +俩=liang3,lia3 +俩人=lia3,ren2 +俪=li4 +俫=lai2 +俬=si1 +俭=jian3 +俭不中礼=jian3,bu4,zhong4,li3 +俭朴=jian3,pu3 +修=xiu1 +俯=fu3 +俯首帖耳=fu3,shou3,tie1,er3 +俰=huo4 +俱=ju4 +俲=xiao4 +俳=pai2 +俴=jian4 +俵=biao4 +俶=chu4,ti4 +俷=fei4 +俸=feng4 +俹=ya4 +俺=an3 +俺们=an3,men5 +俻=bei4 +俼=yu4 +俽=xin1 +俾=bi3 +俿=hu3,chi2 +倀=chang1 +倁=zhi1 +倂=bing4 +倃=jiu4 +倄=yao2 +倅=cui4,zu2 +倆=liang3,lia3 +倇=wan3 +倈=lai2 +倉=cang1 +倊=zong3 +個=ge4,ge3 +倌=guan1 +倍=bei4 +倍数=bei4,shu4 +倎=tian3 +倏=shu1 +倐=shu1 +們=men2 +倒=dao3,dao4 +倒不=dao4,bu4 +倒不是=dao4,bu2,shi4 +倒不错=dao4,bu2,cuo4 +倒也=dao4,ye3 +倒产=dao4,chan3 +倒仓=dao3,cang1 +倒伏=dao3,fu3 +倒像=dao4,xiang4 +倒儿爷=dao3,er5,ye2 +倒冠落佩=dao3,guan1,luo4,pei4 +倒出=dao4,chu1 +倒出去=dao4,chu1,qu4 +倒出来=dao4,chu1,lai2 +倒刺=dao4,ci4 +倒剪=dao4,jian3 +倒卖违禁品=dao3,mai4,wei2,jin4,pin3 +倒卵形=dao4,luan3,xing2 +倒卷=dao4,juan3 +倒反=dao4,fan3 +倒叙=dao4,xu4 +倒吊=dao4,diao4 +倒向=dao4,xiang4 +倒嗓=dao3,sang3 +倒嚼=dao3,jiao4 +倒因为果=dao3,yin1,wei2,guo3 +倒圈=dao3,juan4 +倒在地上=dao3,zai4,di4,shang5 +倒垂=dao4,chui2 +倒垃圾=dao4,la1,ji1 +倒头便睡=dao4,tou2,bian4,shui4 +倒好=dao4,hao3 +倒好儿=dao4,hao3,er5 +倒屣相迎=dao4,xi3,xiang1,ying2 +倒带=dao4,dai4 +倒带键=dao4,dai4,jian4 +倒序=dao4,xu4 +倒序词典=dao4,xu4,ci2,dian3 +倒彩=dao4,cai3 +倒影=dao4,ying3 +倒悬=dao4,xuan2 +倒憋气=dao4,bie1,qi4 +倒戈=dao3,ge1 +倒打一瓦=dao4,da3,yi1,wa3 +倒打一耙=dao4,da3,yi1,pa2 +倒找=dao4,zhao3 +倒抽一口冷气=dao4,chou1,yi4,kou3,leng3,qi4 +倒持太阿=dao4,chi2,tai4,e1 +倒持泰阿=dao4,chi2,tai4,e1 +倒挂=dao4,gua4 +倒接=dao4,jie1 +倒插=dao4,cha1 +倒插门=dao4,cha1,men2 +倒收付息=dao4,shou1,fu4,xi1 +倒放=dao4,fang4 +倒数=dao4,shu4 +倒数式=dao4,shu4,shi4 +倒数比=dao4,shu4,bi3 +倒数计时=dao4,shu3,ji4,shi2 +倒映=dao4,ying4 +倒春寒=dao4,chun1,han2 +倒是=dao4,shi4 +倒板=dao3,ban3 +倒果为因=dao4,guo3,wei2,yin1 +倒栽葱=dao4,zai1,cong1 +倒档=dao4,dang4 +倒欠=dao4,qian4 +倒水=dao4,shui3 +倒流=dao4,liu2 +倒灌=dao4,guan4 +倒片=dao4,pian4 +倒片机=dao4,pian4,ji1 +倒牌子=dao3,pai2,zi5 +倒睫=dao4,jie2 +倒空=dao4,kong1 +倒立=dao4,li4 +倒算=dao4,suan4 +倒粪=dao4,fen4 +倒绷孩儿=dao4,beng1,hai2,er2 +倒置=dao4,zhi4 +倒背如流=dao4,bei4,ru2,liu2 +倒背手=dao4,bei4,shou3 +倒脉冲=dao4,mai4,chong1 +倒苦水=dao4,ku3,shui3 +倒茶=dao4,cha2 +倒虹吸管=dao4,hong2,xi1,guan3 +倒血霉=dao3,xue4,mei2 +倒行逆施=dao4,xing2,ni4,shi1 +倒装=dao4,zhuang1 +倒装句=dao4,zhuang1,ju4 +倒装词序=dao4,zhuang1,ci2,xu4 +倒裳索领=dao4,chang2,suo3,ling3 +倒许=dao4,xu3 +倒读数=dao4,du2,shu4 +倒账卷逃=dao3,zhang4,juan3,tao3 +倒贴=dao4,tie1 +倒赔=dao4,pei2 +倒踏门=dao4,ta4,men2 +倒车=dao4,che1 +倒转=dao4,zhuan4 +倒转来说=dao4,zhuan3,lai2,shuo1 +倒轮闸=dao4,lun2,zha2 +倒载干戈=dao4,zai4,gan1,ge1 +倒过儿=dao4,guo4,er2 +倒过去=dao4,guo4,qu4 +倒过来=dao4,guo4,lai2 +倒退=dao4,tui4 +倒退一步=dao4,tui4,yi2,bu4 +倒酒=dao4,jiu3 +倒金字塔=dao4,jin1,zi4,ta3 +倒锁=dao4,suo3 +倒锁上门=dao4,suo3,shang4,men2 +倒风=dao4,feng1 +倒飞=dao4,fei1 +倓=tan2,tan4 +倔=jue4,jue2 +倔头倔脑=jue4,tou2,jue4,nao3 +倔头强脑=jue4,tou2,jiang4,nao3 +倔强=jue2,jiang4 +倕=chui2 +倖=xing4 +倗=peng2 +倘=tang3,chang2 +倘佯=chang2,yang2 +候=hou4 +倚=yi3 +倛=qi1 +倜=ti4 +倝=gan4 +倞=liang4,jing4 +借=jie4 +借尸还阳=jie4,shi1,huan2,yang2 +借尸还魂=jie4,shi1,huan2,hun2 +借调=jie4,diao4 +倠=sui1 +倡=chang4,chang1 +倡条冶叶=chang1,tiao2,ye3,ye4 +倡而不和=chang4,er2,bu4,he4 +倢=jie2 +倣=fang3 +値=zhi2 +倥=kong1,kong3 +倥侗=kong1,tong2 +倦=juan4 +倦鸟知还=juan4,niao3,zhi1,huan2 +倧=zong1 +倨=ju4 +倩=qian4 +倪=ni2 +倫=lun2 +倬=zhuo1 +倭=wo1,wei1 +倮=luo3 +倯=song1 +倰=leng4 +倱=hun4 +倲=dong1 +倳=zi4 +倴=ben4 +倵=wu3 +倶=ju4 +倷=nai3 +倸=cai3 +倹=jian3 +债=zhai4 +债务重组=zhai4,wu4,chong2,zu3 +倻=ye1 +值=zhi2 +值当=zhi2,dang4 +值得=zhi2,de5 +值得一提=zhi2,de2,yi4,ti2 +倽=sha4 +倾=qing1 +倾筐倒庋=qing1,kuang1,dao4,gui3 +倾筐倒箧=qing1,kuang1,dao4,qie4 +倾箱倒箧=qing1,xiang1,dao4,qie4 +倾肠倒肚=qing1,chang2,dao4,du3 +倿=ning4 +偀=ying1 +偁=cheng1,chen4 +偂=qian2 +偃=yan3 +偃旗仆鼓=yan3,qi2,pu2,gu3 +偃武兴文=yan3,wu3,xing1,wen2 +偃革为轩=yan3,ge2,wei2,xuan1 +偄=ruan3 +偅=zhong4,tong2 +偆=chun3 +假=jia3,jia4 +假分数=jia3,fen1,shu4 +假日=jia4,ri4 +假期=jia4,qi1 +假条=jia4,tiao2 +假洋鬼子=jia3,yang2,gui3,zi5 +偈=ji4,jie2 +偈语=ji4,yu3 +偉=wei3 +偊=yu3 +偋=bing3,bing4 +偌=ruo4 +偍=ti2 +偎=wei1 +偎干就湿=wei1,gan4,jiu4,shi1 +偏=pian1 +偏差=pian1,cha1 +偏裨=pian1,pi2 +偐=yan4 +偑=feng1 +偒=tang3,dang4 +偓=wo4 +偔=e4 +偕=xie2 +偖=che3 +偗=sheng3 +偘=kan3 +偙=di4 +做=zuo4 +做好事=zuo4,hao3,shi4 +做衣服=zuo4,yi1,fu5 +偛=cha1 +停=ting2 +停当=ting2,dang4 +停泊=ting2,bo2 +停留长智=ting2,liu2,zhang3,zhi4 +偝=bei4 +偞=xie4 +偟=huang2 +偠=yao3 +偡=zhan4 +偢=chou3,qiao4 +偣=an1 +偤=you2 +健=jian4 +健将=jian4,jiang4 +偦=xu1 +偧=zha1 +偨=ci1 +偩=fu4 +偪=bi1 +偫=zhi4 +偬=zong3 +偭=mian3 +偮=ji2 +偯=yi3 +偰=xie4 +偱=xun2 +偲=cai1,si1 +偳=duan1 +側=ce4,ze4,zhai1 +偵=zhen1 +偶=ou3 +偶一为之=ou3,yi1,wei2,zhi1 +偷=tou1 +偷空=tou1,kong4 +偷鸡不着蚀把米=tou1,ji1,bu4,zhao2,shi2,ba3,mi3 +偸=tou1 +偹=bei4 +偺=zan2,za2,za3 +偻=lv3,lou2 +偼=jie2 +偽=wei3 +偾=fen4 +偿=chang2 +偿还=chang2,huan2 +傀=kui3,gui1 +傁=sou3 +傂=zhi4,si1 +傃=su4 +傄=xia1 +傅=fu4 +傆=yuan4,yuan2 +傇=rong3 +傈=li4 +傉=nu4 +傊=yun4 +傋=jiang3,gou4 +傌=ma4 +傍=bang4 +傍若无人=pang2,ruo4,wu2,ren2 +傎=dian1 +傏=tang2 +傐=hao4 +傑=jie2 +傒=xi1,xi4 +傓=shan1 +傔=qian4,jian1 +傕=que4,jue2 +傖=cang1,chen5 +傗=chu4 +傘=san3 +備=bei4 +傚=xiao4 +傛=rong2 +傜=yao2 +傝=ta4,tan4 +傞=suo1 +傟=yang3 +傠=fa2 +傡=bing4 +傢=jia1 +傣=dai3 +傤=zai4 +傥=tang3 +傦=gu3 +傧=bin1 +傧相=bin1,xiang4 +储=chu3 +储备=chu3,bei4 +储存=chu3,cun2 +储蓄=chu3,xu4 +储蓄银行=chu3,xu4,yin2,hang2 +储藏=chu2,cang2 +傩=nuo2 +傪=can1,can4 +傫=lei3 +催=cui1 +催泪弹=cui1,lei4,dan4 +催泪炸弹=cui1,lei4,zha4,dan4 +傭=yong1 +傮=zao1,cao2 +傯=zong3 +傰=peng2 +傱=song3 +傲=ao4 +傲不可长=ao4,bu4,ke3,zhang3 +傳=chuan2,zhuan4 +傴=yu3 +債=zhai4 +傶=qi1,cou4 +傷=shang1 +傸=chuang3 +傹=jing4 +傺=chi4 +傻=sha3 +傼=han4 +傽=zhang1 +傾=qing1 +傿=yan1,yan4 +僀=di4 +僁=xie4 +僂=lv3,lou2 +僃=bei4 +僄=piao4,biao1 +僅=jin3,jin4 +僆=lian4 +僇=lu4 +僈=man4 +僉=qian1 +僊=xian1 +僋=tan3,tan4 +僌=ying2 +働=dong4 +僎=zhuan4 +像=xiang4 +像煞有介事=xiang4,sha4,you3,jie4,shi4 +僐=shan4 +僑=qiao2 +僒=jiong3 +僓=tui3,tui2 +僔=zun3 +僕=pu2 +僖=xi1 +僗=lao2 +僘=chang3 +僙=guang1 +僚=liao2 +僛=qi1 +僜=cheng1,deng1 +僝=zhan4,zhuan4,chan2 +僞=wei3 +僟=ji1 +僠=bo1 +僡=hui4 +僢=chuan3 +僣=tie3,jian4 +僤=dan4 +僥=jiao3,yao2 +僦=jiu4 +僧=seng1 +僨=fen4 +僩=xian4 +僪=yu4,ju2 +僫=e4,wu4,wu1 +僬=jiao1 +僬侥=jiao1,yao2 +僭=jian4 +僮=tong2,zhuang4 +僮仆=tong2,pu2 +僯=lin3 +僰=bo2 +僱=gu4 +僲=xian1 +僳=su4 +僴=xian4 +僵=jiang1 +僶=min3 +僷=ye4 +僸=jin4 +價=jia4,jie5 +僺=qiao4 +僻=pi4 +僼=feng1 +僽=zhou4 +僾=ai4 +僿=sai4 +儀=yi2 +儁=jun4 +儂=nong2 +儃=chan2,tan3,shan4 +億=yi4 +儅=dang1,dang4 +儆=jing3 +儇=xuan1 +儈=kuai4 +儉=jian3 +儊=chu4 +儋=dan1,dan4 +儋石之储=dan4,shi2,zhi1,chu3 +儌=jiao3 +儍=sha3 +儎=zai4 +儏=can4 +儐=bin1,bin4 +儑=an2,an4 +儒=ru2 +儒将=ru2,jiang4 +儓=tai2 +儔=chou2 +儕=chai2 +儖=lan2 +儗=ni3,yi4 +儗不于伦=li3,bu4,yu2,lun2 +儘=jin3 +儙=qian4 +儚=meng2 +儛=wu3 +儜=ning2 +儝=qiong2 +儞=ni3 +償=chang2 +儠=lie4 +儡=lei3 +儢=lv3 +儣=kuang3 +儤=bao4 +儥=yu4 +儦=biao1 +儧=zan3 +儨=zhi4 +儩=si4 +優=you1 +儫=hao2 +儬=qing4 +儭=chen4 +儮=li4 +儯=teng2 +儰=wei3 +儱=long3,long2,long4 +儲=chu3 +儳=chan2,chan4 +儴=rang2,xiang1 +儵=shu1 +儶=hui4,xie2 +儷=li4 +儸=luo2 +儹=zan3 +儺=nuo2 +儻=tang3 +儼=yan3 +儽=lei2 +儾=nang4,nang1 +儿=er2 +儿女成行=er2,nv3,cheng2,hang2 +儿媳妇儿=er2,xi2,fu5,er5 +儿子=er2,zi5 +兀=wu4 +兀秃=wu4,tu1 +允=yun3 +允当=yun3,dang4 +兂=zan1 +元=yuan2 +兄=xiong1 +兄死弟及=xiong1,si3,di4,ji2 +兄长=xiong1,zhang3 +充=chong1 +充分=chong1,fen4 +充塞=chong1,se4 +充数=chong1,shu4 +充血=chong1,xue4 +兆=zhao4 +兆头=zhao4,tou5 +兆载永劫=zhao4,zai3,yong3,jie2 +兇=xiong1 +先=xian1 +先下手为强=xian1,xia4,shou3,wei2,qiang2 +先入为主=xian1,ru4,wei2,zhu3 +先睹为快=xian1,du3,wei2,kuai4 +光=guang1 +光晕=guang1,yun4 +光杆=guang1,gan3 +光杆儿=guang1,gan3,er2 +光栅=guang1,shan1 +兊=dui4,rui4,yue4 +克=ke4 +克什米尔=ke4,shi2,mi3,er3 +克分子=ke4,fen4,zi3 +兌=dui4,rui4,yue4 +免=mian3 +免冠=mian3,guan1 +免得=mian3,de5 +免服=wen4,fu2 +免袒=mian3,tan3 +免麻=mian3,ma2 +兎=tu4 +兏=chang2,zhang3 +児=er2 +兑=dui4,rui4,yue4 +兒=er2 +兓=qin1 +兔=tu4 +兔丝燕麦=tu4,si1,yan4,mai4 +兔头麞脑=tu4,tou2,suo1,nao3 +兔子=tu4,zi5 +兔葵燕麦=tu4,kui2,yan4,mai4 +兔角龟毛=tu4,jiao3,gui1,mao2 +兔起鹘落=tu4,qi3,hu2,luo4 +兕=si4 +兖=yan3 +兗=yan3 +兘=shi3 +兙=shi2,ke4 +党=dang3 +党参=dang3,shen1 +党禁=dang3,jin4 +党豺为虐=dang3,chai2,wei2,nve4 +兛=qian1 +兜=dou1 +兜肚连肠=dou1,du3,lian2,chang2 +兝=fen1 +兞=mao2 +兟=shen1 +兠=dou1 +兡=bai3,ke4 +兢=jing1 +兢兢干干=jing1,jing1,gan4,gan4 +兣=li3 +兤=huang3 +入=ru4 +入吾彀中=ru4,wu3,gou4,zhong1 +入国问禁=ru4,guo2,wen4,jin4 +入土为安=ru4,tu2,wei2,an1 +入境问禁=ru4,jing4,wen4,jin4 +入孝出弟=ru4,xiao4,chu1,ti4 +入竟问禁=ru4,jing4,wen4,jin4 +兦=wang2 +內=nei4 +全=quan2 +全军覆没=quan2,jun1,fu4,mo4 +全数=quan2,shu4 +兩=liang3 +兪=yu2,shu4 +八=ba1 +八大山人=ba1,da4,shan1,ren2 +八字没一撇=ba1,zi4,mei2,yi1,pie3 +八字没见一撇=ba1,zi4,mei2,jian4,yi1,pie3 +八字还没有一撇=ba1,zi4,hai2,mei2,you3,yi1,pie3 +八斗之才=ba1,dou3,zhi1,cai2 +八斗才=ba1,dou3,cai2 +八方呼应=ba1,fang1,hu1,ying4 +八旗子弟=ba1,qi2,zi5,di4 +八竿子打不着=ba1,gan1,zi3,da3,bu4,zhao2 +八行书=ba1,hang2,shu1 +八难三灾=ba1,nan4,san1,zai1 +公=gong1 +公了=gong1,liao3 +公仆=gong1,pu2 +公倍数=gong1,bei4,shu4 +公假=gong1,jia4 +公差=gong1,chai1 +公帑=gong1,tang3 +公干=gong1,gan4 +公正不阿=gong1,zheng4,bu4,e1 +公约数=gong1,yue1,shu4 +公诸同好=gong1,zhu1,tong2,hao4 +公转=gong1,zhuan4 +六=liu4,lu4 +六十花甲子=liu4,shi2,hua1,jia2,zi3 +六合=lu4,he2 +六合之内=liu4,he2,zhi1,nei4 +六安=lu4,an1 +六尺之讬=liu4,chi3,zhi1,quan4 +六神不安=liu4,shen2,bu3,an1 +兮=xi1 +兯=han5 +兰=lan2 +兰若=lan2,re3 +共=gong4,gong1 +共为唇齿=gong4,wei2,chun2,chi3 +共处=gong4,chu3 +共枝别干=gong4,zhi1,bie2,gan4 +兲=tian1 +关=guan1 +关切=guan1,qie4 +关卡=guan1,qia3 +关情脉脉=guan1,qing2,mai4,mai4 +兴=xing4,xing1 +兴业=xing1,ye4 +兴中会=xing1,zhong1,hui4 +兴义=xing1,yi4 +兴云致雨=xing1,yun2,zhi4,yu3 +兴亡=xing1,wang2 +兴亡祸福=xing1,wang2,huo4,fu2 +兴亡继绝=xing1,wang2,ji4,jue2 +兴修=xing1,xiu1 +兴兵=xing1,bing1 +兴兵动众=xing1,bing1,dong4,zhong4 +兴冲冲=xing1,chong1,chong1 +兴利除弊=xing1,li4,chu2,bi4 +兴办=xing1,ban4 +兴化=xing1,hua4 +兴叹=xing1,tan4 +兴味=xing4,wei4 +兴国=xing1,guo2 +兴城=xing1,cheng2 +兴头=xing4,tou5 +兴奋=xing1,fen4 +兴妖作乱=xing1,yao1,zuo4,luan4 +兴妖作孽=xing1,yao1,zuo4,nie4 +兴妖作怪=xing1,yao1,zuo4,guai4 +兴学=xing1,xue2 +兴安=xing1,an1 +兴家立业=xing1,jia1,li4,ye4 +兴工=xing1,gong1 +兴师=xing1,shi1 +兴师动众=xing1,shi1,dong4,zhong4 +兴师问罪=xing1,shi1,wen4,zui4 +兴平=xing1,ping2 +兴废=xing1,fei4 +兴废继绝=xing1,fei4,ji4,jue2 +兴建=xing1,jian4 +兴微继绝=xing1,wei1,ji4,jue2 +兴文=xing1,wen2 +兴文匽武=xing1,wen2,diao4,wu3 +兴旺=xing1,wang4 +兴替=xing1,ti4 +兴灭继绝=xing1,mie4,ji4,jue2 +兴盛=xing1,sheng4 +兴筑=xing1,zhu4 +兴致=xing4,zhi4 +兴衰=xing1,shuai1 +兴许=xing1,xu3 +兴讹造讪=xing1,e2,zao4,shan4 +兴词构讼=xing1,ci2,gou4,song4 +兴起=xing1,qi3 +兴邦=xing1,bang1 +兴邦立国=xing1,bang1,li4,guo2 +兴隆=xing1,long2 +兴革=xing1,ge2 +兴风作浪=xing1,feng1,zuo4,lang4 +兴高采烈=xing4,gao1,cai3,lie4 +兵=bing1 +兵不由将=bing1,bu4,you2,jiang4 +兵不血刃=bing1,bu4,xue4,ren4 +兵多将广=bing1,duo1,jiang4,guang3 +兵差=bing1,chai1 +兵强将勇=bing1,qiang2,ang4,yong3 +兵微将寡=bing1,wei1,jiang4,gua3 +兵未血刃=bing1,wei4,xue3,ren4 +其=qi2,ji1 +其应如响=qi2,ying4,ru2,xiang3 +其应若响=qi2,ying4,ruo4,xiang3 +具=ju4 +典=dian3 +典当=dian3,dang4 +兹=zi1,ci2 +养=yang3 +养分=yang3,fen4 +养家糊口=yang3,jia1,hu2,kou3 +养尊处优=yang3,zun1,chu3,you1 +养精畜锐=yang3,jing1,xu4,rui4 +养虎为患=yang3,hu3,wei2,huan4 +兼=jian1 +兼差=jian1,chai1 +兽=shou4 +兾=ji4 +兿=yi4 +冀=ji4 +冁=chan3 +冂=jiong1 +冃=mao4 +冄=ran3 +内=nei4,na4 +内传=nei4,zhuan4 +内外夹攻=nei4,wai4,jia1,gong1 +内应=nei4,ying4 +内查外调=nei4,cha2,wai4,diao4 +内省=nei4,xing3 +内行=nei4,hang2 +円=yuan2 +冇=mao3 +冈=gang1 +冉=ran3 +冊=ce4 +冋=jiong1 +册=ce4 +册子=ce4,zi5 +再=zai4 +再一次=zai4,yi2,ci4 +冎=gua3 +冏=jiong3 +冐=mao4 +冑=zhou4 +冒=mao4,mo4 +冒名接脚=mao4,ming2,jie3,jiao3 +冒天下之大不韪=mao4,tian1,xia4,zhi1,da4,bu4,wei2 +冒着=mao4,zhe5 +冒起火苗=mao4,huo3,qi3,miao2 +冒顿=mo4,dun4 +冓=gou4 +冔=xu2 +冕=mian3 +冖=mi4 +冗=rong3 +冘=yin2,you2 +写=xie3 +冚=kan3 +军=jun1 +军乐=jun1,yue4 +军长=jun1,zhang3 +农=nong2 +农舍=nong2,she4 +农行=nong2,hang2 +冝=yi2 +冞=mi2 +冟=shi4 +冠=guan4,guan1 +冠上加冠=guan1,shang4,jia1,guan1 +冠上履下=guan1,shang4,lv3,xia4 +冠冕=guan1,mian3 +冠冕堂皇=guan1,mian3,tang2,huang2 +冠军=guan4,jun1 +冠子=guan1,zi5 +冠履倒易=guan1,lv3,dao4,yi4 +冠履倒置=guan1,lv3,dao4,zhi4 +冠心病=guan1,xin1,bing4 +冠状动脉=guan1,zhuang4,dong4,mai4 +冠状动脉硬化=guan1,zhuang4,dong4,mai4,ying4,hua4 +冠状动脉血栓形成=guan1,zhuang4,dong4,mai4,xue4,shuan1,xing2,cheng2 +冠状动脉阻塞=guan1,zhuang4,dong4,mai4,zu3,se4 +冠状静脉=guan1,zhuang4,jing4,mai4 +冠玉=guan1,yu4 +冠盖=guan1,gai4 +冠盖云集=guan1,gai4,yun2,ji2 +冠盖如云=guan1,gai4,ru2,yun2 +冠盖相属=guan1,gai4,xiang1,zhu3 +冠盖相望=guan1,gai4,xiang1,wang4 +冠袍带履=guan1,pao2,dai4,lv3 +冡=meng3 +冢=zhong3 +冣=zui4 +冤=yuan1 +冤家对头=yuan1,jia5,dui4,tou2 +冤家路狭=yuan1,jia5,lu4,xia2 +冥=ming2 +冥行擿埴=ming2,xing2,zhi4,zhi2 +冥顽不化=ming2,wan2,bu2,hua4 +冦=kou4 +冧=lin2 +冨=fu4 +冩=xie3 +冪=mi4 +冫=bing1 +冬=dong1 +冬裘夏葛=dong1,qiu2,xia4,ge3 +冭=tai4 +冮=gang1 +冯=feng2,ping2 +冯河=ping2,he2 +冯河暴虎=feng2,he2,bao4,hu3 +冯生弹铗=feng2,sheng1,dan4,jia2 +冯驩弹铗=feng2,huan1,dan4,jia2 +冰=bing1 +冰斗=bing1,dou3 +冰解的破=bing1,jie3,di4,po4 +冱=hu4 +冲=chong1,chong4 +冲冠怒发=chong1,guan4,nu4,fa4 +冲劲=chong4,jing4 +冲劲儿=chong4,jin4,er5 +冲压=chong4,ya1 +冲子=chong4,zi5 +冲孔=chong4,kong3 +冲床=chong4,chuang2 +冲模=chong4,mu2 +冲盹儿=chong4,dun3,er5 +决=jue2 +冴=ya4 +况=kuang4 +冶=ye3 +冷=leng3 +冷场=leng3,chang3 +冷水浇头=leng3,shui3,jiao1,tou2 +冷水浇背=leng3,shui3,jiao1,bei4 +冷血=leng3,xue4 +冷血动物=leng3,xue4,dong4,wu4 +冷颤=leng3,zhan4 +冸=pan4 +冹=fa1 +冺=min3 +冻=dong4 +冼=xian3 +冽=lie4 +冾=qia4 +冿=jian1 +净=jing4,cheng1 +净得=jing4,de5 +凁=sou1 +凂=mei3 +凃=tu2 +凄=qi1 +凄切=qi1,qie4 +凅=gu4 +准=zhun3 +准予=zhun3,yu3 +准头=zhun3,tou5 +准将=zhun3,jiang4 +凇=song1 +凈=jing4,cheng1 +凉=liang2,liang4 +凉一凉=liang4,yi1,liang4 +凊=qing4 +凋=diao1 +凌=ling2 +凌云壮志=ling2,yun2,zhuang1,zhi4 +凍=dong4 +凎=gan4 +减=jian3 +凐=yin1 +凑=cou4 +凑份子=cou4,fen4,zi5 +凑数=cou4,shu4 +凒=ai2 +凓=li4 +凔=cang1 +凕=ming3 +凖=zhun3 +凗=cui1 +凘=si1 +凙=duo2 +凚=jin4 +凛=lin3 +凜=lin3 +凝=ning2 +凝血酶=ning2,xue4,mei2 +凞=xi1 +凟=du2 +几=ji3,ji1 +几不欲生=ji1,bu4,yu4,sheng1 +几乎=ji1,hu1 +几十年如一日=ji3,shi2,nian2,ru2,yi2,ri4 +几只=ji3,zhi1 +几希=ji1,xi1 +几微=ji1,wei1 +几曾=ji3,zeng1 +几案=ji1,an4 +几率=ji1,lv4 +几至=ji1,zhi4 +凡=fan2 +凢=fan2 +凣=fan2 +凤=feng4 +凤冠=feng4,guan1 +凤冠霞帔=feng4,guan1,xia2,pei4 +凤只鸾孤=feng4,zhi1,luan2,gu1 +凤楼龙阙=feng4,lou2,long2,que4 +凤靡鸾吪=feng4,mi3,luan2,e2 +凥=ju1 +処=chu4,chu3 +凧=zheng1 +凨=feng1 +凩=mu4 +凪=zhi3 +凫=fu2 +凬=feng1 +凭=ping2 +凭一己之力=ping2,yi4,ji3,zhi1,li4 +凭几据杖=ping2,ji1,ju4,zhang4 +凮=feng1 +凯=kai3 +凰=huang2 +凱=kai3 +凲=gan1 +凳=deng4 +凳子=deng4,zi5 +凴=ping2 +凵=kan3,qian3 +凶=xiong1 +凶横=xiong1,heng4 +凶煞=xiong1,sha4 +凶相=xiong1,xiang4 +凶相毕露=xiong1,xiang4,bi4,lu4 +凶神恶煞=xiong1,shen2,e4,sha4 +凷=kuai4 +凸=tu1 +凹=ao1,wa1 +出=chu1 +出丧=chu1,sang1 +出乖露丑=chu1,guai1,lu4,chou3 +出份子=chu1,fen4,zi5 +出入将相=chu1,ru4,jiang1,xiang1 +出入无间=chu1,ru4,wu2,jian1 +出塞=chu1,sai4 +出处殊涂=chu1,chu3,shu1,tu2 +出处殊途=chu1,chu3,shu1,tu2 +出处语默=chu1,chu3,yu3,mo4 +出处进退=chu1,chu3,jin4,tui4 +出头露面=chu1,tou2,lu4,mian4 +出将入相=chu1,jiang4,ru4,xiang4 +出岔子=chu1,cha4,zi5 +出差=chu1,chai1 +出差错=chu1,cha1,cuo4 +出没=chu1,mo4 +出没无常=chu1,mo4,wu2,chang2 +出洋相=chu1,yang2,xiang4 +出落=chu1,la4 +出血=chu1,xue4 +出言不逊=chu1,yan2,bu4,xun4 +出风头=chu1,feng1,tou5 +击=ji1 +击中=ji1,zhong4 +击排冒没=ji1,pai2,mao4,mo4 +凼=dang4 +函=han2 +函数=han2,shu4 +凾=han2 +凿=zao2 +凿坏以遁=zao2,pi1,yi3,dun4 +凿坏而遁=zao2,pi1,er2,dun4 +刀=dao1 +刀光血影=dao1,guang1,xue4,ying3 +刀削=dao1,xiao1 +刀把=dao1,ba4 +刀把子=dao1,ba4,zi5 +刀耕火种=dao1,geng1,huo3,zhong4 +刀背=dao1,bei4 +刁=diao1 +刁斗森严=diao1,dou3,sen1,yan2 +刁横=diao1,heng4 +刁钻促搯=diao1,zuan4,cu4,chao1 +刁钻促狭=diao1,zuan4,cu4,xia2 +刁难=diao1,nan4 +刂=dao1 +刃=ren4 +刄=ren4 +刅=chuang1 +分=fen1,fen4 +分为=fen1,wei2 +分内=fen4,nei4 +分内之事=fen4,nei4,zhi1,shi4 +分叉=fen1,cha4 +分外=fen4,wai4 +分子=fen4,zi5 +分子力=fen4,zi3,li4 +分子式=fen4,zi3,shi4 +分子物理学=fen4,zi3,wu4,li3,xue2 +分子生物学=fen4,zi3,sheng1,wu4,xue2 +分子筛=fen4,zi3,shai1 +分子运动论=fen4,zi3,yun4,dong4,lun4 +分子量=fen4,zi3,liang4 +分当=fen4,dang1 +分得=fen1,de5 +分散=fen1,san4 +分散主义=fen1,san4,zhu3,yi4 +分散染料=fen1,san3,ran3,liao4 +分数=fen1,shu4 +分数线=fen1,shu4,xian4 +分毫不差=fen1,hao2,bu4,cha1 +分行=fen1,hang2 +分量=fen4,liang5 +分风劈流=fen1,feng1,pi3,liu2 +分馏=fen1,liu2 +切=qie1,qie4 +切不=qie4,bu4 +切中=qie4,zhong4 +切中事理=qie4,zhong4,shi4,li3 +切中时弊=qie4,zhong4,shi2,bi4 +切中要害=qie4,zhong4,yao4,hai4 +切切=qie4,qie4 +切切实实=qie4,qie4,shi2,shi2 +切切此令=qie4,qie4,ci3,ling4 +切切此布=qie4,qie4,ci3,bu4 +切切牢记=qie4,qie4,lao2,ji4 +切切私语=qie4,qie4,si1,yu3 +切切请求=qie4,qie4,qing3,qiu2 +切削=qie1,xiao1 +切割=qie1,ge1 +切力效应=qie1,li4,xiao4,ying4 +切勿=qie4,wu4 +切勿倒置=qie4,wu4,dao4,zhi4 +切勿受潮=qie4,wu4,shou4,chao2 +切勿吸烟=qie4,wu4,xi1,yan1 +切勿靠近=qie4,wu4,kao4,jin4 +切勿颠倒=qie4,wu4,dian1,dao3 +切口=qie4,kou3 +切合=qie4,he2 +切合实际=qie4,he2,shi2,ji4 +切嘱=qie4,zhu3 +切实=qie4,shi2 +切己=qie4,ji3 +切当=qie4,dang4 +切忌=qie4,ji4 +切望=qie4,wang4 +切末=qie4,mo4 +切激=qie4,ji1 +切瑳琢磨=qie1,cun4,zhuo2,mo2 +切盼=qie4,pan4 +切磋=qie1,cuo1 +切磋琢磨=qie1,cuo1,zhuo2,mo2 +切肤之痛=qie4,fu1,zhi1,tong4 +切脉=qie4,mai4 +切莫=qie4,mo4 +切要=qie4,yao4 +切记=qie4,ji4 +切诊=qie4,zhen3 +切谏=qie4,jian4 +切责=qie4,ze2 +切贴=qie4,tie1 +切身=qie4,shen1 +切身体会=qie4,shen1,ti3,hui4 +切身体验=qie4,shen1,ti3,yan4 +切身利害=qie4,shen1,li4,hai4 +切身大事=qie4,shen1,da4,shi4 +切近=qie4,jin4 +切近的当=qie1,jin4,de5,dang1 +切迫=qie4,po4 +切音=qie4,yin1 +切题=qie4,ti2 +切骨=qie4,gu3 +切骨之仇=qie4,gu3,zhi1,chou2 +切骨之寒=qie4,gu3,zhi1,han2 +切骨之恨=qie4,gu3,zhi1,hen4 +切齿=qie4,chi3 +切齿咒骂=qie4,chi3,zhou4,ma4 +切齿痛恨=qie4,chi3,tong4,hen4 +切齿腐心=qie4,chi3,fu3,xin1 +刈=yi4 +刉=ji1 +刊=kan1 +刊载=kan1,zai3 +刋=qian4 +刌=cun3 +刍=chu2 +刎=wen3 +刏=ji1 +刐=dan3 +刑=xing2 +划=hua2,hua4 +划一=hua4,yi1 +划一不二=hua4,yi1,bu4,er4 +划价=hua4,jia4 +划分=hua4,fen1 +划地为牢=hua2,di4,wei2,lao2 +划定=hua4,ding4 +划归=hua4,gui1 +划得来=hua2,de5,lai2 +划拨=hua4,bo1 +划时代=hua4,shi2,dai4 +划清=hua4,qing1 +划界=hua4,jie4 +划算=hua4,suan4 +划粥割齑=hua4,zhou1,ge1,ji1 +划线=hua4,xian4 +刓=wan2 +刓方为圆=shu1,fang1,wei2,yuan2 +刔=jue2 +刕=li2 +刖=yue4 +列=lie4 +列传=lie4,zhuan4 +列车长=lie4,che1,zhang3 +刘=liu2 +则=ze2 +刚=gang1 +刚劲=gang1,jing4 +刚正不阿=gang1,zheng4,bu4,e1 +刚直不阿=gang1,zhi2,bu4,e1 +创=chuang4,chuang1 +创举=chuang4,ju3 +创伤=chuang1,shang1 +创作=chuang4,zuo4 +创口=chuang1,kou3 +创巨痛深=chuang1,ju4,tong4,shen1 +创痕=chuang1,hen2 +创痛=chuang1,tong4 +创造=chuang4,zao4 +创面=chuang1,mian4 +刜=fu2 +初=chu1 +初生之犊不畏虎=chu1,sheng1,zhi1,du2,bu4,wei4,hu3 +初生牛犊不怕虎=chu1,sheng1,niu2,du2,bu4,pa4,hu3 +初露=chu1,lu4 +初露头角=chu1,lu4,tou2,jiao3 +初露锋芒=chu1,lu4,feng1,mang2 +刞=qu4 +刟=diao1 +删=shan1 +删削=shan1,xue1 +刡=min3 +刢=ling2 +刣=zhong1 +判=pan4 +判处=pan4,chu3 +別=bie2,bie4 +刦=jie2 +刧=jie2 +刨=pao2,bao4 +刨冰=bao4,bing1 +刨刀=bao4,dao1 +刨削=pao2,xue1 +刨子=bao4,zi5 +刨平=bao4,ping2 +刨床=bao4,chuang2 +刨木板=bao4,mu4,ban3 +刨花=bao4,hua1 +刨花板=pao2,hua1,ban3 +利=li4 +利令志惛=li4,ling4,zhi4,zao4 +利口捷给=li4,kou3,jie2,ji3 +利得=li4,de5 +利爪=li4,zhua3 +利用=li4,yong4 +刪=shan1 +别=bie2,bie4 +别传=bie2,zhuan4 +别具只眼=bie2,ju4,zhi1,yan3 +别创一格=bie2,chuang4,yi2,ge2 +别开一格=bie2,kai1,yi2,ge2 +别开蹊径=bie2,kai1,xi1,jing4 +别扭=bie4,niu3 +别无长物=bie2,wu2,chang2,wu4 +别类分门=bie2,lei4,fan1,men2 +刬=chan3,chan4 +刭=jing3 +刮=gua1 +刯=geng1 +到=dao4 +到此为止=dao4,ci3,wei2,zhi3 +到目前为止=dao4,mu4,qian2,wei2,zhi3 +刱=chuang4 +刲=kui1 +刳=ku1 +刴=duo4 +刵=er4 +制=zhi4 +刷=shua1,shua4 +刷白=shua4,bai2 +券=quan4,xuan4 +刹=cha4,sha1 +刹不住=sha1,bu4,zhu4 +刹住=sha1,zhu4 +刹刹=sha1,sha1 +刹把=sha1,ba3 +刹时=sha1,shi2 +刹车=sha1,che1 +刺=ci4,ci1 +刺啦=ci1,la1 +刺溜=ci1,liu1 +刺痒=ci4,yang2 +刺的一声=ci1,de5,yi1,sheng1 +刻=ke4 +刻木为吏=ke4,mu4,wei2,li4 +刻木为鹄=ke4,mu4,wei2,hu2 +刻薄=ke4,bo2 +刼=jie2 +刽=gui4 +刽子手=gui4,zi5,shou3 +刾=ci4 +刿=gui4 +剀=kai3 +剀切=kai3,qie4 +剀切中理=kai3,qie4,zhong4,li3 +剀切教导=kai3,qie4,jiao4,dao3 +剀切详明=kai3,qie4,xiang2,ming2 +剁=duo4 +剂=ji4 +剃=ti4 +剃发=ti4,fa4 +剄=jing3 +剅=lou2 +剆=luo3 +則=ze2 +剈=yuan1 +剉=cuo4 +削=xiao1,xue1 +削价=xue1,jia4 +削减=xue1,jian3 +削削=xue1,xue1 +削发=xue1,fa4 +削地=xue1,di4 +削壁=xue1,bi4 +削尖脑袋=xue1,jian1,nao3,dai4 +削平=xue1,ping2 +削弱=xue1,ruo4 +削木为吏=xue1,mu4,wei2,li4 +削株掘根=xue1,zhu1,jue2,gen1 +削球=xiao1,qiu2 +削瘦=xue1,shou4 +削皮=xiao1,pi2 +削职=xue1,zhi2 +削职为民=xue1,zhi2,wei2,min2 +削肩=xue1,jian1 +削草除根=xue1,cao3,chu2,gen1 +削足适履=xue1,zu2,shi4,lv3 +削趾适屦=xue1,zhi3,shi4,ju4 +削铁如泥=xue1,tie3,ru2,ni2 +削铁无声=xue1,tie3,wu2,sheng1 +削铅笔=xiao1,qian1,bi3 +削除=xue1,chu2 +剋=kei1,ke4 +剌=la4,la2 +前=qian2 +前仆后踣=qian2,pu2,hou4,bo2 +前头=qian2,tou5 +前爪=qian2,zhua3 +前跋后疐=qian2,ba2,hou4,mao2 +剎=cha4,sha1 +剏=chuang4 +剐=gua3 +剑=jian4 +剑首一吷=jian4,shou3,yi1,gui1 +剒=cuo4 +剓=li2 +剔=ti1 +剔抽秃揣=ti1,chou1,tu1,chuai3 +剔蝎撩蜂=ti1,xie1,liao2,feng1 +剕=fei4 +剖=pou1 +剖心泣血=pou1,xin1,qi4,xue4 +剖肝泣血=pou1,gan1,qi4,xue4 +剗=chan3,chan4 +剘=qi2 +剙=chuang4 +剚=zi4 +剛=gang1 +剜=wan1 +剝=bao1,bo1 +剞=ji1 +剟=duo1 +剠=qing2 +剡=yan3,shan4 +剢=du1,zhuo2 +剣=jian4 +剤=ji4 +剥=bo1,bao1 +剥削=bo1,xue1 +剥剥=bao1,bao1 +剥取=bao1,qu3 +剥壳=bao1,ke2 +剥皮=bao1,pi2 +剥皮抽筋=bo1,pi2,chou1,jin1 +剥肤椎髓=bo1,fu1,chui2,sui3 +剥脱=bao1,tuo1 +剥花生=bao1,hua1,sheng1 +剥苹果=bao1,ping2,guo3 +剥除=bao1,chu2 +剦=yan1 +剧=ju4 +剨=huo4 +剩=sheng4 +剪=jian3 +剪发=jian3,fa4 +剪发披缁=jian3,fa1,pi1,zi1 +剪发杜门=jian3,fa4,du4,men2 +剪发被褐=jian3,fa1,bei4,he4 +剫=duo2 +剬=zhi4,duan1 +剭=wu1 +剮=gua3 +副=fu4,pi4 +副行长=fu4,hang2,zhang3 +剰=sheng4 +剱=jian4 +割=ge1 +剳=da2,zha2 +剴=kai3 +創=chuang4,chuang1 +剶=chuan2 +剷=chan3 +剸=tuan2,zhuan1 +剸繁决剧=shi2,fan2,jue2,ju4 +剸繁治剧=shi2,fan2,zhi4,ju4 +剹=lu4,jiu1 +剺=li2 +剻=peng1 +剼=shan1 +剽=piao1 +剾=kou1 +剿=jiao3,chao1 +剿袭=chao1,xi2 +剿说=chao1,shuo1 +劀=gua1 +劁=qiao1 +劂=jue2 +劃=hua2,hua4 +劄=zha1,zha2 +劅=zhuo2 +劆=lian2 +劇=ju4 +劈=pi1,pi3 +劈叉=pi3,cha4 +劈成=pi3,cheng2 +劈木头=pi1,mu4,tou5 +劈柴=pi3,chai2 +劈里啪啦=pi1,li3,pa1,la5 +劉=liu2 +劊=gui4 +劋=jiao3,chao1 +劌=gui4 +劍=jian4 +劎=jian4 +劏=tang1 +劐=huo1 +劑=ji4 +劒=jian4 +劓=yi4 +劔=jian4 +劕=zhi4 +劖=chan2 +劗=zuan1 +劘=mo2 +劙=li2 +劚=zhu2 +力=li4 +力不胜任=li4,bu4,sheng4,ren4 +力有未逮=li4,you3,wei4,dai3 +力能扛鼎=li4,neng2,gang1,ding3 +力透纸背=li4,tou4,zhi3,bei4 +劜=ya4 +劝=quan4 +劝降=quan4,xiang2 +办=ban4 +办不到=ban4,bu2,dao4 +办差=ban4,chai1 +办得到=ban4,de2,dao4 +功=gong1 +功亏一篑=gong1,kui1,yi1,kui4 +功成行满=gong1,cheng2,xing2,man3 +功薄蝉翼=gong1,bo2,chan2,yi4 +加=jia1 +加数=jia1,shu4 +务=wu4 +劢=mai4 +劣=lie4 +劣迹昭着=lie4,ji4,zhao1,zhe5 +劤=jin4,jing4 +劥=keng1 +劦=xie2,lie4 +劧=zhi3 +动=dong4 +动中窾要=dong4,zhong1,zhe5,yao4 +动如参商=dong4,ru2,shen1,shang1 +动弹=dong4,dan4 +动画影片=dong4,hua4,ying3,pian1 +动画片=dong4,hua4,pian1 +助=zhu4,chu2 +助人为乐=zhu4,ren2,wei2,le4 +助兴=zhu4,xing4 +助天为虐=zhu4,tian1,wei2,nve4 +助桀为恶=zhu4,jie2,wei2,e4 +助桀为暴=zhu4,jie2,wei2,bao4 +助桀为虐=zhu4,jie2,wei2,nve4 +助纣为虐=zhu4,zhou4,wei2,nve4 +助长=zhu4,zhang3 +努=nu3 +劫=jie2 +劫数=jie2,shu4 +劫数难逃=jie2,shu4,nan2,tao2 +劫难=jie2,nan4 +劬=qu2 +劭=shao4 +劮=yi4 +劯=zhu3 +劰=miao3 +励=li4 +劲=jin4,jing4 +劲儿=jin4,er5 +劲兵=jing4,bing1 +劲吹=jing4,chui1 +劲头=jin4,tou2 +劲射=jing4,she4 +劲敌=jing4,di2 +劲旅=jing4,lv3 +劲松=jing4,song1 +劲直=jing4,zhi2 +劲草=jing4,cao3 +劲风=jing4,feng1 +劲骨丰肌=jing4,gu3,feng1,ji1 +劳=lao2 +劳什子=lao2,shi2,zi3 +劳碌=lao2,lu4 +劳累=lao2,lei4 +労=lao2 +劵=juan4 +劶=kou3 +劷=yang2 +劸=wa1 +効=xiao4 +劺=mou2 +劻=kuang1 +劼=jie2 +劽=lie4 +劾=he2 +势=shi4 +勀=ke4 +勁=jin4,jing4 +勂=gao4 +勃=bo2,bei4 +勃兴=bo2,xing1 +勄=min3 +勅=chi4 +勆=lang2 +勇=yong3 +勈=yong3 +勉=mian3 +勉为其难=mian3,wei2,qi2,nan2 +勉强=mian3,qiang3 +勊=ke4 +勋=xun1 +勌=juan4,juan1 +勍=qing2 +勎=lu4 +勏=bu4 +勐=meng3 +勑=chi4 +勒=le4,lei1 +勒令=le4,ling4 +勒住=lei1,zhu4 +勒派=le4,pai4 +勒索=le4,suo3 +勒紧=lei1,jin3 +勒逼=le4,bi1 +勓=kai4 +勔=mian3 +動=dong4 +勖=xu4 +勗=xu4 +勘=kan1 +勘查=kan1,zha1 +勘校=kan1,jiao4 +務=wu4 +勚=yi4 +勛=xun1 +勜=weng3,yang3 +勝=sheng4 +勞=lao2 +募=mu4 +勠=lu4 +勡=piao1 +勢=shi4 +勣=ji4 +勤=qin2 +勤朴=qin2,piao2 +勥=jiang4 +勦=jiao3,chao1 +勧=quan4 +勨=xiang4 +勩=yi4 +勪=qiao1 +勫=fan1 +勬=juan1 +勭=tong2,dong4 +勮=ju4 +勯=dan1 +勰=xie2 +勱=mai4 +勲=xun1 +勳=xun1 +勴=lv4 +勵=li4 +勶=che4 +勷=rang2,xiang1 +勸=quan4 +勹=bao1 +勺=shao2 +勻=yun2 +勼=jiu1 +勽=bao4 +勾=gou1,gou4 +勾当=gou4,dang4 +勿=wu4 +勿谓言之不预=wu4,wei4,yan2,zhi1,bu4,yu4 +勿谓言之不预也=wu4,wei4,yan2,zhi1,bu4,yu4,ye3 +匀=yun2 +匀称=yun2,chen4 +匂=xiong1 +匃=gai4 +匄=gai4 +包=bao1 +包乘制=bao1,cheng2,zhi4 +包馄饨=bao1,hun2,tun5 +匆=cong1 +匇=yi4 +匈=xiong1 +匉=peng1 +匊=ju1 +匋=tao2,yao2 +匌=ge2 +匍=pu2 +匎=e4 +匏=pao2 +匐=fu2 +匑=gong1 +匒=da2 +匓=jiu4 +匔=gong1 +匕=bi3 +化=hua4,hua1 +化为乌有=hua4,wei2,wu1,you3 +化为泡影=hua4,wei2,pao1,ying3 +化学反应=hua4,xue2,fan3,ying4 +化干戈为玉帛=hua4,gan1,ge1,wei2,yu4,bo2 +化敌为友=hua4,di2,wei2,you3 +化整为零=hua4,zheng3,wei2,ling2 +化枭为鸠=hua4,xiao1,wei2,jiu1 +化腐为奇=hua4,fu3,wei2,qi2 +化腐朽为神奇=hua4,fu3,xiu3,wei2,shen2,qi2 +化险为夷=hua4,xian3,wei2,yi2 +化零为整=hua4,ling2,wei2,zheng3 +化鸱为凤=hua4,chi1,wei2,feng4 +北=bei3,bei4 +北斗=bei3,dou3 +北斗七星=bei3,dou3,qi1,xing1 +北斗之尊=bei3,dou3,zhi1,zun1 +北斗星=bei3,dou3,xing1 +北窗高卧=bei1,chuang1,gao1,wo4 +北辰星拱=bei1,chen2,xing1,gong3 +北鄙之声=bei1,bi3,zhi1,sheng1 +北鄙之音=bei1,bi3,zhi1,yin1 +匘=nao3 +匙=chi2,shi5 +匙子=chi2,zi5 +匚=fang1 +匛=jiu4 +匜=yi2 +匝=za1 +匞=jiang4 +匟=kang4 +匠=jiang4 +匡=kuang1 +匡其不逮=kuang1,qi2,bu4,dai3 +匡救弥缝=kuang1,jiu4,mi2,feng4 +匢=hu1 +匣=xia2 +匤=qu1 +匥=fan2 +匦=gui3 +匧=qie4 +匨=zang1,cang2 +匩=kuang1 +匪=fei3 +匪伊朝夕=fei3,yi1,zhao1,xi1 +匫=hu1 +匬=yu3 +匭=gui3 +匮=kui4,gui4 +匯=hui4 +匰=dan1 +匱=kui4,gui4 +匲=lian2 +匳=lian2 +匴=suan3 +匵=du2 +匶=jiu4 +匷=jue2 +匸=xi4 +匹=pi3 +匹马只轮=pi3,ma3,zhi1,lun2 +区=qu1,ou1 +区划=qu1,hua4 +区域网路=qu1,yu4,wang3,luo4 +医=yi1 +匼=ke1,qia4 +匽=yan3,yan4 +匾=bian3 +匿=ni4 +區=qu1,ou1 +十=shi2 +十一月=shi2,yi2,yue4 +十一点=shi2,yi4,dian3 +十一点钟=shi2,yi4,dian3,zhong1 +十不当一=shi2,bu4,huo4,yi1 +十四行诗=shi2,si4,hang2,shi1 +十夫楺椎=shi2,fu1,zhi1,zhui1 +十年九不遇=shi2,nian2,jiu3,bu4,yu4 +十捉九着=shi2,zhuo1,jiu3,zhe5 +十行俱下=shi2,hang2,ju4,xia4 +十载寒窗=shi2,zai3,han2,chuang1 +十里堡=shi2,li3,pu4 +十里长亭=shi2,li3,chang2,ting2 +十魔九难=shi2,mo2,jiu3,nan4 +卂=xun4 +千=qian1 +千乘之国=qian1,sheng4,zhi1,guo2 +千了万当=qian1,le5,wan4,dang4 +千了百当=qian1,liao3,bai3,dang4 +千古绝调=qian1,gu3,jue2,diao4 +千差万别=qian1,cha1,wan4,bie2 +千年万载=qian1,nian2,wan4,zai3 +千磨百折=qian1,mo2,bai3,she2 +千虑一得=qian1,lv4,yi1,de2 +千载一会=qian1,zai3,yi1,hui4 +千载一圣=qian1,zai3,yi1,sheng4 +千载一弹=qian1,zai3,yi1,dan4 +千载一日=qian1,zai3,yi1,ri4 +千载一时=qian1,zai3,yi1,shi2 +千载一逢=qian1,zai3,yi1,feng2 +千载一遇=qian1,zai3,yi1,yu4 +千载奇遇=qian1,zai3,qi2,yu4 +千载难逢=qian1,zai3,nan2,feng2 +千载难遇=qian1,zai3,nan2,yu4 +千钧一发=qian1,jun1,yi1,fa4 +卄=nian4 +卅=sa4 +卆=zu2 +升=sheng1 +午=wu3 +午觉=wu3,jiao4 +卉=hui4 +半=ban4 +半上落下=ban4,shang4,luo4,xia4 +半半拉拉=ban4,ban4,la1,la1 +半吐半露=ban4,tu3,ban4,lu4 +半宿=ban4,xiu3 +半拉=ban4,la3 +半拉子=ban4,la3,zi3 +半数=ban4,shu4 +半筹莫展=ban4,chou2,mo4,chan3 +半身不遂=ban4,shen1,bu4,sui2 +半载=ban4,zai3 +半间不界=ban4,gan1,bu4,ga4 +半间半界=ban4,gan1,ban4,ga4 +卋=shi4 +卌=xi4 +卍=wan4 +华=hua2,hua4,hua1 +华亭鹤唳=hua4,ting2,he4,li4 +华佗=hua4,tuo2 +华冠丽服=hua2,guan1,li4,fu2 +华发=hua2,fa4 +华山=hua4,shan1 +华氏温度计=hua4,shi4,wen1,du4,ji4 +华达呢=hua2,da2,ni2 +协=xie2 +协调=xie2,tiao2 +卐=wan4 +卑=bei1 +卑宫菲食=bei1,gong1,fei3,shi2 +卒=zu2,cu4 +卒然=cu4,ran2 +卓=zhuo2 +協=xie2 +单=dan1,shan4,chan2 +单于=chan2,yu2 +单口相声=dan1,kou3,xiang4,sheng4 +单姓=shan4,xing4 +单子=dan1,zi5 +单子叶植物=dan1,zi5,ye4,zhi2,wu4 +单干=dan1,gan4 +单干户=dan1,gan4,hu4 +单数=dan1,shu4 +单枪匹马=dan1,qiang1,pi2,ma3 +单薄=dan1,bo2 +单调=dan1,diao4 +卖=mai4 +卖文为生=mai4,wen2,wei2,shen1 +卖相=mai4,xiang4 +南=nan2,na1 +南冠楚囚=nan2,guan1,chu3,qiu2 +南无=na1,mo2 +南箕北斗=nan2,ji1,bei3,dou3 +南腔北调=nan2,qiang1,bei3,diao4 +南蛮鴂舌=nan2,man2,xiang1,she2 +南贩北贾=nan2,fan4,bei3,gu3 +南郭处士=nan2,guo1,chu3,shi4 +単=dan1 +卙=ji2 +博=bo2 +博得=bo2,de2 +博文约礼=bo2,wen2,yue4,li3 +博闻强识=bo2,wen2,qiang2,zhi4 +卛=shuai4,lv4 +卜=bu3,bo5 +卝=guan4,kuang4 +卞=bian4 +卟=bu3 +占=zhan4,zhan1 +占便宜=zhan4,bian4,yi2 +占卜=zhan1,bu3 +占卦=zhan1,gua4 +占多数=zhan4,duo1,shu4 +占据=zhan1,ju4 +占星=zhan1,xing1 +占梦=zhan1,meng4 +占着茅坑不拉屎=zhan1,zhe5,mao2,keng1,bu4,la1,shi3 +占筮=zhan1,shi4 +占课=zhan1,ke4 +占风使帆=zhan1,feng1,shi3,fan1 +占风望气=zhan1,feng1,wang4,qi4 +卡=ka3,qia3 +卡具=qia3,ju4 +卡壳=qia3,ke2 +卡子=qia3,zi5 +卡脖子=qia3,bo2,zi5 +卢=lu2 +卣=you3 +卤=lu3 +卥=xi1 +卦=gua4 +卧=wo4 +卧铺=wo4,pu4 +卨=xie4 +卩=jie2 +卪=jie2 +卫=wei4 +卬=yang3,ang2 +卬头阔步=ang2,tou2,kuo4,bu4 +卬首信眉=ang2,shou3,shen1,mei2 +卭=qiong2 +卮=zhi1 +卯=mao3 +印=yin4 +印把=yin4,ba4 +印把子=yin4,ba4,zi5 +印相纸=yin4,xiang4,zhi3 +危=wei1 +危如累卵=wei1,ru2,lei3,luan3 +危机四伏=wei1,ji1,si4,fu2 +危机重重=wei1,ji1,chong2,chong2 +危难=wei1,nan4 +卲=shao4 +即=ji2 +即兴=ji2,xing4 +却=que4 +卵=luan3 +卶=chi3 +卷=juan3,juan4 +卷卷=juan4,juan4 +卷发=juan3,fa4 +卷土重来=juan3,tu3,chong2,lai2 +卷地皮=juan4,di4,pi2 +卷子=juan3,zi5 +卷子本=juan3,zi5,ben3 +卷宗=juan4,zong1 +卷层云=juan4,ceng2,yun2 +卷帘=juan4,lian2 +卷帙=juan4,zhi4 +卷席而居=juan4,xi2,er2,ju1 +卷心菜=juan4,xin1,cai4 +卷曲=juan4,qu1 +卷柏=juan4,bai3 +卷甲束兵=juan4,jia3,shu4,bing1 +卷甲衔枚=juan4,jia3,xian2,mei2 +卷甲韬戈=juan3,jia3,tao1,ge1 +卷积云=juan4,ji1,yun2 +卷筒纸=juan4,tong3,zhi3 +卷缩=juan4,suo1 +卷轴=juan4,zhou2 +卷铺盖=juan3,pu4,gai4 +卷面=juan4,mian4 +卷须=juan4,xu1 +卸=xie4 +卸磨杀驴=xie4,mo4,sha1,lv2 +卸载=xie4,zai3 +卹=xu4 +卺=jin3 +卻=que4 +卼=wu4 +卽=ji2 +卾=e4 +卿=qing1 +厀=xi1 +厁=san1 +厂=chang3,an1,han4 +厂长=chang3,zhang3 +厃=wei1,yan2 +厄=e4 +厄难=e4,nan4 +厅=ting1 +历=li4 +历精为治=li4,jing1,wei2,zhi4 +历精更始=li4,jing1,geng4,shi3 +厇=zhe2,zhai2 +厈=han4,an4 +厉=li4 +厊=ya3 +压=ya1,ya4 +压卷=ya1,juan4 +压卷之作=ya1,juan4,zhi1,zuo4 +压担子=ya1,dan4,zi5 +压板=ya4,ban3 +压根儿=ya4,gen1,er2 +压肩叠背=ya1,jian1,die2,bei4 +压肩迭背=ya1,jian1,die2,bei4 +压良为贱=ya1,liang2,wei2,jian4 +压蔓=ya1,wan4 +压轴=ya1,zhou4 +厌=yan4 +厌恶=yan4,wu4 +厍=she4 +厎=di3 +厏=zha3,zhai3 +厐=pang2 +厑=ya2 +厒=qie4 +厓=ya2 +厔=zhi4,shi1 +厕=ce4 +厖=mang2 +厗=ti2 +厘=li2 +厙=she4 +厚=hou4 +厚今薄古=hou4,jin1,bo2,gu3 +厚古薄今=hou4,gu3,bo2,jin1 +厚味腊毒=hou4,wei4,xi1,du2 +厚德载福=hou4,de2,zai3,fu2 +厚朴=hou4,po4 +厚此薄彼=hou4,ci3,bo2,bi3 +厚积薄发=hou4,ji1,bo2,fa1 +厚薄=hou4,bao2 +厛=ting1 +厜=zui1 +厝=cuo4 +厝火燎原=cuo4,huo3,liao3,yuan2 +厞=fei4 +原=yuan2 +原处=yuan2,chu3 +原子反应堆=yuan2,zi3,fan3,ying4,dui1 +原封不动=yuan2,feng1,bu4,dong4 +原形毕露=yuan2,xing2,bi4,lu4 +原形败露=yuan2,xing2,bai4,lu4 +厠=ce4 +厡=yuan2 +厢=xiang1 +厣=yan3 +厤=li4 +厥=jue2 +厦=sha4,xia4 +厦门=xia4,men2 +厧=dian1 +厨=chu2 +厩=jiu4 +厪=jin3 +厫=ao2 +厬=gui3 +厭=yan4 +厮=si1 +厯=li4 +厰=chang3 +厱=qian1,lan2 +厲=li4 +厳=yan2 +厴=yan3 +厵=yuan2 +厶=si1,mou3 +厷=gong1,hong2 +厸=lin2,miao3 +厹=rou2,qiu2 +厺=qu4 +去=qu4 +厽=lei3 +厾=du1 +县=xian4,xuan2 +县长=xian4,zhang3 +叀=zhuan1 +叁=san1 +参=can1,shen1,cen1,san1 +参与=can1,yu4 +参伍错综=cen1,wu3,cuo4,zong1 +参商=shen1,shang1 +参商之虞=shen1,shang1,zhi1,yu2 +参回斗转=shen1,hui2,dou3,zhuan3 +参差=cen1,ci1 +参数=can1,shu4 +参校=can1,jiao4 +参横斗转=shen1,heng2,dou3,zhuan3 +参茸=shen1,rong2 +参谋长=can1,mou2,zhang3 +参辰卯酉=shen1,chen2,mao3,you3 +参辰日月=shen1,chen2,ri4,yue4 +参错=cen1,cuo4 +參=can1,shen1,cen1,san1 +叄=can1,shen1,cen1,san1 +叅=can1,shen1,cen1,san1 +叆=ai4 +叇=dai4 +又=you4 +又吐又泻=you4,tu4,you4,xie4 +又弱一个=you4,ruo4,yi1,ge4 +叉=cha1,cha2,cha3 +叉儿=cha1,er5 +叉子=cha1,zi5 +叉开=cha3,kai1 +叉开双腿=cha3,kai1,shuang1,tui3 +及=ji2 +及时行乐=ji2,shi2,xing2,le4 +友=you3 +双=shuang1 +双柑斗酒=shuai4,gan1,dou3,jiu3 +双足重茧=shuang1,zu2,chong2,jian3 +双重=shuang1,chong2 +双重人格=shuang1,chong2,ren2,ge2 +双重国籍=shuang1,chong2,guo2,ji2 +反=fan3 +反倒=fan3,dao4 +反切=fan3,qie4 +反动分子=fan3,dong4,fen4,zi5 +反劳为逸=fan3,lao2,wei2,yi4 +反客为主=fan3,ke4,wei2,zhu3 +反应=fan3,ying4 +反应两极=fan3,ying4,liang3,ji2 +反应器=fan3,ying4,qi4 +反应堆=fan3,ying4,dui1 +反应式=fan3,ying4,shi4 +反应过度=fan3,ying4,guo4,du4 +反应过渡=fan3,ying4,guo4,du4 +反弹=fan3,dan4 +反战分子=fan3,zhan4,fen4,zi5 +反攻倒算=fan3,gong1,dao4,suan4 +反朴还淳=fan3,pu3,huan2,chun2 +反正还淳=fan3,zheng4,huan2,chun2 +反省=fan3,xing3 +反调=fan3,diao4 +反败为胜=fan3,bai4,wei2,sheng4 +反躬自省=fan3,gong1,zi4,xing3 +反间=fan3,jian4 +収=shou1 +叏=guai2 +叐=ba2 +发=fa1,fa4 +发丧=fa1,sang1 +发乳=fa4,ru3 +发人深省=fa1,ren2,shen1,xing3 +发刷=fa4,shua1 +发卡=fa4,qia3 +发型=fa4,xing2 +发夹=fa1,jia1 +发奸擿伏=fa1,jian1,ti4,fu2 +发妻=fa4,qi1 +发屋=fa4,wu1 +发屋求狸=fa1,wu1,qiu2,li2 +发帖=fa1,tie1 +发廊=fa4,lang2 +发式=fa4,shi4 +发引千钧=fa4,yin3,qian1,jun1 +发怒穿冠=fa4,nu4,chuan1,guan1 +发怔=fa1,zheng4 +发指=fa4,zhi3 +发指眦裂=fa4,zhi3,zi4,lie4 +发晕=fa1,yun4 +发植穿冠=fa4,zhi2,chuan1,guan1 +发毛=fa4,mao2 +发疟子=fa1,yao4,zi3 +发短心长=fa4,duan3,xin1,chang2 +发秃齿豁=fa4,tu1,chi3,huo4 +发箍=fa4,gu1 +发菜=fa4,cai4 +发蒙=fa1,meng2 +发蒙解缚=fa1,meng2,jie3,fu5 +发踊冲冠=fa4,yong3,chong1,guan1 +发辫=fa4,bian4 +发还=fa1,huan2 +发际=fa4,ji4 +发难=fa1,nan4 +发露=fa1,lu4 +发颤=fa1,chan4 +发髻=fa4,ji4 +发鬓=fa4,bin4 +叒=ruo4 +叓=li4 +叔=shu1 +叕=zhuo2,yi3,li4,jue2 +取=qu3 +取之不尽=qu3,zhi1,bu4,jin4 +取予有节=qu3,yu4,you3,jie2 +取得=qu3,de5 +取给=qu3,ji3 +受=shou4 +受累=shou4,lei4 +受降=shou4,xiang2 +受难=shou4,nan4 +变=bian4 +变危为安=bian4,wei1,wei2,an1 +变幻不测=bian4,hua4,bu4,ce4 +变态反应=bian4,tai4,fan3,ying4 +变数=bian4,shu4 +变更=bian4,geng1 +变相=bian4,xiang4 +变调=bian4,diao4 +变贪厉薄=bian3,tan1,li4,bo2 +变风改俗=bian4,feng1,yi4,su2 +叙=xu4 +叚=jia3 +叛=pan4 +叜=sou3 +叝=ji2 +叞=wei4,yu4 +叟=sou3 +叠=die2 +叠矩重规=die2,ju3,chong2,gui1 +叡=rui4 +叢=cong2 +口=kou3 +口不应心=kou3,bu4,ying4,xin1 +口供=kou3,gong4 +口出大言=kou3,chu1,da1,yan2 +口子=kou3,zi5 +口干舌焦=kou3,gan4,she2,jiao1 +口腹之累=kou3,fu4,zhi1,lei3 +口血未干=kou3,xue4,wei4,gan1 +口觉=kou3,jue2 +口角=kou3,jue2 +口角春风=kou3,jiao3,chun1,feng1 +口角生风=kou3,jiao3,sheng1,feng1 +口角风情=kou3,jiao3,feng1,qing2 +口轻舌薄=kou3,qing1,she2,bo2 +古=gu3 +古为今用=gu3,wei2,jin1,yong4 +古刹=gu3,cha4 +古朴=gu3,piao2 +古细菌域=gu3,xi4,jun4,yu4 +古细菌界=gu3,xi4,jun4,jie4 +古调不弹=gu3,diao4,bu4,tan2 +古调单弹=gu3,diao4,dan1,tan2 +古道热肠=gu3,dao4,re4,chang2 +古都=gu3,du1 +句=ju4,gou1 +句子=ju4,zi5 +句读=ju4,dou4 +另=ling4 +另一方面=ling4,yi4,fang1,mian4 +另一面=ling4,yi2,mian4 +另当别论=ling4,dang1,bie2,lun4 +另辟蹊径=ling4,pi4,xi1,jing4 +叧=gua3 +叨=dao1,tao1 +叨光=tao1,guang1 +叨叨=dao1,dao4 +叨咕=dao2,gu4 +叨在知己=tao1,zai4,zhi1,ji3 +叨扰=tao1,rao3 +叨拢=tao1,long3 +叨陪=tao1,pei2 +叩=kou4 +叩心泣血=kou4,xin1,qi4,xue4 +只=zhi3,zhi1 +只字=zhi1,zi4 +只字不提=zhi1,zi4,bu4,ti2 +只得=zhi3,de5 +只此=zhi1,ci3 +只见=zhi1,jian4 +只言片语=zhi1,yan2,pian4,yu3 +只读存储器=zhi1,du2,cun2,chu3,qi4 +只身=zhi1,shen1 +只轮不反=zhi1,lun2,bu4,fan3 +只轮不返=zhi1,lun2,bu4,fan3 +只轮无反=zhi1,lun2,wu2,fan3 +只骑不反=zhi1,qi2,bu4,fan3 +只鸡斗酒=zhi1,ji1,dou3,jiu3 +只鸡樽酒=zhi1,ji1,zun1,jiu3 +只鸡絮酒=zhi1,ji1,xu4,jiu3 +叫=jiao4 +召=zhao4,shao4 +叭=ba1 +叮=ding1 +可=ke3,ke4 +可不是=ke3,bu2,shi4 +可怜相=ke3,lian2,xiang4 +可恶=ke3,wu4 +可曾=ke3,zeng1 +可汗=ke4,han2 +可的松=ke3,di4,song1 +可着=ke3,zhe5 +台=tai2,tai1 +台子=tai2,zi5 +台州=tai1,zhou1 +台柱子=tai2,zhu4,zi5 +台湾话=tai1,wan1,hua4 +台观=tai2,guan4 +叱=chi4 +叱咤=chi4,zha4 +叱咤风云=chi4,zha4,feng1,yun2 +叱喝=chi4,he4 +史=shi3 +史乘=shi3,sheng4 +右=you4 +叴=qiu2 +叵=po3 +叶=ye4,xie2 +叶公好龙=ye4,gong1,hao4,long2 +叶子=ye4,zi5 +叶子烟=ye4,zi5,yan1 +叶落归根=ye4,luo4,gui1,gen1 +叶韵=xie2,yun4 +号=hao4,hao2 +号丧=hao2,sang1 +号叫=hao2,jiao4 +号召=hao2,zhao4 +号咷大哭=hao2,tao2,da4,ku1 +号哭=hao2,ku1 +号啕=hao2,tao2 +号啕大哭=hao2,tao2,da4,ku1 +号寒啼饥=hao2,han2,ti2,ji1 +号数=hao4,shu4 +司=si1 +司务长=si1,wu4,zhang3 +司长=si1,zhang3 +叹=tan4 +叹为观止=tan4,wei2,guan1,zhi3 +叺=chi3 +叻=le4 +叼=diao1 +叽=ji1 +叽哩咕噜=ji1,li3,gu1,lu1 +叿=hong1,hong2 +吀=mie1 +吁=xu1,yu4 +吁咈都俞=yu4,fu2,dou1,yu2 +吁天呼地=yu4,tian1,hu1,di4 +吁求=yu4,qiu2 +吁请=yu4,qing3 +吂=mang2 +吃=chi1 +吃不了兜着走=chi1,bu4,liao3,dou1,zhe5,zou3 +吃不住=chi1,bu2,zhu4 +吃人不吐骨头=chi1,ren2,bu4,tu3,gu2,tou5 +吃哑巴亏=chi1,ya3,ba5,kui1 +吃得下=chi1,de5,xia4 +吃得住=chi1,de5,zhu4 +吃得开=chi1,de5,kai1 +吃得来=chi1,de5,lai2 +吃得消=chi1,de5,xiao1 +吃相=chi1,xiang4 +吃着碗里瞧着锅里=chi1,zhe5,wan3,li3,qiao2,zhe5,guo1,li3 +吃苦受累=chi1,ku3,shou4,lei4 +吃里扒外=chi1,li3,pa2,wai4 +各=ge4,ge3 +各奔前程=ge4,ben4,qian2,cheng2 +各有所好=ge4,you3,suo3,hao4 +各有所长=ge4,you3,suo3,cheng2 +各自为战=ge4,zi4,wei2,zhan4 +各自为政=ge4,zi4,wei2,zheng4 +各色名样=ge4,se4,ge4,yang4 +各行各业=ge4,hang2,ge4,ye4 +吅=xuan1,song4 +吆=yao1 +吆五喝六=yao1,wu3,he4,liu4 +吆喝=yao1,he4 +吇=zi3 +合=he2,ge3 +合两为一=he2,liang3,wei2,yi1 +合二为一=he2,er4,wei2,yi1 +合从连衡=he2,zong4,lian2,heng2 +合得来=he2,de5,lai2 +合数=he2,shu4 +合浦珠还=he2,pu3,zhu1,huan2 +合浦还珠=he2,pu3,huan2,zhu1 +合着=he2,zhe5 +合缝=he2,feng4 +合而为一=he2,er2,wei2,yi1 +吉=ji2 +吉人天相=ji2,ren2,tian1,xiang4 +吉人自有天相=ji2,ren2,zi4,you3,tian1,xiang4 +吊=diao4 +吊丧=diao4,sang1 +吊儿郎当=diao4,er5,lang2,dang1 +吊卷=diao4,juan4 +吊尔郎当=diao4,er5,lang2,dang1 +吊铺=diao4,pu4 +吋=dou4,cun4 +同=tong2,tong4 +同声相应=tong2,sheng1,xiang1,ying4 +同好=tong2,hao4 +同心僇力=tong2,xin1,jie2,li4 +同恶相党=tong2,e4,xiang1,dang3 +同恶相助=tong2,wu4,xiang1,zhu4 +同恶相恤=tong2,wu4,xiang1,xu4 +同恶相求=tong2,e4,xiang1,qiu2 +同恶相济=tong2,e4,xiang1,ji4 +同行=tong2,hang2 +同调=tong2,diao4 +名=ming2 +名不见经传=ming2,bu2,jian4,jing1,zhuan4 +名刹=ming2,sha1 +名实相副=ming2,shi2,xiang1,fu4 +名实相符=ming2,shi2,xiang1,fu2 +名将=ming2,jiang4 +名我固当=ming2,wo3,gu4,dang1 +名数=ming2,shu4 +名角=ming2,jue2 +后=hou4 +后劲=hou4,jing4 +后头=hou4,tou5 +后爪=hou4,zhua3 +吏=li4 +吐=tu3,tu4 +吐口水=tu4,kou3,shui3 +吐哺握发=tu3,bu3,wo4,fa4 +吐沫=tu4,mo4 +吐泻=tu4,xie4 +吐肝露胆=tu3,gan1,lu4,dan3 +吐蕃=tu3,bo1 +吐血=tu4,xue3 +吐谷浑=tu3,yu4,hun2 +吐露=tu3,lu4 +吐露真情=tu3,lu4,zhen1,qing2 +向=xiang4 +向上一纵=xiang4,shang4,yi2,zong4 +向声背实=xiang4,sheng1,bei4,shi2 +向着=xiang4,zhe5 +向背=xiang4,bei4 +吒=zha4,zha1 +吓=xia4,he4 +吓一大跳=xia4,yi2,da4,tiao4 +吓唬=xia4,hu4 +吔=ye1 +吕=lv3 +吖=ya1,a1 +吖啶=a1,ding4 +吗=ma5,ma2,ma3 +吗啡=ma3,fei1 +吗玩意儿=ma2,wan2,yi4,er5 +吘=ou3 +吙=huo1 +吚=yi1 +君=jun1 +君子=jun1,zi3 +君子好逑=jun1,zi5,hao3,qiu2 +吜=chou3 +吝=lin4 +吞=tun1 +吞咽=tun1,yan4 +吞没=tun1,mo4 +吞言咽理=tun1,yan2,yan1,li3 +吟=yin2 +吟哦=yin2,e2 +吠=fei4 +吡=pi3,bi3 +吡咯=bi3,luo4 +吡啶=bi3,ding4 +吢=qin4 +吣=qin4 +吤=jie4,ge4 +吥=bu4 +否=fou3,pi3 +否则=fou2,ze2 +否去泰来=pi3,qu4,tai4,lai2 +否往泰来=pi3,wang3,tai4,lai2 +否极泰回=pi3,ji2,tai4,hui2 +否极泰来=pi3,ji2,tai4,lai2 +否极阳回=pi3,ji2,yang2,hui2 +否终则泰=pi3,zhong1,ze2,tai4 +否终复泰=pi3,zhong1,fu4,tai4 +否认=fou2,ren4 +吧=ba1,ba5 +吧的一声=ba1,de5,yi1,sheng1 +吨=dun1 +吩=fen1 +吪=e2,hua1 +含=han2 +含含糊糊=han2,han2,hu4,hu1 +含垢藏疾=han2,gou3,cang2,ji2 +含着骨头露着肉=han2,zhe5,gu3,tou2,lu4,zhe5,rou4 +含糊=han2,hu2 +含糊不清=han2,hu2,bu4,qing1 +含血=han2,xue4 +含血喷人=han2,xue4,pen1,ren2 +含血噀人=han2,xue4,xun4,ren2 +含血潠人=han2,xue4,xun4,ren2 +听=ting1 +听差=ting1,chai1 +听得到=ting1,de5,dao4 +听得懂=ting1,de5,dong3 +听得见=ting1,de5,jian4 +听而不闻=ting1,er2,bu2,wen2 +吭=hang2,keng1 +吭哧=keng1,chi1 +吭声=keng1,sheng1 +吭气=keng1,qi4 +吮=shun3 +启=qi3 +启蒙=qi3,meng2 +吰=hong2 +吱=zhi1,zi1 +吱哩哇啦=zhi1,li1,wa1,la1 +吱声=zi1,sheng1 +吱扭=zi1,niu3 +吲=yin3,shen3 +吳=wu2 +吴=wu2 +吴下阿蒙=wu2,xia4,a1,meng2 +吴堡=wu2,bu3 +吵=chao3,chao1 +吵吵=chao1,chao4 +吵吵闹闹=chao1,chao4,nao4,nao4 +吶=na4,ne4 +吷=xue4,chuo4,jue2 +吸=xi1 +吸血鬼=xi1,xue4,gui3 +吹=chui1 +吹毛数睫=chui1,mao2,shu4,jie2 +吹竹弹丝=chui1,zhu2,tan2,si1 +吺=dou1,ru2 +吻=wen3 +吼=hou3 +吽=hou3,hong1,ou1 +吾=wu2,yu4 +吾自有处=wu4,zi5,you4,chu5 +吿=gao4 +呀=ya1,ya5 +呁=jun4 +呂=lv3 +呃=e4 +呄=ge2 +呅=wen3 +呆=dai1 +呆子=dai1,zi5 +呆板=dai1,ban3 +呇=qi3 +呈=cheng2 +呈露=cheng2,lu4 +呉=wu2 +告=gao4 +告一段落=gao4,yi1,duan4,luo4 +告假=gao4,jia4 +告朔饩羊=gu4,shuo4,xi4,yang2 +告老还家=gao4,lao3,huan2,jia1 +呋=fu1 +呌=jiao4 +呍=hong1 +呎=chi3 +呏=sheng1 +呐=na4,ne4 +呑=tun1,tian1 +呒=fu3 +呓=yi4 +呔=dai1 +呕=ou3,ou1,ou4 +呕吐=ou3,tu4 +呕哑=ou1,ya1 +呕心沥血=ou3,xin1,li4,xue4 +呕心滴血=ou3,xin1,di1,xue4 +呕气=ou4,qi4 +呕血=ou3,xue4 +呖=li4 +呗=bei5,bai4 +呗唱=bai4,chang4 +员=yuan2,yun2,yun4 +呙=wai1,he2,wo3,wa1,gua3,guo1 +呚=hua2,qi4 +呛=qiang1,qiang4 +呛人=qiang4,ren2 +呛到=qiang4,dao4 +呛眼=qiang4,yan3 +呛鼻=qiang4,bi2 +呜=wu1 +呜呜咽咽=wu1,wu1,ye4,ye4 +呜咽=wu1,ye4 +呝=e4 +呞=shi1 +呟=juan3 +呠=pen3 +呡=wen3,min3 +呢=ne5,ni2 +呢喃=ni2,nan2 +呢喃细语=ni2,nan2,xi4,yu3 +呢子=ni2,zi3 +呢绒=ni2,rong2 +呣=mou2 +呤=ling2 +呥=ran2 +呦=you1 +呧=di3 +周=zhou1 +周正=zhou1,zheng1 +呩=shi4 +呪=zhou4 +呫=tie4,che4 +呬=xi4 +呭=yi4 +呮=qi4,zhi1 +呯=ping2 +呰=zi3,ci1 +呱=gua1,gu1,gua3 +呱呱坠地=gu1,gu1,zhui4,di4 +呱呱堕地=gu1,gu1,duo4,di4 +呲=zi1,ci1 +味=wei4 +味同嚼蜡=wei4,tong2,jiao2,la4 +呴=xu3,hou3,gou4 +呵=he1,a5,ke1 +呵叻=ke1,le4 +呵喝=he1,he4 +呵欠=he1,qian5 +呵欠连天=he1,qian4,lian2,tian1 +呶=nao2 +呷=xia1 +呸=pei1 +呹=yi4 +呺=xiao1,hao2 +呻=shen1 +呼=hu1 +呼不给吸=hu1,bu4,ji3,xi1 +呼卢喝雉=hu1,lu2,he4,zhi4 +呼号=hu1,hao2 +呼吁=hu1,yu4 +呼喝=hu1,he4 +呼天吁地=hu1,tian1,yu4,di4 +呼天抢地=hu1,tian1,qiang1,di4 +呼天钥地=hu1,tian1,yao4,di4 +呼幺喝六=hu1,yao1,he4,liu4 +呼应=hu1,ying4 +呼来喝去=hu1,lai2,he4,qu4 +命=ming4 +命中=ming4,zhong4 +命中注定=ming4,zhong1,zhu4,ding4 +命数=ming4,shu4 +命薄=ming4,bo2 +命薄缘悭=ming4,bao2,yuan2,qian1 +呾=da2,dan4 +呿=qu1 +咀=ju3,zui3 +咀嚼=ju3,jue2 +咁=xian2,gan1 +咂=za1 +咃=tuo1 +咄=duo1 +咅=pou3 +咆=pao2 +咇=bi4 +咈=fu2 +咉=yang3 +咊=he2,he4 +咋=za3,ze2,zha1 +咋办=zha2,ban4 +咋呼=zha1,hu1 +咋舌=ze2,she2 +和=he2,he4,huo2,huo4,hu2 +和了=hu2,le5 +和平共处=he2,ping2,gong4,chu3 +和平共处五项原则=he2,ping2,gong4,chu3,wu3,xiang4,yuan2,ze2 +和弄=huo4,nong4 +和数=he2,shu4 +和泥=huo2,ni2 +和稀泥=huo4,xi1,ni2 +和药=huo4,yao4 +和诗=he4,shi1 +和面=huo2,mian4 +和颜说色=he2,yan2,yue4,se4 +咍=hai1 +咎=jiu4 +咎有应得=jiu4,you3,ying1,de2 +咏=yong3 +咏叹调=yong3,tan4,diao4 +咐=fu4 +咑=da1 +咒=zhou4 +咓=wa3 +咔=ka3 +咔嗒=ka1,da1 +咔嚓=ka1,cha1 +咕=gu1 +咖=ka1,ga1 +咖喱=ga1,li2 +咗=zuo5 +咘=bu4 +咙=long2 +咚=dong1 +咛=ning2 +咜=tuo1 +咝=si1 +咞=xian4,xian2 +咟=huo4 +咠=qi4 +咡=er4 +咢=e4 +咣=guang1 +咤=zha4 +咥=die2,xi1 +咦=yi2 +咧=lie1,lie3,lie2,lie5 +咧嘴=lie3,zui3 +咧着嘴笑=lie3,zhe5,zui3,xiao4 +咨=zi1 +咩=mie1 +咪=mi1 +咫=zhi3 +咬=yao3 +咬人狗儿不露齿=yao3,ren2,gou3,er2,bu4,lou4,chi3 +咬文嚼字=yao3,wen2,jiao2,zi4 +咬牙切齿=yao3,ya2,qie4,chi3 +咬血为盟=yao3,xue4,wei2,meng2 +咭=ji1,xi1,qia4 +咮=zhou4 +咯=ka3,luo4,lo5,ge1 +咯吱=ge1,zhi1 +咯咯=ge1,ge1 +咯噔=ge1,deng1 +咰=shu4,xun2 +咱=zan2,za2,za3 +咱们=zan2,men5 +咱俩=zan2,lia3 +咱家=za2,jia1 +咲=xiao4 +咳=ke2,hai1 +咳咳=hai1,hai1 +咳声叹气=hai1,sheng1,tan4,qi4 +咳血=ke2,xue4 +咴=hui1 +咵=kua1 +咶=huai4,shi4 +咷=tao2 +咸=xian2 +咹=e4,an4 +咺=xuan3,xuan1 +咻=xiu1 +咼=wai1,he2,wo3,wa1,gua3,guo1 +咽=yan4,yan1,ye4 +咽下去=yan4,xia4,qu4 +咽口水=yan4,kou3,shui3 +咽喉=yan1,hou2 +咽头=yan1,tou2 +咽峡炎=yan1,xia2,yan2 +咽炎=yan1,yan2 +咾=lao3 +咿=yi1 +哀=ai1 +哀乐=ai1,yue4 +哀号=ai1,hao2 +品=pin3 +品竹调弦=pin3,zhu2,diao4,xian2 +哂=shen3 +哃=tong2 +哄=hong1,hong3,hong4 +哄人=hong3,ren2 +哄劝=hong3,quan4 +哄哄=hong3,hong3 +哄场=hong4,chang3 +哄堂=hong1,tang2 +哄孩子=hong3,hai2,zi5 +哄小孩=hong3,xiao3,hai2 +哄弄=hong3,nong4 +哄抢=hong4,qiang3 +哄诱=hong3,you4 +哄逗=hong3,dou4 +哄闹=hong4,nao4 +哄骗=hong3,pian4 +哅=xiong1 +哆=duo1 +哇=wa1,wa5 +哈=ha1,ha3,ha4 +哈什蚂=ha4,shi2,ma3 +哈巴狗=ha3,ba1,gou3 +哈罗=ha1,luo5 +哈达=ha3,da2 +哉=zai1 +哊=you4 +哋=die4,di4 +哌=pai4 +响=xiang3 +响应=xiang3,ying4 +哎=ai1 +哎呀=ai1,ya5 +哎哟=ai1,yo5 +哏=gen2,hen3 +哐=kuang1 +哑=ya3,ya1 +哑哑=ya1,ya1 +哑场=ya3,chang3 +哑子=ya3,zi5 +哑子做梦=ya3,zi3,zuo4,meng4 +哑子吃黄连=ya3,zi3,chi1,huang2,lian2 +哑子寻梦=ya3,zi3,xun2,meng4 +哑子托梦=ya3,zi3,tuo1,meng4 +哑巴=ya3,ba5 +哑巴亏=ya3,ba5,kui1 +哑巴吃黄连=ya3,ba5,chi1,huang2,lian2 +哑然=ya3,ran2 +哒=da1 +哓=xiao1 +哔=bi4 +哕=yue3,hui4 +哖=nian2 +哗=hua2,hua1 +哗哗=hua1,hua1 +哗啦=hua1,la1 +哗啦啦=hua1,la1,la1 +哘=xing2 +哙=kuai4 +哚=duo3 +哜=ji4,jie1,zhai1 +哝=nong2 +哞=mou1 +哟=yo1,yo5 +哠=hao4 +員=yuan2,yun2,yun4 +哢=long4 +哣=pou3 +哤=mang2 +哥=ge1 +哥们=ge1,men5 +哥们儿=ge1,men5,er5 +哥儿们=ge1,er5,men5 +哦=o4,o2,e2 +哧=chi1 +哧哧地笑=chi1,chi1,de5,xiao4 +哧的一声=chi1,de5,yi4,sheng1 +哨=shao4 +哨卡=shao4,qia3 +哩=li5,li3,li1 +哩哩啦啦=li1,li1,la1,la1 +哩哩罗罗=li5,li5,luo1,luo1 +哪=na3,nei3,na5,ne2 +哪个=nei3,ge4 +哪些=nei3,xie1 +哪会儿=nei3,hui4,er5 +哪儿=na3,er5 +哪吒=ne2,zha1 +哪里都=na3,li5,dou1 +哫=zu2 +哬=he4 +哭=ku1 +哭丧=ku1,sang5 +哭丧着脸=ku1,sang5,zhe5,lian3 +哭天抢地=ku1,tian1,qiang1,di4 +哮=xiao4 +哯=xian4 +哰=lao2 +哱=po4,ba1,bo1 +哲=zhe2 +哳=zha1 +哴=liang4,lang2 +哵=ba1 +哶=mie1 +哷=lie4,lv4 +哸=sui1 +哹=fu2 +哺=bu3 +哺乳假=bu3,ru3,jia4 +哻=han1 +哼=heng1 +哼哈二将=heng1,ha1,er4,jiang4 +哽=geng3 +哽咽=geng3,ye4 +哽塞=geng3,se4 +哾=chuo4,yue4 +哿=ge3,jia1 +唀=you4 +唁=yan4 +唂=gu1 +唃=gu1 +唄=bei5,bai4 +唅=han2,han4 +唆=suo1 +唇=chun2 +唇辅相连=chun2,fu3,xiang1,lian2 +唈=yi4 +唉=ai1,ai4 +唊=jia2,qian3 +唋=tu3,tu4 +唌=dan4,xian2,yan2 +唍=wan3 +唎=li4 +唏=xi1 +唐=tang2 +唐临晋帖=tang2,lin2,jin4,tie1 +唑=zuo4 +唒=qiu2 +唓=che1 +唔=wu4,wu2 +唕=zao4 +唖=ya3 +唗=dou1 +唘=qi3 +唙=di2 +唚=qin4 +唛=mai4 +唛头=ma4,tou2 +唝=gong4,hong3,gong3 +唞=dou2 +唠=lao4,lao2 +唠叨=lao2,dao1 +唠唠叨叨=lao1,lao1,dao1,dao1 +唠嗑=lao4,ke4 +唡=liang3 +唢=suo3 +唣=zao4 +唤=huan4 +唤头=huan4,tou5 +唥=leng2 +唦=sha1 +唧=ji1 +唨=zu3 +唩=wo1,wei3 +唪=feng3 +唫=jin4,yin2 +唬=hu3,xia4 +唭=qi4 +售=shou4 +唯=wei2 +唯唯否否=wei3,wei3,fou3,fou3 +唯恐天下不乱=wei2,kong3,tian1,xia4,bu4,luan4 +唯所欲为=wei2,suo3,yu4,wei2 +唰=shua1 +唱=chang4 +唱反调=chang4,fan3,diao4 +唱和=chang4,he4 +唱片儿=chang4,pian1,er5 +唱筹量沙=chang4,chou2,liang2,sha1 +唱高调=chang4,gao1,diao4 +唲=er2,wa1 +唳=li4 +唴=qiang4 +唵=an3 +唵嘛呢叭咪吽=an3,ma5,ne5,ba5,mi1,hong1 +唶=jie4,ze2,ji2 +唷=yo1 +唸=nian4 +唹=yu1 +唺=tian3 +唻=lai4 +唼=sha4 +唽=xi1 +唾=tuo4 +唿=hu1 +啀=ai2 +啁=zhou1,zhao1,tiao4 +啂=gou4 +啃=ken3 +啄=zhuo2 +啅=zhuo2,zhao4 +商=shang1 +商行=shang1,hang2 +商调=shang1,diao4 +商贾=shang1,gu3 +商量=shang1,liang2 +啇=di2 +啈=heng4 +啉=lan2,lin2 +啊=a1,a2,a3,a4,a5 +啋=cai3 +啌=qiang1 +啍=zhun1,tun1,xiang1,dui3 +啎=wu3 +問=wen4 +啐=cui4,qi5 +啑=sha4,jie2,die2,ti4 +啒=gu3 +啓=qi3 +啔=qi3 +啕=tao2 +啖=dan4 +啗=dan4 +啘=yue1,wa1 +啙=zi3,ci3 +啚=bi3,tu2 +啛=cui4 +啛啛喳喳=cui4,cui4,cha1,cha1 +啜=chuo4,chuai4 +啝=he2 +啞=ya3,ya1 +啟=qi3 +啠=zhe2 +啡=fei1 +啢=liang3 +啣=xian2 +啤=pi2 +啥=sha2 +啥子=sha2,zi5 +啦=la1,la5 +啧=ze2 +啨=qing2,ying1 +啩=gua4 +啪=pa1 +啫=ze2,shi4 +啬=se4 +啭=zhuan4 +啮=nie4 +啮血为盟=nie4,xue4,wei2,meng2 +啮血沁骨=nie4,xue4,qin4,gu3 +啯=guo1 +啰=luo1,luo2,luo5 +啱=yan2 +啲=di1 +啳=quan2 +啴=tan1,chan3,tuo1 +啵=bo5 +啶=ding4 +啷=lang1 +啸=xiao4 +啹=ju2 +啺=tang2 +啻=chi4 +啼=ti2 +啼血=ti2,xue4 +啼饥号寒=ti2,ji1,hao2,han2 +啽=an1,an2 +啾=jiu1 +啿=dan4 +喀=ka1 +喀什=ka1,shi2 +喀嚓=ka1,cha1 +喁=yong2 +喂=wei4 +喃=nan2 +善=shan4 +善为士者不武=shan4,wei2,shi4,zhe3,bu4,wu3 +善为说辞=shan4,wei2,shuo1,ci2 +善于应对=shan4,yu2,ying4,dui4 +善善恶恶=shan4,shan4,wu4,e4 +善处=shan4,chu3 +善自为谋=shan4,zi4,wei2,mou2 +善贾而沽=shan4,jia4,er2,gu1 +喅=yu4 +喆=zhe2 +喇=la3 +喇嘛=la3,ma5 +喈=jie1 +喉=hou2 +喊=han3 +喊倒好儿=han3,dao4,hao3,er5 +喊哑嗓子=han3,ya3,sang3,zi5 +喋=die2,zha2 +喋血=die2,xue4 +喌=zhou1 +喍=chai2 +喎=wai1 +喏=nuo4,re3 +喐=huo4,guo2,xu4 +喑=yin1 +喒=zan2,za2,za3 +喓=yao1 +喔=o1,wo1 +喔喔=wo1,wo1 +喕=mian3 +喖=hu2 +喗=yun3 +喘=chuan3 +喙=hui4 +喚=huan4 +喛=huan4,yuan2,xuan3,he2 +喜=xi3 +喜好=xi3,hao4 +喜怒哀乐=xi3,nu4,ai1,le4 +喝=he1,he4,ye4 +喝令=he4,ling4 +喝倒彩=he4,dao4,cai3 +喝彩=he4,cai3 +喝斥=he4,chi4 +喝止=he4,zhi3 +喝道=he4,dao4 +喝采=he4,cai3 +喝问=he4,wen4 +喞=ji1 +喟=kui4 +喠=zhong3,chuang2 +喡=wei2,wei4 +喢=sha4 +喣=xu3 +喤=huang2 +喥=duo2,zha4 +喦=yan2 +喧=xuan1 +喨=liang4 +喩=yu4 +喪=sang4,sang1 +喫=chi1 +喬=qiao2,jiao1 +喭=yan4 +單=dan1,shan4,chan2 +喯=ben1,pen4 +喰=can1,sun1,qi1 +喱=li2 +喲=yo1,yo5 +喳=zha1,cha1 +喴=wei1 +喵=miao1 +営=ying2 +喷=pen1,pen4 +喷嚏=pen1,ti4 +喷撒=pen1,sa3 +喷薄=pen1,bo2 +喷薄欲出=pen1,bo2,yu4,chu1 +喷血自污=pen1,xue4,zi4,wu1 +喷香=pen4,xiang1 +喹=kui2 +喹啉=kui2,lin2 +喺=xi2 +喻=yu4 +喼=jie1 +喽=lou2,lou5 +喾=ku4 +喿=zao4,qiao1 +嗀=hu4 +嗁=ti2 +嗂=yao2 +嗃=he4,xiao1,xiao4,hu4 +嗄=sha4,a2 +嗅=xiu4 +嗅觉丧失=xiu4,jue2,min3,rui4 +嗅觉减退=xiu4,jue2,min3,rui4 +嗅觉迟钝=xiu4,jue2,min3,rui4 +嗆=qiang1,qiang4 +嗇=se4 +嗈=yong1 +嗉=su4 +嗊=gong4,hong3,gong3 +嗋=xie2 +嗌=yi4,ai4 +嗍=suo1 +嗎=ma5,ma2,ma3 +嗏=cha1 +嗐=hai4 +嗑=ke1,ke4 +嗒=da1,ta4 +嗒丧=ta4,sang4 +嗒然=ta4,ran2 +嗒然若丧=ta4,ran2,ruo4,sang4 +嗓=sang3 +嗔=chen1 +嗔目切齿=chen1,mu4,qie1,chi3 +嗔着=chen1,zhe5 +嗕=ru4 +嗖=sou1 +嗗=wa1,gu3 +嗘=ji1 +嗙=beng1,pang3 +嗚=wu1 +嗛=xian2,qian4,qie4 +嗜=shi4 +嗜好=shi4,hao4 +嗜血=shi4,xue4 +嗝=ge2 +嗞=zi1 +嗟=jie1 +嗠=lao4 +嗡=weng1 +嗢=wa4 +嗣=si4 +嗣子=si4,zi3 +嗤=chi1 +嗥=hao2 +嗦=suo1 +嗧=jia1,lun2 +嗨=hai1,hei1 +嗨哟=hai1,yo5 +嗩=suo3 +嗪=qin2 +嗫=nie4 +嗬=he1 +嗭=zi5 +嗮=sai3 +嗯=en4 +嗰=ge3 +嗱=na2 +嗲=dia3 +嗳=ai4,ai3,ai1 +嗳酸=ai3,suan1 +嗴=qiang1 +嗵=tong1 +嗶=bi4 +嗷=ao2 +嗸=ao2 +嗹=lian2 +嗺=zui1,sui1 +嗻=zhe1,zhe4,zhu4,zhe5 +嗼=mo4 +嗽=sou4 +嗾=sou3 +嗿=tan3 +嘀=di2 +嘀嗒=di1,da1 +嘁=qi1 +嘁哩喀喳=qi1,li3,ka1,cha1 +嘂=jiao4 +嘃=chong1 +嘄=jiao4,dao3 +嘅=kai3,ge3 +嘆=tan4 +嘇=shan1,can4 +嘈=cao2 +嘈嘈切切=cao2,cao2,qie1,qie1 +嘉=jia1 +嘉兴市=jia1,xing1,shi4 +嘉应=jia1,ying4 +嘊=ai2 +嘋=xiao4 +嘌=piao1 +嘌呤=piao4,ling2 +嘍=lou2,lou5 +嘎=ga1,ga2,ga3 +嘎子=ga3,zi3 +嘏=gu3 +嘐=xiao1,jiao1 +嘑=hu1 +嘒=hui4 +嘓=guo1 +嘔=ou3,ou1,ou4 +嘕=xian1 +嘖=ze2 +嘗=chang2 +嘘=xu1,shi1 +嘙=po2 +嘚=de1,dei1 +嘛=ma2,ma5 +嘜=ma4 +嘝=hu2 +嘞=lei5,le1 +嘟=du1 +嘟噜=du1,lu5 +嘠=ga1,ga2,ga3 +嘡=tang1 +嘢=ye3 +嘣=beng1 +嘤=ying1 +嘥=sai1 +嘦=jiao4 +嘧=mi4 +嘨=xiao4 +嘩=hua2,hua1 +嘪=mai3 +嘫=ran2 +嘬=zuo1 +嘭=peng1 +嘮=lao4,lao2 +嘯=xiao4 +嘰=ji1 +嘱=zhu3 +嘲=chao2,zhao1 +嘲哳=zhao1,zha1 +嘲笑=chao2,xiao4 +嘲讽=chao2,feng3 +嘲骂=chao2,ma4 +嘳=kui4 +嘴=zui3 +嘴尖舌头快=zui3,jian1,she2,tou2,kuai4 +嘵=xiao1 +嘶=si1 +嘷=hao2 +嘸=fu3 +嘹=liao2 +嘺=qiao2,qiao4 +嘻=xi1 +嘼=shou4,chu4,xu4 +嘽=tan1,chan3 +嘾=dan4,tan2 +嘿=hei1,mo4 +噀=xun4 +噁=e3 +噂=zun1 +噃=fan1,bo5 +噄=chi1 +噅=hui1 +噆=zan3 +噇=chuang2 +噈=cu4,za1,he2 +噉=dan4 +噊=jue2 +噋=tun1,kuo4 +噌=ceng1 +噍=jiao4 +噎=ye1 +噏=xi1 +噐=qi4 +噑=hao2 +噒=lian2 +噓=xu1,shi1 +噔=deng1 +噕=hui1 +噖=yin2 +噗=pu1 +噗哧一声=pu1,chi1,yi4,sheng1 +噘=jue1 +噙=qin2 +噚=xun2 +噛=nie4 +噜=lu1 +噝=si1 +噞=yan3 +噟=ying1 +噠=da1 +噡=zhan1 +噢=o1 +噣=zhou4,zhuo2 +噤=jin4 +噥=nong2 +噦=yue3,hui4 +噧=xie4 +器=qi4 +器乐=qi4,yue4 +噩=e4 +噪=zao4 +噫=yi1 +噬=shi4 +噭=jiao4,qiao4,chi1 +噮=yuan4 +噯=ai4,ai3,ai1 +噰=yong1,yong3 +噱=jue2,xue2 +噱头=xue2,tou5 +噲=kuai4 +噳=yu3 +噴=pen1,pen4 +噵=dao4 +噶=ga2 +噶厦=ga2,xia4 +噷=xin1,hen3,hen4 +噸=dun1 +噹=dang1 +噺=xin1 +噻=sai1 +噼=pi1 +噽=pi3 +噾=yin1 +噿=zui3 +嚀=ning2 +嚁=di2 +嚂=lan4 +嚃=ta4 +嚄=huo4,o3 +嚅=ru2 +嚆=hao1 +嚇=he4,xia4 +嚈=yan4 +嚉=duo1 +嚊=xiu4,pi4 +嚋=zhou1,chou2 +嚌=ji4,jie1,zhai1 +嚍=jin4 +嚎=hao2 +嚏=ti4 +嚏喷=ti4,pen5 +嚐=chang2 +嚑=xun1 +嚒=me1 +嚓=ca1,cha1 +嚔=ti4 +嚕=lu1 +嚖=hui4 +嚗=bao4,bo2,pao4 +嚘=you1 +嚙=nie4 +嚚=yin2 +嚛=hu4 +嚜=mei4,me5,mo4 +嚝=hong1 +嚞=zhe2 +嚟=li2 +嚠=liu2 +嚡=xie2,hai2 +嚢=nang2 +嚣=xiao1 +嚤=mo1 +嚥=yan4 +嚦=li4 +嚧=lu2 +嚨=long2 +嚩=po2 +嚪=dan4 +嚫=chen4 +嚬=pin2 +嚭=pi3 +嚮=xiang4 +嚯=huo4 +嚰=me4 +嚱=xi1 +嚲=duo3 +嚳=ku4 +嚴=yan2 +嚵=chan2 +嚶=ying1 +嚷=rang3,rang1 +嚸=dian3 +嚹=la2 +嚺=ta4 +嚻=xiao1 +嚼=jiao2,jue2,jiao4 +嚼墨喷纸=jue2,mo4,pen1,zhi3 +嚼穿龈血=jiao2,chuan1,yin2,xue4 +嚽=chuo4 +嚾=huan4,huan1 +嚿=huo4 +囀=zhuan4 +囁=nie4 +囂=xiao1 +囃=za2,ca4 +囄=li2 +囅=chan3 +囆=chai4 +囇=li4 +囈=yi4 +囉=luo1,luo2,luo5 +囊=nang2,nang1 +囊萤照读=nang2,ying2,zhao4,shu1 +囊锥露颖=nang2,zhui1,lu4,ying3 +囋=zan4,za2,can1 +囌=su1 +囍=xi3 +囎=zeng4 +囏=jian1 +囐=yan4,za2,nie4 +囑=zhu3 +囒=lan2 +囓=nie4 +囔=nang1 +囖=luo2,luo1,luo5 +囗=wei2,guo2 +囘=hui2 +囙=yin1 +囚=qiu2 +囚徒=qiu2,tu2 +囚禁=qiu2,jin4 +四=si4 +四不拗六=si4,bu4,niu4,liu4 +四亭八当=si4,ting2,ba1,dang4 +四海为家=si4,hai3,wei2,jia1 +四舍五入=si4,she3,wu3,ru4 +四马攒蹄=si4,ma3,cuan2,ti2 +囜=nin2 +囝=jian3,nan1 +回=hui2 +回佣=hui2,yong4 +回天运斗=hui2,tian1,yun4,dou3 +回家心切=hui2,jia1,xin1,qie4 +回帖=hui2,tie1 +回应=hui2,ying4 +回船转舵=hui2,chuan2,zhan3,duo4 +回还=hui2,huan2 +囟=xin4 +因=yin1 +因为=yin1,wei4 +因公假私=yin1,gong1,jia3,si1 +因应=yin1,ying4 +因敌为资=yin1,di2,wei2,zi1 +因数=yin1,shu4 +因果报应=yin1,guo3,bao4,ying4 +因树为屋=yin1,shu4,wei2,wu1 +因祸为福=yin1,huo4,wei2,fu2 +因缘为市=yin1,yuan2,wei2,shi4 +囡=nan1 +团=tuan2 +团团转=tuan2,tuan2,zhuan4 +团头聚面=tuan4,tou2,ju4,mian4 +团结一致=tuan2,jie2,yi1,zhi4 +团长=tuan2,zhang3 +団=tuan2 +囤=tun2,dun4 +囤积=tun2,ji1 +囥=kang4 +囦=yuan1 +囧=jiong3 +囨=pian1 +囩=yun2 +囪=cong1 +囫=hu2 +囬=hui2 +园=yuan2 +囮=e2 +囯=guo2 +困=kun4 +困处=kun4,chu3 +困觉=kun4,jiao4 +困难=kun4,nan2 +困难重重=kun4,nan2,chong2,chong2 +囱=cong1 +囲=wei2,tong1 +図=tu2 +围=wei2 +围剿=wei2,jiao3 +囵=lun2 +囶=guo2 +囷=qun1 +囸=ri4 +囹=ling2 +固=gu4 +固执不变=gu4,zhi2,bu2,bian4 +囻=guo2 +囼=tai1 +国=guo2 +国丧=guo2,sang1 +国子监=guo2,zi3,jian4 +国帑=guo2,tang3 +国无宁日=guo2,wu2,ning2,ri4 +国都=guo2,du1 +国际复兴开发银行=guo2,ji4,fu4,xing1,kai1,fa1,yin2,hang2 +国难=guo2,nan4 +图=tu2 +图卷=tu2,juan4 +图穷匕见=tu2,qiong2,bi3,xian4 +图穷匕首见=tu2,qiong2,bi3,shou3,xian4 +囿=you4 +圀=guo2 +圁=yin2 +圂=hun4 +圃=pu3 +圄=yu3 +圅=han2 +圆=yuan2 +圇=lun2 +圈=quan1,juan4,juan1 +圈子=quan1,zi5 +圈牢养物=juan4,lao2,yang3,wu4 +圉=yu3 +圊=qing1 +國=guo2 +圌=chuan2,chui2 +圍=wei2 +圎=yuan2 +圏=quan1,juan4,juan1 +圐=ku1 +圑=pu3 +園=yuan2 +圓=yuan2 +圔=ya4 +圕=tuan1 +圖=tu2 +圗=tu2 +團=tuan2 +圙=lve4 +圚=hui4 +圛=yi4 +圜=huan2,yuan2 +圝=luan2 +圞=luan2 +土=tu3 +土偶蒙金=tu3,ou3,meng2,jin1 +土堡=tu3,pu4 +土生土长=tu3,sheng1,tu3,zhang3 +土著=tu3,zhu4 +圠=ya4 +圡=tu3 +圢=ting3 +圣=sheng4 +圣人不为而成=sheng4,ren2,bu4,wei2,er2,cheng2 +圣君贤相=sheng4,jing1,xian2,xiang4 +圣经贤传=sheng4,jing1,xian2,zhuan4 +圤=pu2 +圥=lu4 +圦=kuai4 +圧=ya1 +在=zai4 +在劫难逃=zai4,jie2,nan2,tao2 +在行=zai4,hang2 +圩=xu1,wei2 +圩场=xu1,chang2 +圩垸=wei2,yuan4 +圩堤=wei2,di1 +圩子=wei2,zi3 +圪=ge1 +圫=yu4,zhun1 +圬=wu1 +圭=gui1 +圭角不露=gui1,jiao3,bu4,lu4 +圮=pi3 +圯=yi2 +地=di4,de5 +地上=di4,shang5 +地上天官=di4,shang4,tian1,guan1 +地上天宫=di4,shang4,tian1,gong1 +地上茎=di4,shang4,jing1 +地动山摇=di4,dong4,shan1,yao2 +地堡=di4,pu4 +地壳=di4,qiao4 +地处=di4,chu3 +地暴十寒=di4,pu4,shi2,han2 +地铺=di4,pu4 +圱=qian1,su2 +圲=qian1 +圳=zhen4 +圴=zhuo2 +圵=dang4 +圶=qia4 +圷=xia4 +圸=shan1 +圹=kuang4 +场=chang3,chang2 +场合=chang3,he2 +场子=chang3,zi5 +场所=chang3,suo3 +场院=chang2,yuan4 +圻=qi2,yin2 +圼=nie4 +圽=mo4 +圾=ji1 +圿=jia2 +址=zhi3 +坁=zhi3,zhi4 +坂=ban3 +坃=xun1 +坄=yi4 +坅=qin3 +坆=mei2,fen2 +均=jun1 +坈=rong3,keng1 +坉=tun2,dun4 +坊=fang1,fang2 +坊巷=fang1,xiang4 +坋=ben4,fen4 +坌=ben4 +坍=tan1 +坎=kan3 +坏=huai4 +坏兆头=huai4,zhao4,tou5 +坏分子=huai4,fen4,zi3 +坏血病=huai4,xue4,bing4 +坏裳为裤=huai4,shang5,wei2,ku4 +坐=zuo4 +坐不重席=zuo4,bu4,chong2,xi2 +坐台子=zuo4,tai2,zi5 +坐月子=zuo4,yue4,zi5 +坐禁闭=zuo4,jin4,bi4 +坐视不救=zuo4,shi1,bu4,jiu4 +坐起来=zuo4,qi5,lai2 +坑=keng1 +坑蒙=keng1,meng2 +坑蒙拐骗=keng1,meng2,guai3,pian4 +坒=bi4 +坓=jing3 +坔=di4,lan4 +坕=jing1 +坖=ji4 +块=kuai4 +坘=di3 +坙=jing1 +坚=jian1 +坚持不懈=jian1,chi2,bu4,xie4 +坛=tan2 +坛子=tan2,zi5 +坜=li4 +坝=ba4 +坞=wu4 +坟=fen2 +坠=zhui4 +坡=po1 +坢=ban4,pan3 +坣=tang2 +坤=kun1 +坤角儿=kun1,jue2,er2 +坥=qu1 +坦=tan3 +坦率=tan3,shuai4 +坦露=tan3,lu4 +坧=zhi3 +坨=tuo2 +坩=gan1 +坪=ping2 +坫=dian4 +坬=gua4 +坭=ni2 +坮=tai2 +坯=pi1 +坯子=pi1,zi5 +坰=jiong1 +坱=yang3 +坲=fo2 +坳=ao4 +坴=lu4 +坵=qiu1 +坶=mu4,mu3 +坷=ke3,ke1 +坷垃=ke1,la1 +坸=gou4 +坹=xue4 +坺=fa2 +坻=di3,chi2 +坼=che4 +坽=ling2 +坾=zhu4 +坿=fu4 +垀=hu1 +垁=zhi4 +垂=chui2 +垂头搨翼=chui2,tou2,da2,yi4 +垂首帖耳=chui2,shou3,tie1,er3 +垃=la1 +垄=long3 +垅=long3 +垆=lu2 +垇=ao4 +垈=dai4 +垉=pao2 +垊=min2 +型=xing2 +垌=dong4,tong2 +垍=ji4 +垎=he4 +垏=lv4 +垐=ci2 +垑=chi3 +垒=lei3 +垓=gai1 +垔=yin1 +垕=hou4 +垖=dui1 +垗=zhao4 +垘=fu2 +垙=guang1 +垚=yao2 +垛=duo3,duo4 +垛子=duo3,zi5 +垜=duo3,duo4 +垝=gui3 +垞=cha2 +垟=yang2 +垠=yin2 +垡=fa2 +垢=gou4 +垣=yuan2 +垤=die2 +垥=xie2 +垦=ken3 +垦种=ken3,zhong4 +垧=shang3 +垨=shou3 +垩=e4 +垪=bing4 +垫=dian4 +垫圈=dian4,juan4 +垫背=dian4,bei4 +垬=hong2 +垭=ya1 +垮=kua3 +垯=da2 +垰=ka3 +垱=dang4 +垲=kai3 +垳=hang2 +垴=nao3 +垵=an3 +垶=xing1 +垷=xian4 +垸=yuan4,huan2 +垹=bang1 +垺=pou2,fu2 +垻=ba4 +垼=yi4 +垽=yin4 +垾=han4 +垿=xu4 +埀=chui2 +埁=cen2 +埂=geng3 +埃=ai1 +埄=beng3,feng1 +埅=di4,fang2 +埆=que4,jue2 +埇=yong3 +埈=jun4 +埉=xia2,jia1 +埊=di4 +埋=mai2,man2 +埋三怨四=man2,san1,yuan4,si4 +埋天怨地=man2,tian1,yuan4,di4 +埋头苦干=mai2,tou2,ku3,gan4 +埋怨=man2,yuan4 +埋没=mai2,mo4 +埌=lang4 +埍=juan3 +城=cheng2 +城阙=cheng2,que4 +埏=yan2,shan1 +埐=qin2,jin1 +埑=zhe2 +埒=lie4 +埒才角妙=lie4,cai2,jue2,miao4 +埓=lie4 +埔=pu3,bu4 +埕=cheng2 +埖=hua1 +埗=bu4 +埘=shi2 +埙=xun1 +埙篪相和=xun1,chi2,xiang1,he4 +埚=guo1 +埛=jiong1 +埜=ye3 +埝=nian4 +埞=di1 +域=yu4 +埠=bu4 +埠头=bu4,tou5 +埡=ya4 +埢=quan2 +埣=sui4,su4 +埤=pi2,pi4 +埥=qing1,zheng1 +埦=wan3,wan1 +埧=ju4 +埨=lun3 +埩=zheng1,cheng2 +埪=kong1 +埫=chong3,shang3 +埬=dong1 +埭=dai4 +埮=tan2,tan4 +埯=an3 +埰=cai3,cai4 +埱=chu4,tou4 +埲=beng3 +埳=xian4,kan3 +埳井之蛙=kan3,jing3,zhi1,wa1 +埴=zhi2 +埵=duo3 +埶=yi4,shi4 +執=zhi2 +埸=yi4 +培=pei2 +基=ji1 +基数=ji1,shu4 +基调=ji1,diao4 +埻=zhun3 +埼=qi2 +埽=sao4,sao3 +埾=ju4 +埿=ni2 +堀=ku1 +堁=ke4 +堂=tang2 +堂皇冠冕=tang2,huang2,guan4,mian3 +堃=kun1 +堄=ni4 +堅=jian1 +堆=dui1 +堆案盈几=dui1,an4,ying2,ji1 +堇=jin1 +堇菜=jin3,cai4 +堈=gang1 +堉=yu4 +堊=e4 +堋=peng2,beng4 +堌=gu4 +堍=tu4 +堎=leng4 +堏=fang1 +堐=ya2 +堑=qian4,jian4 +堒=kun1 +堓=an4 +堔=shen1 +堕=duo4,hui1 +堖=nao3 +堗=tu1 +堘=cheng2 +堙=yin1 +堙没=yin1,mo4 +堚=huan2 +堛=bi4 +堜=lian4 +堝=guo1 +堞=die2 +堟=zhuan4 +堠=hou4 +堡=bao3,bu3,pu4 +堡垒=bao3,lei3 +堡子=bu3,zi5 +堢=bao3 +堣=yu2 +堤=di1 +堥=mao2,mou2,wu3 +堦=jie1 +堧=ruan2 +堨=e4,ai4,ye4 +堩=geng4 +堪=kan1 +堫=zong1 +堬=yu2 +堭=huang2 +堮=e4 +堯=yao2 +堰=yan4 +堰塞湖=yan4,se4,hu2 +報=bao4 +堲=ji2 +堳=mei2 +場=chang3,chang2 +堵=du3 +堵塞=du3,se4 +堶=tuo2 +堷=yin4 +堸=feng2 +堹=zhong4 +堺=jie4 +堻=jin1 +堼=feng1 +堽=gang1 +堾=chuan3 +堿=jian3 +塀=ping2 +塁=lei3 +塂=jiang3 +塃=huang1 +塄=leng2 +塅=duan4 +塆=wan1 +塇=xuan1 +塈=ji4 +塉=ji2 +塊=kuai4 +塋=ying2 +塌=ta1 +塌鼻子=ta1,bi2,zi5 +塍=cheng2 +塎=yong3 +塏=kai3 +塐=su4 +塑=su4 +塑料炸弹=su4,liao4,zha4,dan4 +塒=shi2 +塓=mi4 +塔=ta3 +塔什干=ta3,shi2,gan4 +塕=weng3 +塖=cheng2 +塗=tu2 +塘=tang2 +塙=que4 +塚=zhong3 +塛=li4 +塜=peng2 +塝=bang4 +塞=sai1,sai4,se4 +塞北=sai4,bei3 +塞北江南=sai1,bei3,jiang1,nan2 +塞外=sai4,wai4 +塞子=sai1,zi5 +塞翁之马=sai4,weng1,zhi1,ma3 +塞翁失马=sai4,weng1,shi1,ma3 +塞翁得马=sai4,weng1,de2,ma3 +塞责=se4,ze2 +塞车=sai1,che1 +塞门=sai4,men2 +塞音=se4,yin1 +塟=zang4 +塠=dui1 +塡=tian2 +塢=wu4 +塣=zheng4 +塤=xun1 +塥=ge2 +塦=zhen4 +塧=ai4 +塨=gong1 +塩=yan2 +塪=xian4 +填=tian2,zhen4 +填空=tian2,kong4 +塬=yuan2 +塭=wen1 +塮=xie4 +塯=liu4 +塰=hai3 +塱=lang3 +塲=chang2,chang3 +塳=peng2 +塴=beng4 +塵=chen2 +塶=lu4 +塷=lu3 +塸=ou1,qiu1 +塹=qian4 +塺=mei2 +塻=mo4 +塼=zhuan1,tuan2 +塽=shuang3 +塾=shu2 +塿=lou3 +墀=chi2 +墁=man4 +墂=biao1 +境=jing4 +墄=qi1 +墅=shu4 +墆=zhi4,di4 +墇=zhang4 +墈=kan4 +墉=yong1 +墊=dian4 +墋=chen3 +墌=zhi3,zhuo2 +墍=xi4 +墎=guo1 +墏=qiang3 +墐=jin4 +墑=di4 +墒=shang1 +墓=mu4 +墔=cui1 +墕=yan4 +墖=ta3 +増=zeng1 +墘=qian2 +墙=qiang2 +墙缝=qiang2,feng4 +墚=liang2 +墛=wei4 +墜=zhui4 +墝=qiao1 +增=zeng1 +增长=zeng1,zhang3 +墟=xu1 +墠=shan4 +墡=shan4 +墢=fa2 +墣=pu2 +墤=kuai4,tui2 +墥=tuan3,dong3 +墦=fan2 +墧=qiao2,que4 +墨=mo4 +墨斗=mo4,dou3 +墨斗鱼=mo4,dou3,yu2 +墨晕=mo4,yun4 +墩=dun1 +墪=dun1 +墫=zun1,dun1 +墬=di4 +墭=sheng4 +墮=duo4,hui1 +墯=duo4 +墰=tan2 +墱=deng4 +墲=wu2 +墳=fen2 +墴=huang2 +墵=tan2 +墶=da1 +墷=ye4 +墸=zhu4 +墹=jian4 +墺=ao4 +墻=qiang2 +墼=ji1 +墽=qiao1,ao2 +墾=ken3 +墿=yi4,tu2 +壀=pi2 +壁=bi4 +壂=dian4 +壃=jiang1 +壄=ye3 +壅=yong1 +壅塞=yong1,se4 +壆=xue2,bo2,jue2 +壇=tan2 +壈=lan3 +壉=ju4 +壊=huai4 +壋=dang4 +壌=rang3 +壍=qian4 +壎=xun1 +壏=xian4,lan4 +壐=xi3 +壑=he4 +壒=ai4 +壓=ya1,ya4 +壔=dao3 +壕=hao2 +壖=ruan2 +壗=jin4 +壘=lei3 +壙=kuang4 +壚=lu2 +壛=yan2 +壜=tan2 +壝=wei2 +壞=huai4 +壟=long3 +壠=long3 +壡=rui3 +壢=li4 +壣=lin2 +壤=rang3 +壥=chan2 +壦=xun1 +壧=yan2 +壨=lei3 +壩=ba4 +壪=wan1 +士=shi4 +士大夫=shi4,da4,fu1 +壬=ren2 +壭=san5 +壮=zhuang4 +壯=zhuang4 +声=sheng1 +声乐=sheng1,yue4 +声应气求=sheng1,ying4,qi4,qiu2 +声求气应=sheng1,qiu2,qi4,ying4 +声调=sheng1,diao4 +壱=yi1 +売=mai4 +壳=ke2,qiao4 +壴=zhu4 +壵=zhuang4 +壶=hu2 +壷=hu2 +壸=kun3 +壹=yi1 +壺=hu2 +壻=xu4 +壼=kun3 +壽=shou4 +壾=mang3 +壿=cun2 +夀=shou4 +夁=yi1 +夂=zhi3,zhong1 +夃=gu3,ying2 +处=chu4,chu3 +处世=chu3,shi4 +处之泰然=chu3,zhi1,tai4,ran2 +处事=chu3,shi4 +处于=chu3,yu2 +处于昏睡状态=chu3,yu2,hun1,shui4,zhuang4,tai4 +处于昏迷状态=chu3,yu2,hun1,mi2,zhuang4,tai4 +处决=chu3,jue2 +处分=chu3,fen4 +处刑=chu3,xing2 +处在瓶颈=chu3,zai4,ping2,jing3 +处堂燕雀=chu3,tang2,yan4,que4 +处堂燕鹊=chu3,tang2,yan4,que4 +处境=chu3,jing4 +处士=chu3,shi4 +处处=chu3,chu4 +处女=chu3,nv3 +处子=chu3,zi3 +处实效功=chu3,shi2,xiao4,gong1 +处尊居显=chu3,zun1,ju1,xian3 +处心积虑=chu3,xin1,ji1,lv4 +处所=chu4,suo3 +处方=chu3,fang1 +处暑=chu3,shu3 +处死=chu3,si3 +处治=chu3,zhi4 +处理=chu3,li3 +处理失当=chu3,li3,shi1,dang4 +处理得当=chu3,li3,de2,dang4 +处罚=chu3,fa2 +处置=chu3,zhi4 +处身=chu3,shen1 +处长=chu4,zhang3 +处高临深=chu3,gao1,lin2,shen1 +夅=jiang4,xiang2 +夆=feng2,feng1,pang2 +备=bei4 +备不住=bei4,bu2,zhu4 +备位充数=bei4,wei4,chong1,shu4 +备查=bei4,cha2 +夈=zhai1 +変=bian4 +夊=sui1 +夋=qun1 +夌=ling2 +复=fu4 +复兴=fu4,xing1 +复名数=fu4,ming2,shu4 +复数=fu4,shu4 +复查=fu4,zha1 +复辟=fu4,bi4 +夎=cuo4 +夏=xia4 +夏种=xia4,zhong4 +夏虫朝菌=xia4,chong2,zhao1,jun1 +夐=xiong4,xuan4 +夑=xie4 +夒=nao2 +夓=xia4 +夔=kui2 +夕=xi1 +夕寐宵兴=xi1,mei4,xiao1,xing1 +夕惕朝乾=xi1,ti4,zhao1,qian2 +外=wai4 +外传=wai4,zhuan4 +外合里应=wai4,he2,li3,ying4 +外场=wai4,chang2 +外壳=wai4,qiao4 +外头=wai4,tou5 +外强=wai4,jiang1 +外强中干=wai4,qiang2,zhong1,gan1 +外强中瘠=wai4,qiang2,zhong1,ji2 +外相=wai4,xiang4 +外行=wai4,hang2 +外行人=wai4,hang2,ren2 +外调=wai4,diao4 +外长=wai4,zhang3 +外露=wai4,lu4 +夗=yuan4,wan3,wan1,yuan1 +夘=mao3,wan3 +夙=su4 +夙兴夜处=su4,xing1,ye4,chu3 +夙兴夜寐=su4,xing1,ye4,mei4 +夙兴昧旦=su4,xing1,mei4,dan4 +多=duo1 +多一事不如少一事=duo1,yi1,shi4,bu4,ru2,shao3,yi1,shi4 +多一点=duo1,yi4,dian3 +多会儿=duo1,hui4,er5 +多劳多得=duo1,lao2,duo1,de5 +多口相声=duo1,kou3,xiang4,sheng4 +多咱=duo1,za2 +多数=duo1,shu4 +多文为富=duo1,wen2,wei2,fu4 +多普勒效应=duo1,pu3,le4,xiao4,ying4 +多灾多难=duo1,zai1,duo1,nan4 +多端寡要=duo1,duan1,guai3,yao4 +多行不义必自毙=duo1,xing2,bu4,yi4,bi4,zi4,bi4 +多言数穷=duo1,yan2,shuo4,qiong2 +多财善贾=duo1,cai2,shan4,gu3 +多钱善贾=duo1,qian2,shan4,gu3 +多难兴邦=duo1,nan4,xing1,bang1 +夛=duo1 +夜=ye4 +夜禁=ye4,jin4 +夜静更深=ye4,jing4,geng1,shen1 +夜静更阑=ye4,jing4,geng1,lan2 +夝=qing2 +够=gou4 +够呛=gou4,qiang4 +够戗=gou4,qiang4 +夠=gou4 +夡=qi4 +夢=meng4 +夣=meng4 +夤=yin2 +夥=huo3 +夦=chen3 +大=da4,dai4,tai4 +大个子=da4,ge4,zi5 +大事不妙=da4,shi4,bu2,miao4 +大人先生=da4,ren2,xian1,sheng4 +大伯=da4,bo2 +大伯子=da4,bai3,zi3 +大兴=da4,xing1 +大兴土木=da4,xing1,tu3,mu4 +大兴安岭=da4,xing1,an1,ling3 +大吃一惊=da4,chi1,yi1,jing1 +大吃大喝=da4,chi1,da4,he1 +大喜若狂=da1,xi3,ruo4,kuang2 +大嚼=da4,jue2 +大城=dai4,cheng2 +大埔=da4,bu4 +大堡礁=da4,pu4,jiao1 +大多数=da4,duo1,shu4 +大大咧咧=da4,da4,lie1,lie1 +大夫=dai4,fu1 +大子=tai4,zi3 +大家伙=da4,jia1,huo5 +大将=da4,jiang4 +大师傅=da4,shi1,fu1 +大帽子=da4,mao4,zi5 +大幅增长=da4,fu2,zeng1,zhang3 +大干=da4,gan4 +大换血=da4,huan4,xie3 +大数=da4,shu4 +大显神通=da4,xian3,shen2,tong1 +大曲=da4,qu1 +大有可为=da4,you3,ke3,wei2 +大模大样=da4,mu2,da4,yang4 +大步流星=da3,bu4,liu2,xing1 +大汗=da4,han2 +大溜=da4,liu4 +大煞风趣=da4,sha4,feng1,qu4 +大率=da4,shuai4 +大王=dai4,wang2 +大璞不完=tai4,bu2,bu4,wan2 +大缪不然=da4,miu4,bu4,ran2 +大老爷们儿=da4,lao3,ye2,men5,er2 +大而无当=da4,er2,wu2,dang4 +大肚子=da4,du3,zi5 +大肠杆菌=da4,chang2,gan3,jun1 +大腹便便=da4,fu4,pian2,pian2 +大藏=da4,zang4 +大藏经=da4,zang4,jing1 +大行大市=da4,hang2,da4,shi4 +大行星=da4,xing2,xing1 +大调=da4,diao4 +大轴子=da4,zhou4,zi3 +大部分=da4,bu4,fen4 +大都=da4,du1 +大难=da4,nan4 +大难不死=da4,nan4,bu4,si3 +大难临头=da4,nan4,lin2,tou2 +大雅=da4,ya3 +大雅之堂=da4,ya3,zhi1,tang2 +大雅君子=da4,ya3,jun1,zi3 +大雨滂沱=da4,yu3,pang2,tuo2 +大颊鼠=da4,jia1,shu3 +大黄=da4,huang2 +夨=ce4 +天=tian1 +天下一家=tian1,xia4,yi1,jia1 +天下为公=tian1,xia4,wei2,gong1 +天下为家=tian1,xia4,wei2,jia1 +天下为笼=tian1,xia4,wei2,long2 +天兵天将=tian1,bing1,tian1,jiang4 +天冠地屦=tian1,guan1,di4,ju4 +天分=tian1,fen4 +天华乱坠=tian1,hua1,luan4,zhui4 +天台=tian1,tai1 +天台路迷=tian1,tai2,lu4,mi2 +天姥山=tian1,mu3,shan1 +天宝当年=tian5,bao4,dang4,nian4 +天差地远=tian1,cha1,di4,yuan3 +天年不遂=tian1,nian2,bu4,sui4 +天数=tian1,shu4 +天旋地转=tian1,xuan2,di4,zhuan4 +天晓得=tian1,xiao3,de5 +天王老子=tian1,wang2,lao3,zi3 +天生一对=tian1,sheng1,yi1,dui4 +天衣无缝=tian1,yi1,wu2,feng4 +天覆地载=tian1,fu4,di4,zai3 +天道好还=tian1,dao4,hao3,huan2 +天阙=tian1,que4 +太=tai4 +太冲=tai4,chong4 +太山北斗=tai4,shan1,bei3,dou4 +太监=tai4,jian4 +太行山=tai4,hang2,shan1 +太过分=tai4,guo4,fen4 +太阿=tai4,e1 +太阿倒持=tai4,e1,dao4,chi2 +太阿在握=tai4,e1,zai4,wo4 +夫=fu1,fu2 +夬=guai4 +夭=yao1 +央=yang1 +央行=yang1,hang2 +夯=hang1,ben4 +夰=gao3 +失=shi1 +失当=shi1,dang4 +失着=shi1,zhao1 +失禁=shi1,jin4 +失而复得=shi1,er2,fu4,de5 +失血=shi1,xue4 +失马塞翁=shi1,ma3,sai4,weng1 +失魂落魄=shi1,hun2,luo4,po4 +夲=tao1,ben3 +夳=tai4 +头=tou2,tou5 +头上著头=tou2,shang4,zhuo2,tou2 +头会箕赋=tou2,kuai4,ji1,fu4 +头儿=tou5,er5 +头出头没=tou2,chu1,tou2,mo4 +头发=tou2,fa4 +头子=tou2,zi5 +头昏眼晕=tou2,hun1,yan3,yun1 +头晕=tou2,yun1 +头没杯案=tou2,mo4,bei1,an4 +头破血出=tou2,po4,xue4,chu1 +头破血淋=tou2,po4,xue4,lin2 +头童齿豁=tou2,tong2,chi3,huo4 +头足异处=tou2,zu2,yi4,chu3 +夵=yan3,tao1 +夶=bi3 +夷=yi2 +夷为平地=yi2,wei2,ping2,di4 +夸=kua1,kua4 +夹=jia2,jia1,ga1 +夹七夹八=jia1,qi1,jia1,ba1 +夹克=jia1,ke4 +夹具=jia1,ju4 +夹击=jia1,ji1 +夹剪=jia1,jian3 +夹墙=jia1,qiang2 +夹子=jia1,zi5 +夹层=jia1,ceng2 +夹层玻璃=jia1,ceng2,bo1,li5 +夹峙=jia1,zhi4 +夹带=jia1,dai4 +夹心=jia1,xin1 +夹批=jia1,pi1 +夹持=jia1,chi2 +夹攻=jia1,gong1 +夹杂=jia1,za2 +夹板=jia1,ban3 +夹板医驼子=jia2,ban3,yi1,tuo2,zi3 +夹板气=jia1,ban3,qi4 +夹枪带棍=jia1,qiang1,dai4,gun4 +夹枪带棒=jia1,qiang1,dai4,bang4 +夹棍=jia1,gun4 +夹注=jia1,zhu4 +夹生=jia1,sheng1 +夹生饭=jia1,sheng1,fan4 +夹竹桃=jia1,zhu2,tao2 +夹紧=jia1,jin3 +夹缝=jia1,feng4 +夹袄=jia2,ao3 +夹袋人物=jia1,dai4,ren2,wu4 +夹裤=jia2,ku4 +夹角=jia1,jiao3 +夹道=jia1,dao4 +夹钳=jia1,qian2 +夹馅=jia1,xian4 +夺=duo2 +夺人所好=duo2,ren2,suo3,hao4 +夺冠=duo2,guan4 +夺得=duo2,de5 +夻=hua4 +夼=kuang3 +夽=yun3 +夾=jia2,jia1,ga1 +夿=ba1 +奀=en1 +奁=lian2 +奂=huan4 +奃=di1,ti4 +奄=yan3,yan1 +奄奄一息=yan3,yan3,yi1,xi1 +奅=pao4 +奆=juan4 +奇=qi2,ji1 +奇偶=ji1,ou3 +奇函数=ji1,han2,shu4 +奇数=ji1,shu4 +奇零=ji1,ling2 +奈=nai4 +奉=feng4 +奉为圭臬=feng4,wei2,gui1,nie4 +奉为楷模=feng4,wei2,kai3,mo2 +奉为至宝=feng4,wei2,zhi4,bao3 +奉公不阿=feng4,gong1,bu4,e1 +奉还=feng4,huan2 +奊=xie2 +奋=fen4 +奋发有为=fen4,fa1,you3,wei2 +奌=dian3 +奍=quan1,juan4 +奎=kui2 +奏=zou4 +奏乐=zou4,yue4 +奐=huan4 +契=qi4,qie4,xie4 +奒=kai1 +奓=she1,chi3,zha4 +奔=ben1,ben4 +奔丧=ben1,sang1 +奔命=ben4,ming4 +奔头=ben4,tou2 +奔头儿=ben4,tou5,er5 +奕=yi4 +奖=jiang3 +套=tao4 +套数=tao4,shu4 +套种=tao4,zhong4 +奘=zang4,zhuang3 +奙=ben3 +奚=xi1 +奛=huang3 +奜=fei3 +奝=diao1 +奞=xun4,zhui4 +奟=beng1 +奠=dian4 +奠都=dian4,du1 +奡=ao4 +奢=she1 +奢侈浪费=she1,chi2,lang4,fei4 +奢靡=she1,mi2 +奣=weng3 +奤=po4,ha3,tai3 +奥=ao4,yu4 +奦=wu4 +奧=ao4,yu4 +奨=jiang3 +奩=lian2 +奪=duo2 +奫=yun1 +奬=jiang3 +奭=shi4 +奮=fen4 +奯=huo4 +奰=bi4 +奱=luan2 +奲=duo3,che3 +女=nv3,ru3 +女仆=nv3,pu2 +女佣人=nv3,yong1,ren2 +女大不中留=nv3,da4,bu4,zhong4,liu2 +女大难留=nv3,da4,nan2,liu2 +女孩儿=nv3,hai2,er5 +女孩子=nv3,hai2,zi5 +女将=nv3,jiang4 +女强人=nv3,qiang3,ren2 +女红=nv3,gong1 +女长当嫁=nv3,zhang3,dang1,jia4 +女长须嫁=nv3,zhang3,xu1,jia4 +奴=nu2 +奴仆=nu2,pu2 +奵=ding3,ding1,tian3 +奶=nai3 +奷=qian1 +奸=jian1 +她=ta1,jie3 +她们=ta1,men5 +奺=jiu3 +奻=nuan2 +奼=cha4 +好=hao3,hao4 +好丹非素=hao4,dan1,fei1,su4 +好为事端=hao4,wei2,shi4,duan1 +好为人师=hao4,wei2,ren2,shi1 +好事=hao4,shi4 +好事之徒=hao4,shi4,zhi1,tu2 +好事多悭=hao3,shi4,duo1,qian1 +好事多磨=hao3,shi4,duo1,mo2 +好事天悭=hao3,shi4,tian1,qian1 +好事者=hao4,shi4,zhe3 +好佚恶劳=hao3,yi4,wu4,lao2 +好兆头=hao3,zhao4,tou5 +好动=hao4,dong4 +好勇斗狠=hao4,yong3,dou4,hen3 +好古=hao4,gu3 +好吃=hao4,chi1 +好吃好喝=hao4,chi1,hao4,he1 +好吃懒做=hao4,chi1,lan3,zuo4 +好吧=hao3,ba5 +好吹牛=hao4,chui1,niu2 +好善乐施=hao4,shan4,le4,shi1 +好善恶恶=hao3,shan4,wu4,e4 +好处=hao3,chu4 +好大喜功=hao4,da4,xi3,gong1 +好奇=hao4,qi2 +好奇尚异=hao3,qi2,shang4,yi4 +好好学习=hao3,hao3,xue2,xi2 +好学=hao4,xue2 +好客=hao4,ke4 +好家伙=hao3,jia1,huo5 +好尚=hao4,shang4 +好强=hao4,qiang2 +好得很=hao3,de5,hen3 +好恶=hao4,wu4 +好恶不同=hao3,e4,bu4,tong2 +好战=hao4,zhan4 +好战分子=hao4,zhan4,fen4,zi5 +好整以暇=hao4,zheng3,yi3,xia2 +好斗=hao4,dou4 +好生之德=hao4,sheng1,zhi1,de2 +好管闲事=hao4,guan3,xian2,shi4 +好累=hao3,lei4 +好胜=hao4,sheng4 +好自为之=hao4,zi4,wei2,zhi1 +好色=hao4,se4 +好行小惠=hao4,xing2,xiao3,hui4 +好读书=hao4,du2,shu1 +好谋善断=hao4,mou2,shan4,duan4 +好逸恶劳=hao4,yi4,wu4,lao2 +好酒贪杯=hao4,jiu3,tan1,bei1 +好问决疑=hao4,wen4,jue2,yi2 +好问则裕=hao4,wen4,ze2,yu4 +好骑者堕=hao4,qi2,zhe3,duo4 +好高务远=hao4,gao1,wu4,yuan3 +好高骛远=hao4,gao1,wu4,yuan3 +奾=xian1 +奿=fan4 +妀=ji3 +妁=shuo4 +如=ru2 +如不胜衣=ru2,bu4,sheng4,yi1 +如履薄冰=ru2,lv3,bo2,bing1 +如应斯响=ru2,ying4,si1,xiang3 +如数=ru2,shu4 +如数家珍=ru2,shu3,jia1,zhen1 +如水投石=ru2,shu3,tou2,shi2 +如法炮制=ru2,fa3,pao2,zhi4 +如登春台=ru2,de2,chun1,tai2 +如芒刺背=ru2,mang2,ci4,bei4 +如芒在背=ru2,mang2,zai4,bei4 +妃=fei1,pei4 +妃嫔=fei1,bin1 +妃子=fei1,zi5 +妄=wang4 +妄为=wang4,wei2 +妄自菲薄=wang4,zi4,fei3,bo2 +妅=hong2 +妆=zhuang1 +妇=fu4 +妈=ma1 +妉=dan1 +妊=ren4 +妋=fu1,you1 +妌=jing4 +妍=yan2 +妍蚩好恶=yan2,chi1,hao3,e4 +妎=hai4,jie4 +妏=wen4 +妐=zhong1 +妑=pa1 +妒=du4 +妓=ji4 +妔=keng1,hang2 +妕=zhong4 +妖=yao1 +妖不胜德=yao1,bu4,sheng4,de2 +妖由人兴=yao1,you2,ren2,xing1 +妗=jin4 +妗子=jin4,zi5 +妘=yun2 +妙=miao4 +妙处=miao4,chu4 +妙着=miao4,zhao1 +妙龄女子=miao4,ling2,nv3,zi5 +妚=fou3,pei1,pi1 +妛=chi1 +妜=yue4,jue2 +妝=zhuang1 +妞=niu1 +妞儿=niu1,er5 +妞子=niu1,zi5 +妟=yan4 +妠=na4,nan4 +妡=xin1 +妢=fen2 +妣=bi3 +妤=yu2 +妥=tuo3 +妥帖=tuo3,tie1 +妥当=tuo3,dang4 +妥首帖耳=tuo3,shou3,tie1,er3 +妦=feng1 +妧=wan4,yuan2 +妨=fang2 +妩=wu3 +妪=yu4 +妫=gui1 +妬=du4 +妭=ba2 +妮=ni1 +妯=zhou2 +妯娌=zhou2,li5 +妰=zhuo2 +妱=zhao1 +妲=da2 +妳=ni3,nai3 +妴=yuan4 +妵=tou3 +妶=xian2,xuan2,xu4 +妷=zhi2,yi4 +妸=e1 +妹=mei4 +妺=mo4 +妻=qi1,qi4 +妻儿老少=qi1,er2,lao3,shao3 +妻子=qi1,zi5 +妻梅子鹤=qi1,mei2,zi3,he4 +妼=bi4 +妽=shen1 +妾=qie4 +妿=e1 +姀=he2 +姁=xu3,xu1 +姂=fa2 +姃=zheng1 +姄=min2 +姅=ban4 +姆=mu3 +姇=fu1,fu2 +姈=ling2 +姉=zi3 +姊=zi3 +始=shi3 +始终不懈=shi3,zhong1,bu4,xie4 +姌=ran3 +姍=shan1,shan4 +姎=yang1 +姏=man2 +姐=jie3 +姑=gu1 +姑射神人=gu1,ye4,shen2,ren2 +姒=si4 +姓=xing4 +姓仇=xing4,qiu2 +姓任=xing4,ren2 +委=wei3,wei1 +委委佗佗=wei1,wei1,tuo2,tuo2 +委曲=wei3,qu1 +委曲成全=wei3,qu3,cheng2,quan2 +委曲求全=wei3,qu1,qiu2,quan2 +委肉虎蹊=wei3,rou4,hu3,xi1 +委蛇=wei1,yi2 +委靡=wei3,mi3 +姕=zi1 +姖=ju4 +姗=shan1,shan4 +姘=pin1 +姘头=pin1,tou5 +姙=ren4 +姚=yao2 +姛=dong4 +姜=jiang1 +姝=shu1 +姞=ji2 +姟=gai1 +姠=xiang4 +姡=hua2,huo2 +姢=juan1 +姣=jiao1,xiao2 +姣好=jiao3,hao3 +姤=gou4,du4 +姥=lao3,mu3 +姥姥=lao3,lao4 +姦=jian1 +姧=jian1 +姨=yi2 +姨姥姥=yi2,lao3,lao4 +姩=nian2,nian4 +姪=zhi2 +姫=zhen3 +姬=ji1 +姭=xian4 +姮=heng2 +姯=guang1 +姰=xun2,jun1 +姱=kua1,hu4 +姲=yan4 +姳=ming3 +姴=lie4 +姵=pei4 +姶=e4,ya4 +姷=you4 +姸=yan2 +姹=cha4 +姺=shen1,xian1 +姻=yin1 +姼=shi2 +姽=gui3 +姾=quan2 +姿=zi1 +姿意妄为=zi1,yi4,wang4,wei2 +娀=song1 +威=wei1 +威吓=wei1,he4 +娂=hong2 +娃=wa2 +娃娃亲=wa2,wa5,qin1 +娄=lou2 +娅=ya4 +娆=rao2,rao3 +娇=jiao1 +娈=luan2 +娉=ping1 +娉婷婀娜=ping1,ting2,e1,na4 +娊=xian4 +娋=shao4,shao1 +娌=li3 +娍=cheng2,sheng4 +娎=xie1 +娏=mang2 +娐=fu1 +娑=suo1 +娒=wu3,mu3 +娓=wei3 +娔=ke4 +娕=chuo4,lai4 +娖=chuo4 +娗=ting3 +娘=niang2 +娘子=niang2,zi5 +娙=xing2 +娚=nan2 +娛=yu2 +娜=nuo2,na4 +娝=pou1,bi3 +娞=nei3,sui1 +娟=juan1 +娠=shen1 +娡=zhi4 +娢=han2 +娣=di4 +娤=zhuang1 +娥=e2 +娦=pin2 +娧=tui4 +娨=man3 +娩=mian3 +娪=wu2,wu4,yu2 +娫=yan2 +娬=wu3 +娭=xi1,ai1 +娮=yan2 +娯=yu2 +娰=si4 +娱=yu2 +娲=wa1 +娳=li4 +娴=xian2 +娵=ju1 +娶=qu3 +娶媳妇儿=qu3,xi2,fu5,er5 +娷=zhui4,shui4 +娸=qi1 +娹=xian2 +娺=zhuo2 +娻=dong1,dong4 +娼=chang1 +娽=lu4 +娾=ai3,ai2,e4 +娿=e1,e3 +婀=e1 +婀娜=e1,nuo2 +婁=lou2 +婂=mian2 +婃=cong2 +婄=pei2,pou3,bu4 +婅=ju2 +婆=po2 +婆婆=po2,po5 +婆婆妈妈=po2,po5,ma1,ma1 +婇=cai3 +婈=ling2 +婉=wan3 +婊=biao3 +婊子=biao3,zi5 +婋=xiao1 +婌=shu1 +婍=qi3 +婎=hui1 +婏=fu4,fan4 +婐=wo3 +婑=wo3 +婒=tan2 +婓=fei1 +婔=fei1 +婕=jie2 +婖=tian1 +婗=ni2,ni3 +婘=juan4,quan2 +婙=jing4 +婚=hun1 +婚丧喜庆=hun1,sang1,xi3,qing4 +婚假=hun1,jia4 +婚龄=hun1,ling2 +婛=jing1 +婜=qian1,jin3 +婝=dian4 +婞=xing4 +婟=hu4 +婠=wan1,wa4 +婡=lai2,lai4 +婢=bi4 +婣=yin1 +婤=zhou1,chou1 +婥=chuo4,nao4 +婦=fu4 +婧=jing4 +婨=lun2 +婩=nve4 +婪=lan2 +婫=hun4,kun1 +婬=yin2 +婭=ya4 +婮=ju1 +婯=li4 +婰=dian3 +婱=xian2 +婲=hua1 +婳=hua4 +婴=ying1 +婵=chan2 +婵媛=chan2,yuan2 +婶=shen3 +婷=ting2 +婸=dang4,yang2 +婹=yao3 +婺=wu4 +婻=nan4 +婼=ruo4,chuo4 +婽=jia3 +婾=tou1,yu2 +婿=xu4 +媀=yu2,yu4 +媁=wei2,wei3 +媂=di4,ti2 +媃=rou2 +媄=mei3 +媅=dan1 +媆=ruan3,nen4 +媇=qin1 +媈=hui1 +媉=wo4 +媊=qian2 +媋=chun1 +媌=miao2 +媍=fu4 +媎=jie3 +媏=duan1 +媐=yi2,pei4 +媑=zhong4 +媒=mei2 +媓=huang2 +媔=mian2,mian3 +媕=an1 +媖=ying1 +媗=xuan1 +媘=jie1 +媙=wei1 +媚=mei4 +媛=yuan4,yuan2 +媜=zheng1 +媝=qiu1 +媞=ti2 +媟=xie4 +媠=tuo2,duo4 +媡=lian4 +媢=mao4 +媣=ran3 +媤=si1 +媥=pian1 +媦=wei4 +媧=wa1 +媨=cu4 +媩=hu2 +媪=ao3 +媫=jie2 +媬=bao3 +媭=xu1 +媮=tou1,yu2 +媯=gui1 +媰=chu2,zou4 +媱=yao2 +媲=pi4 +媳=xi2 +媳妇儿=xi2,fu5,er5 +媴=yuan2 +媵=ying4 +媶=rong2 +媷=ru4 +媸=chi1 +媹=liu2 +媺=mei3 +媻=pan2 +媼=ao3 +媽=ma1 +媾=gou4 +媿=kui4 +嫀=qin2,shen1 +嫁=jia4 +嫂=sao3 +嫂子=sao3,zi5 +嫃=zhen1,zhen3 +嫄=yuan2 +嫅=jie1,suo3 +嫆=rong2 +嫇=ming2,ming3 +嫈=ying1 +嫉=ji2 +嫊=su4 +嫋=niao3 +嫌=xian2 +嫌恶=xian2,wu4 +嫍=tao1 +嫎=pang2 +嫏=lang2 +嫐=nao3 +嫑=biao2 +嫒=ai4 +嫓=pi4 +嫔=pin2 +嫕=yi4 +嫖=piao2,piao1 +嫗=yu4 +嫘=lei2 +嫙=xuan2 +嫚=man4 +嫚子=man1,zi5 +嫛=yi1 +嫜=zhang1 +嫝=kang1 +嫞=yong1 +嫟=ni4 +嫠=li2 +嫡=di2 +嫡子=di2,zi3 +嫢=gui1 +嫣=yan1 +嫣然一笑=yan1,ran2,yi1,xiao4 +嫤=jin3,jin4 +嫥=zhuan1 +嫦=chang2 +嫧=ze2 +嫨=han1,nan3 +嫩=nen4 +嫪=lao4 +嫫=mo2 +嫬=zhe1 +嫭=hu4 +嫮=hu4 +嫯=ao4 +嫰=nen4 +嫱=qiang2 +嫲=ma1,ma2 +嫳=pie4 +嫴=gu1 +嫵=wu3 +嫶=qiao2 +嫷=tuo3 +嫸=zhan3 +嫹=miao2 +嫺=xian2 +嫻=xian2 +嫼=mo4 +嫽=liao2 +嫾=lian2 +嫿=hua4 +嬀=gui1 +嬁=deng1 +嬂=zhi2 +嬃=xu1 +嬄=yi1 +嬅=hua4 +嬆=xi1 +嬇=kui4 +嬈=rao2,rao3 +嬉=xi1 +嬊=yan4 +嬋=chan2 +嬌=jiao1 +嬍=mei3 +嬎=fan4 +嬏=fan1 +嬐=xian1,yan3,jin4 +嬑=yi4 +嬒=hui4 +嬓=jiao4 +嬔=fu4 +嬕=shi4 +嬖=bi4 +嬗=shan4 +嬘=sui4 +嬙=qiang2 +嬚=lian3 +嬛=huan2,xuan1,qiong2 +嬜=xin1 +嬝=niao3 +嬞=dong3 +嬟=yi3 +嬠=can1 +嬡=ai4 +嬢=niang2 +嬣=ning2 +嬤=mo2 +嬥=tiao3 +嬦=chou2 +嬧=jin4 +嬨=ci2 +嬩=yu2 +嬪=pin2 +嬫=rong2 +嬬=ru2 +嬭=nai3 +嬮=yan1,yan4 +嬯=tai2 +嬰=ying1 +嬱=qian4 +嬲=niao3 +嬳=yue4 +嬴=ying2 +嬵=mian2 +嬶=bi2 +嬷=mo2 +嬸=shen3 +嬹=xing4 +嬺=ni4 +嬻=du2 +嬼=liu3 +嬽=yuan1 +嬾=lan3 +嬿=yan4 +孀=shuang1 +孁=ling2 +孂=jiao3 +孃=niang2 +孄=lan3 +孅=xian1,qian1 +孆=ying1 +孇=shuang1 +孈=xie2,hui1 +孉=huan1,quan2 +孊=mi3 +孋=li4,li2 +孌=luan2 +孍=yan3 +孎=zhu2,chuo4 +孏=lan3 +子=zi3 +子为父隐=zi3,wei2,fu4,yin3 +子弹=zi3,dan4 +子母弹=zi3,mu3,dan4 +孑=jie2 +孒=jue2 +孓=jue2 +孔=kong3 +孕=yun4 +孕吐=yun4,tu4 +孖=zi1,ma1 +字=zi4 +字帖=zi4,tie4 +字帖儿=zi4,tie3,er2 +字数=zi4,shu4 +字模=zi4,mu2 +字调=zi4,diao4 +字里行间=zi4,li3,hang2,jian1 +存=cun2 +存储=cun2,chu3 +存查=cun2,zha1 +存而不论=cun2,er2,bu4,lun4 +孙=sun1,xun4 +孙子=sun1,zi5 +孚=fu2 +孛=bei4 +孜=zi1 +孜孜不懈=zi1,zi1,bu4,xie4 +孝=xiao4 +孞=xin4 +孟=meng4 +孟什维克=meng4,shi2,wei2,ke4 +孠=si4 +孡=tai1 +孢=bao1 +孢子=bao1,zi5 +季=ji4 +孤=gu1 +孤单一人=gu1,dan1,yi1,ren2 +孤文只义=gu1,wen2,zhi1,yi4 +孤注一掷=gu1,zhu4,yi1,zhi4 +孤独矜寡=gu1,du2,guan1,gua3 +孤身只影=gu1,shen1,zhi1,ying3 +孥=nu2 +学=xue2 +学子=xue2,zi3 +学长=xue2,zhang3 +孧=you4,niu1 +孨=zhuan3 +孩=hai2 +孩子=hai2,zi5 +孩子气=hai2,zi5,qi4 +孪=luan2 +孫=sun1,xun4 +孬=nao1 +孭=mie1 +孮=cong2 +孯=qian1 +孰=shu2 +孱=chan2,can4 +孲=ya1 +孳=zi1 +孴=ni3 +孵=fu1 +孶=zi1 +孷=li2 +學=xue2 +孹=bo4 +孺=ru2 +孻=nai2 +孼=nie4 +孽=nie4 +孽障种子=nie4,zhang4,zhong3,zi3 +孾=ying1 +孿=luan2 +宀=mian2 +宁=ning2,ning4,zhu4 +宁可=ning4,ke3 +宁帖=ning2,tie1 +宁愿=ning4,yuan4 +宁折不弯=ning4,zhe2,bu4,wan1 +宁曲勿折=ning2,qu1,wu4,zhe2 +宁死不屈=ning4,si3,bu4,qu1 +宁缺勿滥=ning4,que1,wu4,lan4 +宁缺毋滥=ning4,que1,wu2,lan4 +宁肯=ning4,ken3 +宂=rong3 +它=ta1 +宄=gui3 +宅=zhai2 +宆=qiong2 +宇=yu3 +守=shou3 +守丧=shou3,sang1 +守分=shou3,fen4 +守分安常=shou3,fen1,an1,chang2 +守正不阿=shou3,zheng4,bu4,e1 +守阙抱残=shou3,que4,bao4,can2 +安=an1 +安全系数=an1,quan2,xi4,shu4 +安分=an1,fen4 +安分守己=an1,fen4,shou3,ji3 +安分知足=an1,fen4,zhi1,zu2 +安宁=an1,ning2 +安常处顺=an1,chang2,chu3,shun4 +安常守分=an1,chang2,shou3,fen4 +安时处顺=an1,shi2,chu3,shun4 +安步当车=an1,bu4,dang1,che1 +安老怀少=an1,lao3,huai2,shao4 +安营扎寨=an1,ying2,zha1,zhai4 +安身为乐=an1,shen1,wei2,le4 +宊=tu1,jia1 +宋=song4 +宋斤鲁削=song4,jin1,lu3,xue1 +完=wan2 +宍=rou4 +宎=yao3 +宏=hong2 +宐=yi2 +宑=jing3 +宒=zhun1 +宓=mi4,fu2 +宔=zhu3 +宕=dang4 +宖=hong2 +宗=zong1 +官=guan1 +官官相为=guan1,guan1,xiang1,wei2 +官差=guan1,chai1 +官运亨通=guan1,yun4,heng1,tong1 +官长=guan1,zhang3 +宙=zhou4 +定=ding4 +定名为=ding4,ming2,wei2 +定当=ding4,dang4 +定数=ding4,shu4 +定时炸弹=ding4,shi2,zha4,dan4 +定调=ding4,diao4 +定调子=ding4,diao4,zi5 +定都=ding4,du1 +宛=wan3,yuan1 +宛转悠扬=wan3,zhuan3,you1,yang2 +宜=yi2 +宝=bao3 +宝坻=bao3,di3 +宝应=bao3,ying4 +宝藏=bao3,zang4 +宝贝疙瘩=bao3,bei4,ge1,da1 +实=shi2 +实与有力=shi2,yu4,you3,li4 +实偪处此=shi2,beng4,chu3,ci3 +实干=shi2,gan4 +实弹=shi2,dan4 +实数=shi2,shu4 +实相=shi2,xiang4 +实蕃有徒=shi2,fan1,you3,tu2 +实逼处此=shi2,bi1,chu3,ci3 +実=shi2 +宠=chong3 +审=shen3 +审处=shen3,chu3 +审己度人=shen3,ji3,duo2,ren2 +审干=shen3,gan4 +审度=shen3,duo2 +审时度势=shen3,shi2,duo2,shi4 +审曲面埶=shen3,qu3,mian4,xin1 +审查=shen3,zha1 +审校=shen3,jiao4 +客=ke4 +宣=xuan1 +宣传弹=xuan1,chuan2,dan4 +室=shi4 +宥=you4 +宦=huan4 +宧=yi2 +宨=tiao3 +宩=shi3 +宪=xian4 +宫=gong1 +宫禁=gong1,jin4 +宫调=gong1,diao4 +宫阙=gong1,que4 +宬=cheng2 +宭=qun2 +宮=gong1 +宯=xiao1 +宰=zai3 +宰相=zai3,xiang4 +宱=zha4 +宲=bao3,shi2 +害=hai4 +害臊=hai4,sao4 +宴=yan4 +宵=xiao1 +宵禁=xiao1,jin4 +家=jia1,jia5,jie5 +家什=jia1,shi2 +家仆=jia1,pu2 +家伙=jia1,huo5 +家当=jia1,dang4 +家无儋石=jia1,wu2,dan4,shi2 +家无担石=jia1,wu2,dan4,shi2 +家畜=jia1,chu4 +家种=jia1,zhong4 +家累=jia1,lei3 +家累千金=jia1,lei4,qian1,jin1 +家给人足=jia1,ji3,ren2,zu2 +家给户足=jia1,ji3,hu4,zu2 +家给民足=jia1,ji3,min2,zu2 +家道从容=jia1,dao4,cong1,rong2 +家长=jia1,zhang3 +家长礼短=jia1,chang2,li3,duan3 +家长里短=jia1,chang2,li3,duan3 +家雀=jia1,qiao3 +家雀儿=jia1,qiao3,er2 +宷=shen3 +宸=chen2 +容=rong2 +宺=huang1,huang3 +宻=mi4 +宼=kou4 +宽=kuan1 +宽大为怀=kuan1,da4,wei2,huai2 +宽绰=kuan1,chuo5 +宾=bin1 +宿=su4,xiu3,xiu4 +宿将=su4,jiang4 +宿水飡风=xiu3,shui3,can1,feng1 +宿水餐风=xiu3,shui3,can1,feng1 +宿舍=su4,she4 +宿雨餐风=xiu3,yu3,can1,feng1 +寀=cai3,cai4 +寁=zan3 +寂=ji4 +寃=yuan1 +寄=ji4 +寅=yin2 +密=mi4 +密发=mi4,fa4 +寇=kou4 +寈=qing1 +寉=he4 +寊=zhen1 +寋=jian4 +富=fu4 +富商巨贾=fu4,shang1,ju4,jia3 +富国彊兵=fu4,guo2,jiang1,bing1 +寍=ning2,ning4 +寎=bing3,bing4 +寏=huan2 +寐=mei4 +寑=qin3 +寒=han2 +寒伧=han2,chen5 +寒假=han2,jia4 +寒号=han2,hao4 +寒碜=han2,chen3 +寒舍=han2,she4 +寒蝉凄切=han2,chan2,qi1,qie4 +寒酸相=han2,suan1,xiang4 +寒酸落魄=han2,suan1,luo4,po4 +寒露=han2,lu4 +寒颤=han2,zhan4 +寓=yu4 +寔=shi2 +寕=ning2,ning4 +寖=qin3,jin4 +寗=ning2,ning4 +寘=zhi4 +寙=yu3 +寚=bao3 +寛=kuan1 +寜=ning2,ning4 +寝=qin3 +寝苫枕干=qin3,shan1,zhen3,gan4 +寞=mo4 +察=cha2 +察察为明=cha2,cha2,wei2,ming2 +寠=ju4,lou2 +寡=gua3 +寡不胜众=gua3,bu4,sheng4,zhong4 +寡廉鲜耻=gua3,lian2,xian3,chi3 +寡见鲜闻=gua3,jian4,xian3,wen2 +寢=qin3 +寣=hu1 +寤=wu4 +寥=liao2 +寥寥数语=liao2,liao2,shu4,yu3 +實=shi2 +寧=ning2,ning4 +寨=zhai4 +審=shen3 +寪=wei3 +寫=xie3,xie4 +寬=kuan1 +寭=hui4 +寮=liao2 +寯=jun4 +寰=huan2 +寱=yi4 +寲=yi2 +寳=bao3 +寴=qin1,qin4 +寵=chong3 +寶=bao3 +寷=feng1 +寸=cun4 +寸利必得=cun4,li4,bi4,de2 +寸积铢累=cun4,ji1,zhu1,lei3 +寸量铢称=cun4,liang2,zhu1,cheng1 +对=dui4 +对不住=dui4,bu4,zhu4 +对事不对人=dui4,shi4,bu2,dui4,ren2 +对口相声=dui4,kou3,xiang4,sheng4 +对应=dui4,ying4 +对得住=dui4,de5,zhu4 +对得数=dui4,de2,shu4 +对得起=dui4,de5,qi3 +对数=dui4,shu4 +对着干=dui4,zhe5,gan4 +对称=dui4,chen4 +对薄公堂=dui4,bu4,gong1,tang2 +对调=dui4,diao4 +寺=si4 +寺观=si4,guan4 +寻=xun2 +寻思=xin2,si1 +寻欢作乐=xun2,huan1,zuo4,le4 +寻瑕伺隙=xun2,xia2,si4,xi4 +寻行数墨=xun2,hang2,shu3,mo4 +导=dao3 +导弹=dao3,dan4 +寽=lve4,luo2 +対=dui4 +寿=shou4 +寿数=shou4,shu4 +寿终正寝=shou4,zhong1,zheng4,qin3 +尀=po3 +封=feng1 +封妻荫子=feng1,qi1,yin4,zi3 +封禁=feng1,jin4 +封禅=feng1,shan4 +封豨修蛇=feng1,xi1,you3,she2 +専=zhuan1 +尃=fu1 +射=she4,ye4,yi4 +射干=ye4,gan4 +尅=ke4,kei1 +将=jiang1,jiang4 +将令=jiang4,ling4 +将伯之助=qiang1,bo2,zhi1,zhu4 +将伯之呼=qiang1,bo2,zhi1,hu1 +将兵=jiang4,bing1 +将功折过=jiang1,gong1,she2,guo4 +将取固予=jiang1,qu3,gu1,yu3 +将士=jiang4,shi4 +将夺固与=jiang1,duo2,gu1,yu3 +将官=jiang4,guan1 +将将=qiang1,qiang1 +将尉=jiang4,wei4 +将帅=jiang4,shuai4 +将才=jiang4,cai2 +将校=jiang4,xiao4 +将相=jiang4,xiang4 +将遇良材=jiang4,yu4,liang2,cai2 +将门=jiang4,men2 +将门无犬子=jiang4,men2,wu2,quan3,zi3 +将门有将=jiang4,men2,you3,jiang4 +将门虎子=jiang4,men2,hu3,zi3 +将领=jiang4,ling3 +將=jiang1,jiang4 +專=zhuan1 +尉=wei4,yu4 +尉犁=yu4,li2 +尉迟=yu4,chi2 +尊=zun1 +尋=xun2 +尌=shu4,zhu4 +對=dui4 +導=dao3 +小=xiao3 +小不点儿=xiao3,bu4,dian3,er5 +小事一件=xiao3,shi4,yi2,jian4 +小传=xiao3,zhuan4 +小便宜=xiao3,bian4,yi2 +小册子=xiao3,ce4,zi5 +小圈子=xiao3,quan1,zi5 +小妞儿=xiao3,niu1,er5 +小姑独处=xiao3,gu1,du2,chu3 +小孩子=xiao3,hai2,zi5 +小家伙=xiao3,jia1,huo5 +小家子气=xiao3,jia1,zi5,qi4 +小将=xiao3,jiang4 +小尕子=xiao3,ga3,zi5 +小数=xiao3,shu4 +小数点=xiao3,shu3,dian3 +小时了了=xiao3,shi2,liao3,liao3 +小眼薄皮=xiao3,yan3,bo2,pi2 +小肚子=xiao3,du3,zi5 +小舅子=xiao3,jiu4,zi5 +小调=xiao3,diao4 +小鬼头=xiao3,gui3,tou5 +尐=jie2,ji2 +少=shao3,shao4 +少一点=shao3,yi4,dian3 +少不了=shao4,bu4,liao3 +少不得=shao4,bu4,de2 +少不更事=shao4,bu4,geng1,shi4 +少不经事=shao4,bu4,jing1,shi4 +少壮=shao4,zhuang4 +少壮派=shao4,zhuang4,pai4 +少女=shao4,nv3 +少妇=shao4,fu4 +少将=shao3,jiang4 +少尉=shao4,wei4 +少小=shao4,xiao3 +少小无猜=shao4,xiao3,wu2,cai1 +少年=shao4,nian2 +少年得志=shao4,nian2,de2,zhi4 +少年老成=shao4,nian2,lao3,cheng2 +少年老诚=shao3,nian2,lao3,cheng2 +少府=shao4,fu3 +少成若性=shao4,cheng2,ruo4,xing4 +少数=shao3,shu4 +少校=shao4,xiao4 +少爷=shao4,ye2 +少男=shao4,nan2 +少相=shao4,xiang1 +尒=er3 +尓=er3 +尔=er3 +尕=ga3 +尕驴儿=ga3,lv2,er5 +尖=jian1 +尖嘴薄舌=jian1,zui3,bo2,she2 +尖担两头脱=jian1,dan4,liang3,tou2,tuo1 +尗=shu2 +尘=chen2 +尙=shang4 +尚=shang4 +尛=mo2 +尜=ga2 +尝=chang2 +尝鼎一脔=chang2,ding3,yi1,luan2 +尞=liao2 +尟=xian3 +尠=xian3 +尡=hun4 +尢=you2 +尣=wang1 +尤=you2 +尤云殢雨=you2,yun2,ti4,yu3 +尥=liao4 +尥蹶子=liao4,jue3,zi5 +尦=liao4 +尧=yao2 +尨=long2,mang2,meng2,pang2 +尩=wang1 +尪=wang1 +尫=wang1 +尬=ga4 +尭=yao2 +尮=duo4 +尯=kui4,kui3 +尰=zhong3 +就=jiu4 +尲=gan1 +尳=gu3 +尴=gan1 +尵=tui2 +尶=gan1 +尷=gan1 +尸=shi1 +尸居龙见=shi1,ju1,long2,xian4 +尹=yin3 +尺=chi3,che3 +尺二冤家=chi3,er4,yuan1,jia5 +尺头=chi3,tou2 +尺子=chi3,zi5 +尺寸=chi3,cun4 +尺寸之功=chi3,cu4,zhi1,gong1 +尺布斗粟=chi3,bu4,dou3,su4 +尺短寸长=chi3,duan3,cun4,chang2 +尻=kao1 +尼=ni2 +尽=jin4,jin3 +尽先=jin3,xian1 +尽其所有=jin3,qi2,suo3,you3 +尽力=jin3,li4 +尽力而为=jin4,li4,er2,wei2 +尽可能=jin3,ke3,neng2 +尽善尽美=jin4,shan4,jin4,mei3 +尽多尽少=jin3,duo1,jin3,shao3 +尽心尽力=jin4,xin1,jin4,li4 +尽快=jin3,kuai4 +尽数=jin3,shu4 +尽早=jin3,zao3 +尽着=jin3,zhe5 +尽管=jin3,guan3 +尽自=jin3,zi4 +尽说漂亮话=jin4,shuo1,piao4,liang4,hua4 +尽里头=jin3,li3,tou5 +尽量=jin3,liang4 +尾=wei3,yi3 +尾大难掉=wei3,da4,nan2,diao4 +尾巴=wei3,ba1 +尾数=wei3,shu4 +尾随不舍=wei3,sui2,bu2,she4 +尿=niao4,sui1 +尿泡=sui1,pao4 +尿脬=sui1,pao1 +尿血=niao4,xie3 +局=ju2 +局地钥天=ju2,di4,yao4,tian1 +局长=ju2,zhang3 +屁=pi4 +层=ceng2 +层台累榭=ceng2,tai2,lei3,xie4 +层见迭出=ceng2,chu1,die2,jian4 +屃=xi4 +屄=bi1 +居=ju1 +居下讪上=ju1,xia4,shan4,shang4 +居不重席=ju1,bu4,chong2,xi2 +居不重茵=ju1,bu4,chong2,yin1 +居丧=ju1,sang1 +居处=ju1,chu3 +居轴处中=ju1,zhou2,chu3,zhong1 +屆=jie4 +屇=tian2 +屈=qu1 +屈折=qu1,she2 +屉=ti4 +届=jie4 +屋=wu1 +屋子=wu1,zi5 +屌=diao3 +屍=shi1 +屎=shi3 +屎壳郎=shi3,ke2,lang4 +屏=ping2,bing3 +屏住=bing3,zhu4 +屏声息气=bing3,sheng1,xi1,qi4 +屏幕=ping2,mu4 +屏弃=bing3,qi4 +屏息=bing3,xi1 +屏气=bing3,qi4 +屏气吞声=bing3,qi4,tun1,sheng1 +屏藩=ping2,fan1 +屏退=bing3,tui4 +屏除=bing3,chu2 +屏风=ping2,feng1 +屐=ji1 +屑=xie4 +屑子=xie4,zi3 +屒=zhen3 +屓=xi4 +屔=ni2 +展=zhan3 +屖=xi1 +屗=wei3 +屘=man3 +屙=e1 +屙金溺银=e1,jin1,niao4,yin2 +屚=lou4 +屛=ping3,bing3 +屜=ti4 +屝=fei4 +属=shu3,zhu3 +属垣有耳=zhu3,yuan2,you3,er3 +属意=zhu3,yi4 +属文=zhu3,wen2 +属望=zhu3,wang4 +属毛离里=zhu3,mao2,li2,li3 +属词比事=zhu3,ci2,bi3,shi4 +属辞比事=zhu3,ci2,bi3,shi4 +屟=xie4,ti4 +屠=tu2 +屠门大嚼=tu2,men2,da4,jiao2 +屡=lv3 +屡教不改=lv3,jian4,bu4,gai3 +屡见不鲜=lv3,jian4,bu4,xian1 +屢=lv3 +屣=xi3 +層=ceng2 +履=lv3 +履薄临深=lv3,bo2,lin2,shen1 +屦=ju4 +屧=xie4 +屨=ju4 +屩=jue1 +屪=liao2 +屫=jue1 +屬=shu3,zhu3 +屭=xi4 +屮=che4,cao3 +屯=tun2,zhun1 +屯扎=tun2,zha1 +屯蹶否塞=tun2,jue3,fou3,sai1 +屰=ni4,ji3 +山=shan1 +山公倒载=shan1,gong1,dao3,zai3 +山大王=shan1,dai4,wang2 +山峙渊渟=shan1,zhi4,yuan1,zi1 +山崩钟应=shan1,beng1,zhong1,ying4 +山殽野湋=shan1,yao1,ye3,fu4 +山溜穿石=shan1,liu4,chuan1,shi2 +山节藻棁=shan1,jie2,zao3,li4 +山行海宿=shan1,xing2,hai3,xiu3 +山阴乘兴=shan1,yin1,cheng2,xing1 +山高岭削=shan1,gao1,ling3,xue1 +山鸣谷应=shan1,ming2,gu3,ying4 +屲=wa1 +屳=xian1 +屴=li4 +屵=an4 +屶=hui4 +屷=hui4 +屸=hong2,long2 +屹=yi4 +屺=qi3 +屻=ren4 +屼=wu4 +屽=han4,an4 +屾=shen1 +屿=yu3 +岀=chu1 +岁=sui4 +岁数=sui4,shu4 +岁月不居=sui4,yue4,bu4,ju2 +岁聿其莫=sui4,yu4,qi2,mu4 +岂=qi3,kai3 +岂弟君子=kai3,ti4,jun1,zi3 +岃=ren4 +岄=yue4 +岅=ban3 +岆=yao3 +岇=ang2 +岈=ya2 +岉=wu4 +岊=jie2 +岋=e4 +岌=ji2 +岍=qian1 +岎=fen2 +岏=wan2 +岐=qi2 +岑=cen2 +岒=qian2 +岓=qi2 +岔=cha4 +岔子=cha4,zi5 +岕=jie4 +岖=qu1 +岗=gang3 +岗头泽底=gang1,tou2,ze2,di3 +岘=xian4 +岙=ao4 +岚=lan2 +岛=dao3 +岜=ba1 +岝=zuo4 +岞=zuo4 +岟=yang3 +岠=ju4 +岡=gang1 +岢=ke3 +岣=gou3 +岤=xue4 +岥=po1 +岦=li4 +岧=tiao2 +岨=ju1,ju3 +岩=yan2 +岩居穴处=yan2,ju1,xue2,chu3 +岩栖穴处=yan2,qi1,xue2,chu3 +岩芯=yan2,xin4 +岪=fu2 +岫=xiu4 +岬=jia3 +岭=ling3,ling2 +岮=tuo2 +岯=pi1 +岰=ao4 +岱=dai4 +岲=kuang4 +岳=yue4 +岳镇渊渟=yue4,zhen4,yuan1,ting1 +岴=qu1 +岵=hu4 +岶=po4 +岷=min2 +岸=an4 +岹=tiao2 +岺=ling3,ling2 +岻=di1 +岼=ping2 +岽=dong1 +岾=zhan1 +岿=kui1 +岿然不动=kui1,ran2,bu4,dong4 +峀=xiu4 +峁=mao3 +峂=tong2 +峃=xue2 +峄=yi4 +峅=bian4 +峆=he2 +峇=ke4,ba1 +峈=luo4 +峉=e2 +峊=fu4,nie4 +峋=xun2 +峌=die2 +峍=lu4 +峎=en3 +峏=er2 +峐=gai1 +峑=quan2 +峒=tong2,dong4 +峓=yi2 +峔=mu3 +峕=shi2 +峖=an1 +峗=wei2 +峘=huan2 +峙=zhi4,shi4 +峚=mi4 +峛=li3 +峜=fa3 +峝=tong2 +峞=wei2 +峟=you4 +峠=qia3 +峡=xia2 +峢=li3 +峣=yao2 +峤=qiao2,jiao4 +峥=zheng1 +峦=luan2 +峧=jiao1 +峨=e2 +峨冠博带=e2,guan1,bo2,dai4 +峨峨汤汤=e2,e2,shang1,shang1 +峩=e2 +峪=yu4 +峫=xie2,ye2 +峬=bu1 +峭=qiao4 +峮=qun2 +峯=feng1 +峰=feng1 +峱=nao2 +峲=li3 +峳=you1 +峴=xian4 +峵=rong2 +島=dao3 +峷=shen1 +峸=cheng2 +峹=tu2 +峺=geng3 +峻=jun4 +峻阪盐车=jun4,ban3,yun2,che1 +峼=gao4 +峽=xia2 +峾=yin2 +峿=wu2 +崀=lang3 +崁=kan4 +崂=lao2 +崃=lai2 +崄=xian3 +崅=que4 +崆=kong1 +崇=chong2 +崈=chong2 +崉=ta4 +崊=lin2 +崋=hua4 +崌=ju1 +崍=lai2 +崎=qi2 +崏=min2 +崐=kun1 +崑=kun1 +崒=zu2,cui4 +崓=gu4 +崔=cui1 +崕=ya2 +崖=ya2 +崗=gang3,gang1 +崘=lun2 +崙=lun2 +崚=ling2,leng2 +崛=jue2 +崜=duo3 +崝=zheng1 +崞=guo1 +崟=yin2 +崠=dong1,dong4 +崡=han2 +崢=zheng1 +崣=wei3 +崤=xiao2 +崥=pi2,bi3 +崦=yan1 +崧=song1 +崨=jie2 +崩=beng1 +崪=zu2 +崫=jue2 +崬=dong1 +崭=zhan3,chan2 +崭露头脚=zhan3,lu4,tou2,jiao3 +崭露头角=zhan3,lu4,tou2,jiao3 +崮=gu4 +崯=yin2 +崰=zi1 +崱=ze4 +崲=huang2 +崳=yu2 +崴=wei1,wai3 +崵=yang2,dang4 +崶=feng1 +崷=qiu2 +崸=yang2 +崹=ti2 +崺=yi3 +崻=zhi4,shi4 +崼=shi4,die2 +崽=zai3 +崾=yao3 +崿=e4 +嵀=zhu4 +嵁=kan1,zhan4 +嵂=lv4 +嵃=yan3 +嵄=mei3 +嵅=han2 +嵆=ji1 +嵇=ji1 +嵈=huan4 +嵉=ting2 +嵊=sheng4 +嵋=mei2 +嵌=qian4,kan4 +嵍=wu4,mao2 +嵎=yu2 +嵏=zong1 +嵐=lan2 +嵑=ke3,jie2 +嵒=yan2 +嵓=yan2 +嵔=wei1,wei3 +嵕=zong1 +嵖=cha2 +嵗=sui4 +嵘=rong2 +嵙=ke1 +嵚=qin1 +嵛=yu2 +嵜=qi2 +嵝=lou3 +嵞=tu2 +嵟=cui1 +嵠=xi1 +嵡=weng3 +嵢=cang1 +嵣=tang2,dang4 +嵤=rong2,ying2 +嵥=jie2 +嵦=kai3,ai2 +嵧=liu2 +嵨=wu4 +嵩=song1 +嵪=kao1,qiao1 +嵫=zi1 +嵬=wei2 +嵬眼澒耳=wei2,yan3,hong4,er3 +嵭=beng1 +嵮=dian1 +嵯=cuo2 +嵰=qin1,qian3 +嵱=yong3 +嵲=nie4 +嵳=cuo2 +嵴=ji3 +嵵=shi2 +嵶=ruo4 +嵷=song3 +嵸=zong3 +嵹=jiang4 +嵺=liao2 +嵻=kang1 +嵼=chan3 +嵽=die2,di4 +嵾=cen1 +嵿=ding3 +嶀=tu1 +嶁=lou3 +嶂=zhang4 +嶃=zhan3,chan2 +嶄=zhan3,chan2 +嶅=ao2,ao4 +嶆=cao2 +嶇=qu1 +嶈=qiang1 +嶉=wei3 +嶊=zui3 +嶋=dao3 +嶌=dao3 +嶍=xi2 +嶎=yu4 +嶏=pi3,pei4 +嶐=long2 +嶑=xiang4 +嶒=ceng2 +嶓=bo1 +嶔=qin1 +嶕=jiao1 +嶖=yan1 +嶗=lao2 +嶘=zhan4 +嶙=lin2 +嶚=liao2 +嶛=liao2 +嶜=qin2 +嶝=deng4 +嶞=tuo4 +嶟=zun1 +嶠=jiao4,qiao2 +嶡=jue2,gui4 +嶢=yao2 +嶣=jiao1 +嶤=yao2 +嶥=jue2 +嶦=zhan1,shan4 +嶧=yi4 +嶨=xue2 +嶩=nao2 +嶪=ye4 +嶫=ye4 +嶬=yi2 +嶭=nie4 +嶮=xian3 +嶯=ji2 +嶰=xie4,jie4 +嶱=ke3,jie2 +嶲=gui1,xi1,juan4 +嶳=di4 +嶴=ao4 +嶵=zui4 +嶶=wei1 +嶷=yi2 +嶸=rong2 +嶹=dao3 +嶺=ling3 +嶻=jie2 +嶼=yu3 +嶽=yue4 +嶾=yin3 +嶿=ru1 +巀=jie2 +巁=li4,lie4 +巂=gui1,xi1,juan4 +巃=long2 +巄=long2 +巅=dian1 +巆=ying2,hong1 +巇=xi1 +巈=ju2 +巉=chan2 +巊=ying3 +巋=kui1 +巌=yan2 +巍=wei1 +巎=nao2 +巏=quan2 +巐=chao3 +巑=cuan2 +巒=luan2 +巓=dian1 +巔=dian1 +巕=nie4 +巖=yan2 +巗=yan2 +巘=yan3 +巙=kui2 +巚=yan3 +巛=chuan1 +巜=kuai4 +川=chuan1 +川渟岳峙=chuan1,ting1,yue4,zhi4 +州=zhou1 +州长=zhou1,zhang3 +巟=huang1 +巠=jing1,xing2 +巡=xun2 +巡更=xun2,geng1 +巡查=xun2,zha1 +巢=chao2 +巢居穴处=chao2,ju1,xue2,chu3 +巣=chao2 +巤=lie4 +工=gong1 +工尺=gong1,che3 +工行=gong1,hang2 +左=zuo3 +左传=zuo3,zhuan4 +左宜右有=zuo3,yi2,you4,fu2 +左撇子=zuo3,pie3,zi5 +左支右吾=zuo3,zhi1,you4,wu1 +左枝右梧=zuo3,zhi1,you4,wu1 +左邻右舍=zuo3,lin2,you4,she4 +巧=qiao3 +巧发奇中=qiao3,fa1,qi2,zhong4 +巧妇难为无米之炊=qiao3,fu4,nan2,wei2,wu2,mi3,zhi1,chui1 +巧干=qiao3,gan4 +巨=ju4 +巨贾=ju4,gu3 +巩=gong3 +巪=ju4 +巫=wu1 +巬=gu1 +巭=gu1 +差=cha4,cha1,chai1,ci1 +差一点=cha4,yi1,dian3 +差一点儿=cha4,yi4,dian3,er5 +差三错四=cha1,san1,cuo4,si4 +差不多=cha4,bu4,duo1 +差不离=cha4,bu4,li2 +差之千里=cha1,zhi1,qian1,li3 +差之毫厘=cha1,zhi1,hao2,li2 +差之毫厘失之千里=cha1,zhi1,hao2,li2,shi1,zhi1,qian1,li3 +差事=chai1,shi4 +差人去=chai1,ren2,qu4 +差人去请医生=chai1,ren2,qu4,qing3,yi1,sheng1 +差价=cha1,jia4 +差使=chai1,shi2 +差值=cha1,zhi2 +差分=cha1,fen1 +差分放大器=cha4,fen1,fang4,da4,qi4 +差别=cha1,bie2 +差动=cha1,dong4 +差可=cha1,ke3 +差可告慰=cha1,ke3,gao4,wei4 +差失=cha1,shi1 +差异=cha1,yi4 +差强人意=cha1,qiang2,ren2,yi4 +差役=chai1,yi4 +差得很远=cha4,de5,hen3,yuan3 +差得远=cha4,de5,yuan3 +差拨=chai1,bo1 +差数=cha1,shu4 +差旅费=chai1,lv3,fei4 +差池=cha1,chi2 +差点=cha4,dian3 +差等=cha1,deng3 +差等生=cha4,deng3,sheng1 +差缺=chai1,que1 +差误=cha1,wu4 +差距=cha1,ju4 +差遣=chai1,qian3 +差错=cha1,cuo4 +差额=cha1,e2 +差饷=chai1,xiang3 +巯=qiu2 +巰=qiu2 +己=ji3 +已=yi3 +已知数=yi3,zhi1,shu4 +巳=si4 +巴=ba1 +巴尔干半岛=ba1,er3,gan4,ban4,dao3 +巴尔扎克=ba1,er3,zha1,ke4 +巴巴结结=ba1,ba5,jie1,jie1 +巴结=ba1,jie5 +巵=zhi1 +巶=zhao1 +巷=xiang4,hang4 +巷子=xiang4,zi5 +巷道=hang4,dao4 +巸=yi2 +巹=jin3 +巺=xun4 +巻=juan3,juan4 +巽=xun4 +巾=jin1 +巿=fu2 +帀=za1 +币=bi4 +市=shi4 +市长=shi4,zhang3 +布=bu4 +布尔什维克=bu4,er3,shi2,wei2,ke4 +布鲁塞尔=bu4,lu3,sai4,er3 +帄=ding1 +帅=shuai4 +帆=fan1 +帇=nie4 +师=shi1 +师父=shi1,fu5 +师直为壮=shi1,zhi2,wei2,zhuang4 +师长=shi1,zhang3 +帉=fen1 +帊=pa4 +帋=zhi3 +希=xi1 +帍=hu4 +帎=dan4 +帏=wei2 +帐=zhang4 +帑=nu2,tang3 +帒=dai4 +帓=mo4,wa4 +帔=pei4 +帕=pa4 +帖=tie3,tie4,tie1 +帖服=tie1,fu2 +帗=fu2 +帘=lian2 +帘子=lian2,zi5 +帙=zhi4 +帚=zhou3 +帛=bo2 +帜=zhi4 +帝=di4 +帝都=di4,du1 +帞=mo4 +帟=yi4 +帠=yi4 +帡=ping2 +帡天极地=ju2,tian1,ji2,di4 +帢=qia4 +帣=juan4,juan3 +帤=ru2 +帥=shuai4 +带=dai4 +带着铃铛去做贼=dai4,zhe5,ling2,dang1,qu4,zuo4,zei2 +带累=dai4,lei3 +帧=zhen1 +帨=shui4 +帩=qiao1 +帪=zhen1 +師=shi1 +帬=qun2 +席=xi2 +席卷=xi2,juan4 +席卷八荒=xi2,juan3,ba1,huang1 +席卷天下=xi2,juan3,tian1,xia4 +席卷而逃=xi2,juan3,er2,tao2 +帮=bang1 +帮倒忙=bang1,dao4,mang2 +帯=dai4 +帰=gui1 +帱=chou2,dao4 +帲=ping2 +帳=zhang4 +帴=jian3,jian1,san4 +帵=wan1 +帶=dai4 +帷=wei2 +帷薄不修=wei2,bo2,bu4,xiu1 +常=chang2 +常备不懈=chang2,bei4,bu4,xie4 +常年累月=chang2,nian2,lei4,yue4 +常用对数=chang2,yong4,dui4,shu4 +常规炸弹=chang2,gui1,zha4,dan4 +帹=sha4,qie4 +帺=qi2,ji4 +帻=ze2 +帼=guo2 +帽=mao4 +帽子=mao4,zi5 +帾=zhu3 +帿=hou2 +幀=zhen1 +幁=zheng4 +幂=mi4 +幂数=mi4,shu4 +幂级数=mi4,ji2,shu4 +幃=wei2 +幄=wo4 +幅=fu2 +幆=yi4 +幇=bang1 +幈=ping2 +幉=die2 +幊=gong1 +幋=pan2 +幌=huang3 +幌子=huang3,zi5 +幍=tao1 +幎=mi4 +幏=jia4 +幐=teng2 +幑=hui1 +幒=zhong1 +幓=shan1,qiao1,shen1 +幔=man4 +幔子=man4,zi5 +幕=mu4 +幖=biao1 +幗=guo2 +幘=ze2 +幙=mu4 +幚=bang1 +幛=zhang4 +幛子=zhang4,zi5 +幜=jing3 +幝=chan3,chan4 +幞=fu2 +幟=zhi4 +幠=hu1 +幡=fan1 +幢=chuang2,zhuang4 +幣=bi4 +幤=bi4 +幥=zhang3 +幦=mi4 +幧=qiao1 +幨=chan1,chan4 +幩=fen2 +幪=meng2 +幫=bang1 +幬=chou2,dao4 +幭=mie4 +幮=chu2 +幯=jie2 +幰=xian3 +幱=lan2 +干=gan1,gan4 +干么=gan4,mo3 +干事=gan4,shi4 +干什么=gan4,shen2,me5 +干仗=gan4,zhang4 +干净利索=gan4,jing4,li4,suo3 +干劲=gan4,jin4 +干号=gan1,hao2 +干名犯义=gan4,ming2,fan4,yi4 +干吗=gan1,ma2 +干员=gan4,yuan2 +干咳=gan1,hai1 +干啼湿哭=gan4,ti2,shi1,ku1 +干嘛=gan4,ma2 +干城之将=gan1,cheng2,zhi1,jiang4 +干将=gan4,jiang4 +干将莫邪=gan1,jiang1,mo4,ye2 +干干=gan4,gan4 +干干净净=gan4,gan1,jing4,jing4 +干干翼翼=gan4,gan4,yi4,yi4 +干性油=gan4,xing4,you2 +干戈载戢=gan4,ge1,zai3,ji2 +干才=gan4,cai2 +干打垒=gan4,da3,lei3 +干掉=gan4,diao4 +干政=gan4,zheng4 +干松=gan4,song1 +干架=gan4,jia4 +干校=gan4,xiao4 +干活=gan4,huo2 +干流=gan4,liu2 +干渠=gan4,qu2 +干点=gan4,dian3 +干父之蛊=gan4,fu4,zhi1,gu3 +干电池=gan4,dian4,chi2 +干瘪=gan1,bie3 +干白=gan4,bai2 +干瞪眼=gan1,deng4,yan3 +干硬=gan4,ying4 +干端坤倪=gan4,duan1,kun1,ni2 +干粉=gan4,fen3 +干线=gan4,xian4 +干练=gan4,lian4 +干结=gan4,jie2 +干脆利索=gan4,cui4,li4,suo3 +干警=gan4,jing3 +干路=gan4,lu4 +干道=gan4,dao4 +干部=gan4,bu4 +干酵母=gan4,jiao4,mu3 +干霄蔽日=gan4,xiao1,bi4,ri4 +干预=gan4,yu4 +干馏=gan1,liu2 +平=ping2 +平地一声雷=ping2,di4,yi1,sheng1,lei2 +平均数=ping2,jun1,shu4 +平头数=ping2,tou2,shu4 +平峒=ping2,dong4 +平巷=ping2,hang4 +平调=ping2,diao4 +平铺=ping2,pu4 +平铺直叙=ping2,pu1,zhi2,xu4 +平铺直序=ping2,pu4,zhi2,xu4 +年=nian2 +年休假=nian2,xiu1,jia4 +年假=nian2,jia4 +年少=nian2,shao4 +年谊世好=nian2,yi4,shi4,hao4 +年轻有为=nian2,qing1,you3,wei2 +年长=nian2,zhang3 +幵=jian1 +并=bing4,bing1 +并为一谈=bing4,wei2,yi1,tan2 +并处=bing4,chu3 +并州=bing1,zhou1 +并赃拿贼=bing4,zhuo1,na2,zei2 +幷=bing4,bing1 +幸=xing4 +幸得=xing4,de5 +幹=gan4 +幺=yao1 +幺麽小丑=yao1,mo3,xiao3,chou3 +幻=huan4 +幻数=huan4,shu4 +幼=you4 +幼畜=you4,chu4 +幽=you1 +幽咽=you1,ye4 +幽禁=you1,jin4 +幾=ji3,ji1 +广=guang3,an1 +广文先生=guang3,wen2,xian1,sheng4 +广种薄收=guang3,zhong4,bo2,shou1 +广陵散绝=guang3,ling2,san3,jue2 +庀=pi3 +庁=ting1 +庂=ze4 +広=guang3 +庄=zhuang1 +庄严宝相=zhuang1,yan2,bao3,xiang4 +庅=mo2,ma1,me5 +庆=qing4 +庇=bi4 +庇荫=bi4,yin4 +庈=qin2 +庉=dun4,tun2 +床=chuang2 +床铺=chuang2,pu4 +庋=gui3 +庌=ya3 +庍=bai4,ting1 +庎=jie4 +序=xu4 +序数=xu4,shu4 +序数词=xu4,shu4,ci2 +庐=lu2 +庑=wu3 +庒=zhuang1 +库=ku4 +应=ying1,ying4 +应举=ying4,ju3 +应从=ying4,cong2 +应付=ying4,fu4 +应付帐款=ying1,fu4,zhang4,kuan3 +应傲=ying4,ao4 +应刃而解=ying4,ren4,er2,jie3 +应分=ying1,fen4 +应制=ying4,zhi4 +应募=ying4,mu4 +应卯=ying4,mao3 +应变=ying4,bian4 +应召=ying4,zhao4 +应名儿=ying1,ming2,er5 +应名点卯=ying4,ming2,dian3,mao3 +应命=ying4,ming4 +应和=ying1,he4 +应声=ying1,sheng1 +应声虫=ying4,sheng1,chong2 +应天从人=ying4,tian1,cong2,ren2 +应天承运=ying4,tian1,cheng2,yun4 +应天顺人=ying4,tian1,shun4,ren2 +应天顺民=ying4,tian1,shun4,min2 +应对=ying4,dui4 +应对得体=ying4,dui4,de2,ti3 +应届=ying1,jie4 +应市=ying4,shi4 +应弦而倒=ying4,xian2,er2,dao3 +应征=ying4,zheng1 +应得=ying1,de5 +应急=ying4,ji2 +应战=ying4,zhan4 +应手=ying4,shou3 +应承=ying4,cheng2 +应接=ying4,jie1 +应接不暇=ying4,jie1,bu4,xia2 +应敌=ying4,di2 +应时=ying4,shi2 +应景=ying4,jing3 +应机立断=ying4,ji1,li4,duan4 +应权通变=ying4,quan2,tong1,bian4 +应用=ying4,yong4 +应答=ying4,da2 +应答如响=ying4,da2,ru2,xiang3 +应答如流=ying4,da2,ru2,liu2 +应约=ying4,yue1 +应考=ying4,kao3 +应聘=ying4,pin4 +应节合拍=ying4,jie2,he2,pai1 +应许=ying1,xu3 +应诊=ying4,zhen3 +应试=ying4,shi4 +应诺=ying4,nuo4 +应运=ying4,yun4 +应选=ying4,xuan3 +应邀=ying4,yao1 +应酬=ying4,chou2 +应门=ying4,men2 +应验=ying4,yan4 +底=di3,de5 +底子=di3,zi5 +底数=di3,shu4 +底死谩生=di3,si3,man4,sheng1 +底片儿=di3,pian1,er5 +庖=pao2 +店=dian4 +店铺=dian4,pu4 +店长=dian4,zhang3 +庘=ya1 +庙=miao4 +庚=geng1 +庛=ci4 +府=fu3 +庝=tong2 +庞=pang2 +庞眉白发=pang2,mei2,bai2,fa4 +庞眉皓发=pang2,mei2,hao4,fa4 +废=fei4 +废寝忘食=fei4,qin3,wang4,shi2 +庠=xiang2 +庡=yi3 +庢=zhi4 +庣=tiao1 +庤=zhi4 +庥=xiu1 +度=du4,duo2 +度假=du4,jia4 +度假村=du4,jia4,cun1 +度己以绳=duo2,ji3,yi3,sheng2 +度德量力=duo2,de2,liang4,li4 +度数=du4,shu4 +座=zuo4 +庨=xiao1 +庩=tu2 +庪=gui3 +庫=ku4 +庬=pang2,mang2,meng2 +庭=ting2 +庭长=ting2,zhang3 +庮=you2 +庯=bu1 +庰=bing4,ping2 +庱=cheng3 +庲=lai2 +庳=bei1 +庴=cuo4,ji1 +庵=an1 +庶=shu4 +康=kang1 +庸=yong1 +庸中皦皦=yong1,zhong1,bi4,tong2 +庸碌=yong1,lu4 +庹=tuo3 +庺=song1 +庻=shu4 +庼=qing3 +庽=yu4 +庾=yu3 +庿=miao4 +廀=sou1 +廁=ce4 +廂=xiang1 +廃=fei4 +廄=jiu4 +廅=e4 +廆=gui1,wei3,hui4 +廇=liu4 +廈=sha4,xia4 +廉=lian2 +廊=lang2 +廋=sou1 +廌=zhi4 +廍=bu4 +廎=qing3 +廏=jiu4 +廐=jiu4 +廑=jin3,qin2 +廒=ao2 +廓=kuo4 +廔=lou2 +廕=yin4 +廖=liao4 +廗=dai4 +廘=lu4 +廙=yi4 +廚=chu2 +廛=chan2 +廜=tu2 +廝=si1 +廞=xin1 +廟=miao4 +廠=chang3 +廡=wu3 +廢=fei4 +廣=guang3 +廤=ku4 +廥=kuai4 +廦=bi4 +廧=qiang2,se4 +廨=xie4 +廩=lin3 +廪=lin3 +廫=liao2 +廬=lu2 +廭=ji4 +廮=ying3 +廯=xian1 +廰=ting1 +廱=yong1 +廲=li2 +廳=ting1 +廴=yin3,yin4 +廵=xun2 +延=yan2 +延颈跂踵=yan2,jing3,qi3,zhong3 +廷=ting2 +廸=di2 +廹=po4,pai3 +建=jian4 +建行=jian4,hang2 +建都=jian4,du1 +廻=hui2 +廼=nai3 +廽=hui2 +廾=gong3 +廿=nian4 +开=kai1 +开华结果=kai1,hua1,jie2,guo3 +开卷=kai1,juan4 +开卷有益=kai1,juan4,you3,yi4 +开小差=kai1,xiao3,chai1 +开弓不放箭=kai1,gong1,bu4,fang4,jian4 +开禁=kai1,jin4 +开花结实=kai1,hua1,jie2,shi2 +开花结果=kai1,hua1,jie2,guo3 +开蒙=kai1,meng2 +弁=bian4 +异=yi4 +弃=qi4 +弃好背盟=qi4,hao3,bei4,meng2 +弃甲曳兵=qi4,jia3,ye4,bing1 +弄=nong4,long4 +弄兵潢池=nong4,bing1,huang2,shi5 +弄口=long4,kou3 +弄口鸣舌=nong4,kou3,ming2,she2 +弄堂=long4,tang2 +弄成一团=nong4,cheng2,yi4,tuan2 +弄玉吹箫=nong4,yu4,chui2,xiao1 +弄竹弹丝=nong4,zhu2,dan4,si1 +弄管调弦=nong4,guan3,diao4,xian2 +弄粉调朱=nong4,fen3,diao4,zhu1 +弄脏=nong4,zang1 +弅=fen4 +弆=ju3 +弇=yan3 +弈=yi4 +弉=zang4 +弊=bi4 +弋=yi4 +弌=yi1 +弍=er4 +弎=san1 +式=shi4 +弐=er4 +弑=shi4 +弒=shi4 +弓=gong1 +弓背=gong1,bei4 +弓腰曲背=gong1,yao1,qu1,bei4 +弓调马服=gong1,diao4,ma3,fu2 +弔=diao4 +引=yin3 +引以为戒=yin3,yi3,wei2,jie4 +引以为鉴=yin3,yi3,wei2,jian4 +引吭高歌=yin3,hang2,gao1,ge1 +引得=yin3,de5 +引芯=yin3,xin4 +弖=hu4 +弗=fu2 +弘=hong2 +弙=wu1 +弚=tui2 +弛=chi2 +弜=jiang4 +弝=ba4 +弞=shen3 +弟=di4,ti4,tui2 +张=zhang1 +张冠李戴=zhang1,guan1,li3,dai4 +张眼露睛=zhang1,yan3,lu4,jing1 +张脉偾兴=zhang1,mai4,fen4,xing1 +张靓颖=zhang1,liang4,ying3 +弡=jue2,zhang1 +弢=tao1 +弣=fu3 +弤=di3 +弥=mi2,mi3 +弥日累夜=mi2,ri4,lei4,ye4 +弥缝其阙=mi2,feng2,qi2,que4 +弥蒙=mi2,meng2 +弥补损失=mi3,bu3,sun3,shi1 +弦=xian2 +弦乐=xian2,yue4 +弧=hu2 +弨=chao1 +弩=nu3 +弪=jing4 +弫=zhen3 +弬=yi5 +弭=mi3 +弮=juan4,quan1 +弯=wan1 +弯弯曲曲=wan1,wan1,qu1,qu1 +弯曲=wan1,qu1 +弰=shao1 +弱=ruo4 +弱不禁风=ruo4,bu4,jin1,feng1 +弱不胜衣=ruo4,bu4,sheng4,yi1 +弲=xuan1,yuan1 +弳=jing4 +弴=diao1 +張=zhang1 +弶=jiang4 +強=qiang2,qiang3,jiang4 +弸=peng2 +弹=tan2,dan4 +弹丸=dan4,wan2 +弹丸脱手=tan2,wan2,tuo1,shou3 +弹冠振衣=tan2,guan1,zhen4,yi1 +弹冠振衿=tan2,guan1,zhen4,jin1 +弹冠相庆=tan2,guan1,xiang1,qing4 +弹冠结绶=tan2,guan1,jie2,shou4 +弹壳=dan4,ke2 +弹头=dan4,tou2 +弹子=dan4,zi3 +弹尽援绝=dan4,jin4,yuan2,jue2 +弹尽粮绝=dan4,jin4,liang2,jue2 +弹弓=dan4,gong1 +弹无虚发=dan4,wu2,xu1,fa1 +弹腿=dan4,tui3 +弹药=dan4,yao4 +弹铗无鱼=dan4,jia2,wu2,yu2 +弹雨枪林=dan4,yu3,qiang1,lin2 +强=qiang2,qiang3,jiang4 +强不知以为知=qiang3,bu4,zhi1,yi3,wei2,zhi1 +强人=qiang3,ren2 +强人所难=qiang3,ren2,suo3,nan2 +强作解人=qiang3,zuo4,jie3,ren2 +强凫变鹤=qiang3,fu2,bian4,he4 +强劲=qiang2,jing4 +强嘴=jiang4,zui3 +强嘴拗舌=jiang4,zui3,niu4,she2 +强嘴硬牙=jiang4,zui3,ying4,ya2 +强将=qiang2,jiang4 +强将手下无弱兵=qiang2,jiang4,shou3,xia4,wu2,ruo4,bing1 +强干=qiang2,gan4 +强干弱枝=qiang2,gan1,ruo4,zhi1 +强弓劲弩=qiang2,gong1,jing4,nu3 +强得易贫=qiang3,de2,yi4,pin2 +强文假醋=qiang3,wen2,jia3,cu4 +强文浉醋=qiang3,wen2,jia3,cu4 +强横=qiang2,heng4 +强死强活=qiang3,si3,qiang3,huo2 +强死赖活=qiang3,si3,lai4,huo2 +强求=qiang3,qiu2 +强留=qiang3,liu2 +强直自遂=qiang2,zhi2,zi4,sui2 +强笑=qiang3,xiao4 +强而后可=qiang3,er2,hou4,ke3 +强聒不舍=qiang3,guo1,bu4,she3 +强自取折=qiang2,zi4,qu3,she2 +强识博闻=qiang3,shi2,bo2,wen2 +强词=qiang3,ci2 +强词夺理=qiang3,ci2,duo2,li3 +强调=qiang2,diao4 +强迫=qiang3,po4 +强逼=qiang3,bi1 +强颜=qiang3,yan2 +强颜欢笑=qiang3,yan2,huan1,xiao4 +强食自爱=qiang3,shi2,zi4,ai4 +强食靡角=qiang3,shi2,mi2,jiao3 +弻=bi4 +弼=bi4 +弽=she4 +弾=tan2,dan4 +弿=jian3 +彀=gou4 +彁=ge1 +彂=fa1 +彃=bi4 +彄=kou1 +彅=jian3 +彆=bie4 +彇=xiao1 +彈=tan2,dan4 +彉=guo1 +彊=qiang2,qiang3,jiang4 +彋=hong2 +彌=mi2,mi3 +彍=guo1 +彎=wan1 +彏=jue2 +彐=ji4,xue3 +彑=ji4 +归=gui1 +归省=gui1,xing3 +归还=gui1,huan2 +归降=gui1,xiang2 +当=dang1,dang4 +当一天和尚撞一天钟=dang1,yi1,tian1,he2,shang4,zhuang4,yi1,tian1,zhong1 +当仁不让=dang1,ren2,bu4,rang4 +当作=dang4,zuo4 +当做=dang4,zuo4 +当儿=dang1,er5 +当出去=dang4,chu1,qu4 +当务始终=dang1,wu4,shi3,zhong1 +当口儿=dang1,kou3,er5 +当地=dang1,di4 +当外人看=dang4,wai4,ren2,kan4 +当夜=dang4,ye4 +当天=dang4,tian1 +当头一棒=dang1,tou2,yi1,bang4 +当头棒喝=dang1,tou2,bang4,he4 +当差=dang1,chai1 +当年=dang1,nian2 +当成=dang4,cheng2 +当断不断=dang1,duan4,bu4,duan4 +当日=dang1,ri4 +当时=dang1,shi2 +当晚=dang4,wan3 +当月=dang4,yue4 +当真=dang4,zhen1 +当票=dang4,piao4 +当行出色=dang1,hang2,chu1,se4 +当轴处中=dang1,zhou2,chu3,zhong1 +当铺=dang4,pu1 +当铺老板=dang4,pu4,lao3,ban3 +当间儿=dang1,jian4,er2 +当面输心背面笑=dang1,mian4,shu1,xin1,bei4,mian4,xiao4 +彔=lu4 +录=lu4 +彖=tuan4 +彗=hui4 +彘=zhi4 +彙=hui4 +彚=hui4 +彛=yi2 +彜=yi2 +彝=yi2 +彞=yi2 +彟=huo4 +彠=huo4 +彡=shan1,xian3 +形=xing2 +形劫势禁=xing2,jie2,shi4,jin4 +形单影只=xing2,dan1,ying3,zhi1 +形只影单=xing2,zhi1,ying3,dan1 +形孤影只=xing2,gu1,ying3,zhi1 +形数=xing2,shu4 +形枉影曲=xing2,wang3,ying3,qu1 +形格势禁=xing2,ge2,shi4,jin4 +形禁势格=xing2,jin4,shi4,ge2 +彣=wen2 +彤=tong2 +彥=yan4 +彦=yan4 +彧=yu4 +彨=chi1 +彩=cai3 +彩色玻璃=cai3,se4,bo1,li5 +彪=biao1 +彫=diao1 +彬=bin1 +彭=peng2,bang1 +彮=yong3 +彯=piao1,piao4 +彰=zhang1 +彰明昭着=zhang1,ming2,zhao1,zhe5 +彰明较着=zhang1,ming2,jiao4,zhu4 +影=ying3 +影只形单=ying3,zhi1,xing2,dan1 +影只形孤=ying3,zhi1,xing2,gu1 +影子=ying3,zi5 +影片=ying3,pian1 +影片儿=ying3,pian1,er5 +影调=ying3,diao4 +影调剧=ying3,diao4,ju4 +彲=chi1 +彳=chi4 +彴=zhuo2,bo2 +彵=tuo3,yi2 +彶=ji2 +彷=pang2,fang3 +彸=zhong1 +役=yi4 +役畜=yi4,xu4 +彺=wang3 +彻=che4 +彻查=che4,cha2 +彼=bi3 +彼一时此一时=bi3,yi1,shi2,ci3,yi1,shi2 +彼倡此和=bi3,chang4,ci3,he4 +彼唱此和=bi3,chang4,ci3,he4 +彽=di1 +彾=ling2 +彿=fu4 +往=wang3 +往渚还汀=wang3,zhu3,huan2,ting1 +往还=wang3,huan2 +征=zheng1 +征调=zheng1,diao4 +徂=cu2 +徃=wang3 +径=jing4 +径行直遂=jing4,xing2,zhi2,sui2 +待=dai4,dai1 +待定系数法=dai4,ding4,xi4,shu4,fa3 +待时守分=dai4,shi2,shou3,fen4 +待查=dai4,zha1 +徆=xi1 +徇=xun4 +很=hen3 +很不错=hen3,bu2,cuo4 +很累=hen3,lei4 +很脏=hen3,zang1 +徉=yang2 +徊=huai2 +律=lv4 +後=hou4 +徍=jia1,wang4,wa1 +徎=cheng3,zheng4 +徏=zhi4 +徐=xu2 +徑=jing4 +徒=tu2 +徒讬空言=tu2,tun2,kong1,yan2 +徒长=tu2,zhang3 +従=cong2 +徔=cong2 +徕=lai2,lai4 +徖=cong2 +得=de2,dei3,de5 +得了吧=de2,le5,ba5 +得亏=dei3,kui1 +得喝水了=dei3,he1,shui3,le5 +得当=de2,dang4 +得心应手=de2,xin1,ying4,shou3 +得意起来=de2,yi4,qi5,lai2 +得数=de2,shu4 +得未曾有=de2,wei4,ceng2,you3 +得空=de2,kong4 +得薄能鲜=de2,bo2,neng2,xian1 +得间=de2,jian4 +得马折足=de2,ma3,she2,zu2 +徘=pai2 +徙=xi3 +徙薪曲突=xi3,xin1,qu1,tu1 +徚=dong1 +徛=ji4 +徜=chang2 +徝=zhi4 +從=cong2,zong4 +徟=zhou1 +徠=lai2,lai4 +御=yu4 +徢=xie4 +徣=jie4 +徤=jian4 +徥=shi4,ti3 +徦=jia3,xia2 +徧=bian4 +徨=huang2 +復=fu4 +循=xun2 +徫=wei3 +徬=pang2 +徭=yao2 +微=wei1 +微晕=wei1,yun4 +微薄=wei1,bo2 +徯=xi1 +徰=zheng1 +徱=piao4 +徲=ti2,chi2 +徳=de2 +徴=zheng1,zhi3 +徵=zheng1,zhi3 +徶=bie2 +德=de2 +德兴市=de2,xing1,shi4 +德薄才疏=de2,bo2,cai2,shu1 +德薄能鲜=de2,bo2,neng2,xian3 +德行=de2,xing4 +徸=zhong3,chong1 +徹=che4 +徺=jiao3,yao2 +徻=hui4 +徼=jiao3,jiao4 +徽=hui1 +徽调=hui1,diao4 +徾=mei2 +徿=long4,long3 +忀=xiang1 +忁=bao4 +忂=qu2,ju4 +心=xin1 +心不在焉=xin1,bu4,zai4,yan1 +心中一懔=xin1,zhong1,yi4,lin3 +心中有数=xin1,zhong1,you3,shu4 +心事重重=xin1,shi4,chong2,chong2 +心切=xin1,qie4 +心口相应=xin1,kou3,xiang1,ying1 +心同止水=xin1,ru2,zhi3,shui3 +心在魏阙=xin1,zai4,wei4,que4 +心存疑虑=xin1,cun2,yi2,huo4 +心宽体胖=xin1,kuan1,ti3,pan2 +心广体胖=xin1,guang3,ti3,pan2 +心急火燎=xin1,ji2,huo3,liao3 +心慌撩乱=xin1,huang1,liao2,luan4 +心手相应=xin1,shou3,xiang1,ying4 +心手相忘=xin1,shou3,xiang1,wang4 +心拙口夯=xin1,zhuo1,kou3,ben4 +心数=xin1,shu4 +心有灵犀一点通=xin1,you3,ling2,xi1,yi1,dian3,tong1 +心电感应=xin1,dian4,gan3,ying4 +心痒难挝=xin1,yang3,nan2,zhua1 +心瞻魏阙=xin1,zhan1,wei4,que4 +心肌梗塞=xin1,ji1,geng3,se4 +心血=xin1,xue4 +心长发短=xin1,chang2,fa4,duan3 +心驰魏阙=xin1,chi2,wei4,que4 +忄=xin1 +必=bi4 +必得=bi4,dei3 +忆=yi4 +忇=le4 +忈=ren2 +忉=dao1 +忊=ding4,ting4 +忋=gai3 +忌=ji4 +忍=ren3 +忍俊不住=ren3,jun4,bu4,zhu4 +忍俊不禁=ren3,jun4,bu4,jin4 +忎=ren2 +忏=chan4 +忐=tan3 +忑=te4 +忒=te4,tui1 +忓=gan1,han4 +忔=yi4,qi4 +忕=shi4,tai4 +忖=cun3 +忖度=cun3,duo2 +志=zhi4 +忘=wang4 +忘恩背义=wang4,en1,bei4,yi4 +忙=mang2 +忙得不亦乐乎=mang2,de5,bu2,yi4,le4,hu1 +忚=xi1,lie3 +忛=fan1 +応=ying1,ying4 +忝=tian3 +忞=min3,wen3,min2 +忟=min3,wen3,min2 +忠=zhong1 +忠仆=zhong1,pu2 +忡=chong1 +忢=wu4 +忣=ji2 +忤=wu3 +忥=xi4 +忦=jia2 +忧=you1 +忨=wan2 +忩=cong1 +忪=song1,zhong1 +快=kuai4 +忬=yu4,shu1 +忭=bian4 +忮=zhi4 +忯=qi2,shi4 +忰=cui4 +忱=chen2 +忲=tai4 +忳=tun2,zhun1,dun4 +忴=qian2,qin2 +念=nian4 +念头=nian4,tou5 +念念不忘=nian4,nian4,bu4,wang4 +念念有词=nian4,nian4,you3,ci2 +忶=hun2 +忷=xiong1 +忸=niu3 +忹=kuang2,wang3 +忺=xian1 +忻=xin1 +忼=kang1,hang4 +忽=hu1 +忾=kai4,xi4 +忿=fen4 +怀=huai2 +怀才不遇=huai2,cai2,bu2,yu4 +怀着鬼胎=huai2,zhe5,gui3,tai1 +怀透了=huai4,tou4,le5 +态=tai4 +怂=song3 +怃=wu3 +怄=ou4 +怅=chang4 +怆=chuang4 +怇=ju4 +怈=yi4 +怉=bao3,bao4 +怊=chao1 +怋=min2,men2 +怌=pei1 +怍=zuo4,zha4 +怎=zen3 +怎么着=zen3,me5,zhao1 +怏=yang4 +怏怏不悦=yang4,yang4,bu4,yue4 +怐=kou4,ju4 +怑=ban4 +怒=nu4 +怒发冲冠=nu4,fa4,chong1,guan1 +怒号=nu4,hao2 +怒喝=nu4,he4 +怓=nao2,niu2 +怔=zheng1 +怔住=zheng4,zhu4 +怔忪=zheng1,zhong1 +怕=pa4 +怖=bu4 +怗=tie1,zhan1 +怘=hu4,gu4 +怙=hu4 +怚=cu1,ju4,zu1 +怛=da2 +怜=lian2 +思=si1,sai1 +思所逐之=si1,suo3,zhu2,zhi1 +思量=si1,liang5 +怞=you2,chou2 +怟=di4 +怠=dai4 +怡=yi2 +怢=tu1,die2 +怣=you2 +怤=fu1 +急=ji2 +急公好义=ji2,gong1,hao4,yi4 +急公好施=ji2,gong1,hao4,shi1 +急切=ji2,qie4 +急功好利=ji2,gong1,hao4,li4 +急惊风撞着慢郎中=ji2,jing1,feng1,zhuang4,zhe5,man4,lang2,zhong1 +急景凋年=ji2,ying3,diao1,nian2 +急脉缓灸=ji2,mai4,huan3,jiu3 +急难=ji2,nan4 +怦=peng1 +性=xing4 +怨=yuan4 +怨声载道=yuan4,sheng1,zai4,dao4 +怩=ni2 +怪=guai4 +怪相=guai4,xiang4 +怫=fu2 +怫然不悦=fu2,ran2,bu4,yue4 +怬=xi4 +怭=bi4 +怮=you1,yao4 +怯=qie4 +怰=xuan4 +怱=cong1 +怲=bing3 +怳=huang3 +怴=xu4,xue4 +怵=chu4 +怶=bi4,pi1 +怷=shu4 +怸=xi1,shu4 +怹=tan1 +怺=yong3 +总=zong3 +总得=zong3,dei3 +总数=zong3,shu4 +总长=zong3,zhang3 +怼=dui4 +怽=mi4 +怿=yi4 +恀=shi4 +恁=nen4,nin2 +恂=xun2 +恃=shi4 +恄=xi4 +恅=lao3 +恆=heng2 +恇=kuang1 +恈=mou2 +恉=zhi3 +恊=xie2 +恋=lian4 +恌=tiao1,yao2 +恍=huang3 +恎=die2 +恏=hao4 +恐=kong3 +恐吓=kong3,he4 +恐怖分子=kong3,bu4,fen4,zi5 +恑=gui3 +恒=heng2 +恒河沙数=heng2,he2,sha1,shu4 +恓=xi1,qi1,xu4 +恔=xiao4,jiao3 +恕=shu4 +恕不奉陪=shu4,bu4,feng4,pei2 +恖=si1 +恗=hu1,kua1 +恘=qiu1 +恙=yang4 +恚=hui4 +恛=hui2 +恜=chi4 +恝=jia2 +恞=yi2 +恟=xiong1 +恠=guai4 +恡=lin4 +恢=hui1 +恣=zi4 +恣意妄为=zi4,yi4,wang4,wei2 +恣睢=zi4,sui1 +恤=xu4 +恥=chi3 +恦=shang4 +恧=nv4 +恨=hen4 +恨海难填=hen4,hai3,nan2,tian2 +恩=en1 +恪=ke4 +恫=dong4 +恫吓=dong4,he4 +恫疑虚喝=dong4,yi2,xu1,he4 +恫疑虚猲=dong4,yi2,xu1,ge2 +恬=tian2 +恬不为怪=tian2,bu4,wei2,guai4 +恬不为意=tian2,bu4,wei2,yi4 +恬淡无为=tian2,dan4,wu2,wei2 +恭=gong1 +恮=quan2,zhuan1 +息=xi1 +恰=qia4 +恰如其分=qia4,ru2,qi2,fen4 +恰当=qia4,dang4 +恱=yue4 +恲=peng1 +恳=ken3 +恳切=ken3,qie4 +恴=de2 +恵=hui4 +恶=e4,wu4,e3,wu1 +恶不去善=wu4,bu4,qu4,shan4 +恶少=e4,shao4 +恶居下流=wu4,ju1,xia4,liu2 +恶心=e3,xin1 +恶恶从短=wu4,wu4,cong2,duan3 +恶湿居下=wu4,shi1,ju1,xia4 +恶煞=e4,sha4 +恶相=e4,xiang4 +恶紫夺朱=wu4,zi3,duo2,zhu1 +恶迹昭着=e4,ji4,zhao1,zhe5 +恶醉强酒=wu4,zui4,qiang3,jiu3 +恶露=e4,lu4 +恷=qiu1 +恸=tong4 +恹=yan1 +恺=kai3 +恻=ce4 +恼=nao3 +恽=yun4 +恾=mang2 +恿=yong3 +悀=yong3 +悁=yuan1,juan4 +悂=pi1,pi3 +悃=kun3 +悄=qiao1,qiao3 +悄声=qiao3,sheng1 +悄寂=qiao3,ji4 +悄悄=qiao1,qiao1 +悄没声=qiao3,mei2,sheng1 +悄然=qiao3,ran2 +悅=yue4 +悆=yu4,shu1 +悇=tu2 +悈=jie4,ke4 +悉=xi1 +悉索薄赋=xi1,suo3,bo2,fu4 +悊=zhe2 +悋=lin4 +悌=ti4 +悍=han4 +悍将=han4,jiang4 +悎=hao4,jiao4 +悏=qie4 +悐=ti4 +悑=bu4 +悒=yi4 +悓=qian4 +悔=hui3 +悔不当初=hui3,bu4,dang1,chu1 +悔过自责=hui3,guo4,zi4,ze4 +悕=xi1 +悖=bei4 +悗=man2,men4 +悘=yi1,yi4 +悙=heng1,heng4 +悚=song3 +悛=quan1 +悜=cheng3 +悝=kui1,li3 +悞=wu4 +悟=wu4 +悠=you1 +悡=li2 +悢=liang4 +患=huan4 +患难=huan4,nan4 +患难与共=huan4,nan4,yu3,gong4 +患难之交=huan4,nan4,zhi1,jiao1 +悤=cong1 +悥=yi4,nian4 +悦=yue4 +悧=li4 +您=nin2 +悩=nao3 +悪=e4 +悫=que4 +悬=xuan2 +悬崖勒马=xuan2,ya2,le4,ma3 +悬狟素飡=xuan2,huan2,su4,kou4 +悬石程书=xuan2,dan4,cheng2,shu1 +悬钩子=xuan2,gou1,zi5 +悬首吴阙=xuan2,shou3,wu2,que4 +悭=qian1 +悮=wu4 +悯=min3 +悰=cong2 +悱=fei3 +悲=bei1 +悲切=bei1,qie4 +悲咽=bei1,ye4 +悳=de2 +悴=cui4 +悵=chang4 +悶=men4,men1 +悷=li4 +悸=ji4 +悹=guan4 +悺=guan4 +悻=xing4 +悼=dao4 +悽=qi1 +悾=kong1,kong3 +悿=tian3 +惀=lun3,lun4 +惁=xi1 +惂=kan3 +惃=gun3 +惄=ni4 +情=qing2 +情不自禁=qing2,bu2,zi4,jin1 +情真意切=qing2,zhen1,yi4,qie4 +情见乎辞=qing2,xian4,hu1,ci2 +情见力屈=qing2,xian4,li4,qu1 +情见势屈=qing2,xian4,shi4,qu1 +情见埶竭=qing2,jian4,zhou1,jie2 +情调=qing2,diao4 +情非得已=qing2,fei1,de2,yi3 +惆=chou2 +惇=dun1 +惇笃=dun1,du3 +惈=guo3 +惉=zhan1 +惊=jing1 +惊魂落魄=jing1,hun2,luo4,po4 +惋=wan3 +惌=yuan1,wan3 +惍=jin1 +惎=ji4 +惏=lan2,lin2 +惐=yu4,xu4 +惑=huo4 +惒=he2,he4 +惓=juan4,quan2 +惔=tan2,dan4 +惕=ti4 +惖=ti4 +惗=nian4 +惘=wang3 +惙=chuo4,chui4 +惚=hu1 +惛=hun1,men4 +惜=xi1 +惝=chang3 +惞=xin1 +惟=wei2 +惟利是趋=wei2,li4,shi4,qu2 +惟妙惟肖=wei2,miao4,wei2,xiao4 +惟所欲为=wei2,suo3,yu4,wei2 +惟日为岁=wei2,ri4,wei2,sui4 +惠=hui4 +惠更斯=hui4,geng1,si1 +惡=e4,wu4,e3,wu1 +惢=rui3,suo3 +惣=zong3 +惤=jian1 +惥=yong3 +惦=dian4 +惧=ju4 +惨=can3 +惩=cheng2 +惩处=cheng2,chu3 +惩艾=cheng2,yi4 +惪=de2 +惫=bei4 +惬=qie4 +惭=can2 +惮=dan4,da2 +惯=guan4 +惰=duo4 +惱=nao3 +惲=yun4 +想=xiang3 +想不到=xiang3,bu2,dao4 +想头=xiang3,tou5 +想望风褱=xiang3,wang4,feng1,sheng4 +想着=xiang3,zhe5 +惴=zhui4 +惵=die2 +惶=huang2 +惷=chun3 +惸=qiong2 +惹=re3 +惺=xing1 +惻=ce4 +惼=bian3 +惽=min3 +惾=zong1 +惿=ti2,shi4 +愀=qiao3 +愁=chou2 +愁眉不展=chou2,mei2,bu4,zhan1 +愂=bei4 +愃=xuan1 +愄=wei1 +愅=ge2 +愆=qian1 +愇=wei3 +愈=yu4 +愉=yu2,tou1 +愊=bi4 +愋=xuan1 +愌=huan4 +愍=min3 +愎=bi4 +意=yi4 +意兴索然=yi4,xing1,suo3,ran2 +意味着=yi4,wei4,zhe5 +意想不到=yi4,xiang3,bu4,dao4 +意气相得=yi4,qi4,xiang1,de2 +意淫=yi4,yin3 +愐=mian3 +愑=yong3 +愒=qi4,kai4 +愓=dang4,shang1,tang2,yang2 +愔=yin1 +愕=e4 +愖=chen2,xin4,dan1 +愗=mao4 +愘=ke4,qia4 +愙=ke4 +愚=yu2 +愚蒙=yu2,meng2 +愛=ai4 +愜=qie4 +愝=yan3 +愞=nuo4 +感=gan3 +感应=gan3,ying4 +感应圈=gan3,ying4,quan1 +感应电流=gan3,ying4,dian4,liu2 +感性认识=gan3,xing4,ren4,shi5 +感恩荷德=gan3,en1,he4,de2 +感激不尽=gan3,ji1,bu4,jin4 +感荷=gan3,he4 +愠=yun4 +愡=cong4,song1 +愢=sai1,si1,si3 +愣=leng4 +愤=fen4 +愥=ying1 +愦=kui4 +愧=kui4 +愨=que4 +愩=gong1,gong4,hong3 +愪=yun2 +愫=su4 +愬=su4,shuo4 +愭=qi2 +愮=yao2,yao4 +愯=song3 +愰=huang4 +愱=ji2 +愲=gu3 +愳=ju4 +愴=chuang4 +愵=ni4 +愶=xie2 +愷=kai3 +愸=zheng3 +愹=yong3 +愺=cao3 +愻=xun4 +愼=shen4 +愽=bo2 +愾=kai4,xi4 +愿=yuan4 +慀=xi4,xie2 +慁=hun4 +慂=yong3 +慃=yang3 +慄=li4 +慅=sao1,cao3 +慆=tao1 +慇=yin1 +慈=ci2 +慈悲为本=ci2,bei1,wei2,ben3 +慉=xu4,chu4 +慊=qian4,qie4 +態=tai4 +慌=huang1 +慍=yun4 +慎=shen4 +慏=ming3 +慐=gong1,gong4,hong3 +慑=she4 +慒=cao2,cong2 +慓=piao1 +慔=mu4 +慕=mu4 +慕古薄今=mu4,gu3,bo2,jin1 +慖=guo2 +慗=chi4 +慘=can3 +慙=can2 +慚=can2 +慛=cui1 +慜=min2 +慝=te4 +慞=zhang1 +慟=tong4 +慠=ao2,ao4 +慡=shuang3 +慢=man4 +慣=guan4 +慤=que4 +慥=zao4 +慦=jiu4 +慧=hui4 +慨=kai3 +慩=lian2,lian3 +慪=ou4 +慫=song3 +慬=jin3,qin2,jin4 +慭=yin4 +慮=lv4 +慯=shang1 +慰=wei4 +慱=tuan2 +慲=man2 +慳=qian1 +慴=she4 +慵=yong1 +慶=qing4 +慷=kang1 +慸=di4,chi4 +慹=zhi2,zhe2 +慺=lv3,lou2 +慻=juan4 +慼=qi1 +慽=qi1 +慾=yu4 +慿=ping2 +憀=liao2 +憁=cong4 +憂=you1 +憃=chong1 +憄=zhi1,zhi4 +憅=tong4 +憆=cheng1 +憇=qi4 +憈=qu1 +憉=peng2 +憊=bei4 +憋=bie1 +憋闷气=bie1,men4,qi4 +憌=qiong2 +憍=jiao1 +憎=zeng1 +憎恶=zeng1,wu4 +憏=chi4 +憐=lian2 +憑=ping2 +憒=kui4 +憓=hui4 +憔=qiao2 +憕=cheng2,deng4,zheng4 +憖=yin4 +憗=yin4 +憘=xi3,xi1 +憙=xi3 +憚=dan4,da2 +憛=tan2 +憜=duo4 +憝=dui4 +憞=dui4,dun4,tun1 +憟=su4 +憠=jue2 +憡=ce4 +憢=xiao1,jiao1 +憣=fan1 +憤=fen4 +憥=lao2 +憦=lao4,lao2 +憧=chong1 +憨=han1 +憩=qi4 +憪=xian2,xian4 +憫=min3 +憬=jing3 +憭=liao3,liao2 +憮=wu3 +憯=can3 +憰=jue2 +憱=cu4 +憲=xian4 +憳=tan3 +憴=sheng2 +憵=pi1 +憶=yi4 +憷=chu4 +憸=xian1 +憹=nao2,nao3,nang2 +憺=dan4 +憻=tan3 +憼=jing3,jing4 +憽=song1 +憾=han4 +憿=jiao3,ji3 +懀=wei4 +懁=xuan1,huan1 +懂=dong3 +懂得=dong3,de5 +懂行=dong3,hang2 +懃=qin2 +懄=qin2 +懅=ju4 +懆=cao3,sao1,sao4 +懇=ken3 +懈=xie4 +應=ying1,ying4 +懊=ao4 +懊丧=ao4,sang4 +懋=mao4 +懌=yi4 +懍=lin3 +懎=se4 +懏=jun4 +懐=huai2 +懑=men4 +懒=lan3 +懒得=lan3,de5 +懒散=lan3,san3 +懒骨头=lan3,gu3,tou5 +懓=ai4 +懔=lin3 +懕=yan1 +懖=guo1 +懗=xia4 +懘=chi4 +懙=yu3,yu2 +懚=yin4 +懛=dai1 +懜=meng4,meng2,meng3 +懝=ai4,yi4,ni3 +懞=meng2,meng3 +懟=dui4 +懠=qi2,ji1,ji4 +懡=mo3 +懢=lan2,xian4 +懣=men4 +懤=chou2 +懥=zhi4 +懦=nuo4 +懧=nuo4 +懨=yan1 +懩=yang3 +懪=bo2 +懫=zhi4 +懬=kuang4 +懭=kuang3 +懮=you1,you3 +懯=fu1 +懰=liu2,liu3 +懱=mie4 +懲=cheng2 +懳=hui4 +懴=chan4 +懵=meng3 +懶=lan3 +懷=huai2 +懸=xuan2 +懹=rang4 +懺=chan4 +懻=ji4 +懼=ju4 +懽=huan1 +懾=she4 +懿=yi4 +戀=lian4 +戁=nan3 +戂=mi2,mo2 +戃=tang3 +戄=jue2 +戅=gang4,zhuang4 +戆=gang4,zhuang4 +戇=gang4,zhuang4 +戈=ge1 +戉=yue4 +戊=wu4 +戋=jian1 +戌=xu1 +戍=shu4 +戎=rong2 +戎行=rong2,hang2 +戎马倥偬=rong2,ma3,kong3,zong3 +戎马倥傯=rong2,ma3,kong3,zong3 +戎马劻勷=rong2,ma3,dan1,xiao4 +戏=xi4,hu1 +戏班子=xi4,ban1,zi5 +戏馆子=xi4,guan3,zi3 +成=cheng2 +成一家言=cheng2,yi1,jia1,yan2 +成为=cheng2,wei2 +成分=cheng2,fen4 +成千累万=cheng2,qian1,lei3,wan4 +成吉思汗=cheng2,ji2,si1,han2 +成宿=cheng2,xiu3 +成年累月=cheng2,nian2,lei3,yue4 +成数=cheng2,shu4 +成绩单=cheng2,ji4,dan1 +成行=cheng2,hang2 +成败兴废=cheng2,bai4,xing1,fei4 +成都=cheng2,du1 +成都平原=cheng2,du1,ping2,yuan2 +成长=cheng2,zhang3 +我=wo3 +我们=wo3,men5 +我们俩=wo3,men1,lia3 +我们的=wo3,men5,de5 +我们自己=wo3,men5,zi4,ji3 +我们自己的=wo3,men5,zi4,ji3,de5 +戒=jie4 +戒奢宁俭=jie4,she1,ning4,jian1 +戓=ge1 +戔=jian1 +戕=qiang1 +或=huo4 +戗=qiang1,qiang4 +戗住=qiang4,zhu4 +戗脊=qiang4,ji3 +戗金=qiang4,jin1 +戗面=qiang4,mian4 +戗面馒头=qiang4,mian4,man2,tou5 +战=zhan4 +战将=zhan4,jiang4 +战无不胜=zhan4,wu2,bu4,sheng4 +战栗=zhan4,li4 +戙=dong4 +戚=qi1 +戛=jia2 +戜=die2 +戝=zei2 +戞=jia2 +戟=ji3 +戠=zhi2 +戡=kan1 +戢=ji2 +戣=kui2 +戤=gai4 +戥=deng3 +戦=zhan4 +戧=qiang1,qiang4 +戨=ge1 +戩=jian3 +截=jie2 +截铁斩钉=jie2,tie3,zhan3,ding4 +戫=yu4 +戬=jian3 +戭=yan3 +戮=lu4 +戯=xi4,hu1 +戰=zhan4 +戱=xi4,hu1 +戲=xi4,hu1 +戳=chuo1 +戴=dai4 +戴帽子=dai4,mao4,zi5 +戵=qu2 +戶=hu4 +户=hu4 +户枢不蠹=hu4,shu1,bu4,du4 +户调=hu4,diao4 +户限为穿=hu4,xian4,wei2,chuan1 +戸=hu4 +戹=e4 +戺=shi4 +戻=ti4 +戼=mao3 +戽=hu4 +戽斗=hu4,dou3 +戾=li4 +房=fang2 +房子=fang2,zi5 +房舍=fang2,she4 +所=suo3 +所作所为=suo3,zuo4,suo3,wei2 +所得=suo3,de5 +所得税=suo3,de5,shui4 +扁=bian3,pian1 +扁担=bian3,dan4 +扁舟=pian1,zhou1 +扂=dian4 +扃=jiong1 +扄=shang3,jiong1 +扅=yi2 +扆=yi3 +扇=shan4,shan1 +扇动=shan1,dong4 +扇子=shan4,zi5 +扇惑=shan1,huo4 +扇枕温席=shan1,zhen3,wen1,xi2 +扇枕温被=shan1,zhen3,wen1,chuang2 +扇风=shan1,feng1 +扇风点火=shan4,feng1,dian3,huo3 +扇骨子=shan4,gu3,zi5 +扈=hu4 +扉=fei1 +扊=yan3 +手=shou3 +手下败将=shou3,xia4,bai4,jiang4 +手不释卷=shou3,bu4,shi4,juan4 +手卷=shou3,juan4 +手夹=shou3,jia1 +手拷=shou3,kao4 +手榴弹=shou3,liu2,dan4 +手相=shou3,xiang4 +手背=shou3,bei4 +手脚干净=shou3,jiao3,gan4,jing4 +手臂=shou3,bi4 +手足异处=shou3,zu2,yi4,chu3 +手足重茧=shou3,zu2,chong2,jian3 +扌=shou3 +才=cai2 +才分=cai2,fen4 +才占八斗=cai2,zhan1,ba1,dou3 +才大难用=cai2,da4,nan2,yong4 +才夸八斗=cai2,kua1,ba1,dou3 +才干=cai2,gan4 +才疏德薄=cai2,shu1,de2,bo2 +才薄智浅=cai1,bo2,zhi4,qian3 +才轻德薄=cai2,qing1,de2,bo2 +才高八斗=cai2,gao1,ba1,dou3 +扎=zha1,za1,zha2 +扎堆=zha1,dui1 +扎实=zha1,shi2 +扎手=zha1,shou3 +扎扎=zha1,zha1 +扎挣=zha2,zheng1 +扎根=zha1,gen1 +扎根串连=zha1,gen1,chuan4,lian2 +扎猛子=zha1,meng3,zi3 +扎眼=zha1,yan3 +扎破=zha2,po4 +扎耳朵=zha1,er3,duo3 +扎花=zha1,hua1 +扎营=zha1,ying2 +扎针=zha1,zhen1 +扏=qiu2 +扐=le4,li4,cai2 +扑=pu1 +扑扇=pu1,shan1 +扑棱=pu1,leng1 +扒=ba1,pa2 +扒开=pa2,kai1 +扒手=pa2,shou3 +扒灰=pa2,hui1 +扒窃=pa2,qie4 +扒粪=pa2,fen4 +扒糕=pa2,gao1 +扒耳搔腮=pa2,er3,sao1,sai1 +扒草=pa2,cao3 +打=da3,da2 +打一折=da3,yi4,zhe2 +打不住=da3,bu2,zhu4 +打中=da3,zhong4 +打击乐器=da3,ji1,yue4,qi4 +打呵欠=da3,he1,qian4 +打哆嗦=da3,duo1,suo4 +打哈欠=da3,ha1,qian4 +打圈子=da3,quan1,zi5 +打场=da3,chang2 +打家劫舍=da3,jia1,jie2,she4 +打工仔=da3,gong1,zai3 +打底子=da3,di3,zi5 +打拍子=da3,pai1,zi5 +打擂=da3,lei4 +打更=da3,geng1 +打杈=da3,cha4 +打棍子=da3,gun4,zi5 +打点=da3,dian3 +打烊=da3,yang4 +打的=da3,di1 +打肿脸充胖子=da3,zhong3,lian3,chong1,pang4,zi1 +打躬作揖=da3,gong1,zuo1,yi1 +打量=da3,liang5 +打颤=da3,zhan4 +打马虎眼=da3,ma3,hu4,yan3 +扔=reng1 +払=fan3,fu2 +扖=ru4 +扗=zai4 +托=tuo1 +托物寓兴=tuo1,wu4,yu4,xing1 +扙=zhang4 +扚=diao3,di2,yue1,li4 +扛=kang2,gang1 +扛得住=kang2,de5,zhu4 +扛鼎=gang1,ding3 +扛鼎抃牛=gang1,ding3,bian4,niu2 +扛鼎拔山=gang1,ding3,ba2,shan1 +扜=yu1,wu1 +扝=yu1,wu1,ku1 +扞=han4 +扟=shen1 +扠=cha1 +扡=tuo1,chi3,yi3 +扢=gu3,xi4,ge1,jie2 +扣=kou4 +扣子=kou4,zi5 +扣帽子=kou4,mao4,zi5 +扣盘扪钥=kou4,pan2,men2,yao4 +扤=wu4 +扥=den4 +扦=qian1 +执=zhi2 +执拗=zhi2,niu4 +执经问难=zhi2,jing1,wen4,nan2 +执著=zhi2,zhuo2 +执迷不悟=zhi2,mi2,bu4,wu4 +扨=ren4 +扩=kuo4 +扪=men2 +扪参历井=men2,shen1,li4,jing3 +扫=sao3,sao4 +扫兴=sao3,xing4 +扫帚=sao4,zhou3 +扫把=sao4,ba3 +扫数=sao3,shu4 +扬=yang2 +扬厉铺张=yang2,li4,pu4,zhang1 +扬场=yang2,chang2 +扬己露才=yang2,ji3,lu4,cai2 +扬眉眴目=yang2,mei2,shun4,mu4 +扬风扢雅=yang2,feng1,bao4,ya3 +扭=niu3 +扭曲=niu3,qu1 +扭直作曲=niu3,zhi2,zuo4,qu1 +扭转干坤=niu3,zhuan3,gan4,kun1 +扮=ban4 +扮相=ban4,xiang4 +扯=che3 +扯篷拉纤=che3,peng2,la1,qian4 +扯纤拉烟=che3,qian4,la1,yan1 +扯顺风旗=che3,shun3,feng1,qi2 +扰=rao3 +扱=xi1,cha1,qi4 +扲=qian2,qin2 +扳=ban1 +扴=jia2 +扵=yu2 +扶=fu2 +扷=ba1,ao4 +扸=xi1,zhe2 +批=pi1 +批假=pi1,jia4 +批砉导窾=pi1,hua1,dao3,tao2 +批隙导窾=pi1,xi4,dao3,yin2 +批风抹月=pi1,feng1,mo4,yue4 +扺=zhi3 +扻=zhi4,sun3,kan3 +扼=e4 +扼亢拊背=e4,kang4,fu3,bei4 +扼吭夺食=e4,hang2,duo2,shi2 +扼吭拊背=e4,hang2,fu3,bei4 +扼喉抚背=e4,hou2,fu3,bei4 +扼襟控咽=e4,jin1,kong4,yan1 +扽=den4 +找=zhao3 +找乐子=zhao3,le4,zi5 +找头=zhao3,tou5 +找得到=zhao3,de5,dao4 +找着=zhao3,zhao2 +承=cheng2 +承应=cheng2,ying1 +承溜=cheng2,liu4 +承蒙=cheng2,meng2 +技=ji4 +抁=yan3 +抂=kuang2,wang3,zai4 +抃=bian4 +抄=chao1 +抄查=chao1,cha2 +抅=ju1 +抆=wen3 +抇=hu2,gu3 +抈=yue4 +抉=jue2 +把=ba3,ba4 +把子=ba4,zi5 +把玩无厌=ba3,wan2,wu3,yan4 +把马子=ba3,ma3,zi5 +抋=qin4 +抌=dan3,shen3 +抍=zheng3 +抎=yun3 +抏=wan2 +抐=ne4,ni4,rui4,na4 +抑=yi4 +抑塞磊落=yi4,se4,lei3,luo4 +抒=shu1 +抓=zhua1 +抓差=zhua1,chai1 +抔=pou2 +投=tou2 +投传而去=tou2,zhuan4,er2,qu4 +投其所好=tou2,qi2,suo3,hao4 +投奔=tou2,ben4 +投弹=tou2,dan4 +投机分子=tou2,ji1,fen4,zi5 +投降=tou2,xiang4 +抖=dou3 +抖搂=dou3,lou1 +抖擞=dou3,sou3 +抗=kang4 +抗颜为师=kang4,yan2,wei2,shi1 +折=zhe2,zhe1,she2 +折堕=she2,duo4 +折头=zhe2,tou5 +折损=she2,sun3 +折本=she2,ben3 +折秤=she2,cheng4 +折箭为誓=she2,jian4,wei2,shi4 +折而族之=zhe2,er2,zu2,zhi1 +折耗=she2,hao4 +折腰五斗=she2,yao1,wu3,dou4 +折腾=zhe1,teng2 +折衷=she2,zhong1 +折跟头=zhe1,gen1,tou2 +折辱=she2,ru3 +抙=pou2,pou1,fu1 +抚=fu3 +抛=pao1 +抛到一边=pao1,dao4,yi4,bian1 +抛头露面=pao1,tou2,lu4,mian4 +抛舍=pao1,she3 +抜=ba2 +抝=ao3,ao4,niu4 +択=ze2 +抟=tuan2 +抠=kou1 +抠心挖血=kou1,xin1,wa1,xue4 +抠着手掌=kou1,zhe5,shou3,zhang3 +抡=lun1,lun2 +抢=qiang3,qiang1,cheng1 +抢地呼天=qiang1,di4,hu1,tian1 +抢种=qiang3,zhong4 +抣=yun2 +护=hu4 +护士长=hu4,shi4,zhang3 +报=bao4 +报丧=bao4,sang1 +报仇雪耻=bao4,chou2,xue3,chi3 +报喜不报忧=bao4,xi3,bu2,bao4,you1 +报应=bao4,ying4 +报应不爽=bao4,ying4,bu4,shuang3 +报数=bao4,shu4 +报答=bao4,da2 +报载=bao4,zai3 +报销差旅费=bao4,xiao1,chai1,lv3,fei4 +抦=bing3 +抧=zhi3,zhai3 +抨=peng1 +抩=nan2 +抪=bu4,pu1 +披=pi1 +披卷=pi1,juan4 +披发=pi1,fa4 +披发入山=pi1,fa1,ru4,shan1 +披发左衽=pi1,fa4,zuo3,ren4 +披发文身=pi1,fa4,wen2,shen1 +披发缨冠=pi1,fa1,ying1,guan4 +披头散发=pi1,tou2,san4,fa4 +披心沥血=pi1,xin1,li4,xue4 +披散=pi1,san3 +披肝沥血=pi1,gan1,li4,xue4 +披肝露胆=pi1,gan1,lu4,dan3 +披露=pi1,lu4 +披露肝胆=pi1,lu4,gan1,dan3 +披露腹心=pi1,lu4,fu4,xin1 +披靡=pi1,mi3 +抬=tai2 +抭=yao3,tao1 +抮=zhen3 +抯=zha1 +抰=yang1 +抱=bao4 +抱法处势=bao4,fa3,chu3,shi4 +抱璞泣血=bao4,pu2,qi4,xue4 +抱蔓摘瓜=bao4,wan4,zhai1,gua1 +抲=he1,he4,qia1 +抳=ni3,ni2 +抴=ye4 +抵=di3 +抵背扼喉=di3,bei4,e4,hou2 +抶=chi4 +抷=pi1,pei1 +抸=jia1 +抹=mo3,mo4,ma1 +抹下来=ma1,xia4,lai2 +抹不开=mo4,bu4,kai1 +抹布=ma1,bu4 +抹月秕风=mo3,yue4,pi1,feng1 +抹桌子=ma1,zhuo1,zi5 +抹灰=mo4,hui1 +抹煞=mo3,sha4 +抹粉施脂=mo4,fen3,shi1,zhi1 +抹胸=mo4,xiong1 +抹脸=ma1,lian3 +抺=mei4 +抻=chen1 +押=ya1 +押头=ya1,tou5 +押当=ya1,dang4 +押禁=ya1,jin4 +押解=ya1,jie4 +抽=chou1 +抽丝剥茧=chou1,si1,bao1,jian3 +抽咽=chou1,ye4 +抽斗=chou1,dou3 +抽查=chou1,cha2 +抽祕骋妍=chou1,bi4,cheng3,yan2 +抽调=chou1,diao4 +抾=qu1 +抿=min3 +拀=zhu4 +拁=jia1,ya2 +拂=fu2,bi4 +拃=zha3 +拄=zhu3 +担=dan1,dan4,dan3 +担不是=dan1,bu2,shi4 +担子=dan4,zi5 +拆=chai1,ca1 +拇=mu3 +拈=nian1 +拉=la1,la2 +拉家常=la2,jia1,chang2 +拉杆=la1,gan3 +拉枯折朽=la1,ku1,she2,xiu3 +拉纤=la1,qian4 +拉肚子=la1,du4,zi5 +拉闲散闷=la1,xian2,san4,men4 +拊=fu3 +拊心泣血=fu3,xin1,qi4,xue4 +拊背扼吭=fu3,bei4,e4,keng1 +拊背扼喉=fu3,bei4,e4,hou2 +拊背搤吭=fu3,bei4,he4,keng1 +拋=pao1 +拌=ban4,pan4 +拌和=ban4,huo4 +拍=pai1 +拍片儿=pai1,pian1,er5 +拎=lin1 +拏=na2 +拐=guai3 +拐弯抹角=guai3,wan1,mo4,jiao3 +拑=qian2 +拒=ju4 +拓=tuo4,ta4,zhi2 +拓印=ta4,yin4 +拓本=ta4,ben3 +拓片=ta4,pian4 +拔=ba2 +拔山扛鼎=ba2,shan1,gang1,ding3 +拔本塞原=ba2,ben3,se4,yuan2 +拔本塞源=ba2,ben3,se4,yuan2 +拔缝=ba2,feng4 +拔苗助长=ba2,miao2,zhu4,zhang3 +拕=tuo1 +拖=tuo1 +拖斗=tuo1,dou3 +拖累=tuo1,lei3 +拗=ao4,niu4,ao3 +拗不过=niu4,bu4,guo4 +拗口=ao4,kou3 +拗断=ao3,duan4 +拘=ju1,gou1 +拘泥=ju1,ni4 +拘神遣将=ju1,shen2,qian3,jiang4 +拘禁=ju1,jin4 +拙=zhuo1 +拙朴=zhuo1,piao2 +拚=pin1,pan4,fan1 +拚命=pan4,ming4 +招=zhao1 +招供=zhao1,gong4 +招待会=zhao1,dai4,hui4 +招数=zhao1,shu4 +招架不住=zhao1,jia4,bu4,zhu4 +招行=zhao1,hang2 +招降=zhao1,xiang2 +招降纳叛=zhao1,xiang2,na4,pan4 +拜=bai4 +拜将=bai4,jiang4 +拜把子=bai4,ba4,zi5 +拝=bai4 +拞=di3 +拟=ni3 +拠=ju4 +拡=kuo4 +拢=long3 +拣=jian3 +拤=qia3 +拥=yong1 +拥塞=yong1,se4 +拦=lan2 +拦不住=lan2,bu4,zhu4 +拧=ning3,ning2,ning4 +拧一把=ning2,yi4,ba3 +拧成一股绳=ning2,cheng2,yi1,gu3,sheng2 +拧毛巾=ning2,mao2,jin1 +拧紧=ning2,jin3 +拧脾气=ning4,pi2,qi4 +拧衣服=ning2,yi1,fu5 +拨=bo1 +拨乱为治=bo1,luan4,wei2,zhi4 +拨云撩雨=bo1,yun2,liao2,yu3 +拨云雾见青天=bo1,yun2,wu1,jian4,qing1,tian1 +拨嘴撩牙=bo1,zui3,liao2,ya2 +拨雨撩云=bo1,yu3,liao2,yun2 +择=ze2,zhai2 +择不开=zhai2,bu4,kai1 +择席=zhai2,xi2 +择菜=zhai2,cai4 +拪=qian1 +拫=hen2 +括=kuo4,gua1 +括起来=kuo4,qi5,lai2 +拭=shi4 +拮=jie2,jia2 +拮据=jie2,ju1 +拯=zheng3 +拰=nin3 +拱=gong3 +拱券=gong3,xuan4 +拱手而降=gong3,shou4,er2,xiang2 +拲=gong3 +拳=quan2 +拳头=quan2,tou5 +拳曲=quan2,qu1 +拴=shuan1 +拵=cun2,zun4 +拶=za1,zan3 +拶子=zan3,zi5 +拶指=zan3,zhi3 +拷=kao3 +拸=yi2,chi3,hai4 +拹=xie2 +拺=ce4,se4,chuo4 +拻=hui1 +拼=pin1 +拽=zhuai4,zhuai1,ye4 +拽巷啰街=zhuai4,xiang4,luo2,jie1 +拽巷攞街=zhuai4,xiang4,luo3,jie1 +拽布拖麻=zhuai1,bu4,tuo1,ma2 +拽耙扶犁=zhuai1,pa2,fu2,li2 +拾=shi2,she4 +拾带重还=shi2,dai4,zhong4,huan2 +拾掇=shi2,duo5 +拾掇无遗=shi2,duo1,wu2,yi2 +拾级而上=she4,ji2,er2,shang4 +拿=na2 +拿得起=na2,de5,qi3 +拿得起来=na2,de5,qi3,lai2 +拿粗夹细=na2,cu1,jia1,xi4 +拿腔作调=na2,qiang1,zuo4,diao4 +挀=bai1 +持=chi2 +持续不断=chi2,xu4,bu2,duan4 +挂=gua4 +挂不住=gua4,bu2,zhu4 +挂冠=gua4,guan1 +挂冠归去=gua4,guan1,gui1,qu4 +挂冠求去=gua4,guan1,qiu2,qu4 +挂席为门=gua4,xi2,wei2,men2 +挂斗=gua4,dou3 +挂累=gua4,lei3 +挃=zhi4 +挄=kuo4,guang1 +挅=duo4 +挆=duo3,duo4 +指=zhi3 +指不胜偻=zhi3,bu4,sheng4,lv3 +指不胜屈=zhi3,bu4,sheng4,qu1 +指东划西=zhi3,dong1,hua4,xi1 +指囷相赠=zhi3,que4,xiang1,zeng4 +指天为誓=zhi3,tian1,wei2,shi4 +指头=zhi3,tou5 +指山卖磨=zhi3,shan1,mai4,mo4 +指山说磨=zhi3,shan1,shuo1,mo4 +指手划脚=zhi3,shou3,hua4,jiao3 +指数=zhi3,shu4 +指树为姓=zhi3,shu4,wei2,xing4 +指甲盖=zhi3,jia2,gai4 +指皁为白=zhi3,zao4,wei2,bai2 +指皂为白=zhi3,zao4,wei2,bai2 +指腹为婚=zhi3,fu4,wei2,hun1 +指雁为羹=zhi3,yan4,wei2,geng1 +指鹿为马=zhi3,lu4,wei2,ma3 +挈=qie4 +挈瓶之知=qie4,ping2,zhi1,zhi4 +按=an4 +按捺不住=an4,na4,bu4,zhu4 +按章给付=an4,zhang1,ji3,fu4 +挊=nong4 +挋=zhen4 +挌=ge2 +挍=jiao4 +挎=kua4,ku1 +挎斗=kua4,dou3 +挏=dong4 +挐=ru2,na2 +挑=tiao1,tiao3 +挑中=tiao1,zhong4 +挑么挑六=tiao1,yao1,tiao1,liu4 +挑动=tiao3,dong4 +挑唆=tiao3,suo1 +挑唇料嘴=tiao3,chun2,liao4,zui3 +挑嘴=tiao3,zui3 +挑大梁=tiao3,da4,liang2 +挑子=tiao1,zi5 +挑弄=tiao3,nong4 +挑得篮里便是菜=tiao3,de2,lan2,li3,bian4,shi4,cai4 +挑战=tiao3,zhan4 +挑担=tiao1,dan4 +挑拨=tiao3,bo1 +挑拨离间=tiao3,bo1,li2,jian4 +挑明=tiao3,ming2 +挑灯=tiao3,deng1 +挑牙料唇=tiao3,ya2,liao4,chun2 +挑花=tiao3,hua1 +挑衅=tiao3,xin4 +挑逗=tiao3,dou4 +挒=lie4 +挓=zha1 +挔=lv3 +挕=die2,she4 +挖=wa1 +挗=jue2 +挘=lie3 +挙=ju3 +挚=zhi4 +挛=luan2 +挜=ya4,ya3 +挝=wo1,zhua1 +挞=ta4 +挟=xie2,jia1 +挟主行令=jia1,zhu3,xing2,ling4 +挟势弄权=jia1,shi4,nong4,quan2 +挠=nao2 +挠曲=nao2,qu1 +挠曲枉直=nao2,qu1,wang3,zhi2 +挠直为曲=nao2,zhi2,wei2,qu1 +挡=dang3,dang4 +挢=jiao3 +挢抂过正=jiao3,kuang1,guo4,zheng4 +挣=zheng4,zheng1 +挣扎=zheng1,zha2 +挤=ji3 +挥=hui1 +挦=xian2 +挦章撦句=long2,zhang1,zong1,ju4 +挧=yu3 +挨=ai1,ai2 +挨个=ai1,ge4 +挨冻受饿=ai2,dong4,shou4,e4 +挨家挨户=ai1,jia1,ai1,hu4 +挨山塞海=ai1,shan1,se4,hai3 +挨打=ai2,da3 +挨日子=ai2,ri4,zi5 +挨肩叠背=ai1,jian1,die2,bei4 +挨肩搭背=ai1,jian1,da1,bei4 +挨肩擦背=ai1,jian1,ca1,bei4 +挨肩迭背=ai1,jian1,die2,bei4 +挨说=ai2,shuo1 +挨近=ai1,jin4 +挨闷雷=ai2,men4,lei2 +挨风缉缝=ai1,feng1,ji1,feng4 +挨饿=ai2,e4 +挨骂=ai2,ma4 +挩=tuo1,shui4 +挪=nuo2 +挫=cuo4 +挬=bo2 +挭=geng3 +挮=ti3,ti4 +振=zhen4 +振兴=zhen4,xing1 +振兵泽旅=zhen4,bing1,shi4,lv3 +挰=cheng2 +挱=suo1,sha1 +挲=suo1,sha1 +挳=keng1,qian1 +挴=mei3 +挵=nong4 +挶=ju2 +挷=bang4,peng2 +挸=jian3 +挹=yi4 +挹斗扬箕=yi4,dou3,yang2,ji1 +挺=ting3 +挺括=ting3,gua1 +挻=shan1 +挼=ruo2 +挼好长发=ruo2,hao3,chang2,fa4 +挽=wan3 +挾=xie2,jia1 +挿=cha1 +捀=peng2 +捁=jiao3,ku4 +捂=wu3 +捃=jun4 +捄=jiu4 +捅=tong3 +捅娄子=tong3,lou2,zi5 +捆=kun3 +捆扎=kun3,zha1 +捇=huo4,chi4 +捈=tu2,shu1,cha2 +捉=zhuo1 +捉衿露肘=zhuo1,jin1,lu4,zhou3 +捉襟露肘=zhuo1,jin1,lu4,zhou3 +捊=pou2,pou1,fu1 +捋=luo1,lv3 +捋胡子=lv3,hu2,zi3 +捋袖子=luo1,xiu4,zi5 +捌=ba1 +捍=han4 +捎=shao1,shao4 +捏=nie1 +捏一把汗=nie1,yi1,ba3,han4 +捐=juan1 +捐躯赴难=juan1,qu1,fu4,nan4 +捑=ze4 +捒=shu4,song3,sou1 +捓=ye2,yu2 +捔=jue2,zhuo2 +捕=bu3 +捖=wan2 +捗=bu4,pu2,zhi4 +捘=zun4 +捙=ye4 +捚=zhai1 +捛=lv3 +捜=sou1 +捝=tuo1,shui4 +捞=lao1 +损=sun3 +损兵折将=sun3,bing1,zhe2,jiang4 +损军折将=sun3,jun1,zhe2,jiang4 +捠=bang1 +捡=jian3 +捡起来=jian3,qi5,lai2 +换=huan4 +换斗移星=huan4,dou3,yi2,xing1 +换衣服=huan4,yi1,fu5 +捣=dao3 +捣乱分子=dao3,luan4,fen4,zi5 +捤=wei3 +捥=wan4,wan3,wan1,yu4 +捦=qin2 +捧=peng3 +捧场=peng3,chang3 +捧腹大笑=peng3,fu4,da4,xiao4 +捨=she3 +捩=lie4 +捪=min2 +捫=men2 +捬=fu3,fu4,bu3 +捭=bai3 +据=ju4,ju1 +据为己有=ju4,wei2,ji3,you3 +捯=dao2 +捰=wo3,luo4,luo3 +捰袖揎拳=luo4,xiu4,xuan1,quan2 +捱=ai2 +捱风缉缝=ai1,feng1,qi1,feng4 +捲=juan3,quan2 +捳=yue4 +捴=zong3 +捵=chen1 +捶=chui2 +捶背=chui2,bei4 +捷=jie2 +捸=tu1 +捹=ben4 +捺=na4 +捻=nian3,nie1 +捻土为香=nian3,tu3,wei2,xiang1 +捼=ruo2,wei3,re2 +捽=zuo2 +捾=wo4,xia2 +捿=qi1 +掀=xian1 +掁=cheng2 +掂=dian1 +掂掇=dian1,duo5 +掂斤抹两=dian1,jin1,mo4,liang3 +掂量=dian1,liang5 +掃=sao3,sao4 +掄=lun1,lun2 +掅=qing4,qian4 +掆=gang1 +掇=duo1 +授=shou4 +掉=diao4 +掊=pou3,pou2 +掊斗折衡=pou3,dou3,zhe2,heng2 +掋=di3 +掌=zhang3 +掌掴=zhang3,guai1 +掍=hun4 +掎=ji3 +掎挈伺诈=ji3,qie4,si4,zha4 +掎裳连襼=ji3,shang5,lian2,zheng1 +掏=tao1 +掐=qia1 +掐着指头=qia1,zhe5,zhi3,tou5 +掑=qi2 +排=pai2,pai3 +排场=pai2,chang5 +排子车=pai3,zi3,che1 +排山倒海=pai2,shan1,dao3,hai3 +排忧解难=pai2,you1,jie3,nan4 +排行=pai2,hang2 +排行榜=pai2,hang2,bang3 +排长=pai2,zhang3 +排难解纷=pai2,nan4,jie3,fen1 +掓=shu1 +掔=qian1,wan4 +掕=ling2 +掖=ye4,ye1 +掖在怀里=ye1,zai4,huai2,li3 +掖满=ye1,man3 +掖给=ye1,gei3 +掖进去=ye1,jin4,qu4 +掗=ya4,ya3 +掘=jue2 +掙=zheng1,zheng4 +掚=liang3 +掛=gua4 +掜=ni3,nie4,yi4 +掝=huo4,xu4 +掞=shan4,yan4,yan3 +掞藻飞声=shan3,zao3,fei1,sheng1 +掟=zheng3,ding4 +掠=lve4 +採=cai3 +探=tan4 +探囊胠箧=tan4,nang2,wu2,qie4 +掣=che4 +掣襟露肘=che4,jin1,lu4,zhou3 +掤=bing1 +接=jie1 +接着=jie1,zhe5 +掦=ti4 +控=kong4 +推=tui1 +推干就湿=tui1,gan4,jiu4,shi1 +推枯折腐=tui1,ku1,she2,fu3 +推磨=tui1,mo4 +掩=yan3 +措=cuo4 +措辞不当=cuo4,ci2,bu2,dang4 +措辞得当=cuo4,ci2,de2,dang4 +掫=zou1,zhou1,chou1 +掬=ju1 +掭=tian4 +掮=qian2 +掯=ken4 +掰=bai1 +掱=pa2 +掲=jie1 +掳=lu3 +掴=guo2 +掴手=guai1,shou3 +掴耳光=guai1,er3,guang1 +掵=ming4 +掶=jie2 +掷=zhi4 +掷骰子=zhi4,tou2,zi3 +掸=dan3,shan4 +掸子=dan3,zi5 +掸桌子=dan3,zhuo1,zi5 +掸衣服=dan3,yi1,fu5 +掹=meng1 +掺=chan1,xian1,can4,shan3 +掺和=chan1,huo4 +掻=sao1 +掼=guan4 +掽=peng4 +掾=yuan4 +掿=nuo4 +揀=jian3 +揁=zheng1,keng1 +揂=jiu1,you2 +揃=jian3,jian1 +揄=yu2 +揅=yan2 +揆=kui2 +揆情度理=kui2,qing2,duo2,li3 +揆理度情=kui2,li3,duo2,qing2 +揇=nan3 +揈=hong1 +揉=rou2 +揊=pi4,che4 +揋=wei1 +揌=sai1 +揍=zou4 +揎=xuan1 +揎拳捰袖=xuan1,quan2,long3,xiu4 +描=miao2 +提=ti2,di1,di3 +提供=ti2,gong1 +提干=ti2,gan4 +提溜=di1,liu1 +提调=ti2,diao4 +提防=di1,fang2 +揑=nie1 +插=cha1 +插头=cha1,tou2 +揓=shi4 +揔=zong3,song1 +揕=zhen4,zhen1 +揖=yi1 +揗=xun2 +揘=huang2,yong2 +揙=bian3 +揚=yang2 +換=huan4 +揜=yan3 +揝=zan3,zuan4 +揞=an3 +揟=xu1,ju1 +揠=ya4 +揠苗助长=ya4,miao2,zhu4,zhang3 +握=wo4 +握粟出卜=wo4,su4,chu1,bo5 +揢=ke2,qia1 +揣=chuai4,chuai3,chuai1,tuan2,zhui1 +揣合逢迎=chuai3,he2,feng2,ying2 +揣奸把猾=chuai1,jian1,ba3,hua2 +揣度=chuai3,duo2 +揣想=chuai3,xiang3 +揣手=chuai1,shou3 +揣手儿=chuai1,shou3,er5 +揣摩=chuai3,mo2 +揣摸=chuai3,mo1 +揣时度力=chuai3,shi2,duo2,li4 +揣测=chuai3,ce4 +揣骨听声=chuai1,gu3,ting1,sheng1 +揤=ji2 +揥=ti4,di4 +揦=la4,la2 +揧=la4 +揨=cheng2 +揩=kai1 +揪=jiu1 +揫=jiu1 +揬=tu2 +揭=jie1,qi4 +揭露=jie1,lu4 +揮=hui1 +揯=gen4 +揰=chong4,dong3 +揱=xiao1 +揲=she2,die2,ye4 +揳=xie1 +援=yuan2 +揵=qian2,jian4,jian3 +揶=ye2 +揷=cha1 +揸=zha1 +揹=bei1 +揺=yao2 +揻=wei1 +揼=beng4 +揽=lan3 +揾=wen4 +揿=qin4 +搀=chan1 +搀和=chan1,huo5 +搀行夺市=chan1,hang2,duo2,shi4 +搁=ge1,ge2 +搁不住=ge2,bu2,zhu4 +搁得住=ge2,de5,zhu4 +搁置=ge1,zhi4 +搂=lou3,lou1 +搃=zong3 +搄=gen4 +搅=jiao3 +搅合=jiao3,he2 +搅和=jiao3,huo5 +搅混=jiao3,gun3 +搆=gou4 +搇=qin4 +搈=rong2 +搉=que4 +搊=chou1,zou3 +搋=chuai1 +搋子=chuai1,zi5 +搌=zhan3 +損=sun3 +搎=sun1 +搏=bo2 +搐=chu4 +搑=rong2,nang2,nang3 +搒=bang4,peng2 +搓=cuo1 +搔=sao1 +搔着痒处=sao1,zhe5,yang3,chu4 +搕=ke1,e4 +搖=yao2 +搗=dao3 +搘=zhi1 +搙=nu4,nuo4,nou4 +搚=la1,xie2,xian4 +搛=jian1 +搜=sou1 +搜岩采干=sou1,yan2,cai3,gan4 +搜括=sou1,gua1 +搜查=sou1,zha1 +搝=qiu3 +搞=gao3 +搟=xian3,xian1 +搠=shuo4 +搡=sang3 +搢=jin4 +搣=mie4 +搤=e4 +搥=chui2 +搦=nuo4 +搧=shan1 +搨=ta4 +搩=jie2,zhe2 +搪=tang2 +搪塞=tang2,se4 +搪差使=tang2,chai1,shi5 +搫=pan2,ban1,po2 +搬=ban1 +搬起石头打自己的脚=ban1,qi3,shi2,tou2,da3,zi4,ji3,de5,jiao3 +搭=da1 +搭理=da1,li3 +搭载=da1,zai4 +搮=li4 +搯=tao1 +搰=hu2 +搱=zhi4,nai2 +搲=wa1,wa3,wa4 +搳=hua2 +搴=qian1 +搴旗取将=qian1,qi2,qu3,jiang4 +搴旗斩将=qian1,qi2,zhan3,jiang4 +搵=wen4 +搶=qiang1,qiang3,cheng1 +搷=tian2,shen1 +搸=zhen1 +搹=e4 +携=xie2 +搻=na2,nuo4 +搼=quan2 +搽=cha2 +搾=zha4 +搿=ge2 +摀=wu3 +摁=en4 +摂=she4 +摃=gang1 +摄=she4,nie4 +摅=shu1 +摆=bai3 +摆擂=bai3,lei4 +摇=yao2 +摇头晃脑=yao2,tou2,huang4,nao3 +摇手触禁=yao2,shou3,chu4,jin4 +摇晃=yao2,huang4 +摇曳=yao2,ye4 +摇滚乐=yao2,gun3,yue4 +摈=bin4 +摉=sou1 +摊=tan1 +摊子=tan1,zi5 +摋=sa4,sha1,shai3 +摌=chan3,sun4 +摍=suo1 +摎=jiu1,liu2,liao2,jiao3,nao2 +摏=chong1 +摐=chuang1 +摑=guo2 +摒=bing4 +摓=feng2,peng3 +摔=shuai1 +摔打=shuai1,da2 +摔跟头=shuai1,gen1,tou5 +摕=di4,tu2,zhi2 +摖=qi4,ji4,cha2 +摗=sou1,song3 +摘=zhai1 +摙=lian3,lian4 +摚=cheng1 +摛=chi1 +摜=guan4 +摝=lu4 +摞=luo4 +摟=lou3,lou1 +摠=zong3 +摡=gai4,xi4 +摢=hu4,chu1 +摣=zha1 +摤=qiang1 +摥=tang4 +摦=hua4 +摧=cui1 +摧刚为柔=cui1,gang1,wei2,rou2 +摧折=cui1,she2 +摧折豪强=cui1,zhe2,hao2,qiang2 +摨=zhi4,nai2 +摩=mo2,ma1 +摩挲=ma1,sa1 +摪=jiang1,qiang4 +摫=gui1 +摬=ying3 +摭=zhi2 +摮=ao2,qiao2 +摯=zhi4 +摰=nie4,che4 +摱=man2,man4 +摲=chan4,can2 +摳=kou1 +摴=chu1 +摵=se4,mi2,su4 +摶=tuan2 +摷=jiao3,chao1 +摸=mo1 +摸不着=mo1,bu4,zhao2 +摸不着头脑=mo1,bu4,zhao2,tou2,nao3 +摸不着边=mo1,bu4,zhuo2,bian1 +摸头不着=mo1,tou2,bu4,zhao2 +摸着石头过河=mo1,zhe5,shi2,tou5,guo4,he2 +摸门不着=mo1,men2,bu4,zhao2 +摹=mo2 +摺=zhe2 +摻=chan1,xian1,can4,shan3 +摼=keng1,qian1 +摽=biao4,biao1 +摽梅之年=biao4,mei2,zhi1,nian2 +摾=jiang4 +摿=yao2 +撀=gou4 +撁=qian1 +撂=liao4 +撂挑子=liao4,tiao1,zi5 +撃=ji1 +撄=ying1 +撅=jue1,jue2 +撅坑撅堑=jue2,keng1,jue2,qian4 +撆=pie1 +撇=pie1,pie3 +撇呆打堕=pie3,dai1,da3,duo4 +撇嘴=pie3,zui3 +撇开=pie1,kai1 +撇弃=pie1,qi4 +撇条=pie3,tiao2 +撈=lao1 +撉=dun1 +撊=xian4 +撋=ruan2 +撌=gui4 +撍=zan3,zan1,zen1,qian2 +撎=yi1 +撏=xian2 +撐=cheng1 +撑=cheng1 +撒=sa1,sa3 +撒呓挣=sa1,yi4,zheng1 +撒布=sa3,bu4 +撒播=sa3,bo1 +撒施=sa3,shi1 +撒种=sa3,zhong3 +撒豆成兵=sa3,dou4,cheng2,bing1 +撒鸭子=sa1,ya1,zi3 +撓=nao2 +撔=hong4 +撕=si1 +撖=han4 +撗=heng2,guang4 +撘=da1 +撙=zun3 +撚=nian3 +撛=lin3 +撜=zheng3,cheng2 +撝=hui1,wei2 +撞=zhuang4 +撟=jiao3 +撠=ji3 +撡=cao1 +撢=dan3 +撣=dan3,shan4 +撤=che4 +撤差=che4,chai1 +撥=bo1 +撦=che3 +撧=jue1 +撨=xiao1,sou1 +撩=liao1,liao2 +撩乱=liao2,luan4 +撩云拨雨=liao2,yun2,bo1,yu3 +撩人=liao2,ren2 +撩动=liao2,dong4 +撩开=liao2,kai1 +撩拨=liao2,bo1 +撩火加油=liao2,huo3,jia1,you2 +撩蜂剔蝎=liao2,feng1,ti4,xie1 +撩蜂吃螫=liao2,feng1,chi1,shi4 +撩逗=liao2,dou4 +撪=ben4 +撫=fu3 +撬=qiao4 +播=bo1 +播撒=bo1,sa3 +播种=bo1,zhong3 +播穅眯目=bo1,kang1,mi3,mu4 +播糠眯目=bo1,kang1,mi3,mu4 +撮=cuo1,zuo3 +撮土焚香=cuo1,gu3,fen2,xiang1 +撮科打哄=cuo1,ke1,da3,hong4 +撯=zhuo2 +撰=zhuan4 +撱=wei3,tuo3 +撲=pu1 +撳=qin4 +撴=dun1 +撵=nian3 +撶=hua2 +撷=xie2 +撸=lu1 +撹=jiao3 +撺=cuan1 +撺掇=cuan1,duo5 +撻=ta4 +撼=han4 +撽=qiao4,yao1,ji1 +撾=zhua1,wo1 +撿=jian3 +擀=gan3 +擁=yong1 +擂=lei2,lei4 +擂主=lei4,zhu3 +擂台=lei4,tai2 +擃=nang3 +擄=lu3 +擅=shan4 +擆=zhuo2 +擇=ze2,zhai2 +擈=pu3 +擉=chuo4 +擊=ji1 +擋=dang3,dang4 +擌=se4 +操=cao1 +操之过切=cao1,zhi1,guo4,qie4 +操切=cao1,qie4 +操奇逐赢=cao1,qi4,zhu4,ying2 +擎=qing2 +擏=qing2,jing3 +擐=huan4 +擑=jie1 +擒=qin2 +擒奸擿伏=qin2,jian1,ti1,fu2 +擓=kuai3 +擔=dan1,dan4 +擕=xie2 +擖=qia1,jia1,ye4 +擗=pi3,bo4 +擘=bo4,bai1 +擘两分星=bo2,liang3,fen1,xing1 +擘划=bo4,hua4 +擙=ao4 +據=ju4,ju1 +擛=ye4 +擜=e4 +擝=meng1 +擞=sou4,sou3 +擟=mi2 +擠=ji3 +擡=tai2 +擢=zhuo2 +擢发莫数=zhuo2,fa4,mo4,shu3 +擢发难数=zhuo2,fa4,nan2,shu3 +擣=dao3 +擤=xing3 +擥=lan3 +擦=ca1 +擦拳抹掌=ca1,quan2,mo4,zhang3 +擦背=ca1,bei4 +擧=ju3 +擨=ye1 +擩=ru3 +擪=ye4 +擫=ye4 +擬=ni3 +擭=huo4 +擮=jie2 +擯=bin4 +擰=ning2,ning3,ning4 +擱=ge1,ge2 +擲=zhi4 +擳=zhi4,jie2 +擴=kuo4 +擵=mo2 +擶=jian4 +擷=xie2 +擸=lie4,la4 +擹=tan1 +擺=bai3 +擻=sou3,sou4 +擼=lu1 +擽=li4,luo4,yue4 +擾=rao3 +擿=ti1,zhi4,zhai1 +擿埴索涂=zhai1,zhi2,suo3,tu2 +擿埴索途=zhai1,zhi2,suo3,tu2 +擿植索涂=zhai1,zhi2,suo3,tu2 +攀=pan1 +攀花折柳=pan1,hua1,zhe2,liu3 +攀藤揽葛=pan1,teng2,lan3,ge3 +攀藤附葛=pan1,teng2,fu4,ge3 +攀蟾折桂=pan1,chan2,she2,gui4 +攁=yang3 +攂=lei2,lei4 +攃=ca1,sa3 +攄=shu1 +攅=zan3 +攆=nian3 +攇=xian3 +攈=jun4,pei4 +攉=huo1 +攊=li4,luo4 +攋=la4,lai4 +攌=huan4 +攍=ying2 +攎=lu2,luo2 +攏=long3 +攐=qian1 +攑=qian1 +攒=zan3,cuan2 +攒三聚五=cuan2,san1,ju4,wu3 +攒三集五=cuan2,san1,ji2,wu3 +攒动=cuan2,dong4 +攒射=cuan2,she4 +攒盒=cuan2,he2 +攒眉=cuan2,mei2 +攒眉苦脸=zan3,mei2,ku3,lian3 +攒眉蹙额=cuan2,mei4,cu4,e2 +攒聚=cuan2,ju4 +攒锋聚镝=cuan2,feng1,ju4,di2 +攒集=cuan2,ji2 +攒零合整=cuan2,ling2,he2,zheng3 +攓=qian1 +攔=lan2 +攕=xian1,jian1 +攖=ying1 +攗=mei2 +攘=rang3 +攙=chan1 +攚=weng3 +攛=cuan1 +攜=xie2 +攝=she4,nie4 +攞=luo2 +攟=jun4 +攠=mi2,mi3,mo2 +攡=chi1 +攢=zan3,cuan2 +攣=luan2 +攤=tan1 +攥=zuan4 +攦=li4,shai4 +攧=dian1 +攨=wa1 +攩=dang3 +攪=jiao3 +攫=jue2 +攫为己有=jue2,wei2,ji3,you3 +攫金不见人=jue2,jin1,bu4,jian4,ren2 +攬=lan3 +攭=li4,luo3 +攮=nang3 +支=zhi1 +支吾其词=zhi1,wu1,qi2,ci2 +支差=zhi1,chai1 +支应=zhi1,ying4 +支数=zhi1,shu4 +支着=zhi1,zhao1 +攰=gui4 +攱=gui3,gui4 +攲=qi1,yi3,ji1 +攳=xun2 +攴=pu1 +攵=pu1 +收=shou1 +收因结果=shou1,yin1,jie2,guo3 +收园结果=shou1,yuan2,jie2,guo3 +收旗卷伞=shou1,qi2,juan4,san3 +收煞=shou1,sha1 +收缘结果=shou1,yuan2,jie2,guo3 +收起来=shou1,qi5,lai2 +收载=shou1,zai3 +攷=kao3 +攸=you1 +改=gai3 +改为=gai3,wei2 +改变心意=gai3,bian4,zhu3,yi4 +改姓更名=gai3,xing4,geng1,ming2 +改张易调=gai3,zhang1,yi4,diao4 +改弦易调=gai3,xian2,yi4,diao4 +改弦更张=gai3,xian2,geng1,zhang1 +改恶为善=gai3,e4,wei2,shan4 +改曲易调=gai3,qu3,yi4,diao4 +改玉改行=gai3,yu4,gai3,xing2 +改而更张=gai3,er2,geng4,zhang1 +改行=gai3,hang2 +改行为善=gai3,xing2,wei2,shan4 +改行从善=gai3,xing2,cong2,shan4 +改行迁善=gai3,xing2,qian1,shan4 +改调=gai3,diao4 +攺=yi3 +攻=gong1 +攻城掠地=gong1,cheng2,lve3,di4 +攻心扼吭=gong1,xin1,e4,keng1 +攻过箴阙=gong1,guo4,zhen1,que4 +攼=gan1,han4 +攽=ban1 +放=fang4 +放假=fang4,jia4 +放大率=fang4,da4,shuai4 +放血=fang4,xue4 +放辟邪侈=fang4,pi4,xie2,chi3 +放长线钓大鱼=fang2,chang2,xian4,diao4,da4,yu2 +政=zheng4 +敀=po4 +敁=dian1 +敂=kou4 +敃=min3 +敄=wu4,mou2 +故=gu4 +故伎重演=gu4,ji4,zhong4,yan3 +故态复还=gu4,tai4,fu4,huan2 +故技重演=gu4,ji4,chong2,yan3 +故都=gu4,du1 +敆=he2 +敇=ce4 +效=xiao4 +效应=xiao4,ying4 +敉=mi3 +敊=chu4,shou1 +敋=ge2,guo2,e4 +敌=di2 +敍=xu4 +敎=jiao4,jiao1 +敏=min3 +敐=chen2 +救=jiu4 +救寒莫如重裘=jiu4,han2,mo4,ru2,chong2,qiu2 +救苦救难=jiu4,ku3,jiu4,nan4 +敒=shen1 +敓=duo2,dui4 +敔=yu3 +敕=chi4 +敖=ao2 +敖不可长=ao4,bu4,ke3,zhang3 +敗=bai4 +敘=xu4 +教=jiao4,jiao1 +教一识百=jiao1,yi1,shi2,bai3 +教中文=jiao1,zhong1,wen2 +教书=jiao1,shu1 +教学相长=jiao4,xue2,xiang1,zhang3 +教猱升木=jiao1,nao2,sheng1,mu4 +教给=jiao1,gei3 +教音乐=jiao1,yin1,yue4 +敚=duo2,dui4 +敛=lian3 +敛声屏息=lian3,sheng1,ping2,xi1 +敛骨吹魂=lian3,gu3,chui2,hun2 +敜=nie4 +敝=bi4 +敝帷不弃=bi3,wei2,bu4,qi4 +敝盖不弃=bi3,gai4,bu4,qi4 +敝綈恶粟=bi4,ti4,e4,su4 +敞=chang3 +敞胸露怀=chang3,xiong1,lu4,huai2 +敟=dian3 +敠=duo1,que4 +敡=yi4 +敢=gan3 +敢为敢做=gan3,wei2,gan3,zuo4 +敢作敢为=gan3,zuo4,gan3,wei2 +散=san4,san3 +散乱=san3,luan4 +散件=san3,jian4 +散体=san3,ti3 +散光=san3,guang1 +散兵=san3,bing1 +散兵游勇=san3,bing1,you2,yong3 +散剂=san3,ji4 +散发=san3,fa4 +散射=san3,she4 +散居=san3,ju1 +散工=san3,gong1 +散文=san3,wen2 +散曲=san3,qu3 +散板=san3,ban3 +散架=san3,jia4 +散沙=san3,sha1 +散漫=san3,man4 +散碎=san3,sui4 +散装=san3,zhuang1 +散见=san3,jian4 +散记=san3,ji4 +敤=ke3 +敥=yan4 +敦=dun1,dui4 +敦朴=dun1,piao2 +敧=qi1,yi3,ji1 +敨=tou3 +敩=xiao4,xue2 +敩学相长=zuan4,xue2,xiang1,chang2 +敪=duo1,que4 +敫=jiao3 +敬=jing4 +敬业乐群=jing4,ye4,yao4,qun2 +敭=yang2 +敮=xia2 +敯=min3 +数=shu3,shu4,shuo4 +数不着=shu3,bu4,zhao2 +数不胜数=shu3,bu4,sheng4,shu3 +数位=shu4,wei4 +数值=shu4,zhi2 +数列=shu4,lie4 +数制=shu4,zhi4 +数叨=shu4,dao1 +数字=shu4,zi4 +数学=shu4,xue2 +数年如一日=shu4,nian2,ru2,yi2,ri4 +数得上=shu3,dei3,shang4 +数得着=shu3,de5,zhao2 +数据=shu4,ju4 +数数=shuo4,shuo4 +数珠=shu4,zhu1 +数理逻辑=shu4,li3,luo2,ji5 +数目=shu4,mu4 +数码=shu4,ma3 +数米量柴=shu3,mi3,er2,chai2 +数罪并罚=shu4,zui4,bing4,fa2 +数落=shu3,luo4 +数表=shu4,biao3 +数见不鲜=shuo4,jian4,bu4,xian1 +数论=shu4,lun4 +数词=shu4,ci2 +数量=shu4,liang4 +数量词=shu4,liang4,ci2 +数额=shu4,e2 +数黄道白=shu4,huang2,dao4,bai2 +数黄道黑=shu3,huang2,dao4,hei1 +数黑论白=shu4,hei1,lun4,bai2 +敱=ai2,zhu2 +敲=qiao1 +敲敲打打=qiao1,qiao1,da3,da3 +敲竹杠=qiao1,zhu1,gang4 +敲骨剥髓=qiao1,gu3,bao1,sui3 +敳=ai2 +整=zheng3 +整年累月=zheng3,nian2,lei4,yue4 +整数=zheng3,shu4 +整躬率物=zheng3,gong1,shuai4,wu4 +整顿干坤=zheng3,dun4,gan4,kun1 +敵=di2 +敶=chen2 +敷=fu1 +敷衍塞责=fu1,yan3,se4,ze2 +敷衍搪塞=fu1,yan3,tang2,se4 +數=shu3,shu4,shuo4 +敹=liao2 +敺=qu1 +敻=xiong4,xuan4 +敼=yi3 +敽=jiao3 +敾=shan4 +敿=jiao3 +斀=zhuo2,zhu2 +斁=yi4,du4 +斂=lian3 +斃=bi4 +斄=li2,tai2 +斅=xiao4 +斆=xiao4 +文=wen2 +文不对题=wen2,bu4,dui4,ti2 +文件夹=wen2,jian4,jia1 +文卷=wen2,juan4 +文坛宿将=wen2,tan2,su4,jiang4 +文房四侯=wen2,fang2,si4,hou4 +文武差事=wen2,wu3,cha4,shi4 +文章星斗=wen2,zhang1,xing1,dou4 +文艺复兴=wen2,yi4,fu4,xing1 +文蛤=wen2,ge2 +文行出处=wen2,xing2,chu1,chu3 +文身剪发=wen2,shen1,jian3,fa1 +文过遂非=wen2,guo4,sui2,fei1 +斈=xue2 +斉=qi2 +斊=qi2 +斋=zhai1 +斌=bin1 +斍=jue2,jiao4 +斎=zhai1 +斏=lang2 +斐=fei3,fei1 +斑=ban1 +斒=ban1 +斓=lan2 +斔=yu3,zhong1 +斕=lan2 +斖=wei3,men2 +斗=dou4,dou3 +斗升之水=dou3,sheng1,zhi1,shui3 +斗南一人=dou3,nan2,yi1,ren2 +斗子=dou3,zi5 +斗室=dou3,shi4 +斗折蛇行=dou3,zhe2,she2,xing2 +斗拱=dou3,gong3 +斗挹箕扬=dou3,yi4,ji1,yang2 +斗方=dou3,fang1 +斗方名士=dou3,fang1,ming2,shi4 +斗榫合缝=dou3,sun3,he2,feng4 +斗渠=dou3,qu2 +斗笠=dou3,li4 +斗筲之人=dou3,shao1,zhi1,ren2 +斗筲之器=dou3,shao1,zhi1,qi4 +斗箕=dou4,ji5 +斗篷=dou3,peng2 +斗米尺布=dou3,mi3,chi3,bu4 +斗粟尺布=dou3,su4,chi3,bu4 +斗绝一隅=dou3,jue2,yi1,yu2 +斗胆=dou3,dan3 +斗车=dou3,che1 +斗转参横=dou3,zhuan3,shen1,heng2 +斗转星移=dou3,zhuan3,xing1,yi2 +斗酒只鸡=dou3,jiu3,zhi1,ji1 +斗酒学士=dou3,jiu3,xue2,shi4 +斗酒百篇=dou3,jiu3,bai3,pian1 +斗量筲计=dou3,liang2,shao1,ji4 +斗量车载=dou3,liang2,che1,zai4 +斗门=dou3,men2 +斘=sheng1 +料=liao4 +料斗=liao4,dou3 +斚=jia3 +斛=hu2 +斜=xie2 +斝=jia3 +斞=yu3 +斟=zhen1 +斠=jiao4 +斡=wo4,guan3 +斢=tou3,tiao3 +斣=dou4 +斤=jin1 +斤斗=jin1,dou3 +斥=chi4 +斦=yin2,zhi4 +斧=fu3 +斧头=fu3,tou5 +斨=qiang1 +斩=zhan3 +斩头沥血=zhan3,tou2,li4,xue4 +斩将刈旗=zhan3,jiang4,yi4,qi2 +斩将搴旗=zhan3,jiang4,qian1,qi2 +斪=qu2 +斫=zhuo2 +斫木为舟=zhuo2,mu4,wei2,zhou1 +斫琱为朴=zhuo2,diao1,wei2,pu3 +斫雕为朴=zhuo2,diao1,wei2,pu3 +斬=zhan3 +断=duan4 +断发文身=duan4,fa4,wen2,shen1 +断还归宗=duan4,huan2,gui1,zong1 +断长续短=duan4,chang1,xu4,duan3 +断长补短=duan4,chang1,bu3,duan3 +斮=zhuo2 +斯=si1 +新=xin1 +新兴=xin1,xing1 +斱=zhuo2 +斲=zhuo2 +斳=qin2 +斴=lin2 +斵=zhuo2 +斶=chu4 +斷=duan4 +斸=zhu2 +方=fang1 +方兴未已=fang1,xing1,wei4,yi3 +方兴未艾=fang1,xing1,wei4,ai4 +方寸万重=fang1,cun4,wan4,chong2 +方正不阿=fang1,zheng4,bu4,e1 +斺=chan3,jie4 +斻=hang2 +於=yu2,wu1 +於菟=wu1,tu2 +施=shi1 +施为=shi1,wei2 +施予=shi1,yu3 +施工缝=shi1,gong1,feng4 +斾=pei4 +斿=liu2,you2 +旀=mei4 +旁=pang2,bang4 +旁观者效应=pang2,guan1,zhe3,xiao4,ying4 +旂=qi2 +旃=zhan1 +旄=mao2,mao4 +旅=lv3 +旅舍=lv3,she4 +旅贲=lv3,ben1 +旆=pei4 +旇=pi1,bi4 +旈=liu2 +旉=fu1 +旊=fang3 +旋=xuan2,xuan4 +旋干转坤=xuan2,qian2,zhuan3,kun1 +旋转干坤=xuan2,zhuan3,gan4,kun1 +旌=jing1 +旍=jing1 +旎=ni3 +族=zu2 +旐=zhao4 +旑=yi3 +旒=liu2 +旓=shao1 +旔=jian4 +旖=yi3 +旗=qi2 +旘=zhi4 +旙=fan1 +旚=piao1 +旛=fan1 +旜=zhan1 +旝=kuai4 +旞=sui4 +旟=yu2 +无=wu2 +无下箸处=wu2,xia4,zhu4,chu3 +无与为比=wu2,yu3,wei2,bi3 +无为=wu2,wei2 +无为之治=wu2,wei2,zhi1,zhi4 +无为而成=wu2,wei2,er2,cheng2 +无为而治=wu2,wei2,er2,zhi4 +无为自化=wu2,wei2,zi4,hua4 +无为自成=wu2,wei2,zi4,cheng2 +无了无休=wu2,le5,wu2,xiu1 +无以塞责=wu2,yi3,se4,ze2 +无伤大雅=wu2,shang1,da4,ya3 +无伤无臭=wu2,sheng1,wu2,xiu4 +无动为大=wu2,dong4,wei2,da4 +无可否认=wu2,ke3,fou2,ren4 +无可比拟=wu2,ke3,bi3,ni3 +无地自处=wu2,di4,zi4,chu3 +无声无臭=wu2,sheng1,wu2,xiu4 +无处=wu2,chu3 +无孔不入=wu2,kong3,bu4,ru4 +无宁=wu2,ning4 +无寇暴死=wu4,kou4,bao4,shi5 +无往不胜=wu2,wang3,bu4,sheng4 +无恶不为=wu2,e4,bu4,wei2 +无恶不作=wu2,e4,bu4,zuo4 +无所不为=wu2,suo3,bu4,wei2 +无所不在=wu2,suo3,bu4,zai4 +无数=wu2,shu4 +无理数=wu2,li3,shu4 +无的放矢=wu2,di4,fang4,shi3 +无缝=wu2,feng4 +无缝天衣=wu2,feng2,tian1,yi1 +无缝钢管=wu2,feng4,gang1,guan3 +无背无侧=wu2,bei4,wu2,ce4 +无能为力=wu2,neng2,wei2,li4 +无能为役=wu2,neng2,wei2,yi4 +无臭=wu2,xiu4 +无色无臭=wu2,se4,wu2,xiu4 +无辜受累=wu2,gu1,shou4,lei3 +无適无莫=wu2,di2,wu2,mo4 +无间=wu2,jian4 +无间冬夏=wu2,jian1,dong1,xia4 +无间可乘=wu2,jian1,ke3,cheng2 +无间可伺=wu2,jian1,ke3,si4 +无间是非=wu2,jian4,shi4,fei1 +无颜落色=wu2,yan2,luo4,se4 +旡=ji4 +既=ji4 +旣=ji4 +旤=huo4 +日=ri4 +日不暇给=ri4,bu4,xia2,ji3 +日中为市=ri4,zhong1,wei2,shi4 +日中将昃=ri4,zhong1,jiang1,ze4 +日中必湲=ri4,zhong1,bi4,tong2 +日削月割=ri4,xue1,yue4,ge1 +日削月朘=ri4,xue1,yue4,juan1 +日复一日=ri4,fu4,yi1,ri4 +日头=ri4,tou5 +日子=ri4,zi5 +日晕=ri4,yun4 +日月参辰=ri4,yue4,shen1,chen2 +日月重光=ri4,yue4,chong2,guang1 +日朘月削=ri4,juan1,yue4,xue1 +日省月修=ri4,xing3,yue4,xiu1 +日省月试=ri4,xing3,yue4,shi4 +日省月课=ri4,xing3,yue4,ke4 +日积月累=ri4,ji1,yue4,lei3 +日累月积=ri4,lei4,yue4,ji1 +日薄西山=ri4,bo2,xi1,shan1 +日进斗金=ri4,jin4,dou3,jin1 +旦=dan4 +旦不报夕=dan4,bu2,bao4,xi1 +旦种暮成=dan4,zhong4,mu4,cheng2 +旦角=dan4,jue2 +旧=jiu4 +旧事重提=jiu4,shi4,zhong4,ti2 +旧地重游=jiu4,di4,chong2,you2 +旧话重提=jiu4,hua4,chong2,ti2 +旧调重弹=jiu4,diao4,chong2,tan2 +旧都=jiu4,du1 +旨=zhi3 +早=zao3 +早占勿药=zao3,zhan1,wu4,yao4 +早该淘汰=zao3,gai1,tao4,tai4 +旪=xie2 +旫=tiao1 +旬=xun2 +旭=xu4 +旮=ga1 +旯=la2 +旰=gan4,han4 +旱=han4 +旱冰场=han4,bing1,chang3 +旱冰鞋=han2,bing1,xie2 +旲=tai2,ying1 +旳=di4,di2,de5 +旴=xu1,xu4 +旵=chan3 +时=shi2 +时兴=shi2,xing1 +时差=shi2,cha1 +时行=shi2,hang2 +时调=shi2,diao4 +时运不齐=shi2,yun4,bu4,ji4 +旷=kuang4 +旷日累时=kuang4,ri4,lei3,shi2 +旸=yang2 +旹=shi2 +旺=wang4 +旻=min2 +旼=min2 +旽=tun1,zhun4 +旾=chun1 +旿=wu4,wu3 +昀=yun2 +昁=bei4 +昂=ang2 +昃=ze4 +昄=ban3 +昅=jie2 +昆=kun1 +昇=sheng1 +昈=hu4 +昉=fang3 +昊=hao4 +昋=gui4 +昌=chang1 +昌亭旅食=chang2,ting2,lv3,shi2 +昍=xuan1 +明=ming2 +明了=ming2,liao3 +明人不做暗事=ming2,ren2,bu4,zuo4,an4,shi4 +明发不寐=ming2,fa1,bu4,mei4 +明摆着=ming2,bai3,zhe5 +明效大验=ming2,xiao4,da4,ya4 +明昭昏蒙=ming2,zhao1,hun1,meng2 +明晃晃=ming2,huang3,huang3 +明窗净几=ming2,chuang1,jing4,ji1 +昏=hun1 +昏定晨省=hun1,ding4,chen2,xing3 +昏迷不省=hun1,mi2,bu4,xing3 +昏镜重明=hun1,jing4,chong2,ming2 +昏镜重磨=hun1,jing4,chong2,mo2 +昐=fen1 +昑=qin3 +昒=hu1 +易=yi4 +易地而处=yi4,di4,er2,chu3 +昔=xi1 +昕=xin1 +昖=yan2 +昗=ze4 +昘=fang3 +昙=tan2 +昚=shen4 +昛=ju4 +昜=yang2 +昝=zan3 +昞=bing3 +星=xing1 +星占=xing1,zhan1 +星宿=xing1,xiu4 +星斗=xing1,dou3 +星期日=xing1,qi1,ri4 +星相=xing1,xiang4 +星移斗转=xing1,yi2,dou3,zhuan3 +映=ying4 +昡=xuan4 +昢=po4 +昣=zhen3 +昤=ling2 +春=chun1 +春假=chun1,jia4 +春种=chun1,zhong4 +春笋怒发=chun1,sun3,mu4,fa1 +春露秋霜=chun1,lu4,qiu1,shuang1 +春风一度=chun1,feng1,yi1,du4 +春风雨露=chun1,feng1,yu3,lu4 +春风风人=chun1,feng1,feng4,ren2 +昦=hao4 +昧=mei4 +昧旦晨兴=mei4,dan4,chun2,xing1 +昨=zuo2 +昩=mo4 +昪=bian4 +昫=xu4 +昬=hun1 +昭=zhao1 +昭德塞违=zhao1,de2,se4,wei2 +昮=zong4 +是=shi4 +是不是=shi4,bu4,shi4 +是否=shi4,fou2 +是非分明=shi4,fei1,fen1,ming2 +是非只为多开口=shi4,fei1,zhi1,wei4,duo1,kai1,kou3 +是非得失=shi4,fei1,de2,shi1 +是非曲直=shi4,fei1,qu3,zhi2 +昰=shi4 +昱=yu4 +昲=fei4 +昳=die2,yi4 +昴=mao3 +昵=ni4 +昶=chang3 +昷=wen1 +昸=dong1 +昹=ai3 +昺=bing3 +昻=ang2 +昼=zhou4 +昼干夕惕=zhou4,gan4,xi1,ti4 +昼度夜思=zhou4,duo2,ye4,si1 +昽=long2 +显=xian3 +显得=xian3,de5 +显豁=xian3,huo4 +显露=xian3,lu4 +显露头角=xian3,lu4,tou2,jiao3 +昿=kuang4 +晀=tiao3 +晁=chao2 +時=shi2 +晃=huang3,huang4 +晃动=huang4,dong4 +晃悠=huang4,you1 +晃摇=huang4,yao2 +晃晃=huang4,huang3 +晃晃悠悠=huang4,huang3,you1,you1 +晃荡=huang4,dang4 +晄=huang3 +晅=xuan1 +晆=kui2 +晇=xu4,kua1 +晈=jiao3 +晉=jin4 +晊=zhi4 +晋=jin4 +晌=shang3 +晍=tong2 +晎=hong3 +晏=yan4 +晐=gai1 +晑=xiang3 +晒=shai4 +晒场=shai4,chang2 +晓=xiao3 +晓得=xiao3,de5 +晔=ye4 +晕=yun1,yun4 +晕乎乎=yun4,hu1,hu1 +晕倒=yun1,dao3 +晕场=yun4,chang3 +晕头转向=yun1,tou2,zhuan4,xiang4 +晕影=yun4,ying3 +晕池=yun4,chi2 +晕船=yun4,chuan2 +晕车=yun1,che1 +晕过去=yun4,guo4,qu4 +晕针=yun4,zhen1 +晕高=yun4,gao1 +晕高儿=yun1,gao1,er2 +晖=hui1 +晗=han2 +晘=han4 +晙=jun4 +晚=wan3 +晚一点=wan3,yi4,dian3 +晚食当肉=wan3,shi2,dang4,rou4 +晛=xian4 +晜=kun1 +晝=zhou4 +晞=xi1 +晟=sheng4,cheng2 +晠=sheng4 +晡=bu1 +晢=zhe2 +晣=zhe2 +晤=wu4 +晥=wan3 +晦=hui4 +晦盲否塞=hui4,mang2,pi3,se4 +晧=hao4 +晨=chen2 +晨兴夜寐=chen2,xing1,ye4,mei4 +晨昏定省=chen2,hun1,ding4,xing3 +晩=wan3 +晪=tian3 +晫=zhuo2 +晬=zui4 +晬面盎背=zui4,mian4,ang4,bei4 +晭=zhou3 +普=pu3 +普天率土=pu3,tian1,shuai4,tu3 +普查=pu3,zha1 +景=jing3,ying3 +晰=xi1 +晰毛辨发=xi1,mao2,bian4,fa4 +晱=shan3 +晲=ni3 +晳=xi1 +晴=qing2 +晵=qi3,du4 +晶=jing1 +晷=gui3 +晸=zheng3 +晹=yi4 +智=zhi4 +智力商数=zhi4,li4,shang1,shu4 +晻=an4,an3,yan3 +晼=wan3 +晽=lin2 +晾=liang4 +晾衣服=liang4,yi1,fu5 +晿=cheng1 +暀=wang3,wang4 +暁=xiao3 +暂=zan4 +暃=fei1 +暄=xuan1 +暅=xuan3 +暆=yi2 +暇=xia2 +暈=yun1,yun4 +暉=hui1 +暊=xu3 +暋=min3,min2 +暌=kui2 +暍=ye1 +暎=ying4 +暏=shu3,du3 +暐=wei3 +暑=shu3 +暑假=shu3,jia4 +暒=qing2 +暓=mao4 +暔=nan2 +暕=jian3,lan2 +暖=nuan3 +暖和=nuan3,huo5 +暗=an4 +暗中行事=an4,zhong1,xing2,shi4 +暗箭中人=an4,jian4,zhong4,ren2 +暘=yang2 +暙=chun1 +暚=yao2 +暛=suo3 +暜=pu3 +暝=ming2 +暞=jiao3 +暟=kai3 +暠=hao4 +暡=weng3 +暢=chang4 +暣=qi4 +暤=hao4 +暥=yan4 +暦=li4 +暧=ai4 +暨=ji4 +暩=ji4 +暪=men4 +暫=zan4 +暬=xie4 +暭=hao4 +暮=mu4 +暮暮朝朝=mu4,mu4,zhao1,zhao1 +暮虢朝虞=mu4,guo2,zhao1,yu2 +暮雨朝云=mu4,yu3,zhao1,yun2 +暯=mu4 +暰=cong1 +暱=ni4 +暲=zhang1 +暳=hui4 +暴=bao4,pu4 +暴戾恣睢=bao4,li4,zi4,sui1 +暴晒=bao4,shai4 +暴腮龙门=pu4,sai1,long2,men2 +暴衣露冠=pu4,yi1,lu4,guan4 +暴衣露盖=pu4,yi1,lu4,gai4 +暴露=bao4,lu4 +暴露无遗=bao4,lu4,wu2,yi2 +暵=han4 +暶=xuan2 +暷=chuan2 +暸=liao2 +暹=xian1 +暺=tan3 +暻=jing3 +暼=pie1 +暽=lin2 +暾=tun1 +暿=xi1,xi3 +曀=yi4 +曁=ji4 +曂=huang4 +曃=dai4 +曄=ye4 +曅=ye4 +曆=li4 +曇=tan2 +曈=tong2 +曉=xiao3 +曊=fei4 +曋=shen3 +曌=zhao4 +曍=hao4 +曎=yi4 +曏=xiang4 +曐=xing1 +曑=shen1 +曒=jiao3 +曓=bao4 +曔=jing4 +曕=yan4 +曖=ai4 +曗=ye4 +曘=ru2 +曙=shu3 +曚=meng2 +曛=xun1 +曜=yao4 +曝=pu4,bao4 +曝光=bao4,guang1 +曝背食芹=pu4,bei4,shi2,qin2 +曝露=pu4,lu4 +曞=li4 +曟=chen2 +曠=kuang4 +曡=die2 +曢=liao3 +曣=yan4 +曤=huo4 +曥=lu2 +曦=xi1 +曧=rong2 +曨=long2 +曩=nang3 +曪=luo3 +曫=luan2 +曬=shai4 +曭=tang3 +曮=yan3 +曯=zhu2 +曰=yue1 +曱=yue1 +曲=qu3,qu1 +曲子=qu3,zi5 +曲学诐行=qu1,xue2,bi4,xing2 +曲尺=qu1,chi3 +曲尽其妙=qu1,jin4,qi2,miao4 +曲径=qu1,jing4 +曲径通幽=qu1,jing4,tong1,you1 +曲意逢迎=qu1,yi4,feng2,ying2 +曲折=qu1,zhe2 +曲曲=qu1,qu1 +曲曲弯弯=qu1,qu1,wan1,wan1 +曲曲折折=qu1,qu1,zhe2,zhe2 +曲柄=qu1,bing3 +曲棍球=qu1,gun4,qiu2 +曲牌=qu3,pai2 +曲直=qu1,zhi2 +曲突徙薪=qu1,tu1,xi3,xin1 +曲笔=qu1,bi3 +曲线=qu1,xian4 +曲线图=qu1,xian4,tu2 +曲艺=qu3,yi4 +曲解=qu1,jie3 +曲调=qu3,diao4 +曲里拐弯=qu1,li3,guan3,wan1 +曲阜=qu1,fu4 +曲高和寡=qu3,gao1,he4,gua3 +曳=ye4 +曳光弹=ye4,guang1,dan4 +更=geng4,geng1 +更为=geng4,wei2 +更事=geng1,shi4 +更仆难尽=geng4,pu2,nan2,jin4 +更仆难数=geng1,pu2,nan2,shu3 +更仆难终=geng1,pu2,nan2,zhong1 +更令明号=geng1,ling4,ming2,hao4 +更动=geng1,dong4 +更卒=geng1,zu2 +更名=geng1,ming2 +更名改姓=geng4,ming2,gai3,xing4 +更唱叠和=geng1,chang4,die2,he2 +更唱迭和=geng1,chang4,die2,he2 +更夫=geng1,fu1 +更始=geng1,shi3 +更姓改物=geng1,xing4,gai3,wu4 +更定=geng1,ding4 +更年期=geng1,nian2,qi1 +更张=geng1,zhang1 +更弦改辙=geng1,xian2,gai3,zhe2 +更弦易辙=geng1,xian2,yi4,zhe2 +更待干罢=geng4,dai4,gan4,ba4 +更换=geng1,huan4 +更改=geng1,gai3 +更新=geng1,xin1 +更新换代=geng1,xin1,huan4,dai4 +更易=geng1,yi4 +更替=geng1,ti4 +更次=geng1,ci4 +更正=geng1,zheng4 +更正错误=geng1,zheng4,cuo4,wu4 +更深=geng1,shen1 +更深人静=geng1,shen1,ren2,jing4 +更深夜静=geng1,shen1,ye4,jing4 +更生=geng1,sheng1 +更番=geng1,fan1 +更衣=geng1,yi1 +更衣室=geng1,yi1,shi4 +更进一步=geng4,jin4,yi1,bu4 +更迭=geng1,die2 +更长梦短=geng1,chang2,meng4,duan3 +更阑人静=geng1,lan2,ren2,jing3 +更难仆数=geng1,nan2,pu2,shu4 +更鼓=geng1,gu3 +曵=ye4 +曶=hu1,hu4 +曷=he2 +書=shu1 +曹=cao2 +曺=cao2 +曻=sheng1 +曼=man4 +曽=ceng2,zeng1 +曾=ceng2,zeng1 +曾不惨然=ceng2,bu4,can3,ran2 +曾参杀人=zeng1,shen1,sha1,ren2 +曾孙=zeng1,sun1 +曾母投杼=zeng1,mu3,tou2,zhu4 +曾祖=zeng1,zu3 +曾祖母=zeng1,zu3,mu3 +曾祖父=zeng1,zu3,fu4 +替=ti4 +替天行道=ti4,tian1,xing2,dao4 +最=zui4 +最后一刻=zui4,hou4,yi1,ke4 +最大公约数=zui4,da4,gong1,yue1,shu4 +最小公倍数=zui4,xiao3,gong1,bei4,shu4 +朁=can3,qian2,jian4 +朂=xu4 +會=hui4,kuai4 +朄=yin3 +朅=he2,qie4 +朆=fen1 +朇=bi4,pi2 +月=yue4 +月中折桂=yue4,zhong1,she2,gui4 +月夕花朝=yue4,xi1,hua1,zhao1 +月夜花朝=yue4,ye4,hua1,zhao1 +月头儿=yue4,tou5,er5 +月晕=yue4,yun4 +月晕础润=yue4,yun1,chu3,run4 +月氏=rou4,zhi1 +月没参横=yue4,mo4,shen1,heng2 +月相=yue4,xiang4 +月落参横=yue4,luo4,shen1,heng2 +月露之体=yue4,lu4,zhi1,ti3 +月露风云=yue4,lu4,feng1,yun2 +有=you3,you4 +有一些=you3,yi4,xie1 +有一天=you3,yi4,tian1 +有一得一=you3,yi1,de2,yi1 +有为=you3,wei2 +有以善处=you3,yi3,shan4,chu3 +有借无还=you3,jie4,wu2,huan2 +有冯有翼=you3,ping2,you3,yi4 +有加无已=you3,jia1,wu3,yi3 +有国难投=you3,guo2,nan2,tou2 +有天没日头=you3,tian1,mei2,ri4,tou2 +有得=you3,de5 +有数=you3,shu4 +有朝一日=you3,zhao1,yi1,ri4 +有模有样=you3,mu2,you3,yang4 +有求必应=you3,qiu2,bi4,ying4 +有理数=you3,li3,shu4 +有的放矢=you3,di4,fang4,shi3 +有着=you3,zhe5 +有空=you3,kong4 +有蠙可乘=you3,bin1,ke3,cheng2 +有血有肉=you3,xue4,you3,rou4 +有隙可乘=you3,xi4,ke3,cheng4 +朊=ruan3 +朋=peng2 +朋比为奸=peng2,bi3,wei2,jian1 +朌=fen2,ban1 +服=fu2,fu4 +服丧=fu2,sang1 +服差役=fu2,chai1,yi4 +服帖=fu2,tie1 +服服帖帖=fu2,fu5,tie1,tie1 +朎=ling2 +朏=fei3,ku1 +朐=qu2,xu4,chun3 +朑=ti4 +朒=nv4,ga3 +朓=tiao3 +朔=shuo4 +朕=zhen4 +朖=lang3 +朗=lang3 +朘=juan1,zui1 +朙=ming2 +朚=huang1,mang2,wang2 +望=wang4 +望其项背=wang4,qi2,xiang4,bei4 +望尘僄声=wang4,chen2,piao4,sheng1 +望影揣情=wang4,ying3,chuai1,qing2 +望洋兴叹=wang4,yang2,xing1,tan4 +望阙谢恩=wang4,que4,xie4,en1 +望风响应=wang4,feng1,xiang3,ying1 +朜=tun1 +朝=chao2,zhao1 +朝三暮二=zhao1,san1,mu4,er4 +朝三暮四=zhao1,san1,mu4,si4 +朝不保夕=zhao1,bu4,bao3,xi1 +朝不保暮=zhao1,bu4,bao3,mu4 +朝不及夕=zhao1,bu4,ji2,xi1 +朝不虑夕=zhao1,bu4,lv4,xi1 +朝不谋夕=zhao1,bu4,mou2,xi1 +朝乾夕惕=zhao1,qian2,xi1,ti4 +朝乾夕愓=zhao1,qian2,xi1,dang4 +朝云暮雨=zhao1,yun2,mu4,yu3 +朝令夕改=zhao1,ling4,xi1,gai3 +朝令暮改=zhao1,ling4,mu4,gai3 +朝会=chao2,hui4 +朝前夕惕=zhao1,qiang2,xi1,ti4 +朝升暮合=zhao1,sheng1,mu4,ge3 +朝华夕秀=zhao1,hua2,xi1,xiu4 +朝发夕至=zhao1,fa1,xi1,zhi4 +朝发暮至=zhao1,fa1,mu4,zhi4 +朝夕=zhao1,xi1 +朝夷暮跖=zhao1,yi2,mu4,zhi2 +朝奏夕召=zhao1,zou4,xi1,zhao4 +朝奏暮召=zhao1,zou4,mu4,zhao4 +朝思暮想=zhao1,si1,mu4,xiang3 +朝成夕毁=zhao1,cheng2,xi1,hui3 +朝成暮徧=chao2,cheng2,mu4,shi2 +朝成暮毁=zhao1,cheng2,mu4,hui3 +朝成暮遍=zhao1,cheng2,mu4,bian4 +朝折暮折=zhao1,she2,mu4,she2 +朝攀暮折=zhao1,pan1,mu4,she2 +朝斯夕斯=zhao1,si1,xi1,si1 +朝日=zhao1,ri4 +朝晖=zhao1,hui1 +朝暮=zhao1,mu4 +朝更暮改=zhao1,geng4,mu4,gai3 +朝朝暮暮=zhao1,zhao1,mu4,mu4 +朝梁暮周=zhao1,liang2,mu4,zhou1 +朝梁暮晋=zhao1,liang2,mu4,jin4 +朝梁暮陈=zhao1,liang2,mu4,chen2 +朝欢暮乐=zhao1,huan1,mu4,le4 +朝歌夜弦=zhao1,ge1,ye4,xian2 +朝歌暮弦=zhao1,ge1,mu4,xian2 +朝气=zhao1,qi4 +朝生夕死=zhao1,sheng1,xi1,si3 +朝生暮死=zhao1,sheng1,mu4,si3 +朝着=chao2,zhe5 +朝种暮获=zhao1,zhong3,mu4,hu4 +朝秦暮楚=zhao1,qin2,mu4,chu3 +朝穿暮塞=zhao1,chuan1,mu4,sai1 +朝经暮史=zhao1,jing1,mu4,shi3 +朝荣夕灭=zhao1,rong2,xi1,mie4 +朝衣东市=zhao1,yi1,dong1,shi4 +朝觐圣地=chao2,jin3,sheng4,di4 +朝趁暮食=zhao1,chen4,mu4,shi2 +朝过夕改=zhao1,guo4,xi1,gai3 +朝钟暮鼓=zhao1,zhong1,mu4,gu3 +朝锺暮鼓=zhao1,zhong1,mu4,gu3 +朝闻夕改=zhao1,wen2,xi1,gai3 +朝闻夕死=zhao1,wen2,xi1,si3 +朝闻道夕死可矣=zhao1,wen2,dao4,xi1,si3,ke3,yi3 +朝阳=zhao1,yang2 +朝阳丹凤=chao2,yang2,dan1,feng4 +朝阳花=chao2,yang2,hua1 +朝霞=zhao1,xia2 +朝露=zhao1,lu4 +朝饔夕飧=zhao1,yong1,xi1,sun1 +朝鲜族=chao2,xian3,zu2 +朝齑夕盐=zhao1,ji1,xi1,yan2 +朝齑暮盐=zhao1,ji1,mu4,yan2 +朞=ji1 +期=qi1,ji1 +期年=ji1,nian2 +期数=qi1,shu4 +朠=ying1 +朡=zong1 +朢=wang4 +朣=tong2,chuang2 +朤=lang3 +朥=lao2 +朦=meng2 +朧=long2 +木=mu4 +木人石心=mu4,ren2,shi2,xin1 +木头=mu4,tou5 +木头屑=mu4,tou5,xie4 +木头木脑=mu4,tou2,mu4,nao3 +木干鸟栖=mu4,gan4,niao3,qi1 +木杆=mu4,gan3 +木栅=mu4,shan1 +木模=mu4,mu2 +木钻=mu4,zuan4 +朩=deng3 +未=wei4 +未为不可=wei4,wei2,bu4,ke3 +未了=wei4,liao3 +未曾=wei4,zeng1 +未爆炸弹=wei4,bao4,zha4,dan4 +未知数=wei4,zhi1,shu4 +未艾方兴=wei4,ai4,fang1,xing1 +末=mo4 +末了=mo4,liao3 +本=ben3 +本分=ben3,fen4 +本本分分=ben3,ben3,fen4,fen1 +本相=ben3,xiang4 +本相毕露=ben3,xiang4,bi4,lu4 +本着=ben3,zhe5 +本行=ben3,hang2 +札=zha2 +朮=shu4,shu2,zhu2 +术=shu4,shu2,zhu2 +朱=zhu1,shu2 +朱槃玉敦=zhu1,pan2,yu4,dui4 +朱盘玉敦=zhu1,pan2,yu4,dui4 +朱轓皁盖=zhu1,fan1,hai3,gai4 +朲=ren2 +朳=ba1 +朴=pu3,po4,po1,piao2 +朴刀=po1,dao1 +朴硝=po4,xiao1 +朴质=pu3,zhi4 +朵=duo3 +朵颐大嚼=duo3,yi1,da4,jiao2 +朶=duo3 +朷=dao1,tiao2,mu4 +朸=li4 +朹=qiu2,gui3 +机=ji1 +机长=ji1,zhang3 +朻=jiu1 +朼=bi3 +朽=xiu3 +朾=cheng2,cheng1 +朿=ci4 +杀=sha1 +杀人不见血=sha1,ren2,bu2,jian4,xie3 +杀伤炸弹=sha1,shang1,zha4,dan4 +杀出重围=sha1,chu1,chong2,wei2 +杀妻求将=sha1,qi1,qiu2,jiang4 +杀衣缩食=shai4,yi1,suo1,shi2 +杀鸡为黍=sha1,ji1,wei2,shu3 +杁=ru4 +杂=za2 +杂处=za2,chu3 +杂货铺=za2,huo4,pu4 +权=quan2 +权数=quan2,shu4 +杄=qian1 +杅=yu2,wu1 +杅穿皮蠹=yu2,chuan1,pi2,du4 +杆=gan1,gan3 +杆塔=gan3,ta3 +杆子=gan3,zi5 +杆秤=gan3,cheng4 +杆菌=gan3,jun1 +杇=wu1 +杈=cha1,cha4 +杉=shan1,sha1 +杉木=sha1,mu4 +杊=xun2 +杋=fan2 +杌=wu4 +杍=zi3 +李=li3 +李子=li3,zi5 +李广不侯=li3,guang3,bu4,hou4 +杏=xing4 +材=cai2 +材优干济=cai2,you1,gan4,ji3 +材大难用=cai2,da4,nan2,yong4 +材茂行絜=cai2,mao4,xing2,jie2 +材薄质衰=cai2,bo2,zhi4,shuai1 +材轻德薄=cai2,qing1,de2,bo2 +材高知深=cai2,gao1,zhi4,shen1 +村=cun1 +村舍=cun1,she4 +村长=cun1,zhang3 +杒=ren4,er2 +杓=shao2,biao1 +杔=tuo1,zhe2 +杕=di4,duo4 +杖=zhang4 +杗=mang2 +杘=chi4 +杙=yi4 +杚=gu1,gai4 +杛=gong1 +杜=du4 +杜鹃啼血=du4,juan1,ti2,xue3 +杜默为诗=du4,mo4,wei2,shi1 +杝=yi2,li4,li2,duo4,tuo4 +杞=qi3 +束=shu4 +束发=shu4,fa4 +束发封帛=shu4,fa1,feng1,bo2 +束带结发=shu4,dai4,jie2,fa1 +束戈卷甲=shu4,ge1,juan4,jia3 +束椽为柱=shu4,chuan2,wei2,zhu4 +束缊举火=shu4,yun1,ju3,huo3 +束缊还妇=shu4,yun1,huan2,fu4 +束蒲为脯=shu4,pu2,wei2,pu2 +束身自好=shu4,shen1,zi4,hao4 +杠=gang4,gang1 +杠子=gang4,zi5 +杠杆=gang4,gan3 +条=tiao2,tiao1 +条几=tiao2,ji1 +条子=tiao2,zi5 +条干=tiao2,gan4 +条畅=di2,dang4 +条贯部分=tiao2,guan4,bu4,fen1 +杢=jie2 +杣=mian2 +杤=wan4 +来=lai2 +来头=lai2,tou5 +来得=lai2,de5 +来得及=lai2,de5,ji2 +来日大难=lai2,ri4,da4,nan4 +来着=lai2,zhe5 +来者不善=lai2,zhe3,bu4,shan4 +杦=jiu3 +杧=mang2 +杨=yang2 +杩=ma3,ma4 +杪=miao3 +杫=si4,zhi3,xi3 +杬=yuan2,wan2 +杭=hang2 +杮=fei4,bei4 +杯=bei1 +杯子=bei1,zi5 +杯水车薪=bei1,shui3,che1,xin1 +杰=jie2 +東=dong1 +杲=gao3 +杳=yao3 +杴=xian1 +杵=chu3 +杶=chun1 +杷=pa2 +杸=shu1,dui4 +杹=hua4 +杺=xin1 +杻=niu3,chou3 +杼=zhu4 +杼柚之空=zhu4,zhou2,zhi1,kong1 +杼柚其空=zhu4,zhou2,qi2,kong1 +杼柚空虚=zhu4,zhou2,kong1,xu1 +杽=chou3 +松=song1 +松散=song1,san3 +松筠之节=song1,jun1,zhi1,jie2 +板=ban3 +板上钉钉=ban3,shang4,ding4,ding1 +板子=ban3,zi5 +板铺=ban3,pu4 +枀=song1 +极=ji2 +极为=ji2,wei2 +极少数=ji2,shao3,shu4 +极深研几=ji2,shen1,yan2,ji1 +极深研幾=ji2,shen1,yan2,ji1 +枂=wo4,yue4 +枃=jin4 +构=gou4 +枅=ji1 +枆=mao2 +枇=pi2 +枈=pi1,mi4 +枉=wang3 +枉曲直凑=wang3,qu3,zhi2,cou4 +枊=ang4 +枋=fang1,bing4 +枌=fen2 +枍=yi4 +枎=fu2,fu1 +枏=nan2 +析=xi1 +枑=hu4,di3 +枒=ya1 +枓=dou1 +枔=xin2 +枕=zhen3 +枕头=zhen3,tou5 +枕席还师=zhen3,xi2,huan2,shi1 +枕干之雠=zhen3,gan4,zhi1,chou2 +枕戈泣血=zhen3,ge1,qi4,xue4 +枕戈饮血=zhen3,ge1,yin3,xue4 +枕曲藉糟=zhen3,qu1,jie4,zao1 +枕石嗽流=zhen3,shi2,shu4,liu2 +枕石漱流=zhen3,shi2,sou4,liu2 +枖=yao3,yao1 +林=lin2 +林冠=lin2,guan1 +林荫道=lin2,yin1,dao4 +枘=rui4 +枙=e3,e4 +枚=mei2 +枛=zhao4 +果=guo3 +果子露=guo3,zi3,lu4 +果干=guo3,gan4 +枝=zhi1,qi2 +枝叶相持=zhi1,ye4,xing1,chi2 +枝大于本=zhi1,da4,yu4,ben3 +枝干=zhi1,gan4 +枝杈=zhi1,cha4 +枝蔓=zhi1,wan4 +枞=cong1,zong1 +枞树=cong1,shu4 +枞阳=zong1,yang2 +枟=yun4 +枠=hua4 +枡=sheng1 +枢=shu1 +枣=zao3 +枣子=zao3,zi5 +枤=di4,duo4 +枥=li4 +枦=lu2 +枧=jian3 +枨=cheng2 +枩=song1 +枪=qiang1 +枪弹=qiang1,dan4 +枪杆=qiang1,gan3 +枪杆子=qiang1,gan3,zi5 +枪林弹雨=qiang1,lin2,dan4,yu3 +枪榴弹=qiang1,liu2,dan4 +枫=feng1 +枬=zhan1 +枭=xiao1 +枭将=xiao1,jiang4 +枮=xian1,zhen1 +枯=ku1 +枯树生华=ku1,shu4,sheng1,hua1 +枰=ping2 +枱=si4,tai2 +枲=xi3 +枳=zhi3 +枴=guai3 +枵=xiao1 +架=jia4 +架不住=jia4,bu2,zhu4 +架子=jia4,zi5 +枷=jia1 +枸=gou3,ju3 +枸杞=gou3,qi3 +枸橘=gou1,ju2 +枸橼=ju3,yuan2 +枹=bao1,fu2 +枺=mo4 +枻=yi4,xie4 +枼=ye4 +枽=ye4 +枾=shi4 +枿=nie4 +柀=bi3 +柁=tuo2,duo4 +柂=yi2,duo4,li2 +柃=ling2 +柄=bing3 +柅=ni3,chi4 +柆=la1 +柇=he2 +柈=pan2,ban4 +柉=fan2 +柊=zhong1 +柋=dai4 +柌=ci2 +柍=yang3,yang4,yang1,ying1 +柎=fu1,fu3,fu4 +柏=bai3,bo2,bo4 +柏拉图=bo2,la1,tu2 +柏林=bo2,lin2 +柏油=bai3,you2 +柏油纸=bai3,you2,zhi3 +柏油路=bai3,you2,lu4 +某=mou3 +柑=gan1 +柒=qi1 +染=ran3 +染坊=ran3,fang2 +柔=rou2 +柔情绰态=rou2,qing2,chuo1,tai4 +柕=mao4 +柖=shao2,shao4 +柗=song1 +柘=zhe4 +柙=xia2 +柚=you4,you2 +柚子=you4,zi5 +柚木=you2,mu4 +柛=shen1 +柜=gui4,ju3 +柜子=gui4,zi5 +柝=tuo4 +柞=zuo4,zha4 +柞水=zha4,shui3 +柞绸=zuo4,chou2 +柞蚕=zuo4,can2 +柟=nan2 +柠=ning2 +柡=yong3 +柢=di3,chi2 +柣=zhi4,die2 +柤=zha1,zu3,zu1 +查=cha2,zha1 +查勤=cha2,qin2 +查处=cha2,chu3 +查岗=cha2,gang3 +查帐=zha1,zhang4 +查查=zha1,zha1 +查核=cha2,he2 +查检=cha2,jian3 +查照=cha2,zhao4 +查看=cha2,kan4 +查禁=cha2,jin4 +查缉=cha2,ji1 +查铺=cha2,pu4 +柦=dan4 +柧=gu1 +柨=bu4,pu1 +柩=jiu4 +柪=ao1,ao4 +柫=fu2 +柬=jian3 +柭=ba1,fu2,pei4,bo2,bie1 +柮=duo4,zuo2,wu4 +柯=ke1 +柰=nai4 +柱=zhu4 +柲=bi4,bie2 +柳=liu3 +柳巷花街=liu3,xiong4,hua1,jie1 +柳眉倒竖=liu3,mei2,dao4,shu4 +柳街花巷=liu3,jie1,hua1,xiong4 +柳骨颜筋=jiu3,gu3,yan2,jin1 +柴=chai2 +柴沟堡=chai2,gou1,bu3 +柴立不阿=chai2,li4,bu4,e1 +柵=shan1 +柶=si4 +柷=zhu4 +柸=bei1,pei1 +柹=shi4,fei4 +柺=guai3 +査=cha2,zha1 +柼=yao3 +柽=cheng1 +柾=jiu4 +柿=shi4 +柿子=shi4,zi5 +栀=zhi1 +栁=liu3 +栂=mei2 +栃=li4 +栄=rong2 +栅=zha4,shan1,shi5,ce4 +栅极=shan1,ji2 +栆=zao3 +标=biao1 +标的=biao1,di4 +标识=biao1,shi2 +栈=zhan4 +栉=zhi4 +栉比鳞差=zhi4,bi3,lin2,ci3 +栉霜沐露=zhi4,shuang1,mu4,lu4 +栊=long2 +栋=dong4 +栋折榱坏=dong4,she2,cui1,huai4 +栌=lu2 +栎=li4,yue4 +栏=lan2 +栏干=lan2,gan4 +栏杆=lan2,gan3 +栏栅=lan2,shan1 +栐=yong3 +树=shu4 +树冠=shu4,guan1 +树干=shu4,gan4 +树杈=shu4,cha4 +树荫=shu4,yin1 +树行子=shu4,hang4,zi5 +栒=xun2 +栓=shuan1 +栓塞=shuan1,se4 +栔=qi4,qie4 +栕=chen2 +栖=qi1,xi1 +栖栖=xi1,xi1 +栖栖遑遑=qi1,qi1,huang2,huang2 +栖风宿雨=qi1,feng1,xiu3,yu3 +栗=li4 +栗子=li4,zi5 +栘=yi2 +栙=xiang2 +栚=zhen4 +栛=li4 +栜=se4 +栝=gua1,tian3 +栞=kan1 +栟=ben1,bing1 +栠=ren3 +校=xiao4,jiao4 +校准=jiao4,zhun3 +校勘=jiao4,kan1 +校园=xiao4,yuan3 +校场=jiao4,chang3 +校对=jiao4,dui4 +校改=jiao4,gai3 +校样=jiao4,yang4 +校核=jiao4,he2 +校正=jiao4,zheng4 +校注=jiao4,zhu4 +校点=jiao4,dian3 +校短量长=jiao4,duan3,liang2,chang2 +校舍=xiao4,she4 +校订=jiao4,ding4 +校长=xiao4,zhang3 +校阅=jiao4,yue4 +校验=jiao4,yan4 +栢=bai3 +栣=ren3 +栤=bing4 +栥=zi1 +栦=chou2 +栧=yi4,xie4 +栨=ci4 +栩=xu3 +株=zhu1 +栫=jian4,zun4 +栬=zui4 +栭=er2 +栮=er3 +栯=you3,yu4 +栰=fa2 +栱=gong3 +栲=kao3 +栳=lao3 +栴=zhan1 +栵=lie4 +栶=yin1 +样=yang4 +核=he2,hu2 +核儿=hu2,er2 +核反应=he2,fan3,ying4 +核反应堆=he2,fan3,ying4,dui1 +核子=he2,zi5 +核弹=he2,dan4 +核查=he2,zha1 +核桃凹=he2,tao2,wa1 +根=gen1 +根生土长=gen1,shen1,tu3,zhang3 +栺=zhi1,yi4 +栻=shi4 +格=ge2 +格杀不论=ge2,sha1,bu4,lun4 +格格不入=ge2,ge2,bu4,ru4 +格格不吐=ge1,ge1,bu4,tu3 +格格不纳=ge1,ge1,bu4,na4 +格调=ge2,diao4 +栽=zai1 +栽种=zai1,zhong4 +栽跟头=zai1,gen1,tou5 +栾=luan2 +栿=fu2 +桀=jie2 +桀骜不恭=jie2,ao4,bu4,gong1 +桀骜不逊=jie2,ao4,bu4,xun4 +桀骜不驯=jie2,ao4,bu4,xun4 +桀骜难驯=jie2,ao4,nan2,xun4 +桁=heng2,hang2 +桂=gui4 +桂冠=gui4,guan1 +桂折一枝=gui4,she2,yi1,zhi1 +桂折兰摧=gui4,she2,lan2,cui1 +桃=tao2 +桄=guang1,guang4 +桅=wei2 +桅杆=wei2,gan3 +框=kuang4 +桇=ru2 +案=an4 +案卷=an4,juan4 +案子=an4,zi3 +桉=an1 +桊=juan4 +桋=yi2,ti2 +桌=zhuo1 +桌子=zhuo1,zi5 +桍=ku1 +桎=zhi4 +桏=qiong2 +桐=tong2 +桑=sang1 +桑土绸缪=sang1,tu3,chou2,miu4 +桑户棬枢=sang1,hu4,juan4,shu1 +桑椹=sang1,shen4 +桑葚=sang1,shen4 +桑葚儿=sang1,ren4,er5 +桒=sang1 +桓=huan2 +桔=jie2,ju2 +桕=jiu4 +桖=xue4 +桗=duo4 +桘=chui2 +桙=yu2,mou2 +桚=za1,zan3 +桜=ying1 +桝=jie2 +桞=liu3 +桟=zhan4 +桠=ya1 +桠杈=ya1,cha4 +桡=rao2,nao2 +桢=zhen1 +档=dang4 +桤=qi1 +桥=qiao2 +桦=hua4 +桧=gui4,hui4 +桨=jiang3 +桩=zhuang1 +桪=xun2 +桫=suo1 +桬=sha1 +桭=chen2,zhen4 +桮=bei1 +桯=ting1,ying2 +桰=gua1 +桱=jing4 +桲=bo2 +桳=ben4,fan4 +桴=fu2 +桴鼓相应=fu2,gu3,xiang1,ying4 +桵=rui2 +桶=tong3 +桷=jue2 +桸=xi1 +桹=lang2 +桺=liu3 +桻=feng1,feng4 +桼=qi1 +桽=wen3 +桾=jun1 +桿=gan3 +梀=su4,yin4 +梁=liang2 +梁上君子=liang2,shang4,jun1,zi3 +梁孟相敬=liang2,meng4,xiang1,jin4 +梂=qiu2 +梃=ting3,ting4 +梄=you3 +梅=mei2 +梆=bang1 +梆子=bang1,zi5 +梆子腔=bang1,zi5,qiang1 +梇=long4 +梈=peng1 +梉=zhuang1 +梊=di4 +梋=xuan1,juan1,xie2 +梌=tu2,cha2 +梍=zao4 +梎=ao1,you4 +梏=gu4 +梐=bi4 +梑=di2 +梒=han2 +梓=zi3 +梔=zhi1 +梕=ren4,er2 +梖=bei4 +梗=geng3 +梗塞=geng3,se4 +梗着脖子=geng3,zhe5,bo2,zi5 +梘=jian3 +梙=huan4 +梚=wan3 +梛=nuo2 +梜=jia1 +條=tiao2,tiao1 +梞=ji4 +梟=xiao1 +梠=lv3 +梡=kuan3 +梢=shao1,sao4 +梣=chen2 +梤=fen1 +梥=song1 +梦=meng4 +梦撒撩丁=meng4,sa1,liao2,ding1 +梧=wu2 +梨=li2 +梩=si4,qi3 +梪=dou4 +梫=qin3 +梬=ying3 +梭=suo1 +梮=ju1 +梯=ti1 +梯子=ti1,zi5 +械=xie4 +梱=kun3 +梲=zhuo1 +梳=shu1 +梳子=shu1,zi5 +梴=chan1,yan2 +梵=fan4 +梵呗=fan4,bai4 +梶=wei3 +梷=jing4 +梸=li2 +梹=bin1,bing1 +梺=xia4 +梻=fo2 +梼=chou2,tao2,dao4 +梼杌=tao2,wu4 +梽=zhi4 +梾=lai2 +梿=lian2,lian3 +检=jian3 +棁=zhuo1 +棂=ling2 +棃=li2 +棄=qi4 +棅=bing3 +棆=lun2 +棇=cong1,song1 +棈=qian4 +棉=mian2 +棊=qi2 +棋=qi2 +棋输先着=qi2,shu1,xian1,zhao1 +棋输先著=qi2,shu1,xian1,zhuo2 +棋高一着=qi2,gao1,yi1,zhao1 +棌=cai3 +棍=gun4,hun4 +棍子=gun4,zi5 +棎=chan2 +棏=de2,zhe2 +棐=fei3 +棑=pai2,bei4,pei4 +棒=bang4 +棒喝=bang4,he4 +棒子面=bang4,zi5,mian4 +棓=bang4,pou3,bei4,bei1 +棔=hun1 +棕=zong1 +棖=cheng2 +棗=zao3 +棘=ji2 +棙=li4,lie4 +棚=peng2 +棚圈=peng2,juan4 +棛=yu4 +棜=yu4 +棝=gu4 +棞=jun4 +棟=dong4 +棠=tang2 +棡=gang1 +棢=wang3 +棣=di4,dai4,ti4 +棤=que4 +棥=fan2 +棦=cheng1 +棧=zhan4 +棨=qi3 +棩=yuan1 +棪=yan3,yan4 +棫=yu4 +棬=quan1,juan4 +棭=yi4 +森=sen1 +棯=ren3,shen3 +棰=chui2 +棱=leng2,leng1,ling2 +棲=qi1 +棳=zhuo1 +棴=fu2,su4 +棵=ke1 +棶=lai2 +棷=zou1,sou3 +棸=zou1 +棹=zhao4,zhuo1 +棺=guan1 +棻=fen1 +棼=fen2 +棽=chen1,shen1 +棾=qing2 +棿=ni2,ni3 +椀=wan3 +椁=guo3 +椂=lu4 +椃=hao2 +椄=jie1,qie4 +椅=yi3,yi1 +椅子=yi3,zi5 +椆=chou2,zhou4,diao1 +椇=ju3 +椈=ju2 +椉=cheng2,sheng4 +椊=zu2,cui4 +椋=liang2 +椌=qiang1,kong1 +植=zhi2 +植发冲冠=zhi2,fa4,chong1,guan4 +植发穿冠=zhi2,fa4,chuan1,guan4 +椎=zhui1,chui2 +椎埋屠狗=chui2,mai2,tu2,gou3 +椎埋狗窃=chui2,mai2,gou3,qie4 +椎天抢地=chui2,tian1,qiang3,di4 +椎心呕血=chui2,xin1,ou3,xue4 +椎心泣血=chui2,xin1,qi4,xue4 +椎心顿足=chui2,xin1,dun4,zu2 +椎心饮泣=chui2,xin1,yin3,qi4 +椎牛发冢=chui2,niu2,fa1,zhong3 +椎牛歃血=chui2,niu2,sha4,xue4 +椎牛飨士=chui2,niu2,xiang3,shi4 +椎肤剥体=chui2,fu1,bo1,ti3 +椎肤剥髓=chui2,fu1,bo1,sui3 +椎胸跌足=chui2,xiong1,die1,zu2 +椎胸顿足=chui2,xiong1,dun4,zu2 +椎膺顿足=chui2,ying1,dun4,zu2 +椎锋陷阵=chui2,feng1,xian4,zhen4 +椎锋陷陈=chui1,feng1,xian4,chen2 +椏=ya1 +椐=ju1 +椑=bei1 +椒=jiao1 +椓=zhuo2 +椔=zi1 +椕=bin1 +椖=peng2 +椗=ding4 +椘=chu3 +椙=chang1 +椚=men1 +椛=hua1 +検=jian3 +椝=gui1 +椞=xi4 +椟=du2 +椠=qian4 +椡=dao4 +椢=gui4 +椣=dian3 +椤=luo2 +椥=zhi1 +椦=quan1,juan4,quan2 +椨=fu3 +椩=geng1 +椪=peng4 +椫=shan4 +椬=yi2 +椭=tuo3 +椮=sen1 +椯=duo3,chuan2 +椰=ye1 +椰子=ye1,zi5 +椰子树=ye1,zi5,shu4 +椱=fu4 +椲=wei3,hui1 +椳=wei1 +椴=duan4 +椵=jia3,jia1 +椶=zong1 +椷=jian1,han2 +椸=yi2 +椹=zhen1,shen4 +椺=xi2 +椻=yan4,ya4 +椼=yan3 +椽=chuan2 +椾=jian1 +椿=chun1 +楀=yu3 +楁=he2 +楂=zha1,cha2 +楃=wo4 +楄=pian1 +楅=bi1 +楆=yao1 +楇=guo1,kua3 +楈=xu1 +楉=ruo4 +楊=yang2 +楋=la4 +楌=yan2 +楍=ben3 +楎=hui1 +楏=kui2 +楐=jie4 +楑=kui2 +楒=si1 +楓=feng1 +楔=xie1 +楔子=xie1,zi5 +楕=tuo3 +楖=ji2,zhi4 +楗=jian4 +楘=mu4 +楙=mao2 +楚=chu3 +楛=ku3,hu4 +楜=hu2 +楝=lian4 +楞=leng2 +楞眉横眼=leng4,mei2,heng2,yan3 +楟=ting2 +楠=nan2 +楡=yu2 +楢=you2,you3 +楣=mei2 +楤=song3,cong1 +楥=xuan4,yuan2 +楦=xuan4 +楧=yang3,yang4,ying1 +楨=zhen1 +楩=pian2 +楪=die2,ye4 +楫=ji2 +楬=jie1 +業=ye4 +楮=chu3 +楯=shun3,dun4 +楰=yu2 +楱=cou4,zou4 +楲=wei1 +楳=mei2 +楴=di4,di3,shi4 +極=ji2 +楶=jie2 +楷=kai3,jie1 +楸=qiu1 +楹=ying2 +楺=rou2,rou4 +楻=huang2 +楼=lou2 +楼观=lou2,guan4 +楽=le4,yue4 +楾=quan2 +楿=xiang1 +榀=pin3 +榁=shi3 +概=gai4 +概数=gai4,shu4 +榃=tan2 +榄=lan3 +榅=wen1,yun4 +榆=yu2 +榇=chen4 +榈=lv2 +榉=ju3 +榊=shen2 +榋=chu1 +榌=bi1,pi5 +榍=xie4 +榎=jia3 +榏=yi4 +榐=zhan3,nian3,zhen4 +榑=fu2,fu4,bo2 +榒=nuo4 +榓=mi4 +榔=lang2 +榔头=lang2,tou5 +榕=rong2 +榖=gu3 +榗=jian4,jin4 +榘=ju3 +榙=ta1 +榚=yao3 +榛=zhen1 +榜=bang3,bang4 +榝=sha1,xie4 +榞=yuan2 +榟=zi3 +榠=ming2 +榡=su4 +榢=jia4 +榣=yao2 +榤=jie2 +榥=huang4 +榦=gan4 +榧=fei3 +榨=zha4 +榩=qian2 +榪=ma4,ma1 +榫=sun3 +榫头=sun3,tou5 +榫子=sun3,zi5 +榬=yuan2 +榭=xie4 +榮=rong2 +榯=shi2 +榰=zhi1 +榱=cui1 +榱崩栋折=cui1,beng1,dong4,she2 +榱栋崩折=cui1,dong4,beng1,she2 +榲=wen1 +榳=ting2 +榴=liu2 +榴弹=liu2,dan4 +榴弹炮=liu2,dan4,pao4 +榵=rong2 +榶=tang2 +榷=que4 +榸=zhai1 +榹=si4 +榺=sheng4 +榻=ta4 +榼=ke1 +榽=xi1 +榾=gu4 +榾柮=gu3,duo4 +榿=qi1 +槀=gao3 +槁=gao3 +槁项没齿=gao3,xiang4,mei2,chi3 +槂=sun1 +槃=pan2 +槄=tao1 +槅=ge2 +槆=chun1 +槇=dian1 +槈=nou4 +槉=ji2 +槊=shuo4 +槊血满袖=shuo4,xue4,man3,xiu4 +構=gou4 +槌=chui2 +槍=qiang1 +槎=cha2 +槏=qian3,lian2,xian4 +槐=huai2 +槐南一梦=huai2,nan2,yi1,meng1 +槑=mei2 +槒=xu4 +槓=gang4 +槔=gao1 +槕=zhuo1 +槖=tuo2 +槗=qiao2 +様=yang4 +槙=dian1,zhen3,zhen1 +槚=jia3 +槛=jian4,kan3 +槜=zui4 +槝=dao3 +槞=long2 +槟=bin1,bing1 +槟榔=bing1,lang2 +槠=zhu1 +槡=sang1 +槢=xi2,die2 +槣=ji1,gui1 +槤=lian2,lian3 +槥=hui4 +槦=rong2,yong1 +槧=qian4 +槨=guo3 +槩=gai4 +槪=gai4 +槫=tuan2,shuan4,quan2 +槬=hua4 +槭=qi4,se4 +槮=sen1 +槯=cui1,zhi3 +槰=peng4 +槱=you3,chao3 +槲=hu2 +槳=jiang3 +槴=hu4 +槵=huan4 +槶=gui4 +槷=nie4 +槸=yi4 +槹=gao1 +槺=kang1 +槻=gui1 +槼=gui1 +槽=cao2 +槾=man4,wan4 +槿=jin3 +樀=di1 +樁=zhuang1 +樂=le4,yue4,yao4,lao4 +樃=lang2 +樄=chen2 +樅=cong1,zong1 +樆=li2,chi1 +樇=xiu1 +樈=qing2 +樉=shang3 +樊=fan2 +樋=tong1 +樌=guan4 +樍=ze2 +樎=su4 +樏=lei2,lei3 +樐=lu3 +樑=liang2 +樒=mi4 +樓=lou2 +樔=chao2,jiao3,chao1 +樕=su4 +樖=ke1 +樗=chu1 +樘=tang2 +標=biao1 +樚=lu4 +樛=jiu1,liao2 +樜=zhe4 +樝=zha1 +樞=shu1 +樟=zhang1 +樠=man2 +模=mo2,mu2 +模具=mu2,ju4 +模子=mu2,zi3 +模板=mu2,ban3 +模样=mu2,yang4 +模模糊糊=mo2,mo2,hu4,hu1 +模糊=mo2,hu4 +樢=niao3,mu4 +樣=yang4 +樤=tiao2 +樥=peng2 +樦=zhu4 +樧=sha1,xie4 +樨=xi1 +権=quan2 +横=heng2,heng4 +横事=heng4,shi4 +横加=heng4,jia1 +横加梗阻=heng4,jia1,geng3,zu3 +横征暴赋=heng4,zheng1,bao4,fu4 +横征苛役=heng4,zheng1,ke1,yi4 +横征苛敛=heng4,zheng1,ke1,lian3 +横恣=heng4,zi4 +横恩滥赏=heng4,en1,lan4,shang3 +横抢武夺=heng4,qiang3,wu3,duo2 +横抢硬夺=heng4,qiang3,ying4,duo2 +横拖倒扯=heng2,tuo1,dao4,che3 +横拖倒拽=heng2,tuo1,dao4,zhuai1 +横无忌惮=heng4,wu2,ji4,dan4 +横暴=heng4,bao4 +横死=heng4,si3 +横殃飞祸=heng4,yang1,fei1,huo4 +横灾飞祸=heng4,zai1,fei1,huo4 +横祸=heng4,huo4 +横科暴敛=heng4,ke1,bao4,lian3 +横蛮=heng4,man2 +横蛮不讲理=heng4,man2,bu4,jiang3,li3 +横议=heng4,yi4 +横话=heng4,hua4 +横财=heng4,cai2 +横逆=heng4,ni4 +樫=jian1 +樬=cong1 +樭=ji1 +樮=yan1 +樯=qiang2 +樰=xue3 +樱=ying1 +樲=er4 +樳=xun2 +樴=zhi2 +樵=qiao2 +樶=zui1 +樷=cong2 +樸=pu3 +樹=shu4 +樺=hua4 +樻=gui4 +樼=zhen1 +樽=zun1 +樾=yue4 +樿=shan4 +橀=xi1 +橁=chun1 +橂=dian4 +橃=fa2,fei4 +橄=gan3 +橅=mo2 +橆=wu2 +橇=qiao1 +橈=rao2,nao2 +橉=lin4 +橊=liu2 +橋=qiao2 +橌=xian4 +橍=run4 +橎=fan3 +橏=zhan3,jian3 +橐=tuo2 +橑=liao2 +橒=yun2 +橓=shun4 +橔=tui2,dun1 +橕=cheng1 +橖=tang2,cheng1 +橗=meng2 +橘=ju2 +橘化为枳=ju2,hua4,wei2,zhi3 +橘子=ju2,zi5 +橘子汁=ju2,zi5,zhi1 +橙=cheng2 +橙黄桔绿=cheng2,huang2,ju2,lv4 +橚=su4,qiu1 +橛=jue2 +橜=jue2 +橝=tan2,dian4 +橞=hui4 +機=ji1 +橠=nuo2 +橡=xiang4 +橡皮图章=xiang4,pi2,tu2,zhang1 +橡皮钉子=xiang4,pi2,ding4,zi3 +橢=tuo3 +橣=ning2 +橤=rui3 +橥=zhu1 +橦=tong2,chuang2 +橧=zeng1,ceng2 +橨=fen2,fen4,fei4 +橩=qiong2 +橪=ran3,yan1 +橫=heng2,heng4 +橬=qian2 +橭=gu1 +橮=liu3 +橯=lao4 +橰=gao1 +橱=chu2 +橲=xi3 +橳=sheng4 +橴=zi3 +橵=zan1 +橶=ji3 +橷=dou1 +橸=jing1 +橹=lu3 +橺=xian4 +橻=cu1,chu5 +橼=yuan2 +橽=ta4 +橾=shu1,qiao1 +橿=jiang1 +檀=tan2 +檁=lin3 +檂=nong2 +檃=yin3 +檄=xi2 +檅=hui4 +檆=shan1 +檇=zui4 +檈=xuan2 +檉=cheng1 +檊=gan4 +檋=ju1 +檌=zui4 +檍=yi4 +檎=qin2 +檏=pu3 +檐=yan2 +檐溜=yan2,liu4 +檑=lei2 +檒=feng1 +檓=hui3 +檔=dang4 +檕=ji4 +檖=sui4 +檗=bo4 +檘=ping2,bo4 +檙=cheng2 +檚=chu3 +檛=zhua1 +檜=gui4,hui4 +檝=ji2 +檞=jie3 +檟=jia3 +檠=qing2 +檡=zhai2,shi4,tu2 +檢=jian3 +檣=qiang2 +檤=dao4 +檥=yi3 +檦=biao1,biao3 +檧=song1 +檨=she1 +檩=lin3 +檪=li4 +檫=cha2 +檬=meng2 +檭=yin2 +檮=chou2,tao2,dao3 +檯=tai2 +檰=mian2 +檱=qi2 +檲=tuan2 +檳=bin1,bing1 +檴=huo4 +檵=ji4 +檶=qian1,lian2 +檷=ni3,mi2 +檸=ning2 +檹=yi1 +檺=gao3 +檻=jian4,kan3 +檼=yin3 +檽=nou4,ruan3,ru2 +檾=qing3 +檿=yan3 +櫀=qi2 +櫁=mi4 +櫂=zhao4 +櫃=gui4 +櫄=chun1 +櫅=ji1,ji4 +櫆=kui2 +櫇=po2 +櫈=deng4 +櫉=chu2 +櫊=ge2 +櫋=mian2 +櫌=you1 +櫍=zhi4 +櫎=huang3,guo3,gu3 +櫏=qian1 +櫐=lei3 +櫑=lei2,lei3 +櫒=sa4 +櫓=lu3 +櫔=li4 +櫕=cuan2 +櫖=lv4,chu1 +櫗=mie4,mei4 +櫘=hui4 +櫙=ou1 +櫚=lv2 +櫛=zhi4 +櫜=gao1 +櫝=du2 +櫞=yuan2 +櫟=li4,yue4 +櫠=fei4 +櫡=zhuo2,zhu4 +櫢=sou3 +櫣=lian2,lian3 +櫤=jiang4 +櫥=chu2 +櫦=qing4 +櫧=zhu1 +櫨=lu2 +櫩=yan2 +櫪=li4 +櫫=zhu1 +櫬=chen4 +櫭=jue2,ji4 +櫮=e4 +櫯=su1 +櫰=huai2,gui1 +櫱=nie4 +櫲=yu4 +櫳=long2 +櫴=la4,lai4 +櫵=qiao2 +櫶=xian3 +櫷=gui1 +櫸=ju3 +櫹=xiao1 +櫺=ling2 +櫻=ying1 +櫼=jian1 +櫽=yin3 +櫾=you4,you2 +櫿=ying2 +欀=xiang1 +欁=nong2 +欂=bo2 +欃=chan2,zhan4 +欄=lan2 +欅=ju3 +欆=shuang1 +欇=she4 +欈=wei2,zui4 +欉=cong2 +權=quan2 +欋=qu2 +欌=cang2 +欍=jiu4 +欎=yu4 +欏=luo2 +欐=li4 +欑=cuan2 +欒=luan2 +欓=dang3 +欔=qu2 +欕=yan2 +欖=lan3 +欗=lan2 +欘=zhu2 +欙=lei2 +欚=li3 +欛=ba4 +欜=nang2 +欝=yu4 +欞=ling2 +欟=guan4 +欠=qian4 +次=ci4 +次数=ci4,shu4 +次长=ci4,zhang3 +欢=huan1 +欣=xin1 +欤=yu2 +欥=yu4,yi4 +欦=qian1,xian1 +欧=ou1 +欨=xu1 +欩=chao1 +欪=chu4,qu4,xi4 +欫=qi4 +欬=kai4,ai4 +欭=yi4,yin1 +欮=jue2 +欯=xi4,kai4 +欰=xu4 +欱=he1 +欲=yu4 +欲取姑予=yu4,qu3,gu1,yu3 +欲取故予=yu4,qu3,gu4,yu3 +欳=kuai4 +欴=lang2 +欵=kuan3 +欶=shuo4,sou4 +欷=xi1 +欸=ei4,ei3,ai3 +欹=qi1 +欺=qi1 +欺上蒙下=qi1,shang4,meng1,xia4 +欺哄=qi1,hong3 +欺天诳地=qi1,tian1,kuang1,di4 +欺蒙=qi1,meng2 +欺行霸市=qi1,hang2,ba4,shi4 +欻=xu1,chua1 +欼=chi3,chuai4 +欽=qin1 +款=kuan3 +款识=kuan3,zhi4 +欿=kan3,qian4 +歀=kuan3 +歁=kan3,ke4 +歂=chuan3,chuan2 +歃=sha4 +歃血=sha4,xue4 +歃血为盟=sha4,xue4,wei2,meng2 +歄=gua1 +歅=yan1,yin1 +歆=xin1 +歇=xie1 +歈=yu2 +歉=qian4 +歊=xiao1 +歋=ye1 +歌=ge1 +歌仔戏=ge1,zai3,xi4 +歌片儿=ge1,pian1,er5 +歌莺舞燕=ge1,ying2,wu3,yan4 +歍=wu1 +歎=tan4 +歏=jin4,qun1 +歐=ou1 +歑=hu1 +歒=ti4 +歓=huan1 +歔=xu1 +歕=pen1 +歖=xi3 +歗=xiao4 +歘=xu1 +歙=xi1,she4 +歙县=she4,xian4 +歙漆阿胶=she4,qi1,e1,jiao1 +歚=shan4 +歛=lian3,han1 +歜=chu4 +歝=yi4 +歞=e4 +歟=yu2 +歠=chuo4 +歡=huan1 +止=zhi3 +止戈为武=zhi3,ge1,wei2,wu3 +止戈兴仁=zhi3,ge1,xing1,ren2 +止暴禁非=zhi3,bao4,jin4,fei1 +止血=zhi3,xue4 +正=zheng4,zheng1 +正中下怀=zheng4,zhong4,xia4,huai2 +正中己怀=zheng4,zhong4,ji3,huai2 +正人君子=zheng4,ren2,jun1,zi3 +正传=zheng4,zhuan4 +正当防卫=zheng4,dang4,fang2,wei4 +正数=zheng4,shu4 +正旦=zheng1,dan4 +正月=zheng1,yue4 +正朔=zheng1,shuo4 +正身率下=zheng4,shen1,shuai4,xia4 +此=ci3 +此唱彼和=ci3,chang4,bi3,he4 +此起彼伏=ci3,qi3,bi3,fu2 +此风不可长=ci3,feng1,bu4,ke3,zhang3 +步=bu4 +步斗踏罡=bu4,dou3,ta4,gang1 +步步为营=bu4,bu4,wei2,ying2 +步罡踏斗=bu4,gang1,ta4,dou3 +步调=bu4,diao4 +步调一致=bu4,diao4,yi1,zhi4 +步调快速=bu4,diao4,kuai4,su4 +武=wu3 +武将=wu3,jiang4 +武断专横=wu3,duan4,zhuan1,heng2 +武断乡曲=wu3,duan4,xiang1,qu1 +歧=qi2 +歨=bu4 +歩=bu4 +歪=wai1 +歪打正着=wai1,da3,zheng4,zhao2 +歪曲=wai1,qu1 +歫=ju4 +歬=qian2 +歭=zhi4,chi2 +歮=se4 +歯=chi3 +歰=se4,sha4 +歱=zhong3 +歲=sui4 +歳=sui4 +歴=li4 +歵=ze2 +歶=yu2 +歷=li4 +歸=gui1 +歹=dai3 +歺=e4 +死=si3 +死劲儿=si3,jing4,er5 +死当=si3,dang4 +死要面子=si3,yao4,mian4,zi3 +死记硬背=si3,ji4,ying4,bei4 +死诸葛吓走生仲达=si3,zhu1,ge2,xia4,zou3,sheng1,zhong4,da2 +死诸葛能走生仲达=si3,zhu1,ge2,neng2,zou3,sheng1,zhong4,da2 +死难=si3,nan4 +死马当活马医=si3,ma3,dang1,huo2,ma3,yi1 +歼=jian1 +歽=zhe2 +歾=mo4,wen3 +歿=mo4 +殀=yao1 +殁=mo4 +殂=cu2 +殃=yang1 +殄=tian3 +殅=sheng1 +殆=dai4 +殇=shang1 +殈=xu4 +殉=xun4 +殉难=xun4,nan4 +殊=shu1 +殊功劲节=shu1,gong1,jing4,jie2 +残=can2 +残兵败将=can2,bing1,bai4,jiang4 +残卷=can2,juan4 +殌=jing3 +殍=piao3 +殎=qia4 +殏=qiu2 +殐=su4 +殑=qing2,jing4 +殒=yun3 +殒身不恤=yun3,shen1,bu2,xu4 +殓=lian4 +殔=yi4 +殕=fou3,bo2 +殖=zhi2,shi5 +殗=ye4,yan1,yan4 +殘=can2 +殙=hun1,mei4 +殚=dan1 +殛=ji2 +殜=die2 +殝=zhen1 +殞=yun3 +殟=wen1 +殠=chou4 +殡=bin4 +殢=ti4 +殣=jin4 +殤=shang1 +殥=yin2 +殦=chi1 +殧=jiu4 +殨=kui4,hui4 +殩=cuan4 +殪=yi4 +殫=dan1 +殬=du4 +殭=jiang1 +殮=lian4 +殯=bin4 +殰=du2 +殱=jian1 +殲=jian1 +殳=shu1 +殴=ou1 +段=duan4 +殶=zhu4 +殷=yin1,yan1,yin3 +殷切=yin1,qie4 +殷殷=yin3,yin3 +殷殷屯屯=yin1,yin1,tun2,tun2 +殷红=yan1,hong2 +殸=qing4,keng1,sheng1 +殹=yi4 +殺=sha1 +殻=ke2,qiao4 +殼=ke2,qiao4 +殽=xiao2,yao2,xiao4 +殾=xun4 +殿=dian4 +毀=hui3 +毁=hui3 +毁冠裂裳=hui3,guan1,lie4,chang2 +毁家纾难=hui3,jia1,shu1,nan4 +毁舟为杕=hui3,zhou1,wei2,duo4 +毁钟为铎=hui3,zhong1,wei2,duo2 +毂=gu3 +毃=qiao1 +毄=ji1 +毅=yi4 +毆=ou1 +毇=hui3 +毈=duan4 +毉=yi1 +毊=xiao1 +毋=wu2 +毋宁=wu2,ning4 +毌=guan4,wan1 +母=mu3 +毎=mei3 +每=mei3 +毐=ai3 +毑=jie3 +毒=du2,dai4 +毒气炸弹=du2,qi4,zha4,dan4 +毓=yu4 +比=bi3 +比兴=bi3,xing1 +比划=bi3,hua4 +比喻失当=bi3,yu4,shi1,dang4 +比干=bi3,gan4 +比手划脚=bi3,shou3,hua4,jiao3 +比物属事=bi3,wu4,zhu3,shi4 +比量=bi3,liang2 +比量齐观=bi3,liang4,qi2,guan1 +毕=bi4 +毕剥=bi4,bao1 +毕肖=bi4,xiao4 +毕露=bi4,lu4 +毖=bi4 +毗=pi2 +毘=pi2 +毙=bi4 +毚=chan2 +毛=mao2 +毛发=mao2,fa4 +毛发不爽=mao2,fa1,bu4,shuang3 +毛发倒竖=mao2,fa1,dao3,shu4 +毛发悚然=mao2,fa1,song3,ran2 +毛发耸然=mao2,fa1,song3,ran2 +毛呢=mao2,ni2 +毛头小子=mao2,tou2,xiao3,zi5 +毛葛=mao2,ge3 +毛遂=mao2,sui2 +毛遂自荐=mao2,sui2,zi4,jian4 +毜=hao2 +毝=cai3 +毞=bi3 +毟=lie3 +毠=jia1 +毡=zhan1 +毡子=zhan1,zi5 +毢=sai1 +毣=mu4 +毤=tuo4 +毥=xun2,xun4 +毦=er3 +毧=rong2 +毨=xian3 +毩=ju1 +毪=mu2 +毫=hao2 +毫不含糊=hao2,bu4,han2,hu2 +毫不屑意=hao2,bu2,xie4,yi4 +毫发不爽=hao2,fa4,bu4,shuang3 +毬=qiu2 +毭=dou4,nuo4 +毮=sha1 +毯=tan3 +毯子=tan3,zi5 +毰=pei2 +毱=ju1 +毲=duo1 +毳=cui4 +毴=bi1 +毵=san1 +毶=san1 +毷=mao4 +毸=sai1,sui1 +毹=shu1 +毺=shu1 +毻=tuo4 +毼=he2 +毽=jian4 +毾=ta4 +毿=san1 +氀=lv2 +氁=mu2 +氂=mao2 +氃=tong2 +氄=rong3 +氅=chang3 +氆=pu3 +氇=lu3 +氈=zhan1 +氉=sao4 +氊=zhan1 +氋=meng2 +氌=lu3 +氍=qu2 +氎=die2 +氏=shi4,zhi1 +氐=di1,di3 +民=min2 +民为贵君为轻=min2,wei2,gui4,jun1,wei2,qing1 +民乐=min2,yue4 +民以食为天=min2,yi3,shi2,wei2,tian1 +氒=jue2 +氓=meng2,mang2 +气=qi4 +气不忿儿=qi4,bu4,fen4,er2 +气克斗牛=qi4,ke4,dou3,niu2 +气冲斗牛=qi4,chong1,dou3,niu2 +气冲牛斗=qi4,chong1,niu2,dou3 +气吞牛斗=qi4,tun1,niu2,dou3 +气喘吁吁=qi4,chuan3,xu1,xu1 +气息奄奄=qi4,xi1,yan1,yan1 +气数=qi4,shu4 +气血=qi4,xue4 +气血方刚=qi4,xue4,fang1,gang1 +氕=pie1 +氖=nai3 +気=qi4 +氘=dao1 +氙=xian1 +氚=chuan1 +氛=fen1 +氜=yang2,ri4 +氝=nei4 +氞=nei4 +氟=fu2 +氠=shen1 +氡=dong1 +氢=qing1 +氢弹=qing1,dan4 +氣=qi4 +氤=yin1 +氥=xi1 +氦=hai4 +氧=yang3 +氨=an1 +氩=ya4 +氪=ke4 +氫=qing1 +氬=ya4 +氭=dong1 +氮=dan4 +氯=lv4 +氰=qing2 +氱=yang3 +氲=yun1 +氳=yun1 +水=shui3 +水中捉月=shui3,zhong1,zhuo1,yue4 +水中著盐=shui3,zhong1,zhuo2,yan2 +水佩风裳=shui3,pei4,feng1,shang5 +水分=shui3,fen4 +水宿山行=shui3,xiu3,shan1,xing2 +水宿风餐=shui3,xiu3,feng1,can1 +水杉=shui3,shan1 +水栅=shui3,shan1 +水泊=shui3,po1 +水浒传=shui3,hu3,zhuan4 +水溜=shui3,liu4 +水漂儿=shui3,piao3,er2 +水火不兼容=shui3,huo3,bu4,xiang1,rong2 +水米无干=shui3,mi3,wu2,gan4 +水调歌头=shui3,diao4,ge1,tou2 +水过鸭背=shui3,guo4,ya1,bei4 +水长船高=shui3,zhang3,chuan2,gao1 +氵=shui3 +氶=zheng3,cheng2,zheng4 +氷=bing1 +永=yong3 +永无宁日=yong3,wu2,ning2,ri4 +氹=dang4 +氺=shui3 +氻=le4 +氼=ni4 +氽=tun3 +氾=fan4 +氿=gui3,jiu3 +汀=ting1 +汁=zhi1 +求=qiu2 +求降=qiu2,xiang2 +汃=bin1,pa4,pa1 +汄=ze4 +汅=mian3 +汆=cuan1 +汇=hui4 +汈=diao1 +汉=han4 +汉堡包=han4,pu4,bao1 +汊=cha4 +汋=zhuo2,que4 +汌=chuan4 +汍=wan2 +汎=fan4 +汏=tai4,da4 +汐=xi1 +汑=tuo1 +汒=mang2 +汓=qiu2 +汔=qi4 +汕=shan4 +汖=pin4 +汗=han4,han2 +汗出洽背=han4,chu1,qia4,bei4 +汗出浃背=han4,chu1,jia1,bei4 +汗流夹背=han4,liu2,jia1,bei4 +汗流洽背=han4,liu2,qia4,bei4 +汗流浃背=han4,liu2,jia1,bei4 +汗血盐车=han4,xue4,yan2,che1 +汗褂儿=han4,gua4,er5 +汘=qian1 +汙=wu1 +汚=wu1 +汛=xun4 +汜=si4 +汝=ru3 +汝成人耶=ru3,cheng2,ren2,ye2 +汞=gong3 +江=jiang1 +江都=jiang1,du1 +池=chi2 +污=wu1 +汢=tu5 +汣=jiu3 +汤=tang1,shang1 +汥=zhi1,ji4 +汦=zhi3 +汧=qian1 +汨=mi4 +汩=gu3,yu4 +汩没=gu3,mo4 +汪=wang1 +汫=jing3 +汬=jing3 +汭=rui4 +汮=jun1 +汯=hong2 +汰=tai4 +汱=tai4 +汲=ji2 +汳=bian4 +汴=bian4 +汵=gan4,han2,cen2 +汶=wen4,men2 +汷=zhong1 +汸=fang1,pang1 +汹=xiong1 +汹涌淜湃=xiong1,yong3,peng2,pai4 +決=jue2 +汻=hu3,huang3 +汼=niu2,you2 +汽=qi4 +汾=fen2 +汿=xu4 +沀=xu4 +沁=qin4 +沂=yi2 +沃=wo4 +沄=yun2 +沅=yuan2 +沆=hang4 +沇=yan3 +沈=shen3,chen2 +沈博绝丽=chen2,bo2,jue2,li4 +沉=chen2 +沉吟不决=chen2,yin2,bu4,jue2 +沉吟章句=chen2,yin1,zhang1,ju4 +沉没=chen2,mo4 +沉疴宿疾=chen2,ke1,su4,ji4 +沉谋重虑=chen2,mou2,chong2,lv4 +沉降缝=chen2,jiang4,feng4 +沊=dan4 +沋=you2 +沌=dun4 +沍=hu4 +沎=huo4 +沏=qi1 +沐=mu4 +沐猴而冠=mu4,hou2,er2,guan4 +沐猴衣冠=mu4,hou2,yi1,guan4 +沐露梳风=mu4,lu4,shu1,feng1 +沐露沾霜=mu4,lu4,zhan1,shuang1 +沑=nv4,niu3 +沒=mei2,mo4 +沓=ta4,da2 +沓子=ta4,zi3 +沓来踵至=ta3,lai2,zhong3,zhi4 +沓来麕至=ta4,lai2,you3,zhi4 +沔=mian3 +沕=mi4,wu4 +沖=chong1 +沗=hong2,pang1 +沘=bi3 +沙=sha1,sha4 +沙参=sha1,shen1 +沙子=sha1,zi5 +沙拉=sha1,la4 +沙鸥翔集=sha4,ou5,xiang4,ji5 +沚=zhi3 +沛=pei4 +沜=pan4 +沝=zhui3,zi3 +沞=za1 +沟=gou1 +沠=pai4 +没=mei2,mo4 +没世=mo4,shi4 +没世不忘=mo4,shi4,bu4,wang4 +没世不渝=mei2,shi4,bu4,yu2 +没世无称=mei2,shi4,wu2,cheng1 +没世无闻=mei2,shi4,wu2,wen2 +没世穷年=mei2,shi4,qiong2,nian2 +没世难忘=mo4,shi4,nan2,wang4 +没入=mo4,ru4 +没头没尾=mei2,tou2,mo4,wei3 +没头苍蝇=mei2,tou2,cang1,ying4 +没奈何=mo4,nai4,he2 +没完没了=mei2,wan2,mei2,liao3 +没收=mo4,shou1 +没有空=mei2,you3,kong4 +没没无闻=mo4,mo4,wu2,wen2 +没空=mei2,kong4 +没药=mo4,yao4 +没落=mo4,luo4 +没衷一是=mo4,zhong1,yi1,shi4 +没谱儿=mei2,pu3,er5 +没过=mo4,guo4 +没金饮羽=mo4,jin1,yin3,yu3 +没面子=mei2,mian4,zi5 +没齿=mo4,chi3 +没齿不忘=mo4,chi3,bu4,wang4 +没齿无怨=mo4,chi3,wu2,yuan4 +没齿难忘=mo4,chi3,nan2,wang4 +沢=ze2 +沣=feng1 +沤=ou4,ou1 +沤沫槿艳=ou1,mo4,jin3,yan4 +沤浮泡影=ou1,fu2,pao4,ying3 +沤珠槿艳=ou1,zhu1,jin3,yan4 +沥=li4 +沥血叩心=li4,xue4,kou4,xin1 +沥血披心=li4,xue4,pi1,xin1 +沥血披肝=li4,xue4,pi1,gan1 +沦=lun2 +沦没=lun2,mo4 +沧=cang1 +沨=feng1 +沩=wei2 +沪=hu4 +沫=mo4 +沬=mei4 +沭=shu4 +沮=ju3,ju4 +沮丧=ju3,sang4 +沮洳=ju4,ru4 +沯=za2 +沰=tuo1,duo2 +沱=tuo2 +沲=tuo2,duo4 +河=he2 +河伯为患=he2,bo2,wei2,huan4 +河曲=he2,qu1 +河涸海干=he2,he2,hai3,qian2 +沴=li4 +沵=mi3,li4 +沶=yi2,chi2 +沷=fa1 +沸=fei4 +沸沸汤汤=fei4,fei4,shang1,shang1 +油=you2 +油坊=you2,fang2 +油干灯尽=you2,gan4,deng1,jin4 +油炸=zha2,zha2 +油炸土豆片=you2,zha2,tu3,dou4,pian4 +油炸果=you2,zha2,guo3 +油炸鬼=you2,zha2,gui3 +油电混合车=you2,dian4,hun1,he2,che1 +油腔滑调=you2,qiang1,hua2,diao4 +沺=tian2 +治=zhi4 +治丧=zhi4,sang1 +沼=zhao3 +沽=gu1 +沽名干誉=gu1,ming2,gan4,yu4 +沾=zhan1 +沾沾自好=zhan1,zhan1,zi4,hao4 +沿=yan2 +沿着=yan2,zhe5 +泀=si1 +況=kuang4 +泂=jiong3 +泃=ju1 +泄=xie4,yi4 +泄露天机=xie4,lou4,tian1,ji1 +泅=qiu2 +泆=yi4,die2 +泇=jia1 +泈=zhong1 +泉=quan2 +泊=bo2,po1 +泊地=po1,di4 +泋=hui4 +泌=mi4,bi4 +泌阳=bi4,yang2 +泍=ben1,ben4 +泎=ze2 +泏=chu4,she4 +泐=le4 +泑=you1,you4,ao1 +泒=gu1 +泓=hong2 +泔=gan1 +法=fa3 +法不阿贵=fa3,bu4,e1,gui4 +法家拂士=fa3,jia1,bi4,shi4 +法帖=fa3,tie4 +法轮常转=fa3,lun2,chang2,zhuan4 +泖=mao3 +泗=si4 +泘=hu1 +泙=peng1,ping2 +泚=ci3 +泛=fan4 +泜=zhi1 +泝=su4 +泞=ning4 +泟=cheng1 +泠=ling2 +泡=pao4,pao1 +泡子=pao1,zi3 +泡桐=pao1,tong2 +泡涨=pao4,zhang4 +泡货=pao1,huo4 +波=bo1 +波属云委=bo1,zhu3,yun2,wei3 +波骇云属=bo1,hai4,yun2,zhu3 +泣=qi4 +泣下如雨=qi3,xia4,ru2,yu3 +泣不成声=qi3,bu4,cheng2,sheng1 +泣数行下=qi4,shu4,hang2,xia4 +泣血=qi4,xue4 +泣血捶膺=qi4,xue4,chui2,ying1 +泣血枕戈=qi4,xue4,zhen3,ge1 +泣血稽颡=qi4,xue4,ji1,sang3 +泤=si4 +泥=ni2,ni4 +泥古=ni4,gu3 +泥古不化=ni4,gu3,bu4,hua4 +泥古拘方=ni4,gu3,ju1,fang1 +泥古非今=ni4,gu3,fei1,jin1 +泥名失实=ni4,ming2,shi1,shi2 +泥娃娃=ni2,wa2,wa5 +泥子=ni4,zi3 +泥巴=ni2,ba5 +泥泞=ni2,ning3 +泥而不滓=nie4,er2,bu4,zi3 +泥蟠不滓=ni2,pan1,bu4,zi3 +泦=ju2 +泧=yue4,sa4 +注=zhu4 +泩=sheng1 +泪=lei4 +泫=xuan4 +泬=jue2,xue4 +泭=fu2 +泮=pan4 +泯=min3 +泯没=min3,mo4 +泰=tai4 +泰山北斗=tai4,shan1,bei3,dou3 +泰斗=tai4,dou3 +泰来否往=tai4,lai2,pi3,wang3 +泰极而否=tai4,ji2,er2,pi3 +泰然处之=tai4,ran2,chu3,zhi1 +泱=yang1 +泲=ji3 +泳=yong3 +泴=guan4 +泵=beng4 +泶=xue2 +泷=long2,shuang1 +泸=lu2 +泹=dan4 +泺=luo4,po1 +泻=xie4 +泼=po1 +泽=ze2,shi4 +泾=jing1 +泿=yin2 +洀=pan2 +洁=jie2 +洁身累行=jie2,shen1,lei4,xing2 +洁身自好=jie2,shen1,zi4,hao4 +洂=ye4 +洃=hui1 +洄=hui2 +洅=zai4 +洆=cheng2 +洇=yin1 +洈=wei2 +洉=hou4 +洊=jian4 +洋=yang2 +洋为中用=yang2,wei2,zhong1,yong4 +洋洋纚纚=yang2,yang2,sa3,sa3 +洋相=yang2,xiang4 +洋葱头=yang2,cong1,tou2 +洋行=yang2,hang2 +洌=lie4 +洍=si4 +洎=ji4 +洏=er2 +洐=xing2 +洑=fu2,fu4 +洒=sa3,xi3 +洒心更始=sa3,xin1,geng4,shi3 +洒扫应对=sa3,sao4,ying4,dui4 +洓=se4,qi4,zi4 +洔=zhi3 +洕=yin4 +洖=wu2 +洗=xi3,xian3 +洗劫ㄧ空=xi3,jie2,yi4,kong1 +洗发精=xi3,fa4,jing1 +洗手不干=xi3,shou3,bu4,gan4 +洗衣服=xi3,yi1,fu5 +洘=kao3,kao4 +洙=zhu1 +洚=jiang4 +洛=luo4 +洜=luo4 +洝=an4,yan4,e4 +洞=dong4 +洞见症结=dong4,jian4,zheng4,jie2 +洞鉴废兴=dong4,jian4,fei4,xing1 +洟=yi2 +洠=si4 +洡=lei3,lei4 +洢=yi1 +洣=mi3 +洤=quan2 +津=jin1 +津关险塞=jin1,guan1,xian3,sai4 +洦=po4 +洧=wei3 +洨=xiao2 +洩=xie4 +洪=hong2 +洪炉燎发=hong2,lu2,liao2,fa4 +洫=xu4 +洬=su4,shuo4 +洭=kuang1 +洮=tao2 +洯=qie4,jie2 +洰=ju4 +洱=er3 +洲=zhou1 +洲际导弹=zhou1,ji4,dao3,dan4 +洲际弹道导弹=zhou1,ji4,dan4,dao4,dao3,dan4 +洳=ru4 +洴=ping2 +洵=xun2 +洶=xiong1 +洷=zhi4 +洸=guang1 +洹=huan2 +洺=ming2 +活=huo2 +活儿=huo2,er5 +活剥生吞=huo2,bao1,sheng1,tun1 +活动分子=huo2,dong4,fen4,zi3 +活塞=huo2,sai1 +活着=huo2,zhe5 +活血=huo2,xue4 +活靶子=huo2,ba3,zi5 +洼=wa1 +洽=qia4 +派=pai4 +派不是=pai4,bu2,shi4 +洿=wu1 +浀=qu1 +流=liu2 +流弹=liu2,dan4 +流氓=liu2,mang2 +流汗浃背=liu2,han4,jia1,bei4 +流离颠疐=liu2,li2,dian1,zhi4 +流血=liu2,xue4 +流露=liu2,lu4 +浂=yi4 +浃=jia1 +浃背汗流=jia1,bei4,han4,liu2 +浄=jing4 +浅=qian3,jian1 +浅浅=jian1,jian1 +浅薄=qian3,bo2 +浅露=qian3,lu4 +浆=jiang1,jiang4 +浆糊=jiang1,hu2 +浇=jiao1 +浇头=jiao1,tou5 +浇薄=jiao1,bo2 +浇风薄俗=jiao1,feng1,bo2,su2 +浈=zhen1 +浉=shi1 +浊=zhuo2 +测=ce4 +测度=ce4,duo2 +测量=ce4,liang2 +浌=fa2 +浍=kuai4,hui4 +济=ji4,ji3 +济南=ji3,nan2 +济宁=ji3,ning2 +济济=ji3,ji3 +济济一堂=ji3,ji3,yi1,tang2 +浏=liu2 +浐=chan3 +浑=hun2 +浑抡吞枣=hun2,lun2,tun1,zao3 +浑朴=hun2,piao2 +浑水摸鱼=hun2,shui3,mo1,yu2 +浑然一体=hun2,ran2,yi1,ti3 +浑球儿=hun2,qiu2,er5 +浑身冒汗=hun1,shen1,mao4,han4 +浑身解数=hun2,shen1,xie4,shu4 +浒=hu3,xu3 +浒墅关=xu3,shu4,guan1 +浓=nong2 +浓抹淡妆=nong2,mo4,dan4,zhuang1 +浓装艳抹=nong2,zhuang1,yan4,mo4 +浔=xun2 +浕=jin4 +浖=lie4 +浗=qiu2 +浘=wei3 +浙=zhe4 +浚=jun4,xun4 +浛=han2 +浜=bang1 +浝=mang2 +浞=zhuo2 +浟=you1,di2 +浠=xi1 +浡=bo2 +浢=dou4 +浣=huan4 +浤=hong2 +浥=yi4 +浦=pu3 +浧=ying3,cheng2,ying2 +浨=lan3 +浩=hao4 +浩浩汤汤=hao4,hao4,shang1,shang1 +浪=lang4 +浪头=lang4,tou5 +浫=han3 +浬=li3 +浭=geng1 +浮=fu2 +浮名薄利=fu2,ming2,bo2,li4 +浮声切响=fu2,sheng1,qie4,xiang3 +浮收勒折=fu2,shou1,le4,she2 +浮生切响=fu2,sheng1,qie4,xiang3 +浮白载笔=fu2,bai2,zai3,bi3 +浯=wu2 +浰=li4 +浱=chun2 +浲=feng2,hong2 +浳=yi4 +浴=yu4 +浴血=yu4,xue4 +浴血奋战=yu4,xue4,fen4,zhan4 +浵=tong2 +浶=lao2 +海=hai3 +海参=hai3,shen1 +海参崴=hai3,shen1,wei1 +海德堡大学=hai3,de2,pu4,da4,xue2 +海水不可斗量=hai3,shui3,bu4,ke3,dou3,liang2 +海水难量=hai3,shui3,nan2,liang2 +海禁=hai3,jin4 +海难=hai3,nan4 +浸=jin4 +浸没=jin4,mo4 +浹=jia1 +浺=chong1 +浻=jiong3,jiong1 +浼=mei3 +浽=sui1,nei3 +浾=cheng1 +浿=pei4 +涀=xian4 +涁=shen4 +涂=tu2 +涃=kun4 +涄=ping1 +涅=nie4 +涆=han4 +涇=jing1 +消=xiao1 +消长=xiao1,zhang3 +涉=she4 +涊=nian3 +涋=tu1 +涌=yong3,chong1 +涍=xiao4 +涎=xian2 +涎着脸=xian2,zhe5,lian3 +涏=ting3 +涐=e2 +涑=su4 +涒=tun1,yun1 +涓=juan1 +涓滴不剩=juan1,di1,bu2,sheng4 +涔=cen2 +涕=ti4 +涖=li4 +涗=shui4 +涘=si4 +涙=lei4 +涚=shui4 +涛=tao1 +涜=du2 +涝=lao4 +涞=lai2 +涟=lian2 +涠=wei2 +涡=wo1,guo1 +涡河=guo1,he2 +涢=yun2 +涣=huan4 +涤=di2 +涤纶=di2,lun2 +涥=heng1 +润=run4 +涧=jian4 +涨=zhang3,zhang4 +涨红=zhang4,hong2 +涨红了脸=zhang4,hong2,le5,lian3 +涩=se4 +涪=fu2 +涫=guan1 +涬=xing4 +涭=shou4,tao1 +涮=shuan4 +涯=ya2 +涰=chuo4 +涱=zhang4 +液=ye4 +液体炸弹=ye4,ti3,zha4,dan4 +涳=kong1,nang2 +涴=wan3,wo4,yuan1 +涵=han2 +涶=tuo1,tuo4 +涷=dong1 +涸=he2 +涸思干虑=he2,si1,qian2,lv4 +涹=wo1 +涺=ju1 +涻=she4 +涼=liang2,liang4 +涽=hun1 +涾=ta4 +涿=zhuo1 +淀=dian4 +淁=qie4,ji2 +淂=de2 +淃=juan4 +淄=zi1 +淅=xi1 +淆=xiao2 +淇=qi2 +淈=gu3 +淉=guo3,guan4 +淊=yan1 +淋=lin2,lin4 +淋病=lin4,bing4 +淌=tang3,chang3 +淍=zhou1 +淎=peng3 +淏=hao4 +淐=chang1 +淑=shu1 +淒=qi1 +淓=fang1 +淔=zhi2 +淕=lu4 +淖=nao4,chuo4,zhuo1 +淗=ju2 +淘=tao2 +淙=cong2 +淚=lei4 +淛=zhe4 +淜=ping2,peng2 +淝=fei2 +淞=song1 +淟=tian3 +淠=pi4,pei4 +淡=dan4 +淡妆轻抹=dan4,zhuang1,qing1,mo4 +淡汝浓抹=dan4,zhuang1,nong2,mo3 +淡泊=dan4,bo2 +淡然处之=dan4,ran2,chu3,zhi1 +淡薄=dan4,bo2 +淢=yu4,xu4 +淣=ni2 +淤=yu1 +淤塞=yu1,se4 +淤血=yu1,xue4 +淥=lu4 +淦=gan4 +淧=mi4 +淨=jing4,cheng1 +淩=ling2 +淪=lun2 +淫=yin2 +淫言媟语=yin2,yan2,liang3,yu3 +淬=cui4 +淭=qu2 +淮=huai2 +淮橘为枳=huai2,ju2,wei2,zhi3 +淯=yu4 +淰=nian3,shen3 +深=shen1 +深仇宿怨=shen1,chou2,xiu3,yuan4 +深切=shen1,qie4 +深切着明=shen1,qie1,zhe5,ming2 +深切着白=shen1,qie1,zhe5,bai2 +深切著明=shen1,qie4,zhu4,ming2 +深切著白=shen1,qie4,zhu4,bai2 +深厉浅揭=shen1,li4,qian3,qi4 +深恶痛嫉=shen1,wu4,tong4,ji2 +深恶痛疾=shen1,wu4,tong4,ji2 +深恶痛绝=shen1,wu4,tong4,jue2 +深扃固钥=shen1,jiong1,gu4,yao4 +深文周内=shen1,wen2,zhou1,na4 +深文曲折=shen1,wen2,qu3,she2 +深更半夜=shen1,geng1,ban4,ye4 +深水炸弹=shen1,shui3,zha4,dan4 +深省=shen1,xing3 +深谷为陵=shen1,gu3,wei2,ling2 +淲=biao1,hu3 +淳=chun2,zhun1 +淴=hu1 +淵=yuan1 +淶=lai2 +混=hun4,hun2 +混为一体=hun4,wei2,yi4,ti3 +混为一谈=hun4,wei2,yi1,tan2 +混凝纸浆=hun4,ning2,zhi3,jiang4 +混子=hun4,zi5 +混战一场=hun4,zhan4,yi4,chang3 +混日子=hun4,ri4,zi5 +混活=hun2,huo2 +混浊=hun2,zhuo2 +混混噩噩=hun2,hun2,e4,e4 +混然一体=hun2,ran2,yi1,ti3 +混球儿=hun4,qiu2,er5 +混血=hun4,xue4 +混血儿=hun4,xue4,er2 +淸=qing1 +淹=yan1 +淹没=yan1,mo4 +淺=qian3 +添=tian1 +添盐着醋=tian1,yan2,zhe5,cu4 +添砖加瓦=tian1,zhuan1,jie1,wa3 +淼=miao3 +淽=zhi3 +淾=yin3 +淿=bo2 +渀=ben4 +渁=yuan1 +渂=wen4,min2 +渃=ruo4,re4,luo4 +渄=fei1 +清=qing1 +清净无为=qing1,jing4,wu2,wei2 +清官能断家务事=qing1,guan1,neng2,duan4,jia1,wu4,shi4 +清寒=qing1,han2 +清都绛阙=qing1,dou1,jiang4,que4 +清静无为=qing1,jing4,wu2,wei2 +清风劲节=qing1,feng1,jing4,jie2 +渆=yuan1 +渇=ke3 +済=ji4,ji3 +渉=she4 +渊=yuan1 +渊涓蠖濩=yuan1,juan1,huo4,hu4 +渊清玉絜=yuan1,qing1,yu4,jie2 +渋=se4 +渌=lu4 +渍=zi4 +渎=du2,dou4 +渏=yi1 +渐=jian4,jian1 +渐不可长=jian4,bu4,ke3,zhang3 +渑=mian3,sheng2 +渑池=mian3,chi2 +渒=pai4 +渓=xi1 +渔=yu2 +渔阳鞞鼓=yu3,yang2,pi2,gu3 +渕=yuan1 +渖=shen3 +渗=shen4 +渘=rou2 +渙=huan4 +渚=zhu3 +減=jian3 +渜=nuan3,nuan2 +渝=yu2 +渞=qiu2,wu4 +渟=ting2,ting1 +渠=qu2,ju4 +渡=du4 +渢=feng1 +渣=zha1 +渤=bo2 +渥=wo4 +渦=wo1,guo1 +渧=ti2,di1,di4 +渨=wei3 +温=wen1 +温凊定省=wen1,qing3,ding4,sheng3 +温差=wen1,cha1 +温席扇枕=wen1,xi2,shan1,zhen3 +温枕扇席=wen1,zhen3,shan1,xi2 +温衾扇枕=wen1,qin1,shan1,zhen3 +渪=ru2 +渫=xie4 +測=ce4 +渭=wei4 +渮=he2 +港=gang3,jiang3 +渰=yan1,yan3 +渱=hong2 +渲=xuan4 +渳=mi3 +渴=ke3 +渵=mao2 +渶=ying1 +渷=yan3 +游=you2 +游兴=you2,xing4 +游必有方=you1,bi4,you3,fang1 +游手好闲=you2,shou3,hao4,xian2 +游泳裤衩=you2,yong3,ku4,cha3 +游说=you2,shui4 +渹=hong1,qing4 +渺=miao3 +渻=sheng3 +渼=mei3 +渽=zai1 +渾=hun2 +渿=nai4 +湀=gui3 +湁=chi4 +湂=e4 +湃=pai4 +湄=mei2 +湅=lian4 +湆=qi4 +湇=qi4 +湈=mei2 +湉=tian2 +湊=cou4 +湋=wei2 +湌=can1 +湍=tuan1 +湎=mian3 +湏=hui4,min3,xu1 +湐=po4 +湑=xu3,xu1 +湒=ji2 +湓=pen2 +湔=jian1 +湕=jian3 +湖=hu2 +湖泊=hu2,po1 +湗=feng4 +湘=xiang1 +湙=yi4 +湚=yin4 +湛=zhan4 +湛恩汪濊=zhan4,en1,wang1,hun2 +湜=shi2 +湝=jie1 +湞=zhen1 +湟=huang2 +湠=tan4 +湡=yu2 +湢=bi4 +湣=min3,hun1 +湤=shi1 +湥=tu1 +湦=sheng1 +湧=yong3 +湨=ju2 +湩=dong4 +湪=tuan4,nuan3 +湫=qiu1,jiao3 +湫隘=jiao3,ai4 +湬=qiu1,jiao3 +湭=qiu2 +湮=yan1,yin1 +湮没=yan1,mo4 +湯=tang1,shang1 +湰=long2 +湱=huo4 +湲=yuan2 +湳=nan3 +湴=ban4,pan2 +湵=you3 +湶=quan2 +湷=zhuang1,hun2 +湸=liang4 +湹=chan2 +湺=xian2 +湻=chun2 +湼=nie4 +湽=zi1 +湾=wan1 +湿=shi1 +満=man3 +溁=ying2 +溂=la4 +溃=kui4,hui4 +溃烂=kui4,lan4 +溃脓=hui4,nong2 +溄=feng2,hong2 +溅=jian4,jian1 +溆=xu4 +溇=lou2 +溈=wei2 +溉=gai4 +溊=bo1 +溋=ying2 +溌=po1 +溍=jin4 +溎=yan4,gui4 +溏=tang2 +源=yuan2 +溑=suo3 +溒=yuan2 +溓=lian2,lian3,nian2,xian2,xian4 +溔=yao3 +溕=meng2 +準=zhun3 +溗=cheng2 +溘=ke4 +溙=tai4 +溚=da2,ta3 +溛=wa1 +溜=liu1,liu4 +溜子=liu1,zi5 +溜旱冰=liu1,han2,bing1 +溜溜转=liu1,liu1,zhuan4 +溜达=liu1,da5 +溝=gou1 +溞=sao1 +溟=ming2 +溠=zha4 +溡=shi2 +溢=yi4 +溢美溢恶=yi4,mei3,yi4,wu4 +溣=lun4 +溤=ma3 +溥=pu3 +溦=wei1 +溧=li4 +溨=zai1 +溩=wu4 +溪=xi1 +溫=wen1 +溬=qiang1 +溭=ze2 +溮=shi1 +溯=su4 +溰=ai2 +溱=zhen1,qin2 +溲=sou1 +溳=yun2 +溴=xiu4 +溵=yin1 +溶=rong2 +溶血=rong2,xue4 +溷=hun4 +溸=su4 +溹=suo4 +溺=ni4,niao4 +溻=ta1 +溼=shi1 +溽=ru4 +溾=ai1 +溿=pan4 +滀=chu4,xu4 +滁=chu2 +滂=pang1 +滂沱大雨=pang2,tuo2,da4,yu3 +滃=weng3,weng1 +滄=cang1 +滅=mie4 +滆=ge2 +滇=dian1 +滈=hao4,xue4 +滉=huang4 +滊=qi4,xi4,xie1 +滋=zi1 +滋长=zi1,zhang3 +滌=di2 +滍=zhi4 +滎=xing2,ying2 +滏=fu3 +滐=jie2 +滑=hua2 +滒=ge1 +滓=zi3 +滔=tao1 +滕=teng2 +滖=sui1 +滗=bi4 +滘=jiao4 +滙=hui4 +滚=gun3 +滛=yin2 +滜=ze2,hao4 +滝=long2 +滞=zhi4 +滟=yan4 +滠=she4 +满=man3 +满不在乎=man3,bu4,zai4,hu1 +满处=man3,chu3 +满天星斗=man3,tian1,xing1,dou3 +满载=man3,zai4 +滢=ying2 +滣=chun2 +滤=lv4 +滥=lan4 +滥竽充数=lan4,yu2,chong1,shu4 +滥调=lan4,diao4 +滦=luan2 +滧=yao2 +滨=bin1 +滩=tan1 +滪=yu4 +滫=xiu3 +滬=hu4 +滭=bi4 +滮=biao1 +滯=zhi4 +滰=jiang4 +滱=kou4 +滲=shen4 +滳=shang1 +滴=di1 +滴水不漏=di1,shui3,bu4,lou4 +滴溜儿=di1,liu4,er2 +滴滴答答=di1,di1,da1,da1 +滴露研朱=di1,lu4,yan2,zhu1 +滴露研珠=di1,lu4,yan2,zhu1 +滵=mi4 +滶=ao2 +滷=lu3 +滸=hu3,xu3 +滹=hu1 +滺=you1 +滻=chan3 +滼=fan4 +滽=yong1 +滾=gun3 +滿=man3 +漀=qing3 +漁=yu2 +漂=piao1,piao3,piao4 +漂亮=piao4,liang4 +漂亮话=piao4,liang4,hua4 +漂染=piao3,ran3 +漂泊=piao1,bo2 +漂洗=piao3,xi3 +漂白=piao3,bai2 +漃=ji4 +漄=ya2 +漅=chao2 +漆=qi1 +漇=xi3 +漈=ji4 +漉=lu4 +漊=lou2 +漋=long2 +漌=jin3 +漍=guo2 +漎=cong2,song3 +漏=lou4 +漏尽更阑=lou4,jin4,geng1,lan2 +漏斗=lou4,dou3 +漐=zhi2 +漑=gai4 +漒=qiang2 +漓=li2 +演=yan3 +漕=cao2 +漖=jiao4 +漗=cong1 +漘=chun2 +漙=tuan2,zhuan1 +漚=ou4,ou1 +漛=teng2 +漜=ye3 +漝=xi2 +漞=mi4 +漟=tang2 +漠=mo4 +漡=shang1 +漢=han4 +漣=lian2 +漤=lan3 +漥=wa1 +漦=chi2 +漧=gan1 +漨=feng2,peng2 +漩=xuan2 +漪=yi1 +漫=man4 +漫卷=man4,juan4 +漫天遍地=man4,shan1,bian4,di4 +漬=zi4 +漭=mang3 +漮=kang1 +漯=luo4,ta4 +漯河=ta4,he2 +漰=ben1,peng1 +漱=shu4 +漲=zhang3,zhang4 +漳=zhang1 +漴=chong2,zhuang4 +漵=xu4 +漶=huan4 +漷=huo3,huo4,kuo4 +漸=jian4,jian1 +漹=yan1 +漺=shuang3 +漻=liao2,liu2 +漼=cui3,cui1 +漽=ti2 +漾=yang4 +漿=jiang1,jiang4 +潀=cong2,zong3 +潁=ying3 +潂=hong2 +潃=xiu3 +潄=shu4 +潅=guan4 +潆=ying2 +潇=xiao1 +潈=cong2,zong1 +潉=kun1 +潊=xu4 +潋=lian4 +潌=zhi4 +潍=wei2 +潎=pi4,pie1 +潏=yu4 +潐=jiao4,qiao2 +潑=po1 +潒=dang4,xiang4 +潓=hui4 +潔=jie2 +潕=wu3 +潖=pa2 +潗=ji2 +潘=pan1 +潙=wei2 +潚=su4 +潛=qian2 +潜=qian2 +潜血=qian2,xue4 +潝=xi1,ya4 +潞=lu4 +潟=xi4 +潠=xun4 +潡=dun4 +潢=huang2,guang1 +潢池盗弄=huang2,shi5,dao4,nong4 +潣=min3 +潤=run4 +潥=su4 +潦=lao3,lao4,liao2 +潦倒=liao2,dao3 +潦草=liao2,cao3 +潧=zhen1 +潨=cong1,zong4 +潩=yi4 +潪=zhi2,zhi4 +潫=wan1 +潬=tan1,shan4 +潭=tan2 +潮=chao2 +潮差=chao2,cha1 +潯=xun2 +潰=kui4,hui4 +潱=ye1 +潲=shao4 +潳=tu2,zha1 +潴=zhu1 +潵=san4,sa3 +潶=hei1 +潷=bi4 +潸=shan1 +潹=chan2 +潺=chan2 +潻=shu3 +潼=tong2 +潽=pu1 +潾=lin2 +潿=wei2 +澀=se4 +澁=se4 +澂=cheng2 +澃=jiong3 +澄=cheng2,deng4 +澄汰=deng4,tai4 +澄沙=deng4,sha1 +澄沙汰砾=deng4,sha1,tai4,li4 +澄清=cheng2,qing1 +澄结=deng4,jie2 +澅=hua4 +澆=jiao1 +澇=lao4 +澈=che4 +澉=gan3 +澊=cun1,cun2 +澋=jing3 +澌=si1 +澍=shu4,zhu4 +澎=peng2 +澏=han2 +澐=yun2 +澑=liu1,liu4 +澒=hong4,gong3 +澓=fu2 +澔=hao4 +澕=he2 +澖=xian2 +澗=jian4 +澘=shan1 +澙=xi4 +澚=ao4,yu4 +澛=lu3 +澜=lan2 +澝=ning4 +澞=yu2 +澟=lin3 +澠=mian3,sheng2 +澡=zao3 +澢=dang1 +澣=huan4 +澤=ze2,shi4 +澥=xie4 +澦=yu4 +澧=li3 +澨=shi4 +澩=xue2 +澪=ling2 +澫=wan4,man4 +澬=zi1 +澭=yong1,yong3 +澮=kuai4,hui4 +澯=can4 +澰=lian4 +澱=dian4 +澲=ye4 +澳=ao4 +澴=huan2 +澵=zhen1 +澶=chan2 +澷=man4 +澸=gan3 +澹=dan4,tan2 +澹台=tan2,tai2 +澺=yi4 +澻=sui4 +澼=pi4 +澽=ju4 +澾=ta4 +澿=qin2 +激=ji1 +激切=ji1,qie4 +激将=ji1,jiang4 +激将法=ji1,jiang4,fa3 +激薄停浇=ji1,bo2,ting2,jiao1 +激进分子=ji1,jin4,fen4,zi3 +濁=zhuo2 +濂=lian2 +濃=nong2 +濄=guo1,wo1 +濅=jin4 +濆=fen2,pen1 +濇=se4 +濈=ji2,sha4 +濉=sui1 +濊=hui4,huo4 +濋=chu3 +濌=ta4 +濍=song1 +濎=ding3,ting4 +濏=se4 +濐=zhu3 +濑=lai4 +濒=bin1 +濓=lian2 +濔=mi3,ni3 +濕=shi1 +濖=shu4 +濗=mi4 +濘=ning4 +濙=ying2 +濚=ying2 +濛=meng2 +濜=jin4 +濝=qi2 +濞=bi4,pi4 +濟=ji4,ji3 +濠=hao2 +濡=ru2 +濢=cui4,zui3 +濣=wo4 +濤=tao1 +濥=yin3 +濦=yin1 +濧=dui4 +濨=ci2 +濩=huo4,hu4 +濪=qing4 +濫=lan4 +濬=jun4,xun4 +濭=ai3,kai4,ke4 +濮=pu2 +濯=zhuo2,zhao4 +濰=wei2 +濱=bin1 +濲=gu3 +濳=qian2 +濴=ying2 +濵=bin1 +濶=kuo4 +濷=fei4 +濸=cang1 +濹=me4 +濺=jian4,jian1 +濻=wei3,dui4 +濼=luo4,po1 +濽=zan4,cuan2 +濾=lv4 +濿=li4 +瀀=you1 +瀁=yang3,yang4 +瀂=lu3 +瀃=si4 +瀄=zhi4 +瀅=ying2 +瀆=du2,dou4 +瀇=wang3,wang1 +瀈=hui1 +瀉=xie4 +瀊=pan2 +瀋=shen3 +瀌=biao1 +瀍=chan2 +瀎=mie4,mo4 +瀏=liu2 +瀐=jian1 +瀑=pu4,bao4 +瀒=se4 +瀓=cheng2,deng4 +瀔=gu3 +瀕=bin1 +瀖=huo4 +瀗=xian4 +瀘=lu2 +瀙=qin4 +瀚=han4 +瀛=ying2 +瀜=rong2 +瀝=li4 +瀞=jing4 +瀟=xiao1 +瀠=ying2 +瀡=sui3 +瀢=wei3,dui4 +瀣=xie4 +瀤=huai2,wai1 +瀥=xue4 +瀦=zhu1 +瀧=long2,shuang1 +瀨=lai4 +瀩=dui4 +瀪=fan4 +瀫=hu2 +瀬=lai4 +瀭=shu1 +瀮=lian2 +瀯=ying2 +瀰=mi2 +瀱=ji4 +瀲=lian4 +瀳=jian4,zun4 +瀴=ying1,ying3,ying4 +瀵=fen4 +瀶=lin2 +瀷=yi4 +瀸=jian1 +瀹=yue4 +瀺=chan2 +瀻=dai4 +瀼=rang2,nang3 +瀽=jian3 +瀾=lan2 +瀿=fan2 +灀=shuang4 +灁=yuan1 +灂=zhuo2,jiao4,ze2 +灃=feng1 +灄=she4 +灅=lei3 +灆=lan2 +灇=cong2 +灈=qu2 +灉=yong1 +灊=qian2 +灋=fa3 +灌=guan4 +灍=jue2 +灎=yan4 +灏=hao4 +灐=ying2 +灑=sa3 +灒=zan4,cuan2 +灓=luan2,luan4 +灔=yan4 +灕=li2 +灖=mi3 +灗=shan4 +灘=tan1 +灙=dang3,tang3 +灚=jiao3 +灛=chan3 +灜=ying2 +灝=hao4 +灞=ba4 +灟=zhu2 +灠=lan3 +灡=lan2 +灢=nang3 +灣=wan1 +灤=luan2 +灥=xun2,quan2,quan4 +灦=xian3 +灧=yan4 +灨=gan4 +灩=yan4 +灪=yu4 +火=huo3 +火急火燎=huo3,ji2,huo3,liao3 +火烧火燎=huo3,shao1,huo3,liao3 +火耕水种=huo3,geng1,shui3,zhong4 +灬=huo3,biao1 +灭=mie4 +灭景追风=mie4,ying3,zhui1,feng1 +灭此朝食=mie4,ci3,zhao1,shi2 +灮=guang1 +灯=deng1 +灯尽油干=deng1,jin4,you2,gan4 +灯晕=deng1,yun4 +灯蛾扑火=de2,e2,pu1,huo3 +灰=hui1 +灰蒙蒙=hui1,meng2,meng2 +灱=xiao1 +灲=xiao1 +灳=hui1 +灴=hong1 +灵=ling2 +灵长目=ling2,zhang3,mu4 +灶=zao4 +灶头=zao4,tou5 +灷=zhuan4 +灸=jiu3 +灸艾分痛=jiu4,ai4,fen1,tong4 +灹=zha4,yu4 +灺=xie4 +灻=chi4 +灼=zhuo2 +災=zai1 +灾=zai1 +灾难=zai1,nan4 +灾难深重=zai1,nan2,shen1,zhong4 +灿=can4 +炀=yang2 +炁=qi4 +炂=zhong1 +炃=fen2,ben4 +炄=niu3 +炅=jiong3,gui4 +炆=wen2 +炇=pu1 +炈=yi4 +炉=lu2 +炊=chui1 +炋=pi1 +炌=kai4 +炍=pan4 +炎=yan2 +炏=yan2 +炐=pang4,feng1 +炑=mu4 +炒=chao3 +炓=liao4 +炔=que1 +炕=kang4 +炖=dun4 +炗=guang1 +炘=xin4 +炙=zhi4 +炚=guang1 +炛=guang1 +炜=wei3 +炝=qiang4 +炞=bian1 +炟=da2 +炠=xia2 +炡=zheng1 +炢=zhu2 +炣=ke3 +炤=zhao4,zhao1 +炥=fu2 +炦=ba2 +炧=xie4 +炨=xie4 +炩=ling4 +炪=zhuo1,chu4 +炫=xuan4 +炫昼缟夜=xuan4,zhou4,gao3,ye4 +炫玉贾石=xuan4,yu4,gu3,shi2 +炫石为玉=xuan4,shi2,wei2,yu4 +炬=ju4 +炭=tan4 +炮=pao4,pao2,bao1 +炮凤烹龙=pao2,feng4,peng1,long2 +炮制=pao2,zhi4 +炮子儿=pao4,zi3,er5 +炮弹=pao4,dan4 +炮烙=pao2,luo4 +炯=jiong3 +炰=pao2,fou3 +炰鳖脍鲤=feng4,bie1,kuai4,li3 +炱=tai2 +炲=tai2 +炳=bing3 +炴=yang3 +炵=tong1 +炶=shan3,qian2,shan1 +炷=zhu4 +炸=zha4,zha2 +炸丸子=zha2,wan2,zi3 +炸元宵=zha2,yuan2,xiao1 +炸土豆条=zha2,tu3,dou4,tiao2 +炸弹=zha4,dan4 +炸弹坑=zha4,dan4,keng1 +炸油条=zha2,you2,tiao2 +炸油饼=zha2,you2,bing3 +炸烹大虾=zha2,peng1,da4,xia1 +炸糕=zha2,gao1 +炸肉丸子=zha2,rou4,wan2,zi5 +炸虾=zha2,xia1 +炸酱=zha2,jiang4 +炸鱼=zha2,yu2 +炸鸡蛋=zha2,ji1,dan4 +点=dian3 +点将=dian3,jiang4 +点心瓤子=dian3,xin1,rang2,zi5 +点手划脚=dian3,shou3,ji2,jiao3 +点指划脚=dian3,zhi3,ji2,jiao3 +点数=dian3,shu4 +点着=dian3,zhao2 +点石为金=dian3,shi2,wei2,jin1 +点种=dian3,zhong4 +為=wei2,wei4 +炻=shi2 +炼=lian4 +炽=chi4 +炾=huang3 +炿=zhou1 +烀=hu1 +烁=shuo4 +烂=lan4 +烂糊=lan4,hu2 +烃=ting1 +烄=jiao3,yao4 +烅=xu4 +烆=heng2 +烇=quan3 +烈=lie4 +烉=huan4 +烊=yang2,yang4 +烋=xiao1 +烌=xiu1 +烍=xian3 +烎=yin2 +烏=wu1 +烐=zhou1 +烑=yao2 +烒=shi4 +烓=wei1 +烔=tong2,dong4 +烕=mie4 +烖=zai1 +烗=kai4 +烘=hong1 +烘干机=hong2,gan1,ji1 +烙=lao4,luo4 +烙印=lao4,yin4 +烙铁=lao4,tie3 +烙饼=lao4,bing3 +烚=xia2 +烛=zhu2 +烛照数计=zhu2,zhao4,shu4,ji4 +烜=xuan3 +烝=zheng1 +烞=po4 +烟=yan1 +烟卷=yan1,juan4 +烟卷儿=yan1,juan3,er2 +烟幕弹=yan1,mu4,dan4 +烟斗=yan1,dou3 +烟杆=yan1,gan3 +烟熏火燎=yan1,xun1,huo3,liao3 +烟筒=yan1,tong2 +烟雾弹=yan1,wu4,dan4 +烠=hui2,hui3 +烡=guang1 +烢=che4 +烣=hui1 +烤=kao3 +烥=ju4 +烦=fan2 +烧=shao1 +烨=ye4 +烩=hui4 +烫=tang4 +烫发=tang4,fa4 +烬=jin4 +热=re4 +热切=re4,qie4 +热和=re4,huo5 +热哄哄=re4,hong3,hong3 +热得快=re4,de5,kuai4 +热核反应=re4,he2,fan3,ying4 +热核炸弹=re4,he2,zha4,dan4 +热水澡=re4,shui4,zao3 +热熬翻饼=re3,ao2,fan1,bing3 +热血=re4,xue4 +烮=lie4 +烯=xi1 +烰=fu2,pao2 +烱=jiong3 +烲=xie4,che4 +烳=pu3 +烴=ting1 +烵=zhuo2 +烶=ting3 +烷=wan2 +烸=hai3 +烹=peng1 +烹龙炮凤=peng1,long2,pao2,feng4 +烺=lang3 +烻=yan4 +烼=xu4 +烽=feng1 +烾=chi4 +烿=rong2 +焀=hu2 +焁=xi1 +焂=shu1 +焃=he4 +焄=xun1,hun1 +焅=ku4 +焆=juan1,ye4 +焇=xiao1 +焈=xi1 +焉=yan1 +焊=han4 +焊缝=han4,feng4 +焋=zhuang4 +焌=qu1,jun4 +焍=di4 +焎=xie4,che4 +焏=ji2,qi4 +焐=wu4 +焑=yan1 +焒=lv3 +焓=han2 +焔=yan4 +焕=huan4 +焕然一新=huan4,ran2,yi1,xin1 +焖=men4 +焗=ju2 +焘=dao4 +焙=bei4 +焚=fen2 +焛=lin4 +焜=kun1 +焝=hun4 +焞=tun1 +焟=xi1 +焠=cui4 +無=wu2 +焢=hong1 +焣=chao3,ju4 +焤=fu3 +焥=wo4,ai4 +焦=jiao1 +焦唇干舌=jiao1,chun2,gan4,she2 +焦沙烂石=jiao1,sha1,shi2,lan4 +焦熬投石=jiao1,ao2,tou2,shi2 +焦糊=jiao1,hu2 +焧=zong3,cong1 +焨=feng4 +焩=ping2 +焪=qiong2 +焫=ruo4 +焬=xi1,yi4 +焭=qiong2 +焮=xin4 +焯=zhuo1,chao1 +焰=yan4 +焱=yan4 +焲=yi4 +焳=jue2 +焴=yu4 +焵=gang4 +然=ran2 +焷=pi2 +焸=xiong3,ying1 +焹=gang4 +焺=sheng1 +焻=chang4 +焼=shao1 +焽=xiong3,ying1 +焾=nian3 +焿=geng1 +煀=qu1 +煁=chen2 +煂=he4 +煃=kui3 +煄=zhong3 +煅=duan4 +煆=xia1 +煇=hui1,yun4,xun1 +煈=feng4 +煉=lian4 +煊=xuan1 +煋=xing1 +煌=huang2 +煍=jiao3,qiao1 +煎=jian1 +煎炸=jian1,zha2 +煎熬=jian1,ao2 +煏=bi4 +煐=ying1 +煑=zhu3 +煒=wei3 +煓=tuan1 +煔=shan3,qian2,shan1 +煕=xi1,yi2 +煖=nuan3 +煗=nuan3 +煘=chan2 +煙=yan1 +煚=jiong3 +煛=jiong3 +煜=yu4 +煝=mei4 +煞=sha1,sha4 +煞尾=sha1,wei3 +煞星=sha4,xing1 +煞有介事=sha4,you3,jie4,shi4 +煞气=sha4,qi4 +煞白=sha4,bai2 +煞神=sha4,shen2 +煞费心机=sha4,fei4,xin1,ji1 +煞费苦心=sha4,fei4,ku3,xin1 +煟=wei4 +煠=ye4,zha2 +煡=jin4 +煢=qiong2 +煣=rou2 +煤=mei2 +煤核儿=mei2,hu2,er2 +煤熏=mei2,xun4 +煥=huan4 +煦=xu4 +照=zhao4 +照应=zhao4,ying4 +照明弹=zhao4,ming2,dan4 +照片=zhao4,pian1 +照相=zhao4,xiang4 +煨=wei1 +煨干就湿=wei1,gan4,jiu4,shi1 +煨干避湿=wei1,gan4,bi4,shi1 +煩=fan2 +煪=qiu2 +煫=sui4 +煬=yang2,yang4 +煭=lie4 +煮=zhu3 +煯=jie1 +煰=zao4 +煱=gua1 +煲=bao1 +煳=hu2 +煴=yun1,yun3 +煵=nan3 +煶=shi4 +煷=huo3 +煸=bian1 +煹=gou4 +煺=tui4 +煻=tang2 +煼=chao3 +煽=shan1 +煾=en1,yun1 +煿=bo2 +熀=huang3 +熁=xie2 +熂=xi4 +熃=wu4 +熄=xi1 +熅=yun1,yun3 +熆=he2 +熇=he4,xiao1 +熈=xi1 +熉=yun2 +熊=xiong2 +熊据虎跱=xiong2,ju4,hu3,shen1 +熊爪子=xiong2,zhua3,zi5 +熊腰虎背=xiong2,yao1,hu3,bei4 +熋=xiong2 +熌=shan3 +熍=qiong2 +熎=yao4 +熏=xun1,xun4 +熏倒=xun4,dao3 +熏着=xun4,zhao2 +熏莸不同器=xun2,you2,bu4,tong2,qi4 +熏莸同器=xun2,you2,tong2,qi4 +熏蚊子=xun1,wen2,zi5 +熏豆腐=xun1,dou4,fu5 +熏透=xun4,tou4 +熐=mi4 +熑=lian2 +熒=ying2 +熓=wu3 +熔=rong2 +熕=gong4 +熖=yan4 +熗=qiang4 +熘=liu1 +熙=xi1 +熚=bi4 +熛=biao1 +熜=cong1,zong3 +熝=lu4,ao1 +熞=jian1 +熟=shu2 +熟思审处=shu2,si1,shen3,chu3 +熠=yi4 +熡=lou2 +熢=peng2,feng1 +熣=sui1,cui3 +熤=yi4 +熥=teng1 +熦=jue2 +熧=zong1 +熨=yun4,yu4 +熨帖=yu4,tie1 +熨斗=yun4,dou3 +熨烫=yun4,tang4 +熩=hu4 +熪=yi2 +熫=zhi4 +熬=ao1,ao2 +熬出头=ao2,chu1,tou2 +熬夜=ao2,ye4 +熬姜呷醋=ao2,jiang1,xia1,cu4 +熬心=ao2,xin1 +熬心费力=ao2,xin1,fei4,li4 +熬更守夜=ao2,geng1,shou3,ye4 +熬枯受淡=ao2,ku1,shou4,dan4 +熬油费火=ao2,you2,fei4,huo3 +熬清受淡=ao2,qing1,shou4,dan4 +熬清守淡=ao2,qing1,shou3,dan4 +熬清守谈=ao2,qing1,shou3,tan2 +熬煎=ao2,jian1 +熬熬=ao2,ao2 +熬肠刮肚=ao2,chang2,gua1,du4 +熭=wei4 +熮=liu3 +熯=han4,ran3 +熰=ou1,ou3 +熱=re4 +熲=jiong3 +熳=man4 +熴=kun1 +熵=shang1 +熶=cuan4 +熷=zeng4 +熸=jian1 +熹=xi1 +熺=xi1 +熻=xi1 +熼=yi4 +熽=xiao4 +熾=chi4 +熿=huang2,huang3 +燀=chan3,dan3,chan4 +燁=ye4 +燂=tan2 +燃=ran2 +燃烧弹=ran2,shao1,dan4 +燃烧炸弹=ran2,shao1,zha4,dan4 +燄=yan4 +燅=xun2 +燆=qiao1 +燇=jun4 +燈=deng1 +燉=dun4 +燊=shen1 +燋=jiao1,qiao2,jue2,zhuo2 +燌=fen2 +燍=si1 +燎=liao2,liao3 +燎发摧枯=liao3,fa4,cui1,ku1 +燎如观火=liao3,ru2,guan1,huo3 +燏=yu4 +燐=lin2 +燑=tong2,dong4 +燒=shao1 +燓=fen2 +燔=fan2 +燕=yan4,yan1 +燕京=yan1,jing1 +燕处危巢=yan4,chu3,wei1,chao2 +燕处焚巢=yan4,chu3,fen2,chao2 +燕子=yan4,zi5 +燕子衔食=yan4,zi3,xian2,shi2 +燕山=yan1,shan1 +燕岱之石=yan1,dai4,zhi1,shi2 +燕巢幙上=yan4,chao2,yu2,shang4 +燕市悲歌=yan1,shi4,bei1,ge1 +燕昭好马=yan1,zhao1,hao3,ma3 +燕昭市骏=yan1,zhao1,shi4,jun4 +燕歌赵舞=yan1,ge1,zhao4,wu3 +燕石妄珍=yan1,shi2,wang4,zhen1 +燕赵=yan1,zhao4 +燕金募秀=yan1,jin1,mu4,xiu4 +燕雀处堂=yan4,que4,chu3,tang2 +燕雀处屋=yan4,que4,chu3,wu1 +燕驾越毂=yan1,jia4,yue4,gu1 +燕骏千金=yan1,jun4,qian1,jin1 +燕麦=yan1,mai4 +燕麦粥=yan1,mai4,zhou1 +燖=xun2 +燗=lan4 +燘=mei3 +燙=tang4 +燚=yi4 +燛=jiong3 +燜=men4 +燝=zhu3 +燞=jiao3 +營=ying2 +燠=yu4 +燡=yi4 +燢=xue2 +燣=lan2 +燤=tai4,lie4 +燥=zao4 +燦=can4 +燧=sui4 +燨=xi1 +燩=que4 +燪=zong3 +燫=lian2 +燬=hui3 +燭=zhu2 +燮=xie4 +燯=ling2 +燰=wei1 +燱=yi4 +燲=xie2 +燳=zhao4 +燴=hui4 +燵=da2 +燶=nong2 +燷=lan2 +燸=xu1 +燹=xian3 +燺=he4 +燻=xun1 +燼=jin4 +燽=chou2 +燾=dao4 +燿=yao4 +爀=he4 +爁=lan4 +爂=biao1 +爃=rong2,ying2 +爄=li4,lie4 +爅=mo4 +爆=bao4 +爆破炸弹=bao4,po4,zha4,dan4 +爇=ruo4 +爈=lv4 +爉=la4,lie4 +爊=ao1 +爋=xun1,xun4 +爌=kuang4,huang3 +爍=shuo4 +爎=liao2,liao3 +爏=li4 +爐=lu2 +爑=jue2 +爒=liao2,liao3 +爓=yan4,xun2 +爔=xi1 +爕=xie4 +爖=long2 +爗=ye4 +爘=can1 +爙=rang3 +爚=yue4 +爛=lan4 +爜=cong2 +爝=jue2 +爞=chong2 +爟=guan4 +爠=qu2 +爡=che4 +爢=mi2 +爣=tang3 +爤=lan4 +爥=zhu2 +爦=lan3,lan4 +爧=ling2 +爨=cuan4 +爩=yu4 +爪=zhao3,zhua3 +爪儿=zhua3,er5 +爪子=zhua3,zi5 +爪尖儿=zhua3,jian1,er5 +爪牙=zhao3,ya2 +爫=zhao3,zhua3 +爬=pa2 +爭=zheng1 +爮=pao2 +爯=cheng1,chen4 +爰=yuan2 +爱=ai4 +爱丽舍宫=ai4,li4,she4,gong1 +爱人好士=ai4,ren2,hao4,shi4 +爱好=ai4,hao4 +爱生恶死=ai4,sheng1,wu4,si3 +爱面子=ai4,mian4,zi5 +爲=wei2,wei4 +爳=han5 +爴=jue2 +爵=jue2 +爵士乐=jue2,shi4,yue4 +父=fu4,fu3 +父为子隐=fu4,wei2,zi3,yin3 +父债子还=fu4,zhai4,zi3,huan2 +爷=ye2 +爸=ba4 +爹=die1 +爺=ye2 +爻=yao2 +爼=zu3 +爽=shuang3 +爾=er3 +爿=pan2 +牀=chuang2 +牁=ke1 +牂=zang1 +牃=die2 +牄=qiang1 +牅=yong1 +牆=qiang2 +片=pian4,pian1 +片儿=pian1,er5 +片儿汤=pian1,er5,tang1 +片头=pian1,tou2 +片子=pian1,zi5 +片文只事=pian4,wen2,zhi1,shi4 +片甲不还=pian4,jia3,bu4,huan2 +片纸只字=pian4,zhi3,zhi1,zi4 +片言只字=pian4,yan2,zhi3,zi4 +片词只句=pian4,ci2,zhi1,ju4 +片语只辞=pian4,yan2,zhi3,ci2 +片长薄技=pian4,chang2,bo2,ji4 +片鳞只甲=pian4,lin2,zhi1,jia3 +版=ban3 +牉=pan4 +牊=chao2 +牋=jian1 +牌=pai2 +牌坊=pai2,fang1 +牌子=pai2,zi5 +牍=du2 +牎=chuang1 +牏=yu2 +牐=zha2 +牑=bian1,mian4 +牒=die2 +牓=bang3 +牔=bo2 +牕=chuang1 +牖=you3 +牗=you3,yong1 +牘=du2 +牙=ya2 +牙碜=ya2,chen5 +牙缝=ya2,feng4 +牚=cheng1,cheng4 +牛=niu2 +牛不喝水强按头=niu2,bu4,he1,shui3,qiang3,an4,tou2 +牛仔=niu2,zai3 +牛头不对马嘴=niu2,tou2,bu4,dui4,ma3,zui3 +牛头不对马面=niu2,tou2,bu4,dui4,ma3,mian4 +牛头刨=niu2,tou2,bao4 +牛肚=niu2,du3 +牛蒡子=niu2,bang4,zi5 +牛鞅=niu2,yang4 +牛骥同皁=niu2,ji4,tong2,wen3 +牜=niu2 +牝=pin4 +牞=jiu1,le4 +牟=mou2,mu4 +牟平=mu4,ping2 +牠=ta1 +牡=mu3 +牢=lao2 +牢什古子=lao2,shi2,gu3,zi5 +牢笼=lao2,long2 +牣=ren4 +牤=mang1 +牥=fang1 +牦=mao2 +牧=mu4 +牧猪奴戏=mu4,zhou4,nu2,xi4 +牨=gang1 +物=wu4 +物以希为贵=wu4,yi3,xi1,wei2,gui4 +物以稀为贵=wu4,yi3,xi1,wei2,gui4 +物稀为贵=wu4,xi1,wei2,gui4 +物竞天择=wu4,jin4,tian1,ze2 +物美价廉=jia4,lian2,wu4,mei3 +牪=yan4 +牫=ge1,qiu2 +牬=bei4 +牭=si4 +牮=jian4 +牯=gu3 +牯牛=gu3,niu2 +牰=you4,chou1 +牱=ke1 +牲=sheng1 +牲畜=sheng1,chu4 +牳=mu3 +牴=di3 +牵=qian1 +牵一发而动全身=qian1,yi1,fa4,er2,dong4,quan2,shen1 +牵强=qian1,qiang3 +牵强附会=qian1,qiang2,fu4,hui4 +牵强附合=qian1,qiang2,fu4,he2 +牵着鼻子走=qian1,zhe5,bi2,zi5,zou3 +牵累=qian1,lei3 +牵羊担酒=qian1,yang2,dan4,jiu3 +牶=quan4 +牷=quan2 +牸=zi4 +特=te4 +特徵=te4,zhi3 +牺=xi1 +牻=mang2 +牼=keng1 +牽=qian1 +牾=wu3 +牿=gu4 +犀=xi1 +犁=li2 +犁庭扫闾=li2,ting2,sao3,lv3 +犁牛骍角=li2,niu2,xing1,jiao3 +犁生骍角=li2,sheng1,xing1,jiao3 +犂=li2 +犃=pou3 +犄=ji1 +犅=gang1 +犆=zhi2,te4 +犇=ben1 +犈=quan2 +犉=chun2 +犊=du2 +犋=ju4 +犌=jia1 +犍=jian1,qian2 +犍为=qian2,wei2 +犎=feng1 +犏=pian1 +犐=ke1 +犑=ju2 +犒=kao4 +犓=chu2 +犔=xi4 +犕=bei4 +犖=luo4 +犗=jie4 +犘=ma2 +犙=san1 +犚=wei4 +犛=mao2,li2 +犜=dun1 +犝=tong2 +犞=qiao2 +犟=jiang4 +犠=xi1 +犡=li4 +犢=du2 +犣=lie4 +犤=pai2 +犥=piao1 +犦=bao4 +犧=xi1 +犨=chou1 +犩=wei2 +犪=kui2 +犫=chou1 +犬=quan3 +犭=quan3 +犮=quan3,ba2 +犯=fan4 +犯不着=fan4,bu4,zhao2 +犯得上=fan4,de5,shang4 +犯得着=fan4,de5,zhao2 +犯禁=fan4,jin4 +犯而不校=fan4,er2,bu4,jiao4 +犰=qiu2 +犱=ji3 +犲=chai2 +犳=zhuo2,bao4 +犴=han1,an4 +犵=ge1 +状=zhuang4 +犷=guang3 +犸=ma3 +犹=you2 +犹疑=you2,ni3 +犺=kang4,gang3 +犻=pei4,fei4 +犼=hou3 +犽=ya4 +犾=yin2 +犿=huan1,fan1 +狀=zhuang4 +狁=yun3 +狂=kuang2 +狂风怒号=kuang2,feng1,nu4,hao2 +狃=niu3 +狄=di2 +狅=kuang2 +狆=zhong4 +狇=mu4 +狈=bei4 +狉=pi1 +狊=ju2 +狋=yi2,quan2,chi2 +狌=sheng1,xing1 +狍=pao2 +狎=xia2 +狏=tuo2,yi2 +狐=hu2 +狐狸尾巴=hu2,li5,wei3,ba5 +狐裘尨茸=hu2,qiu2,meng2,rong2 +狐裘蒙戎=hu2,qiu2,meng2,rong2 +狐裘蒙茸=hu2,qiu2,meng2,rong2 +狑=ling2 +狒=fei4 +狓=pi1 +狔=ni3 +狕=yao3 +狖=you4 +狗=gou3 +狗续侯冠=gou3,xu4,hou4,guan4 +狗血喷头=gou3,xue4,pen1,tou2 +狗血淋头=gou3,xue4,lin2,tou2 +狗追耗子=gou3,zhui1,hao4,zi3 +狘=xue4 +狙=ju1 +狚=dan4 +狛=bo2 +狜=ku3 +狝=xian3 +狞=ning2 +狟=huan2,huan1 +狠=hen3 +狡=jiao3 +狢=he2,mo4 +狣=zhao4 +狤=jie2 +狥=xun4 +狦=shan1 +狧=ta4,shi4 +狧穅及米=shi4,kan3,ji2,mi3 +狨=rong2 +狩=shou4 +狪=tong2,dong4 +狫=lao3 +独=du2 +独具只眼=du2,ju4,zhi1,yan3 +独处=du2,chu3 +独当一面=du2,dang1,yi1,mian4 +独有千秋=du2,you4,qian1,qiu1 +独辟蹊径=du2,pi4,xi1,jing4 +狭=xia2 +狭缝=xia2,feng4 +狮=shi1 +狮子=shi1,zi5 +狮子大开口=shi1,zi1,da4,kai1,kou3 +狯=kuai4 +狰=zheng1 +狱=yu4 +狲=sun1 +狳=yu2 +狴=bi4 +狴犴=bi4,an4 +狵=mang2,dou4 +狶=xi1,shi3 +狷=juan4 +狸=li2 +狹=xia2 +狺=yin2 +狻=suan1 +狼=lang2 +狼号鬼哭=lang2,hao2,gui3,ku1 +狼吞虎咽=lang2,tun1,hu3,yan4 +狼狈为奸=lang2,bei4,wei2,jian1 +狼艰狈蹶=lang2,jian1,bei4,jue3 +狼藉=lang2,ji2 +狼飡虎咽=lang2,can1,hu3,yan1 +狽=bei4 +狾=zhi4 +狿=yan2 +猀=sha1 +猁=li4 +猂=han4 +猃=xian3 +猄=jing1 +猅=pai2 +猆=fei1 +猇=xiao1 +猈=bai4,pi2 +猉=qi2 +猊=ni2 +猋=biao1 +猌=yin4 +猍=lai2 +猎=lie4 +猏=jian1,yan4 +猐=qiang1 +猑=kun1 +猒=yan4 +猓=guo1 +猔=zong4 +猕=mi2 +猖=chang1 +猗=yi1,yi3 +猘=zhi4 +猙=zheng1 +猚=ya2,wei4 +猛=meng3 +猛将=meng3,jiang4 +猜=cai1 +猜度=cai1,duo2 +猜着=cai1,zhao2 +猜闷葫芦=cai1,men4,hu2,lu5 +猝=cu4 +猞=she1 +猟=lie4 +猡=luo2 +猢=hu2 +猣=zong1 +猤=gui4 +猥=wei3 +猦=feng1 +猧=wo1 +猨=yuan2 +猩=xing1 +猪=zhu1 +猪肚=zhu1,du3 +猫=mao1,mao2 +猫哭耗子=mao1,ku1,hao4,zi5 +猫爪子=mao1,zhua3,zi5 +猫鼠同处=mao1,shu3,tong2,chu3 +猬=wei4 +猭=chuan4,chuan1 +献=xian4 +献血=xian4,xue4 +猯=tuan1,tuan4 +猰=ya4,jia2,qie4 +猱=nao2 +猲=xie1,he4,ge2,hai4 +猳=jia1 +猴=hou2 +猴子=hou2,zi5 +猵=bian1,pian4 +猶=you2 +猷=you2 +猸=mei2 +猹=cha2 +猺=yao2 +猻=sun1 +猼=bo2,po4 +猽=ming2 +猾=hua2 +猿=yuan2 +獀=sou1 +獁=ma3 +獂=huan2 +獃=dai1 +獄=yu4 +獅=shi1 +獆=hao2 +獇=qiang1 +獈=yi4 +獉=zhen1 +獊=cang1 +獋=hao2,gao1 +獌=man4 +獍=jing4 +獎=jiang3 +獏=mo4 +獐=zhang1 +獐子=zhang1,zi5 +獑=chan2 +獒=ao2 +獓=ao2 +獔=hao2 +獕=suo3 +獖=fen2,fen4 +獗=jue2 +獘=bi4 +獙=bi4 +獚=huang2 +獛=pu2 +獜=lin2,lin4 +獝=xu4 +獞=tong2 +獟=yao4,xiao1 +獠=liao2 +獡=shuo4,xi1 +獢=xiao1 +獣=shou4 +獤=dun1 +獥=jiao4 +獦=ge2,lie4,xie1 +獧=juan4 +獨=du2 +獩=hui4 +獪=kuai4 +獫=xian3 +獬=xie4 +獭=ta3 +獮=xian3 +獯=xun1 +獰=ning2 +獱=bian1,pian4 +獲=huo4 +獳=nou4,ru2 +獴=meng2 +獵=lie4 +獶=nao2,nao3,you1 +獷=guang3 +獸=shou4 +獹=lu2 +獺=ta3 +獻=xian4 +獼=mi2 +獽=rang2 +獾=huan1 +獿=nao2,you1 +玀=luo2 +玁=xian3 +玂=qi2 +玃=jue2 +玄=xuan2 +玅=miao4 +玆=zi1 +率=lv4,shuai4 +率以为常=shuai4,yi3,wei2,chang2 +率先=shuai4,xian1 +率兽食人=shuai4,shou4,shi2,ren2 +率军=shuai4,jun1 +率土同庆=shuai4,tu3,tong2,qing4 +率土宅心=shuai4,tu3,zhai2,xin1 +率土归心=shuai4,tu3,gui1,xin1 +率尔=shuai4,er3 +率尔操觚=shuai4,er3,cao1,gu1 +率性=shuai4,xing4 +率然=shuai4,ran2 +率由旧则=shuai4,you2,jiu4,ze2 +率由旧章=shuai4,you2,jiu4,zhang1 +率直=shuai4,zhi2 +率真=shuai4,zhen1 +率队=shuai4,dui4 +率领=shuai4,ling3 +率马以骥=shuai4,ma3,yi3,ji4 +玈=lu2 +玉=yu4 +玉卮无当=yu4,zhi1,wu2,dang4 +玉尺量才=yu4,chi3,liang2,cai2 +玉质金相=yu4,zhi4,jin1,xiang4 +玊=su4 +王=wang2,wang4 +王冠=wang2,guan1 +王天下=wang4,tian1,xia4 +王蒙=wang2,meng2 +王贡弹冠=wang2,gong4,dan4,guan4 +玌=qiu2 +玍=ga3 +玎=ding1 +玏=le4 +玐=ba1 +玑=ji1 +玒=hong2 +玓=di4 +玔=chuan4 +玕=gan1 +玖=jiu3 +玗=yu2 +玘=qi3 +玙=yu2 +玚=chang4,yang2 +玛=ma3 +玜=hong2 +玝=wu3 +玞=fu1 +玟=min2,wen2 +玠=jie4 +玡=ya4 +玢=bin1,fen1 +玣=bian4 +玤=bang4 +玥=yue4 +玦=jue2 +玧=men2,yun3 +玨=jue2 +玩=wan2 +玩儿不转=wan2,er2,bu4,zhuan4 +玩岁愒日=wan2,sui4,kai4,ri4 +玩岁愒时=wan2,sui4,kai4,shi2 +玩岁愒月=wan2,sui4,kai4,yue4 +玩日愒时=wan2,ri4,kai4,shi2 +玩时愒日=wan2,shi2,kai4,ri4 +玪=jian1,qian2 +玫=mei2 +玬=dan3 +玭=pin2 +玮=wei3 +环=huan2 +环伺=huan2,si4 +环晕=huan2,yun4 +现=xian4 +现在为止=xian4,zai4,wei2,zhi3 +玱=qiang1,cang1 +玲=ling2 +玳=dai4 +玴=yi4 +玵=an2,gan1 +玶=ping2 +玷=dian4 +玸=fu2 +玹=xuan2,xian2 +玺=xi3 +玻=bo1 +玻璃=bo1,li5 +玻璃杯=bo2,li5,bei1 +玻璃碴=bo1,li5,cha2 +玻璃窗=bo2,li5,chuang1 +玻璃纸=bo2,li5,zhi3 +玼=ci1,ci3 +玽=gou3 +玾=jia3 +玿=shao2 +珀=po4 +珁=ci2 +珂=ke1 +珃=ran3 +珄=sheng1 +珅=shen1 +珆=yi2,tai1 +珇=zu3,ju4 +珈=jia1 +珉=min2 +珊=shan1 +珋=liu3 +珌=bi4 +珍=zhen1 +珎=zhen1 +珏=jue2 +珐=fa4 +珑=long2 +珒=jin1 +珓=jiao4 +珔=jian4 +珕=li4 +珖=guang1 +珗=xian1 +珘=zhou1 +珙=gong3 +珚=yan1 +珛=xiu4 +珜=yang2 +珝=xu3 +珞=luo4 +珟=su4 +珠=zhu1 +珠宫贝阙=zhu1,gong1,bei4,que4 +珠还合浦=zhu1,huan2,he2,pu3 +珡=qin2 +珢=yin2,ken4 +珣=xun2 +珤=bao3 +珥=er3 +珦=xiang4 +珧=yao2 +珨=xia2 +珩=heng2 +珪=gui1 +珫=chong1 +珬=xu4 +班=ban1 +班长=ban1,zhang3 +珮=pei4 +珯=lao3 +珰=dang1 +珱=ying1 +珲=hun2,hui1 +珳=wen2 +珴=e2 +珵=cheng2 +珶=di4,ti2 +珷=wu3 +珸=wu2 +珹=cheng2 +珺=jun4 +珻=mei2 +珼=bei4 +珽=ting3 +現=xian4 +珿=chu4 +琀=han2 +琁=xuan2,qiong2 +琂=yan2 +球=qiu2 +球迷=qiu2,mi2 +琄=xuan4 +琅=lang2 +理=li3 +理不胜辞=li3,bu4,sheng4,ci2 +理发=li3,fa4 +琇=xiu4 +琈=fu2,fu1 +琉=liu2 +琊=ya2 +琋=xi1 +琌=ling2 +琍=li2 +琎=jin1 +琏=lian3 +琐=suo3 +琑=suo3 +琒=feng1 +琓=wan2 +琔=dian4 +琕=pin2,bing3 +琖=zhan3 +琗=cui4,se4 +琘=min2 +琙=yu4 +琚=ju1 +琛=chen1 +琜=lai2 +琝=min2 +琞=sheng4 +琟=wei2,yu4 +琠=tian3,tian4 +琡=shu1 +琢=zhuo2,zuo2 +琢磨=zuo2,mo5 +琣=beng3,pei3 +琤=cheng1 +琥=hu3 +琦=qi2 +琧=e4 +琨=kun1 +琩=chang1 +琪=qi2 +琫=beng3 +琬=wan3 +琭=lu4 +琮=cong2 +琯=guan3 +琰=yan3 +琱=diao1 +琲=bei4 +琳=lin2 +琴=qin2 +琴瑟之好=qi2,se4,zhi1,hao3 +琵=pi2 +琶=pa2 +琷=que4 +琸=zhuo2 +琹=qin2 +琺=fa4 +琻=jin1 +琼=qiong2 +琽=du3 +琾=jie4 +琿=hun2,hui1 +瑀=yu3 +瑁=mao4 +瑂=mei2 +瑃=chun1 +瑄=xuan1 +瑅=ti2 +瑆=xing1 +瑇=dai4 +瑈=rou2 +瑉=min2 +瑊=jian1 +瑋=wei3 +瑌=ruan3 +瑍=huan4 +瑎=xie2,jie1 +瑏=chuan1 +瑐=jian3 +瑑=zhuan4 +瑒=chang4,yang2 +瑓=lian4 +瑔=quan2 +瑕=xia2 +瑕瑜互见=xia2,yu2,hu4,jian4 +瑖=duan4 +瑗=yuan4 +瑗珲=yuan4,hui1 +瑘=ye2 +瑙=nao3 +瑚=hu2 +瑛=ying1 +瑜=yu2 +瑝=huang2 +瑞=rui4 +瑟=se4 +瑠=liu2 +瑡=shi1 +瑢=rong2 +瑣=suo3 +瑤=yao2 +瑥=wen1 +瑦=wu3 +瑧=zhen1 +瑨=jin4 +瑩=ying2 +瑪=ma3 +瑫=tao1 +瑬=liu2 +瑭=tang2 +瑮=li4 +瑯=lang2 +瑰=gui1 +瑱=tian4,tian2,zhen4 +瑲=qiang1,cang1 +瑳=cuo1 +瑴=jue2 +瑵=zhao3 +瑶=yao2 +瑶台银阙=yao2,tai2,yin2,que4 +瑶池女使=yao2,shi5,nv3,shi3 +瑷=ai4 +瑸=bin1,pian2 +瑹=tu2,shu1 +瑺=chang2 +瑻=kun1 +瑼=zhuan1 +瑽=cong1 +瑾=jin3 +瑿=yi1 +璀=cui3 +璁=cong1 +璂=qi2 +璃=li2 +璄=jing3 +璅=zao3,suo3 +璆=qiu2 +璇=xuan2 +璇霄丹阙=xuan2,xiao1,dan1,que4 +璈=ao2 +璉=lian3 +璊=men2 +璋=zhang1 +璌=yin2 +璍=ye4 +璎=ying1 +璏=zhi4 +璐=lu4 +璑=wu2 +璒=deng1 +璓=xiu4 +璔=zeng1 +璕=xun2 +璖=qu2 +璗=dang4 +璘=lin2 +璙=liao2 +璚=qiong2,jue2 +璛=su4 +璜=huang2 +璝=gui1 +璞=pu2 +璟=jing3 +璠=fan2 +璡=jin1 +璢=liu2 +璣=ji1 +璤=hui4 +璥=jing3 +璦=ai4 +璧=bi4 +璧还=bi4,huan2 +璨=can4 +璩=qu2 +璪=zao3 +璫=dang1 +璬=jiao3 +璭=guan3 +璮=tan3 +璯=hui4,kuai4 +環=huan2 +璱=se4 +璲=sui4 +璳=tian2 +璴=chu3 +璵=yu2 +璶=jin4 +璷=lu2,fu1 +璸=bin1,pian2 +璹=shu2 +璺=wen4 +璻=zui3 +璼=lan2 +璽=xi3 +璾=ji4,zi1 +璿=xuan2 +瓀=ruan3 +瓁=wo4 +瓂=gai4 +瓃=lei2 +瓄=du2 +瓅=li4 +瓆=zhi4 +瓇=rou2 +瓈=li2 +瓉=zan4 +瓊=qiong2 +瓋=ti4 +瓌=gui1 +瓍=sui2 +瓎=la4 +瓏=long2 +瓐=lu2 +瓑=li4 +瓒=zan4 +瓓=lan4 +瓔=ying1 +瓕=mi2,xi3 +瓖=xiang1 +瓗=qiong2,wei3,wei4 +瓘=guan4 +瓙=dao4 +瓚=zan4 +瓛=huan2,ye4,ya4 +瓜=gua1 +瓜葛=gua1,ge2 +瓜葛相连=gua1,ge3,xiang1,lian2 +瓜蔓=gua1,wan4 +瓝=bo2 +瓞=die2 +瓟=bo2,pao2 +瓠=hu4 +瓡=zhi2,hu2 +瓢=piao2 +瓣=ban4 +瓤=rang2 +瓤子=rang2,zi5 +瓥=li4 +瓦=wa3,wa4 +瓦刀=wa4,dao1 +瓦岗军=wa3,gang1,jun1 +瓦窑堡=wa3,yao2,bu3 +瓦舍=wa3,she4 +瓨=xiang2,hong2 +瓩=qian2,wa3 +瓪=ban3 +瓫=pen2 +瓬=fang3 +瓭=dan3 +瓮=weng4 +瓯=ou1 +瓳=hu2 +瓴=ling2 +瓵=yi2 +瓶=ping2 +瓶塞=ping2,sai1 +瓶子=ping2,zi5 +瓷=ci2 +瓸=bai3,wa3 +瓹=juan4,juan1 +瓺=chang2 +瓻=chi1 +瓽=dang4 +瓾=wa1 +瓿=bu4 +甀=zhui4 +甁=ping2 +甂=bian1 +甃=zhou4 +甄=zhen1 +甆=ci2 +甇=ying1 +甈=qi4 +甉=xian2 +甊=lou3 +甋=di4 +甌=ou1 +甍=meng2 +甎=zhuan1 +甏=beng4 +甐=lin4 +甑=zeng4 +甒=wu3 +甓=pi4 +甔=dan1 +甕=weng4 +甖=ying1 +甗=yan3 +甘=gan1 +甘分随时=gan1,fen4,sui2,shi2 +甘处下流=gan1,chu3,xia4,liu2 +甘油炸药=gan1,you2,zha4,yao4 +甘贫守分=gan1,pin2,shou3,fen1 +甘露=gan1,lu4 +甙=dai4 +甚=shen4,shen2 +甚为=shen4,wei2 +甛=tian2 +甜=tian2 +甝=han2 +甞=chang2 +生=sheng1 +生发=sheng1,fa1 +生发未燥=sheng1,fa4,wei4,zao4 +生拖死拽=sheng1,tuo1,si3,zhuai1 +生杀予夺=sheng1,sha1,yu3,duo2 +生死予夺=sheng1,si3,yu3,duo2 +生肖=sheng1,xiao4 +生角=sheng1,jue2 +生还=sheng1,huan2 +生还者=sheng1,huan2,zhe3 +生长=sheng1,zhang3 +生闷气=sheng1,men4,qi4 +甠=qing2 +甡=shen1 +產=chan3 +産=chan3 +甤=rui2 +甥=sheng1 +甦=su1 +甧=shen1 +用=yong4 +用人不当=yong4,ren2,bu2,dang4 +用力一搡=yong4,li4,yi4,sang3 +用处=yong4,chu3 +用智铺谋=yong4,zhi4,pu4,mou2 +用水和面=yong4,shui3,huo4,mian4 +用行舍藏=yong4,xing2,cang2,she3 +用词不当=yong4,ci2,bu4,dang4 +用词切当=yong4,ci2,qie4,dang4 +甩=shuai3 +甪=lu4 +甫=fu3 +甬=yong3 +甭=beng2 +甮=beng2 +甯=ning2,ning4 +田=tian2 +田父=tian2,fu3 +田父之功=tian2,fu3,zhi1,gong1 +田父献曝=tian2,fu3,xian4,pu4 +田舍=tian2,she4 +由=you2 +由头=you2,tou5 +由得=you2,de5 +甲=jia3 +甲壳=jia3,qiao4 +申=shen1 +甴=you2,zha2 +电=dian4 +电刨=dian4,bao4 +电势差=dian4,shi4,cha1 +电子乐器=dian4,zi3,yue4,qi4 +电磁感应=dian4,ci2,gan3,ying4 +电磨=dian4,mo4 +电线杆=dian4,xian4,gan3 +电荷=dian4,he4 +电钻=dian4,zuan4 +甶=fu2 +男=nan2 +男仆=nan2,pu2 +男傧相=nan2,bin1,xiang4 +男女老小=nan2,nv3,lao3,xiao3 +男女老少=nan2,nv3,lao3,shao4 +男孩儿=nan2,hai2,er5 +男孩子=nan2,hai2,zi5 +甸=dian4,tian2,sheng4 +甹=ping1 +町=ting3,ding1 +画=hua4 +画卷=hua4,juan4 +画地为牢=hua4,di4,wei2,lao2 +画地为狱=hua4,di4,wei2,yu4 +画夹=hua4,jia1 +画帖=hua4,tie4 +画片=hua4,pian1 +画片儿=hua4,pian1,er5 +画肖像=hua4,xiao4,xiang4 +画荻和丸=hua4,di2,huo4,wan2 +画蛇著足=hua4,she2,zhuo2,zu2 +画龙不成反为狗=hua4,long2,bu4,cheng2,fan3,wei2,gou3 +甼=ting3,ding1 +甽=zhen4 +甾=zai1,zi1 +甿=meng2 +畀=bi4 +畁=bi4,qi2 +畂=mu3 +畃=xun2 +畄=liu2 +畅=chang4 +畅所欲为=chang4,suo3,yu4,wei2 +畆=mu3 +畇=yun2 +畈=fan4 +畉=fu2 +畊=geng1 +畋=tian2 +界=jie4 +畍=jie4 +畎=quan3 +畏=wei4 +畐=fu2,bi4 +畑=tian2 +畒=mu3 +畓=duo1 +畔=pan4 +畕=jiang1 +畖=wa1 +畗=da2,fu2 +畘=nan2 +留=liu2 +留空=liu2,kong4 +留难=liu2,nan4 +畚=ben3 +畛=zhen3 +畜=chu4,xu4 +畜产=xu4,chan3 +畜养=xu4,yang3 +畜力=chu4,li4 +畜圈=chu4,juan4 +畜妻养子=xu4,qi1,yang3,zi3 +畜牧=xu4,mu4 +畝=mu3 +畞=mu3 +畟=ce4,ji4 +畠=zai1,zi1 +畡=gai1 +畢=bi4 +畣=da2 +畤=zhi4,chou2,shi4 +略=lve4 +畦=qi2 +畧=lve4 +畨=fan1,pan1 +畩=yi1 +番=fan1,pan1 +番将=fan1,jiang4 +番禺=pan1,yu2 +畫=hua4 +畬=she1,yu2 +畭=she1 +畮=mu3 +畯=jun4 +異=yi4 +畱=liu2 +畲=she1 +畳=die2 +畴=chou2 +畵=hua4 +當=dang1,dang4,dang3 +畷=zhui4 +畸=ji1 +畹=wan3 +畺=jiang1,jiang4 +畻=cheng2 +畼=chang4 +畽=tuan3 +畾=lei2 +畿=ji1 +疀=cha1 +疁=liu2 +疂=die2 +疃=tuan3 +疄=lin2,lin4 +疅=jiang1 +疆=jiang1,qiang2 +疇=chou2 +疈=pi4 +疉=die2 +疊=die2 +疋=pi3,ya3,shu1 +疌=jie2,qie4 +疍=dan4 +疎=shu1 +疏=shu1 +疏不间亲=shu1,bu4,jian4,qin1 +疐=zhi4,di4 +疑=yi2,ni3 +疑为=yi2,wei2 +疒=ne4 +疓=nai3 +疔=ding1 +疕=bi3 +疖=jie1 +疖子=jie1,zi5 +疗=liao2 +疘=gang1 +疙=ge1,yi4 +疙疙瘩瘩=ge1,ge1,da1,da2 +疙瘩=ge1,da5 +疙瘩汤=ge1,da1,tang1 +疙里疙瘩=ge1,li3,ge1,da1 +疚=jiu4 +疛=zhou3 +疜=xia4 +疝=shan4 +疞=xu1 +疟=nve4,yao4 +疟子=yao4,zi3 +疟疾=nve4,ji2 +疠=li4,lai4 +疡=yang2 +疢=chen4 +疣=you2 +疤=ba1 +疥=jie4 +疦=jue2,xue4 +疧=qi2 +疨=ya3,xia1 +疩=cui4 +疪=bi4 +疫=yi4 +疬=li4 +疭=zong4 +疮=chuang1 +疯=feng1 +疰=zhu4 +疱=pao4 +疲=pi2 +疲于奔命=pi2,yu2,ben1,ming4 +疲沓=pi2,ta5 +疲疲沓沓=pi2,pi2,ta5,ta5 +疲累=pi2,lei4 +疳=gan1 +疴=ke1 +疵=ci1 +疶=xue1 +疷=zhi1 +疸=dan3 +疹=zhen3 +疺=fa2,bian3 +疻=zhi3 +疼=teng2 +疽=ju1 +疾=ji2 +疾不可为=ji2,bu4,ke3,wei2 +疾风劲草=ji2,feng1,jing4,cao3 +疾风彰劲草=ji2,feng1,zhang1,jing4,cao3 +疾风知劲草=ji2,feng1,zhi1,jing4,cao3 +疿=fei4,fei2 +痀=gou1 +痁=shan1,dian4 +痂=jia1 +痃=xuan2 +痄=zha4 +病=bing4 +病假=bing4,jia4 +病入骨隨=bing4,ru4,gu3,sui3 +病革=bing4,ji2 +痆=nie4 +症=zheng4,zheng1 +症状=zheng1,zhuang4 +症瘕=zheng1,jia3 +症瘕积聚=zheng1,jia3,ji1,ju4 +症结=zheng1,jie2 +痈=yong1 +痉=jing4 +痊=quan2 +痋=teng2,chong2 +痌=tong1,tong2 +痌瘝在抱=tong1,guan1,zao4,bao4 +痍=yi2 +痎=jie1 +痏=wei3,you4,yu4 +痐=hui2 +痑=tan1,shi3 +痒=yang3 +痓=zhi4 +痔=zhi4 +痕=hen2 +痖=ya3 +痗=mei4 +痘=dou4 +痙=jing4 +痚=xiao1 +痛=tong4 +痛不欲生=tong4,bu4,yu4,sheng1 +痛切=tong4,qie4 +痛恶=tong4,wu4 +痛深恶绝=tong4,shen1,wu4,jue2 +痛自创艾=tong4,zi4,chuang1,yi4 +痜=tu1 +痝=mang2 +痞=pi3 +痞子=pi3,zi5 +痟=xiao1 +痠=suan1 +痡=pu1,pu4 +痢=li4 +痣=zhi4 +痤=cuo2 +痥=duo2 +痦=wu4 +痧=sha1 +痨=lao2 +痩=shou4 +痪=huan4 +痫=xian2 +痬=yi4 +痭=beng1,peng2 +痮=zhang4 +痯=guan3 +痰=tan2 +痱=fei4,fei2 +痲=ma2 +痳=ma2,lin4 +痴=chi1 +痵=ji4 +痶=tian3,dian4 +痷=an1,ye4,e4 +痸=chi4 +痹=bi4 +痺=bi4 +痻=min2 +痼=gu4 +痽=dui1 +痾=ke1,e1 +痿=wei3 +瘀=yu1 +瘀血=yu1,xue4 +瘁=cui4 +瘂=ya3 +瘃=zhu2 +瘄=cu4 +瘅=dan4,dan1 +瘅暑=dan1,shu3 +瘅热=dan1,re4 +瘆=shen4 +瘇=zhong3 +瘈=zhi4,chi4 +瘉=yu4 +瘊=hou2 +瘊子=hou2,zi5 +瘋=feng1 +瘌=la4 +瘍=yang2 +瘎=chen2 +瘏=tu2 +瘐=yu3 +瘑=guo1 +瘒=wen2 +瘓=huan4 +瘔=ku4 +瘕=jia3,xia2,xia1 +瘖=yin1 +瘗=yi4 +瘘=lou4 +瘙=sao4 +瘚=jue2 +瘛=chi4 +瘜=xi1 +瘝=guan1 +瘞=yi4 +瘟=wen1 +瘠=ji2 +瘡=chuang1 +瘢=ban1 +瘣=hui4,lei3 +瘤=liu2 +瘥=chai4,cuo2 +瘦=shou4 +瘦削=shou4,xue1 +瘦骨嶙峋=shou4,gu3,lin2,xun2 +瘦骨梭棱=shou4,gu3,leng2,leng2 +瘦高挑儿=shou4,gao1,tiao3,er2 +瘧=nve4,yao4 +瘨=dian1,chen1 +瘩=da2,da5 +瘪=bie1,bie3 +瘫=tan1 +瘬=zhang4 +瘭=biao1 +瘮=shen4 +瘯=cu4 +瘰=luo3 +瘱=yi4 +瘲=zong4 +瘳=chou1 +瘴=zhang4 +瘵=zhai4 +瘶=sou4 +瘷=se4 +瘸=que2 +瘸子=que2,zi5 +瘹=diao4 +瘺=lou4 +瘻=lou4 +瘼=mo4 +瘽=qin2 +瘾=yin3 +瘾君子=yin3,jun1,zi3 +瘿=ying3 +癀=huang2 +癁=fu2 +療=liao2 +癃=long2 +癄=qiao2,jiao4 +癅=liu2 +癆=lao2 +癇=xian2 +癈=fei4 +癉=dan4,dan1 +癊=yin4 +癋=he4 +癌=ai2 +癍=ban1 +癎=xian2 +癏=guan1 +癐=gui4,wei1 +癑=nong4,nong2 +癒=yu4 +癓=wei1 +癔=yi4 +癕=yong1 +癖=pi3 +癖好=pi3,hao4 +癗=lei3 +癘=li4,lai4 +癙=shu3 +癚=dan4 +癛=lin3 +癜=dian4 +癝=lin3 +癞=lai4 +癞蛤蟆=lai4,ha2,ma5 +癞蛤蟆想吃天鹅肉=lai4,ha2,ma5,xiang3,chi1,tian1,e2,rou4 +癟=bie1,bie3 +癠=ji4 +癡=chi1 +癢=yang3 +癣=xuan3 +癤=jie1 +癥=zheng1 +癦=meng4 +癧=li4 +癨=huo4 +癩=lai4 +癪=ji1 +癫=dian1 +癬=xuan3 +癭=ying3 +癮=yin3 +癯=qu2 +癰=yong1 +癱=tan1 +癲=dian1 +癳=luo3 +癴=luan2 +癵=luan2 +癶=bo1 +癷=bo1,bo3 +癸=gui3 +癹=ba2 +発=fa1 +登=deng1 +登载=deng1,zai3 +發=fa1 +白=bai2 +白不呲咧=bai2,bu4,ci1,lie3 +白云亲舍=bai2,yun2,qin1,she4 +白云观=bai2,yun2,guan4 +白卷=bai2,juan4 +白发=bai2,fa4 +白发丹心=bai2,fa4,dan1,xin1 +白发千丈=bai2,fa4,qian1,zhang4 +白发朱颜=bai2,fa4,zhu1,yan2 +白发红颜=bai2,fa4,hong2,yan2 +白发苍苍=bai2,fa4,cang1,cang1 +白发苍颜=bai2,fa4,cang1,yan2 +白发青衫=bai2,fa4,qing1,shan1 +白干=bai2,gan1 +白干儿=bai2,gan1,er2 +白晃晃=bai2,huang3,huang3 +白术=bai2,zhu2 +白相=bai2,xiang4 +白纸坊=bai2,zhi3,fang1 +白血=bai2,xue4 +白血病=bai2,xue4,bing4 +白衣卿相=bai2,yi1,qing1,xiang4 +白露=bai2,lu4 +白面儒冠=bai2,mian4,ru2,guan1 +白首为郎=bai2,shou3,wei2,lang2 +白首相庄=bai2,shou3,xiang1,zhuang1 +白首相知=bai2,shou3,xiang1,zhi1 +白骨露野=bai2,gu3,lu4,ye3 +百=bai3 +百下百着=bai3,xia4,bai3,zhao2 +百不当一=bai3,bu4,dang1,yi1 +百中百发=bai3,zhong4,bai3,fa1 +百了千当=bai3,liao3,qian1,dang1 +百兽率舞=bai3,shou4,shuai4,wu3 +百分数=bai3,fen1,shu4 +百发百中=bai3,fa1,bai3,zhong4 +百善孝为先=bai3,shan4,xiao4,wei2,xian1 +百堕俱举=bai3,hui1,ju4,ju3 +百夫长=bai3,fu1,zhang3 +百孔千创=bai3,kong3,qian1,chuang1 +百尺竿头更进一步=bai3,chi3,gan1,tou2,geng4,jin4,yi1,bu4 +百年好事=bai3,nian2,hao3,shi4 +百废俱兴=bai3,fei4,ju4,xing1 +百废具兴=bai3,fei4,ju4,xing1 +百废待兴=bai3,fei4,dai4,xing1 +百战不殆=bai3,zhan4,bu4,dai4 +百日咳=bai3,ri4,ke2 +百星不如一月=bai3,xing1,bu4,ru2,yi1,yue4 +百舍重茧=bai3,she4,chong2,jian3 +百舍重趼=bai3,she4,chong2,jian3 +百读不厌=bai3,du2,bu4,yan4 +百载树人=bai3,zai3,shu4,ren2 +百闻不如一见=bai3,wen2,bu4,ru2,yi1,jian4 +癿=qie2 +皀=ji2,bi1 +皁=zao4 +皂=zao4 +皃=mao4 +的=de5,di2,di4 +的一确二=di2,yi1,que4,er4 +的哥=di1,ge1 +的士=di2,shi4 +的当=di2,dang4 +的情=di2,qing2 +的款=di2,kuan3 +的确=di2,que4 +的证=di2,zheng4 +皅=pa1,ba4 +皆=jie1 +皇=huang2 +皇冠=huang2,guan1 +皈=gui1 +皉=ci3 +皊=ling2 +皋=gao1,hao2 +皌=mo4 +皍=ji2 +皎=jiao3 +皎阳似火=jiao3,yang2,shi4,huo3 +皏=peng3 +皐=gao1,yao2 +皑=ai2 +皒=e2 +皓=hao4 +皔=han4 +皕=bi4 +皖=wan3 +皗=chou2 +皘=qian4 +皙=xi1 +皚=ai2 +皛=xiao3 +皜=hao4 +皝=huang4 +皞=hao4 +皟=ze2 +皠=cui3 +皡=hao4 +皢=xiao3 +皣=ye4 +皤=po2 +皥=hao4 +皦=jiao3 +皧=ai4 +皨=xing1 +皩=huang4 +皪=li4,luo4,bo1 +皫=piao3 +皬=he2 +皭=jiao4 +皮=pi2 +皮包骨头=pi2,bao1,gu2,tou5 +皮夹=pi2,jia1 +皮夹子=pi2,jia1,zi5 +皮相=pi2,xiang4 +皮相之见=pi2,xiang4,zhi1,jian4 +皮相之谈=pi2,xiang4,zhi1,tan2 +皮笑肉不笑=pi2,xiao4,rou4,bu4,xiao4 +皯=gan3 +皰=pao4 +皱=zhou4 +皱眉头=zhou4,mei2,tou5 +皲=jun1 +皳=qiu2 +皴=cun1 +皵=que4 +皶=zha1 +皷=gu3 +皸=jun1 +皹=jun1 +皺=zhou4 +皻=zha1,cu3 +皼=gu3 +皽=zhao1,zhan3,dan3 +皾=du2 +皿=min3 +盀=qi3 +盁=ying2 +盂=yu2 +盃=bei1 +盄=diao4 +盅=zhong1 +盆=pen2 +盇=he2 +盈=ying2 +盈千累万=ying2,qian1,lei3,wan4 +盈篇累牍=ying2,pian1,lei3,du2 +盉=he2 +益=yi4 +盋=bo1 +盌=wan3 +盍=he2 +盎=ang4 +盏=zhan3 +盐=yan2 +盐分=yan2,fen4 +盐巴=yan2,ba1 +监=jian1,jian4 +监利=jian4,li4 +监生=jian4,sheng1 +监禁=jian1,jin4 +盒=he2 +盒子=he2,zi5 +盓=yu1 +盔=kui1 +盕=fan4 +盖=gai4,ge3,he2 +盖子=gai4,zi5 +盗=dao4 +盗录行为=dao4,lu4,xing2,wei2 +盘=pan2 +盘子=pan2,zi5 +盘曲=pan2,qu1 +盙=fu3 +盚=qiu2 +盛=sheng4,cheng2 +盛器=cheng2,qi4 +盛水不漏=cheng2,shui3,bu4,lou4 +盛菜=cheng2,cai4 +盛衰兴废=sheng4,shuai1,xing1,fei4 +盛饭=cheng2,fan4 +盜=dao4 +盝=lu4 +盞=zhan3 +盟=meng2 +盠=li2 +盡=jin4 +盢=xu4 +監=jian1,jian4 +盤=pan2 +盥=guan4 +盦=an1 +盧=lu2 +盨=xu3 +盩=zhou1,chou2 +盪=dang4 +盫=an1 +盬=gu3 +盭=li4 +目=mu4 +目下十行=mu4,xia4,shi2,hang2 +目不暇给=mu4,bu4,xia2,ji3 +目不见睫=mu4,bu4,jian4,jie2 +目前为止=mu4,qian2,wei2,zhi3 +目挑心招=mu4,tiao3,xin1,zhao1 +目无尊长=mu4,wu2,zun1,zhang3 +目的=mu4,di4 +目眢心忳=mu4,yuan1,xin1,wang3 +目瞪舌彊=mu4,deng4,she2,jiang4 +目空余子=mu4,kong1,yu2,zi3 +盯=ding1 +盰=gan4 +盱=xu1 +盲=mang2 +盲翁扪钥=mang2,weng1,men2,yao4 +盳=mang2,wang4 +直=zhi2 +直扑无华=zhi2,pu3,wu2,hua2 +直率=zhi2,shuai4 +直言不讳=zhi2,yan2,bu4,hui4 +直言切谏=zhi2,yan2,qie1,jian4 +直言贾祸=zhi2,yan2,gu3,huo4 +盵=qi4 +盶=yuan3 +盷=xian2,tian2 +相=xiang1,xiang4 +相与为一=xiang1,yu3,wei2,yi1 +相中=xiang1,zhong4 +相亲=xiang4,qin1 +相亲相爱=xiang1,qin1,xiang1,ai4 +相似=xiang1,si4 +相位=xiang4,wei4 +相依为命=xiang1,yi1,wei2,ming4 +相公=xiang4,gong1 +相册=xiang4,ce4 +相反数=xiang1,fan3,shu4 +相国=xiang4,guo2 +相图=xiang4,tu2 +相士=xiang4,shi4 +相声=xiang4,sheng4 +相处=xiang1,chu3 +相女配夫=xiang4,nv3,pei4,fu1 +相差=xiang1,cha1 +相差不大=xiang1,cha1,bu2,da4 +相差甚远=xiang1,cha1,shen4,yuan3 +相应=xiang1,ying4 +相得=xiang1,de5 +相得甚欢=xiang1,de2,shen4,huan1 +相得益彰=xiang1,de2,yi4,zhang1 +相得益章=xiang1,de2,yi4,zhang1 +相态=xiang4,tai4 +相扑=xiang4,pu1 +相时而动=xiang4,shi2,er2,dong4 +相术=xiang4,shu4 +相机=xiang4,ji1 +相机而动=xiang4,ji1,er2,dong4 +相机行事=xiang4,ji1,xing2,shi4 +相片=xiang4,pian4 +相片儿=xiang4,pian1,er5 +相率=xiang1,shuai4 +相称=xiang1,chen4 +相纸=xiang4,zhi3 +相角=xiang4,jiao3 +相貌=xiang4,mao4 +相貌寒碜=xiang4,mao4,han2,chen5 +相辅相成=xiang1,fu3,xiang1,cheng2 +相遗以水=xiang1,wei4,yi3,shui3 +相门有相=xiang4,men2,you3,xiang4 +相间=xiang1,jian4 +相面=xiang4,mian4 +相马=xiang4,ma3 +相鼠有皮=xiang4,shu3,you3,pi2 +盹=dun3 +盺=xin1 +盻=xi4,pan3 +盼=pan4 +盼头=pan4,tou5 +盽=feng1 +盾=dun4 +盿=min2 +眀=ming2 +省=sheng3,xing3 +省亲=xing3,qin1 +省察=xing3,cha2 +省得=sheng3,de5 +省悟=xing3,wu4 +省视=xing3,shi4 +省长=sheng3,zhang3 +眂=shi4 +眃=yun2,hun4 +眄=mian3 +眅=pan1 +眆=fang3 +眇=miao3 +眈=dan1 +眉=mei2 +眉头一皱=mei2,tou5,yi2,zhou4 +眉毛胡子一把抓=mei2,mao2,hu2,zi5,yi1,ba3,zhua1 +眊=mao4 +看=kan4,kan1 +看不见=kan4,bu2,jian4 +看中=kan4,zhong4 +看吧=kan4,ba5 +看头=kan4,tou5 +看守=kan1,shou3 +看家=kan1,jia1 +看得懂=kan4,de5,dong3 +看得清=kan4,de5,qing1 +看得起=kan4,de5,qi3 +看护=kan1,hu4 +看押=kan1,ya1 +看样子=kan4,yang4,zi5 +看相=kan4,xiang4 +看管=kan1,guan3 +看门=kan1,men2 +県=xian4 +眍=kou1 +眎=shi4 +眏=yang1,yang3,ying4 +眐=zheng1 +眑=yao3,ao1,ao3 +眒=shen1 +眓=huo4 +眔=da4 +眕=zhen3 +眖=kuang4 +眗=ju1,xu1,kou1 +眘=shen4 +眙=yi2,chi4 +眚=sheng3 +眛=mei4 +眜=mo4,mie4 +眝=zhu4 +眞=zhen1 +真=zhen1 +真倔=zhen1,jue4 +真分数=zhen1,fen1,shu4 +真切=zhen1,qie4 +真数=zhen1,shu4 +真枪实弹=zhen1,qiang1,shi2,dan4 +真率=zhen1,shuai4 +真相=zhen1,xiang4 +真相大白=zhen1,xiang4,da4,bai2 +真相毕露=zhen1,xiang1,bi4,lu4 +真菌=zhen1,jun4 +真菌界=zhen1,jun4,jie4 +眠=mian2 +眡=shi4 +眢=yuan1 +眣=die2,ti4 +眤=ni4 +眥=zi4 +眦=zi4 +眦裂发指=zi4,lie4,fa4,zhi3 +眧=chao3 +眨=zha3 +眩=xuan4 +眩晕=xuan4,yun4 +眪=bing3,fang3 +眫=pang4,pan2 +眬=long2 +眭=gui4,sui1 +眮=tong2 +眯=mi1,mi2 +眰=die2,zhi4 +眱=di4 +眲=ne4 +眳=ming2 +眴=xuan4,shun4,xun2 +眵=chi1 +眶=kuang4 +眷=juan4 +眸=mou2 +眸子=mou2,zi3 +眹=zhen4 +眺=tiao4 +眻=yang2 +眼=yan3 +眼不见为净=yan3,bu4,jian4,wei2,jing4 +眼晕=yan3,yun4 +眼花撩乱=yan3,hua1,liao2,luan4 +眼见得=yan3,jian4,de5 +眽=mo4 +眾=zhong4 +眿=mo4 +着=zhe5,zhuo2,zhao2,zhao1 +着三不着两=zhao2,san1,bu4,zhao2,liang3 +着书立说=zhu4,shu1,li4,shuo1 +着人办理=zhuo2,ren2,ban4,li4 +着凉=zhao2,liang2 +着力=zhuo2,li4 +着墨=zhuo2,mo4 +着忙=zhao2,mang2 +着急=zhao2,ji2 +着想=zhuo2,xiang3 +着意=zhuo2,yi4 +着慌=zhao2,huang1 +着手=zhuo2,shou3 +着数=zhao1,shu4 +着水=zhe5,shui3 +着法=zhao1,fa3 +着火=zhao2,huo3 +着眼=zhuo2,yan3 +着笔=zhuo2,bi3 +着色=zhuo2,se4 +着落=zhuo2,luo4 +着装=zhuo2,zhuang1 +着迷=zhao2,mi2 +着重=zhuo2,zhong4 +着陆=zhuo2,lu4 +着风=zhao2,feng1 +着魔=zhao2,mo2 +睁=zheng1 +睂=mei2 +睃=suo1 +睄=qiao2,shao4,xiao1 +睅=han4 +睆=huan3 +睇=di4 +睈=cheng3 +睉=cuo2,zhuai4 +睊=juan4 +睋=e2 +睌=mian3 +睍=xian4 +睎=xi1 +睏=kun4 +睐=lai4 +睑=jian3 +睒=shan3 +睓=tian3 +睔=gun4 +睕=wan1 +睖=leng4 +睗=shi4 +睘=qiong2 +睙=li4 +睚=ya2 +睛=jing1 +睜=zheng1 +睝=li2 +睞=lai4 +睟=sui4,zui4 +睟面盎背=sui4,mian4,ang4,bei4 +睠=juan4 +睡=shui4 +睡懒觉=shui4,lan3,jiao4 +睡相=shui4,xiang4 +睡眼蒙眬=shui4,yan3,meng2,long2 +睡着=shui4,zhao2 +睡觉=shui4,jiao4 +睢=hui1,sui1 +督=du1 +督率=du1,shuai4 +睤=bi4 +睥=bi4,pi4 +睥睨=pi4,ni4 +睥睨一切=pi4,ni4,yi1,qie4 +睦=mu4 +睧=hun1 +睨=ni4 +睩=lu4 +睪=yi4,ze2,gao1 +睫=jie2 +睬=cai3 +睭=zhou3 +睮=yu2 +睯=hun1 +睰=ma4 +睱=xia4 +睲=xing3,xing4 +睳=hui1 +睴=hun4 +睵=zai1 +睶=chun3 +睷=jian1 +睸=mei4 +睹=du3 +睹物兴情=du3,wu4,xing1,qing2 +睹着知微=du3,zhe5,zhi1,wei1 +睺=hou2 +睻=xuan1 +睼=ti2 +睽=kui2 +睾=gao1 +睿=rui4 +瞀=mao4 +瞁=xu4 +瞂=fa2 +瞃=wo4 +瞄=miao2 +瞅=chou3 +瞆=gui4,wei4,kui4 +瞇=mi1,mi2 +瞈=weng3 +瞉=kou4,ji4 +瞊=dang4 +瞋=chen1 +瞋目切齿=chen1,mu4,qie4,chi3 +瞌=ke1 +瞍=sou3 +瞎=xia1 +瞎子=xia1,zi5 +瞎琢磨=xia1,zuo2,mo5 +瞎蒙=xia1,meng1 +瞏=qiong2,huan2 +瞐=mo4 +瞑=ming2 +瞒=man2,men2 +瞒哄=man2,hong3 +瞓=fen4 +瞔=ze2 +瞕=zhang4 +瞖=yi4 +瞗=diao1,dou1 +瞘=kou1 +瞙=mo4 +瞚=shun4 +瞛=cong1 +瞜=lou2,lv2,lou5 +瞝=chi1 +瞞=man2,men2 +瞟=piao3 +瞠=cheng1 +瞡=gui1 +瞢=meng2,meng3 +瞣=wan4 +瞤=run2,shun4 +瞥=pie1 +瞥一眼=pie1,yi1,yan3 +瞦=xi1 +瞧=qiao2 +瞧得起=qiao2,de5,qi3 +瞨=pu2 +瞩=zhu3 +瞪=deng4 +瞪眼咋舌=deng4,yan3,ze2,she2 +瞫=shen3 +瞬=shun4 +瞭=liao3,liao4 +瞭望=liao4,wang4 +瞭望台=liao4,wang4,tai2 +瞮=che4 +瞯=xian2,jian4 +瞰=kan4 +瞰瑕伺隙=kan4,xia2,si4,xi4 +瞱=ye4 +瞲=xue4 +瞳=tong2 +瞴=wu3,mi2 +瞵=lin2 +瞶=gui4,kui4 +瞷=jian4 +瞸=ye4 +瞹=ai4 +瞺=hui4 +瞻=zhan1 +瞼=jian3 +瞽=gu3 +瞾=zhao4 +瞿=qu2,ju4 +瞿然=ju4,ran2 +矀=wei2 +矁=chou3 +矂=sao4 +矃=ning3,cheng1 +矄=xun1 +矅=yao4 +矆=huo4,yue4 +矇=meng1 +矈=mian2 +矉=pin2 +矊=mian2 +矋=lei3 +矌=kuang4,guo1 +矍=jue2 +矎=xuan1 +矏=mian2 +矐=huo4 +矑=lu2 +矒=meng2,meng3 +矓=long2 +矔=guan4,quan2 +矕=man3,man2 +矖=xi3 +矗=chu4 +矘=tang3 +矙=kan4 +矚=zhu3 +矛=mao2 +矛头=mao2,tou5 +矜=jin1,qin2,guan1 +矜功负气=jin1,gong1,fu3,qi4 +矜名嫉能=jin1,ming2,ji4,neng2 +矜己自饰=jin1,ji3,zhi4,shi4 +矝=jin1,qin2,guan1 +矞=yu4,xu4,jue2 +矟=shuo4 +矠=ze2 +矡=jue2 +矢=shi3 +矢口否认=shi3,kou3,fou3,ren4 +矣=yi3 +矤=shen3 +知=zhi1,zhi4 +知之为知之=zhi1,zhi1,wei2,zhi1,zhi1 +知了=zhi1,liao3 +知其不可为而为之=zhi1,qi2,bu4,ke3,wei2,er2,wei2,zhi1 +知疼着热=zhi1,teng2,zhao2,re4 +知疼着痒=zhi1,teng2,zhao2,yang3 +知识=zhi1,shi5 +知识产业=zhi1,shi5,chan3,ye4 +知识产权=zhi1,shi5,chan3,quan2 +知识分子=zhi1,shi5,fen4,zi5 +知识化=zhi1,shi5,hua4 +知识宝库=zhi1,shi5,bao3,ku4 +知识密集产业=zhi1,shi5,mi4,ji2,chan3,ye4 +知识就是力量=zhi1,shi5,jiu4,shi4,li4,liang4 +知识工厂=zhi1,shi5,gong1,chang3 +知识工程=zhi1,shi5,gong1,cheng2 +知识库=zhi1,shi5,ku4 +知识更新=zhi1,shi5,geng1,xin1 +知识爆炸=zhi1,shi5,bao4,zha4 +知识界=zhi1,shi5,jie4 +知识经济=zhi1,shi5,jing1,ji4 +知识老化=zhi1,shi5,lao3,hua4 +知识财产=zhi1,shi5,cai2,chan3 +知识贬值=zhi1,shi5,bian3,zhi2 +知识青年=zhi1,shi5,qing1,nian2 +知难而退=zhi1,nan2,er2,tui4 +矦=hou2,hou4 +矧=shen3 +矨=ying3 +矩=ju3 +矩形=ju3,xing2 +矪=zhou1 +矫=jiao3,jiao2 +矫国更俗=jiao3,guo2,geng1,su2 +矬=cuo2 +短=duan3 +短褐不完=duan1,he4,bu4,wan2 +短见薄识=duan3,jian4,bo2,shi2 +矮=ai3 +矮个子=ai3,ge4,zi5 +矮人观场=ai3,ren2,guan1,chang2 +矮子观场=ai3,zi3,guan1,chang2 +矯=jiao3,jiao2 +矰=zeng1 +矱=yue1 +矲=ba4 +石=shi2,dan4 +石头=shi2,tou5 +石子儿=shi2,zi3,er5 +石室金匮=shi2,shi4,jin1,gui4 +石磨=shi2,mo4 +矴=ding4 +矵=qi4 +矶=ji1 +矷=zi3 +矸=gan1 +矹=wu4 +矺=zhe2 +矻=ku1 +矼=gang1,qiang1,kong4 +矽=xi1 +矾=fan2 +矿=kuang4 +矿芯=kuang4,xin4 +矿藏=kuang4,cang2 +矿难=kuang4,nan4 +砀=dang4 +码=ma3 +码头=ma3,tou5 +砂=sha1 +砃=dan1 +砄=jue2 +砅=li4 +砆=fu1 +砇=min2 +砈=e4 +砉=xu1,hua1 +砊=kang1 +砋=zhi3 +砌=qi4,qie4 +砍=kan3 +砎=jie4 +砏=pin1,bin1,fen1 +砐=e4 +砑=ya4 +砒=pi1 +砓=zhe2 +研=yan2,yan4 +砕=sui4 +砖=zhuan1 +砖头=zhuan1,tou5 +砗=che1 +砘=dun4 +砙=wa3 +砚=yan4 +砚台=yan4,tai1 +砛=jin1 +砜=feng1 +砝=fa3 +砞=mo4 +砟=zha3 +砠=ju1 +砡=yu4 +砢=ke1,luo3 +砣=tuo2 +砤=tuo2 +砥=di3 +砥砺琢磨=di3,li4,zhuo2,mo2 +砦=zhai4 +砧=zhen1 +砨=e3 +砩=fu2,fei4 +砪=mu3 +砫=zhu4,zhu3 +砬=li4,la1,la2 +砭=bian1 +砮=nu3 +砯=ping1 +砰=peng1 +砱=ling2 +砲=pao4 +砳=le4 +破=po4 +破家为国=po4,jia1,wei4,guo2 +破愁为笑=po4,chou2,wei2,xiao4 +破涕为笑=po4,ti4,wei2,xiao4 +破相=po4,xiang4 +破矩为圆=po4,ju3,wei2,yuan2 +破衣服=po4,yi1,fu5 +破觚为圜=po4,gu1,wei2,yuan2 +破镜重合=po4,jing4,zhong4,he2 +破镜重圆=po4,jing4,chong2,yuan2 +砵=bo1 +砶=po4 +砷=shen1 +砸=za2 +砹=ai4 +砺=li4 +砻=long2 +砼=tong2 +砽=yong4 +砾=li4 +砿=kuang4 +础=chu3 +硁=keng1 +硂=quan2 +硃=zhu1 +硄=kuang1,guang1 +硅=gui1 +硆=e4 +硇=nao2 +硈=qia4 +硉=lu4 +硊=wei3,gui4 +硋=ai4 +硌=luo4,ge4 +硍=ken4,xian4,gun3,yin3 +硎=xing2 +硏=yan2,yan4 +硐=dong4 +硑=peng1,ping2 +硒=xi1 +硓=lao3 +硔=hong2 +硕=shuo4,shi2 +硕望宿德=shuo4,wang4,xiu3,de2 +硖=xia2 +硗=qiao1 +硘=qing2 +硙=wei2,wei4 +硚=qiao2 +硜=keng1 +硝=xiao1 +硝云弹雨=xiao1,yun2,dan4,yu3 +硞=que4,ke4,ku4 +硟=chan4 +硠=lang2 +硡=hong1 +硢=yu4 +硣=xiao1 +硤=xia2 +硥=mang3,bang4 +硦=luo4,long4 +硧=yong3,tong2 +硨=che1 +硩=che4 +硪=wo4 +硫=liu2 +硬=ying4 +硭=mang2 +确=que4 +确切=que4,qie4 +确当=que4,dang4 +硯=yan4 +硰=sha1 +硱=kun3 +硲=yu4 +硴=hua1 +硵=lu3 +硶=chen3 +硷=jian3 +硸=nve4 +硹=song1 +硺=zhuo2 +硻=keng1,keng3 +硼=peng2 +硽=yan1,yan3 +硾=zhui4,chui2,duo3 +硿=kong1 +碀=cheng1 +碁=qi2 +碂=zong4,cong2 +碃=qing4 +碄=lin2 +碅=jun1 +碆=bo1 +碇=ding4 +碈=min2 +碉=diao1 +碉堡=diao1,bao3 +碊=jian1,zhan4 +碋=he4 +碌=lu4,liu4 +碌碌无为=lu4,lu4,wu2,wei2 +碌碡=liu4,zhou4 +碍=ai4 +碎=sui4 +碏=que4,xi1 +碐=leng2 +碑=bei1 +碑帖=bei1,tie4 +碒=yin2 +碓=dui4 +碔=wu3 +碔砆混玉=zhi4,fu1,hun4,yu4 +碕=qi2 +碖=lun2,lun3,lun4 +碗=wan3 +碘=dian3 +碘酊=dian3,ding3 +碙=nao2,gang1 +碚=bei4 +碛=qi4 +碜=chen3 +碝=ruan3 +碞=yan2 +碟=die2 +碟子=die2,zi5 +碠=ding4 +碡=zhou2 +碢=tuo2 +碣=jie2,ya4 +碤=ying1 +碥=bian3 +碦=ke4 +碧=bi4 +碧血=bi4,xue4 +碧血丹心=bi4,xue4,dan1,xin1 +碨=wei3,wei4 +碩=shuo4,shi2 +碪=zhen1 +碫=duan4 +碬=xia2 +碭=dang4 +碮=ti2,di1 +碯=nao3 +碰=peng4 +碰头会=peng4,tou2,kuai4 +碰钉子=peng4,ding4,zi3 +碱=jian3 +碲=di4 +碳=tan4 +碴=cha2,cha1 +碵=tian2 +碶=qi4 +碷=dun4 +碸=feng1 +碹=xuan4 +確=que4 +碻=que4,qiao1 +碼=ma3 +碽=gong1 +碾=nian3 +碾坊=nian3,fang2 +碿=su4,xie4 +磀=e2 +磁=ci2 +磂=liu2,liu4 +磃=si1,ti2 +磄=tang2 +磅=bang4,pang2 +磅礴=pang2,bo2 +磆=hua2,ke3,gu1 +磇=pi1 +磈=kui3,wei3 +磉=sang3 +磊=lei3 +磊落豪横=lei3,luo4,hao2,heng2 +磋=cuo1 +磌=tian2 +磍=xia2,qia4,ya4 +磎=xi1 +磏=lian2,qian1 +磐=pan2 +磑=ai2,wei4 +磒=yun3 +磓=dui1 +磔=zhe2 +磕=ke1 +磖=la2,la1 +磗=zhuan1 +磘=yao2 +磙=gun3 +磚=zhuan1 +磛=chan2 +磜=qi4 +磝=ao2,qiao1 +磞=peng1,peng4 +磟=liu4 +磠=lu3 +磡=kan4 +磢=chuang3 +磣=chen3 +磤=yin1,yin3 +磥=lei3,lei2 +磦=biao1 +磧=qi4 +磨=mo2,mo4 +磨叨=mo4,dao1 +磨坊=mo4,fang2 +磨子=mo4,zi3 +磨棱刓角=mo2,leng2,liang3,jiao3 +磨烦=mo4,fan2 +磨牙吮血=mo2,ya2,shun3,xue4 +磨盘=mo4,pan2 +磨米=mo4,mi3 +磨豆腐=mo4,dou4,fu3 +磨难=mo2,nan4 +磨面=mo4,mian4 +磩=qi4,zhu2 +磪=cui1 +磫=zong1 +磬=qing4 +磭=chuo4 +磮=lun2 +磯=ji1 +磰=shan4 +磱=lao2,luo4 +磲=qu2 +磳=zeng1 +磴=deng4 +磵=jian4 +磶=xi4 +磷=lin2 +磸=ding4 +磹=dian4 +磺=huang2 +磻=pan2,bo1 +磼=ji2,she2 +磽=qiao1 +磾=di1 +磿=li4 +礀=jian4 +礁=jiao1 +礂=xi1 +礃=zhang3 +礄=qiao2 +礅=dun1 +礆=jian3 +礇=yu4 +礈=zhui4 +礉=he2,qiao1,qiao4 +礊=ke4,huo4 +礋=ze2 +礌=lei2,lei3 +礍=jie2 +礎=chu3 +礏=ye4 +礐=que4,hu2 +礑=dang4 +礒=yi3 +礓=jiang1 +礔=pi1 +礕=pi1 +礖=yu4 +礗=pin1 +礘=e4,qi4 +礙=ai4 +礚=ke1 +礛=jian1 +礜=yu4 +礝=ruan3 +礞=meng2 +礟=pao4 +礠=ci2 +礡=bo1 +礢=yang3 +礣=mie4 +礤=ca3 +礥=xian2,xin2 +礦=kuang4 +礧=lei2,lei3,lei4 +礨=lei3 +礩=zhi4 +礪=li4 +礫=li4 +礬=fan2 +礭=que4 +礮=pao4 +礯=ying1 +礰=li4 +礱=long2 +礲=long2 +礳=mo4 +礴=bo2 +礵=shuang1 +礶=guan4 +礷=jian1 +礸=ca3 +礹=yan2,yan3 +示=shi4 +礻=shi4 +礼=li3 +礼为情貌=li3,wei2,qing2,mao4 +礼乐=li3,yue4 +礼坏乐崩=li3,huai4,yue4,beng1 +礼崩乐坏=li3,beng1,yue4,huai4 +礼数=li3,shu4 +礼让为国=li3,rang4,wei2,guo2 +礽=reng2 +社=she4 +礿=yue4 +祀=si4 +祁=qi2 +祂=ta1 +祃=ma4 +祄=xie4 +祅=yao1 +祆=xian1 +祇=zhi3,qi2 +祈=qi2 +祉=zhi3 +祊=beng1,fang1 +祋=dui4 +祌=zhong4 +祍=ren4 +祎=yi1 +祏=shi2 +祐=you4 +祑=zhi4 +祒=tiao2 +祓=fu2 +祔=fu4 +祕=mi4,bi4 +祖=zu3 +祗=zhi1 +祘=suan4 +祙=mei4 +祚=zuo4 +祛=qu1 +祜=hu4 +祝=zhu4 +祝不胜诅=zhu4,bu4,sheng4,zu3 +祝咽祝哽=zhu4,yan1,zhu4,geng3 +神=shen2 +神不守舍=shen2,bu4,shou3,she4 +神出鬼没=shen2,chu1,gui3,mo4 +神号鬼哭=shen2,hao2,gui3,ku1 +神差鬼使=shen2,chai1,gui3,shi3 +神曲=shen2,qu1 +神武挂冠=shen2,wu3,gua4,guan4 +神祇=shen2,qi2 +神霄绛阙=shen2,xiao1,jiang4,que4 +神魂不定=shen2,hun2,bu2,ding4 +神龙失埶=shen2,long2,shi1,zhi4 +神龙见首不见尾=shen2,long2,jian4,shou3,bu4,jian4,wei3 +祟=sui4 +祠=ci2 +祡=chai2 +祢=mi2 +祣=lv3 +祤=yu3 +祥=xiang2 +祦=wu2 +祧=tiao1 +票=piao4,piao1 +祩=zhu4 +祪=gui3 +祫=xia2 +祬=zhi1 +祭=ji4,zhai4 +祮=gao4 +祯=zhen1 +祰=gao4 +祱=shui4,lei4 +祲=jin4 +祲威盛容=long2,wei1,sheng4,rong2 +祳=shen4 +祴=gai1 +祵=kun3 +祶=di4 +祷=dao3 +祸=huo4 +祸为福先=huo4,wei2,fu2,xian1 +祸福相依=huo4,fu2,xiang1,yi1 +祸福相倚=huo4,fu2,xiang1,yi1 +祸福相生=huo4,fu2,xiang1,sheng1 +祹=tao2 +祺=qi2 +祻=gu4 +祼=guan4 +祽=zui4 +祾=ling2 +祿=lu4 +禀=bing3 +禁=jin1,jin4 +禁不住=jin1,bu2,zhu4 +禁不起=jin1,bu4,qi3 +禁中颇牧=jin4,zhong1,po1,mu4 +禁书=jin4,shu1 +禁令=jin4,ling4 +禁例=jin4,li4 +禁军=jin4,jun1 +禁制=jin4,zhi4 +禁区=jin4,qu1 +禁卫=jin4,wei4 +禁卫军=jin4,wei4,jun1 +禁受=jin1,shou4 +禁地=jin4,di4 +禁子=jin4,zi3 +禁律=jin4,lv4 +禁得住=jin1,de2,zhu4 +禁得起=jin1,de5,qi3 +禁忌=jin4,ji4 +禁情割欲=jin4,qing2,ge1,yu4 +禁攻寝兵=jin4,gong1,qin3,bing1 +禁暴正乱=jin4,bao4,zheng4,luan4 +禁暴诛乱=jin4,bao4,zhu1,luan4 +禁条=jin4,tiao2 +禁果=jin4,guo3 +禁欲=jin4,yu4 +禁止=jin4,zhi3 +禁毁=jin4,hui3 +禁渔=jin4,yu2 +禁火=jin4,huo3 +禁烟=jin4,yan1 +禁物=jin4,wu4 +禁猎=jin4,lie4 +禁绝=jin4,jue2 +禁网疏阔=jin4,wang3,shu1,kuo4 +禁脔=jin4,luan2 +禁舍开塞=jin4,she3,kai1,sai1 +禁足=jin4,zu2 +禁运=jin4,yun4 +禁酒=jin4,jiu3 +禁锢=jin4,gu4 +禁闭=jin4,bi4 +禁闭室=jin4,bi4,shi4 +禁阻=jin4,zu3 +禁食=jin4,shi2 +禁鼎一脔=jin4,ding3,yi1,luan2 +禂=dao3 +禃=zhi2 +禄=lu4 +禅=chan2,shan4 +禅絮沾泥=chan2,xu1,zhan1,ni2 +禅让=shan4,rang4 +禆=bi4,pi2 +禇=chu3 +禈=hui1 +禉=you3 +禊=xi4 +禋=yin1 +禌=zi1 +禍=huo4 +禎=zhen1 +福=fu2 +福为祸先=fu2,wei2,huo4,xian1 +福为祸始=fu2,wei2,huo4,shi3 +福孙荫子=fu2,sun1,yin4,zi3 +福相=fu2,xiang4 +禐=yuan4 +禑=xu2 +禒=xian3 +禓=shang1,yang2 +禔=ti2,zhi3 +禕=yi1 +禖=mei2 +禗=si1 +禘=di4 +禙=bei4 +禚=zhuo2 +禛=zhen1 +禜=ying2 +禝=ji4 +禞=gao4 +禟=tang2 +禠=si1 +禡=ma4 +禢=ta4 +禣=fu4 +禤=xuan1 +禥=qi2 +禦=yu4 +禧=xi3 +禨=ji1,ji4 +禩=si4 +禪=shan4,chan2 +禫=dan4 +禬=gui4 +禭=sui4 +禮=li3 +禯=nong2 +禰=mi2 +禱=dao3 +禲=li4 +禳=rang2 +禴=yue4 +禵=ti2 +禶=zan4 +禷=lei4 +禸=rou2 +禹=yu3 +禺=yu2,yu4,ou3 +离=li2 +离乡背井=li2,xiang1,bei4,jing3 +离乡背土=li2,xiang1,bei4,tu3 +离山调虎=li2,shan1,diao4,hu3 +离本徼末=li2,ben3,yao1,mo4 +离本趣末=li2,ben3,qu1,mo4 +离间=li2,jian4 +禼=xie4 +禽=qin2 +禾=he2 +禾场=he2,chang2 +禿=tu1 +秀=xiu4 +秀出班行=xiu4,chu1,ban1,hang2 +私=si1 +私了=si1,liao3 +私处=si1,chu4 +秂=ren2 +秃=tu1 +秄=zi3,zi4 +秅=cha2,na2 +秆=gan3 +秇=yi4,zhi2 +秈=xian1 +秉=bing3 +秊=nian2 +秋=qiu1 +秋实春华=qiu1,shi2,chun1,hua1 +秋行夏令=qiu1,xing2,xia4,ling2 +秌=qiu1 +种=zhong3,zhong4,chong2 +种地=zhong4,di4 +种子=zhong3,zi5 +种树=zhong4,shu4 +种植=zhong4,zhi2 +种牛=zhong4,niu2 +种牛痘=zhong4,niu2,dou4 +种瓜得瓜=zhong4,gua1,de2,gua1 +种田=zhong4,tian2 +种痘=zhong4,dou4 +种稻=zhong4,dao4 +种花=zhong4,hua1 +种草=zhong4,cao3 +种菜=zhong4,cai4 +秎=fen4 +秏=hao4,mao4 +秐=yun2 +科=ke1 +科教片=ke1,jiao4,pian1 +科长=ke1,zhang3 +秒=miao3 +秓=zhi1 +秔=jing1 +秕=bi3 +秕言谬说=bi3,yan2,miu4,shuo4 +秖=zhi3 +秗=yu4 +秘=mi4,bi4 +秘而不露=mi4,er2,bu4,lu4 +秘鲁=bi4,lu3 +秙=ku4,ku1 +秚=ban4 +秛=pi1 +秜=ni2,ni4 +秝=li4 +秞=you2 +租=zu1 +租庸调=zu1,yong1,diao4 +秠=pi1 +秡=bo2 +秢=ling2 +秣=mo4 +秤=cheng4 +秤平斗满=cheng4,ping2,dou3,man3 +秤杆=cheng4,gan3 +秥=nian2 +秦=qin2 +秦桧=qin2,gui4 +秧=yang1 +秨=zuo2 +秩=zhi4 +秪=di1 +秫=shu2 +秬=ju4 +秭=zi3 +秮=huo2,kuo4 +积=ji1 +积功兴业=ji1,gong1,xing1,ye4 +积岁累月=ji1,sui4,lei3,yue4 +积年累岁=ji1,nian2,lei3,sui4 +积年累月=ji1,nian2,lei3,yue4 +积德累仁=ji1,de2,lei3,ren2 +积德累功=ji1,de2,lei3,gong1 +积德累善=ji1,de2,lei3,shan4 +积攒=ji1,zan3 +积数=ji1,shu4 +积日累久=ji1,ri4,lei3,jiu3 +积日累岁=ji1,ri4,lei3,sui4 +积日累月=ji1,ri4,lei3,yue4 +积时累日=ji1,shi2,lei3,ri4 +积极分子=ji1,ji2,fen4,zi3 +积素累旧=ji1,su4,lei3,jiu4 +积累=ji1,lei3 +积蓄=ji1,xu4 +积谗糜骨=ji1,chan2,mei2,gu3 +积金累玉=ji1,jin1,lei4,yu4 +积铢累寸=ji1,zhu1,lei3,cun4 +积露为波=ji1,lu4,wei2,bo1 +称=cheng1,chen4,cheng4 +称为=cheng1,wei2 +称体裁衣=chen4,ti3,cai2,yi1 +称体载衣=chen4,ti3,cai2,yi1 +称呼=cheng1,hu1 +称家有无=chen4,jia1,you3,wu2 +称得上=cheng1,de5,shang4 +称德度功=cheng1,de2,duo2,gong1 +称心=chen4,xin1 +称意=cheng1,yi4 +称愿=chen4,yuan4 +称职=chen4,zhi2 +称身=chen4,shen1 +称道=cheng1,dao4 +称量=cheng1,liang2 +秱=tong2 +秲=shi4,zhi4 +秳=huo2,kuo4 +秴=huo1 +秵=yin1 +秶=zi1 +秷=zhi4 +秸=jie1 +秹=ren3 +秺=du4 +移=yi2 +移孝为忠=yi2,xiao4,wei2,zhong1 +移星换斗=yi2,xing1,huan4,dou3 +移的就箭=yi2,di4,jiu4,jian4 +移解=yi2,jie4 +秼=zhu1 +秽=hui4 +秾=nong2 +秿=fu4,pu1 +稀=xi1 +稀薄=xi1,bo2 +稀里哗啦=xi1,li3,hua1,la1 +稁=gao3 +稂=lang2 +稃=fu1 +稄=xun4,ze4 +稅=shui4 +稆=lv3 +稇=kun3 +稈=gan3 +稉=jing1 +稊=ti2 +程=cheng2 +稌=tu2,shu3 +稍=shao1,shao4 +稍为=shao1,wei2 +稍息=shao4,xi1 +税=shui4 +税卡=shui4,qia3 +稏=ya4 +稐=lun3 +稑=lu4 +稒=gu1 +稓=zuo2 +稔=ren3 +稕=zhun4,zhun3 +稖=bang4 +稗=bai4 +稗子面=bai4,zi5,mian4 +稗子面馍馍=bai4,zi5,mian4,mo2,mo2 +稘=ji1,qi2 +稙=zhi1 +稚=zhi4 +稚齿婑媠=zhi4,chi3,wo3,tuo3 +稛=kun3 +稜=leng2,leng1,ling2 +稝=peng2 +稞=ke1 +稟=bing3 +稠=chou2 +稡=zui4,zu2,su1 +稢=yu4 +稣=su1 +稤=lve4 +稥=xiang1 +稦=yi1 +稧=xi4,qie4 +稨=bian3 +稩=ji4 +稪=fu2 +稫=pi4,bi4 +稬=nuo4 +稭=jie1 +種=zhong3,zhong4 +稯=zong1,zong3 +稰=xu3,xu1 +稱=cheng1,chen4,cheng4 +稲=dao4 +稳=wen3 +稳当=wen3,dang4 +稳扎稳打=wen3,zha1,wen3,da3 +稳操左券=wen2,cao1,zuo3,quan4 +稳操胜券=wen2,cao1,sheng4,quan4 +稳稳当当=wen3,wen3,dang1,dang1 +稴=xian2,jian1,lian4 +稵=zi1,jiu1 +稶=yu4 +稷=ji4 +稸=xu4 +稹=zhen3 +稺=zhi4 +稻=dao4 +稻子=dao4,zi5 +稼=jia4 +稽=ji1,qi3 +稽古揆今=ji1,gu3,zhen4,jin1 +稽查=ji1,zha1 +稽首=qi3,shou3 +稾=gao3 +稿=gao3 +稿子=gao3,zi5 +穀=gu3 +穁=rong2 +穂=sui4 +穃=rong4 +穄=ji4 +穅=kang1 +穆=mu4 +穇=can3,shan1,cen1 +穈=men2,mei2 +穉=zhi4 +穊=ji4 +穋=lu4 +穌=su1 +積=ji1 +穎=ying3 +穏=wen3 +穐=qiu1 +穑=se4 +穓=yi4 +穔=huang2 +穕=qie4 +穖=ji3,ji4 +穗=sui4 +穘=xiao1,rao4 +穙=pu2 +穚=jiao1 +穛=zhuo1,bo2 +穜=tong2,zhong3 +穝=zuo1 +穞=lu3 +穟=sui4 +穠=nong2 +穡=se4 +穢=hui4 +穣=rang2 +穤=nuo4 +穥=yu3 +穦=pin1 +穧=ji4 +穨=tui2 +穩=wen3 +穪=cheng1,chen4,cheng4 +穫=huo4 +穬=kuang4 +穭=lv3 +穮=biao1,pao1 +穯=se4 +穰=rang2 +穱=zhuo1,jue2 +穲=li2 +穳=cuan2,zan4 +穴=xue2 +穴居野处=xue2,ju1,ye3,chu3 +穵=wa1 +究=jiu1 +穷=qiong2 +穷年累世=qiong2,nian2,lei3,shi4 +穷年累月=qiong2,nian2,lei3,yue4 +穷形尽相=qiong2,xing2,jin4,xiang4 +穷骨头=qiong2,gu2,tou5 +穸=xi1 +穹=qiong2 +空=kong1,kong4,kong3 +空余=kong4,yu2 +空儿=kong4,er2 +空出=kong4,chu1 +空包弹=kong1,bao1,dan4 +空地=kong4,di4 +空子=kong4,zi5 +空心吃药=kong1,xin1,chi1,yao4 +空无一人=kong1,wu2,yi1,ren2 +空暇=kong4,xia2 +空瘪=kong1,bie3 +空白=kong4,bai2 +空缺=kong4,que1 +空腹便便=kong1,fu4,pian2,pian2 +空袭惊报=kong1,xi2,jing3,bao4 +空转=kong1,zhuan4 +空闲=kong4,xian2 +空隙=kong4,xi4 +空难=kong1,nan4 +空额=kong4,e2 +穻=yu1,yu3 +穼=shen1 +穽=jing3 +穾=yao4 +穿=chuan1 +穿着=chuan1,zhuo2 +穿着打扮=chuan1,zhuo2,da3,ban4 +穿衣服=chuan1,yi1,fu5 +窀=zhun1 +突=tu1 +窂=lao2 +窃=qie4 +窄=zhai3 +窅=yao3 +窆=bian3 +窇=bao2 +窈=yao3 +窉=bing4 +窊=wa1 +窋=zhu2,ku1 +窌=jiao4,liao2,liu4 +窍=qiao4 +窎=diao4 +窏=wu1 +窐=wa1,gui1 +窑=yao2 +窒=zhi4 +窓=chuang1 +窔=yao4 +窕=tiao3,yao2 +窖=jiao4 +窗=chuang1 +窗明几净=chuang1,ming2,ji1,jing4 +窘=jiong3 +窘相=jiong3,xiang4 +窙=xiao1 +窚=cheng2 +窛=kou4 +窜=cuan4 +窝=wo1 +窝囊=wo1,nang1 +窝囊废=wo1,nang1,fei4 +窝囊气=wo1,nang1,qi4 +窝里反=wo1,li5,fan3 +窝铺=wo1,pu4 +窞=dan4 +窟=ku1 +窠=ke1 +窡=zhuo2 +窢=huo4 +窣=su1 +窤=guan1 +窥=kui1 +窥伺=kui1,si4 +窥度=kui1,duo2 +窥间伺隙=kui1,jian4,si4,xi4 +窦=dou4 +窧=zhuo1 +窨=yin4,xun1 +窩=wo1 +窪=wa1 +窫=ya4,ye1 +窬=yu2 +窭=ju4 +窮=qiong2 +窯=yao2 +窰=yao2 +窱=tiao3 +窲=chao2 +窳=yu3 +窴=tian2,dian1,yan3 +窵=diao4 +窶=ju4 +窷=liao4 +窸=xi1 +窹=wu4 +窺=kui1 +窻=chuang1 +窼=chao1,ke1 +窽=kuan3,cuan4 +窾=kuan3,cuan4 +窿=long2 +竀=cheng1,cheng4 +竁=cui4 +竂=liao2 +竃=zao4 +竄=cuan4 +竅=qiao4 +竆=qiong2 +竇=dou4 +竈=zao4 +竉=long3 +竊=qie4 +立=li4 +立于不败之地=li4,yu2,bu4,bai4,zhi1,di4 +立传=li4,zhuan4 +立身处世=li4,shen1,chu3,shi4 +竌=chu4 +竍=shi2 +竎=fu4 +竏=qian1 +竐=chu4,qi4 +竑=hong2 +竒=qi2 +竓=hao2 +竔=sheng1 +竕=fen1 +竖=shu4 +竗=miao4 +竘=qu3,kou3 +站=zhan4 +站起来=zhan4,qi5,lai2 +站长=zhan4,zhang3 +竚=zhu4 +竛=ling2 +竜=long2 +竝=bing4 +竞=jing4 +竟=jing4 +章=zhang1 +章句小儒=zhang1,ju4,xiao1,ru2 +竡=bai3 +竢=si4 +竣=jun4 +竤=hong2 +童=tong2 +童仆=tong2,pu2 +童蒙=tong2,meng2 +童颜鹤发=tong2,yan2,he4,fa4 +竦=song3 +竧=jing4,zhen3 +竨=diao4 +竩=yi4 +竪=shu4 +竫=jing4 +竬=qu3 +竭=jie2 +竭尽=jie1,jin4 +竭尽全力=jie2,jin4,quan2,li4 +竭尽心力=jie2,jin4,xin1,li4 +竭智尽力=jie2,zhi4,jin4,li4 +竮=ping2 +端=duan1 +端量=duan1,liang2 +竰=li2 +竱=zhuan3 +竲=ceng2,zeng1 +竳=deng1 +竴=cun1 +竵=wai1 +競=jing4 +竷=kan3,kan4 +竸=jing4 +竹=zhu2 +竹筒倒豆子=zhu2,tong3,dao3,dou4,zi5 +竹篓子=zhu2,lou3,zi5 +竹篮打水=zhu2,lan2,da2,shui3 +竹篮打水一场空=zhu2,lan2,da3,shui3,yi1,chang3,kong1 +竺=zhu2,du3 +竻=le4,jin1 +竼=peng2 +竽=yu2 +竾=chi2 +竿=gan1 +笀=mang2 +笁=zhu2 +笂=wan2 +笃=du3 +笃志好学=du3,zhi4,hao3,xue2 +笃近举远=du3,jin4,ju3,juan3 +笄=ji1 +笅=jiao3,jiao4 +笆=ba1 +笆斗=ba1,dou3 +笇=suan4 +笈=ji2 +笉=qin3 +笊=zhao4 +笋=sun3 +笌=ya2 +笍=zhui4,rui4 +笎=yuan2 +笏=hu4 +笐=hang2,hang4 +笑=xiao4 +笒=cen2,jin4,han2 +笓=pi2,bi4 +笔=bi3 +笔划=bi3,hua4 +笔削=bi3,xue1 +笔削褒贬=bi3,xue1,bao1,bian3 +笔头儿=bi3,tou5,er5 +笔杆=bi3,gan3 +笔调=bi3,diao4 +笕=jian3 +笖=yi3 +笗=dong1 +笘=shan1 +笙=sheng1 +笚=da1,xia2,na4 +笛=di2 +笜=zhu2 +笝=na4 +笞=chi1 +笟=gu1 +笠=li4 +笡=qie4 +笢=min3 +笣=bao1 +笤=tiao2 +笥=si4 +符=fu2 +符号逻辑=fu2,hao4,luo2,ji5 +笧=ce4 +笨=ben4 +笩=fa2 +笪=da2 +笫=zi3 +第=di4 +第一信号系统=di4,yi1,xin4,hao4,xi4,tong3 +第一名=di4,yi4,ming2 +第一次世界大战=di4,yi1,ci4,shi4,jie4,da4,zhan4 +笭=ling2 +笮=zuo2,ze2 +笯=nu2 +笰=fu2,fei4 +笱=gou3 +笲=fan2 +笳=jia1 +笴=ge3 +笵=fan4 +笶=shi3 +笷=mao3 +笸=po3 +笸箩=po3,luo5 +笹=ti4 +笺=jian1 +笻=qiong2 +笼=long2,long3 +笼子=long2,zi5 +笼络=long3,luo4 +笼统=long3,tong3 +笼罩=long3,zhao4 +笼鸟池鱼=long2,niao3,shi5,yu2 +笽=min3 +笾=bian1 +笿=luo4 +筀=gui4 +筁=qu1 +筂=chi2 +筃=yin1 +筄=yao4 +筅=xian3 +筆=bi3 +筇=qiong2 +筈=kuo4 +等=deng3 +等一下=deng3,yi2,xia4 +等差级数=deng3,cha4,ji2,shu4 +等比数列=deng3,bi3,shu4,lie4 +等比级数=deng3,bi3,ji2,shu4 +等量齐观=deng3,liang4,qi2,guan1 +筊=jiao3,jiao4 +筋=jin1 +筋斗=jin1,dou3 +筌=quan2 +筍=sun3 +筎=ru2 +筏=fa2 +筐=kuang1 +筑=zhu4,zhu2 +筑舍道傍=zhu4,she4,dao4,bang4 +筒=tong3 +筓=ji1 +答=da2,da1 +答允=da1,yun3 +答卷=da2,juan4 +答复=da2,fu4 +答应=da1,ying4 +答数=da2,shu4 +答理=da1,li3 +答腔=da1,qiang1 +筕=hang2 +策=ce4 +策划=ce4,hua4 +策应=ce4,ying4 +筗=zhong4 +筘=kou4 +筙=lai2 +筚=bi4 +筛=shai1 +筛子=shai1,zi5 +筜=dang1 +筝=zheng1 +筞=ce4 +筟=fu1 +筠=yun2,jun1 +筡=tu2 +筢=pa2 +筣=li2 +筤=lang2,lang4 +筥=ju3 +筦=guan3 +筧=jian3 +筨=han2 +筩=tong3 +筪=xia2 +筫=zhi4,zhi3 +筬=cheng2 +筭=suan4 +筮=shi4 +筯=zhu4 +筰=zuo2 +筱=xiao3 +筲=shao1 +筳=ting2 +筴=ce4 +筵=yan2 +筶=gao4 +筷=kuai4 +筷子=kuai4,zi5 +筸=gan1 +筹=chou2 +筹划=chou2,hua4 +筺=kuang1 +筻=gang4 +筼=yun2 +筽=o5 +签=qian1 +签子=qian1,zi5 +筿=xiao3 +简=jian3 +简分数=jian3,fen1,shu4 +简切了当=jian3,qie4,liao3,dang4 +简帖=jian3,tie1 +简明扼要=jian3,ming2,e2,yao4 +简朴=jian3,piao2 +箁=pou2,bu4,fu2,pu2 +箂=lai2 +箃=zou1 +箄=pai2,bei1 +箅=bi4 +箆=bi4 +箇=ge4 +箈=tai2,chi2 +箉=guai3,dai4 +箊=yu1 +箋=jian1 +箌=zhao4,dao4 +箍=gu1 +箎=chi2 +箏=zheng1 +箐=qing4,jing1 +箑=sha4 +箒=zhou3 +箓=lu4 +箔=bo2 +箕=ji1 +箖=lin2,lin3 +算=suan4 +算得=suan4,de5 +算数=suan4,shu4 +箘=jun4,qun1 +箙=fu2 +箚=zha2 +箛=gu1 +箜=kong1 +箝=qian2 +箞=quan1 +箟=jun4 +箠=chui2 +管=guan3 +管乐=guan3,yue4 +管乐器=guan3,yue4,qi4 +管弦乐=guan3,xian2,yue4 +管窥蠡测=guan3,kui1,li2,ce4 +箢=wan3,yuan1 +箣=ce4 +箤=zu2 +箥=po3 +箦=ze2 +箧=qie4 +箨=tuo4 +箩=luo2 +箩筐=luo2,kuang1 +箪=dan1 +箪食壶浆=dan1,si4,hu2,jiang1 +箪食壶酒=dan1,si4,hu2,jiu3 +箪食瓢饮=dan1,si4,piao2,yin3 +箫=xiao1 +箬=ruo4 +箭=jian4 +箮=xuan1 +箯=bian1 +箰=sun3 +箱=xiang1 +箱子=xiang1,zi5 +箱笼=xiang1,long3 +箲=xian3 +箳=ping2 +箴=zhen1 +箵=xing1 +箶=hu2 +箷=shi1,yi2 +箸=zhu4 +箸长碗短=zhu4,chang4,wan3,duan3 +箹=yue1,yao4,chuo4 +箺=chun1 +箻=lv4 +箼=wu1 +箽=dong3 +箾=shuo4,xiao1,qiao4 +箿=ji2 +節=jie2 +篁=huang2 +篂=xing1 +篃=mei4 +範=fan4 +篅=chuan2 +篆=zhuan4 +篇=pian1 +篇什=pian1,shi2 +篈=feng1 +築=zhu4,zhu2 +篊=hong2 +篋=qie4 +篌=hou2 +篍=qiu1 +篎=miao3 +篏=qian4 +篐=gu1 +篑=kui4 +篒=yi4 +篓=lou3 +篓子=lou3,zi5 +篔=yun2 +篕=he2 +篖=tang2 +篗=yue4 +篘=chou1 +篙=gao1 +篙头=gao1,tou5 +篚=fei3 +篛=ruo4 +篜=zheng1 +篝=gou1 +篞=nie4 +篟=qian4 +篠=xiao3 +篡=cuan4 +篢=gong1,gan3,long3 +篣=peng2,pang2 +篤=du3 +篥=li4 +篦=bi4 +篧=zhuo2,huo4 +篨=chu2 +篩=shai1 +篪=chi2 +篫=zhu4 +篬=qiang1,cang1 +篭=long2,long3 +篮=lan2 +篮子=lan2,zi5 +篯=jian3,jian1 +篰=bu4 +篱=li2 +篲=hui4 +篳=bi4 +篴=zhu2,di2 +篵=cong1 +篶=yan1 +篷=peng2 +篸=cen1,zan1,can3 +篹=zhuan4,zuan4,suan3 +篺=pi2 +篻=piao3,biao1 +篼=dou1 +篽=yu4 +篾=mie4 +篿=tuan2,zhuan1 +簀=ze2 +簁=shai1 +簂=guo2,gui4 +簃=yi2 +簄=hu4 +簅=chan3 +簆=kou4 +簇=cu4 +簈=ping2 +簉=zao4 +簊=ji1 +簋=gui3 +簌=su4 +簍=lou3 +簎=ce4,ji2 +簏=lu4 +簐=nian3 +簑=suo1 +簒=cuan4 +簓=diao1 +簔=suo1 +簕=le4 +簖=duan4 +簗=zhu4 +簘=xiao1 +簙=bo2 +簚=mi4,mie4 +簛=shai1 +簜=dang4 +簝=liao2 +簞=dan1 +簟=dian4 +簠=fu3 +簡=jian3 +簢=min3 +簣=kui4 +簤=dai4 +簥=jiao1 +簦=deng1 +簧=huang2 +簨=sun3,zhuan4 +簩=lao2 +簪=zan1 +簪子=zan1,zi5 +簫=xiao1 +簬=lu4 +簭=shi4 +簮=zan1 +簯=qi2 +簰=pai2 +簱=qi2 +簲=pai2 +簳=gan3,gan4 +簴=ju4 +簵=lu4 +簶=lu4 +簷=yan2 +簸=bo4,bo3 +簸动=bo3,dong4 +簸土扬沙=bo3,tu3,yang2,sha1 +簸弄=bo3,nong4 +簸箕=bo4,ji5 +簸箩=bo3,luo2 +簸荡=bo3,dang4 +簹=dang1 +簺=sai4 +簻=zhua1 +簼=gou1 +簽=qian1 +簾=lian2 +簿=bu4,bo2 +籀=zhou4 +籁=lai4 +籂=shi5 +籃=lan2 +籄=kui4 +籅=yu2 +籆=yue4 +籇=hao2 +籈=zhen1,jian1 +籉=tai2 +籊=ti4 +籋=nie4 +籌=chou2 +籍=ji2 +籍没=ji2,mo4 +籍茅=jie4,mao2 +籎=yi2 +籏=qi2 +籐=teng2 +籑=zhuan4 +籒=zhou4 +籓=fan1,pan1,bian1 +籔=sou3,shu3 +籕=zhou4 +籖=qian1 +籗=zhuo2 +籘=teng2 +籙=lu4 +籚=lu2 +籛=jian3,jian1 +籜=tuo4 +籝=ying2 +籞=yu4 +籟=lai4 +籠=long2,long3 +籡=qie4 +籢=lian2 +籣=lan2 +籤=qian1 +籥=yue4 +籦=zhong1 +籧=qu2 +籨=lian2 +籩=bian1 +籪=duan4 +籫=zuan3 +籬=li2 +籭=shai1 +籮=luo2 +籯=ying2 +籰=yue4 +籱=zhuo2 +籲=yu4 +米=mi3 +籴=di2 +籵=fan2 +籶=shen1 +籷=zhe2 +籸=shen1 +籹=nv3 +籺=he2 +类=lei4 +籼=xian1 +籽=zi3 +籾=ni2 +籿=cun4 +粀=zhang4 +粁=qian1 +粂=zhai1 +粃=bi3 +粄=ban3 +粅=wu4 +粆=sha1,chao3 +粇=kang1,jing1 +粈=rou2 +粉=fen3 +粉坊=fen3,fang2 +粊=bi4 +粋=cui4 +粌=yin3 +粍=zhe2 +粎=mi3 +粏=ta4 +粐=hu4 +粑=ba1 +粒=li4 +粓=gan1 +粔=ju4 +粕=po4 +粖=yu4 +粗=cu1 +粘=nian2,zhan1 +粘信封=zhan1,xin4,feng1 +粘土=nian2,tu3 +粘液=nian2,ye4 +粘牙=zhan1,ya2 +粘皮带骨=zhan1,pi2,dai4,gu3 +粘稠=nian2,chou2 +粘贴=zhan1,tie1 +粘连=zhan1,lian2 +粙=zhou4 +粚=chi1 +粛=su4 +粜=tiao4 +粝=li4 +粞=xi1 +粟=su4 +粠=hong2 +粡=tong2 +粢=zi1,ci2 +粣=ce4,se4 +粤=yue4 +粥=zhou1,yu4 +粦=lin2 +粧=zhuang1 +粨=bai3 +粩=lao1 +粪=fen4 +粪土不如=fen4,tu2,bu4,ru2 +粫=er2 +粬=qu1 +粭=he2 +粮=liang2 +粮囤=liang2,dun4 +粮行=liang2,hang2 +粯=xian4 +粰=fu1,fu2 +粱=liang2 +粲=can4 +粳=jing1 +粴=li3 +粵=yue4 +粶=lu4 +粷=ju2 +粸=qi2 +粹=cui4 +粺=bai4 +粻=zhang1 +粼=lin2 +粽=zong4 +粽子=zong4,zi5 +精=jing1 +精兵强将=jing1,bing1,qiang2,jiang4 +精干=jing1,gan4 +精当=jing1,dang4 +精明强干=jing1,ming2,qiang2,gan4 +精明能干=jing1,ming2,neng2,gan4 +精神不振=jing1,shen2,bu2,zhen4 +精血=jing1,xue4 +精馏=jing1,liu2 +粿=guo3 +糀=hua1 +糁=san3,shen1 +糂=shen1 +糃=tang2 +糄=bian1,bian3 +糅=rou2 +糆=mian4 +糇=hou2 +糈=xu3 +糉=zong4 +糊=hu1,hu2,hu4 +糊口=hu2,kou3 +糊口度日=hu2,kou3,du4,ri4 +糊弄=hu4,nong4 +糊涂=hu2,tu2 +糊精=hu2,jing1 +糊糊涂涂=hu1,hu1,tu2,tu2 +糊里糊涂=hu2,li3,hu2,tu2 +糋=jian4 +糌=zan1 +糌粑=zan1,ba5 +糍=ci2 +糎=li2 +糏=xie4 +糐=fu1 +糑=nuo4 +糒=bei4 +糓=gu3,gou4 +糔=xiu3 +糕=gao1 +糖=tang2 +糖弹=tang2,dan4 +糗=qiu3 +糘=jia1 +糙=cao1 +糚=zhuang1 +糛=tang2 +糜=mi2,mei2 +糜子=mei2,zi3 +糝=san3,shen1 +糞=fen4 +糟=zao1 +糠=kang1 +糠豆不赡=kang4,dou4,bu4,shan4 +糡=jiang4 +糢=mo2 +糣=san3,shen1 +糤=san3 +糥=nuo4 +糦=xi1 +糧=liang2 +糨=jiang4 +糨糊=jiang4,hu4 +糩=kuai4 +糪=bo2 +糫=huan2 +糬=shu3 +糭=zong4 +糮=xian4 +糯=nuo4 +糰=tuan2 +糱=nie4 +糲=li4 +糳=zuo4 +糴=di2 +糵=nie4 +糶=tiao4 +糷=lan4 +糸=mi4,si1 +糹=si1 +糺=jiu1,jiu3 +系=xi4,ji4 +系好=ji4,hao3 +系带=ji4,dai4 +系数=xi4,shu4 +系泊=ji4,bo2 +系紧=ji4,jin3 +系绳=ji4,sheng2 +系绳子=ji4,sheng2,zi3 +系鞋带=ji4,xie2,dai4 +糼=gong1 +糽=zheng1,zheng3 +糾=jiu1 +糿=gong1 +紀=ji4 +紁=cha4,cha3 +紂=zhou4 +紃=xun2 +約=yue1,yao1 +紅=hong2,gong1 +紆=yu1 +紇=he2,ge1 +紈=wan2 +紉=ren4 +紊=wen3 +紋=wen2,wen4 +紌=qiu2 +納=na4 +紎=zi1 +紏=tou3 +紐=niu3 +紑=fou2 +紒=ji4,jie2,jie4 +紓=shu1 +純=chun2 +紕=pi1,pi2,bi3 +紖=zhen4 +紗=sha1 +紘=hong2 +紙=zhi3 +級=ji2 +紛=fen1 +紜=yun2 +紝=ren4 +紞=dan3 +紟=jin1,jin4 +素=su4 +素数=su4,shu4 +素朴=su4,piao2 +紡=fang3 +索=suo3 +紣=cui4 +紤=jiu3 +紥=zha1,za1 +紦=ha1 +紧=jin3 +紧着=jin3,zhe5 +紧跟不舍=jin3,gen1,bu2,she4 +紨=fu1,fu4 +紩=zhi4 +紪=qi1 +紫=zi3 +紫禁城=zi3,jin4,cheng2 +紫菀=zi3,wan3 +紫衫=zi3,shan1 +紫阳观=zi3,yang2,guan4 +紬=chou1,chou2 +紭=hong2 +紮=zha1,za1 +累=lei2,lei3,lei4 +累世不仕=lei3,shi4,bu2,shi4 +累乏=lei4,fa2 +累了=lei4,le5 +累人=lei4,ren2 +累卵=lei3,luan3 +累及=lei3,ji2 +累土聚沙=lei3,tu3,ju4,sha1 +累土至山=lei3,tu3,zhi4,shan1 +累块积苏=lei3,kuai4,ji1,su1 +累垮了=lei4,kua3,le5 +累屋重架=lei3,wu1,chong2,jia4 +累建奇功=lei3,jian4,qi2,gong1 +累手=lei4,shou3 +累教不改=lei3,jiao4,bu4,gai3 +累月经年=lei3,yue4,jing1,nian2 +累次=lei3,ci4 +累死累活=lei4,si3,lei4,huo2 +累活=lei4,huo2 +累牍连篇=lei3,du2,lian2,pian1 +累瓦结绳=lei3,wa3,jie2,sheng2 +累着=lei4,zhe5 +累积=lei3,ji1 +累累=lei3,lei3 +累累作案=lei3,lei3,zuo4,an4 +累累失误=lei3,lei3,shi1,wu4 +累苏积块=lei3,su1,ji1,kuai4 +累计=lei3,ji4 +累赘=lei2,zhui4 +累足成步=lei3,zu2,cheng2,bu4 +細=xi4 +紱=fu2 +紲=xie4 +紳=shen1 +紴=bo1,bi4 +紵=zhu4 +紶=qu1,qu3 +紷=ling2 +紸=zhu4 +紹=shao4 +紺=gan4 +紻=yang3 +紼=fu2 +紽=tuo2 +紾=zhen3,tian3 +紿=dai4 +絀=chu4 +絁=shi1 +終=zhong1 +絃=xian2 +組=zu3 +絅=jiong1,jiong3 +絆=ban4 +絇=qu2 +絈=mo4 +絉=shu4 +絊=zui4 +絋=kuang4 +経=jing1 +絍=ren4 +絎=hang2 +絏=xie4 +結=jie2,jie1 +絑=zhu1 +絒=chou2 +絓=gua4,kua1 +絔=bai3,mo4 +絕=jue2 +絖=kuang4 +絗=hu2 +絘=ci4 +絙=huan2,geng1 +絚=geng1 +絛=tao1 +絜=xie2,jie2 +絝=ku4 +絞=jiao3 +絟=quan2,shuan1 +絠=gai3,ai3 +絡=luo4,lao4 +絢=xuan4 +絣=beng1,bing1,peng1 +絤=xian4 +絥=fu2 +給=gei3,ji3 +絧=tong1,tong2,dong4 +絨=rong2 +絩=tiao4,diao4,dao4 +絪=yin1 +絫=lei3,lei4,lei2 +絬=xie4 +絭=juan4 +絮=xu4 +絮叨=xu4,dao2 +絮絮叨叨=xu4,xu4,dao1,dao1 +絯=gai1,hai4 +絰=die2 +統=tong3 +絲=si1 +絳=jiang4 +絴=xiang2 +絵=hui4 +絶=jue2 +絷=zhi2 +絸=jian3 +絹=juan4 +絺=chi1,zhi3 +絻=mian3,wen4,man2,wan4 +絼=zhen4 +絽=lv3 +絾=cheng2 +絿=qiu2 +綀=shu1 +綁=bang3 +綂=tong3 +綃=xiao1 +綄=huan2,huan4,wan4 +綅=qin1,xian1 +綆=geng3 +綇=xu1 +綈=ti2,ti4 +綉=xiu4 +綊=xie2 +綋=hong2 +綌=xi4 +綍=fu2 +綎=ting1 +綏=sui2 +綐=dui4 +綑=kun3 +綒=fu1 +經=jing1 +綔=hu4 +綕=zhi1 +綖=yan2,xian4 +綗=jiong3 +綘=feng2 +継=ji4 +続=xu4 +綛=ren3 +綜=zong1,zeng4 +綝=lin2,chen1 +綞=duo3 +綟=li4,lie4 +綠=lv4 +綡=jing1 +綢=chou2 +綣=quan3 +綤=shao4 +綥=qi2 +綦=qi2 +綦溪利跂=qi2,xi1,li4,gui4 +綧=zhun3,zhun4 +綨=ji1,qi2 +綩=wan3 +綪=qian4,qing1,zheng1 +綫=xian4 +綬=shou4 +維=wei2 +綮=qing4,qi3 +綯=tao2 +綰=wan3 +綱=gang1 +網=wang3 +綳=beng1,beng3,beng4 +綴=zhui4 +綵=cai3 +綶=guo3 +綷=cui4 +綸=lun2,guan1 +綹=liu3 +綺=qi3 +綻=zhan4 +綼=bi4 +綽=chuo4,chao1 +綾=ling2 +綿=mian2 +緀=qi1 +緁=ji1 +緂=tian2,tan3,chan1 +緃=zong1 +緄=gun3 +緅=zou1 +緆=xi1 +緇=zi1 +緈=xing4 +緉=liang3 +緊=jin3 +緋=fei1 +緌=rui2 +緍=min2 +緎=yu4 +総=zong3 +緐=fan2 +緑=lv4,lu4 +緒=xu4 +緓=ying1 +緔=shang4 +緕=zi1 +緖=xu4 +緗=xiang1 +緘=jian1 +緙=ke4 +線=xian4 +緛=ruan3,ruan4 +緜=mian2 +緝=ji1,qi1 +緞=duan4 +緟=chong2,zhong4 +締=di4 +緡=min2 +緢=miao2,mao2 +緣=yuan2 +緤=xie4,ye4 +緥=bao3 +緦=si1 +緧=qiu1 +編=bian1 +緩=huan3 +緪=geng1,geng4 +緫=zong3 +緬=mian3 +緭=wei4 +緮=fu4 +緯=wei3 +緰=tou1,xu1,shu1 +緱=gou1 +緲=miao3 +緳=xie2 +練=lian4 +緵=zong1,zong4 +緶=bian4,pian2 +緷=gun3,yun4 +緸=yin1 +緹=ti2 +緺=gua1,wo1 +緻=zhi4 +緼=yun4,yun1,wen1 +緽=cheng1 +緾=chan2 +緿=dai4 +縀=xie2 +縁=yuan2 +縂=zong3 +縃=xu1 +縄=sheng2 +縅=wei1 +縆=geng1,geng4 +縈=ying2 +縉=jin4 +縊=yi4 +縋=zhui4 +縌=ni4 +縍=bang1,bang4 +縎=gu3,hu2 +縏=pan2 +縐=zhou4 +縑=jian1 +縒=ci1,cuo4,suo3 +縓=quan2 +縔=shuang3 +縕=yun4,yun1,wen1 +縖=xia2 +縗=cui1,sui1,shuai1 +縘=xi4 +縙=rong2,rong3,rong4 +縚=tao1 +縛=fu4 +縜=yun2 +縝=zhen3 +縞=gao3 +縟=ru4 +縠=hu2 +縡=zai4,zeng1 +縢=teng2 +縣=xian4,xuan2 +縤=su4 +縥=zhen3 +縦=zong4 +縧=tao1 +縨=huang3 +縩=cai4 +縪=bi4 +縫=feng2,feng4 +縬=cu4 +縭=li2 +縮=suo1,su4 +縯=yan3,yin3 +縰=xi3 +縱=zong4,zong3 +縲=lei2 +縳=zhuan4,juan4 +縴=qian4 +縵=man4 +縶=zhi2 +縷=lv3 +縸=mu4,mo4 +縹=piao3,piao1 +縺=lian2 +縻=mi2 +縼=xuan4 +總=zong3 +績=ji4 +縿=shan1 +繀=sui4 +繁=fan2,po2 +繂=lv4 +繃=beng1,beng3,beng4 +繄=yi1 +繅=sao1 +繆=mou2,miu4,miao4,mu4,liao3 +繇=yao2,you2,zhou4 +繈=qiang3 +繉=sheng2 +繊=xian1 +繋=ji4 +繌=zong1,zong4 +繍=xiu4 +繎=ran2 +繏=xuan4 +繐=sui4 +繑=qiao1 +繒=zeng1,zeng4 +繓=zuo3 +織=zhi1,zhi4 +繕=shan4 +繖=san3 +繗=lin2 +繘=ju2,jue2 +繙=fan1 +繚=liao2 +繛=chuo1,chuo4 +繜=zun1,zun3 +繝=jian4 +繞=rao4 +繟=chan3,chan2 +繠=rui3 +繡=xiu4 +繢=hui4,hui2 +繣=hua4 +繤=zuan3 +繥=xi1 +繦=qiang3 +繧=wen2 +繨=da5 +繩=sheng2 +繪=hui4 +繫=xi4,ji4 +繬=se4 +繭=jian3 +繮=jiang1 +繯=huan2 +繰=qiao1,sao1 +繱=cong1 +繲=xie4 +繳=jiao3,zhuo2 +繴=bi4 +繵=dan4,tan2,chan2 +繶=yi4 +繷=nong3 +繸=sui4 +繹=yi4 +繺=sha1 +繻=ru2 +繼=ji4 +繽=bin1 +繾=qian3 +繿=lan2 +纀=pu2,fu2 +纁=xun1 +纂=zuan3 +纃=zi1 +纄=peng2 +纅=yao4,li4 +纆=mo4 +纇=lei4 +纈=xie4 +纉=zuan3 +纊=kuang4 +纋=you1 +續=xu4 +纍=lei2 +纎=xian1 +纏=chan2 +纐=jiao3 +纑=lu2 +纒=chan2 +纓=ying1 +纔=cai2 +纕=xiang1,rang3 +纖=xian1 +纗=zui1 +纘=zuan3 +纙=luo4 +纚=li2,xi3,li3,sa3 +纛=dao4 +纜=lan3 +纝=lei2 +纞=lian4 +纟=si1 +纠=jiu1 +纡=yu1 +红=hong2,gong1 +红不棱登=hong2,bu4,leng1,deng1 +红彤彤=hong2,tong1,tong1 +红得发紫=hong2,de5,fa1,zi3 +红晕=hong2,yun4 +红曲=hong2,qu3 +红杉=hong2,shan1 +红殷殷=hong2,yin1,yin1 +红澄澄=hong2,deng4,deng4 +红绳系足=hong2,sheng2,ji4,zu2 +红苕=hong2,shao2 +红颜薄命=hong2,yan2,bo2,ming4 +纣=zhou4 +纤=xian1,qian4 +纤夫=qian4,fu1 +纤手=qian4,shou3 +纤绳=qian4,sheng2 +纥=he2,ge1 +纥繨=ge1,da5 +约=yue1,yao1 +约数=yue1,shu4 +约法三章=yue4,fa3,san1,zhang1 +级=ji2 +级数=ji2,shu4 +纨=wan2 +纨袴子弟=wan2,ku3,zi3,di4 +纩=kuang4 +纪=ji4,ji3 +纪纲人论=ji4,gang1,ren2,lun2 +纫=ren4 +纬=wei3 +纭=yun2 +纮=hong2 +纯=chun2 +纯属骗局=chun2,shu2,pian4,ju2 +纯朴=chun2,piao2 +纰=pi1,pi2,bi3 +纰缪=pi1,miu4 +纱=sha1 +纲=gang1 +纳=na4 +纳降=na4,xiang2 +纴=ren4 +纵=zong4,zong3 +纶=lun2,guan1 +纶巾=guan1,jin1 +纷=fen1 +纸=zhi3 +纸包不住火=zhi3,bao1,bu4,zhu4,huo3 +纸夹=zhi3,jia1 +纹=wen2,wen4 +纺=fang3 +纻=zhu4 +纼=zhen4 +纽=niu3 +纾=shu1 +线=xian4 +线坯子=xian4,pi1,zi5 +绀=gan4 +绁=xie4 +绂=fu2 +练=lian4 +组=zu3 +组分=zu3,fen4 +组织得当=zu3,zhi1,de2,dang4 +组长=zu3,zhang3 +绅=shen1 +细=xi4 +细发=xi4,fa4 +细嚼慢咽=xi4,jiao2,man4,yan4 +细屑子=xi4,xie4,zi5 +细挑=xi4,tao1 +细菌=xi4,jun4 +细菌域=xi4,jun4,yu4 +细菌炸弹=xi4,jun4,zha4,dan4 +细菌界=xi4,jun4,jie4 +细高挑儿=xi4,gao1,tiao3,er2 +织=zhi1,zhi4 +终=zhong1 +终了=zhong1,liao3 +绉=zhou4 +绊=ban4 +绋=fu2 +绌=chu4 +绍=shao4 +绍兴=shao4,xing1 +绍兴戏=shao4,xing1,xi4 +绍兴酒=shao4,xing1,jiu3 +绎=yi4 +经=jing1 +经传=jing1,zhuan4 +经卷=jing1,juan4 +经幢=jing1,chuang2 +经年累月=jing1,nian2,lei3,yue4 +经纶济世=jing4,lun4,ji5,shi5 +经血=jing1,xue4 +绐=dai4 +绑=bang3 +绑扎=bang3,za1 +绒=rong2 +结=jie2,jie1 +结发=jie2,fa4 +结子=jie1,zi3 +结实=jie1,shi5 +结巴=jie1,ba1 +结果=jie1,guo3 +结核杆菌=jie2,he2,gan3,jun1 +结结巴巴=jie1,jie1,ba1,ba1 +绔=ku4 +绕=rao4 +绕圈子=rao4,quan1,zi5 +绖=die2 +绗=hang2 +绘=hui4 +绘划=hui4,hua4 +给=gei3,ji3 +给予=ji3,yu3 +给付=ji3,fu4 +给体=ji3,ti3 +给养=ji3,yang3 +给水=ji3,shui3 +给水器=ji3,shui3,qi4 +给水工程=ji3,shui3,gong1,cheng2 +给水站=ji3,shui3,zhan4 +给水管=ji3,shui3,guan3 +给水箱=ji3,shui3,xiang1 +给脸不要脸=gei3,lian3,bu2,yao4,lian3 +给面子=gei3,mian4,zi5 +绚=xuan4 +绛=jiang4 +络=luo4,lao4 +络子=lao4,zi5 +络腮=luo4,sai1 +绝=jue2 +绝着=jue2,zhao1 +绞=jiao3 +统=tong3 +统率=tong3,shuai4 +绠=geng3 +绡=xiao1 +绢=juan4 +绣=xiu4 +绣花枕头=xiu4,hua1,zhen3,tou5 +绤=xi4 +绥=sui2 +绦=tao1 +继=ji4 +绨=ti2,ti4 +绨袍之义=ti4,pao2,zhi1,yi4 +绩=ji4 +绪=xu4 +绫=ling2 +绬=ying1 +续=xu4 +绮=qi3 +绯=fei1 +绰=chuo4,chao1 +绰绰有余=chuo4,chuo4,you3,yu2 +绱=shang4 +绲=gun3 +绳=sheng2 +绳子=sheng2,zi5 +绳愆纠缪=sheng2,qian1,jiu1,miu4 +维=wei2 +维妙维肖=wei2,miao4,wei2,xiao4 +绵=mian2 +绵力薄材=mian2,li4,bo2,cai2 +绵里薄材=mian2,li3,bo2,cai2 +绶=shou4 +绷=beng1,beng3,beng4 +绷脸=beng3,lian3 +绸=chou2 +绹=tao2 +绺=liu3 +绻=quan3 +综=zong1,zeng4 +绽=zhan4 +绾=wan3 +绿=lv4,lu4 +绿女红男=lv4,nv3,hong2,nan2 +绿帽子=lv4,mao4,zi5 +绿林=lu4,lin2 +缀=zhui4 +缁=zi1 +缂=ke4 +缃=xiang1 +缄=jian1 +缅=mian3 +缆=lan3 +缇=ti2 +缈=miao3 +缉=ji1,qi1 +缉查=ji1,zha1 +缊=yun4,yun1,wen1 +缋=hui4,hui2 +缌=si1 +缍=duo3 +缎=duan4 +缏=bian4,pian2 +缐=xian4 +缑=gou1 +缒=zhui4 +缓=huan3 +缓破=chong1,po4 +缔=di4 +缕=lv3 +编=bian1 +缗=min2 +缘=yuan2 +缙=jin4 +缚=fu4 +缛=ru4 +缜=zhen3 +缝=feng2,feng4 +缝儿=feng4,er5 +缝子=feng4,zi5 +缝工=feng4,gong1 +缝扣子=feng2,kou4,zi5 +缝穷=feng4,qiong2 +缝线=feng4,xian4 +缝缀破衣服=feng2,zhui4,po4,yi1,fu5 +缝缝连连=feng4,feng4,lian2,lian2 +缝补衣服=feng2,bu3,yi1,fu5 +缝被子=feng2,bei4,zi5 +缝际=feng4,ji4 +缝隙=feng4,xi4 +缞=cui1,sui1,shuai1 +缟=gao3 +缠=chan2 +缠夹不清=chan2,jia1,bu4,qing1 +缡=li2 +缢=yi4 +缣=jian1 +缤=bin1 +缥=piao3,piao1 +缥缈=piao1,miao3 +缦=man4 +缧=lei2 +缨=ying1 +缨子=ying1,zi5 +缩=suo1,su4 +缩砂=su4,sha1 +缩砂密=su4,sha1,mi4 +缪=mou2,miu4,miao4,mu4,liao3 +缪种流传=miu4,zhong3,liu2,chuan2 +缫=sao1 +缬=xie2 +缭=liao2 +缮=shan4 +缯=zeng1,zeng4 +缰=jiang1 +缱=qian3 +缲=qiao1,sao1 +缳=huan2 +缴=jiao3,zhuo2 +缵=zuan3 +缶=fou3 +缷=xie4 +缸=gang1 +缹=fou3 +缺=que1 +缻=fou3 +缼=que1 +缽=bo1 +缾=ping2 +缿=xiang4 +罀=zhao4 +罁=gang1 +罂=ying1 +罃=ying1 +罄=qing4 +罅=xia4 +罆=guan4 +罇=zun1 +罈=tan2 +罉=cheng1 +罊=qi4 +罋=weng4 +罌=ying1 +罍=lei2 +罎=tan2 +罏=lu2 +罐=guan4 +罐头=guan4,tou5 +网=wang3 +罒=wang3 +罓=wang3 +罔=wang3 +罕=han3 +罖=wang3 +罗=luo2 +罗织构陷=luo4,zhi1,gou4,xian4 +罘=fu2 +罙=shen1 +罚=fa2 +罚不当罪=fa2,bu4,dang1,zui4 +罛=gu1 +罜=zhu3 +罝=ju1 +罞=mao2 +罟=gu3 +罠=min2 +罡=gang1 +罢=ba4,ba5,pi2 +罢黜百家=ba1,chu4,bai3,jia1 +罣=gua4 +罤=ti2 +罥=juan4 +罦=fu2 +罧=shen1 +罨=yan3 +罩=zhao4 +罪=zui4 +罪不胜诛=zui4,bu4,sheng4,zhu1 +罪有应得=zui4,you3,ying1,de2 +罪行累累=zui4,xing2,lei3,lei4 +罪责难逃=zui4,ze2,nan2,tao2 +罫=guai3,gua4 +罬=zhuo2 +罭=yu4 +置=zhi4 +罯=an3 +罰=fa2 +罱=lan3 +署=shu3 +罳=si1 +罴=pi2 +罵=ma4 +罶=liu3 +罷=ba4,ba5,pi2 +罸=fa2 +罹=li2 +罹难=li2,nan4 +罺=chao2 +罻=wei4 +罼=bi4 +罽=ji4 +罾=zeng1 +罿=chong1 +羀=liu3 +羁=ji1 +羂=juan4 +羃=mi4 +羄=zhao4 +羅=luo2 +羆=pi2 +羇=ji1 +羈=ji1 +羉=luan2 +羊=yang2,xiang2 +羊圈=yang2,juan4 +羊羔美酒=yan2,gao1,mei3,jiu3 +羊肚=yang2,du3 +羋=mi3 +羌=qiang1 +羍=da2 +美=mei3 +美不胜收=mei3,bu4,sheng4,shou1 +美如冠玉=mei3,ru2,guan1,yu4 +美差=mei3,chai1 +羏=yang2,xiang2 +羐=ling2 +羑=you3 +羒=fen2 +羓=ba1 +羔=gao1 +羕=yang4 +羖=gu3 +羗=qiang1 +羘=zang1 +羙=mei3,gao1 +羚=ling2 +羛=yi4,xi1 +羜=zhu4 +羝=di1 +羞=xiu1 +羞与为伍=xiu1,yu3,wei2,wu3 +羞人答答=xiu1,ren2,da1,da1 +羞恶=xiu1,wu4 +羞答答=xiu1,da1,da1 +羞羞答答=xiu1,xiu1,da1,da1 +羞臊=xiu1,sao4 +羟=qiang3 +羠=yi2 +羡=xian4 +羢=rong2 +羣=qun2 +群=qun2 +群居穴处=qun2,ju1,xue2,chu3 +群雌粥粥=qun2,ci2,yu4,yu4 +羥=qiang3 +羦=huan2 +羧=suo1 +羨=xian4 +義=yi4 +羪=you1 +羫=qiang1,kong4 +羬=qian2,xian2,yan2 +羭=yu2 +羮=geng1 +羯=jie2 +羰=tang1 +羱=yuan2 +羲=xi1 +羳=fan2 +羴=shan1 +羵=fen2 +羶=shan1 +羷=lian3 +羸=lei2 +羹=geng1 +羺=nou2 +羻=qiang4 +羼=chan4 +羽=yu3 +羽冠=yu3,guan1 +羽扇纶巾=yu3,shan4,guan1,jin1 +羾=hong2,gong4 +羿=yi4 +翀=chong1 +翁=weng1 +翂=fen1 +翃=hong2 +翄=chi4 +翅=chi4 +翅膀=chi4,bang3 +翆=cui4 +翇=fu2 +翈=xia2 +翉=ben3 +翊=yi4 +翋=la4 +翌=yi4 +翍=pi1,bi4,po1 +翎=ling2 +翏=liu4 +翐=zhi4 +翑=qu2,yu4 +習=xi2 +翓=xie2 +翔=xiang2 +翕=xi1 +翖=xi1 +翗=ke2 +翘=qiao2,qiao4 +翘尾巴=qiao4,wei3,ba1 +翘棱=qiao2,leng1 +翘楚=qiao2,chu3 +翘翘=qiao4,qiao4 +翘舌音=qiao4,she2,yin1 +翘起=qiao4,qi3 +翘辫子=qiao4,bian4,zi5 +翘首=qiao2,shou3 +翙=hui4 +翚=hui1 +翛=xiao1 +翜=sha4 +翝=hong2 +翞=jiang1 +翟=di2,zhai2 +翠=cui4 +翡=fei3 +翢=dao4,zhou1 +翣=sha4 +翤=chi4 +翥=zhu4 +翦=jian3 +翧=xuan1 +翨=chi4 +翩=pian1 +翩翩年少=pian1,pian1,nian2,shao3 +翪=zong1 +翫=wan2 +翬=hui1 +翭=hou2 +翮=he2 +翯=he4 +翰=han4 +翱=ao2 +翲=piao1 +翳=yi4 +翴=lian2 +翵=hou2,qu2 +翶=ao2 +翷=lin2 +翸=pen3 +翹=qiao2,qiao4 +翺=ao2 +翻=fan1 +翻供=fan1,gong4 +翻斗=fan1,dou3 +翻查=fan1,zha1 +翻空出奇=fan1,kong1,chu1,qi2 +翻肠倒肚=fan1,chang2,dao3,du4 +翻跟头=fan1,gen1,tou5 +翻跟斗=fan1,gen1,dou3 +翻黄倒皁=fan1,huang2,dao3,yi2 +翼=yi4 +翽=hui4 +翾=xuan1 +翿=dao4 +耀=yao4 +老=lao3 +老伯=lao3,bo2 +老伯伯=lao3,bo2,bo5 +老区=lao3,ou1 +老大难=lao3,da4,nan4 +老头儿=lao3,tou5,er5 +老头子=lao3,tou2,zi5 +老婆婆=lao3,po2,po5 +老子=lao3,zi5 +老将=lao3,jiang4 +老少=lao3,shao4 +老师宿儒=lao3,shi1,xiu3,ru2 +老来少=lao3,lai2,shao4 +老疙瘩=lao3,ge1,da1 +老白干儿=lao3,bai2,gan1,er2 +老着脸皮=lao3,zhe5,lian3,pi2 +老老少少=lao3,lao3,shao4,shao4 +老而不死是为贼=lao3,er2,bu4,si3,shi4,wei2,zei2 +老调=lao3,diao4 +老调重弹=lao3,diao4,chong2,tan2 +老调重谈=lao3,diao4,chong2,tan2 +老骨头=lao3,gu2,tou5 +老鼠夹=lao3,shu3,jia1 +耂=lao3 +考=kao3 +考卷=kao3,juan4 +考查=kao3,zha1 +考量=kao3,liang2 +耄=mao4 +者=zhe3 +耆=qi2,shi4 +耇=gou3 +耈=gou3 +耉=gou3 +耊=die2 +耋=die2 +而=er2 +耍=shua3 +耍横=shua3,heng4 +耎=ruan3,nuo4 +耏=er2,nai4 +耐=nai4 +耑=duan1,zhuan1 +耒=lei3 +耓=ting1 +耔=zi3 +耕=geng1 +耕种=geng1,zhong4 +耖=chao4 +耗=hao4 +耗子=hao4,zi5 +耘=yun2 +耙=ba4,pa2 +耙地=pa2,di4 +耙子=pa2,zi5 +耚=pi1 +耛=si4,chi2 +耜=si4 +耝=qu4,chu2 +耞=jia1 +耟=ju4 +耠=huo1 +耡=chu2 +耢=lao4 +耣=lun2,lun3 +耤=ji2,jie4 +耥=tang3 +耦=ou3 +耧=lou2 +耨=nou4 +耩=jiang3 +耪=pang3 +耫=zha2,ze2 +耬=lou2 +耭=ji1 +耮=lao4 +耯=huo4 +耰=you1 +耱=mo4 +耲=huai2 +耳=er3 +耳掴子=er3,guai1,zi5 +耳目闭塞=er3,mu4,bi4,sai1 +耳背=er3,bei4 +耴=yi4 +耵=ding1 +耶=ye2,ye1 +耶和华=ye1,he2,hua2 +耶教=ye1,jiao4 +耶稣=ye1,su1 +耶稣会=ye1,su1,hui4 +耶稣教=ye1,su1,jiao4 +耷=da1 +耷拉=da1,la5 +耸=song3 +耸肩曲背=song3,jian1,qu1,bei4 +耸肩缩背=song3,jian1,suo1,bei4 +耹=qin2 +耺=yun2,ying2 +耻=chi3 +耻与哙伍=chi3,yu2,kuai4,wu3 +耻骨=chi3,gu3 +耼=dan1 +耽=dan1 +耾=hong2 +耿=geng3 +聀=zhi2 +聁=pan4 +聂=nie4 +聃=dan1 +聄=zhen3 +聅=che4 +聆=ling2 +聇=zheng1 +聈=you3 +聉=wa4,tui3,zhuo2 +聊=liao2 +聊以塞责=liao2,yi3,se4,ze2 +聋=long2 +聋子=long2,zi5 +职=zhi2 +聍=ning2 +聎=tiao1 +聏=er2,nv4 +聐=ya4 +聑=tie1,zhe2 +聒=guo1 +聓=xu4 +联=lian2 +联篇累牍=lian2,pian1,lei3,du2 +聕=hao4 +聖=sheng4 +聗=lie4 +聘=pin4 +聙=jing1 +聚=ju4 +聚变反应=ju4,bian4,fan3,ying4 +聚米为山=ju4,mi3,wei2,shan1 +聚米为谷=ju4,mi3,wei2,gu3 +聚酰亚胺=ju4,xian1,ya4,an1 +聛=bi3 +聜=di3,zhi4 +聝=guo2 +聞=wen2 +聟=xu4 +聠=ping1 +聡=cong1 +聢=ding4 +聣=ni2 +聤=ting2 +聥=ju3 +聦=cong1 +聧=kui1 +聨=lian2 +聩=kui4 +聪=cong1 +聫=lian2 +聬=weng1 +聭=kui4 +聮=lian2 +聯=lian2 +聰=cong1 +聱=ao2 +聱牙诘屈=ao2,ya2,jie2,qu1 +聱牙诘曲=ao2,ya2,jie2,qu1 +聲=sheng1 +聳=song3 +聴=ting1 +聵=kui4 +聶=nie4 +職=zhi2 +聸=dan1 +聹=ning2 +聺=qie2 +聻=ni3,jian4 +聼=ting1 +聽=ting1 +聾=long2 +聿=yu4 +肀=yu4 +肁=zhao4 +肂=si4 +肃=su4 +肄=yi4 +肅=su4 +肆=si4 +肆应=si4,ying4 +肆意妄为=si4,yi4,wang4,wei2 +肇=zhao4 +肈=zhao4 +肉=rou4 +肉冠=rou4,guan1 +肉包子打狗=rou4,bao1,zi5,da3,gou3 +肉薄骨并=rou4,bo2,gu3,bing4 +肉铺=rou4,pu4 +肊=yi4 +肋=lei4,le1 +肌=ji1 +肍=qiu2 +肎=ken3 +肏=cao4 +肐=ge1 +肑=bo2,di2 +肒=huan4 +肓=huang1 +肔=chi3 +肕=ren4 +肖=xiao1,xiao4 +肖像=xiao4,xiang4 +肗=ru3 +肘=zhou3 +肙=yuan1 +肚=du4,du3 +肚子=du3,zi5 +肚子痛=du4,zi5,tong4 +肚子饿=du4,zi5,e4 +肚片=du3,pian4 +肛=gang1 +肜=rong2,chen1 +肝=gan1 +肞=chai1 +肟=wo4 +肠=chang2 +肠子=chang2,zi5 +股=gu3 +股长=gu3,zhang3 +肢=zhi1 +肣=qin2,han2,han4 +肤=fu1 +肤受之愬=fu1,shou4,zhi1,xiang1 +肤皮潦草=fu1,pi3,liao3,cao3 +肤见謭识=fu1,jian4,guang3,shi2 +肥=fei2 +肦=ban1 +肧=pei1 +肨=pang4,pan2,pan4 +肩=jian1 +肩摩毂接=jian1,mo2,gu1,jie1 +肩背=jian1,bei4 +肩背相望=jian1,bei4,xiang1,wang4 +肩背难望=jian1,bei4,nan2,wang4 +肪=fang2 +肫=zhun1,chun2 +肬=you2 +肭=na4 +肮=ang1 +肮脏=ang1,zang1 +肯=ken3 +肰=ran2 +肱=gong1 +育=yu4 +肳=wen3 +肴=yao2 +肵=qi2 +肶=pi2,bi3,bi4 +肷=qian3 +肸=xi1 +肹=xi1 +肺=fei4 +肻=ken3 +肼=jing3 +肽=tai4 +肾=shen4 +肿=zhong3 +胀=zhang4 +胁=xie2 +胁肩累足=xie2,jian1,lei3,zu2 +胁肩絫足=xie2,jian1,lei4,zu2 +胂=shen4 +胃=wei4 +胄=zhou4 +胅=die2 +胆=dan3 +胆大如斗=dan3,da4,ru2,dou3 +胆大妄为=dan3,da4,wang4,wei2 +胆大心粗=dan3,da1,xin1,cu1 +胆子=dan3,zi5 +胇=fei4,bi4 +胈=ba2 +胉=bo2 +胊=qu2 +胋=tian2 +背=bei1,bei4 +背不住=bei4,bu2,zhu4 +背义忘恩=bei4,yu4,wang4,en1 +背义负信=bei4,yu4,fu4,xin4 +背义负恩=bei4,yu4,fu4,en1 +背乡离井=bei4,xiang1,li2,jing3 +背书=bei4,shu1 +背井离乡=bei4,jing3,li2,xiang1 +背人=bei4,ren2 +背信=bei4,xin4 +背信弃义=bei4,xin4,qi4,yi4 +背光=bei4,guang1 +背光性=bei4,guang1,xing4 +背公向私=bei4,gong1,xiang4,si1 +背公营私=bei4,gong1,ying2,si1 +背前面后=bei4,qian2,mian4,hou4 +背包=bei1,bao1 +背叛=bei4,pan4 +背后=bei4,hou4 +背向=bei4,xiang4 +背囊=bei4,nang2 +背地=bei4,di4 +背地里=bei4,di4,li3 +背城一战=bei4,cheng2,yi1,zhan4 +背城借一=bei4,cheng2,jie4,yi1 +背子=bei1,zi5 +背山起楼=bei4,shan1,qi3,lou2 +背弃=bei4,qi4 +背影=bei4,ying3 +背心=bei4,xin1 +背恩弃义=bei4,en1,qi4,yi4 +背恩忘义=bei4,en1,wang4,yi4 +背恩负义=bei4,en1,fu4,yi4 +背惠食言=bei4,hui4,shi2,yan2 +背搭子=bei4,da1,zi3 +背日性=bei4,ri4,xing4 +背时=bei4,shi2 +背景=bei4,jing3 +背暗投明=bei4,an4,tou2,ming2 +背曲腰弯=bei4,qu3,yao1,wan1 +背曲腰躬=bei4,qu3,yao1,gong1 +背本就末=bei4,ben3,jiu4,mo4 +背本趋末=bei4,ben3,qu1,mo4 +背气=bei4,qi4 +背水一战=bei4,shui3,yi1,zhan4 +背水阵=bei4,shui3,zhen4 +背熟=bei4,shu2 +背生芒刺=bei4,sheng1,mang2,ci4 +背盟败约=bei4,meng2,bai4,yue1 +背着手=bei4,zhe5,shou3 +背碑覆局=bei4,bei1,fu4,ju2 +背离=bei4,li2 +背篼=bei1,dou1 +背约=bei4,yue1 +背脊=bei4,ji3 +背腹受敌=bei4,fu4,shou4,di2 +背若芒刺=bei4,ruo4,mang2,ci4 +背街=bei4,jie1 +背袋=bei4,dai4 +背诵=bei4,song4 +背谬=bei4,miu4 +背负=bei1,fu4 +背运=bei4,yun4 +背道而驰=bei4,dao4,er2,chi2 +背部=bei4,bu4 +背阴=bei4,yin1 +背静=bei4,jing4 +背靠=bei4,kao4 +背靠背=bei4,kao4,bei4 +背面=bei4,mian4 +背风=bei4,feng1 +背风面=bei4,feng1,mian4 +背鳍=bei4,qi2 +背黑锅=bei1,hei1,guo1 +胍=gua1 +胎=tai1 +胎发=tai1,fa4 +胏=zi3,fei4 +胐=fei3,ku1 +胑=zhi1 +胒=ni4 +胓=ping2,peng1 +胔=zi4 +胕=fu1,fu2,zhou3 +胖=pang4,pan2,pan4 +胖子=pang4,zi5 +胗=zhen1 +胘=xian2 +胙=zuo4 +胚=pei1 +胛=jia3 +胜=sheng4 +胝=zhi1 +胞=bao1 +胟=mu3 +胠=qu1 +胡=hu2 +胡作乱为=hu2,zuo4,luan4,wei2 +胡作胡为=hu2,zuo4,hu2,wei2 +胡作非为=hu2,zuo4,fei1,wei2 +胡同=hu2,tong4 +胡子拉碴=hu2,zi5,la1,cha1 +胡行乱为=hu2,xing2,luan4,wei2 +胢=qia4 +胣=chi3 +胤=yin4 +胥=xu1 +胦=yang1 +胧=long2 +胨=dong4 +胩=ka3 +胪=lu2 +胫=jing4 +胬=nu3 +胭=yan1 +胮=pang1 +胯=kua4 +胰=yi2 +胱=guang1 +胲=hai3 +胳=ge1,ge2 +胳膊=ge1,bo5 +胳臂=ge1,bei5 +胴=dong4 +胵=chi1 +胶=jiao1 +胶着=jiao1,zhe5 +胷=xiong1 +胸=xiong1 +胸中有数=xiong1,zhong1,you3,shu4 +胸脯=xiong1,pu2 +胸闷=xiong1,men1 +胸闷难受=xiong1,men1,nan2,shou4 +胹=er2 +胺=an4 +胻=heng2 +胼=pian2 +能=neng2,nai4 +能不称官=neng2,bu4,chen4,guan1 +能干=neng2,gan4 +能者为师=neng2,zhe3,wei2,shi1 +能量不灭定律=neng2,liang4,bu2,mie4,ding4,lv4 +胾=zi4 +胿=gui1,kui4 +脀=zheng1 +脁=tiao3 +脂=zhi1 +脃=cui4 +脄=mei2 +脅=xie2 +脆=cui4 +脆弱=cui4,ruo4 +脇=xie2 +脈=mai4 +脉=mai4,mo4 +脉脉=mo4,mo4 +脉脉相通=mai4,mai4,xiang1,tong1 +脊=ji3 +脊背=ji3,bei4 +脋=xie2 +脌=nin2 +脍=kuai4 +脎=sa4 +脏=zang4 +脏乱差=zang1,luan4,cha4 +脏了=zang1,le5 +脏污狼藉=zang1,wu1,lang2,ji2 +脏稀稀=zang1,xi1,xi1 +脏衣服=zang1,yi1,fu2 +脏话=zang1,hua4 +脐=qi2 +脑=nao3 +脑壳=nao3,ke2 +脑子=nao3,zi5 +脑涨=nao3,zhang4 +脑溢血=nao3,yi4,xue4 +脑血栓形成=nao3,xue4,shuan1,xing2,cheng2 +脒=mi3 +脓=nong2 +脓血=nong2,xue4 +脔=luan2 +脕=wan4 +脖=bo2 +脖子=bo2,zi5 +脖颈=bo2,geng3 +脖颈子=bo2,geng3,zi5 +脗=wen3 +脘=wan3 +脙=xiu1 +脚=jiao3 +脚踏两只船=jiao3,ta4,liang3,zhi1,chuan2 +脛=jing4 +脜=rou2 +脝=heng1 +脞=cuo3 +脟=lie4 +脠=shan1 +脡=ting3 +脢=mei2 +脣=chun2 +脤=shen4 +脥=jia2 +脦=te4 +脧=juan1 +脨=cu4 +脩=xiu1 +脪=xin4 +脫=tuo1 +脬=pao1 +脭=cheng2 +脮=nei3 +脯=fu3,pu2 +脰=dou4 +脱=tuo1 +脱发=tuo1,fa4 +脱壳=tuo1,qiao4 +脱壳金蝉=tuo1,ke2,jin1,chan2 +脱衣服=tuo1,yi1,fu5 +脲=niao4 +脳=nao3 +脴=pi3 +脵=gu3 +脶=luo2 +脷=li4 +脸=lian3 +脸不红心不跳=lian3,bu4,hong2,xin1,bu2,tiao4 +脹=zhang4 +脺=cui1 +脻=jie1 +脼=liang3 +脽=shui2 +脾=pi2 +脾气很拗=pi2,qi4,hen3,niu4 +脿=biao1 +腀=lun2 +腁=pian2 +腂=guo4 +腃=juan4 +腄=chui2 +腅=dan4 +腆=tian3 +腇=nei3 +腈=jing1 +腉=nai2 +腊=la4,xi1 +腋=ye4 +腌=a1,yan1 +腌制=yan1,zhi4 +腌渍=yan1,zi4 +腌肉=yan1,rou4 +腌菜=yan1,cai4 +腍=ren4 +腎=shen4 +腏=zhui4 +腐=fu3 +腑=fu3 +腒=ju1 +腓=fei2 +腔=qiang1 +腔调=qiang1,diao4 +腕=wan4 +腖=dong4 +腗=pi2 +腘=guo2 +腙=zong1 +腚=ding4 +腛=wo4 +腜=mei2 +腝=ruan3 +腞=zhuan4 +腟=chi4 +腠=cou4 +腡=luo2 +腢=ou3 +腣=di4 +腤=an1 +腥=xing1 +腥闻在上=xing2,wen2,zai4,shang4 +腥风血雨=xing1,feng1,xue4,yu3 +腦=nao3 +腧=shu4 +腨=shuan4 +腩=nan3 +腪=yun4 +腫=zhong3 +腬=rou2 +腭=e4 +腮=sai1 +腮帮子=sai1,bang1,zi5 +腯=tu2 +腰=yao1 +腰杆=yao1,gan3 +腰杆子=yao1,gan1,zi5 +腰背=yao1,bei4 +腰酸背痛=yao1,suan1,bei4,tong4 +腱=jian4 +腲=wei3 +腳=jiao3 +腴=yu2 +腵=jia1 +腶=duan4 +腷=bi4 +腸=chang2 +腹=fu4 +腹背之毛=fu4,bei4,zhi1,mao2 +腹背受敌=fu4,bei4,shou4,di2 +腺=xian4 +腻=ni4 +腼=mian3 +腽=wa4 +腾=teng2 +腿=tui3 +腿肚子=tui3,du4,zi5 +膀=bang3 +膀胱=pang2,guang1 +膁=qian3 +膂=lv3 +膃=wa4 +膄=shou4 +膅=tang2 +膆=su4 +膇=zhui4 +膈=ge2 +膉=yi4 +膊=bo2 +膋=liao2 +膌=ji2 +膍=pi2 +膎=xie2 +膏=gao1,gao4 +膏唇岐舌=gao4,chun2,qi2,she2 +膏唇贩舌=gao4,chun2,fan4,she2 +膏场绣浍=gao1,chang2,xiu4,kuai4 +膏泽=gao4,ze2 +膏泽脂香=gao1,ze2,zhi1,xiang1 +膏粱年少=gao1,liang2,nian2,shao4 +膏血=gao1,xue4 +膏车秣马=gao4,che1,mo4,ma3 +膐=lv3 +膑=bin4 +膒=ou1 +膓=chang2 +膔=lu4,biao1 +膕=guo2 +膖=pang1 +膗=chuai2 +膘=biao1 +膙=jiang3 +膚=fu1 +膛=tang2 +膜=mo2 +膝=xi1 +膝痒搔背=xi1,yang3,sao1,bei4 +膞=zhuan1,chuan2,chun2,zhuan3 +膟=lv4 +膠=jiao1 +膡=ying4 +膢=lv2 +膣=zhi4 +膤=xue3 +膥=cun1 +膦=lin4 +膧=tong2 +膨=peng2 +膨胀系数=peng2,zhang4,xi4,shu4 +膩=ni4 +膪=chuai4 +膫=liao2 +膬=cui4 +膭=kui4 +膮=xiao1 +膯=teng1 +膰=fan2,pan2 +膱=zhi2 +膲=jiao1 +膳=shan4 +膴=hu1,wu3 +膵=cui4 +膶=run4 +膷=xiang1 +膸=sui3 +膹=fen4 +膺=ying1 +膻=shan1,dan4 +膻中=dan4,zhong1 +膼=zhua1 +膽=dan3 +膾=kuai4 +膿=nong2 +臀=tun2 +臁=lian2 +臂=bi4,bei5 +臂膀=bi4,bang3 +臃=yong1 +臄=jue2 +臅=chu4 +臆=yi4 +臆度=yi4,duo2 +臇=juan3 +臈=la4,ge2 +臉=lian3 +臊=sao1,sao4 +臊子=sao4,zi5 +臋=tun2 +臌=gu3 +臍=qi2 +臎=cui4 +臏=bin4 +臐=xun1 +臑=nao4 +臒=wo4,yue4 +臓=zang4 +臔=xian4 +臕=biao1 +臖=xing4 +臗=kuan1 +臘=la4 +臙=yan1 +臚=lu2 +臛=huo4 +臜=za1 +臝=luo3 +臞=qu2 +臟=zang4 +臠=luan2 +臡=ni2,luan2 +臢=za1 +臣=chen2 +臣仆=chen2,pu2 +臤=qian1,xian2 +臥=wo4 +臦=guang4,jiong3 +臧=zang1,zang4,cang2 +臧否=zang1,pi3 +臧否人物=zang1,pi3,ren2,wu4 +臨=lin2 +臩=guang3,jiong3 +自=zi4 +自个儿=zi4,ge3,er2 +自以为是=zi4,yi3,wei2,shi4 +自传=zi4,zhuan4 +自动倒卷=zi4,dong4,dao4,juan3 +自动倒带=zi4,dong4,dao4,dai4 +自各儿=zi4,ge3,er2 +自怨自艾=zi4,yuan4,zi4,yi4 +自我吹嘘=zi4,wo3,chui2,xu1 +自相惊忧=zi4,xiang1,jing1,rao3 +自省=zi4,xing3 +自筹给养=ji3,chou2,ji3,yang3 +自给=zi4,ji3 +自给自足=zi4,ji3,zi4,zu2 +自背黑锅=zi4,bei4,hei1,guo1 +自转=zi4,zhuan4 +臫=jiao3 +臬=nie4 +臭=chou4,xiu4 +臭味相投=xiu4,wei4,xiang1,tou2 +臭揍一顿=chou4,zou4,yi2,dun4 +臭豆腐=chou4,dou4,fu3 +臮=ji4 +臯=gao1 +臰=chou4 +臱=mian2,bian1 +臲=nie4 +至=zhi4 +至为=zhi4,wei2 +至当不易=zhi4,dang4,bu4,yi4 +致=zhi4 +致远任重=zhi4,yuan3,ren4,zhang4 +臵=ge2 +臶=jian4 +臷=die2,zhi2 +臸=zhi1,jin4 +臹=xiu1 +臺=tai2 +臻=zhen1 +臼=jiu4 +臽=xian4 +臾=yu2 +臿=cha1 +舀=yao3 +舁=yu2 +舂=chong1 +舂容大雅=chong1,rong2,da4,ya2 +舃=xi4 +舄=xi4 +舅=jiu4 +舆=yu2 +與=yu3 +興=xing1 +舉=ju3 +舊=jiu4 +舋=xin4 +舌=she2 +舌头=she2,tou5 +舌苔=she2,tai1 +舍=she3,she4 +舍亲=she4,qin1 +舍去=she3,qu4 +舍己=she3,ji3 +舍己为人=she3,ji3,wei4,ren2 +舍弃=she3,qi4 +舍得=she3,de2 +舍监=she4,jian1 +舍身为国=she3,shen1,wei2,guo2 +舍车保帅=she3,ju1,bao3,shuai4 +舎=she4 +舏=jiu3 +舐=shi4 +舑=tan1 +舒=shu1 +舒卷=shu1,juan4 +舒服=shu1,fu5 +舓=shi4 +舔=tian3 +舕=tan4 +舖=pu4 +舗=pu4 +舘=guan3 +舙=hua4 +舚=tian4 +舛=chuan3 +舜=shun4 +舝=xia2 +舞=wu3 +舟=zhou1 +舠=dao1 +舡=chuan2 +舢=shan1 +舣=yi3 +舤=fan2 +舥=pa1 +舦=tai4 +舧=fan2 +舨=ban3 +舩=chuan2 +航=hang2 +舫=fang3 +般=ban1 +般桓=pan2,huan2 +般若=bo1,re3 +舭=bi3 +舮=lu2 +舯=zhong1 +舰=jian4 +舰只=jian4,zhi1 +舱=cang1 +舲=ling2 +舳=zhu2 +舴=ze2 +舵=duo4 +舶=bo2 +舷=xian2 +舸=ge3 +船=chuan2 +船只=chuan2,zhi1 +船长=chuan2,zhang3 +舺=xia2 +舻=lu2 +舼=qiong2 +舽=pang2 +舾=xi1 +舿=kua1 +艀=fu2 +艁=zao4 +艂=feng2 +艃=li2 +艄=shao1 +艅=yu2 +艆=lang2 +艇=ting3 +艈=yu4 +艉=wei3 +艊=bo2 +艋=meng3 +艌=nian4 +艍=ju1 +艎=huang2 +艏=shou3 +艐=ke4 +艑=bian4 +艒=mu4 +艓=die2 +艔=dao4 +艕=bang4 +艖=cha1 +艗=yi4 +艘=sou1 +艙=cang1 +艚=cao2 +艛=lou2 +艜=dai4 +艝=xue3 +艞=yao4 +艟=chong1 +艠=deng1 +艡=dang1 +艢=qiang2 +艣=lu3 +艤=yi3 +艥=ji2 +艦=jian4 +艧=huo4 +艨=meng2 +艩=qi2 +艪=lu3 +艫=lu2 +艬=chan2 +艭=shuang1 +艮=gen4 +良=liang2 +良将=liang2,jiang4 +艰=jian1 +艱=jian1 +色=se4 +色厉胆薄=se4,li4,dan3,bo2 +色子=shai3,zi3 +色差=se4,cha1 +色晕=se4,yun4 +色相=se4,xiang4 +色调=se4,diao4 +艳=yan4 +艴=fu2 +艴然不悦=fu2,ran2,bu4,yue4 +艵=ping1 +艶=yan4 +艷=yan4 +艸=cao3 +艹=ao3 +艺=yi4 +艻=le4 +艼=ding3 +艽=jiao1,qiu2 +艾=ai4,yi4 +艿=nai3 +芀=tiao2 +芁=qiu2 +节=jie2,jie1 +节子=jie1,zi3 +节骨眼=jie1,gu3,yan3 +芃=peng2 +芄=wan2 +芅=yi4 +芆=chai1,cha1 +芇=mian2 +芈=mi3 +芉=gan3 +芊=qian1 +芋=yu4 +芋头=yu4,tou5 +芌=yu4 +芍=shao2 +芎=xiong1 +芏=du4 +芐=hu4,xia4 +芑=qi3 +芒=mang2 +芒刺在背=mang2,ci4,zai4,bei4 +芒种=mang2,zhong4 +芓=zi4,zi3 +芔=hui4,hu1 +芕=sui1 +芖=zhi4 +芗=xiang1 +芘=bi4,pi2 +芙=fu2 +芚=tun2,chun1 +芛=wei3 +芜=wu2 +芝=zhi1 +芞=qi4 +芟=shan1 +芠=wen2 +芡=qian4 +芢=ren2 +芣=fu2 +芤=kou1 +芥=jie4,gai4 +芥末=jie4,mo4 +芥菜=jie4,cai4 +芥蓝=gai4,lan2 +芥蓝菜=gai4,lan2,cai4 +芦=lu2 +芧=xu4,zhu4 +芨=ji1 +芩=qin2 +芪=qi2 +芫=yuan2,yan2 +芫荽=yan2,sui1 +芬=fen1 +芭=ba1 +芮=rui4 +芯=xin1,xin4 +芯子=xin4,zi5 +芰=ji4 +花=hua1 +花不棱登=hua1,bu4,leng1,deng1 +花冠=hua1,guan1 +花呢=hua1,ni2 +花攒锦簇=hua1,cuan2,jin3,cu4 +花攒锦聚=hua1,cuan2,jin3,ju4 +花朝月夕=hua1,zhao1,yue4,xi1 +花朝月夜=hua1,zhao1,yue4,ye4 +花簇锦攒=hua1,cu4,jin3,cuan2 +花露=hua1,lu4 +芲=lun2,hua1 +芳=fang1 +芴=wu4,hu1 +芵=jue2 +芶=gou1,gou3 +芷=zhi3 +芸=yun2 +芹=qin2 +芺=ao3 +芻=chu2 +芼=mao2,mao4 +芽=ya2 +芾=fei4,fu2 +芿=reng2 +苀=hang2 +苁=cong1 +苂=chan2,yin2 +苃=you3 +苄=bian4 +苅=yi4 +苆=qie1 +苇=wei3 +苈=li4 +苉=pi3 +苊=e4 +苋=xian4 +苌=chang2 +苌弘碧血=chang2,hong2,bi4,xue3 +苍=cang1 +苍劲=cang1,jing4 +苍术=cang1,zhu2 +苍蝇见血=cang1,ying2,jian4,xue3 +苍颜白发=cang1,yan2,bai2,fa4 +苎=zhu4 +苏=su1,su4 +苏打=su1,da2 +苐=di4,ti2 +苑=yuan4 +苒=ran3 +苓=ling2 +苔=tai2,tai1 +苕=tiao2,shao2 +苖=di2 +苗=miao2 +苘=qing3 +苙=li4,ji1 +苚=yong4 +苛=ke1,he1 +苛政猛于虎=ke1,zhe4,meng3,yu2,hu3 +苜=mu4 +苜蓿=mu4,xu5 +苝=bei4 +苞=bao1 +苟=gou3 +苟合取容=gou3,he2,qu3,an1 +苠=min2 +苡=yi3 +苢=yi3 +苣=ju4,qu3 +苣卖菜=qu3,mai4,cai4 +苣荬菜=qu3,mai3,cai4 +苤=pie3 +苤蓝=pie3,lan5 +若=ruo4,re3 +若夫=ruo4,fu2 +苦=ku3 +苦中作乐=ku3,zhong1,zuo4,le4 +苦参=ku3,shen1 +苦处=ku3,chu3 +苦差=ku3,chai1 +苦干=ku3,gan4 +苦熬=ku3,ao2 +苦难=ku3,nan4 +苦难深重=ku3,nan4,shen1,zhong4 +苧=zhu4,ning2 +苨=ni3 +苩=pa1,bo2 +苪=bing3 +苫=shan1,shan4 +苫布=shan4,bu4 +苫盖=shan4,gai4 +苬=xiu2 +苭=yao3 +苮=xian1 +苯=ben3 +苰=hong2 +英=ying1 +苲=zuo2,zha3 +苳=dong1 +苴=ju1,cha2 +苵=die2 +苶=nie2 +苷=gan1 +苸=hu1 +苹=ping2,peng1 +苹果=ping2,guo3 +苺=mei2 +苻=fu2 +苼=sheng1,rui2 +苽=gu1 +苾=bi4 +苿=wei4 +茀=fu2 +茁=zhuo2 +茁长=zhuo2,zhang3 +茂=mao4 +范=fan4 +范蠡=fan4,li3 +茄=qie2 +茄子=qie2,zi5 +茅=mao2 +茅塞顿开=mao2,se4,dun4,kai1 +茅舍=mao2,she4 +茆=mao2 +茇=ba2 +茈=zi3 +茉=mo4 +茊=zi1 +茋=zhi3 +茌=chi2 +茍=ji4 +茎=jing1 +茎干=jing1,gan4 +茏=long2 +茐=cong1 +茑=niao3 +茒=yuan2 +茓=xue2 +茔=ying2 +茕=qiong2 +茖=ge4 +茗=ming2 +茘=li4 +茙=rong2 +茚=yin4 +茛=gen4 +茜=qian4 +茜茜公主=xi1,xi1,gong1,zhu3 +茜草=xi1,cao3 +茝=chai3 +茞=chen2 +茟=yu4 +茠=hao1 +茡=zi4 +茢=lie4 +茣=wu2 +茤=ji4 +茥=gui1 +茦=ci4 +茧=jian3 +茨=ci2 +茩=hou4 +茪=guang1 +茫=mang2 +茬=cha2 +茭=jiao1 +茮=jiao1 +茯=fu2 +茰=yu2 +茱=zhu1 +茱萸=zhu1,yu2 +茲=zi1 +茳=jiang1 +茴=hui2 +茵=yin1 +茶=cha2 +茶几=cha2,ji1 +茶瓯=cha1,ou1 +茶铛=cha2,cheng1 +茷=fa2 +茸=rong2 +茹=ru2 +茹毛饮血=ru2,mao2,yin3,xue4 +茺=chong1 +茻=mang3 +茼=tong2 +茽=zhong4 +茾=qian1 +茿=zhu2 +荀=xun2 +荁=huan2 +荂=fu1 +荃=quan2 +荄=gai1 +荅=da2 +荆=jing1 +荆棘塞途=jing1,ji2,se4,tu2 +荇=xing4 +荈=chuan3 +草=cao3 +草创=cao3,chuang4 +草率=cao3,shuai4 +草率将事=cao3,lv4,jiang1,shi4 +草苫=cao3,shan1 +草长莺飞=cao3,zhang3,ying1,fei1 +荊=jing1 +荋=er2 +荌=an4 +荍=qiao2 +荎=chi2 +荏=ren3 +荐=jian4 +荑=yi2,ti2 +荒=huang1 +荓=ping2 +荔=li4 +荕=jin1 +荖=lao3 +荗=shu4 +荘=zhuang1 +荙=da2 +荚=jia2 +荛=rao2 +荜=bi4 +荝=ce4 +荞=qiao2 +荞面饸饹=qiao2,mian4,he2,le5 +荟=hui4 +荠=ji4,qi2 +荡=dang4 +荢=zi4 +荣=rong2 +荤=hun1 +荤粥=xun1,zhou1 +荥=xing2,ying1 +荦=luo4 +荧=ying2 +荨=qian2,xun2 +荨麻=xun2,ma2 +荨麻疹=xun2,ma2,zhen3 +荩=jin4 +荪=sun1 +荫=yin1,yin4 +荫凉=yin4,liang2 +荫子封妻=yin4,zi3,feng1,qi1 +荫庇=yin4,bi4 +荬=mai3 +荭=hong2 +荮=zhou4 +药=yao4 +药铺=yao4,pu4 +荰=du4 +荱=wei3 +荲=li2 +荳=dou4 +荴=fu1 +荵=ren3 +荶=yin2 +荷=he2,he4 +荷尔蒙=he2,er3,meng2 +荷枪实弹=he4,qiang1,shi2,tan2 +荷载=he4,zai3 +荷重=he4,zhong4 +荷锄=he4,chu2 +荸=bi2 +荸荠=bi2,qi2 +荹=bu4 +荺=yun3 +荻=di2 +荼=tu2 +荽=sui1 +荾=sui1 +荿=cheng2 +莀=chen2 +莁=wu2 +莂=bie2 +莃=xi1 +莄=geng3 +莅=li4 +莆=pu2 +莇=zhu4 +莈=mo4 +莉=li4 +莊=zhuang1 +莋=zuo2 +莌=tuo1 +莍=qiu2 +莎=sha1,suo1 +莎草=suo1,cao3 +莏=suo1 +莐=chen2 +莑=peng2,feng1 +莒=ju3 +莓=mei2 +莔=meng2 +莕=xing4 +莖=jing4 +莗=che1 +莘=shen1,xin1 +莘庄=xin1,zhuang1 +莙=jun1 +莚=yan2 +莛=ting2 +莜=you2 +莝=cuo4 +莞=wan3,guan3,guan1 +莞尔=wan3,er3 +莞尔一笑=wan3,er3,yi1,xiao4 +莞莞=wan3,wan3 +莟=han4 +莠=you3 +莡=cuo4 +莢=jia2 +莣=wang2 +莤=su4,you2 +莥=niu3 +莦=shao1,xiao1 +莧=xian4 +莨=lang4,liang2 +莨绸=liang2,chou2 +莩=fu2,piao3 +莪=e2 +莫=mo4,mu4 +莫不是=mo4,bu2,shi4 +莫为已甚=mo4,wei2,yi3,shen4 +莫予毒也=mo4,yu4,du2,ye3 +莫此为甚=mo4,ci3,wei2,shen4 +莫知所为=mo4,zhi1,suo3,wei2 +莫衷一是=mo4,zhong1,yi1,shi4 +莬=wen4,wan3,mian3 +莭=jie2 +莮=nan2 +莯=mu4 +莰=kan3 +莱=lai2 +莲=lian2 +莲花落=lian2,hua1,lao4 +莳=shi4,shi2 +莳罗=shi2,luo2 +莳萝=shi2,luo2 +莴=wo1 +莵=tu4,tu2 +莶=xian1,lian3 +获=huo4 +获隽公车=huo4,jun1,gong1,che1 +莸=you2 +莹=ying2 +莺=ying1 +莺吟燕儛=ying1,yin2,yan4,sai1 +莺飞草长=ying1,fei1,cao3,zhang3 +莼=chun2 +莽=mang3 +莾=mang3 +莿=ci4 +菀=wan3,yun4 +菁=jing1 +菂=di4 +菃=qu2 +菄=dong1 +菅=jian1 +菆=zou1,chu4 +菇=gu1 +菈=la1 +菉=lu4 +菊=ju2 +菋=wei4 +菌=jun1,jun4 +菍=nie4,ren3 +菎=kun1 +菏=he2 +菐=pu2 +菑=zi1,zi4,zai1 +菒=gao3 +菓=guo3 +菔=fu2 +菕=lun2 +菖=chang1 +菗=chou2 +菘=song1 +菙=chui2 +菚=zhan4 +菛=men2 +菜=cai4 +菝=ba2 +菞=li2 +菟=tu4,tu2 +菠=bo1 +菡=han4 +菢=bao4 +菣=qin4 +菤=juan3 +菥=xi1 +菦=qin2 +菧=di3 +菨=jie1,sha4 +菩=pu2 +菪=dang4 +菫=jin3 +菬=qiao2,zhao3 +菭=tai2,zhi1,chi2 +菮=geng1 +華=hua2,hua4,hua1 +菰=gu1 +菱=ling2 +菲=fei1,fei3 +菲仪=fei3,yi2 +菲材=fei3,cai2 +菲礼=fei3,li3 +菲薄=fei3,bo2 +菲食薄衣=fei3,shi2,bo2,yi1 +菳=qin2,qin1,jin1 +菴=an1 +菵=wang3 +菶=beng3 +菷=zhou3 +菸=yan1 +菹=zu1 +菺=jian1 +菻=lin3,ma2 +菼=tan3 +菽=shu1 +菾=tian2,tian4 +菿=dao4 +萀=hu3 +萁=qi2 +萂=he2 +萃=cui4 +萄=tao2 +萅=chun1 +萆=bi4 +萇=chang2 +萈=huan2 +萉=fei4 +萊=lai2 +萋=qi1 +萌=meng2 +萍=ping2 +萍飘蓬转=ping2,piao1,peng2,zhuan4 +萎=wei3 +萎靡=wei3,mi3 +萏=dan4 +萐=sha4 +萑=huan2 +萒=yan3 +萓=yi2 +萔=tiao2 +萕=qi2 +萖=wan3 +萗=ce4 +萘=nai4 +萙=zhen3 +萚=tuo4 +萛=jiu1 +萜=tie1 +萝=luo2 +萝卜=luo2,bo5 +萞=bi4 +萟=yi4 +萠=pan1 +萡=bo2 +萢=pao1 +萣=ding4 +萤=ying2 +营=ying2 +营蝇斐锦=ying2,ying2,fei1,jin3 +萦=ying2 +萧=xiao1 +萨=sa4 +萩=qiu1 +萪=ke1 +萫=xiang1 +萬=wan4 +萭=yu3 +萮=yu2 +萯=fu4 +萰=lian4 +萱=xuan1 +萲=xuan1 +萳=nan3 +萴=ce4 +萵=wo1 +萶=chun3 +萷=shao1 +萸=yu2 +萹=bian1 +萺=mao4 +萻=an1 +萼=e4 +落=luo4,la4,lao4 +落下=la4,xia4 +落不是=luo4,bu2,shi4 +落了=la4,le5 +落价=lao4,jia4 +落发=luo4,fa4 +落子=lao4,zi3 +落得=luo4,de5 +落枕=lao4,zhen3 +落炕=lao4,kang4 +落膘=luo4,biao1 +落色=lao4,shai3 +落花生=luo4,hua1,sheng1 +落草为寇=luo4,cao3,wei2,kou4 +落落难合=luo4,luo4,nan2,he2 +落难=luo4,nan4 +落魄=luo4,tuo4 +落魄不偶=luo4,po4,bu4,ou3 +落魄不羁=luo4,po4,bu4,ji1 +落魄江湖=luo4,po4,jing1,hu2 +萾=ying2 +萿=kuo4 +葀=kuo4 +葁=jiang1 +葂=mian3 +葃=zuo4 +葄=zuo4 +葅=zu1 +葆=bao3 +葇=rou2 +葈=xi3 +葉=ye4 +葊=an1 +葋=qu2 +葌=jian1 +葍=fu2 +葎=lv4 +葏=jing1 +葐=pen2 +葑=feng1 +葒=hong2 +葓=hong2 +葔=hou2 +葕=xing4 +葖=tu1 +著=zhu4,zhuo2,zhe5 +葘=zi1 +葙=xiang1 +葚=shen4 +葛=ge2,ge3 +葛屦履霜=ge3,ju4,lv3,shuang1 +葛布=ge2,bu4 +葛藤=ge2,teng2 +葜=qia1 +葝=qing2 +葞=mi3 +葟=huang2 +葠=shen1 +葡=pu2 +葢=gai4 +董=dong3 +董事长=dong3,shi4,zhang3 +葤=zhou4 +葥=qian2 +葦=wei3 +葧=bo2 +葨=wei1 +葩=pa1 +葪=ji4 +葫=hu2 +葫芦=hu2,lu5 +葬=zang4 +葭=jia1 +葮=duan4 +葯=yao4 +葰=jun4 +葱=cong1 +葱头=cong1,tou2 +葲=quan2 +葳=wei1 +葴=zhen1 +葵=kui2 +葶=ting2 +葷=hun1 +葸=xi3 +葹=shi1 +葺=qi4 +葻=lan2 +葼=zong1 +葽=yao1 +葾=yuan1 +葿=mei2 +蒀=yun1 +蒁=shu4 +蒂=di4 +蒃=zhuan4 +蒄=guan1 +蒅=ran3 +蒆=xue1 +蒇=chan3 +蒈=kai3 +蒉=kui4,kuai4 +蒊=hua1 +蒋=jiang3 +蒌=lou2 +蒍=wei3 +蒎=pai4 +蒏=yong4 +蒐=sou1 +蒑=yin1 +蒒=shi1 +蒓=chun2 +蒔=shi4,shi2 +蒕=yun1 +蒖=zhen1 +蒗=lang4 +蒘=ru2,na2 +蒙=meng1,meng2,meng3 +蒙事=meng1,shi4 +蒙以养正=meng2,yi3,yang3,zheng4 +蒙冤=meng2,yuan1 +蒙受=meng2,shou4 +蒙古人种=meng3,gu3,ren2,zhong3 +蒙古包=meng3,gu3,bao1 +蒙古大夫=meng3,gu3,dai4,fu1 +蒙古文=meng3,gu3,wen2 +蒙古族=meng3,gu3,zu2 +蒙古语=meng3,gu3,yu3 +蒙哄=meng1,hong1 +蒙在鼓里=meng2,zai4,gu3,li3 +蒙垢=meng2,gou4 +蒙太奇=meng2,tai4,qi2 +蒙头转向=meng1,tou2,zhuan4,xiang4 +蒙尘=meng2,chen2 +蒙师=meng2,shi1 +蒙得维的亚=meng2,de5,wei2,de5,ya4 +蒙恩=meng2,en1 +蒙昧=meng2,mei4 +蒙汗药=meng2,han4,yao4 +蒙混=meng2,hun4 +蒙皮=meng2,pi2 +蒙眬=meng2,long2 +蒙着=meng1,zhao2 +蒙着锅儿=meng1,zhe5,guo1,er5 +蒙羞=meng2,xiu1 +蒙胧=meng2,long2 +蒙药=meng2,yao4 +蒙蒙=meng2,meng2 +蒙蒙亮=meng2,meng1,liang4 +蒙蒙细雨=meng2,meng2,xi4,yu3 +蒙蒙黑=meng1,meng1,hei1 +蒙蔽=meng2,bi4 +蒙袂辑屦=meng2,mei4,ji2,ju4 +蒙难=meng2,nan4 +蒙面=meng2,mian4 +蒙骗=meng1,pian4 +蒚=li4 +蒛=que1 +蒜=suan4 +蒜头=suan4,tou5 +蒝=yuan2,huan2 +蒞=li4 +蒟=ju3 +蒠=xi1 +蒡=bang4 +蒢=chu2 +蒣=xu2,shu2 +蒤=tu2 +蒥=liu2 +蒦=huo4 +蒧=dian3 +蒨=qian4 +蒩=zu1,ju4 +蒪=po4 +蒫=cuo2 +蒬=yuan1 +蒭=chu2 +蒮=yu4 +蒯=kuai3 +蒰=pan2 +蒱=pu2 +蒲=pu2 +蒳=na4 +蒴=shuo4 +蒵=xi2,xi4 +蒶=fen2 +蒷=yun2 +蒸=zheng1 +蒸沙为饭=zheng1,sha1,wei2,fan4 +蒸馏=zheng1,liu2 +蒸馏水=zheng1,liu2,shui3 +蒹=jian1 +蒺=ji2 +蒻=ruo4 +蒼=cang1 +蒽=en1 +蒾=mi2 +蒿=hao1 +蓀=sun1 +蓁=zhen1 +蓂=ming2 +蓃=sou1,sou3 +蓄=xu4 +蓅=liu2 +蓆=xi2 +蓇=gu1 +蓈=lang2 +蓉=rong2 +蓊=weng3 +蓋=gai4,ge3,he2 +蓌=cuo4 +蓍=shi1 +蓍占=shi1,zhan1 +蓎=tang2 +蓏=luo3 +蓐=ru4 +蓑=suo1 +蓒=xuan1 +蓓=bei4 +蓔=yao3,zhuo2 +蓕=gui4 +蓖=bi4 +蓗=zong3 +蓘=gun3 +蓙=zuo4 +蓚=tiao2 +蓛=ce4 +蓜=pei4 +蓝=lan2 +蓝调=lan2,diao4 +蓞=dan4 +蓟=ji4 +蓠=li2 +蓡=shen1 +蓢=lang3 +蓣=yu4 +蓤=ling2 +蓥=ying2 +蓦=mo4 +蓧=diao4,tiao2,di2 +蓨=tiao2 +蓩=mao3 +蓪=tong1 +蓫=zhu2 +蓬=peng2 +蓭=an1 +蓮=lian2 +蓯=cong1 +蓰=xi3 +蓱=ping2 +蓲=qiu1,xu1,fu1 +蓳=jin3 +蓴=chun2 +蓵=jie2 +蓶=wei2 +蓷=tui1 +蓸=cao2 +蓹=yu4 +蓺=yi4 +蓻=zi2,ju2 +蓼=liao3,lu4 +蓽=bi4 +蓾=lu3 +蓿=xu4 +蔀=bu4 +蔁=zhang1 +蔂=lei2 +蔃=qiang2 +蔄=man4 +蔅=yan2 +蔆=ling2 +蔇=ji4 +蔈=biao1 +蔉=gun3 +蔊=han4 +蔋=di2 +蔌=su4 +蔍=lu4 +蔎=she4 +蔏=shang1 +蔐=di2 +蔑=mie4 +蔒=hun1 +蔓=man4,wan4 +蔓延=man4,yan2 +蔔=bo5 +蔕=di4 +蔖=cuo2 +蔗=zhe4 +蔘=shen1 +蔙=xuan4 +蔚=wei4 +蔚为大观=wei4,wei2,da4,guan1 +蔚县=yu4,xian4 +蔛=hu2 +蔜=ao2 +蔝=mi3 +蔞=lou2 +蔟=cu4 +蔠=zhong1 +蔡=cai4 +蔢=po2 +蔣=jiang3 +蔤=mi4 +蔥=cong1 +蔦=niao3 +蔧=hui4 +蔨=juan4 +蔩=yin2 +蔪=jian1 +蔫=nian1 +蔬=shu1 +蔭=yin1 +蔮=guo2 +蔯=chen2 +蔰=hu4 +蔱=sha1 +蔲=kou4 +蔳=qian4 +蔴=ma2 +蔵=zang4 +蔶=ze2 +蔷=qiang2 +蔸=dou1 +蔹=lian3 +蔺=lin4 +蔻=kou4 +蔼=ai3 +蔽=bi4 +蔽明塞聪=bi4,ming2,se4,cong1 +蔾=li2 +蔿=wei3 +蕀=ji2 +蕁=qian2,xun2 +蕂=sheng4 +蕃=fan2 +蕃王=fan1,wang2 +蕄=meng2 +蕅=ou3 +蕆=chan3 +蕇=dian3 +蕈=xun4 +蕉=jiao1 +蕊=rui3 +蕋=rui3 +蕌=lei3 +蕍=yu2 +蕎=qiao2 +蕏=zhu1 +蕐=hua2 +蕑=jian1 +蕒=mai3 +蕓=yun2 +蕔=bao1 +蕕=you2 +蕖=qu2 +蕗=lu4 +蕘=rao2 +蕙=hui4 +蕚=e4 +蕛=ti2 +蕜=fei3 +蕝=jue2 +蕞=zui4 +蕟=fa4 +蕠=ru2 +蕡=fen2 +蕢=kui4 +蕣=shun4 +蕤=rui2 +蕥=ya3 +蕦=xu1 +蕧=fu4 +蕨=jue2 +蕩=dang4 +蕪=wu2 +蕫=dong3 +蕬=si1 +蕭=xiao1 +蕮=xi4 +蕯=sa4 +蕰=yun4 +蕱=shao1 +蕲=qi2 +蕳=jian1 +蕴=yun4 +蕵=sun1 +蕶=ling2 +蕷=yu4 +蕸=xia2 +蕹=weng4 +蕺=ji2 +蕻=hong4 +蕼=si4 +蕽=nong2 +蕾=lei3 +蕿=xuan1 +薀=yun4 +薁=yu4 +薂=xi2,xiao4 +薃=hao4 +薄=bao2,bo2,bo4 +薄产=bo2,chan3 +薄利=bo2,li4 +薄利多销=bo2,li4,duo1,xiao1 +薄命=bo2,ming4 +薄命佳人=bo2,ming4,jia1,ren2 +薄寒中人=bo2,han2,zhong4,ren2 +薄幸=bo2,xing4 +薄弱=bo2,ruo4 +薄待=bao2,dai4 +薄情=bo2,qing2 +薄情无义=bao2,qing2,wu2,yi4 +薄技=bo2,ji4 +薄技在身=bo2,ji4,zai4,shen1 +薄晓=bo2,xiao3 +薄暮=bo2,mu4 +薄暮冥冥=bo2,mu4,ming2,ming2 +薄片=bao2,pian4 +薄物细故=bo2,wu4,xi4,gu4 +薄田=bao2,tian2 +薄荷=bo4,he2 +薄荷脑=bo4,he4,nao3 +薄酒=bao2,jiu3 +薄酬=bo2,chou2 +薄雾=bo2,wu4 +薄饼=bao2,bing3 +薅=hao1 +薆=ai4 +薇=wei1 +薈=hui4 +薉=hui4 +薊=ji4 +薋=ci2,zi1 +薌=xiang1 +薍=wan4,luan4 +薎=mie4 +薏=yi4 +薏苡蒙谤=yi4,yi3,meng2,bang4 +薐=leng2 +薑=jiang1 +薒=can4 +薓=shen1 +薔=qiang2,se4 +薕=lian2 +薖=ke1 +薗=yuan2 +薘=da2 +薙=ti4 +薚=tang1 +薛=xue1 +薜=bi4 +薝=zhan1 +薞=sun1 +薟=xian1,lian3 +薠=fan2 +薡=ding3 +薢=xie4 +薣=gu3 +薤=xie4 +薥=shu3 +薦=jian4 +薧=hao1,kao3 +薨=hong1 +薩=sa4 +薪=xin1 +薪给=xin1,ji3 +薫=xun1 +薬=yao4 +薭=bai4 +薮=sou3 +薮中荆曲=sou3,zhong1,ji2,qu3 +薯=shu3 +薯莨=shu3,liang2 +薰=xun1 +薱=dui4 +薲=pin2 +薳=yuan3,wei3 +薴=ning2 +薵=chou2,zhou4 +薶=mai2,wo1 +薷=ru2 +薸=piao2 +薹=tai2 +薺=ji4,qi2 +薻=zao3 +薼=chen2 +薽=zhen1 +薾=er3 +薿=ni3 +藀=ying2 +藁=gao3 +藂=cong2 +藃=xiao1,hao4 +藄=qi2 +藅=fa2 +藆=jian3 +藇=xu4,yu4,xu1 +藈=kui2 +藉=jie4,ji2 +藊=bian3 +藋=diao4,zhuo2 +藌=mi2 +藍=lan2 +藎=jin4 +藏=cang2,zang4 +藏人=zang4,ren2 +藏历=zang4,li4 +藏头露尾=cang2,tou2,lu4,wei3 +藏府=zang4,fu3 +藏戏=zang4,xi4 +藏掖=cang2,ye1 +藏掖躲闪=cang2,ye1,duo3,shan3 +藏族=zang4,zu2 +藏獒=zang4,ao2 +藏红花=zang4,hong2,hua1 +藏药=zang4,yao4 +藏蓝=zang4,lan2 +藏踪蹑迹=cang2,zong1,nie4,ji1 +藏青=zang4,qing1 +藐=miao3 +藑=qiong2 +藒=qi4 +藓=xian3 +藔=liao2 +藕=ou3 +藖=xian2 +藗=su4 +藘=lv2 +藙=yi4 +藚=xu4 +藛=xie3 +藜=li2 +藝=yi4 +藞=la3 +藟=lei3 +藠=jiao4 +藡=di2 +藢=zhi3 +藣=bei1 +藤=teng2 +藤蔓=teng2,wan4 +藥=yao4 +藦=mo4 +藧=huan4 +藨=biao1,pao1 +藩=fan1 +藪=sou3 +藫=tan2 +藬=tui1 +藭=qiong2 +藮=qiao2 +藯=wei4 +藰=liu2,liu3 +藱=hui4,hui2 +藲=ou1 +藳=gao3 +藴=yun4 +藵=bao3 +藶=li4 +藷=shu3 +藸=zhu1,chu2 +藹=ai3 +藺=lin4 +藻=zao3 +藼=xuan1 +藽=qin4 +藾=lai4 +藿=huo4 +蘀=tuo4 +蘁=wu4 +蘂=rui3 +蘃=rui3 +蘄=qi2 +蘅=heng2 +蘆=lu2 +蘇=su1 +蘈=tui2 +蘉=mang2 +蘊=yun4 +蘋=pin2,ping2 +蘌=yu4 +蘍=xun1 +蘎=ji4 +蘏=jiong1 +蘐=xuan1 +蘑=mo2 +蘒=qiu1 +蘓=su1 +蘔=jiong1 +蘕=peng2 +蘖=nie4 +蘗=bo4 +蘘=rang2 +蘙=yi4 +蘚=xian3 +蘛=yu2 +蘜=ju2 +蘝=lian3 +蘞=lian3 +蘟=yin3 +蘠=qiang2 +蘡=ying1 +蘢=long2 +蘣=tou3 +蘤=hua1 +蘥=yue4 +蘦=ling4 +蘧=qu2 +蘨=yao2 +蘩=fan2 +蘪=mi2 +蘫=lan2 +蘬=gui1 +蘭=lan2 +蘮=ji4 +蘯=dang4 +蘰=man4 +蘱=lei4 +蘲=lei2 +蘳=hui1 +蘴=feng1 +蘵=zhi1 +蘶=wei4 +蘷=kui2 +蘸=zhan4 +蘹=huai2 +蘺=li2 +蘻=ji4 +蘼=mi2 +蘽=lei3 +蘾=huai4 +蘿=luo2 +虀=ji1 +虁=kui2 +虂=lu4 +虃=jian1 +虅=teng2 +虆=lei2 +虇=quan3 +虈=xiao1 +虉=yi4 +虊=luan2 +虋=men2 +虌=bie1 +虍=hu1 +虎=hu3 +虎将=hu3,jiang4 +虎背熊腰=hu3,bei4,xiong2,yao1 +虎贲=hu3,ben1 +虎跑=hu3,pao2 +虏=lu3 +虐=nve4 +虑=lv4 +虒=si1 +虓=xiao1 +虔=qian2 +處=chu3 +虖=hu1 +虗=xu1 +虘=cuo2 +虙=fu2 +虚=xu1 +虚与委蛇=xu1,yu3,wei1,yi2 +虚应故事=xu1,ying4,gu4,shi4 +虚晃一枪=xiu4,huang4,yi1,qiang1 +虛=xu1 +虜=lu3 +虝=hu3 +虞=yu2 +號=hao4,hao2 +虠=jiao1 +虡=ju4 +虢=guo2 +虣=bao4 +虤=yan2 +虥=zhan4 +虦=zhan4 +虧=kui1 +虨=bin1 +虩=xi4 +虪=shu4 +虫=chong2 +虬=qiu2 +虭=diao1 +虮=ji3 +虯=qiu2 +虰=ding1 +虱=shi1 +虱处裈中=shi1,chu3,kun1,zhong1 +虱子=shi1,zi5 +虲=xia1 +虳=jue2 +虴=zhe2 +虵=she2 +虶=yu2 +虷=han2 +虸=zi3 +虹=hong2 +虺=hui3,hui1 +虻=meng2 +虼=ge4 +虽=sui1 +虾=xia1,ha1 +虾兵蟹将=xia1,bing1,xie4,jiang4 +虾子=xia1,zi5 +虾蟆=ha2,ma2 +虿=chai4 +蚀=shi2 +蚁=yi3 +蚁拥蜂攒=yi3,yong1,feng1,cuan2 +蚁聚蜂攒=yi3,ju4,feng1,cuan2 +蚁集蜂攒=yi3,ji2,feng1,cuan2 +蚂=ma3,ma1,ma4 +蚂蚁啃骨头=ma3,yi3,ken3,gu2,tou5 +蚂蚱=ma4,zha4 +蚂螂=ma1,lang2 +蚃=xiang3 +蚄=fang1,bang4 +蚅=e4 +蚆=ba1 +蚇=chi3 +蚈=qian1 +蚉=wen2 +蚊=wen2 +蚊子=wen2,zi5 +蚋=rui4 +蚌=bang4,beng4 +蚌埠=beng4,bu4 +蚍=pi2 +蚎=yue4 +蚏=yue4 +蚐=jun1 +蚑=qi2 +蚒=tong2 +蚓=yin3 +蚔=qi2,zhi3 +蚕=can2 +蚖=yuan2,wan2 +蚗=jue2,que1 +蚘=hui2 +蚙=qin2,qian2 +蚚=qi2 +蚛=zhong4 +蚜=ya2 +蚝=hao2 +蚞=mu4 +蚟=wang2 +蚠=fen2 +蚡=fen2 +蚢=hang2 +蚣=gong1,zhong1 +蚤=zao3 +蚥=fu4,fu3 +蚦=ran2 +蚧=jie4 +蚨=fu2 +蚩=chi1 +蚪=dou3 +蚫=bao4 +蚬=xian3 +蚭=ni2 +蚮=dai4,de2 +蚯=qiu1 +蚰=you2 +蚱=zha4 +蚲=ping2 +蚳=chi2 +蚴=you4 +蚵=ke1 +蚶=han1 +蚶子=han1,zi5 +蚷=ju4 +蚸=li4 +蚹=fu4 +蚺=ran2 +蚻=zha2 +蚼=gou3,qu2,xu4 +蚽=pi2 +蚾=pi2,bo3 +蚿=xian2 +蛀=zhu4 +蛁=diao1 +蛂=bie2 +蛃=bing1 +蛄=gu1 +蛅=zhan1 +蛆=qu1 +蛇=she2,yi2 +蛇蝎为心=she2,xie1,wei2,xin1 +蛈=tie3 +蛉=ling2 +蛊=gu3 +蛋=dan4 +蛌=tun2 +蛍=ying2 +蛎=li4 +蛏=cheng1 +蛏子=cheng1,zi5 +蛐=qu1 +蛑=mou2 +蛒=ge2,luo4 +蛓=ci4 +蛔=hui2 +蛕=hui2 +蛖=mang2,bang4 +蛗=fu4 +蛘=yang2 +蛙=wa1 +蛚=lie4 +蛛=zhu1 +蛜=yi1 +蛝=xian2 +蛞=kuo4 +蛟=jiao1 +蛠=li4 +蛡=yi4,xu3 +蛢=ping2 +蛣=jie2 +蛤=ge2,ha2 +蛤蚌=ge2,bang4 +蛤蚧=ge2,jie4 +蛤蜊=ge2,li2 +蛤蟆=ha2,ma5 +蛤蟆镜=ha2,ma2,jing4 +蛥=she2 +蛦=yi2 +蛧=wang3 +蛨=mo4 +蛩=qiong2 +蛪=qie4,ni2 +蛫=gui3 +蛬=qiong2 +蛭=zhi4 +蛮=man2 +蛮干=man2,gan4 +蛮横=man2,heng4 +蛯=lao3 +蛰=zhe2 +蛱=jia2 +蛲=nao2 +蛳=si1 +蛴=qi2 +蛵=xing2 +蛶=jie4 +蛷=qiu2 +蛸=xiao1 +蛹=yong3 +蛺=jia2 +蛻=tui4 +蛼=che1 +蛽=bei4 +蛾=e2,yi3 +蛾子=e2,zi5 +蛿=han4 +蜀=shu3 +蜁=xuan2 +蜂=feng1 +蜂扇蚁聚=feng1,shan1,yi3,ju4 +蜂攒蚁聚=feng1,cuan2,yi3,ju4 +蜂攒蚁集=feng1,cuan2,yi3,ji2 +蜂腰削背=feng1,yao1,xue1,bei4 +蜂腰猿背=feng1,yao1,yuan2,bei4 +蜃=shen4 +蜄=shen4 +蜅=fu3 +蜆=xian3 +蜇=zhe2 +蜈=wu2 +蜉=fu2 +蜊=li4 +蜋=lang2 +蜌=bi4 +蜍=chu2 +蜎=yuan1 +蜏=you3 +蜐=jie2 +蜑=dan4 +蜒=yan2 +蜓=ting2 +蜔=dian4 +蜕=tui4 +蜖=hui2 +蜗=wo1 +蜘=zhi1 +蜙=zhong1 +蜚=fei1 +蜛=ju1 +蜜=mi4 +蜜露=mi4,lu4 +蜝=qi2 +蜞=qi2 +蜟=yu4 +蜠=jun4 +蜡=la4 +蜢=meng3 +蜣=qiang1 +蜤=si1 +蜥=xi1 +蜦=lun2 +蜧=li4 +蜨=die2 +蜩=tiao2 +蜪=tao2 +蜫=kun1 +蜬=han2 +蜭=han4 +蜮=yu4 +蜯=bang4 +蜰=fei2 +蜱=pi2 +蜲=wei1 +蜳=dun1 +蜴=yi4 +蜵=yuan1 +蜶=suo4 +蜷=quan2 +蜷曲=quan2,qu1 +蜸=qian3 +蜹=rui4 +蜺=ni2 +蜻=qing1 +蜼=wei4 +蜽=liang3 +蜾=guo3 +蜿=wan1 +蝀=dong1 +蝁=e4 +蝂=ban3 +蝃=di4 +蝄=wang3 +蝅=can2 +蝆=yang3 +蝇=ying2 +蝇攒蚁聚=ying2,cuan2,yi3,ju4 +蝇攒蚁附=ying2,cuan2,yi3,fu4 +蝈=guo1 +蝉=chan2 +蝊=ding4 +蝋=la4 +蝌=ke1 +蝍=ji2 +蝍蛆=ji2,qu1 +蝎=xie1 +蝎子=xie1,zi5 +蝎蝎螫螫=xie1,xie1,zhe1,zhe1 +蝏=ting2 +蝐=mao4 +蝑=xu1 +蝒=mian2 +蝓=yu2 +蝔=jie1 +蝕=shi2 +蝖=xuan1 +蝗=huang2 +蝘=yan3 +蝙=bian1 +蝚=rou2 +蝛=wei1 +蝜=fu4 +蝝=yuan2 +蝞=mei4 +蝟=wei4 +蝠=fu2 +蝡=ru2 +蝢=xie2 +蝣=you2 +蝤=qiu2 +蝥=mao2 +蝦=xia1,ha1 +蝧=ying1 +蝨=shi1 +蝩=chong2 +蝪=tang1 +蝫=zhu1 +蝬=zong1 +蝭=di4 +蝮=fu4 +蝯=yuan2 +蝰=kui2 +蝱=meng2 +蝲=la4 +蝳=dai4 +蝴=hu2 +蝵=qiu1 +蝶=die2 +蝷=li4 +蝸=wo1 +蝹=yun1 +蝺=qu3 +蝻=nan3 +蝼=lou2 +蝽=chun1 +蝾=rong2 +蝿=ying2 +螀=jiang1 +螁=tui4 +螂=lang2 +螃=pang2 +螄=si1 +螅=xi1 +螆=ci4 +螇=xi1,qi1 +螈=yuan2 +螉=weng1 +螊=lian2 +螋=sou1 +螌=ban1 +融=rong2 +融为一体=rong2,wei2,yi1,ti3 +融洽无间=rong2,qia4,wu2,jian4 +螎=rong2 +螏=ji2 +螐=wu1 +螑=xiu4 +螒=han4 +螓=qin2 +螓首蛾眉=qin2,shou3,er2,mei2 +螔=yi2 +螕=bi1,pi2 +螖=hua2 +螗=tang2 +螘=yi3 +螙=du4 +螚=nai4,neng3 +螛=he2,xia2 +螜=hu2 +螝=gui4,hui3 +螞=ma3,ma1,ma4 +螟=ming2 +螠=yi4 +螡=wen2 +螢=ying2 +螣=teng2 +螤=zhong1 +螥=cang1 +螦=sao1 +螧=qi2 +螨=man3 +螩=dao1 +螪=shang1 +螫=shi4,zhe1 +螬=cao2 +螭=chi1 +螮=di4 +螯=ao2 +螰=lu4 +螱=wei4 +螲=die2,zhi4 +螳=tang2 +螳臂当车=tang2,bi4,dang1,che1 +螴=chen2 +螵=piao1 +螶=qu2,ju4 +螷=pi2 +螸=yu2 +螹=chan2,jian4 +螺=luo2 +螺丝起子=luo2,si1,qi3,zi3 +螺杆=luo2,gan3 +螻=lou2 +螼=qin3 +螽=zhong1 +螾=yin3 +螿=jiang1 +蟀=shuai4 +蟁=wen2 +蟂=xiao1 +蟃=wan4 +蟄=zhe2 +蟅=zhe4 +蟆=ma2,mo4 +蟇=ma2 +蟈=guo1 +蟉=liu2 +蟊=mao2 +蟋=xi1 +蟌=cong1 +蟍=li2 +蟎=man3 +蟏=xiao1 +蟐=chan2 +蟑=zhang1 +蟒=mang3,meng3 +蟓=xiang4 +蟔=mo4 +蟕=zui1 +蟖=si1 +蟗=qiu1 +蟘=te4 +蟙=zhi2 +蟚=peng2 +蟛=peng2 +蟜=jiao3 +蟝=qu2 +蟞=bie1,bie2 +蟟=liao2 +蟠=pan2 +蟠曲=pan2,qu1 +蟡=gui3 +蟢=xi3 +蟣=ji3 +蟤=zhuan1 +蟥=huang2 +蟦=fei4,ben1 +蟧=lao2,liao2 +蟨=jue2 +蟩=jue2 +蟪=hui4 +蟫=yin2,xun2 +蟬=chan2 +蟭=jiao1 +蟮=shan4 +蟯=nao2 +蟰=xiao1 +蟱=wu2 +蟲=chong2 +蟳=xun2 +蟴=si1 +蟵=chu2 +蟶=cheng1 +蟷=dang1 +蟸=li2 +蟹=xie4 +蟺=shan4 +蟻=yi3 +蟼=jing3 +蟽=da2 +蟾=chan2 +蟿=qi4 +蠀=ci1 +蠁=xiang3 +蠂=she4 +蠃=luo3 +蠄=qin2 +蠅=ying2 +蠆=chai4 +蠇=li4 +蠈=zei2 +蠉=xuan1 +蠊=lian2 +蠋=zhu2 +蠌=ze2 +蠍=xie1 +蠎=mang3 +蠏=xie4 +蠐=qi2 +蠑=rong2 +蠒=jian3 +蠓=meng3 +蠔=hao2 +蠕=ru2 +蠖=huo4 +蠗=zhuo2 +蠘=jie2 +蠙=pin2 +蠚=he1 +蠛=mie4 +蠜=fan2 +蠝=lei3 +蠞=jie2 +蠟=la4 +蠠=min3 +蠡=li3 +蠡县=li3,xian4 +蠢=chun3 +蠣=li4 +蠤=qiu1 +蠥=nie4 +蠦=lu2 +蠧=du4 +蠨=xiao1 +蠩=zhu1 +蠪=long2 +蠫=li2 +蠬=long2 +蠭=feng1 +蠮=ye1 +蠯=pi2 +蠰=nang2 +蠱=gu3 +蠲=juan1 +蠳=ying1 +蠴=shu3 +蠵=xi1 +蠶=can2 +蠷=qu2 +蠸=quan2 +蠹=du4 +蠹居棊处=du4,ju1,que4,chu3 +蠹居棋处=du4,ju1,qi2,chu3 +蠺=can2 +蠻=man2 +蠼=qu2 +蠽=jie2 +蠾=zhu2 +蠿=zhuo2 +血=xie3,xue4 +血丝=xue4,si1 +血中毒=xue4,zhong4,du2 +血书=xue4,shu1 +血亏=xue4,kui1 +血亲=xue4,qin1 +血仇=xue4,chou2 +血债=xue4,zhai4 +血债累累=xue4,zhai4,lei3,lei3 +血光之灾=xue4,guang1,zhi1,zai1 +血凝固=xue4,ning2,gu4 +血刃=xue4,ren4 +血勇之人=xue4,yong3,zhi1,ren2 +血印=xue4,yin4 +血压=xue4,ya1 +血压正常=xue4,ya1,zheng4,chang2 +血压计=xue4,ya1,ji4 +血压高升=xue4,ya1,gao1,sheng1 +血友病=xue4,you3,bing4 +血友病患者=xue4,you3,bing4,huan4,zhe3 +血口=xue4,kou3 +血口喷人=xue4,kou3,pen1,ren2 +血吸虫=xue4,xi1,chong2 +血吸虫病=xue4,xi1,chong2,bing4 +血嗣=xue4,si4 +血块=xue4,kuai4 +血型=xue4,xing2 +血型分类=xue4,xing2,fen1,lei4 +血小板=xue4,xiao3,ban3 +血小板病=xue4,xiao3,ban3,bing4 +血尿=xue4,niao4 +血尿素=xue4,niao4,su4 +血崩=xue4,beng1 +血崩症=xue4,beng1,zheng4 +血库=xue4,ku4 +血循环=xue4,xun2,huan2 +血性=xue4,xing4 +血性男儿=xue4,xing4,nan2,er2 +血战=xue4,zhan4 +血斑=xue4,ban1 +血晕=xie3,yun4 +血本=xue4,ben3 +血本无归=xue4,ben3,wu2,gui1 +血枯病=xue4,ku1,bing4 +血染沙场=xue4,ran3,sha1,chang3 +血栓=xue4,shuan1 +血栓形成=xue4,shuan1,xing2,cheng2 +血栓心内膜炎=xue4,shuan1,xin1,nei4,mo2,yan2 +血栓栓塞=xue4,shuan1,shuan1,se4 +血栓脉管炎=xue4,shuan1,mai4,guan3,yan2 +血样=xue4,yang4 +血案=xue4,an4 +血气=xue4,qi4 +血气之勇=xue4,qi4,zhi1,yong3 +血气方刚=xue4,qi4,fang1,gang1 +血水=xue4,shui3 +血汗=xue4,han4 +血汗工厂=xue4,han4,gong1,chang3 +血污=xue4,wu1 +血沉=xue4,chen2 +血沉实验=xue4,chen2,shi2,yan4 +血沉淀素=xue4,chen2,dian4,su4 +血沉素=xue4,chen2,su4 +血泊=xue4,po1 +血泡=xue4,pao4 +血泪=xue4,lei4 +血洗=xue4,xi3 +血流=xue4,liu2 +血流动力学=xue4,liu2,dong4,li4,xue2 +血流如注=xue4,liu2,ru2,zhu4 +血流成河=xue4,liu2,cheng2,he2 +血流漂杵=xue4,liu2,piao1,chu3 +血浆=xue4,jiang1 +血浆蛋白=xue4,jiang1,dan4,bai2 +血浓于水=xue4,nong2,yu2,shui3 +血海=xue4,hai3 +血海深仇=xue4,hai3,shen1,chou2 +血液=xue4,ye4 +血液学=xue4,ye4,xue2 +血液循环=xue4,ye4,xun2,huan2 +血液病=xue4,ye4,bing4 +血液透析=xue4,ye4,tou4,xi1 +血液透析器=xue4,ye4,tou4,xi1,qi4 +血淋淋=xie3,lin1,lin1 +血清=xue4,qing1 +血清型=xue4,qing1,xing2 +血清性肝炎=xue4,qing1,xing4,gan1,yan2 +血清病=xue4,qing1,bing4 +血清白蛋白=xue4,qing1,bai2,dan4,bai2 +血清素=xue4,qing1,su4 +血渍=xue4,zi4 +血球=xue4,qiu2 +血球计数器=xue4,qiu2,ji4,shu4,qi4 +血痂=xue4,jia1 +血痕=xue4,hen2 +血痢=xue4,li4 +血痨=xue4,lao2 +血瘕=xue4,jia3 +血癌=xue4,ai2 +血盆大口=xue4,pen2,da4,kou3 +血祭=xue4,ji4 +血竭=xue4,jie2 +血管=xue4,guan3 +血管再造=xue4,guan3,zai4,zao4 +血管切除术=xue4,guan3,qie1,chu2,shu4 +血管扩张=xue4,guan3,kuo4,zhang1 +血管收缩=xue4,guan3,shou1,suo1 +血管梗阻=xue4,guan3,geng3,zu3 +血管炎=xue4,guan3,yan2 +血管病=xue4,guan3,bing4 +血管痉挛=xue4,guan3,jing4,luan2 +血管瘤=xue4,guan3,liu2 +血管破裂=xue4,guan3,po4,lie4 +血管硬化=xue4,guan3,ying4,hua4 +血管紧张=xue4,guan3,jin3,zhang1 +血管膜=xue4,guan3,mo2 +血管舒张=xue4,guan3,shu1,zhang1 +血管造影=xue4,guan3,zao4,ying3 +血粉=xue4,fen3 +血糖=xue4,tang2 +血糖低=xue4,tang2,di1 +血糖高=xue4,tang2,gao1 +血红=xue4,hong2 +血红素=xue4,hong2,su4 +血红色=xue4,hong2,se4 +血红蛋白=xue4,hong2,dan4,bai2 +血细胞=xue4,xi4,bao1 +血统=xue4,tong3 +血缘=xue4,yuan2 +血肉=xue4,rou4 +血肉之躯=xue4,rou4,zhi1,qu1 +血肉模糊=xue4,rou4,mo2,hu2 +血肉横飞=xue4,rou4,heng2,fei1 +血肉相连=xue4,rou4,xiang1,lian2 +血肿=xue4,zhong3 +血脂=xue4,zhi1 +血脉=xue4,mai4 +血脉不畅=xue4,mai4,bu2,chang4 +血脉相通=xue4,mai4,xiang1,tong1 +血腥=xue4,xing1 +血腥味=xue4,xing1,wei4 +血腥屠杀=xue4,xing1,tu1,sha1 +血腥统治=xue4,xing1,tong3,zhi4 +血腥镇压=xue4,xing1,zhen4,ya1 +血色=xue4,se4 +血色素=xue4,se4,su4 +血花=xue4,hua1 +血虚=xue4,xu1 +血衣=xue4,yi1 +血证=xue4,zheng4 +血豆腐=xie3,dou4,fu5 +血象=xue4,xiang4 +血账=xue4,zhang4 +血路=xue4,lu4 +血迹=xue4,ji4 +血迹斑斑=xue4,ji4,ban1,ban1 +血道子=xie3,dao4,zi5 +血量=xue4,liang4 +血防=xue4,fang2 +血雨腥风=xue4,yu3,xing1,feng1 +血风肉雨=xue4,feng1,rou4,yu3 +衁=huang1 +衂=nv4 +衃=pei1 +衄=nv4 +衅=xin4 +衆=zhong4 +衇=mai4 +衈=er3 +衉=ke4 +衊=mie4 +衋=xi4 +行=xing2,hang2 +行不胜衣=xing2,bu4,sheng4,yi1 +行业=hang2,ye4 +行为=xing2,wei2 +行伍=hang2,wu3 +行会=hang2,hui4 +行侠好义=xing2,xia2,hao4,yi4 +行列=hang2,lie4 +行号=hang2,hao2 +行号卧泣=xing2,hao2,wo4,qi4 +行号巷哭=xing2,hao2,xiang4,ku1 +行商=hang2,shang1 +行头=xing2,tou5 +行家=hang2,jia1 +行家里手=hang2,jia1,li3,shou3 +行市=hang2,shi4 +行帮=hang2,bang1 +行当=hang2,dang4 +行情=hang2,qing2 +行成于思=xing2,cheng2,yu2,si1 +行行出状元=hang2,hang2,chu1,zhuang4,yuan2 +行行蛇蚓=hang2,hang2,she2,yin3 +行规=hang2,gui1 +行话=hang2,hua4 +行货=hang2,huo4 +行贾=xing2,gu3 +行距=hang2,ju4 +行辈=hang2,bei4 +行道=hang2,dao4 +行道树=hang2,dao4,shu4 +行都=xing2,du1 +行间=hang2,jian1 +行间字里=hang2,jian1,zi4,li3 +衍=yan3 +衎=kan4 +衏=yuan4 +衐=qu2 +衑=ling2 +衒=xuan4 +衒玉贾石=zui4,yu4,jia3,shi2 +術=shu4 +衔=xian2 +衕=tong4 +衖=xiang4 +街=jie1 +衘=xian2 +衙=ya2 +衚=hu2 +衛=wei4 +衜=dao4 +衝=chong1 +衞=wei4 +衟=dao4 +衠=zhun1 +衡=heng2 +衡量=heng2,liang2 +衢=qu2 +衣=yi1 +衣冠=yi4,guan1 +衣冠冢=yi4,guan1,zhong3 +衣冠土枭=yi1,guan1,tu3,xiao1 +衣冠枭獍=yi1,guan1,xiao1,jing4 +衣冠楚楚=yi1,guan1,chu3,chu3 +衣冠礼乐=yi1,guan4,li3,le4 +衣冠禽兽=yi1,guan1,qin2,shou4 +衣单食薄=yi1,dan1,shi2,bo2 +衣服=yi1,fu5 +衣租食税=yi4,zu1,shi2,shui4 +衣绣昼行=yi4,xiu4,zhou4,xing2 +衣被群生=yi4,bei4,qun2,sheng1 +衣裳=yi1,shang5 +衣裳之会=yi1,shang1,zhi1,hui4 +衣轻乘肥=yi4,qing1,cheng2,fei2 +衣锦夜行=yi4,jin3,ye4,xing2 +衣锦昼行=yi4,jin3,zhou4,xing2 +衣锦过乡=yi4,jin3,guo4,xiang1 +衣锦还乡=yi1,jin3,huan2,xiang1 +衣锦食肉=yi4,jin3,shi2,rou4 +衤=yi1 +补=bu3 +补假=bu3,jia4 +补种=bu3,zhong4 +补给=bu3,ji3 +补血=bu3,xue4 +衦=gan3 +衧=yu2 +表=biao3 +表率=biao3,shuai4 +表瓤子=biao3,rang2,zi5 +表蒙子=biao3,meng2,zi3 +表里为奸=biao3,li3,wei2,jian1 +表里相应=biao3,li3,xiang1,ying4 +表露=biao3,lu4 +衩=cha4 +衪=yi4 +衫=shan1 +衬=chen4 +衭=fu1 +衮=gun3 +衯=fen1 +衰=shuai1,cui1 +衰乏=cui1,fa2 +衰少=cui1,shao3 +衰衣=cui1,yi1 +衱=jie2 +衲=na4 +衳=zhong1 +衴=dan3 +衵=ri4 +衶=zhong4 +衷=zhong1 +衸=jie4 +衹=zhi3 +衺=xie2 +衻=ran2 +衼=zhi1 +衽=ren4 +衾=qin1 +衿=jin1 +袀=jun1 +袁=yuan2 +袂=mei4 +袃=chai4 +袄=ao3 +袅=niao3 +袅娜=niao3,nuo2 +袆=hui1 +袇=ran2 +袈=jia1 +袉=tuo2,tuo1 +袊=ling3,ling2 +袋=dai4 +袋子=dai4,zi5 +袌=bao4,pao2,pao4 +袍=pao2 +袍子=pao2,zi5 +袎=yao4 +袏=zuo4 +袐=bi4 +袑=shao4 +袒=tan3 +袒胸露背=tan3,xiong1,lu4,bei4 +袒胸露臂=tan3,xiong1,lu4,bi4 +袒露=tan3,lu4 +袓=ju4,jie1 +袔=he4,ke4 +袕=xue2 +袖=xiu4 +袗=zhen3 +袘=yi2,yi4 +袙=pa4 +袚=fu2 +袛=di1 +袜=wa4 +袜子=wa4,zi5 +袝=fu4 +袞=gun3 +袟=zhi4 +袠=zhi4 +袡=ran2 +袢=pan4 +袣=yi4 +袤=mao4 +袥=tuo1 +袦=na4,jue2 +袧=gou1 +袨=xuan4 +袩=zhe2 +袪=qu1 +被=bei4,pi1 +被发左衽=pi1,fa4,zuo3,ren4 +被发文身=pi1,fa4,wen2,shen1 +被发缨冠=pi1,fa4,ying1,guan4 +被子=bei4,zi5 +被山带河=pi1,shan1,dai4,he2 +被泽蒙庥=bei4,ze2,meng2,xiu1 +被甲执兵=pi1,jia3,zhi2,bing1 +被甲持兵=pi1,jia3,chi2,bing1 +被甲据鞍=pi1,jia3,ju4,an1 +被甲枕戈=pi1,jia3,zhen3,ge1 +被称为=bei4,cheng1,wei2 +被褐怀玉=pi1,he4,huai2,yu4 +被褐怀珠=pi1,he4,huai2,zhu1 +被视为=bei4,shi4,wei2 +被评为=bei4,ping2,wei2 +被难=bei4,nan4 +袬=yu4 +袭=xi2 +袮=mi2 +袯=bo2 +袰=bo1 +袱=fu2 +袲=chi3,nuo3 +袳=chi3,qi3,duo3,nuo3 +袴=ku4 +袵=ren4 +袶=peng2 +袷=jia2,jie2,qia1 +袷袢=qia1,pan4 +袸=jian4,zun4 +袹=bo2,mo4 +袺=jie2 +袻=er2 +袼=ge1 +袽=ru2 +袾=zhu1 +袿=gui1,gua4 +裀=yin1 +裁=cai2 +裁处=cai2,chu3 +裁度=cai2,duo2 +裂=lie4,lie3 +裂缝=lie4,feng4 +裂裳衣疮=lie4,shang2,yi1,chuang1 +裃=ka3 +裄=hang2 +装=zhuang1 +装在闷葫芦里=zhuang1,zai4,men4,hu2,lu5,li3 +装孙子=zhuang1,sun1,zi5 +装模作样=zhuang1,mu2,zuo4,yang4 +装载=zhuang1,zai4 +裆=dang1 +裇=xu1 +裈=kun1 +裉=ken4 +裊=niao3 +裋=shu4 +裌=jia2 +裍=kun3 +裎=cheng2,cheng3 +裏=li3 +裐=juan1 +裑=shen1 +裒=pou2 +裓=ge2,jie1 +裔=yi4 +裕=yu4 +裖=zhen3 +裗=liu2 +裘=qiu2 +裙=qun2 +裙子=qun2,zi5 +裚=ji4 +裛=yi4 +補=bu3 +裝=zhuang1 +裞=shui4 +裟=sha1 +裠=qun2 +裡=li3 +裢=lian2 +裣=lian3 +裤=ku4 +裤子=ku4,zi5 +裤缝=ku4,feng4 +裤衩=ku4,cha3 +裥=jian3 +裦=bao1 +裧=chan1 +裨=pi2,bi4 +裨将=pi2,jiang4 +裨益=bi4,yi4 +裨补=bi4,bu3 +裩=kun1 +裪=tao2 +裫=yuan4 +裬=ling2 +裭=chi3 +裮=chang1 +裯=chou2,dao1 +裰=duo1 +裱=biao3 +裱糊=biao3,hu2 +裲=liang3 +裳=chang2,shang5 +裳裳=chang2,chang2 +裴=pei2 +裵=pei2 +裶=fei1 +裷=yuan1,gun3 +裸=luo3 +裸露=luo3,lu4 +裹=guo3 +裹血力战=guo3,xue4,li4,zhan4 +裺=yan3,an1 +裻=du2 +裼=xi1,ti4 +製=zhi4 +裾=ju1 +裿=yi3 +褀=qi2 +褁=guo3 +褂=gua4 +褃=ken4 +褄=qi1 +褅=ti4 +褆=ti2 +複=fu4 +褈=chong2 +褉=xie4 +褊=bian3 +褋=die2 +褌=kun1 +褍=duan1 +褎=xiu4 +褎然冠首=you4,ran2,guan4,shou3 +褏=xiu4 +褐=he4 +褑=yuan4 +褒=bao1 +褒衣危冠=bao1,yi1,wei1,guan1 +褒贬与夺=bao3,bian3,yu3,duo2 +褓=bao3 +褔=fu4,fu2 +褕=yu2 +褖=tuan4 +褗=yan3 +褘=hui1 +褙=bei4 +褚=zhu3,chu3 +褛=lv3 +褜=pao2 +褝=dan1 +褞=yun4 +褟=ta1 +褠=gou1 +褡=da1 +褢=huai2 +褣=rong2 +褤=yuan2 +褥=ru4 +褥子=ru4,zi5 +褦=nai4 +褧=jiong3 +褨=suo3 +褩=ban1 +褪=tui4,tun4 +褪去=tun4,qu4 +褪色=tui4,shai3 +褫=chi3 +褬=sang3 +褭=niao3 +褮=ying1 +褯=jie4 +褰=qian1 +褱=huai2 +褲=ku4 +褳=lian2 +褴=lan2 +褵=li2 +褶=zhe3 +褶子=zhe3,zi5 +褷=shi1 +褸=lv3 +褹=yi4 +褺=die1 +褻=xie4 +褼=xian1 +褽=wei4 +褾=biao3 +褿=cao2 +襀=ji4 +襁=qiang3 +襂=sen1 +襃=bao1 +襄=xiang1 +襅=bi4 +襆=fu2 +襇=jian3 +襈=zhuan4 +襉=jian3 +襊=cui4 +襋=ji2 +襌=dan1 +襍=za2 +襎=fan2 +襏=bo2 +襐=xiang4 +襑=xin2 +襒=bie2 +襓=rao2 +襔=man3 +襕=lan2 +襖=ao3 +襗=ze2 +襘=gui4 +襙=cao4 +襚=sui4 +襛=nong2 +襜=chan1 +襝=lian3 +襞=bi4 +襟=jin1 +襠=dang1 +襡=shu3 +襢=tan3 +襣=bi4 +襤=lan2 +襥=fu2 +襦=ru2 +襧=zhi3 +襩=shu3 +襪=wa4 +襫=shi4 +襬=bai3 +襭=xie2 +襮=bo2 +襯=chen4 +襰=lai3 +襱=long2 +襲=xi2 +襳=xian1 +襴=lan2 +襵=zhe3 +襶=dai4 +襷=ju3 +襸=zan4 +襹=shi1 +襺=jian3 +襻=pan4 +襼=yi4 +襽=lan2 +襾=ya4 +西=xi1 +西乐=xi1,yue4 +西洋参=xi1,yang2,shen1 +西藏=xi1,zang4 +覀=ya4 +要=yao4,yao1 +要不是=yao4,bu2,shi4 +要价还价=yao4,jia4,huan2,jia4 +要击=yao1,ji1 +要塞=yao4,sai4 +要好=yao1,hao3 +要得=yao4,de5 +要挟=yao1,xie2 +要求=yao1,qiu2 +要面子=yao4,mian4,zi5 +覂=feng3 +覃=tan2,qin2 +覄=fu4 +覅=fiao4 +覆=fu4 +覆没=fu4,mo4 +覆盆子=fu4,pen2,zi3 +覇=ba4 +覈=he2 +覉=ji1 +覊=ji1 +見=jian4,xian4 +覌=guan1,guan4 +覍=bian4 +覎=yan4 +規=gui1 +覐=jue2,jiao4 +覑=pian3 +覒=mao4 +覓=mi4 +覔=mi4 +覕=pie1,mie4 +視=shi4 +覗=si4 +覘=chan1 +覙=zhen3 +覚=jue2,jiao4 +覛=mi4 +覜=tiao4 +覝=lian2 +覞=yao4 +覟=zhi4 +覠=jun1 +覡=xi1 +覢=shan3 +覣=wei1 +覤=xi4 +覥=tian3 +覦=yu2 +覧=lan3 +覨=e4 +覩=du3 +親=qin1,qing4 +覫=pang3 +覬=ji4 +覭=ming2 +覮=ying2,ying3 +覯=gou4 +覰=qu1,qu4 +覱=zhan4,zhan1 +覲=jin4 +観=guan1,guan4 +覴=deng4 +覵=jian4,bian3 +覶=luo2,luan3 +覷=qu4,qu1 +覸=jian4 +覹=wei2 +覺=jue2,jiao4 +覻=qu4,qu1 +覼=luo2 +覽=lan3 +覾=shen3 +覿=di2 +觀=guan1,guan4 +见=jian4,xian4 +见世面=xian4,shi4,mian4 +见义勇为=jian4,yi4,yong3,wei2 +见义当为=jian4,yi4,dang1,wei2 +见义必为=jian4,yi4,bi4,wei2 +见义敢为=jian4,yi4,gan3,wei2 +见几而作=jian4,ji1,er2,zuo4 +见哭兴悲=jian4,ku1,xing1,bei1 +见幾而作=jian4,ji1,er2,zuo4 +见弃于人=jian4,qi4,yu3,ren2 +见弹求鸮=jian4,dan4,qiu2,hao2 +见弹求鹗=jian4,dan4,qiu2,e4 +见得=jian4,de5 +见所不见=jian4,suo3,bu4,jian4 +见死不救=jian4,si3,bu4,jiu4 +见物不见人=jian4,wu4,bu4,jian4,ren2 +见素抱朴=xian4,su4,bao4,pu3 +见缝就钻=jian4,feng4,jiu4,zuan1 +见缝插针=jian4,feng4,cha1,zhen1 +见背=jian4,bei4 +见长=jian4,zhang3 +观=guan1,guan4 +观今宜鉴古=guan1,jin1,yi4,jian4,gu3 +觃=yan4 +规=gui1 +规划=gui1,hua4 +规旋矩折=gui1,xuan2,ju3,she2 +规矩=gui1,ju5 +规重矩叠=gui1,chong2,ju3,die2 +觅=mi4 +视=shi4 +视为儿戏=shi4,wei2,er2,xi4 +视为寇雠=shi4,wei2,kou4,chou2 +视为畏途=shi4,wei2,wei4,tu2 +视为知己=shi4,wei2,zhi1,ji3 +视差=shi4,cha1 +视微知著=shi4,wei1,zhi1,zhuo2 +视而不见=shi4,er2,bu4,jian4 +觇=chan1 +览=lan3 +觉=jue2,jiao4 +觉得=jue2,de5 +觊=ji4 +觋=xi2 +觌=di2 +觍=tian3 +觎=yu2 +觏=gou4 +觐=jin4 +觑=qu4,qu1 +角=jiao3,jue2 +角力=jue2,li4 +角斗=jue2,dou4 +角立杰出=jiao3,li4,jie2,chu1 +角色=jue2,se4 +角逐=jue2,zhu2 +觓=qiu2 +觔=jin1 +觕=cu1 +觖=jue2 +觗=zhi4 +觘=chao4 +觙=ji2 +觚=gu1 +觛=dan4 +觜=zi1,zui3 +觝=di3 +觞=shang1 +觟=hua4,xie4 +觠=quan2 +觡=ge2 +觢=shi4 +解=jie3,jie4,xie4 +解人难得=jie3,ren2,nan2,de2 +解元=jie4,yuan2 +解发佯狂=jie3,fa4,yang2,kuang2 +解差=jie4,cha1 +解弦更张=jie3,xian2,geng1,zhang1 +解数=xie4,shu4 +解款=jie4,kuan3 +解法=xie4,fa3 +解禁=jie3,jin4 +解衣槃磅=jie3,yi1,pan2,pang2 +解衣盘磅=jie3,yi1,pan2,pang2 +解衣衣人=jie4,yi1,yi1,ren2 +解送=jie4,song4 +解铃系铃=jie3,ling2,ji4,ling2 +解铃还需系铃人=jie3,ling2,hai2,xu1,ji4,ling2,ren2 +解难=jie3,nan4 +觤=gui3 +觥=gong1 +触=chu4 +触处机来=chu4,chu3,ji1,lai2 +触物兴怀=chu4,wu4,xing1,huai2 +触目兴叹=chu4,mu4,xing1,tan4 +觧=jie3,jie4,xie4 +觨=hun4 +觩=qiu2 +觪=xing1 +觫=su4 +觬=ni2 +觭=ji1,qi2 +觮=jue2 +觯=zhi4 +觰=zha1 +觱=bi4 +觲=xing1 +觳=hu2 +觴=shang1 +觵=gong1 +觶=zhi4 +觷=xue2,hu4 +觸=chu4 +觹=xi1 +觺=yi2 +觻=li4,lu4 +觼=jue2 +觽=xi1 +觾=yan4 +觿=xi1 +言=yan2 +言不由衷=yan2,bu4,you2,zhong1 +言不逮意=yan2,bu4,dai3,yi4 +言为心声=yan2,wei2,xin1,sheng1 +言之过早=yan2,zhi4,guo4,zao3 +言差语错=yan2,cha1,yu3,cuo4 +言必有中=yan2,bi4,you3,zhong4 +言行一致=yan2,xing2,yi1,zhi4 +言词恳切=yan2,ci2,ken3,qie4 +言重九鼎=yan2,zhong4,jiu3,ding3 +訁=yan2 +訂=ding4 +訃=fu4 +訄=qiu2 +訅=qiu2 +訆=jiao4 +訇=hong1 +計=ji4 +訉=fan4 +訊=xun4 +訋=diao4 +訌=hong4 +訍=chai4 +討=tao3 +訏=xu1 +訐=jie2 +訑=dan4 +訒=ren4 +訓=xun4 +訔=yin2 +訕=shan4 +訖=qi4 +託=tuo1 +記=ji4 +訙=xun4 +訚=yin2 +訛=e2 +訜=fen1 +訝=ya4 +訞=yao1 +訟=song4 +訠=shen3 +訡=yin2 +訢=xin1 +訣=jue2 +訤=xiao2 +訥=ne4 +訦=chen2 +訧=you2 +訨=zhi3 +訩=xiong1 +訪=fang3 +訫=xin4 +訬=chao1 +設=she4 +訮=yan2 +訯=sa3 +訰=zhun4 +許=xu1 +訲=yi4 +訳=yi4 +訴=su4 +訵=chi1 +訶=he1 +訷=shen1 +訸=he2 +訹=xu4 +診=zhen3 +註=zhu4 +証=zheng4 +訽=gou4 +訾=zi1 +訿=zi3 +詀=zhan1 +詁=gu3 +詂=fu4 +詃=jian3 +詄=die2 +詅=ling2 +詆=di3 +詇=yang4 +詈=li4 +詈夷为跖=li4,yi2,wei2,zhi2 +詉=nao2 +詊=pan4 +詋=zhou4 +詌=gan4 +詍=yi4 +詎=ju4 +詏=yao4 +詐=zha4 +詑=tuo2 +詒=yi2,dai4 +詓=qu3 +詔=zhao4 +評=ping2 +詖=bi4 +詗=xiong4 +詘=qu1 +詙=ba2 +詚=da2 +詛=zu3 +詜=tao1 +詝=zhu3 +詞=ci2 +詟=zhe2 +詠=yong3 +詡=xu3 +詢=xun2 +詣=yi4 +詤=huang3 +詥=he2 +試=shi4 +詧=cha2 +詨=xiao4 +詩=shi1 +詪=hen3 +詫=cha4 +詬=gou4 +詭=gui3 +詮=quan2 +詯=hui4 +詰=jie2 +話=hua4 +該=gai1 +詳=xiang2 +詴=wei1 +詵=shen1 +詶=chou2 +詷=tong2 +詸=mi2 +詹=zhan1 +詺=ming2 +詻=luo4 +詼=hui1 +詽=yan2 +詾=xiong1 +詿=gua4 +誀=er4 +誁=bing4 +誂=tiao3,diao4 +誃=yi2,chi3,chi4 +誄=lei3 +誅=zhu1 +誆=kuang1 +誇=kua1,kua4 +誈=wu1 +誉=yu4 +誊=teng2 +誋=ji4 +誌=zhi4 +認=ren4 +誎=cu4 +誏=lang3,lang4 +誐=e2 +誑=kuang2 +誒=ei1,ei2,ei3,ei4,xi1 +誓=shi4 +誔=ting3 +誕=dan4 +誖=bei4,bo2 +誗=chan2 +誘=you4 +誙=keng1 +誚=qiao4 +誛=qin1 +誜=shua4 +誝=an1 +語=yu3,yu4 +誟=xiao4 +誠=cheng2 +誡=jie4 +誢=xian4 +誣=wu1 +誤=wu4 +誥=gao4 +誦=song4 +誧=bu1 +誨=hui4 +誩=jing4 +說=shuo1,shui4,yue4 +誫=zhen4 +説=shuo1,shui4,yue4 +読=du2 +誮=hua1 +誯=chang4 +誰=shui2,shei2 +誱=jie2 +課=ke4 +誳=qu1,jue4 +誴=cong2 +誵=xiao2 +誶=sui4 +誷=wang3 +誸=xian2 +誹=fei3 +誺=chi1,lai4 +誻=ta4 +誼=yi4 +誽=ni4,na2 +誾=yin2 +調=diao4,tiao2 +諀=pi3,bei1 +諁=zhuo2 +諂=chan3 +諃=chen1 +諄=zhun1 +諅=ji4,ji1 +諆=qi1 +談=tan2 +諈=zhui4 +諉=wei3 +諊=ju1 +請=qing3 +諌=dong3 +諍=zheng4 +諎=ze2,zuo4,zha3,cuo4 +諏=zou1 +諐=qian1 +諑=zhuo2 +諒=liang4 +諓=jian4 +諔=chu4,ji2 +諕=xia4,hao2 +論=lun4,lun2 +諗=shen3 +諘=biao3 +諙=hua4 +諚=bian4 +諛=yu2 +諜=die2 +諝=xu1 +諞=pian3 +諟=shi4,di4 +諠=xuan1 +諡=shi4 +諢=hun4 +諣=hua4,gua1 +諤=e4 +諥=zhong4 +諦=di4 +諧=xie2 +諨=fu2 +諩=pu3 +諪=ting2 +諫=jian4 +諬=qi3 +諭=yu4 +諮=zi1 +諯=zhuan1 +諰=xi3,shai1,ai1 +諱=hui4 +諲=yin1 +諳=an1 +諴=xian2 +諵=nan2,nan4 +諶=chen2 +諷=feng3 +諸=zhu1 +諹=yang2 +諺=yan4 +諻=huang2 +諼=xuan1 +諽=ge2 +諾=nuo4 +諿=xu3 +謀=mou2 +謁=ye4 +謂=wei4 +謃=xing1 +謄=teng2 +謅=zhou1 +謆=shan4 +謇=jian3 +謈=bo2 +謉=kui4 +謊=huang3 +謋=huo4 +謌=ge1 +謍=ying2 +謎=mi2 +謏=xiao3 +謐=mi4 +謑=xi3 +謒=qiang1 +謓=chen1 +謔=xue4 +謕=ti2 +謖=su4 +謗=bang4 +謘=chi2 +謙=qian1 +謚=shi4 +講=jiang3 +謜=yuan2 +謝=xie4 +謞=he4 +謟=tao1 +謠=yao2 +謡=yao2 +謢=lu1 +謣=yu2 +謤=biao1 +謥=cong4 +謦=qing3 +謧=li2 +謨=mo2 +謩=mo2 +謪=shang1 +謫=zhe2 +謬=miu4 +謭=jian3 +謮=ze2 +謯=jie1 +謰=lian2 +謱=lou2 +謲=can4 +謳=ou1 +謴=gun4 +謵=xi2 +謶=zhuo2 +謷=ao2 +謷牙诘屈=ao2,ya2,jie2,qu1 +謸=ao2 +謹=jin3 +謺=zhe2 +謻=yi2 +謼=hu1 +謽=jiang4 +謾=man2 +謿=chao2 +譀=han4 +譁=hua2 +譂=chan3 +譃=xu1 +譄=zeng1 +譅=se4 +譆=xi1 +譇=zha1 +譈=dui4 +證=zheng4 +譊=nao2 +譋=lan2 +譌=e2 +譍=ying1 +譎=jue2 +譏=ji1 +譐=zun3 +譑=jiao3 +譒=bo4 +譓=hui4 +譔=zhuan4 +譕=wu2 +譖=zen4 +譗=zha2 +識=shi2 +譙=qiao2 +譚=tan2 +譛=jian4 +譜=pu3 +譝=sheng2 +譞=xuan1 +譟=zao4 +譠=tan2 +譡=dang3 +譢=sui4 +譣=xian3 +譤=ji1 +譥=jiao4 +警=jing3 +譧=zhan4 +譨=nong2 +譩=yi1 +譪=ai3 +譫=zhan1 +譬=pi4 +譭=hui3 +譮=hua4 +譯=yi4 +議=yi4 +譱=shan4 +譲=rang4 +譳=rou4 +譴=qian3 +譵=dui4 +譶=ta4 +護=hu4 +譸=zhou1 +譹=hao2 +譺=ai4 +譻=ying1 +譼=jian1 +譽=yu4 +譾=jian3 +譿=hui4 +讀=du2 +讁=zhe2 +讂=juan4,xuan1 +讃=zan4 +讄=lei3 +讅=shen3 +讆=wei4 +讇=chan3 +讈=li4 +讉=yi2,tui1 +變=bian4 +讋=zhe2 +讌=yan4 +讍=e4 +讎=chou2 +讏=wei4 +讐=chou2 +讑=yao4 +讒=chan2 +讓=rang4 +讔=yin3 +讕=lan2 +讖=chen4 +讗=xie2 +讘=nie4 +讙=huan1 +讚=zan4 +讛=yi4 +讜=dang3 +讝=zhan2 +讞=yan4 +讟=du2 +讠=yan2 +计=ji4 +计划=ji4,hua4 +计尽力穷=ji4,jin4,li4,qiong2 +计深虑远=ji4,sheng1,lv4,yuan3 +计量=ji4,liang2 +订=ding4 +讣=fu4 +认=ren4 +认为=ren4,wei2 +认影为头=ren4,ying3,wei2,tou2 +认得=ren4,de5 +认识=ren4,shi5 +认识论=ren4,shi5,lun4 +认贼为子=ren4,zei2,wei2,zi3 +认贼为父=ren4,zei2,wei2,fu4 +讥=ji1 +讦=jie2 +讧=hong4 +讨=tao3 +讨价还价=tao3,jia4,huan2,jia4 +讨便宜=tao3,bian4,yi2 +讨还=tao3,huan2 +让=rang4 +讪=shan4 +讫=qi4 +讬=tuo1 +训=xun4 +议=yi4 +议长=yi4,zhang3 +讯=xun4 +记=ji4 +记得=ji4,de5 +记载=ji4,zai3 +讱=ren4 +讲=jiang3 +讳=hui4 +讳树数马=hui4,shu4,shu4,ma3 +讴=ou1 +讵=ju4 +讶=ya4 +讷=ne4 +许=xu3,hu3 +讹=e2 +讹误=e2,wu4 +论=lun4,lun2 +论处=lun4,chu3 +论语=lun2,yu3 +论调=lun4,diao4 +论难=lun4,nan4 +论黄数白=lun4,huang2,shu4,bai2 +讻=xiong1 +讼=song4 +讽=feng3 +设=she4 +设心处虑=she4,xin1,chu3,lv4 +设身处地=she4,shen1,chu3,di4 +访=fang3 +访查=fang3,zha1 +诀=jue2 +证=zheng4 +诂=gu3 +诃=he1 +诃子=he1,zi3 +评=ping2 +评传=ping2,zhuan4 +评卷=ping2,juan4 +诅=zu3 +识=shi2,zhi4 +识微知著=shi2,wei1,zhi1,zhuo2 +识时务者为俊杰=shi2,shi2,wu4,zhe3,wei2,jun4,jie2 +识相=shi2,xiang4 +识记=zhi4,ji4 +诇=xiong4 +诈=zha4 +诈降=zha4,xiang2 +诉=su4 +诊=zhen3 +诋=di3 +诌=zhou1 +词=ci2 +词甚切激=ci2,shen4,qie4,ji1 +词调=ci2,diao4 +诎=qu1 +诎要桡腘=qu1,yao4,rao2,yu4 +诏=zhao4 +诐=bi4 +译=yi4 +诒=yi2,dai4 +诓=kuang1 +诔=lei3 +试=shi4 +试卷=shi4,juan4 +试种=shi4,zhong4 +诖=gua4 +诗=shi1 +诗书发冢=shi1,shu1,fa4,zhong3 +诗行=shi1,hang2 +诘=jie2,ji2 +诘屈=ji2,qu1 +诘屈磝碻=jie2,qu1,bing4,zhou4 +诘屈聱牙=ji2,qu1,ao2,ya2 +诘屈謷牙=jie2,qu1,da4,ya2 +诘朝=ji2,chao2 +诙=hui1 +诚=cheng2 +诚朴=cheng2,pu3 +诛=zhu1 +诜=shen1 +话=hua4 +话匣子=hua4,xia2,zi5 +话把儿=hua4,ba4,er5 +诞=dan4 +诞谩不经=dan4,man4,bu4,jing1 +诟=gou4 +诠=quan2 +诡=gui3 +询=xun2 +询查=xun2,zha1 +诣=yi4 +诤=zheng4 +该=gai1 +该着=gai1,zhao2 +详=xiang2,yang2 +详星拜斗=xiang2,xing1,bai4,dou3 +诧=cha4 +诨=hun4 +诩=xu3 +诪=zhou1,chou2 +诪张为幻=zhou1,zhang1,wei2,huan4 +诫=jie4 +诬=wu1 +诬良为盗=wu1,liang2,wei2,dao4 +语=yu3,yu4 +语不惊人=yu3,bu4,jing1,ren4 +语塞=yu3,se4 +语笑喧阗=yu3,xiao4,xuan1,tian1 +语调=yu3,diao4 +诮=qiao4 +误=wu4 +误以为=wu4,yi3,wei2 +误作非为=wu4,zuo4,fei1,wei2 +误差=wu4,cha1 +诰=gao4 +诱=you4 +诱降=you4,xiang2 +诲=hui4 +诲人不倦=hui4,ren2,bu4,juan4 +诲人不惓=hui4,ren2,bu4,tie3 +诳=kuang2 +说=shuo1,shui4,yue4 +说不着=shuo1,bu4,zhao2 +说头儿=shuo1,tou5,er5 +说客=shui4,ke4 +说得来=shuo1,de5,lai2 +说服=shui4,fu2 +说漂亮话=shuo1,piao4,liang4,hua4 +说项=shui4,xiang4 +诵=song4 +诶=ei1,ei2,ei3,ei4,xi1 +请=qing3 +请假=qing3,jia4 +请将不如激将=qing3,jiang4,bu4,ru2,ji1,jiang4 +请帖=qing3,tie3 +请调=qing3,diao4 +请降=qing3,xiang2 +诸=zhu1 +诸宫调=zhu1,gong1,diao4 +诸葛=zhu1,ge3 +诹=zou1 +诺=nuo4 +读=du2,dou4 +读书君子=du2,shu1,jun1,zi3 +读书种子=du2,shu1,zhong3,zi3 +读数=du2,shu4 +诼=zhuo2 +诽=fei3 +课=ke4 +课卷=ke4,juan4 +课嘴撩牙=ke4,zui3,liao2,ya2 +诿=wei3 +谀=yu2 +谁=shui2 +谁的=shei2,de5 +谁都=shei2,dou1 +谂=shen3 +调=tiao2,diao4,zhou1 +调人=diao4,ren2 +调令=diao4,ling4 +调任=diao4,ren4 +调值=diao4,zhi2 +调停=tiao2,ting2 +调兵遣将=diao4,bing1,qian3,jiang4 +调养=tiao2,yang3 +调准=tiao2,zhun3 +调函=diao4,han2 +调制=tiao2,zhi4 +调动=diao4,dong4 +调匀=tiao2,yun2 +调包=diao4,bao1 +调卷=diao4,juan4 +调取=diao4,qu3 +调号=diao4,hao4 +调味=tiao2,wei4 +调嘴调舌=tiao2,zui3,diao4,she2 +调回=diao4,hui2 +调处=tiao2,chu3 +调头=diao4,tou2 +调子=diao4,zi5 +调干=diao4,gan4 +调度=diao4,du4 +调式=diao4,shi4 +调弦品竹=diao4,xian2,pin3,zhu2 +调情=tiao2,qing2 +调戏=tiao2,xi4 +调换=diao4,huan4 +调教=tiao2,jiao4 +调整=tiao2,zheng3 +调查=diao4,cha2 +调查结果=diao4,cha2,jie2,guo3 +调正=diao4,zheng4 +调派=diao4,pai4 +调温=diao4,wen1 +调演=diao4,yan3 +调用=diao4,yong4 +调皮=tiao2,pi2 +调研=diao4,yan2 +调离=diao4,li2 +调笑=tiao2,xiao4 +调经=diao4,jing1 +调职=diao4,zhi2 +调虎离山=diao4,hu3,li2,shan1 +调试=tiao2,shi4 +调调=tiao2,diao4 +调转=diao4,zhuan3 +调运=diao4,yun4 +调遣=diao4,qian3 +调配=diao4,pei4 +调门=diao4,men2 +调门儿=diao4,men2,er2 +调阅=diao4,yue4 +调防=diao4,fang2 +调集=diao4,ji2 +调音=tiao2,yin1 +谄=chan3 +谄上抑下=chan3,shang4,yi5,xia4 +谅=liang4 +谆=zhun1 +谇=sui4 +谈=tan2 +谈得来=tan2,de5,lai2 +谈言微中=tan2,yan2,wei1,zhong4 +谉=shen3 +谊=yi4 +谊切苔岑=yi4,qie4,tai2,cen2 +谋=mou2 +谋为不轨=mou2,wei2,bu4,gui3 +谋划=mou2,hua4 +谋臣猛将=mou2,chen2,meng3,jiang1 +谌=chen2,shen4 +谍=die2 +谎=huang3 +谎称=huang1,cheng1 +谎言=huang3,yan2 +谏=jian4 +谐=xie2 +谑=xue4 +谒=ye4 +谓=wei4 +谔=e4 +谕=yu4 +谖=xuan1 +谗=chan2 +谘=zi1 +谙=an1 +谚=yan4 +谛=di4 +谜=mi2 +谜一样=mi2,yi2,yang4 +谝=pian3 +谞=xu1 +谟=mo2 +谠=dang3 +谡=su4 +谢=xie4 +谣=yao2 +谤=bang4 +谥=shi4 +谦=qian1 +谦受益=qian1,shou4,yi4 +谧=mi4 +谨=jin3 +谩=man2 +谩不经意=man4,bu4,jing1,yi4 +谩天谩地=man4,tian1,man4,di4 +谩藏诲盗=man4,cang2,hui4,dao4 +谩辞哗说=man4,ci2,hua2,shuo1 +谩骂=man4,ma4 +谪=zhe2 +谫=jian3 +谬=miu4 +谭=tan2 +谭言微中=tan2,yan2,wei1,zhong4 +谮=zen4 +谯=qiao2 +谰=lan2 +谱=pu3 +谲=jue2 +谳=yan4 +谴=qian3 +谵=zhan1 +谶=chen4 +谷=gu3 +谷坊=gu3,fang2 +谸=qian1 +谹=hong2 +谺=xia1 +谻=ji2 +谼=hong2 +谽=han1 +谾=hong1 +谿=xi1 +豀=xi1 +豁=huo1,huo4,hua2 +豁亮=huo4,liang4 +豁人耳目=huo4,ren2,er3,mu4 +豁免=huo4,mian3 +豁出=huo4,chu1 +豁出去=huo4,chu1,qu4 +豁口=huo1,kou3 +豁拳=huo2,quan2 +豁朗=huo4,lang3 +豁然=huo4,ran2 +豁达=huo4,da2 +豂=liao2 +豃=han3 +豄=du2 +豅=long2 +豆=dou4 +豆泡=dou4,pao1 +豆腐=dou4,fu5 +豆腐干=dou4,fu3,gan4 +豆角儿=dou4,jue2,er2 +豆重榆瞑=dou4,chong2,yu2,ming2 +豇=jiang1 +豈=qi3,kai3 +豉=chi3 +豊=li3 +豋=deng1 +豌=wan1 +豍=bi1 +豎=shu4 +豏=xian4 +豐=feng1 +豑=zhi4 +豒=zhi4 +豓=yan4 +豔=yan4 +豕=shi3 +豖=chu4 +豗=hui1 +豘=tun2 +豙=yi4 +豚=tun2 +豛=yi4 +豜=jian1 +豝=ba1 +豞=hou4 +豟=e4 +豠=chu2 +象=xiang4 +象煞有介事=xiang4,sha4,you3,jie4,shi4 +豢=huan4 +豣=jian1,yan4 +豤=ken3 +豥=gai1 +豦=ju4 +豧=fu2 +豨=xi1 +豩=bin1 +豪=hao2 +豪兴=hao2,xing4 +豪商巨贾=hao2,shang1,ju4,jia3 +豪干暴取=hao2,gan4,bao4,qu3 +豪横=hao2,heng4 +豪横跋扈=hao2,heng2,ba2,hu4 +豫=yu4 +豬=zhu1 +豭=jia1 +豮=fen2 +豯=xi1 +豰=hu4 +豱=wen1 +豲=huan2 +豳=bin1 +豴=di2 +豵=zong1 +豶=fen2 +豷=yi4 +豸=zhi4 +豹=bao4 +豺=chai2 +豻=an4 +豼=pi2 +豽=na4 +豾=pi1 +豿=gou3 +貀=na4 +貁=you4 +貂=diao1 +貃=mo4 +貄=si4 +貅=xiu1 +貆=huan2,huan1 +貇=ken3,kun1 +貈=he2,mo4 +貉=he2,hao2,mo4 +貉子=hao2,zi5 +貉绒=hao2,rong2 +貊=mo4 +貋=an4 +貌=mao4 +貍=li2 +貎=ni2 +貏=bi3 +貐=yu3 +貑=jia1 +貒=tuan1,tuan4 +貓=mao1,mao2 +貔=pi2 +貕=xi1 +貖=yi4 +貗=ju4,lou2 +貘=mo4 +貙=chu1 +貚=tan2 +貛=huan1 +貜=jue2 +貝=bei4 +貞=zhen1 +貟=yuan2,yun2,yun4 +負=fu4 +財=cai2 +貢=gong4 +貣=dai4 +貤=yi4,yi2 +貥=hang2 +貦=wan2 +貧=pin2 +貨=huo4 +販=fan4 +貪=tan1 +貫=guan4 +責=ze2,zhai4 +貭=zhi4 +貮=er4 +貯=zhu4 +貰=shi4 +貱=bi4 +貲=zi1 +貳=er4 +貴=gui4 +貵=pian3 +貶=bian3 +買=mai3 +貸=dai4 +貹=sheng4 +貺=kuang4 +費=fei4 +貼=tie1 +貽=yi2 +貾=chi2 +貿=mao4 +賀=he4 +賁=bi4,ben1 +賂=lu4 +賃=lin4 +賄=hui4 +賅=gai1 +賆=pian2 +資=zi1 +賈=jia3,gu3,jia4 +賉=xu4 +賊=zei2 +賋=jiao3 +賌=gai1 +賍=zang1 +賎=jian4 +賏=ying1 +賐=jun4 +賑=zhen4 +賒=she1 +賓=bin1 +賔=bin1 +賕=qiu2 +賖=she1 +賗=chuan4 +賘=zang1 +賙=zhou1 +賚=lai4 +賛=zan4 +賜=ci4 +賝=chen1 +賞=shang3 +賟=tian3 +賠=pei2 +賡=geng1 +賢=xian2 +賣=mai4 +賤=jian4 +賥=sui4 +賦=fu4 +賧=dan3 +賨=cong2 +賩=cong2 +質=zhi4 +賫=ji1 +賬=zhang4 +賭=du3 +賮=jin4 +賯=xiong1,min2 +賰=chun3 +賱=yun3 +賲=bao3 +賳=zai1 +賴=lai4 +賵=feng4 +賶=cang4 +賷=ji1 +賸=sheng4 +賹=ai4 +賺=zhuan4,zuan4 +賻=fu4 +購=gou4 +賽=sai4 +賾=ze2 +賿=liao2 +贀=yi4 +贁=bai4 +贂=chen3 +贃=wan4,zhuan4 +贄=zhi4 +贅=zhui4 +贆=biao1 +贇=yun1 +贈=zeng4 +贉=dan4 +贊=zan4 +贋=yan4 +贌=pu2 +贍=shan4 +贎=wan4 +贏=ying2 +贐=jin4 +贑=gan4 +贒=xian2 +贓=zang1 +贔=bi4 +贕=du2 +贖=shu2 +贗=yan4 +贘=shang3 +贙=xuan4 +贚=long4 +贛=gan4 +贜=zang1 +贝=bei4 +贝壳=bei4,ke2 +贝阙珠宫=bei4,que4,zhu1,gong1 +贞=zhen1 +贞松劲柏=zhen1,song1,jing4,bai3 +负=fu4 +负任蒙劳=fu4,ren4,meng2,lao2 +负俗之累=fu4,su2,zhi1,lei4 +负债累累=fu4,zhai4,lei4,lei4 +负德背义=fu4,de2,bei4,yi4 +负恩背义=fu4,en1,bei4,yi4 +负数=fu4,shu4 +负气斗狠=fu4,qi4,dou3,hen3 +负累=fu4,lei4 +负荷=fu4,he4 +贠=yuan2,yun4 +贡=gong4 +贡禹弹冠=gong4,yu3,tan2,guan1 +财=cai2 +财不露白=cai2,bu4,lu4,bai2 +财会=cai2,kuai4 +财大气粗=cai2,da4,qi4,cu4 +财长=cai2,zhang3 +责=ze2,zhai4 +责难=ze2,nan4 +贤=xian2 +败=bai4 +败兴=bai4,xing4 +败军之将=bai4,jun1,zhi1,jiang4 +败国丧家=bai4,guo2,sang4,jia1 +败将=bai4,jiang4 +败血=bai4,xue4 +败血病=bai4,xue4,bing4 +败血症=bai4,xue4,zheng4 +败露=bai4,lu4 +账=zhang4 +货=huo4 +货而不售=huo5,er5,bu4,shou4 +质=zhi4 +质伛影曲=zhi4,yu3,ying3,qu1 +质因数=zhi4,yin1,shu4 +质当=zhi4,dang4 +质数=zhi4,shu4 +质朴=zhi4,piao2 +质疑问难=zhi4,yi2,wen4,nan4 +质的=zhi4,di4 +贩=fan4 +贩夫皁隶=fan4,fu1,ye3,li4 +贩夫走卒=fan4,fu1,zou3,zu2 +贪=tan1 +贪便宜=tan1,bian4,yi2 +贪夫狥利=tan1,fu1,ye3,li4 +贪夫狥财=tan1,fu1,huai2,cai2 +贪惏无餍=tan1,lin2,wu2,yan4 +贫=pin2 +贫嘴薄舌=pin2,zui3,bo2,she2 +贫富差距=pin2,fu4,cha1,ju4 +贫血=pin2,xue4 +贬=bian3 +购=gou4 +购得=gou4,de5 +贮=zhu4 +贯=guan4 +贰=er4 +贱=jian4 +贱骨头=jian4,gu3,tou5 +贲=bi4,ben1 +贲星=ben1,xing1 +贲溃=ben1,kui4 +贲门=ben1,men2 +贳=shi4 +贴=tie1 +贴切=tie1,qie4 +贴着=tie1,zhe5 +贵=gui4 +贵冠履轻头足=gui4,guan1,lv3,qing1,tou2,zu2 +贵处=gui4,chu3 +贵干=gui4,gan4 +贶=kuang4 +贷=dai4 +贸=mao4 +贸易逆差=mao4,yi4,ni4,cha1 +贸易顺差=mao4,yi4,shun4,cha1 +费=fei4 +费劲=fei4,jin4 +贺=he4 +贻=yi2 +贻累=yi2,lei4 +贼=zei2 +贼骨头=zei2,gu2,tou5 +贽=zhi4 +贾=jia3,gu3 +贾人=gu3,ren2 +贾勇=gu3,yong3 +贾官=gu3,guan1 +贾欺=gu3,qi1 +贾用=gu3,yong4 +贾田=gu3,tian2 +贾祸=gu3,huo4 +贾贷=gu3,dai4 +贾贸=gu3,mao4 +贾资=gu3,zi1 +贾道=gu3,dao4 +贾马=gu3,ma3 +贿=hui4 +赀=zi1 +赁=lin4 +赂=lu4 +赃=zang1 +资=zi1 +资深望重=zi1,sheng1,wang4,zhong4 +赅=gai1 +赆=jin4 +赇=qiu2 +赈=zhen4 +赉=lai4 +赊=she1 +赋=fu4 +赋予=fu4,yu3 +赌=du3 +赍=ji1 +赍志而没=ji1,zhi4,er2,mo4 +赎=shu2 +赎当=shu2,dang4 +赏=shang3 +赏不当功=shang3,bu4,dang1,gong1 +赐=ci4 +赐予=ci4,yu3 +赑=bi4 +赒=zhou1 +赓=geng1 +赔=pei2 +赔不是=pei2,bu2,shi4 +赔还=pei2,huan2 +赕=dan3 +赖=lai4 +赗=feng4 +赘=zhui4 +赙=fu4 +赚=zhuan4 +赚头=zhuan4,tou5 +赚得=zuan4,de5 +赛=sai4 +赛璐玢=sai4,lu4,fen1 +赜=ze2 +赝=yan4 +赞=zan4 +赟=yun1 +赠=zeng4 +赠予=zeng4,yu3 +赡=shan4 +赢=ying2 +赢得=ying2,de5 +赣=gan4 +赤=chi4 +赤崁楼=chi4,kan3,lou2 +赤绳系足=chi4,sheng2,ji4,zu2 +赤背=chi4,bei4 +赤身露体=chi4,shen1,lu4,ti3 +赤露=chi4,lu4 +赥=xi1 +赦=she4 +赧=nan3 +赨=tong2 +赩=xi4 +赪=cheng1 +赫=he4 +赬=cheng1 +赭=zhe3 +赮=xia2 +赯=tang2 +走=zou3 +走为上策=zou3,wei2,shang4,ce4 +走为上计=zou3,wei2,shang4,ce4 +走吧=zou3,ba5 +走着瞧=zou3,zhe5,qiao2 +走花溜水=zou3,hua1,liu1,bing1 +走调=zou3,diao4 +走调儿=zou3,diao4,er2 +赱=zou3 +赲=li4 +赳=jiu1 +赴=fu4 +赴汤蹈火=fu4,tang1,dao3,huo3 +赴难=fu4,nan4 +赵=zhao4 +赶=gan3 +赶得上=gan3,de5,shang4 +赶得及=gan3,de5,ji2 +赶活儿=gan3,huo2,er5 +赶浪头=gan3,lang4,tou2 +赶着=gan3,zhe5 +赶驴=gan3,lv3 +赶鸭子=gan3,ya1,zi5 +赶鸭子上架=gan3,ya1,zi5,shang4,jia4 +起=qi3 +起偃为竖=qi3,yan3,wei2,shu4 +起劲=qi3,jin4 +起哄=qi3,hong4 +起解=qi3,jie4 +赸=shan4 +赹=qiong2 +赺=yin3 +赻=xian3 +赼=zi1 +赽=jue2 +赾=qin3 +赿=chi2 +趀=ci1 +趁=chen4 +趁水和泥=chen4,shui3,huo4,ni2 +趁空=chen4,kong4 +趁风使柁=chen4,feng1,shi3,duo4 +趂=chen4 +趃=die2,tu2 +趄=qie4,ju1 +超=chao1 +超市=chao1,shi4 +超然独处=chao1,ran2,du2,chu3 +趆=di1 +趇=xi4 +趈=zhan1 +趉=jue2 +越=yue4 +趋=qu1,cu4 +趌=ji2,jie2 +趍=qu1 +趎=chu2 +趏=gua1,huo2 +趐=xue4 +趑=zi1 +趑趄=zi1,ju1 +趑趄不前=zi1,ju1,bu4,qian2 +趒=tiao4 +趓=duo3 +趔=lie4 +趔趄=lie4,qie4 +趕=gan3 +趖=suo1 +趗=cu4 +趘=xi2 +趙=zhao4 +趚=su4 +趛=yin3 +趜=ju2 +趝=jian4 +趞=que4,qi4,ji2 +趟=tang4,tang1 +趟地=tang1,di4 +趟水=tang1,shui3 +趠=chuo1,zhuo2 +趡=cui3 +趢=lu4 +趣=qu4,cu4 +趤=dang4 +趥=qiu1 +趦=zi1 +趧=ti2 +趨=qu1,cu4 +趩=chi4 +趪=huang2 +趫=qiao2 +趬=qiao1 +趭=jiao4 +趮=zao4 +趯=ti4,yue4 +趰=er3 +趱=zan3 +趲=zan3 +足=zu2 +趴=pa1 +趵=bao4,bo1 +趵突泉=bao4,tu1,quan2 +趶=kua4,wu4 +趷=ke1 +趸=dun3 +趹=jue2,gui4 +趺=fu1 +趻=chen3 +趼=jian3 +趽=fang1,fang4,pang2 +趾=zhi3 +趿=ta1 +趿拉=ta1,la5 +趿拉着鞋=ta1,la5,zhe5,xie2 +跀=yue4 +跁=ba4,pao2 +跂=qi2,qi3 +跃=yue4 +跄=qiang1,qiang4 +跄踉=qiang4,liang4 +跅=tuo4 +跆=tai2 +跇=yi4 +跈=jian4,chen2 +跉=ling2 +跊=mei4 +跋=ba2 +跌=die1 +跌宕不羁=die2,dang4,bu4,ji1 +跌弹斑鸠=die1,dan4,ban1,jiu1 +跍=ku1 +跎=tuo2 +跏=jia1 +跐=ci1,ci3 +跐着门槛=ci3,zhe5,men2,kan3 +跑=pao3,pao2 +跑了和尚跑不了寺=pao3,le5,he2,shang4,pao3,bu4,le5,si4 +跑了和尚跑不了庙=pao3,le5,he2,shang4,pao3,bu4,le5,miao4 +跑码头=pao3,ma3,tou2 +跑调=pao3,diao4 +跑马卖解=pao3,ma3,mai4,xie4 +跒=qia3 +跓=zhu4 +跔=ju1 +跕=dian3,tie1,die2 +跖=zhi2 +跗=fu1 +跗萼载韡=fu1,e4,zai3,wei3 +跘=pan2,ban4 +跙=ju1,ju4,qie4 +跚=shan1 +跛=bo3 +跛子=bo3,zi5 +跜=ni2 +距=ju4 +跞=li4,luo4 +跟=gen1 +跟头=gen1,tou5 +跟差=gen1,chai1 +跟斗=gen1,dou3 +跟着=gen1,zhe5 +跠=yi2 +跡=ji4 +跢=dai4,duo4,duo1,chi2 +跣=xian3 +跤=jiao1 +跥=duo4 +跦=zhu1 +跧=quan2 +跨=kua4 +跩=zhuai3 +跪=gui4 +跫=qiong2 +跬=kui3 +跭=xiang2 +跮=die2 +路=lu4 +路卡=lu4,qia3 +路子=lu4,zi5 +路数=lu4,shu4 +跰=pian2,beng4 +跱=zhi4 +跲=jie2 +跳=tiao4,tao2 +跳行=tiao4,hang2 +跳踉=tiao4,liang2 +跴=cai3 +践=jian4 +跶=da2 +跷=qiao1 +跸=bi4 +跹=xian1 +跺=duo4 +跻=ji1 +跼=ju2 +跽=ji4 +跾=shu1,chou1 +跿=tu2 +踀=chuo4 +踁=jing4 +踂=nie4 +踃=xiao1 +踄=bu4 +踅=xue2 +踆=qun1 +踇=mu3 +踈=shu1 +踉=liang2,liang4 +踉跄=liang4,qiang4 +踉踉跄跄=liang4,liang4,qiang4,qiang4 +踊=yong3 +踋=jiao3 +踌=chou2 +踍=qiao1 +踎=mou2 +踏=ta4 +踐=jian4 +踑=ji1 +踒=wo1 +踓=wei3 +踔=chuo1 +踕=jie2 +踖=ji2 +踗=nie4 +踘=ju1 +踙=nie4 +踚=lun2 +踛=lu4 +踜=leng4 +踝=huai2 +踞=ju4 +踟=chi2 +踠=wan3 +踡=quan2 +踢=ti1 +踣=bo2 +踤=zu2 +踥=qie4 +踦=qi1 +踧=cu4 +踨=zong1 +踩=cai3 +踪=zong1 +踫=peng4 +踬=zhi4 +踭=zheng1 +踮=dian3 +踯=zhi2 +踰=yu2 +踱=duo2 +踲=dun4 +踳=chuan3 +踴=yong3 +踵=zhong3 +踶=di4 +踷=zhe3 +踸=chen3 +踹=chuai4 +踹一脚=chuai4,yi4,jiao3 +踺=jian4 +踻=gua1 +踼=tang2 +踽=ju3 +踾=fu2 +踿=cu4 +蹀=die2 +蹁=pian2 +蹂=rou2 +蹃=nuo4 +蹄=ti2 +蹄閒三寻=ti2,jian4,san1,xun2 +蹅=cha3 +蹆=tui3 +蹇=jian3 +蹈=dao3 +蹈其覆辙=dao3,qi4,fu4,zhe2 +蹈锋饮血=dao3,feng1,yin3,xue4 +蹉=cuo1 +蹊=qi1,xi1 +蹊跷=qi1,qiao1 +蹋=ta4 +蹌=qiang1 +蹍=nian3 +蹎=dian1 +蹏=ti2 +蹐=ji2 +蹑=nie4 +蹑蹻担簦=nie4,jue1,dan1,deng1 +蹒=pan2 +蹓=liu1 +蹓跶=liu1,da5 +蹔=zan4 +蹕=bi4 +蹖=chong1 +蹗=lu4 +蹘=liao2 +蹙=cu4 +蹚=tang1 +蹛=dai4 +蹜=su4 +蹝=xi3 +蹞=kui3 +蹟=ji4 +蹠=zhi2 +蹡=qiang1 +蹢=di2 +蹣=pan2 +蹤=zong1 +蹥=lian2 +蹦=beng4 +蹦跶=beng4,da5 +蹧=zao1 +蹨=nian3 +蹩=bie2 +蹪=tui2 +蹫=ju2 +蹬=deng1 +蹭=ceng4 +蹭蹬=ceng4,deng4 +蹮=xian1 +蹯=fan2 +蹰=chu2 +蹱=zhong1 +蹲=dun1 +蹳=bo1 +蹴=cu4 +蹵=cu4 +蹶=jue2,jue3 +蹷=jue2 +蹸=lin4 +蹹=ta4 +蹺=qiao1 +蹻=qiao1 +蹼=pu3 +蹽=liao1 +蹾=dun1 +蹿=cuan1 +躀=guan4 +躁=zao4 +躂=ta4 +躃=bi4 +躄=bi4 +躅=zhu2 +躆=ju4 +躇=chu2 +躈=qiao4 +躉=dun3 +躊=chou2 +躋=ji1 +躌=wu3 +躍=yue4 +躎=nian3 +躏=lin4 +躐=lie4 +躑=zhi2 +躒=li4,luo4 +躓=zhi4 +躔=chan2 +躕=chu2 +躖=duan4 +躗=wei4 +躘=long2,long3 +躙=lin4 +躚=xian1 +躛=wei4 +躜=zuan1 +躝=lan2 +躞=xie4 +躟=rang2 +躠=sa3,xie4 +躡=nie4 +躢=ta4 +躣=qu2 +躤=ji2 +躥=cuan1 +躦=zuan1 +躧=xi3 +躨=kui2 +躩=jue2 +躪=lin4 +身=shen1 +身体发肤=shen1,ti3,fa4,fu1 +身先朝露=shen1,xian1,zhao1,lu4 +身兼数职=shen1,jian1,shu4,zhi2 +身分=shen1,fen4 +身单力薄=shen1,dan1,li4,bo2 +身处险境=shen1,chu3,xian3,jing4 +身子=shen1,zi5 +身子骨=shen1,zi5,gu3 +身无长物=shen1,wu2,chang2,wu4 +躬=gong1 +躭=dan1 +躮=fen1 +躯=qu1 +躯壳=qu1,qiao4 +躯干=qu1,gan4 +躰=ti3 +躱=duo3 +躲=duo3 +躳=gong1 +躴=lang2 +躵=ren3 +躶=luo3 +躷=ai3 +躸=ji1 +躹=ju1 +躺=tang3 +躻=kong1 +躼=lao4 +躽=yan3 +躾=mei3 +躿=kang1 +軀=qu1 +軁=lou2 +軂=lao4 +軃=duo3 +軄=zhi2 +軅=yan4 +軆=ti3 +軇=dao4 +軈=ying1 +軉=yu4 +車=che1,ju1 +軋=ya4,zha2,ga2 +軌=gui3 +軍=jun1 +軎=wei4 +軏=yue4 +軐=xin4,xian4 +軑=dai4 +軒=xuan1 +軓=fan4,gui3 +軔=ren4 +軕=shan1 +軖=kuang2 +軗=shu1 +軘=tun2 +軙=chen2 +軚=dai4 +軛=e4 +軜=na4 +軝=qi2 +軞=mao2 +軟=ruan3 +軠=kuang2 +軡=qian2 +転=zhuan4,zhuan3 +軣=hong1 +軤=hu1 +軥=qu2 +軦=kuang4 +軧=di3 +軨=ling2 +軩=dai4 +軪=ao1,ao4 +軫=zhen3 +軬=fan4 +軭=kuang1 +軮=yang3 +軯=peng1 +軰=bei4 +軱=gu1 +軲=gu1 +軳=pao2 +軴=zhu4 +軵=rong3 +軶=e4 +軷=ba2 +軸=zhou2,zhou4 +軹=zhi3 +軺=yao2 +軻=ke1,ke3 +軼=yi4,die2 +軽=qing1 +軾=shi4 +軿=ping2 +輀=er2 +輁=gong3 +輂=ju2 +較=jiao4 +輄=guang1 +輅=lu4 +輆=kai3 +輇=quan2 +輈=zhou1 +載=zai4 +輊=zhi4 +輋=she1 +輌=liang4 +輍=yu4 +輎=shao1 +輏=you2 +輐=wan4 +輑=yin3 +輒=zhe2 +輓=wan3 +輔=fu3 +輕=qing1 +輖=zhou1 +輗=ni2 +輘=ling2 +輙=zhe2 +輚=han4 +輛=liang4 +輜=zi1 +輝=hui1 +輞=wang3 +輟=chuo4 +輠=guo3 +輡=kan3 +輢=yi3 +輣=peng2 +輤=qian4 +輥=gun3 +輦=nian3 +輧=ping2 +輨=guan3 +輩=bei4 +輪=lun2 +輫=pai2 +輬=liang2 +輭=ruan3 +輮=rou2 +輯=ji2 +輰=yang2 +輱=xian2 +輲=chuan2 +輳=cou4 +輴=chun1 +輵=ge2 +輶=you2 +輷=hong1 +輸=shu1 +輹=fu4 +輺=zi1 +輻=fu2 +輼=wen1 +輽=fan4 +輾=zhan3 +輿=yu2 +轀=wen1 +轁=tao1 +轂=gu3 +轃=zhen1 +轄=xia2 +轅=yuan2 +轆=lu4 +轇=jiao1 +轈=chao2 +轉=zhuan3 +轊=wei4 +轋=hun1 +轌=xue3 +轍=zhe2 +轎=jiao4 +轏=zhan4 +轐=bu2 +轑=lao3 +轒=fen2 +轓=fan1 +轔=lin2 +轕=ge2 +轖=se4 +轗=kan3 +轘=huan4 +轙=yi3 +轚=ji2 +轛=dui4 +轜=er2 +轝=yu2 +轞=jian4 +轟=hong1 +轠=lei2 +轡=pei4 +轢=li4 +轣=li4 +轤=lu2 +轥=lin4 +车=che1,ju1 +车子=che1,zi5 +车行道=che1,xing2,dao4 +车载斗量=che1,zai4,dou3,liang2 +车载船装=che1,zai3,chuan2,zhuang1 +车马炮=ju1,ma3,pao4 +轧=ya4,zha2 +轧机=zha2,ji1 +轧账=ga2,zhang4 +轧辊=zha2,gun3 +轧钢=zha2,gang1 +轨=gui3 +轩=xuan1 +轪=dai4 +轫=ren4 +转=zhuan3,zhuan4,zhuai3 +转动=zhuan4,dong4 +转危为安=zhuan3,wei1,wei2,an1 +转台=zhuan4,tai2 +转嗔为喜=zhuan3,chen1,wei2,xi3 +转圈=zhuan4,quan1 +转子=zhuan4,zi3 +转干=zhuan3,gan4 +转弯抹角=zhuan3,wan1,mo4,jiao3 +转忧为喜=zhuan3,you1,wei2,xi3 +转悠=zhuan4,you1 +转悲为喜=zhuan3,bei1,wei2,xi3 +转愁为喜=zhuan3,chou2,wei2,xi3 +转文=zhuai3,wen2 +转来转去=zhuan4,lai2,zhuan4,qu4 +转椅=zhuan4,yi3 +转湾抹角=zhuan3,wan1,mo4,jiao3 +转灾为福=zhuan3,zai1,wei2,fu2 +转炉=zhuan4,lu2 +转矩=zhuan4,ju3 +转磨=zhuan4,mo4 +转祸为福=zhuan3,huo4,wei2,fu2 +转筋=zhuan4,jin1 +转行=zhuan3,hang2 +转调=zhuan3,diao4 +转败为功=zhuan3,bai4,wei2,gong1 +转败为成=zhuan3,bai4,wei2,cheng2 +转败为胜=zhuan3,bai4,wei2,sheng4 +转身=zhuan3,shen1 +转轮=zhuan4,lun2 +转轮手枪=zhuan4,lun2,shou3,qiang1 +转载=zhuan3,zai3 +转速=zhuan4,su4 +转门=zhuan4,men2 +轭=e4 +轮=lun2 +轮机长=lun2,ji1,zhang3 +轮种=lun2,zhong4 +轮转=lun2,zhuan4 +软=ruan3 +软和=ruan3,huo5 +软禁=ruan3,jin4 +软红香土=ruan3,hong2,xiang1,yu4 +软骨头=ruan3,gu2,tou5 +轰=hong1 +轰隆=hong1,long1 +轱=gu1 +轲=ke1 +轳=lu2 +轴=zhou2,zhou4 +轵=zhi3 +轶=yi4 +轷=hu1 +轸=zhen3 +轸宿=zhen3,xiu4 +轹=li4 +轺=yao2 +轻=qing1 +轻傜薄赋=qing1,yao1,bao2,fu4 +轻嘴薄舌=qing1,zui3,bo2,she2 +轻才好施=qing1,cai2,hao4,shi1 +轻率=qing1,shuai4 +轻薄=qing1,bo2 +轻薄无知=qing1,bao2,wu2,zhi1 +轻薄无礼=qing1,bao2,wu2,li3 +轻薄无行=qing1,bao2,wu2,xing2 +轻财好义=qing1,cai2,hao4,yi4 +轻车简从=qing1,che1,jian3,cong2 +轻骑简从=qing1,ji4,jian3,cong2 +轼=shi4 +载=zai4,zai3 +载体=zai4,ti3 +载入史册=zai3,ru4,shi3,ce4 +载客=zai4,ke4 +载歌载舞=zai4,ge1,zai4,wu3 +载荷=zai4,he4 +载货=zai4,huo4 +载运=zai4,yun4 +载重=zai4,zhong4 +载驰载驱=zai3,chi2,zai3,qu1 +轾=zhi4 +轿=jiao4 +轿子=jiao4,zi5 +辀=zhou1 +辁=quan2 +辂=lu4 +较=jiao4 +较为=jiao4,wei2 +较短量长=jiao4,duan3,liang2,chang2 +辄=zhe2 +辅=fu3 +辅世长民=fu3,shi4,zhang3,min2 +辅相=fu3,xiang4 +辆=liang4 +辇=nian3 +辈=bei4 +辈子=bei4,zi5 +辈数=bei4,shu4 +辈数儿=bei4,shu4,er2 +辉=hui1 +辊=gun3 +辋=wang3 +辌=liang2 +辍=chuo4 +辎=zi1 +辏=cou4 +辐=fu2 +辑=ji2 +辒=wen1 +输=shu1 +输血=shu1,xue4 +辔=pei4 +辕=yuan2 +辖=xia2 +辗=zhan3,nian3 +辘=lu4 +辙=zhe2 +辚=lin2 +辛=xin1 +辜=gu1 +辜恩背义=gu1,en1,bei4,yi4 +辝=ci2 +辞=ci2 +辟=pi4,bi4 +辟举=bi4,ju3 +辟书=bi4,shu1 +辟召=bi4,zhao4 +辟引=bi4,yin3 +辟谷=bi4,gu3 +辟邪=bi4,xie2 +辟除=bi4,chu2 +辠=zui4 +辡=bian4 +辢=la4 +辣=la4 +辤=ci2 +辥=xue1 +辦=ban4 +辧=bian4 +辨=bian4 +辩=bian4 +辩难=bian4,nan4 +辪=xue1 +辫=bian4 +辫子=bian4,zi5 +辬=ban1 +辭=ci2 +辮=bian4 +辯=bian4 +辰=chen2 +辱=ru3 +辱国丧师=ru3,guo2,sang4,shi1 +辱没=ru3,mo4 +農=nong2 +辳=nong2 +辴=zhen3 +辵=chuo4 +辶=chuo4 +辷=yi1 +辸=reng2 +边=bian1 +边卡=bian1,qia3 +边塞=bian1,sai4 +辺=dao4,bian1 +辻=shi5 +込=yu1 +辽=liao2 +达=da2 +达姆弹=da2,mu3,dan4 +辿=chan1 +迀=gan1 +迁=qian1 +迁都=qian1,du1 +迂=yu1 +迂曲=yu1,qu1 +迃=yu1 +迄=qi4 +迅=xun4 +迆=yi3,yi2 +过=guo4,guo5,guo1 +过为已甚=guo4,wei2,yi3,shen4 +过五关斩六将=guo4,wu3,guan1,zhan3,liu4,jiang4 +过关斩将=guo4,guan1,zhan3,jiang4 +过分=guo4,fen4 +过半数=guo4,ban4,shu4 +过家家=guo1,jia1,jia1 +过屠大嚼=guo4,tu2,da4,jue2 +过屠门而大嚼=guo4,tu2,men2,er2,da4,jiao2 +过得去=guo4,dei3,qu4 +过得硬=guo4,de5,ying4 +过敏反应=guo4,min3,fan3,ying4 +过日子=guo4,ri4,zi5 +过犹不及=guo4,you2,bu4,ji2 +过都历块=guo4,du1,li4,kuai4 +过隙白驹=guo4,xi1,bai2,ju1 +迈=mai4 +迉=qi1 +迊=za1 +迋=wang4,kuang1 +迌=tu4 +迍=zhun1 +迎=ying2 +迏=da2 +运=yun4 +运数=yun4,shu4 +运智铺谋=yun4,zhi4,pu4,mou2 +运计铺谋=yun4,ji4,pu4,mou2 +运转=yun4,zhuan3 +近=jin4 +迒=hang2 +迓=ya4 +返=fan3 +返本还元=fan3,ben3,huan2,yuan2 +返本还源=fan3,ben3,huan2,yuan2 +返朴还淳=fan3,pu3,huan2,chun2 +返朴还真=fan3,pu3,huan2,zhen1 +返还=fan3,huan2 +迕=wu3 +迖=da2 +迗=e2 +还=hai2,huan2 +还不错=hai2,bu2,cuo4 +还乡=huan2,xiang1 +还书=huan2,shu1 +还价=huan2,jia4 +还俗=huan2,su2 +还债=huan2,zhai4 +还元返本=huan2,yuan2,fan3,ben3 +还击=huan2,ji1 +还原=huan2,yuan2 +还口=huan2,kou3 +还嘴=huan2,zui3 +还席=huan2,xi2 +还年却老=huan2,nian2,que4,lao3 +还年卻老=huan2,nian2,que4,lao3 +还年驻色=huan2,nian2,zhu4,se4 +还愿=huan2,yuan4 +还我河山=huan2,wo3,he2,shan1 +还手=huan2,shou3 +还本=huan2,ben3 +还朴反古=huan2,pu3,fan3,gu3 +还淳反古=huan2,chun2,fan3,gu3 +还淳反朴=huan2,chun2,fan3,pu3 +还淳反素=huan2,chun2,fan3,su4 +还淳返朴=huan2,chun2,fan3,pu3 +还清=huan2,qing1 +还珠买椟=huan2,zhu1,mai3,du2 +还珠合浦=huan2,zhu1,he2,pu3 +还珠返璧=huan2,zhu1,fan3,bi4 +还礼=huan2,li3 +还童=huan2,tong2 +还算不错=hai2,suan4,bu2,cuo4 +还给=huan2,gei3 +还账=huan2,zhang4 +还醇返朴=huan2,chun2,fan3,pu3 +还钱=huan2,qian2 +还阳=huan2,yang2 +还魂=huan2,hun2 +这=zhe4,zhei4 +这么些=zhe4,mo3,xie1 +这么点儿=zhe4,me5,dian3,er5 +这么着=zhe4,me5,zhao1 +这些=zhei4,xie1 +这会儿=zhe4,hui4,er5 +这山望着那山高=zhe4,shan1,wang4,zhe5,na4,shan1,gao1 +迚=da2 +进=jin4 +进一步=jin4,yi1,bu4 +进寸退尺=jin3,cun4,tui4,chi3 +进给=jin4,ji3 +进给量=jin4,ji3,liang4 +进贤兴功=jin4,xian2,xing1,gong1 +进退两难=jin3,tui4,liang3,nan2 +进退中度=jin4,tui4,zhong4,du4 +进退出处=jin4,tui4,chu1,chu3 +进退消长=jin4,tui4,xiao1,chang2 +进退触籓=jin4,tui4,chu4,zu3 +进退跋疐=jin4,tui4,ba2,zu3 +远=yuan3,yuan4 +远不间亲=yuan3,bu4,jian4,qin1 +远涉重洋=yuan3,she4,chong2,yang2 +违=wei2 +违信背约=wei2,xin4,bei4,yue1 +违禁=wei2,jin4 +违禁品=wei2,jin4,pin3 +违背=wei2,bei4 +连=lian2 +连中三元=lian2,zhong4,san1,yuan2 +连帙累牍=lian2,zhi4,lei4,du2 +连杆=lian2,gan3 +连章累牍=lian2,zhang1,lei4,du2 +连篇累册=lian2,pian1,lei4,ce4 +连篇累帙=lian2,pian1,lei4,zhi4 +连篇累帧=lian2,pian1,lei4,zhen1 +连篇累幅=lian2,pian1,lei4,fu2 +连篇累牍=lian2,pian1,lei3,du2 +连篇絫幅=lian2,pian1,lei4,fu2 +连篇絫牍=lian2,pian1,lei4,du2 +连累=lian2,lei3 +连编累牍=lian2,bian1,lei3,du2 +连翘=lian2,qiao2 +连车平斗=lian2,che1,ping2,dou3 +连载=lian2,zai3 +连载小说=lian2,zai3,xiao3,shuo1 +连锁反应=lian2,suo3,fan3,ying4 +连阡累陌=lian2,qian1,lei4,mo4 +连阶累任=lian2,jie1,lei4,ren4 +迟=chi2 +迠=che4 +迡=chi2 +迢=tiao2 +迣=zhi4,li4 +迤=yi3,yi2 +迥=jiong3 +迦=jia1 +迧=chen2 +迨=dai4 +迩=er3 +迪=di2 +迫=po4,pai3 +迫击=pai3,ji1 +迫击炮=pai3,ji1,pao4 +迫击炮弹=pai3,ji1,pao4,dan4 +迫切=po4,qie4 +迬=zhu4,wang3 +迭=die2 +迭矩重规=die2,ju3,chong2,gui1 +迮=ze2 +迯=tao2 +述=shu4 +迱=yi3,yi2 +迳=jing4 +迴=hui2 +迵=dong4 +迶=you4 +迷=mi2 +迷恋骸骨=mi2,lian4,hai4,gu3 +迷糊=mi2,hu4 +迷蒙=mi2,meng2 +迷迷糊糊=mi2,mi2,hu4,hu1 +迸=beng4 +迹=ji4 +迺=nai3 +迻=yi2 +迼=jie2 +追=zhui1,dui1 +追亡逐北=zhui1,ben1,zhu2,bei3 +追捧=zhui1,peng3 +追查=zhui1,zha1 +追欢作乐=zhui1,huan1,zuo4,le4 +追趋逐耆=zhui1,qu1,zhu2,shi4 +追还=zhui1,huan2 +追风摄景=zhui1,feng1,nie4,jing3 +迾=lie4 +迿=xun4 +退=tui4 +退还=tui4,huan2 +退避三舍=tui4,bi4,san1,she4 +送=song4 +适=shi4 +适切=shi4,qie4 +适如其分=shi4,ru2,qi2,fen4 +适居其反=shi4,ju2,qi2,fan3 +适应=shi4,ying4 +适当=shi4,dang4 +适当其冲=shi4,dang1,qi2,chong1 +适当其时=shi4,dang1,qi2,shi2 +适情率意=shi4,qing2,shuai4,yi4 +逃=tao2 +逃奔=tao2,ben4 +逃难=tao2,nan4 +逄=pang2 +逅=hou4 +逆=ni4 +逆差=ni4,cha1 +逆行倒施=ni4,xing2,dao4,shi1 +逇=dun4 +逈=jiong3 +选=xuan3 +选择=xuan3,ze2 +选调=xuan3,diao4 +逊=xun4 +逋=bu1 +逌=you1 +逍=xiao1 +逎=qiu2 +透=tou4 +透露=tou4,lu4 +逐=zhu2 +逐物不还=zhu2,wu4,bu4,huan2 +逑=qiu2 +递=di4 +递兴递废=di4,xing1,di4,fei4 +递条子=di4,tiao2,zi5 +递解=di4,jie4 +逓=di4 +途=tu2 +逕=jing4 +逖=ti4 +逗=dou4 +逘=yi3 +這=zhe4 +通=tong1 +通同一气=tong1,tong2,yi1,yi4 +通宿=tong1,xiu3 +通文调武=tong1,wen2,diao4,wu3 +通红=tong4,hong2 +通缉=tong1,ji1 +通邑大都=tong1,yi4,da4,dou1 +通都大邑=tong1,du1,da4,yi4 +通铺=tong1,pu4 +逛=guang4 +逜=wu3 +逝=shi4 +逞=cheng3 +逞性妄为=cheng3,xing4,wang4,wei2 +速=su4 +造=zao4 +造因结果=zao4,yin1,jie2,guo3 +逡=qun1 +逢=feng2 +逢人说项=feng2,ren2,shuo1,xiang4 +逢场作乐=feng2,chang3,zuo4,le4 +連=lian2 +逤=suo4 +逥=hui2 +逦=li3 +逧=gu3 +逨=lai2 +逩=ben4 +逪=cuo4 +逫=zhu2 +逬=beng4 +逭=huan4 +逮=dai4 +逮小偷=dai3,xiao3,tou1 +逮捕=dai4,bu3 +逮特务=dai3,te4,wu4 +逮老鼠=dai3,lao3,shu3 +逮蚊子=dai3,wen2,zi5 +逯=lu4 +逰=you2 +週=zhou1 +進=jin4 +逳=yu4 +逴=chuo1 +逵=kui2 +逶=wei1 +逶迤=wei1,yi2 +逷=ti4 +逸=yi4 +逹=da2 +逺=yuan3 +逻=luo2 +逼=bi1 +逼肖=bi1,xiao4 +逼良为娼=bi1,liang2,wei2,chang1 +逽=nuo4 +逾=yu2 +逾分=yu2,fen4 +逾墙钻穴=yu2,qiang2,zuan4,xue2 +逾墙钻蠙=yu2,qiang2,zuan4,pin2 +逿=dang4 +遀=sui2 +遁=dun4 +遂=sui4 +遂心应手=sui4,xin1,ying1,shou3 +遂迷不寤=sui2,mi2,bu4,wu4 +遂迷不窹=sui2,mei2,bu4,wu4 +遂非文过=sui2,fei1,wen2,guo4 +遃=yan3 +遄=chuan2 +遅=chi2 +遆=di4,ti2 +遇=yu4 +遇难=yu4,nan4 +遈=shi2 +遉=zhen1 +遊=you2 +運=yun4 +遌=e4 +遍=bian4 +過=guo4 +遏=e4 +遏恶扬善=e4,e4,yan2,shan4 +遐=xia2 +遑=huang2 +遒=qiu2 +遒劲=qiu2,jing4 +道=dao4 +道不同不相为谋=dao4,bu4,tong2,bu4,xiang1,wei2,mou2 +道在人为=dao4,zai4,ren2,wei2 +道德行为=dao4,de2,xing2,wei2 +道藏=dao4,zang4 +道行=dao4,heng2 +道观=dao4,guan4 +道远日暮=dao4,yuan4,ri4,mu4 +達=da2 +違=wei2 +遖=nan2 +遗=yi2 +遗世忘累=yi2,shi4,wang4,lei3 +遗书=wei4,shu1 +遗使=wei4,shi3 +遗劳=wei4,lao2 +遗少=yi2,shao4 +遗老遗少=yi2,lao3,yi2,shao4 +遗臭万年=yi2,chou4,wan4,nian2 +遗臭万载=yi2,chou4,wan4,zai3 +遘=gou4 +遙=yao2 +遚=chou4 +遛=liu4 +遜=xun4 +遝=ta4 +遞=di4 +遟=chi2 +遠=yuan3 +遡=su4 +遢=ta4 +遣=qian3 +遣兵调将=qian3,bing1,diao4,jiang4 +遣将调兵=qian3,jiang1,diao4,bing1 +遤=ma3 +遥=yao2 +遥呼相应=yao2,hu1,xiang1,ying4 +遥相呼应=yao2,xiang1,hu1,ying4 +遦=guan4 +遧=zhang1 +遨=ao2 +適=shi4 +遪=ca4 +遫=chi4 +遬=su4 +遭=zao1 +遭劫在数=zao1,jie2,zai4,shu4 +遭难=zao1,nan4 +遮=zhe1 +遯=dun4 +遰=di4 +遱=lou2 +遲=chi2 +遳=cuo1 +遴=lin2 +遵=zun1 +遶=rao4 +遷=qian1 +選=xuan3 +遹=yu4 +遺=yi2 +遻=e4 +遼=liao2 +遽=ju4 +遾=shi4 +避=bi4 +避难=bi4,nan4 +避难就易=bi4,nan2,jiu4,yi4 +避难趋易=bi4,nan2,qiu4,yi4 +避风头=bi4,feng1,tou5 +邀=yao1 +邁=mai4 +邂=xie4 +邃=sui4 +還=huan2,hai2 +邅=zhan1 +邆=teng2 +邇=er3 +邈=miao3 +邈处欿视=miao3,chu3,ji1,shi4 +邉=bian1 +邊=bian1 +邋=la1 +邋遢=la1,ta1 +邌=li2,chi2 +邍=yuan2 +邎=yao2 +邏=luo2 +邐=li3 +邑=yi4 +邒=ting2 +邓=deng4 +邔=qi3 +邕=yong1 +邖=shan1 +邗=han2 +邘=yu2 +邙=mang2 +邚=ru2 +邛=qiong2 +邜=xi1 +邝=kuang4 +邞=fu1 +邟=kang4,hang2 +邠=bin1 +邡=fang1 +邢=xing2 +那=na4,na3,nei4,na1 +那么些=na4,mo3,xie1 +那么点儿=na4,me5,dian3,er5 +那么着=na4,me5,zhao1 +那些=nei4,xie1 +那会儿=na4,hui4,er5 +邤=xin1 +邥=shen3 +邦=bang1 +邦以民为本=bang1,yi3,min2,wei2,ben3 +邧=yuan2 +邨=cun1 +邩=huo3 +邪=xie2,ya2,ye2,yu2,xu2 +邪不干正=xie2,bu4,gan1,zheng4 +邪不胜正=xie2,bu4,sheng4,zheng4 +邫=bang1 +邬=wu1 +邭=ju4 +邮=you2 +邮差=you2,chai1 +邯=han2 +邰=tai2 +邱=qiu1 +邲=bi4 +邳=pi1 +邴=bing3 +邵=shao4 +邶=bei4 +邷=wa3 +邸=di3 +邹=zou1 +邺=ye4 +邻=lin2 +邻舍=lin2,she4 +邼=kuang1 +邽=gui1 +邾=zhu1 +邿=shi1 +郀=ku1 +郁=yu4 +郂=gai1,hai2 +郃=he2 +郄=qie4,xi4 +郅=zhi4 +郆=ji2 +郇=xun2,huan2 +郈=hou4 +郉=xing2 +郊=jiao1 +郋=xi2 +郌=gui1 +郍=na4 +郎=lang2,lang4 +郏=jia2 +郐=kuai4 +郑=zheng4 +郒=lang2 +郓=yun4 +郔=yan2 +郕=cheng2 +郖=dou4 +郗=xi1 +郘=lv3 +郙=fu3 +郚=wu2 +郛=fu2 +郜=gao4 +郝=hao3 +郞=lang2 +郟=jia2 +郠=geng3 +郡=jun4 +郢=ying3 +郢书燕说=ying3,shu1,yan1,shuo1 +郣=bo2 +郤=xi4 +郥=bei4 +郦=li4,zhi2 +郦食其=li4,yi4,ji1 +郧=yun2 +部=bu4 +部分=bu4,fen4 +部分切除=bu4,fen4,qie1,chu2 +部将=bu4,jiang4 +部长=bu4,zhang3 +郩=xiao2,ao3 +郪=qi1 +郫=pi2 +郬=qing1 +郭=guo1 +郮=zhou1 +郯=tan2 +郰=zou1 +郱=ping2 +郲=lai2 +郳=ni2 +郴=chen1 +郵=you2 +郶=bu4 +郷=xiang1 +郸=dan1 +郹=ju2 +郺=yong1 +郻=qiao1 +郼=yi1 +都=dou1,du1 +都中纸贵=du1,zhong1,zhi3,gui4 +都城=du1,cheng2 +都头异姓=du1,tou2,yi4,xing4 +都察院=du1,cha2,yuan4 +都尉=du1,wei4 +都市=du1,shi4 +都江堰=du1,jiang1,yan4 +都督=du1,du1 +都统=du1,tong3 +郾=yan3 +郿=mei2 +鄀=ruo4 +鄁=bei4 +鄂=e4 +鄃=shu1 +鄄=juan4 +鄅=yu3 +鄆=yun4 +鄇=hou2 +鄈=kui2 +鄉=xiang1 +鄊=xiang1 +鄋=sou1 +鄌=tang2 +鄍=ming2 +鄎=xi1 +鄏=ru3 +鄐=chu4 +鄑=zi1 +鄒=zou1 +鄓=yi4 +鄔=wu1 +鄕=xiang1 +鄖=yun2 +鄗=hao4 +鄘=yong1 +鄙=bi3 +鄙于不屑=bi3,yu2,bu4,xie4 +鄙夷不屑=bi3,yi2,bu4,xie4 +鄙薄=bi3,bo2 +鄚=mao4 +鄛=chao2 +鄜=fu1 +鄝=liao3 +鄞=yin2 +鄟=zhuan1 +鄠=hu4 +鄡=qiao1 +鄢=yan1 +鄣=zhang1 +鄤=man4 +鄥=qiao1 +鄦=xu3 +鄧=deng4 +鄨=bi4 +鄩=xun2 +鄪=bi4 +鄫=zeng1 +鄬=wei2 +鄭=zheng4 +鄮=mao4 +鄯=shan4 +鄰=lin2 +鄱=po2 +鄲=dan1 +鄳=meng2 +鄴=ye4 +鄵=cao4 +鄶=kuai4 +鄷=feng1 +鄸=meng2 +鄹=zou1 +鄺=kuang4 +鄻=lian3 +鄼=zan4 +鄽=chan2 +鄾=you1 +鄿=qi2 +酀=yan4 +酁=chan2 +酂=cuo2,zan4 +酃=ling2 +酄=huan1 +酅=xi1 +酆=feng1 +酇=cuo2,zan4 +酈=li4 +酉=you3 +酊=ding1,ding3 +酋=qiu2 +酋长=qiu2,zhang3 +酌=zhuo2 +酌处=zhuo2,chu3 +配=pei4 +配乐=pei4,yue4 +配称=pei4,chen4 +配给=pei4,ji3 +配角=pei4,jue2 +配调=pei4,diao4 +配载=pei4,zai3 +酎=zhou4 +酏=yi3 +酐=gan1 +酑=yu2 +酒=jiu3 +酒晕=jiu3,yun4 +酒曲=jiu3,qu1 +酒铛=jiu3,cheng1 +酒铺=jiu3,pu4 +酓=yan3 +酔=zui4 +酕=mao2 +酖=dan1 +酗=xu4 +酘=dou4 +酙=zhen1 +酚=fen1 +酛=yuan2 +酜=fu1 +酝=yun4 +酞=tai4 +酟=tian1 +酠=qia3 +酡=tuo2 +酢=zuo4 +酢浆草=cu4,jiang1,cao3 +酣=han1 +酤=gu1 +酥=su1 +酦=po1 +酧=chou2 +酨=zai4 +酩=ming3 +酩酊=ming3,ding3 +酩酊大醉=ming3,ding3,da4,zui4 +酩酊烂醉=ming3,ding3,lan4,zui4 +酪=lao4 +酫=chuo4 +酬=chou2 +酬和=chou2,he4 +酬应=chou2,ying4 +酭=you4 +酮=tong2 +酯=zhi3 +酰=xian1 +酱=jiang4 +酲=cheng2 +酳=yin4 +酴=tu2 +酵=jiao4 +酶=mei2 +酷=ku4 +酸=suan1 +酹=lei4 +酺=pu2 +酻=zui4 +酼=hai3 +酽=yan4 +酾=shi1 +酿=niang4 +醀=wei2 +醁=lu4 +醂=lan3 +醃=yan1 +醄=tao2 +醅=pei1 +醆=zhan3 +醇=chun2 +醇厚=chun2,hou4 +醇朴=chun2,piao2 +醈=tan2,dan4 +醉=zui4 +醉翁之意不在酒=zui4,weng1,zhi1,yi4,bu4,zai4,jiu3 +醊=zhui4 +醋=cu4 +醋劲儿=cu4,jin4,er5 +醌=kun1 +醍=ti2,ti3 +醎=xian2 +醏=du1 +醐=hu2 +醑=xu3 +醒=xing3 +醒豁=xing3,huo4 +醓=tan3 +醔=qiu2,chou1 +醕=chun2 +醖=yun4 +醗=po1,fa1 +醘=ke1 +醙=sou1 +醚=mi2 +醛=quan2 +醜=chou3 +醝=cuo1 +醞=yun4 +醟=yong4 +醠=ang4 +醡=zha4 +醢=hai3 +醣=tang2 +醤=jiang4 +醥=piao3 +醦=chan3,chen3 +醧=yu4 +醨=li2 +醩=zao1 +醪=lao2 +醫=yi1 +醬=jiang4 +醭=bu2 +醮=jiao4 +醯=xi1 +醰=tan2 +醱=po1,fa1 +醲=nong2 +醳=yi4,shi4 +醴=li3 +醵=ju4 +醶=yan4,lian3,xian1 +醷=yi4 +醸=niang4 +醹=ru2 +醺=xun1 +醻=chou2 +醼=yan4 +醽=ling2 +醾=mi2 +醿=mi2 +釀=niang4,niang2 +釁=xin4 +釂=jiao4 +釃=shi1 +釄=mi2 +釅=yan4 +釆=bian4 +采=cai3,cai4 +采地=cai4,di4 +采邑=cai4,yi4 +釈=shi4 +釉=you4 +釉子=you4,zi5 +释=shi4 +释卷=shi4,juan4 +释知遗形=shi4,shi4,yi2,xing2 +释迦牟尼=shi4,jia1,mu4,ni2 +釋=shi4 +里=li3 +里外夹攻=li3,wai4,jia1,gong1 +里头=li3,tou5 +里子=li3,zi5 +里应外合=li3,ying4,wai4,he2 +里弄=li3,long4 +里挑外撅=li3,tiao3,wai4,jue1 +里长=li3,zhang3 +重=zhong4,chong2 +重三叠四=chong2,san1,die2,si4 +重三迭四=chong2,san1,die2,si4 +重九=chong2,jiu3 +重九登高=chong2,jiu3,deng1,gao1 +重修=chong2,xiu1 +重修旧好=chong2,xiu1,jiu4,hao3 +重光=chong2,guang1 +重光累洽=chong2,guang1,lei4,qia4 +重关击柝=chong2,guan1,ji1,tuo4 +重兴旗鼓=chong2,xing1,qi2,gu3 +重出=chong2,chu1 +重创=zhong4,chuang1 +重印=chong2,yin4 +重叠=chong2,die2 +重合=chong2,he2 +重唱=chong2,chang4 +重围=chong2,wei2 +重圆=chong2,yuan2 +重圭叠组=chong2,gui1,die2,zu3 +重垣叠锁=chong2,yuan2,die2,suo3 +重垣迭锁=chong2,yuan2,die2,suo3 +重复=chong2,fu4 +重头=chong2,tou2 +重头戏=chong2,tou2,xi4 +重奏=chong2,zou4 +重婚=chong2,hun1 +重孙=chong2,sun1 +重孙女=chong2,sun1,nv3 +重审=chong2,shen3 +重山复岭=chong2,shan1,fu4,ling3 +重山复水=chong2,shan1,fu4,shui3 +重山峻岭=chong2,shan1,jun4,ling3 +重岩叠嶂=chong2,yan2,die2,zhang4 +重岩迭障=chong2,yan2,die2,zhang4 +重峦叠嶂=chong2,luan2,die2,zhang4 +重峦叠巘=chong2,luan2,die2,yan3 +重峦复嶂=chong2,luan2,fu4,zhang4 +重峦迭嶂=chong2,luan2,die2,zhang4 +重峦迭巘=chong2,luan2,die2,yan3 +重庆=chong2,qing4 +重床叠屋=chong2,chuang2,die2,wu1 +重床叠架=chong2,chuang2,die2,jia4 +重床迭屋=chong2,chuang2,die2,wu1 +重床迭架=chong2,chuang2,die2,jia4 +重建=chong2,jian4 +重弹=chong2,tan2 +重影=chong2,ying3 +重手累足=chong2,shou3,lei3,zu2 +重担=zhong4,dan4 +重拍=chong2,pai1 +重振旗鼓=chong2,zhen4,qi2,gu3 +重提=chong2,ti2 +重播=chong2,bo1 +重操旧业=chong2,cao1,jiu4,ye4 +重数=chong2,shu4 +重整旗鼓=chong2,zheng3,qi2,gu3 +重新=chong2,xin1 +重明继焰=chong2,ming2,ji4,yan4 +重映=chong2,ying4 +重气徇命=zhong4,qi4,xun4,ming2 +重沓=chong2,ta4 +重洋=chong2,yang2 +重温=chong2,wen1 +重温旧业=chong2,wen1,jiu4,ye4 +重温旧梦=chong2,wen1,jiu4,meng4 +重演=chong2,yan3 +重熙累叶=chong2,xi1,lei3,ye4 +重熙累洽=chong2,xi1,lei3,qia4 +重熙累盛=chong2,xi1,lei3,sheng4 +重熙累绩=chong2,xi1,lei3,ji4 +重版=chong2,ban3 +重犯=chong2,fan4 +重现=chong2,xian4 +重珪叠组=chong2,gui1,die2,zu3 +重珪迭组=chong2,gui1,die2,zu3 +重理旧业=chong2,li3,jiu4,ye4 +重生父母=chong2,sheng1,fu4,mu3 +重生爷娘=chong2,sheng1,ye2,niang2 +重申=chong2,shen1 +重睹天日=chong2,du3,tian1,ri4 +重纰貤缪=chong2,pi1,yi2,miu4 +重纸累札=chong2,zhi3,lei4,zha2 +重组=chong2,zu3 +重聚=chong2,ju4 +重荷=zhong4,he4 +重获=chong2,huo4 +重葩累藻=chong2,pa1,lei4,zao3 +重行=chong2,xing2 +重裀列鼎=chong2,yin1,lie4,ding3 +重见天日=chong2,jian4,tian1,ri4 +重规叠矩=chong2,gui1,die2,ju3 +重规沓矩=chong2,gui1,ta4,ju3 +重规累矩=chong2,gui1,lei4,ju3 +重规袭矩=chong2,gui1,xi2,ju3 +重规迭矩=chong2,gui1,die2,ju3 +重译=chong2,yi4 +重起炉灶=chong2,qi3,lu2,zao4 +重足一迹=chong2,zu2,yi1,ji4 +重足屏息=chong2,zu2,bing3,xi1 +重足屏气=chong2,zu2,bing3,qi4 +重足累息=chong2,zu2,lei4,xi1 +重足而立=chong2,zu2,er2,li4 +重蹈覆辙=chong2,dao3,fu4,zhe2 +重载=chong2,zai3 +重返=chong2,fan3 +重述=chong2,shu4 +重迹屏气=chong2,ji4,bing3,qi4 +重逢=chong2,feng2 +重重=chong2,chong2 +重金兼紫=chong2,jin1,jian1,zi3 +重金袭汤=chong2,jin1,xi2,tang1 +重铬酸钾=chong2,ge4,suan1,jia3 +重门击柝=chong2,men2,ji1,tuo4 +重阳=chong2,yang2 +重霄=chong2,xiao1 +野=ye3 +野乘=ye3,sheng4 +野调无腔=ye3,diao4,wu2,qiang1 +量=liang4,liang2,liang5 +量体温=liang2,ti3,wen1 +量体裁衣=liang4,ti3,cai2,yi1 +量入为出=liang4,ru4,wei2,chu1 +量具=liang2,ju4 +量力度德=liang4,li4,duo2,de2 +量力而为=liang4,li4,er2,wei2 +量器=liang2,qi4 +量度=liang2,du4 +量才而为=liang4,cai2,er2,wei2 +量杯=liang2,bei1 +量瓶=liang2,ping2 +量程=liang2,cheng2 +量筒=liang2,tong3 +量血压=liang2,xue4,ya1 +量规=liang2,gui1 +量角器=liang2,jiao3,qi4 +量计=liang2,ji4 +釐=li2,xi3,xi1 +金=jin1 +金冠=jin1,guan1 +金刚钻=jin1,gang1,zuan4 +金匮石室=jin1,gui4,shi2,shi4 +金发=jin1,fa4 +金吾不禁=jin1,wu2,bu4,jin4 +金弹=jin1,dan4 +金晃晃=jin1,huang4,huang3 +金相=jin1,xiang4 +金相玉式=jin1,xiang1,yu4,shi4 +金相玉振=jin1,xiang1,yu4,zhen4 +金相玉映=jin1,xiang1,yu4,ying4 +金相玉质=jin1,xiang4,yu4,zhi4 +金翅擘海=jin1,chi4,bai1,hai3 +金苹果=jin1,ping2,guo3 +金蝉脱壳=jin1,chan2,tuo1,qiao4 +金谷酒数=jin1,gu3,jiu3,shu4 +金针见血=jin1,zhen1,jian4,xue4 +金阙=jin1,que4 +金风玉露=jin1,feng1,yu4,lu4 +釒=jin1 +釓=ga2 +釔=yi3 +釕=liao3,liao4 +釖=dao1 +釗=zhao1 +釘=ding1,ding4 +釙=po1 +釚=qiu2 +釛=he2 +釜=fu3 +針=zhen1 +釞=zhi2 +釟=ba1 +釠=luan4 +釡=fu3 +釢=nai3 +釣=diao4 +釤=shan1,shan4 +釥=qiao3,jiao3 +釦=kou4 +釧=chuan4 +釨=zi3 +釩=fan2 +釪=hua2,yu2 +釫=hua2,wu1 +釬=han4 +釭=gang1 +釮=qi2 +釯=mang2 +釰=ri4,ren4,jian4 +釱=di4,dai4 +釲=si4 +釳=xi4 +釴=yi4 +釵=chai1 +釶=shi1,yi2 +釷=tu3 +釸=xi1 +釹=nv3 +釺=qian1 +釻=qiu2 +釼=ri4,ren4,jian4 +釽=pi1,zhao1 +釾=ye2,ya2 +釿=jin1 +鈀=ba3 +鈁=fang1 +鈂=chen2 +鈃=xing2 +鈄=dou3 +鈅=yue4 +鈆=qian1 +鈇=fu1 +鈈=bu4 +鈉=na4 +鈊=xin1 +鈋=e2 +鈌=jue2 +鈍=dun4 +鈎=gou1 +鈏=yin3 +鈐=qian2 +鈑=ban3 +鈒=sa4 +鈓=ren4 +鈔=chao1 +鈕=niu3 +鈖=fen1 +鈗=yun3 +鈘=yi3 +鈙=qin2 +鈚=pi1 +鈛=guo1 +鈜=hong2 +鈝=yin2 +鈞=jun1 +鈟=diao4 +鈠=yi4 +鈡=zhong1 +鈢=xi3 +鈣=gai4 +鈤=ri4 +鈥=huo3 +鈦=tai4 +鈧=kang4 +鈨=yuan2 +鈩=lu2 +鈪=e4 +鈫=qin2 +鈬=duo2 +鈭=zi1 +鈮=ni2 +鈯=tu2 +鈰=shi4 +鈱=min2 +鈲=gu1 +鈳=ke1 +鈴=ling2 +鈵=bing3 +鈶=si4 +鈷=gu3 +鈸=bo2 +鈹=pi2 +鈺=yu4 +鈻=si4 +鈼=zuo2 +鈽=bu1 +鈾=you2 +鈿=dian4 +鉀=jia3 +鉁=zhen1 +鉂=shi3 +鉃=shi4 +鉄=tie3 +鉅=ju4 +鉆=zuan1 +鉇=shi1 +鉈=ta1,tuo2 +鉉=xuan4 +鉊=zhao1 +鉋=bao4,pao2 +鉌=he2 +鉍=bi4 +鉎=sheng1 +鉏=chu2 +鉐=shi2 +鉑=bo2 +鉒=zhu4 +鉓=chi4 +鉔=za1 +鉕=po3 +鉖=tong2 +鉗=qian2 +鉘=fu2 +鉙=zhai3 +鉚=mao3 +鉛=qian1 +鉜=fu2 +鉝=li4 +鉞=yue4 +鉟=pi1 +鉠=yang1 +鉡=ban4 +鉢=bo1 +鉣=jie2 +鉤=gou1 +鉥=shu4 +鉦=zheng1 +鉧=mu3 +鉨=xi3 +鉩=xi3 +鉪=di4 +鉫=jia1 +鉬=mu4 +鉭=tan3 +鉮=shen2 +鉯=yi3 +鉰=si1 +鉱=kuang4 +鉲=ka3 +鉳=bei3 +鉴=jian4 +鉴影度形=jian4,ying3,duo2,xing2 +鉵=tong2 +鉶=xing2 +鉷=hong2 +鉸=jiao3 +鉹=chi3 +鉺=er3 +鉻=ge4 +鉼=bing3,ping2 +鉽=shi4 +鉾=mao2 +鉿=ha1,ke1 +銀=yin2 +銁=jun1 +銂=zhou1 +銃=chong4 +銄=xiang3,jiong1 +銅=tong2 +銆=mo4 +銇=lei4 +銈=ji1 +銉=yu4,si4 +銊=xu4,hui4 +銋=ren2,ren3 +銌=zun4 +銍=zhi4 +銎=qiong2 +銏=shan4,shuo4 +銐=chi4,li4 +銑=xian3,xi3 +銒=xing2 +銓=quan2 +銔=pi1 +銕=tie3 +銖=zhu1 +銗=hou2,xiang4 +銘=ming2 +銙=kua3 +銚=diao4,tiao2,yao2 +銛=xian1,kuo4,tian3,gua1 +銜=xian2 +銝=xiu1 +銞=jun1 +銟=cha1 +銠=lao3 +銡=ji2 +銢=pi3 +銣=ru2 +銤=mi3 +銥=yi1 +銦=yin1 +銧=guang1 +銨=an3 +銩=diu1 +銪=you3 +銫=se4 +銬=kao4 +銭=qian2 +銮=luan2 +銯=si1 +銰=ai1 +銱=diao4 +銲=han4 +銳=rui4 +銴=shi4,zhi4 +銵=keng1 +銶=qiu2 +銷=xiao1 +銸=zhe2,nie4 +銹=xiu4 +銺=zang4 +銻=ti1 +銼=cuo4 +銽=xian1,kuo4,tian3,gua1 +銾=hong4,gong3 +銿=zhong1,yong1 +鋀=tou1,tu4,dou4 +鋁=lv3 +鋂=mei2,meng2 +鋃=lang2 +鋄=wan4,jian3 +鋅=xin1 +鋆=yun2 +鋇=bei4 +鋈=wu4 +鋉=su4 +鋊=yu4 +鋋=chan2 +鋌=ting3,ding4 +鋍=bo2 +鋎=han4 +鋏=jia2 +鋐=hong2 +鋑=juan1,jian1,cuan1 +鋒=feng1 +鋓=chan1 +鋔=wan3 +鋕=zhi4 +鋖=si1,tuo2 +鋗=xuan1,juan1,juan4 +鋘=hua2,wu2,wu1 +鋙=wu2 +鋚=tiao2 +鋛=kuang4 +鋜=zhuo2,chuo4 +鋝=lve4 +鋞=xing2,xing4,jing1 +鋟=qin3 +鋠=shen4 +鋡=han2 +鋢=lve4 +鋣=ye2 +鋤=chu2 +鋥=zeng4 +鋦=ju1,ju2 +鋧=xian4 +鋨=e2 +鋩=mang2 +鋪=pu1,pu4 +鋫=li2 +鋬=pan4 +鋭=rui4 +鋮=cheng2 +鋯=gao4 +鋰=li3 +鋱=te4 +鋲=bing1 +鋳=zhu4 +鋴=zhen4 +鋵=tu1 +鋶=liu3 +鋷=zui4,nie4 +鋸=ju4,ju1 +鋹=chang3 +鋺=yuan3,yuan1,wan3,wan1 +鋻=jian1,jian4 +鋼=gang1,gang4 +鋽=diao4 +鋾=tao2 +鋿=shang3 +錀=lun2 +錁=ke4 +錂=ling2 +錃=pi1 +錄=lu4 +錅=li2 +錆=qing1 +錇=pei2 +錈=juan3 +錉=min2 +錊=zui4 +錋=peng2 +錌=an4 +錍=pi1 +錎=xian4 +錏=ya1 +錐=zhui1 +錑=lei4 +錒=a1 +錓=kong1 +錔=ta4 +錕=kun1 +錖=du2 +錗=nei4 +錘=chui2 +錙=zi1 +錚=zheng1 +錛=ben1 +錜=nie4 +錝=cong2 +錞=chun2 +錟=tan2 +錠=ding4 +錡=qi2 +錢=qian2 +錣=zhui4 +錤=ji1 +錥=yu4 +錦=jin3 +錧=guan3 +錨=mao2 +錩=chang1 +錪=tian3 +錫=xi1 +錬=lian4 +錭=diao1 +錮=gu4 +錯=cuo4 +錰=shu4 +錱=zhen1 +録=lu4 +錳=meng3 +錴=lu4 +錵=hua1 +錶=biao3 +錷=ga2 +錸=lai2 +錹=ken3 +錺=fang1 +錻=bu1 +錼=nai4 +錽=wan4 +錾=zan4 +錾子=zan4,zi5 +錿=hu3 +鍀=de2 +鍁=xian1 +鍂=pian1 +鍃=huo4 +鍄=liang4 +鍅=fa3 +鍆=men2 +鍇=kai3 +鍈=yang1 +鍉=chi2 +鍊=lian4 +鍋=guo1 +鍌=xian3 +鍍=du4 +鍎=tu2 +鍏=wei2 +鍐=zong1 +鍑=fu4 +鍒=rou2 +鍓=ji2 +鍔=e4 +鍕=jun1 +鍖=chen3 +鍗=ti2 +鍘=zha2 +鍙=hu4 +鍚=yang2 +鍛=duan4 +鍜=xia2 +鍝=yu2 +鍞=keng1 +鍟=sheng1 +鍠=huang2 +鍡=wei3 +鍢=fu4 +鍣=zhao1 +鍤=cha1 +鍥=qie4 +鍦=shi1 +鍧=hong1 +鍨=kui2 +鍩=nuo4 +鍪=mou2 +鍫=qiao1 +鍬=qiao1 +鍭=hou2 +鍮=tou1 +鍯=cong1 +鍰=huan2 +鍱=ye4 +鍲=min2 +鍳=jian4 +鍴=duan1 +鍵=jian4 +鍶=si1 +鍷=kui2 +鍸=hu2 +鍹=xuan1 +鍺=zhe3 +鍻=jie2 +鍼=zhen1 +鍽=bian1 +鍾=zhong1 +鍿=zi1 +鎀=xiu1 +鎁=ye2 +鎂=mei3 +鎃=pai4 +鎄=ai1 +鎅=jie4 +鎆=qian2 +鎇=mei2 +鎈=cuo1,cha1 +鎉=da1,ta4 +鎊=bang4 +鎋=xia2 +鎌=lian2 +鎍=suo3,se4 +鎎=kai4 +鎏=liu2 +鎐=yao2,zu2 +鎑=ye4,ta4,ge2 +鎒=nou4 +鎓=weng1 +鎔=rong2 +鎕=tang2 +鎖=suo3 +鎗=qiang1,cheng1 +鎘=ge2,li4 +鎙=shuo4 +鎚=chui2 +鎛=bo2 +鎜=pan2 +鎝=da1 +鎞=bi1,bi4,pi1 +鎟=sang3 +鎠=gang1 +鎡=zi1 +鎢=wu1 +鎣=ying2 +鎤=huang4 +鎥=tiao2 +鎦=liu2,liu4 +鎧=kai3 +鎨=sun3 +鎩=sha1 +鎪=sou1 +鎫=wan4,jian3 +鎬=gao3,hao4 +鎭=zhen4 +鎮=zhen4 +鎯=lang2 +鎰=yi4 +鎱=yuan2 +鎲=tang3 +鎳=nie4 +鎴=xi2 +鎵=jia1 +鎶=ge1 +鎷=ma3 +鎸=juan1 +鎹=song4 +鎺=zu3 +鎻=suo3 +鎼=xia4 +鎽=feng1 +鎾=wen1 +鎿=na2 +鏀=lu3 +鏁=suo3 +鏂=ou1 +鏃=zu2,chuo4 +鏄=tuan2 +鏅=xiu1,xiu4 +鏆=guan4 +鏇=xuan4 +鏈=lian4 +鏉=shou4,sou1 +鏊=ao4 +鏋=man3 +鏌=mo4 +鏍=luo2 +鏎=bi4 +鏏=wei4 +鏐=liu2 +鏑=di2,di1 +鏒=san3,qiao1,can4 +鏓=cong1 +鏔=yi2 +鏕=lu4,ao2 +鏖=ao2 +鏗=keng1 +鏘=qiang1 +鏙=cui1 +鏚=qi1 +鏛=shang3 +鏜=tang1,tang2 +鏝=man4 +鏞=yong1 +鏟=chan3 +鏠=feng1 +鏡=jing4 +鏢=biao1 +鏣=shu4 +鏤=lou4 +鏥=xiu4 +鏦=cong1 +鏧=long2 +鏨=zan4 +鏩=jian4,zan4 +鏪=cao2 +鏫=li2 +鏬=xia4 +鏭=xi1 +鏮=kang1 +鏯=shuang3 +鏰=beng4 +鏱=zhang1 +鏲=qian1 +鏳=zheng1 +鏴=lu4 +鏵=hua2 +鏶=ji2 +鏷=pu2 +鏸=hui4,sui4,rui4 +鏹=qiang3,qiang1 +鏺=po1 +鏻=lin2 +鏼=se4 +鏽=xiu4 +鏾=san3,xian4,sa4 +鏿=cheng1 +鐀=gui4 +鐁=si1 +鐂=liu2 +鐃=nao2 +鐄=huang2 +鐅=pie3 +鐆=sui4 +鐇=fan2 +鐈=qiao2 +鐉=quan1 +鐊=xi1 +鐋=tang4 +鐌=xiang4 +鐍=jue2 +鐎=jiao1 +鐏=zun1 +鐐=liao4 +鐑=qi4 +鐒=lao2 +鐓=dui1 +鐔=xin2 +鐕=zan1 +鐖=ji1 +鐗=jian3 +鐘=zhong1 +鐙=deng4 +鐚=ya1 +鐛=ying3 +鐜=dui1 +鐝=jue2 +鐞=nou4 +鐟=zan1 +鐠=pu3 +鐡=tie3 +鐢=fan2 +鐣=cheng1 +鐤=ding3 +鐥=shan4 +鐦=kai1 +鐧=jian3 +鐨=fei4 +鐩=sui4 +鐪=lu3 +鐫=juan1 +鐬=hui4 +鐭=yu4 +鐮=lian2 +鐯=zhuo1 +鐰=qiao1 +鐱=jian4 +鐲=zhuo2 +鐳=lei2 +鐴=bi4 +鐵=tie3 +鐶=huan2 +鐷=ye4 +鐸=duo2 +鐹=guo4 +鐺=dang1,cheng1 +鐻=ju4 +鐼=fen2 +鐽=da2 +鐾=bei4 +鐿=yi4 +鑀=ai4 +鑁=zong1 +鑂=xun4 +鑃=diao4 +鑄=zhu4 +鑅=heng2 +鑆=zhui4 +鑇=ji1 +鑈=nie4 +鑉=he2 +鑊=huo4 +鑋=qing1 +鑌=bin1 +鑍=ying1 +鑎=gui4 +鑏=ning2 +鑐=xu1 +鑑=jian4 +鑒=jian4 +鑓=qian3 +鑔=cha3 +鑕=zhi4 +鑖=mie4 +鑗=li2 +鑘=lei2 +鑙=ji1 +鑚=zuan1 +鑛=kuang4 +鑜=shang3 +鑝=peng2 +鑞=la4 +鑟=du2 +鑠=shuo4 +鑡=chuo4 +鑢=lv4 +鑣=biao1 +鑤=bao4 +鑥=lu3 +鑦=xian2 +鑧=kuan1 +鑨=long2 +鑩=e4 +鑪=lu2 +鑫=xin1 +鑬=jian4 +鑭=lan2 +鑮=bo2 +鑯=jian1 +鑰=yue4 +鑱=chan2 +鑲=xiang1 +鑳=jian4 +鑴=xi1 +鑵=guan4 +鑶=cang2 +鑷=nie4 +鑸=lei3 +鑹=cuan1 +鑺=qu2 +鑻=pan4 +鑼=luo2 +鑽=zuan1 +鑾=luan2 +鑿=zao2 +钀=nie4 +钁=jue2 +钂=tang3 +钃=zhu2 +钄=lan4 +钅=jin1 +钆=ga2 +钇=yi3 +针=zhen1 +针头削铁=zhen1,tou2,xue1,tie3 +钉=ding1,ding4 +钉书机=ding4,shu1,ji1 +钉书针=ding4,shu1,zhen1 +钉住=ding4,zhu4 +钉头磷磷=ding1,tou2,lin2,lin2 +钉子=ding1,zi5 +钉子户=ding1,zi5,hu4 +钉扣子=ding4,kou4,zi5 +钉箱子=ding4,xiang1,zi5 +钉耙=ding1,pa2 +钉钉子=ding4,ding1,zi5 +钉钮扣=ding4,niu3,kou4 +钉马掌=ding4,ma3,zhang3 +钊=zhao1 +钋=po1 +钌=liao3,liao4 +钍=tu3 +钎=qian1 +钏=chuan4 +钐=shan1,shan4 +钑=sa4,xi4 +钒=fan2 +钓=diao4 +钓钩=diao4,gou1 +钓鱼钩=diao1,yu2,gou1 +钔=men2 +钕=nv3 +钖=yang2 +钗=chai1 +钘=xing2 +钙=gai4 +钚=bu4 +钛=tai4 +钜=ju4 +钜细靡遗=ju4,xi4,mi3,yi2 +钝=dun4 +钞=chao1 +钟=zhong1 +钟鼎人家=zhong1,ding3,ren2,jia5 +钠=na4 +钡=bei4 +钢=gang1,gang4 +钢筋铁骨=gang1,jing1,tie3,gu3 +钢镚儿=gang1,beng4,er5 +钣=ban3 +钤=qian2 +钥=yue4,yao4 +钥匙=yao4,shi5 +钦=qin1 +钦天监=qin1,tian1,jian4 +钦差=qin1,chai1 +钧=jun1 +钧天广乐=jun1,tian1,guang3,yue4 +钨=wu1 +钩=gou1 +钩子=gou1,zi5 +钪=kang4 +钫=fang1 +钬=huo3 +钭=dou3 +钮=niu3 +钯=ba3,pa2 +钰=yu4 +钱=qian2 +钱夹=qian2,jia1 +钲=zheng1,zheng4 +钳=qian2 +钴=gu3 +钵=bo1 +钶=ke1 +钷=po3 +钸=bu1 +钹=bo2 +钺=yue4 +钻=zuan1,zuan4 +钻井=zuan4,jing3 +钻井船=zuan4,jing3,chuan2 +钻具=zuan4,ju4 +钻冰取火=zuan4,bing1,qu3,huo3 +钻台=zuan4,tai2 +钻坚仰高=zuan4,jian1,yang3,gao1 +钻坚研微=zuan4,jian1,yan2,wei1 +钻塔=zuan4,ta3 +钻天觅缝=zuan4,tian1,mi4,feng2 +钻头=zuan4,tou2 +钻头就锁=zuan4,tou2,jiu4,suo3 +钻头觅缝=zuan1,tou2,mi4,feng4 +钻孔=zuan1,kong3 +钻山塞海=zuan4,shan1,sai1,hai3 +钻床=zuan4,chuang2 +钻心=zuan1,xin1 +钻心刺骨=zuan4,xin1,ci4,gu3 +钻戒=zuan4,jie4 +钻探=zuan1,tan4 +钻故纸堆=zuan4,gu4,zhi3,dui1 +钻木取火=zuan1,mu4,qu3,huo3 +钻机=zuan4,ji1 +钻杆=zuan4,gan3 +钻洞觅缝=zuan4,dong4,mi4,feng2 +钻牛角=zuan4,niu2,jiao3 +钻牛角尖=zuan1,niu2,jiao3,jian1 +钻石=zuan4,shi2 +钻穴逾垣=zuan4,xue2,yu2,yuan2 +钻空=zuan1,kong4 +钻空子=zuan1,kong4,zi5 +钼=mu4 +钽=tan3 +钾=jia3 +钿=dian4,tian2 +铀=you2 +铁=tie3 +铁叉子=tie3,cha1,zi5 +铁杆=tie3,gan3 +铁板钉钉=tie3,ban3,ding4,ding1 +铁树开华=tie3,shu4,kai1,hua1 +铁椎=tie3,chui2 +铁绰铜琶=tie3,chuo1,tong2,pa2 +铁耙=tie3,pa2 +铁血=tie3,xue4 +铁血政策=tie3,xue4,zheng4,ce4 +铂=bo2 +铃=ling2 +铃铛=ling2,dang4 +铄=shuo4 +铅=qian1,yan2 +铅山=yan2,shan1 +铅弹=qian1,dan4 +铆=mao3 +铇=bao4,pao2 +铈=shi4 +铉=xuan4 +铊=ta1,tuo2 +铋=bi4 +铌=ni2 +铍=pi2,pi1 +铎=duo2 +铏=xing2 +铐=kao4 +铑=lao3 +铒=er3 +铓=mang2 +铔=ya1,ya4 +铕=you3 +铖=cheng2 +铗=jia2 +铘=ye2 +铙=nao2 +铚=zhi4 +铛=dang1,cheng1 +铜=tong2 +铜模=tong2,mu2 +铜臭=tong2,xiu4 +铜臭熏天=tong2,chou4,xun1,tian1 +铝=lv3 +铞=diao4 +铟=yin1 +铠=kai3 +铡=zha2 +铢=zhu1 +铢两悉称=zhu1,liang3,xi1,chen4 +铢寸累积=zhu1,cun4,lei4,ji1 +铢积丝累=zhu1,ji1,si1,lei4 +铢积寸累=zhu1,ji1,cun4,lei3 +铢积锱累=zhu1,ji1,zi1,lei4 +铢称寸量=zhu1,cheng1,cun4,liang2 +铢累寸积=zhu1,lei4,cun4,ji1 +铢量寸度=zhu1,liang2,cun4,duo2 +铣=xian3,xi3 +铤=ting3,ding4 +铤鹿走险=ding4,lu4,zou3,xian3 +铥=diu1 +铦=xian1,kuo4,tian3,gua1 +铧=hua2 +铨=quan2 +铩=sha1 +铪=ha1,ke1 +铫=diao4,tiao2,yao2 +铬=ge4 +铭=ming2 +铮=zheng1 +铯=se4 +铰=jiao3 +铱=yi1 +铲=chan3 +铲子=chan3,zi5 +铳=chong4 +铴=tang4,tang1 +铵=an3 +银=yin2 +银子=yin2,zi5 +银河倒泻=yin2,he2,dao4,xie4 +银行=yin2,hang2 +银行行员=yin2,hang2,hang2,yuan2 +银镪=yin2,qiang3 +铷=ru2 +铸=zhu4 +铸剑为犁=zhu4,jian4,wei2,li2 +铹=lao2 +铺=pu1,pu4 +铺位=pu4,wei4 +铺保=pu4,bao3 +铺子=pu4,zi5 +铺底=pu4,di3 +铺户=pu4,hu4 +铺板=pu4,ban3 +铺盖=pu4,gai4 +铺盖卷=pu4,gai4,juan3 +铺盖卷儿=pu1,gai4,juan3,er2 +铺眉苫眼=pu1,mei2,shan4,yan3 +铺眉蒙眼=pu1,mei2,meng2,yan3 +铺谋定计=pu4,mou2,ding4,ji4 +铺采摛文=pu4,cai3,chi1,wen2 +铺铺=pu4,pu4 +铻=wu2 +铼=lai2 +铽=te4 +链=lian4 +链子=lian4,zi5 +链式反应=lian4,shi4,fan3,ying4 +铿=keng1 +销=xiao1 +销假=xiao1,jia4 +锁=suo3 +锁匙=suo3,shi5 +锁头=suo3,tou5 +锁钥=suo3,yue4 +锂=li3 +锃=zeng4 +锄=chu2 +锄头=chu2,tou5 +锅=guo1 +锅子=guo1,zi5 +锅炉给水=guo1,lu2,ji3,shui3 +锆=gao4 +锇=e2 +锈=xiu4 +锉=cuo4 +锊=lve4 +锋=feng1 +锋芒不露=feng1,mang2,bu4,lu4 +锋芒毕露=feng1,mang2,bi4,lu4 +锋铓毕露=feng1,mang2,bi4,lu4 +锌=xin1 +锍=liu3 +锎=kai1 +锏=jian3 +锐=rui4 +锑=ti1 +锒=lang2 +锓=qin3 +锔=ju1 +锕=a1 +锖=qiang1 +锗=zhe3 +锘=nuo4 +错=cuo4 +错处=cuo4,chu3 +锚=mao2 +锛=ben1 +锜=qi2 +锝=de2 +锞=ke4 +锟=kun1 +锠=chang1 +锡=xi1 +锢=gu4 +锣=luo2 +锤=chui2 +锤子=chui2,zi5 +锥=zhui1 +锥处囊中=zhui1,chu3,nang2,zhong1 +锦=jin3 +锦囊还矢=jin3,nang2,huan2,shi3 +锧=zhi4 +锨=xian1 +锩=juan3 +锪=huo4 +锫=pei2 +锬=tan2 +锭=ding4 +锭子=ding4,zi5 +键=jian4 +键盘乐器=jian4,pan2,yue4,qi4 +锯=ju4 +锰=meng3 +锱=zi1 +锲=qie4 +锲而不舍=qie4,er2,bu4,she3 +锳=ying1 +锴=kai3 +锵=qiang1 +锶=si1 +锷=e4 +锸=cha1 +锹=qiao1 +锺=zhong1 +锻=duan4 +锻模=duan4,mu2 +锼=sou1 +锽=huang2 +锾=huan2 +锿=ai1 +镀=du4 +镁=mei3 +镂=lou4 +镂金铺翠=lou4,jin1,pu4,cui4 +镃=zi1 +镄=fei4 +镅=mei2 +镆=mo4 +镇=zhen4 +镈=bo2 +镉=ge2,li4 +镊=nie4 +镊子=nie4,zi5 +镋=tang3 +镌=juan1 +镍=nie4 +镎=na2 +镏=liu2 +镐=gao3,hao4 +镐京=hao4,jing1 +镐头=gao3,tou5 +镑=bang4 +镒=yi4 +镓=jia1 +镔=bin1 +镕=rong2 +镖=biao1 +镗=tang1 +镘=man4 +镙=luo2 +镚=beng4 +镚子=beng4,zi5 +镛=yong1 +镜=jing4 +镜子=jing4,zi5 +镝=di2 +镞=zu2 +镟=xuan4 +镠=liu2 +镡=xin2 +镢=jue2 +镣=liao4 +镤=pu2 +镥=lu3 +镦=dui1 +镧=lan2 +镨=pu3 +镩=cuan1 +镪=qiang3 +镫=deng4 +镫子=deng4,zi5 +镬=huo4 +镭=lei2 +镮=huan2 +镯=zhuo2 +镯子=zhuo2,zi5 +镰=lian2 +镱=yi4 +镲=cha3 +镳=biao1 +镴=la4 +镵=chan2 +镶=xiang1 +長=chang2,zhang3 +镸=chang2 +镹=jiu3 +镺=ao3 +镻=die2 +镼=jie2 +镽=liao3 +镾=mi2 +长=chang2,zhang3 +长上=zhang3,shang4 +长亲=zhang3,qin1 +长假=chang2,jia4 +长傲饰非=zhang3,ao4,shi4,fei1 +长兄=zhang3,xiong1 +长兴=chang2,xing1 +长出=zhang3,chu1 +长势=zhang3,shi4 +长卷=chang2,juan4 +长发=chang2,fa4 +长吁短叹=chang2,xu1,duan3,tan4 +长吏=zhang3,li4 +长处=chang2,chu4 +长大=zhang3,da4 +长子=zhang3,zi3 +长孙=zhang3,sun1 +长官=zhang3,guan1 +长年三老=zhang3,nian2,san1,lao3 +长年累月=chang2,nian2,lei3,yue4 +长幼=zhang3,you4 +长恶靡悛=chang2,e4,mi3,quan1 +长房=zhang3,fang2 +长春不老=chang2,chun2,bu4,lao3 +长歌当哭=chang2,ge1,dang4,ku1 +长毛=zhang3,mao2 +长毛绒=chang2,mao2,rong2 +长满荒草=zhang3,man3,huang1,cao3 +长牙=zhang3,ya2 +长物=zhang4,wu4 +长生不死=chang2,sheng1,bu1,si3 +长相=zhang3,xiang4 +长相寒碜=zhang3,xiang4,han2,chen5 +长篇累牍=chang2,pian1,lei3,du2 +长绳系日=chang2,sheng2,ji4,ri4 +长老=zhang3,lao3 +长者=zhang3,zhe3 +长膘=zhang3,biao1 +长虺成蛇=zhang3,hui3,cheng2,she2 +长调=chang2,diao4 +长辈=zhang3,bei4 +长进=zhang3,jin4 +长鸣都尉=chang2,ming2,dou1,wei4 +門=men2 +閁=ma4 +閂=shuan1 +閃=shan3 +閄=huo4,shan3 +閅=men2 +閆=yan2 +閇=bi4 +閈=han4,bi4 +閉=bi4 +閊=shan1 +開=kai1 +閌=kang1,kang4 +閍=beng1 +閎=hong2 +閏=run4 +閐=san4 +閑=xian2 +閒=xian2,jian1,jian4 +間=jian1,jian4 +閔=min3 +閕=xia1,xia3 +閖=shui3 +閗=dou4 +閘=zha2 +閙=nao4 +閚=zhan1 +閛=peng1,peng4 +閜=xia3,ke3 +閝=ling2 +閞=bian4,guan1 +閟=bi4 +閠=run4 +閡=he2 +関=guan1 +閣=ge2 +閤=he2,ge2 +閥=fa2 +閦=chu4 +閧=hong4,xiang4 +閨=gui1 +閩=min3 +閪=se1,xi1 +閫=kun3 +閬=lang4 +閭=lv2 +閮=ting2,ting3 +閯=sha4 +閰=ju2 +閱=yue4 +閲=yue4 +閳=chan3 +閴=qu4 +閵=lin4 +閶=chang1 +閷=sha1 +閸=kun3 +閹=yan1 +閺=wen2 +閻=yan2 +閼=e4,yan1 +閽=hun1 +閾=yu4 +閿=wen2 +闀=hong4 +闁=bao1 +闂=hong4,juan3,xiang4 +闃=qu4 +闄=yao3 +闅=wen2 +闆=ban3,pan4 +闇=an4 +闈=wei2 +闉=yin1 +闊=kuo4 +闋=que4 +闌=lan2 +闍=du1,she2 +闎=quan2 +闐=tian2 +闑=nie4 +闒=ta4 +闓=kai3 +闔=he2 +闕=que4,que1 +闖=chuang3 +闗=guan1 +闘=dou4 +闙=qi3 +闚=kui1 +闛=tang2,tang1,chang1 +關=guan1 +闝=piao2 +闞=kan4,han3 +闟=xi4,se4,ta4 +闠=hui4 +闡=chan3 +闢=pi4 +闣=dang1,dang4 +闤=huan2 +闥=ta4 +闦=wen2 +闧=ta1 +门=men2 +门单户薄=men2,dan1,hu4,bo2 +门把=men2,ba4 +门斗=men2,dou3 +门槛=men2,kan3 +门禁=men2,jin4 +门禁森严=men2,jin4,sen1,yan2 +门缝=men2,feng4 +闩=shuan1 +闪=shan3 +闫=yan2 +闬=han4,bi4 +闭=bi4 +闭卷=bi4,juan4 +闭合思过=bi4,ge2,si1,guo4 +闭合自责=bi4,ge2,zi4,ze2 +闭塞=bi4,se4 +闭塞眼睛捉麻雀=bi4,se4,yan3,jing1,zhuo1,ma2,que4 +闭明塞聪=bi4,ming2,se4,cong1 +闭目塞听=bi4,mu4,se4,ting1 +闭目塞耳=bi4,mu4,se4,er3 +闭门合辙=bi4,kou3,he2,she2 +闭门塞户=bi4,kou3,se4,hu4 +闭门塞窦=bi4,kou3,se4,dou4 +闭门思愆=bi4,ge2,si1,qian1 +闭门扫迹=bi4,kou3,sao3,gui3 +闭门造车=bi4,men2,zao4,che1 +问=wen4 +问难=wen4,nan4 +闯=chuang3 +闯将=chuang3,jiang4 +闰=run4 +闱=wei2 +闲=xian2 +闲散=xian2,san3 +闲空=xian2,kong4 +闳=hong2 +间=jian1,jian4 +间不容发=jian1,bu4,rong2,fa4 +间不容息=jian4,bu4,rong2,xi1 +间不容瞚=jian1,bu4,rong2,xi3 +间作=jian4,zuo4 +间奏=jian4,zou4 +间奏曲=jian4,zou4,qu3 +间或=jian4,huo4 +间接=jian4,jie1 +间断=jian4,duan4 +间日=jian4,ri4 +间歇=jian4,xie1 +间种=jian4,zhong4 +间色=jian4,se4 +间苗=jian4,miao2 +间见层出=jian4,xian4,ceng2,chu1 +间谍=jian4,die2 +间距=jian4,ju4 +间道=jian4,dao4 +间隔=jian4,ge2 +间隙=jian4,xi4 +闵=min3 +闶=kang4,kang1 +闷=men4,men1 +闷光玻璃=men4,guang1,bo1,li5 +闷在家里=men1,zai4,jia1,li3 +闷在心里=men1,zai4,xin1,li3 +闷声=men1,sheng1 +闷声不吭=men4,sheng1,bu4,keng1 +闷声不响=men1,sheng1,bu4,xiang3 +闷声闷气=men1,sheng1,men1,qi4 +闷头=men1,tou2 +闷头儿=men1,tou2,er5 +闷头儿干=men1,tou2,er5,gan4 +闷头儿干活=men1,tou2,er5,gan4,huo2 +闷子车=men4,zi5,che1 +闷得发慌=men4,de5,fa1,huang1 +闷气=men1,qi4 +闷沉沉=men1,chen2,chen2 +闷热=men1,re4 +闷葫芦=men4,hu2,lu5 +闷葫芦罐=men4,hu2,lu5,guan4 +闷锄=men1,chu2 +闷闷不乐=men4,men4,bu4,le4 +闷雷=men1,lei2 +闸=zha2 +闹=nao4 +闹乱子=nao4,luan4,zi5 +闹哄哄=nao4,hong3,hong3 +闹嚷=nao4,rang1 +闹嚷嚷=nao4,rang1,rang1 +闹着玩=nao4,zhe5,wan2 +闹着玩儿=nao4,zhe5,wan2,er5 +闹肚子=nao4,du4,zi5 +闺=gui1 +闻=wen2 +闻风响应=wen2,feng1,xiang3,ying1 +闻风而兴=wen2,feng1,er2,xing1 +闼=ta4 +闽=min3 +闽侯=min3,hou4 +闾=lv2 +闿=kai3 +阀=fa2 +阁=ge2 +阂=he2 +阃=kun3 +阄=jiu1 +阅=yue4 +阅卷=yue4,juan4 +阆=lang4 +阇=du1,she2 +阈=yu4 +阉=yan1 +阊=chang1 +阋=xi4 +阌=wen2 +阍=hun1 +阎=yan2 +阏=e4 +阐=chan3 +阑=lan2 +阑风长雨=lan2,feng1,zhang4,yu3 +阒=qu4 +阓=hui4 +阔=kuo4 +阔少=kuo4,shao4 +阕=que4 +阖=he2 +阗=tian2 +阘=ta4 +阙=que1,que4 +阙一不可=que4,yi1,bu4,ke3 +阙下=que4,xia4 +阚=kan4 +阛=huan2 +阜=fu4 +阝=fu3 +阞=le4 +队=dui4 +队长=dui4,zhang3 +阠=xin4 +阡=qian1 +阢=wu4 +阣=yi4 +阤=tuo2 +阥=yin1 +阦=yang2 +阧=dou3 +阨=e4 +阩=sheng1 +阪=ban3 +阫=pei2 +阬=keng1 +阭=yun3 +阮=ruan3 +阯=zhi3 +阰=pi2 +阱=jing3 +防=fang2 +防不胜防=fang2,bu4,sheng4,fang2 +防弹=fang2,dan4 +阳=yang2 +阴=yin1 +阴差阳错=yin1,cha1,yang2,cuo4 +阴干=yin1,gan4 +阴着儿=yin1,zhao1,er2 +阴错阳差=yin1,cuo4,yang2,cha1 +阴魂不散=yin1,hun2,bu4,san4 +阵=zhen4 +阵子=zhen4,zi5 +阶=jie1 +阷=cheng1 +阸=e4 +阹=qu1 +阺=di3 +阻=zu3 +阻塞=zu3,se4 +阻难=zu3,nan4 +阼=zuo4 +阽=dian4 +阾=lin2 +阿=a1,e1 +阿世取容=e1,shi4,qu3,rong2 +阿世媚俗=e1,shi4,mei4,su2 +阿世盗名=e1,shi4,dao4,ming2 +阿党比周=e1,dang3,bi3,zhou1 +阿党相为=e1,dang3,xiang1,wei2 +阿其所好=e1,qi2,suo3,hao4 +阿姨=a1,yi2 +阿家阿翁=a1,gu1,a1,weng1 +阿弥陀佛=e1,mi2,tuo2,fo2 +阿意取容=e1,yi4,qu3,rong2 +阿房宫=e1,pang2,gong1 +阿斗=a1,dou3 +阿时趋俗=e1,shi2,qu1,su2 +阿木林=a1,mu4,lin2 +阿的平=a1,di4,ping2 +阿罗汉=a1,luo2,han4 +阿胶=e1,jiao1 +阿蒙=a1,meng2 +阿訇=a1,hong1 +阿谀=e1,yu2 +阿门=a1,men2 +阿附=e1,fu4 +陀=tuo2 +陀螺=tuo2,luo2 +陁=tuo2 +陂=bei1,pi2,po1 +陂塘=bei1,tang2 +陂池=bei1,chi2 +陂湖禀量=bei1,hu2,bing3,liang2 +陂陀=po1,tuo2 +陃=bing3 +附=fu4 +附和=fu4,he4 +附着=fu4,zhuo2 +附识=fu4,zhi4 +附载=fu4,zai3 +际=ji4 +陆=lu4 +陆万=liu4,wan4 +陆仟=liu4,qian1 +陆佰=liu4,bai3 +陆圆=liu4,yuan2 +陆拾=liu4,shi2 +陇=long3 +陈=chen2 +陈词滥调=chen2,ci2,lan4,diao4 +陈辞滥调=chen2,ci2,lan4,diao4 +陉=xing2 +陊=duo4 +陋=lou4 +陌=mo4 +降=jiang4,xiang2 +降伏=xiang4,fu2 +降妖捉怪=xiang2,yao1,zhuo1,guai4 +降服=xiang2,fu2 +降调=jiang4,diao4 +降龙=xiang4,long2 +降龙伏虎=xiang2,long2,fu2,hu3 +陎=shu1 +陏=duo4 +限=xian4 +陑=er2 +陒=gui3 +陓=yu1 +陔=gai1 +陕=shan3 +陖=jun4 +陗=qiao4 +陘=xing2 +陙=chun2 +陚=wu3 +陛=bi4 +陜=xia2 +陝=shan3 +陞=sheng1 +陟=zhi4 +陟罚臧否=zhi4,fa2,zang1,pi3 +陠=pu1 +陡=dou3 +院=yuan4 +院子=yuan4,zi5 +院长=yuan4,zhang3 +陣=zhen4 +除=chu2 +除害兴利=chu2,hai4,xing1,li4 +除患兴利=chu2,huan4,xing1,li4 +除数=chu2,shu4 +陥=xian4 +陦=dao3 +陧=nie4 +陨=yun3 +险=xian3 +陪=pei2 +陪都=pei2,du1 +陫=fei4 +陬=zou1 +陭=qi2 +陮=dui4 +陯=lun2 +陰=yin1 +陱=ju1 +陲=chui2 +陳=chen2 +陴=pi1 +陵=ling2 +陵劲淬砺=ling2,jing4,cui4,li4 +陶=tao2 +陷=xian4 +陸=lu4 +陹=sheng1 +険=xian3 +陻=yin1 +陼=zhu3 +陽=yang2 +陾=reng2 +陿=xia2 +隀=chong2 +隁=yan4,yan3 +隂=yin1 +隃=yu2,yao2,shu4 +隄=di1 +隅=yu2 +隆=long2 +隆重庆祝=long2,zhong4,qing4,zhu4 +隇=wei1 +隈=wei1 +隉=nie4 +隊=dui4,zhui4 +隋=sui2,duo4 +隌=an4 +隍=huang2 +階=jie1 +随=sui2 +随声附和=sui2,sheng1,fu4,he4 +随大溜=sui2,da4,liu4 +随机应变=sui2,ji1,ying4,bian4 +随行就市=sui2,hang2,jiu4,shi4 +随风而靡=sui2,feng1,er2,mi3 +隐=yin3,yin4 +隐占身体=yin3,zhan4,shen1,ti3 +隑=qi2,gai1,ai2 +隒=yan3 +隓=hui1,duo4 +隔=ge2 +隔扇=ge2,shan1 +隔行=ge2,hang2 +隔行如隔山=ge2,hang2,ru2,ge2,shan1 +隕=yun3 +隖=wu4 +隗=wei3,kui2 +隘=ai4 +隙=xi4 +隙缝=xi4,feng4 +隚=tang2 +際=ji4 +障=zhang4 +隝=dao3 +隞=ao2 +隟=xi4 +隠=yin3,yin4 +隡=sa4 +隢=rao3 +隣=lin2 +隤=tui2 +隥=deng4 +隦=pi2 +隧=sui4 +隨=sui2 +隩=ao4,yu4 +險=xian3 +隫=fen2 +隬=ni3 +隭=er2 +隮=ji1 +隯=dao3 +隰=xi2 +隱=yin3,yin4 +隲=zhi4 +隳=hui1,duo4 +隴=long3 +隵=xi1 +隶=li4 +隷=li4 +隸=li4 +隹=zhui1,cui1,wei2 +隺=hu2,he4 +隻=zhi1 +隼=sun3 +隽=jun4,juan4 +隽永=juan4,yong3 +隽语=juan4,yu3 +难=nan2,nan4,nuo2 +难为=nan2,wei2 +难为情=nan2,wei2,qing2 +难乎为情=nan2,hu1,wei2,qing2 +难乎为继=nan2,hu1,wei2,ji4 +难侨=nan4,qiao2 +难兄难弟=nan4,xiong1,nan4,di4 +难割难舍=nan2,ge1,nan2,she3 +难友=nan4,you3 +难处=nan2,chu3 +难属=nan4,shu3 +难得=nan2,de5 +难得糊涂=nan2,de2,hu2,tu2 +难更仆数=nan2,geng1,pu2,shu3 +难民=nan4,min2 +难熬=nan2,ao2 +难胞=nan4,bao1 +难解难分=nan2,jie3,nan2,fen1 +难进易退=nan2,jin4,yi4,tui4 +隿=yi4 +雀=que4,qiao1,qiao3 +雀子=qiao1,zi3 +雀屏中选=que4,ping2,zhong4,xuan3 +雁=yan4 +雁行=yan4,hang2 +雂=qin2 +雃=jian1 +雄=xiong2 +雄劲=xiong2,jing4 +雅=ya3 +雅乐=ya3,yue4 +雅兴=ya3,xing4 +雅典娜=ya3,dian3,na4 +集=ji2 +集腋为裘=ji2,ye4,wei2,qiu2 +雇=gu4 +雈=huan2 +雉=zhi4 +雊=gou4 +雋=jun4,juan4 +雌=ci2 +雍=yong1 +雍容大雅=yong1,rong2,da4,ya3 +雎=ju1 +雏=chu2 +雐=hu1 +雑=za2 +雒=luo4 +雓=yu2 +雔=chou2 +雕=diao1 +雕楹碧槛=diao1,ying2,bi4,kan3 +雕虫薄技=diao1,chong2,bao2,ji4 +雖=sui1 +雗=han4 +雘=huo4 +雙=shuang1 +雚=guan4,huan2 +雛=chu2 +雜=za2 +雝=yong1 +雞=ji1 +雟=gui1,xi1 +雠=chou2 +雡=liu4 +離=li2 +難=nan2,nan4,nuo2 +雤=yu4 +雥=za2 +雦=chou2 +雧=ji2 +雨=yu3,yu4 +雨僝云僽=yu3,chan2,yun2,zhou4 +雨僝风僽=yu3,chan2,feng1,zhou4 +雨僽风僝=yu3,zhou4,feng1,chan2 +雨夹雪=yu3,jia1,xue3 +雨露=yu3,lu4 +雨露之恩=yu3,lu4,zhi1,en1 +雩=yu2 +雪=xue3 +雪耻=xue3,chi3 +雪耻报仇=xue3,chi3,bao4,chou2 +雪茄=xue3,jia1 +雫=na3 +雬=fou3 +雭=se4,xi2 +雮=mu4 +雯=wen2 +雰=fen1 +雱=pang1 +雲=yun2 +雳=li4 +雴=chi4 +雵=yang1 +零=ling2 +零散=ling2,san3 +零数=ling2,shu4 +零零散散=ling2,ling2,san3,san3 +零露瀼瀼=ling2,lu4,rang2,rang2 +雷=lei2 +雸=an2 +雹=bao2 +雺=wu4,meng2 +電=dian4 +雼=dang4 +雽=hu1,hu4 +雾=wu4 +雾兴云涌=wu4,xing1,yun2,yong3 +雿=diao4 +需=xu1 +霁=ji4 +霁月光风=ji1,yue4,guang1,feng1 +霁风朗月=ji1,feng1,lang3,yue4 +霂=mu4 +霃=chen2 +霄=xiao1 +霅=zha2 +霆=ting2 +震=zhen4 +霈=pei4 +霉=mei2 +霊=ling2 +霋=qi1 +霌=zhou1 +霍=huo4 +霎=sha4 +霏=fei1 +霐=hong2 +霑=zhan1 +霒=yin1 +霓=ni2 +霔=shu4 +霕=tun2 +霖=lin2 +霗=ling2 +霘=dong4 +霙=ying1 +霚=wu4 +霛=ling2 +霜=shuang1 +霜行草宿=shuang1,xing2,cao3,xiu3 +霜露之思=shuang1,lu4,zhi1,si1 +霜露之悲=shuang1,lu4,zhi1,bei1 +霜露之感=shuang1,lu4,zhi1,gan3 +霜露之病=shuang1,lu4,zhi1,bing4 +霜露之辰=shuang1,lu4,zhi1,chen2 +霝=ling2 +霞=xia2 +霟=hong2 +霠=yin1 +霡=mai4 +霢=mai4 +霣=yun3 +霤=liu4 +霥=meng4 +霦=bin1 +霧=wu4 +霨=wei4 +霩=kuo4 +霪=yin2 +霫=xi2 +霬=yi4 +霭=ai3 +霮=dan4 +霯=teng4 +霰=xian4 +霰弹=xian4,dan4 +霰彈槍=xian4,dan4,qiang1 +霱=yu4 +露=lou4,lu4 +露光=lou4,guang1 +露刃=lou4,ren4 +露台=lu4,tai2 +露天=lu4,tian1 +露头=lou4,tou2 +露头角=lu4,tou2,jiao3 +露宿=lu4,su4 +露富=lou4,fu4 +露尾藏头=lu4,wei3,cang2,tou2 +露己扬才=lu4,ji3,yang2,cai2 +露布=lu4,bu4 +露影藏形=lu4,ying3,cang2,xing2 +露往霜来=lu4,wang3,shuang1,lai2 +露才扬己=lu4,cai2,yang2,ji3 +露水=lu4,shui3 +露点=lu4,dian3 +露珠=lu4,zhu1 +露白=lou4,bai2 +露相=lou4,xiang4 +露纂雪钞=lu4,zuan3,xue3,chao1 +露红烟紫=lu4,hong2,yan1,zi3 +露红烟绿=lu4,hong2,yan1,lv4 +露胆披肝=lu4,dan3,pi1,gan1 +露胆披诚=lu4,dan3,pi1,cheng2 +露苗=lou4,miao2 +露营=lu4,ying2 +露酒=lu4,jiu3 +露钞雪纂=lu4,chao1,xue3,zuan3 +露面抛头=lu4,mian4,pao1,tou2 +露餐风宿=lu4,can1,feng1,su4 +露马脚=lou4,ma3,jiao3 +露骨=lu4,gu3 +露齿=lu4,chi3 +霳=long2 +霴=dai4 +霵=ji2 +霶=pang1 +霷=yang2 +霸=ba4 +霹=pi1 +霹雷=pi1,lei2 +霺=wei1 +霻=feng1 +霼=xi4 +霽=ji4 +霾=mai2 +霿=meng2 +靀=meng2 +靁=lei2 +靂=li4 +靃=huo4 +靄=ai3 +靅=fei4 +靆=dai4 +靇=long2 +靈=ling4 +靉=ai4 +靊=feng1 +靋=li4 +靌=bao3 +靍=he4 +靎=he4 +靏=he4 +靐=bing4 +靑=qing1 +青=qing1 +青堂瓦舍=qing1,tang2,wa3,she4 +青山一发=qing1,shan1,yi1,fa4 +青林黑塞=qing1,lin2,hei1,sai4 +青灯黄卷=qing1,deng1,huang2,juan4 +青紫被体=qing1,zi3,pi1,ti3 +青红皁白=qing1,hong2,tou2,bai2 +青肝碧血=qing1,gan1,bi4,xue4 +青苔=qing1,tai2 +青藏高原=qing1,zang4,gao1,yuan2 +靓=jing4,liang4 +靓丽=liang4,li4 +靓仔=liang4,zai3 +靓女=liang4,nv3 +靓衣=liang4,yi1 +靓饰=liang4,shi4 +靔=tian1 +靕=zheng4 +靖=jing4 +靗=cheng1 +靘=qing4 +静=jing4 +静电反应=jing4,dian4,fan3,ying4 +靚=jing4 +靛=dian4 +靜=jing4 +靝=tian1 +非=fei1 +非分=fei1,fen4 +非分之念=fei1,fen4,zhi1,nian4 +非分之想=fei1,fen4,zhi1,xiang3 +非分之财=fei1,fen4,zhi1,cai2 +非得=fei1,dei3 +非难=fei1,nan4 +靟=fei1 +靠=kao4 +靠不住=kao4,bu2,zhu4 +靠得住=kao4,de5,zhu4 +靠背=kao4,bei4 +靡=mi2 +靡丽=mi3,li4 +靡有孑遗=mi3,you3,jie2,yi2 +靡然从风=mi3,ran2,cong2,feng1 +靡衣偷食=mi3,yi1,tou1,shi2 +靡衣媮食=mi3,yi1,tou1,shi2 +靡费=mi2,fei4 +靡靡之乐=mi3,mi3,zhi1,yue4 +靡靡之音=mi3,mi3,zhi1,yin1 +面=mian4 +面如冠玉=mian4,ru2,guan1,yu4 +面子=mian4,zi5 +面折庭争=mian4,she2,ting2,zheng1 +面折廷诤=mian4,she2,ting2,zheng4 +面是背非=mian4,shi4,bei4,fei1 +面有难色=mian4,you3,nan2,se4 +面片儿=mian4,pian1,er5 +面糊=mian4,hu4 +面誉背毁=mian4,yu4,bei4,hui3 +面誉背非=mian4,yu4,bei4,fei1 +面谀背毁=mian4,yu2,bei4,hui3 +靣=mian4 +靤=pao4 +靥=ye4 +靦=mian3 +靦颜事仇=tian3,yan2,shi4,chou2 +靧=hui4 +靨=ye4 +革=ge2 +靪=ding1 +靫=cha2 +靬=jian1 +靭=ren4 +靮=di2 +靯=du4 +靰=wu4 +靱=ren4 +靲=qin2 +靳=jin4 +靴=xue1 +靴子=xue1,zi5 +靵=niu3 +靶=ba3 +靶子=ba3,zi5 +靷=yin3 +靸=sa3 +靹=na4 +靺=mo4 +靻=zu3 +靼=da2 +靽=ban4 +靾=xie4 +靿=yao4 +鞀=tao2 +鞁=bei4 +鞂=jie1 +鞃=hong2 +鞄=pao2 +鞅=yang1,yang4 +鞆=bing3 +鞇=yin1 +鞈=ge2,ta4,sa3 +鞉=tao2 +鞊=jie2,ji2 +鞋=xie2 +鞋子=xie2,zi5 +鞌=an1 +鞍=an1 +鞍子=an1,zi5 +鞎=hen2 +鞏=gong3 +鞐=qia3 +鞑=da2 +鞒=qiao2 +鞓=ting1 +鞔=man2,men4 +鞕=bian1,ying4 +鞖=sui1 +鞗=tiao2 +鞘=qiao4,shao1 +鞙=xuan1,juan1 +鞚=kong4 +鞛=beng3 +鞜=ta4 +鞝=shang4,zhang3 +鞞=bing3,pi2,bi4,bei1 +鞟=kuo4 +鞠=ju1 +鞠为茂草=ju1,wei2,mao4,cao3 +鞠躬尽力=ju1,gong1,jin4,li4 +鞡=la5 +鞢=xie4,die2 +鞣=rou2 +鞤=bang1 +鞥=yi4,eng1 +鞦=qiu1 +鞧=qiu1 +鞨=he2 +鞩=qiao4 +鞪=mu4,mou2 +鞫=ju1 +鞫为茂草=ju1,wei2,mao4,cao3 +鞬=jian4,jian1 +鞭=bian1 +鞭子=bian1,zi5 +鞭擗向里=bian1,bi4,xiang4,li3 +鞭约近里=bian1,yue1,jin1,li3 +鞭辟向里=bian1,bi4,xiang4,li3 +鞭辟着里=bian1,bi4,zhuo2,li3 +鞮=di1 +鞯=jian1 +鞰=wen1,yun4 +鞱=tao1 +鞲=gou1 +鞳=ta4 +鞴=bei4 +鞵=xie2 +鞶=pan2 +鞷=ge2 +鞸=bi4,bing3 +鞹=kuo4 +鞺=tang1 +鞻=lou2 +鞼=gui4 +鞽=qiao2 +鞾=xue1 +鞿=ji1 +韀=jian1 +韁=jiang1 +韂=chan4 +韃=da2 +韄=huo4 +韅=xian3 +韆=qian1 +韇=du2 +韈=wa1 +韉=jian1 +韊=lan2 +韋=wei2 +韌=ren4 +韍=fu2 +韎=mei4,wa4 +韏=quan4 +韐=ge2 +韑=wei3 +韒=qiao4 +韓=han2 +韔=chang4 +韕=kuo4 +韖=rou3 +韗=yun4 +韘=she4,xie4 +韙=wei3 +韚=ge2 +韛=bai4 +韜=tao1 +韝=gou1 +韞=yun4 +韟=gao1 +韠=bi4 +韡=wei3 +韢=sui4 +韣=du2 +韤=wa4 +韥=du2 +韦=wei2 +韧=ren4 +韨=fu2 +韩=han2 +韪=wei3 +韫=yun4,wen1 +韬=tao1 +韬戈卷甲=tao1,ge1,juan3,jia3 +韭=jiu3 +韮=jiu3 +韯=xian1 +韰=xie4 +韱=xian1 +韲=ji1 +音=yin1 +音乐=yin1,yue4 +音调=yin1,diao4 +韴=za2 +韵=yun4 +韵调=yun4,diao4 +韶=shao2 +韷=le4 +韸=peng2 +韹=huang2 +韺=ying1 +韻=yun4 +韼=peng2 +韽=an1 +韾=yin1 +響=xiang3 +頀=hu4 +頁=ye4 +頂=ding3 +頃=qing3 +頄=qiu2 +項=xiang4 +順=shun4 +頇=han1 +須=xu1 +頉=yi2 +頊=xu4 +頋=e3 +頌=song4 +頍=kui3 +頎=qi2 +頏=hang2 +預=yu4 +頑=wan2 +頒=ban1 +頓=dun4 +頔=di2 +頕=dan1 +頖=pan4 +頗=po1 +領=ling3 +頙=che4 +頚=jing3 +頛=lei4 +頜=he2 +頝=qiao1 +頞=e4 +頟=e2 +頠=wei3 +頡=jie2 +頢=kuo4 +頣=shen3 +頤=yi2 +頥=yi2 +頦=ke1 +頧=dui3 +頨=yu3 +頩=ping1 +頪=lei4 +頫=fu3 +頬=jia2 +頭=tou2 +頮=hui4 +頯=kui2 +頰=jia2 +頱=luo1 +頲=ting3 +頳=cheng1 +頴=ying3 +頵=jun1 +頶=hu2 +頷=han4 +頸=jing3 +頹=tui2 +頺=tui2 +頻=bin1 +頼=lai4 +頽=tui2 +頾=zi1 +頿=zi1 +顀=chui2 +顁=ding4 +顂=lai4 +顃=tan2 +顄=han4 +顅=qian1 +顆=ke1 +顇=cui4 +顈=jiong3 +顉=qin1 +顊=yi2 +顋=sai1 +題=ti2 +額=e2 +顎=e4 +顏=yan2 +顐=wen4 +顑=kan3 +顒=yong2 +顓=zhuan1 +顔=yan2 +顕=xian3 +顖=xin4 +顗=yi3 +願=yuan4 +顙=sang3 +顚=dian1 +顛=dian1 +顜=jiang3 +顝=kui1 +類=lei4 +顟=lao2 +顠=piao3 +顡=wai4 +顢=man1 +顣=cu4 +顤=yao2 +顥=hao4 +顦=qiao2 +顧=gu4 +顨=xun4 +顩=yan3 +顪=hui4 +顫=chan4 +顬=ru2 +顭=meng2 +顮=bin1 +顯=xian3 +顰=pin2 +顱=lu2 +顲=lan3 +顳=nie4 +顴=quan2 +页=ye4 +页数=ye4,shu4 +顶=ding3 +顶数=ding3,shu4 +顷=qing3 +顸=han1 +项=xiang4 +项背=xiang4,bei4 +项背相望=xiang4,bei4,xiang1,wang4 +顺=shun4 +顺人应天=shun4,ren2,ying4,tian1 +顺便一提=shun4,bian4,yi4,ti2 +顺天应人=shun4,tian1,ying4,ren2 +顺天应时=shun4,tian1,ying1,shi2 +顺差=shun4,cha1 +顺应=shun4,ying4 +顺当=shun4,dang4 +顺风使船=shun4,feng1,shi3,chuan2 +顺风吹火=shun4,feng1,chui1,huo3 +顺风而呼=shun4,feng1,er2,hu1 +顺风转舵=shun4,feng1,zhuan3,duo4 +顺风驶船=shun4,feng1,shi3,chuan2 +须=xu1 +须发=xu1,fa4 +须发皆白=xu1,fa4,jie1,bai2 +须子=xu1,zi5 +顼=xu1 +顽=wan2 +顾=gu4 +顾前不顾后=gu4,qian2,bu4,gu4,hou4 +顾景惭形=gu4,ying3,can2,xing2 +顾虑重重=gu4,lv4,chong2,chong2 +顿=dun4 +顿学累功=dun4,xue2,lei3,gong1 +顿开茅塞=dun4,kai1,mao2,se4 +颀=qi2 +颁=ban1 +颂=song4 +颂声载道=song4,sheng1,zai3,dao4 +颃=hang2 +预=yu4 +预应力=yu4,ying4,li4 +预应力混凝土=yu4,ying4,li4,hun4,ning2,tu3 +预防接种=yu4,fang2,jie1,zhong4 +颅=lu2 +领=ling3 +领导得力=ling2,dao3,de2,li4 +颇=po1 +颈=jing3,geng3 +颉=jie2,xie2,jia2 +颉颃=xie2,hang2 +颊=jia2 +颋=ting3 +颌=he2,ge2 +颍=ying3 +颎=jiong3 +颏=ke1 +颐=yi2 +频=pin2,bin1 +频数=pin2,shuo4 +颒=hui4 +颓=tui2 +颓丧=tui2,sang4 +颓垣断堑=tui2,yuan2,duan4,pian4 +颔=han4 +颕=ying3 +颖=ying3 +颗=ke1 +题=ti2 +颙=yong2 +颚=e4 +颛=zhuan1 +颜=yan2 +额=e2 +额手相庆=e2,shou3,xiang1,qing4 +额数=e2,shu4 +颞=nie4 +颟=man1 +颠=dian1 +颠三倒四=dian1,san1,dao3,si4 +颠乾倒坤=dian1,qian1,dao3,kun1 +颠仆流离=dian1,pu2,liu2,li2 +颠倒=dian1,dao3 +颠倒干坤=dian1,dao3,gan4,kun1 +颠倒衣裳=dian1,dao3,yi1,chang2 +颠来簸去=dian1,lai2,bo3,qu4 +颠簸=dian1,bo3 +颠衣到裳=dian1,yi1,dao4,shang5 +颡=sang3 +颢=hao4 +颣=lei4 +颤=chan4,zhan4 +颤动=chan4,dong4 +颤抖=chan4,dou3 +颤栗=zhan4,li4 +颤音=chan4,yin1 +颥=ru2 +颦=pin2 +颧=quan2 +風=feng1,feng3 +颩=biao1,diu1 +颪=gua1 +颫=fu2 +颬=xia1 +颭=zhan3 +颮=biao1 +颯=sa4 +颰=ba2,fu2 +颱=tai2 +颲=lie4 +颳=gua1 +颴=xuan4 +颵=xiao1 +颶=ju4 +颷=biao1 +颸=si1 +颹=wei3 +颺=yang2 +颻=yao2 +颼=sou1 +颽=kai3 +颾=sao1,sou1 +颿=fan1 +飀=liu2 +飁=xi2 +飂=liu4,liao2 +飃=piao1 +飄=piao1 +飅=liu2 +飆=biao1 +飇=biao1 +飈=biao1 +飉=liao2 +飊=biao1 +飋=se4 +飌=feng1 +飍=xiu1 +风=feng1,feng3 +风云月露=feng1,yun2,yue4,lu4 +风土人情=feng1,tu2,ren2,qing2 +风头=feng1,tou5 +风尘仆仆=feng1,chen2,pu2,pu2 +风影敷衍=feng1,ying3,fu1,yan1 +风斗=feng1,dou3 +风清月皎=feng1,qing1,yue4,jiao1 +风烛草露=feng1,zhu2,cao3,lu4 +风行一时=feng1,xing2,yi1,shi1 +风钻=feng1,zuan4 +风镐=feng1,hao4 +风靡=feng1,mi3 +风餐露宿=feng1,can1,lu4,su4 +风驰草靡=feng1,chi2,cao3,mi3 +飏=yang2 +飐=zhan3 +飑=biao1 +飒=sa4 +飓=ju4 +飔=si1 +飕=sou1 +飖=yao2 +飗=liu2 +飘=piao1 +飙=biao1 +飙发电举=biao1,fa1,dian4,ju4 +飚=biao1 +飛=fei1 +飜=fan1 +飝=fei1 +飞=fei1 +飞将军=fei1,jiang4,jun1 +飞弹=fei1,dan4 +飞来横祸=fei1,lai2,heng4,huo4 +飞沙走砾=fei1,she1,zou3,li4 +飞转=fei1,zhuan4 +食=shi2,si4,yi4 +食不累味=shi2,bu4,lei4,wei4 +食不重味=shi2,bu4,chong2,wei4 +食为民天=shi2,wei2,min2,tian1 +食母=si4,mu3 +飠=shi2 +飡=can1 +飢=ji1 +飣=ding4 +飤=si4 +飥=tuo1 +飦=zhan1 +飧=sun1 +飨=xiang3 +飩=tun2 +飪=ren4 +飫=yu4 +飬=yang3,juan4 +飭=chi4 +飮=yin3,yin4 +飯=fan4 +飰=fan4 +飱=sun1 +飲=yin3,yin4 +飳=zhu4,tou3 +飴=yi2,si4 +飵=zuo4,ze2,zha1 +飶=bi4 +飷=jie3 +飸=tao1 +飹=bao3 +飺=ci2 +飻=tie4 +飼=si4 +飽=bao3 +飾=shi4 +飿=duo4 +餀=hai4 +餁=ren4 +餂=tian3 +餃=jiao3 +餄=he2 +餅=bing3 +餆=yao2 +餇=tong2 +餈=ci2 +餉=xiang3 +養=yang3 +餋=juan4 +餌=er3 +餍=yan4 +餎=le4 +餏=xi1 +餐=can1 +餐葩饮露=can1,pa1,yin3,lu4 +餐霞吸露=can1,xia2,xi1,lu4 +餐风咽露=can1,feng1,yan4,lu4 +餐风宿草=can1,feng1,su4,xue3 +餐风宿露=can1,feng1,su4,lu4 +餐风露宿=can1,feng1,lu4,su4 +餐风饮露=can1,feng1,yin3,lu4 +餑=bo1 +餒=nei3 +餓=e4 +餔=bu1 +餔糟啜漓=bu3,zao1,chuo4,li2 +餕=jun4 +餖=dou4 +餗=su4 +餘=yu2 +餙=shi4 +餚=yao2 +餛=hun2 +餜=guo3 +餝=shi4 +餞=jian4 +餟=chuo4 +餠=bing3 +餡=xian4 +餢=bu4 +餣=ye4 +餤=dan4 +餥=fei1 +餦=zhang1 +餧=wei4 +館=guan3 +餩=e4 +餪=nuan3 +餫=yun4 +餬=hu2 +餭=huang2 +餮=tie4 +餯=hui4 +餰=jian1 +餱=hou2 +餲=ai4 +餳=xing2 +餴=fen1 +餵=wei4 +餶=gu3 +餷=cha1 +餸=song4 +餹=tang2 +餺=bo2 +餻=gao1 +餼=xi4 +餽=kui4 +餾=liu4 +餿=sou1 +饀=tao2 +饁=ye4 +饂=wen1 +饃=mo2 +饄=tang2 +饅=man2 +饆=bi4 +饇=yu4 +饈=xiu1 +饉=jin3 +饊=san3 +饋=kui4 +饌=zhuan4 +饍=shan4 +饎=xi1 +饏=dan4 +饐=yi4 +饑=ji1 +饒=rao2 +饓=cheng1 +饔=yong1 +饔飧不给=yong1,sun1,bu4,ji3 +饔飧不继=yong1,sun1,bu4,ji4 +饕=tao1 +饖=wei4 +饗=xiang3 +饘=zhan1 +饙=fen1 +饚=hai4 +饛=meng2 +饜=yan4 +饝=mo2 +饞=chan2 +饟=xiang3,nang2 +饠=luo2 +饡=zan4 +饢=nang2 +饣=shi2 +饤=ding4 +饥=ji1 +饥荒=ji1,huang1 +饦=tuo1 +饧=xing2 +饨=tun2 +饩=xi4 +饪=ren4 +饫=yu4 +饬=chi4 +饭=fan4 +饭铺=fan4,pu4 +饮=yin3 +饮冰食蘖=yin3,bing1,shi2,bo4 +饮弹=yin3,dan4 +饮弹自尽=yin3,dan4,zi4,jin4 +饮弹身亡=yin3,dan4,shen1,wang2 +饮水曲肱=yin3,shui3,qu1,gong1 +饮犊上流=yin4,du2,shang4,liu2 +饮胆尝血=yin3,dan3,chang2,xue4 +饮血茹毛=yin3,xue4,ru2,mao2 +饮露餐风=yin3,lu4,can1,feng1 +饮风餐露=yin3,feng1,can1,lu4 +饯=jian4 +饰=shi4 +饰非遂过=shi4,fei1,sui2,guo4 +饱=bao3 +饱和点=bao3,he2,dian3 +饲=si4 +饳=duo4 +饴=yi2 +饵=er3 +饶=rao2 +饷=xiang3 +饸=he2 +饸饹=he2,le5 +饹=ge1,le5 +饺=jiao3 +饺子=jiao3,zi5 +饻=xi1 +饼=bing3 +饼屑子=bing3,xie4,zi5 +饼铛=bing3,cheng1 +饽=bo1 +饾=dou4 +饿=e4 +饿殍枕藉=e4,piao3,zhen3,ji2 +饿莩载道=e4,piao3,zai3,dao4 +饿莩遍野=e4,piao3,bian4,ye3 +馀=yu2 +馁=nei3 +馂=jun4 +馃=guo3 +馄=hun2 +馄饨=hun2,tun5 +馄饨皮儿=hun2,tun5,pi2,er5 +馅=xian4 +馆=guan3 +馆子=guan3,zi5 +馆长=guan3,zhang3 +馇=cha1 +馈=kui4 +馉=gu3 +馊=sou1 +馋=chan2 +馌=ye4 +馍=mo2 +馎=bo2 +馏=liu4,liu2 +馏分=liu2,fen4 +馐=xiu1 +馑=jin3 +馒=man2 +馒头=man2,tou5 +馓=san3 +馔=zhuan4 +馕=nang2,nang3 +首=shou3 +首创=shou3,chuang4 +首尾相应=shou3,wei3,xiang1,ying4 +首相=shou3,xiang4 +首足异处=shou3,zu2,yi4,chu4 +首身分离=shou3,shen1,fen1,li2 +首都=shou3,du1 +首长=shou3,zhang3 +馗=kui2 +馘=guo2 +香=xiang1 +香培玉琢=xiang4,pei4,yu4,zuo5 +馚=fen1 +馛=bo2 +馜=ni2 +馝=bi4 +馞=bo2 +馟=tu2 +馠=han1 +馡=fei1 +馢=jian1 +馣=an1 +馤=ai4 +馥=fu4 +馦=xian1 +馧=yun1,wo4 +馨=xin1 +馩=fen2 +馪=pin1 +馫=xin1 +馬=ma3 +馭=yu4 +馮=feng2,ping2 +馯=han4,han2 +馰=di2 +馱=tuo2,duo4 +馲=tuo1,zhe2 +馳=chi2 +馴=xun4 +馵=zhu4 +馶=zhi1,shi4 +馷=pei4 +馸=xin4,jin4 +馹=ri4 +馺=sa4 +馻=yun3 +馼=wen2 +馽=zhi2 +馾=dan3,dan4 +馿=lu2 +駀=you2 +駁=bo2 +駂=bao3 +駃=jue2,kuai4 +駄=tuo2,duo4 +駅=yi4 +駆=qu1 +駇=wen2 +駈=qu1 +駉=jiong1 +駊=po3 +駋=zhao1 +駌=yuan1 +駍=peng1 +駎=zhou4 +駏=ju4 +駐=zhu4 +駑=nu2 +駒=ju1 +駓=pi1 +駔=zang3 +駕=jia4 +駖=ling2 +駗=zhen3 +駘=tai2,dai4 +駙=fu4 +駚=yang3 +駛=shi3 +駜=bi4 +駝=tuo2 +駞=tuo2 +駟=si4 +駠=liu2 +駡=ma4 +駢=pian2 +駣=tao2 +駤=zhi4 +駥=rong2 +駦=teng2 +駧=dong4 +駨=xun2,xuan1 +駩=quan2 +駪=shen1 +駫=jiong1 +駬=er3 +駭=hai4 +駮=bo2 +駯=zhu1 +駰=yin1 +駱=luo4 +駲=zhou1 +駳=dan4 +駴=hai4 +駵=liu2 +駶=ju2 +駷=song3 +駸=qin1 +駹=mang2 +駺=liang2,lang2 +駻=han4 +駼=tu2 +駽=xuan1 +駾=tui4 +駿=jun4 +騀=e3 +騁=cheng3 +騂=xing1 +騃=si4 +騃女痴男=ai2,nv3,chi1,nan2 +騃童钝夫=ai2,tong2,dun4,fu1 +騄=lu4 +騅=zhui1 +騆=zhou1 +騇=she4 +騈=pian2 +騉=kun1 +騊=tao2 +騋=lai2 +騌=zong1 +騍=ke4 +騎=qi2 +騏=qi2 +騐=yan4 +騑=fei1 +騒=sao1 +験=yan4 +騔=ge2 +騕=yao3 +騖=wu4 +騗=pian4 +騘=cong1 +騙=pian4 +騚=qian2 +騛=fei1 +騜=huang2 +騝=qian2 +騞=huo1 +騟=yu2 +騠=ti2 +騡=quan2 +騢=xia2 +騣=zong1 +騤=kui2 +騥=rou2 +騦=si1 +騧=gua1 +騨=tuo2 +騩=gui1 +騪=sou1 +騫=qian1 +騬=cheng2 +騭=zhi4 +騮=liu2 +騯=peng2 +騰=teng2 +騱=xi2 +騲=cao3 +騳=du2 +騴=yan4 +騵=yuan2 +騶=zou1 +騷=sao1 +騸=shan4 +騹=qi2 +騺=zhi4 +騻=shuang1 +騼=lu4 +騽=xi2 +騾=luo2 +騿=zhang1 +驀=mo4 +驁=ao4 +驂=can1 +驃=piao4 +驄=cong1 +驅=qu1 +驆=bi4 +驇=zhi4 +驈=yu4 +驉=xu1 +驊=hua2 +驋=bo1 +驌=su4 +驍=xiao1 +驎=lin2 +驏=zhan4 +驐=dun1 +驑=liu2 +驒=tuo2 +驓=ceng2 +驔=dian4 +驕=jiao1 +驖=tie3 +驗=yan4 +驘=luo2 +驙=zhan1 +驚=jing1 +驛=yi4 +驜=ye4 +驝=tuo2 +驞=pin1 +驟=zhou4 +驠=yan4 +驡=long2 +驢=lv2 +驣=teng2 +驤=xiang1 +驥=ji4 +驦=shuang1 +驧=ju2 +驨=xi2 +驩=huan1 +驪=li2 +驫=biao1 +马=ma3 +马仔=ma3,zai3 +马入华山=ma3,ru4,hua2,shan1 +马咽车阗=ma3,yan1,che1,tian2 +马圈=ma3,juan4 +马子=ma3,zi5 +马尾=ma3,yi3 +马尾巴=ma3,yi3,ba1 +马尾松=ma3,wei3,song1 +马尾藻=ma3,yi3,zao3 +马尾辫=ma3,yi3,bian4 +马扎=ma3,zha2 +马赫数=ma3,he4,shu4 +马齿徒长=ma3,chi3,tu2,zhang3 +驭=yu4 +驮=tuo2 +驯=xun4 +驰=chi2 +驱=qu1 +驱虫剂=qu1,chong2,ji4 +驲=ri4 +驳=bo2 +驳壳枪=bo2,ke2,qiang1 +驴=lv2 +驴唇不对马嘴=lv2,chun2,bu4,dui4,ma3,zui3 +驴头不对马嘴=lv2,tou2,bu4,dui4,ma3,zui3 +驴子=lv2,zi5 +驵=zang3 +驶=shi3 +驷=si4 +驸=fu4 +驹=ju1 +驹留空谷=ju1,liu2,kong1,gu3 +驺=zou1 +驻=zhu4 +驻扎=zhu4,zha2 +驼=tuo2 +驼背=tuo2,bei4 +驽=nu2 +驾=jia4 +驿=yi4 +骀=tai2 +骀背鹤发=dai4,bei4,he4,fa1 +骁=xiao1 +骁将=xiao1,jiang4 +骂=ma4 +骃=yin1 +骄=jiao1 +骄儿騃女=jiao1,er2,ba1,nv3 +骄奢淫泆=jiao1,she1,yin2,yi2 +骄横=jiao1,heng4 +骄泰淫泆=jiao1,tai4,yin2,zhuang4 +骅=hua2 +骆=luo4 +骇=hai4 +骈=pian2 +骈兴错出=pian2,xing1,cuo4,chu1 +骈肩累足=pian2,jian1,lei3,zu2 +骈肩累踵=pian2,jian1,lei4,zhong3 +骈肩累迹=pian2,jian1,lei3,ji4 +骉=biao1 +骊=li2 +骋=cheng3 +验=yan4 +验查=yan4,zha1 +验血=yan4,xue4 +骍=xing1 +骎=qin1 +骏=jun4 +骐=qi2 +骑=qi2 +骑缝=qi2,feng4 +骑缝印=qi2,feng4,yin4 +骑缝章=qi2,feng4,zhang1 +骒=ke4 +骓=zhui1 +骔=zong1 +骕=su4 +骖=can1 +骖风驷霞=cen1,feng1,si4,xia2 +骗=pian4 +骗子=pian4,zi5 +骘=zhi4 +骙=kui2 +骚=sao1,sao3 +骛=wu4 +骜=ao2 +骝=liu2 +骞=qian1 +骟=shan4 +骠=biao1,piao4 +骠勇=piao4,yong3 +骠骑=piao4,qi2 +骡=luo2 +骡子=luo2,zi5 +骢=cong1 +骣=chan3 +骤=zhou4 +骥=ji4 +骦=shuang1 +骧=xiang1 +骨=gu3,gu1 +骨头=gu2,tou5 +骨头架子=gu2,tou5,jia4,zi5 +骨子=gu3,zi5 +骨子里=gu3,zi5,li3 +骨干=gu3,gan4 +骨朵=gu1,duo3 +骨朵儿=gu1,duo3,er5 +骨殖=gu3,shi5 +骨碌=gu1,lu5 +骨碌碌=gu1,lu4,lu4 +骨血=gu3,xue4 +骩=wei3 +骪=wei3 +骫=wei3 +骬=yu2 +骭=gan4 +骮=yi4 +骯=ang1 +骰=tou2 +骰子=shai3,zi5 +骱=jie4 +骲=bao4 +骳=bei4,mo2 +骴=ci1 +骵=ti3 +骶=di3 +骷=ku1 +骸=hai2 +骹=qiao1,xiao1 +骺=hou2 +骻=kua4 +骼=ge2 +骽=tui3 +骾=geng3 +骿=pian2 +髀=bi4 +髁=ke1 +髂=qia4 +髃=yu2 +髄=sui2 +髅=lou2 +髆=bo2 +髇=xiao1 +髈=bang3 +髉=bo2,jue2 +髊=ci1 +髋=kuan1 +髌=bin4 +髍=mo2 +髎=liao2 +髏=lou2 +髐=xiao1 +髑=du2 +髒=zang1 +髓=sui3 +體=ti3,ti1 +髕=bin4 +髖=kuan1 +髗=lu2 +高=gao1 +高不成低不就=gao1,bu4,cheng2,di1,bu4,jiu4 +高丽=gao1,li2 +高丽参=gao1,li4,shen1 +高丽纸=gao1,li2,zhi3 +高义薄云=gao1,yi4,bo2,yun2 +高义薄云天=gao1,yi4,bao2,yun2,tian1 +高兴=gao1,xing4 +高冠博带=gao1,guan1,bo2,dai4 +高分子=gao1,fen1,zi3 +高分子化合物=gao1,fen4,zi3,hua4,he2,wu4 +高句骊=gao1,gou1,li2 +高山反应=gao1,shan1,fan3,ying4 +高帽子=gao1,mao4,zi5 +高干=gao1,gan4 +高情逸兴=gao1,qing2,yi4,xing1 +高挑=gao1,tiao3 +高挑儿=gao1,tiao3,er2 +高着=gao1,zhao1 +高知=gao1,zhi4 +高血压=gao1,xue4,ya1 +高调=gao1,diao4 +高风劲节=gao1,feng1,jin4,jie2 +高高兴兴=gao1,gao1,xing4,xing1 +髙=gao1 +髚=qiao4 +髛=kao1 +髜=qiao3 +髝=lao2 +髞=sao4 +髟=biao1 +髠=kun1 +髡=kun1 +髢=di2 +髣=fang3 +髤=xiu1 +髥=ran2 +髦=mao2 +髧=dan4 +髨=kun1 +髩=bin4 +髪=fa4 +髫=tiao2 +髬=pi1 +髭=zi1 +髮=fa4 +髯=ran2 +髰=ti4 +髱=bao4 +髲=bi4,po3 +髳=mao2,meng2 +髴=fu2 +髵=er2 +髶=er4 +髷=qu1 +髸=gong1 +髹=xiu1 +髺=kuo4,yue4 +髻=ji4 +髼=peng2 +髽=zhua1 +髾=shao1 +髿=sha1 +鬀=ti4 +鬁=li4 +鬂=bin4 +鬃=zong1 +鬄=ti4 +鬅=peng2 +鬆=song1 +鬇=zheng1 +鬈=quan2 +鬈发=quan2,fa4 +鬈曲=quan2,qu1 +鬉=zong1 +鬊=shun4 +鬋=jian3 +鬌=duo3 +鬍=hu2 +鬎=la4 +鬏=jiu1 +鬐=qi2 +鬑=lian2 +鬒=zhen3 +鬓=bin4 +鬓发=bin4,fa4 +鬔=peng2 +鬕=ma4 +鬖=san1 +鬗=man2 +鬘=man2 +鬙=seng1 +鬚=xu1 +鬛=lie4 +鬜=qian1 +鬝=qian1 +鬞=nong2 +鬟=huan2 +鬠=kuo4 +鬡=ning2 +鬢=bin4 +鬣=lie4 +鬤=rang2 +鬥=dou4 +鬦=dou4 +鬧=nao4 +鬨=hong4 +鬩=xi4 +鬪=dou4 +鬫=kan4 +鬬=dou4 +鬭=dou4 +鬮=jiu1 +鬯=chang4 +鬰=yu4 +鬱=yu4 +鬲=ge2,li4 +鬳=yan4 +鬴=fu3 +鬵=zeng4 +鬶=gui1 +鬷=zong1 +鬸=liu4 +鬹=gui1 +鬺=shang1 +鬻=yu4 +鬻文为生=yu4,wen2,wei2,sheng1 +鬻矛誉楯=yu4,mao2,yu4,dun4 +鬻驽窃价=yu4,nu3,qie4,jia4 +鬻鸡为凤=yu4,ji1,wei2,feng4 +鬼=gui3 +鬼使神差=gui3,shi3,shen2,chai1 +鬼头=gui3,tou5 +鬼头滑脑=gui3,tou2,hua2,nao3 +鬼头鬼脑=gui3,tou2,gui3,nao3 +鬽=mei4 +鬾=ji4 +鬿=qi2 +魀=ga4 +魁=kui2 +魁梧奇伟=kui2,wu3,qi2,wei3 +魂=hun2 +魂不守舍=hun2,bu4,shou3,she4 +魂不著体=hun2,bu4,zhuo2,ti3 +魂不附体=hun2,bu4,fu4,ti3 +魂飞魄丧=hun2,fei1,po4,sang1 +魃=ba2 +魄=po4 +魅=mei4 +魆=xu1 +魆风骤雨=zhuo1,feng1,zhou4,yu3 +魇=yan3 +魈=xiao1 +魉=liang3 +魊=yu4 +魋=tui2 +魌=qi1 +魍=wang3 +魎=liang3 +魏=wei4 +魐=gan1 +魑=chi1 +魒=piao1 +魓=bi4 +魔=mo2 +魔高一丈=mo2,gao1,yi1,zhang4 +魕=ji1 +魖=xu1 +魗=chou3 +魘=yan3 +魙=zhan1 +魚=yu2 +魛=dao1 +魜=ren2 +魝=ji4 +魞=ba1,ba4 +魟=hong2 +魠=tuo1 +魡=diao4 +魢=ji3 +魣=yu2 +魤=e2 +魥=ji4 +魦=sha1 +魧=hang2 +魨=tun2 +魩=mo4 +魪=jie4 +魫=shen3 +魬=ban3 +魭=yuan2 +魮=pi2 +魯=lu3 +魰=wen2 +魱=hu2 +魲=lu2 +魳=za1 +魴=fang2 +魵=fen2 +魶=na4 +魷=you2 +魸=pian4 +魹=mo2 +魺=he2 +魻=xia2 +魼=qu1 +魽=han1 +魾=pi1 +魿=ling2 +鮀=tuo2 +鮁=ba4 +鮂=qiu2 +鮃=ping2 +鮄=fu2 +鮅=bi4 +鮆=ci3,ji4 +鮇=wei4 +鮈=ju1 +鮉=diao1 +鮊=bo2,ba4 +鮋=you2 +鮌=gun3 +鮍=pi2 +鮎=nian2 +鮏=xing1 +鮐=tai2 +鮑=bao4 +鮒=fu4 +鮓=zha3,zha4 +鮔=ju4 +鮕=gu1 +鮖=shi2 +鮗=dong1 +鮘=chou5,dai4 +鮙=ta3 +鮚=jie2 +鮛=shu1 +鮜=hou4 +鮝=xiang3 +鮞=er2 +鮟=an1 +鮠=wei2 +鮡=zhao4 +鮢=zhu1 +鮣=yin4 +鮤=lie4 +鮥=luo4,ge2 +鮦=tong2 +鮧=yi2 +鮨=yi4 +鮩=bing4 +鮪=wei3 +鮫=jiao1 +鮬=ku1 +鮭=gui1,xie2 +鮮=xian1,xian3 +鮯=ge2 +鮰=hui2 +鮱=lao3 +鮲=fu2 +鮳=kao4 +鮴=xiu1 +鮵=tuo1 +鮶=jun1 +鮷=ti2 +鮸=mian3 +鮹=shao1 +鮺=zha3 +鮻=suo1 +鮼=qin1 +鮽=yu2 +鮾=nei3 +鮿=zhe2 +鯀=gun3 +鯁=geng3 +鯂=su1 +鯃=wu2 +鯄=qiu2 +鯅=shan1 +鯆=pu1,bu1 +鯇=huan4 +鯈=tiao2 +鯉=li3 +鯊=sha1 +鯋=sha1 +鯌=kao4 +鯍=meng2 +鯎=cheng2 +鯏=li2 +鯐=zou3 +鯑=xi1 +鯒=yong3 +鯓=shen1 +鯔=zi1 +鯕=qi2 +鯖=qing1 +鯗=xiang3 +鯘=nei3 +鯙=chun2 +鯚=ji4 +鯛=diao1 +鯜=qie4 +鯝=gu4 +鯞=zhou3 +鯟=dong1 +鯠=lai2 +鯡=fei1 +鯢=ni2 +鯣=yi4,si1 +鯤=kun1 +鯥=lu4 +鯦=jiu4 +鯧=chang1 +鯨=jing1 +鯩=lun2 +鯪=ling2 +鯫=zou1 +鯬=li2 +鯭=meng3 +鯮=zong1 +鯯=zhi4 +鯰=nian2 +鯱=hu3 +鯲=yu2 +鯳=di3 +鯴=shi1 +鯵=shen1 +鯶=huan4 +鯷=ti2 +鯸=hou2 +鯹=xing1 +鯺=zhu1 +鯻=la4 +鯼=zong1 +鯽=ji4 +鯾=bian1 +鯿=bian1 +鰀=huan4 +鰁=quan2 +鰂=zei2 +鰃=wei1 +鰄=wei1 +鰅=yu2 +鰆=chun1 +鰇=rou2 +鰈=die2 +鰉=huang2 +鰊=lian4 +鰋=yan3 +鰌=qiu1 +鰍=qiu1 +鰎=jian3 +鰏=bi1 +鰐=e4 +鰑=yang2 +鰒=fu4 +鰓=sai1,xi3 +鰔=jian1 +鰕=xia1 +鰖=tuo3 +鰗=hu2 +鰘=shi4 +鰙=ruo4 +鰚=xuan1 +鰛=wen1 +鰜=jian1 +鰝=hao4 +鰞=wu1 +鰟=pang2 +鰠=sao1 +鰡=liu2 +鰢=ma3 +鰣=shi2 +鰤=shi1 +鰥=guan1 +鰦=zi1 +鰧=teng2 +鰨=ta3 +鰩=yao2 +鰪=e4 +鰫=yong2 +鰬=qian2 +鰭=qi2 +鰮=wen1 +鰯=ruo4 +鰰=shen2 +鰱=lian2 +鰲=ao2 +鰳=le4 +鰴=hui1 +鰵=min3 +鰶=ji4 +鰷=tiao2 +鰸=qu1 +鰹=jian1 +鰺=shen1 +鰻=man2 +鰼=xi2 +鰽=qiu2 +鰾=piao4 +鰿=ji4 +鱀=ji4 +鱁=zhu2 +鱂=jiang1 +鱃=xiu1 +鱄=zhuan1 +鱅=yong1 +鱆=zhang1 +鱇=kang1 +鱈=xue3 +鱉=bie1 +鱊=yu4 +鱋=qu1 +鱌=xiang4 +鱍=bo1 +鱎=jiao3 +鱏=xun2 +鱐=su4 +鱑=huang2 +鱒=zun1 +鱓=shan4 +鱔=shan4 +鱕=fan1 +鱖=gui4 +鱗=lin2 +鱘=xun2 +鱙=yao2 +鱚=xi3 +鱛=zeng1 +鱜=xiang1 +鱝=fen4 +鱞=guan1 +鱟=hou4 +鱠=kuai4 +鱡=zei2 +鱢=sao1 +鱣=zhan1 +鱤=gan3 +鱥=gui4 +鱦=ying4 +鱧=li3 +鱨=chang2 +鱩=lei2 +鱪=shu3 +鱫=ai4 +鱬=ru2 +鱭=ji4 +鱮=xu4 +鱯=hu4 +鱰=shu3 +鱱=li3 +鱲=lie4 +鱳=le4 +鱴=mie4 +鱵=zhen1 +鱶=xiang3 +鱷=e4 +鱸=lu2 +鱹=guan4 +鱺=li2 +鱻=xian1 +鱼=yu2 +鱼丽于罶=yu2,li2,yu2,liu3 +鱼封雁帖=yu2,feng1,yan4,tie1 +鱼尾雁行=yu2,wei3,yan4,xing2 +鱼游燋釜=yu2,you2,zhuo2,fu3 +鱼目混珎=yu2,mu4,hun4,zhu1 +鱼肚=yu2,du3 +鱼肚白=yu2,du3,bai2 +鱼贯雁行=yu2,guan4,yan4,xing2 +鱽=dao1 +鱾=ji3 +鱿=you2 +鲀=tun2 +鲁=lu3 +鲁斤燕削=lu3,jin1,yan4,xue1 +鲂=fang2 +鲃=ba1,ba4 +鲄=he2,ge3 +鲅=ba4 +鲆=ping2 +鲇=nian2 +鲈=lu2 +鲉=you2 +鲊=zha3,zha4 +鲋=fu4 +鲌=bo2,ba4 +鲍=bao4 +鲎=hou4 +鲏=pi2 +鲐=tai2 +鲑=gui1,xie2 +鲒=jie2 +鲓=kao4 +鲔=wei3 +鲕=er2 +鲖=tong2 +鲗=zei2 +鲘=hou4 +鲙=kuai4 +鲚=ji4 +鲛=jiao1 +鲜=xian1,xian3 +鲜为人知=xian3,wei2,ren2,zhi1 +鲜有=xian3,you3 +鲜血=xian1,xue4 +鲜血淋漓=xian1,xue4,lin2,li2 +鲜见=xian3,jian4 +鲝=zha3 +鲞=xiang3 +鲟=xun2 +鲠=geng3 +鲡=li2 +鲢=lian2 +鲣=jian1 +鲤=li3 +鲥=shi2 +鲦=tiao2 +鲧=gun3 +鲨=sha1 +鲩=huan4 +鲪=jun1 +鲫=ji4 +鲬=yong3 +鲭=qing1 +鲮=ling2 +鲯=qi2 +鲰=zou1 +鲱=fei1 +鲲=kun1 +鲳=chang1 +鲴=gu4 +鲵=ni2 +鲶=nian2 +鲷=diao1 +鲸=jing1 +鲹=shen1 +鲺=shi1 +鲻=zi1 +鲼=fen4 +鲽=die2 +鲽离鹣背=die2,li2,jian1,bei4 +鲾=bi1 +鲿=chang2 +鳀=ti2 +鳁=wen1 +鳂=wei1 +鳃=sai1,xi3 +鳄=e4 +鳅=qiu1 +鳆=fu4 +鳇=huang2 +鳈=quan2 +鳉=jiang1 +鳊=bian1 +鳋=sao1 +鳌=ao2 +鳍=qi2 +鳎=ta3 +鳏=guan1 +鳐=yao2 +鳑=pang2 +鳒=jian1 +鳓=le4 +鳔=biao4 +鳕=xue3 +鳖=bie1 +鳗=man2 +鳘=min3 +鳙=yong1 +鳚=wei4 +鳛=xi2 +鳜=gui4,jue2 +鳝=shan4 +鳞=lin2 +鳟=zun1 +鳠=hu4 +鳡=gan3 +鳢=li3 +鳣=zhan1,shan4 +鳤=guan3 +鳥=niao3,diao3 +鳦=yi3 +鳧=fu2 +鳨=li4 +鳩=jiu1 +鳪=bu2 +鳫=yan4 +鳬=fu2 +鳭=diao1,zhao1 +鳮=ji1 +鳯=feng4 +鳰=ru4 +鳱=gan1,han4,yan4 +鳲=shi1 +鳳=feng4 +鳴=ming2 +鳵=bao3 +鳶=yuan1 +鳷=zhi1 +鳸=hu4 +鳹=qin2 +鳺=fu1,gui1 +鳻=ban1,fen2 +鳼=wen2 +鳽=jian1,qian1,zhan1 +鳾=shi1 +鳿=yu4 +鴀=fou3 +鴁=yao1 +鴂=jue2 +鴃=jue2 +鴄=pi3 +鴅=huan1 +鴆=zhen4 +鴇=bao3 +鴈=yan4 +鴉=ya1 +鴊=zheng4 +鴋=fang1 +鴌=feng4 +鴍=wen2 +鴎=ou1 +鴏=dai4 +鴐=jia1 +鴑=ru2 +鴒=ling2 +鴓=mie4 +鴔=fu2 +鴕=tuo2 +鴖=min2 +鴗=li4 +鴘=bian3 +鴙=zhi4 +鴚=ge1 +鴛=yuan1 +鴜=ci2 +鴝=qu2 +鴞=xiao1 +鴟=chi1 +鴠=dan4 +鴡=ju1 +鴢=yao1 +鴣=gu1 +鴤=zhong1 +鴥=yu4 +鴦=yang1 +鴧=yu4 +鴨=ya1 +鴩=die2 +鴪=yu4 +鴫=tian2 +鴬=ying1 +鴭=dui1 +鴮=wu1 +鴯=er2 +鴰=gua1 +鴱=ai4 +鴲=zhi1 +鴳=yan4 +鴴=heng2 +鴵=xiao1 +鴶=jia2 +鴷=lie4 +鴸=zhu1 +鴹=yang2 +鴺=yi2 +鴻=hong2 +鴼=lu4 +鴽=ru2 +鴾=mou2 +鴿=ge1 +鵀=ren2 +鵁=jiao1 +鵂=xiu1 +鵃=zhou1 +鵄=chi1 +鵅=luo4 +鵆=heng2 +鵇=nian2 +鵈=e3 +鵉=luan2 +鵊=jia2 +鵋=ji4 +鵌=tu2 +鵍=huan1 +鵎=tuo3 +鵏=bu1 +鵐=wu2 +鵑=jian1 +鵒=yu4 +鵓=bo2 +鵔=jun4 +鵕=jun4 +鵖=bi1 +鵗=xi1 +鵘=jun4 +鵙=ju2 +鵚=tu1 +鵛=jing4 +鵜=ti2 +鵝=e2 +鵞=e2 +鵟=kuang2 +鵠=hu2 +鵡=wu3 +鵢=shen1 +鵣=lai4 +鵤=zan1 +鵥=pan4 +鵦=lu4 +鵧=pi2 +鵨=shu1 +鵩=fu2 +鵪=an1 +鵫=zhuo2 +鵬=peng2 +鵭=qin2 +鵮=qian1 +鵯=bei1 +鵰=diao1 +鵱=lu4 +鵲=que4 +鵳=jian1 +鵴=ju2 +鵵=tu4 +鵶=ya1 +鵷=yuan1 +鵸=qi2 +鵹=li2 +鵺=ye4 +鵻=zhui1 +鵼=kong1 +鵽=duo4 +鵾=kun1 +鵿=sheng1 +鶀=qi2 +鶁=jing1 +鶂=yi4 +鶃=yi4 +鶄=jing1 +鶅=zi1 +鶆=lai2 +鶇=dong1 +鶈=qi1 +鶉=chun2 +鶊=geng1 +鶋=ju1 +鶌=qu1 +鶍=yi4 +鶎=zun1 +鶏=ji1 +鶐=shu4 +鶑=ying1 +鶒=chi4 +鶓=miao2 +鶔=rou2 +鶕=an1 +鶖=qiu1 +鶗=ti2,chi2 +鶘=hu2 +鶙=ti2,chi2 +鶚=e4 +鶛=jie1 +鶜=mao2 +鶝=fu2,bi4 +鶞=chun1 +鶟=tu2 +鶠=yan3 +鶡=he2,jie4 +鶢=yuan2 +鶣=pian1,bian3 +鶤=kun1 +鶥=mei2 +鶦=hu2 +鶧=ying1 +鶨=chuan4,zhi4 +鶩=wu4 +鶪=ju2 +鶫=dong1 +鶬=cang1,qiang1 +鶭=fang3 +鶮=he4,hu2 +鶯=ying1 +鶰=yuan2 +鶱=xian1 +鶲=weng1 +鶳=shi1 +鶴=he4 +鶵=chu2 +鶶=tang2 +鶷=xia2 +鶸=ruo4 +鶹=liu2 +鶺=ji1 +鶻=gu3,hu2 +鶼=jian1 +鶽=sun3,xun4 +鶾=han4 +鶿=ci2 +鷀=ci2 +鷁=yi4 +鷂=yao4 +鷃=yan4 +鷄=ji1 +鷅=li4 +鷆=tian2 +鷇=kou4 +鷈=ti1 +鷉=ti1 +鷊=yi4 +鷋=tu2 +鷌=ma3 +鷍=xiao1 +鷎=gao1 +鷏=tian2 +鷐=chen2 +鷑=ji4 +鷒=tuan2 +鷓=zhe4 +鷔=ao2 +鷕=yao3 +鷖=yi1 +鷗=ou1 +鷘=chi4 +鷙=zhi4 +鷚=liu4 +鷛=yong1 +鷜=lou2,lv3 +鷝=bi4 +鷞=shuang1 +鷟=zhuo2 +鷠=yu2 +鷡=wu2 +鷢=jue2 +鷣=yin2 +鷤=ti2 +鷥=si1 +鷦=jiao1 +鷧=yi4 +鷨=hua2 +鷩=bi4 +鷪=ying1 +鷫=su4 +鷬=huang2 +鷭=fan2 +鷮=jiao1 +鷯=liao2 +鷰=yan4 +鷱=gao1 +鷲=jiu4 +鷳=xian2 +鷴=xian2 +鷵=tu2 +鷶=mai3 +鷷=zun1 +鷸=yu4 +鷹=ying1 +鷺=lu4 +鷻=tuan2 +鷼=xian2 +鷽=xue2 +鷾=yi4 +鷿=pi4 +鸀=zhu3 +鸁=luo2 +鸂=xi1 +鸃=yi4 +鸄=ji1 +鸅=ze2 +鸆=yu2 +鸇=zhan1 +鸈=ye4 +鸉=yang2 +鸊=pi4 +鸋=ning2 +鸌=hu4 +鸍=mi2 +鸎=ying1 +鸏=meng2 +鸐=di2 +鸑=yue4 +鸒=yu4 +鸓=lei3 +鸔=bu3 +鸕=lu2 +鸖=he4 +鸗=long2 +鸘=shuang1 +鸙=yue4 +鸚=ying1 +鸛=guan4 +鸜=qu2 +鸝=li2 +鸞=luan2 +鸟=niao3 +鸟乱=diao3,luan4 +鸟事=diao3,shi4 +鸟闹=diao3,nao4 +鸠=jiu1 +鸡=ji1 +鸡内金=ji1,na4,jin1 +鸡冠=ji1,guan1 +鸡冠子=ji1,guan1,zi5 +鸡毛掸子=ji1,mao2,dan3,zi5 +鸡爪子=ji1,zhua3,zi5 +鸡皮疙瘩=ji1,pi2,ge1,da5 +鸡皮鹤发=ji1,pi2,he4,fa4 +鸡胸龟背=ji1,xiong1,gui1,bei4 +鸡蛋里找骨头=ji1,dan4,li3,zhao3,gu3,tou2 +鸢=yuan1 +鸣=ming2 +鸤=shi1 +鸥=ou1 +鸦=ya1 +鸧=cang1 +鸨=bao3 +鸩=zhen4 +鸪=gu1 +鸫=dong1 +鸬=lu2 +鸭=ya1 +鸭子=ya1,zi5 +鸭绿江=ya1,lu4,jiang1 +鸮=xiao1 +鸮鸣鼠暴=zhang1,ming2,shu3,bao4 +鸯=yang1 +鸰=ling2 +鸱=chi1 +鸲=qu2 +鸳=yuan1 +鸴=xue2 +鸵=tuo2 +鸶=si1 +鸷=zhi4 +鸸=er2 +鸹=gua1 +鸺=xiu1 +鸻=heng2 +鸼=zhou1 +鸽=ge1 +鸽子=ge1,zi5 +鸾=luan2 +鸾凤和鸣=luan2,feng4,he4,ming2 +鸾只凤单=luan2,zhi1,feng4,dan1 +鸾孤凤只=luan2,gu1,feng4,zhi1 +鸿=hong2 +鸿渐于干=hong2,jian4,yu2,gan4 +鸿篇巨着=hong2,pian1,ju4,zhe5 +鸿蒙=hong2,meng2 +鸿蒙初辟=hong2,meng2,chu1,pi4 +鸿衣羽裳=hong2,yi1,yu3,shang5 +鹀=wu2 +鹁=bo2 +鹂=li2 +鹃=juan1 +鹄=hu2 +鹄的=gu3,di4 +鹅=e2 +鹆=yu4 +鹇=xian2 +鹈=ti2 +鹉=wu3 +鹊=que4 +鹋=miao2 +鹌=an1 +鹍=kun1 +鹎=bei1 +鹏=peng2 +鹐=qian1 +鹑=chun2 +鹒=geng1 +鹓=yuan1 +鹔=su4 +鹕=hu2 +鹖=he2 +鹗=e4 +鹘=gu3 +鹙=qiu1 +鹚=ci2 +鹛=mei2 +鹜=wu4 +鹝=yi4 +鹞=yao4 +鹟=weng1 +鹠=liu2 +鹡=ji1 +鹢=yi4 +鹣=jian1 +鹤=he4 +鹤发松姿=he4,fa4,song1,zi1 +鹤发童颜=he4,fa4,tong2,yan2 +鹤发鸡皮=he4,fa4,ji1,pi2 +鹤处鸡群=he4,chu3,ji1,qun2 +鹤背扬州=he4,bei4,yang2,zhou1 +鹥=yi1 +鹦=ying1 +鹧=zhe4 +鹨=liu4 +鹩=liao2 +鹪=jiao1 +鹫=jiu4 +鹬=yu4 +鹭=lu4 +鹮=huan2 +鹯=zhan1 +鹰=ying1 +鹰爪=ying1,zhao3 +鹰爪子=ying1,zhua3,zi5 +鹰觑鹘望=ying1,qu4,hu2,wang4 +鹱=hu4 +鹲=meng2 +鹳=guan4 +鹴=shuang1 +鹵=lu3 +鹶=jin1 +鹷=ling2 +鹸=jian3 +鹹=xian2 +鹺=cuo2 +鹻=jian3 +鹼=jian3 +鹽=yan2 +鹾=cuo2 +鹿=lu4 +鹿死谁手=lu4,si3,shei2,shou3 +麀=you1 +麁=cu1 +麂=ji3 +麃=pao2,biao1 +麄=cu1 +麅=pao2 +麆=zhu4,cu1 +麇=jun1,qun2 +麇至沓来=qun2,zhi4,ta4,lai2 +麈=zhu3 +麉=jian1 +麊=mi2 +麋=mi2 +麌=yu3 +麍=liu2 +麎=chen2 +麏=jun1 +麐=lin2 +麑=ni2 +麒=qi2 +麓=lu4 +麔=jiu4 +麕=jun1 +麖=jing1 +麗=li4,li2 +麘=xiang1 +麙=xian2 +麚=jia1 +麛=mi2 +麜=li4 +麝=she4 +麞=zhang1 +麟=lin2 +麟角凤觜=lin2,jiao3,feng4,zui3 +麠=jing1 +麡=qi2 +麢=ling2 +麣=yan2 +麤=cu1 +麥=mai4 +麦=mai4 +麧=he2 +麨=chao3 +麩=fu1 +麪=mian4 +麫=mian4 +麬=fu1 +麭=pao4 +麮=qu4 +麯=qu1 +麰=mou2 +麱=fu1 +麲=xian4 +麳=lai2 +麴=qu1 +麵=mian4 +麶=chi5 +麷=feng1 +麸=fu1 +麸子=fu1,zi5 +麹=qu1 +麺=mian4 +麻=ma2 +麼=me5,mo2 +麽=me5,mo2 +麾=hui1 +麿=mi2 +黀=zou1 +黁=nun2 +黂=fen2 +黃=huang2 +黄=huang2 +黄卷幼妇=huang2,juan4,you4,fu4 +黄卷青灯=huang2,juan4,qing1,deng1 +黄发儿齿=huang2,fa4,er2,chi3 +黄发台背=huang2,fa1,tai2,bei4 +黄发垂髫=huang2,fa4,chui2,tiao2 +黄发骀背=huang2,fa1,tai2,bei4 +黄发鲐背=huang2,fa1,tai2,bei4 +黄埔=huang2,pu3 +黄梁一梦=huang2,liang2,yi1,meng4 +黄毛丫头=huang2,mao2,ya1,tou5 +黄陂=huang2,po2 +黄雀伺蝉=huang2,que4,si4,chan2 +黄骠马=huang2,biao1,ma3 +黅=jin1 +黆=guang1 +黇=tian1 +黈=tou3 +黉=hong2 +黊=hua4 +黋=kuang4 +黌=hong2 +黍=shu3 +黎=li2 +黏=nian2 +黏皮着骨=nian2,pi2,zhe5,gu3 +黏着=nian2,zhe5 +黏糊=nian2,hu2 +黏黏糊糊=nian2,nian2,hu1,hu1 +黐=chi1 +黑=hei1 +黑匣子=hei1,xia2,zi5 +黑发=hei1,fa4 +黑咕隆咚=hei1,gu1,long1,dong1 +黑更半夜=hei1,geng1,ban4,ye4 +黑白相间=hei1,bai2,xiang1,jian4 +黒=hei1 +黓=yi4 +黔=qian2 +黕=dan3 +黖=xi4 +黗=tun2 +默=mo4 +黙=mo4 +黚=qian2 +黛=dai4 +黜=chu4 +黝=you3 +點=dian3 +黟=yi1 +黠=xia2 +黡=yan3 +黢=qu1 +黣=mei3 +黤=yan3 +黥=qing2 +黦=yue4 +黧=li2 +黨=dang3 +黩=du2 +黪=can3 +黫=yan1 +黬=yan3 +黭=yan3 +黮=dan4,shen4 +黯=an4 +黰=zhen3,yan1 +黱=dai4 +黲=can3 +黳=yi1 +黴=mei2 +黵=dan3,zhan3 +黶=yan3 +黷=du2 +黸=lu2 +黹=zhi3 +黺=fen3 +黻=fu2 +黼=fu3 +黽=min3,mian3,meng3 +黾=min3,mian3,meng3 +黾穴鸲巢=meng3,xue2,qu2,chao2 +黿=yuan2 +鼀=cu4 +鼁=qu4 +鼂=chao2 +鼃=wa1 +鼄=zhu1 +鼅=zhi1 +鼆=meng3 +鼇=ao2 +鼈=bie1 +鼉=tuo2 +鼊=bi4 +鼋=yuan2 +鼋鸣鳖应=yuan2,ming2,bie1,ying4 +鼌=chao2 +鼍=tuo2 +鼎=ding3 +鼎折覆餗=ding3,she2,fu4,su4 +鼎折餗覆=ding3,she2,su4,fu4 +鼎铛有耳=ding3,cheng1,you3,er3 +鼎铛玉石=ding3,cheng1,yu4,shi2 +鼏=mi4 +鼐=nai4 +鼑=ding3 +鼒=zi1 +鼓=gu3 +鼓乐=gu3,yue4 +鼓乐喧天=gu3,yue4,xuan1,tian1 +鼓乐齐鸣=gu3,yue4,qi2,ming2 +鼓唇咋舌=gu3,chun2,za3,she2 +鼓鼓囊囊=gu3,gu3,nang1,nang1 +鼔=gu3 +鼕=dong1 +鼖=fen2 +鼗=tao2 +鼘=yuan1 +鼙=pi2 +鼚=chang1 +鼛=gao1 +鼜=cao4 +鼝=yuan1 +鼞=tang1 +鼟=teng1 +鼠=shu3 +鼡=shu3 +鼢=fen2 +鼣=fei4 +鼤=wen2 +鼥=ba2 +鼦=diao1 +鼧=tuo2 +鼨=zhong1 +鼩=qu2 +鼪=sheng1 +鼫=shi2 +鼬=you4 +鼭=shi2 +鼮=ting2 +鼯=wu2 +鼰=ju2 +鼱=jing1 +鼲=hun2 +鼳=ju2 +鼴=yan3 +鼵=tu1 +鼶=si1 +鼷=xi1 +鼸=xian4 +鼹=yan3 +鼺=lei2 +鼻=bi2 +鼻咽癌=bi2,yan1,ai2 +鼻子=bi2,zi5 +鼻孔撩天=bi2,kong3,liao2,tian1 +鼻涕虫=bi2,ti4,chong2 +鼼=yao4 +鼽=qiu2 +鼾=han1 +鼿=wu4 +齀=wu4 +齁=hou1 +齂=xie4 +齃=e4 +齄=zha1 +齅=xiu4 +齆=weng4 +齇=zha1 +齈=nong4 +齉=nang4 +齊=qi2 +齋=zhai1 +齌=ji4 +齍=zi1 +齎=ji2 +齏=ji1 +齐=qi2 +齐明=qi2,ming2 +齑=ji1 +齒=chi3 +齓=chen4 +齔=chen4 +齕=he2 +齖=ya2 +齗=yin1 +齘=xie4 +齙=bao1 +齚=ze2 +齛=xie4 +齜=zi1 +齝=chi1 +齞=yan4 +齟=ju3 +齠=tiao2 +齡=ling2 +齢=ling2 +齣=chu1 +齤=quan2 +齥=xie4 +齦=yin2 +齧=nie4 +齨=jiu4 +齩=yao3 +齪=chuo4 +齫=yun3 +齬=yu3 +齭=chu3 +齮=yi3 +齯=ni2 +齰=ze2 +齱=zou1 +齲=qu3 +齳=yun3 +齴=yan3 +齵=yu2 +齶=e4 +齷=wo4 +齸=yi4 +齹=ci1 +齺=zou1 +齻=dian1 +齼=chu3 +齽=jin4 +齾=ya4 +齿=chi3 +齿冠=chi3,guan1 +齿牙为猾=chi3,ya2,wei2,hua2 +齿牙为祸=chi3,ya2,wei2,huo4 +龀=chen4 +龁=he2 +龂=yin2,ken3 +龃=ju3 +龄=ling2 +龅=bao1 +龆=tiao2 +龇=zi1 +龈=yin2,ken3 +龈齿弹舌=yin2,chi3,dan4,she2 +龉=yu3 +龊=chuo4 +龋=qu3 +龌=wo4 +龍=long2,long3 +龎=pang2 +龏=gong1,wo4 +龐=pang2 +龑=yan3 +龒=long2 +龓=long2,long3 +龔=gong1 +龕=kan1 +龖=da2 +龗=ling2 +龘=da2 +龙=long2 +龙举云兴=long2,ju3,yun2,xing1 +龙兴云属=long2,xing1,yun2,shu3 +龙兴凤举=long2,xing1,feng4,ju3 +龙楼凤阙=long2,lou2,feng4,que4 +龙游曲沼=long4,you4,qu4,zhao4 +龙血凤髓=long2,xue4,feng4,sui3 +龙血玄黄=long2,xue3,xuan2,huang2 +龙门刨=long2,men2,bao4 +龚=gong1 +龛=kan1 +龜=gui1,jun1,qiu1 +龝=qiu1 +龞=bie1 +龟=gui1,jun1,qiu1 +龟兹=qiu1,ci2 +龟甲=gui1,jia2 +龟背=gui1,bei4 +龟裂=jun1,lie4 +龠=yue4 +龡=chui1 +龢=he2 +龣=jiao3 +龤=xie2 +龥=yue4 +重启=chong2,qi3 +还款=huan2,kuan3 +侠传=xia2,zhuan4 +𩽾𩾌=an1,kang1 \ No newline at end of file diff --git a/pom.xml b/pom.xml index e94e94a7a..87c55ec56 100644 --- a/pom.xml +++ b/pom.xml @@ -4,7 +4,7 @@ com.hankcs hanlp - portable-1.5.4 + portable-1.8.6 HanLP https://github.com/hankcs/HanLP @@ -25,7 +25,7 @@ hankcs - me@hankcs.com + hankcshe@gmail.com http://www.hankcs.com @@ -118,6 +118,13 @@ sign + + + + --pinentry-mode + loopback + + diff --git a/src/main/java/com/hankcs/hanlp/HanLP.java b/src/main/java/com/hankcs/hanlp/HanLP.java index 8f364f0ce..bc4cbbac7 100644 --- a/src/main/java/com/hankcs/hanlp/HanLP.java +++ b/src/main/java/com/hankcs/hanlp/HanLP.java @@ -15,6 +15,7 @@ import com.hankcs.hanlp.corpus.io.IIOAdapter; import com.hankcs.hanlp.corpus.io.ResourceIOAdapter; import com.hankcs.hanlp.dependency.nnparser.NeuralNetworkDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; import com.hankcs.hanlp.dictionary.py.Pinyin; import com.hankcs.hanlp.dictionary.py.PinyinDictionary; import com.hankcs.hanlp.dictionary.ts.*; @@ -22,6 +23,12 @@ import com.hankcs.hanlp.mining.phrase.MutualInformationEntropyPhraseExtractor; import com.hankcs.hanlp.mining.word.NewWordDiscover; import com.hankcs.hanlp.mining.word.WordInfo; +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.seg.CRF.CRFSegment; +import com.hankcs.hanlp.seg.HMM.HMMSegment; +import com.hankcs.hanlp.seg.NShort.NShortSegment; +import com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment; import com.hankcs.hanlp.seg.Segment; import com.hankcs.hanlp.seg.Viterbi.ViterbiSegment; import com.hankcs.hanlp.seg.common.Term; @@ -69,6 +76,10 @@ public static final class Config * 用户自定义词典路径 */ public static String CustomDictionaryPath[] = new String[]{"data/dictionary/custom/CustomDictionary.txt"}; + /** + * 用户自定义词典是否自动重新生成缓存(根据词典文件的最后修改时间是否大于缓存文件的时间判断) + */ + public static boolean CustomDictionaryAutoRefreshCache = true; /** * 2元语法词典路径 */ @@ -110,10 +121,6 @@ public static final class Config * 简繁转换词典根目录 */ public static String tcDictionaryRoot = "data/dictionary/tc/"; - /** - * 声母韵母语调词典 - */ - public static String SYTDictionaryPath = "data/dictionary/pinyin/SYTDictionary.txt"; /** * 拼音词典路径 @@ -140,6 +147,11 @@ public static final class Config */ public static String CharTablePath = "data/dictionary/other/CharTable.txt"; + /** + * 词性标注集描述表,用来进行中英映射(对于Nature词性,可直接参考Nature.java中的注释) + */ + public static String PartOfSpeechTagDictionary = "data/dictionary/other/TagPKU98.csv"; + /** * 词-词性-依存关系模型 */ @@ -147,24 +159,53 @@ public static final class Config /** * 最大熵-依存关系模型 + * @deprecated 已废弃,请使用{@link KBeamArcEagerDependencyParser}。未来版本将不再发布该模型,并删除配置项 */ public static String MaxEntModelPath = "data/model/dependency/MaxEntModel.txt"; /** * 神经网络依存模型路径 */ public static String NNParserModelPath = "data/model/dependency/NNParserModel.txt"; + /** + * 感知机ArcEager依存模型路径 + */ + public static String PerceptronParserModelPath = "data/model/dependency/perceptron.bin"; /** * CRF分词模型 + * + * @deprecated 已废弃,请使用{@link com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer}。未来版本将不再发布该模型,并删除配置项 */ public static String CRFSegmentModelPath = "data/model/segment/CRFSegmentModel.txt"; /** * HMM分词模型 + * + * @deprecated 已废弃,请使用{@link PerceptronLexicalAnalyzer} */ public static String HMMSegmentModelPath = "data/model/segment/HMMSegmentModel.bin"; /** - * CRF依存模型 + * CRF分词模型 + */ + public static String CRFCWSModelPath = "data/model/crf/pku199801/cws.txt"; + /** + * CRF词性标注模型 + */ + public static String CRFPOSModelPath = "data/model/crf/pku199801/pos.txt"; + /** + * CRF命名实体识别模型 */ - public static String CRFDependencyModelPath = "data/model/dependency/CRFDependencyModelMini.txt"; + public static String CRFNERModelPath = "data/model/crf/pku199801/ner.txt"; + /** + * 感知机分词模型 + */ + public static String PerceptronCWSModelPath = "data/model/perceptron/large/cws.bin"; + /** + * 感知机词性标注模型 + */ + public static String PerceptronPOSModelPath = "data/model/perceptron/pku1998/pos.bin"; + /** + * 感知机命名实体识别模型 + */ + public static String PerceptronNERModelPath = "data/model/perceptron/pku1998/ner.bin"; /** * 分词结果是否展示词性 */ @@ -190,10 +231,26 @@ public static final class Config { // IKVM (v.0.44.0.5) doesn't set context classloader loader = HanLP.Config.class.getClassLoader(); } - p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ? - loader.getResourceAsStream("hanlp.properties") : - new FileInputStream(Predefine.HANLP_PROPERTIES_PATH) - , "UTF-8")); + try + { + p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ? + loader.getResourceAsStream("hanlp.properties") : + new FileInputStream(Predefine.HANLP_PROPERTIES_PATH) + , "UTF-8")); + } + catch (Exception e) + { + String HANLP_ROOT = System.getProperty("HANLP_ROOT"); + if (HANLP_ROOT == null) HANLP_ROOT = System.getenv("HANLP_ROOT"); + if (HANLP_ROOT != null) + { + HANLP_ROOT = HANLP_ROOT.trim(); + p = new Properties(); + p.setProperty("root", HANLP_ROOT); + logger.info("使用环境变量 HANLP_ROOT=" + HANLP_ROOT); + } + else throw e; + } String root = p.getProperty("root", "").replaceAll("\\\\", "/"); if (root.length() > 0 && !root.endsWith("/")) root += "/"; CoreDictionaryPath = root + p.getProperty("CoreDictionaryPath", CoreDictionaryPath); @@ -222,9 +279,9 @@ public static final class Config } } CustomDictionaryPath = pathArray; + CustomDictionaryAutoRefreshCache = "true".equals(p.getProperty("CustomDictionaryAutoRefreshCache", "true")); tcDictionaryRoot = root + p.getProperty("tcDictionaryRoot", tcDictionaryRoot); if (!tcDictionaryRoot.endsWith("/")) tcDictionaryRoot += '/'; - SYTDictionaryPath = root + p.getProperty("SYTDictionaryPath", SYTDictionaryPath); PinyinDictionaryPath = root + p.getProperty("PinyinDictionaryPath", PinyinDictionaryPath); TranslatedPersonDictionaryPath = root + p.getProperty("TranslatedPersonDictionaryPath", TranslatedPersonDictionaryPath); JapanesePersonDictionaryPath = root + p.getProperty("JapanesePersonDictionaryPath", JapanesePersonDictionaryPath); @@ -234,12 +291,19 @@ public static final class Config OrganizationDictionaryTrPath = root + p.getProperty("OrganizationDictionaryTrPath", OrganizationDictionaryTrPath); CharTypePath = root + p.getProperty("CharTypePath", CharTypePath); CharTablePath = root + p.getProperty("CharTablePath", CharTablePath); + PartOfSpeechTagDictionary = root + p.getProperty("PartOfSpeechTagDictionary", PartOfSpeechTagDictionary); WordNatureModelPath = root + p.getProperty("WordNatureModelPath", WordNatureModelPath); MaxEntModelPath = root + p.getProperty("MaxEntModelPath", MaxEntModelPath); NNParserModelPath = root + p.getProperty("NNParserModelPath", NNParserModelPath); + PerceptronParserModelPath = root + p.getProperty("PerceptronParserModelPath", PerceptronParserModelPath); CRFSegmentModelPath = root + p.getProperty("CRFSegmentModelPath", CRFSegmentModelPath); - CRFDependencyModelPath = root + p.getProperty("CRFDependencyModelPath", CRFDependencyModelPath); HMMSegmentModelPath = root + p.getProperty("HMMSegmentModelPath", HMMSegmentModelPath); + CRFCWSModelPath = root + p.getProperty("CRFCWSModelPath", CRFCWSModelPath); + CRFPOSModelPath = root + p.getProperty("CRFPOSModelPath", CRFPOSModelPath); + CRFNERModelPath = root + p.getProperty("CRFNERModelPath", CRFNERModelPath); + PerceptronCWSModelPath = root + p.getProperty("PerceptronCWSModelPath", PerceptronCWSModelPath); + PerceptronPOSModelPath = root + p.getProperty("PerceptronPOSModelPath", PerceptronPOSModelPath); + PerceptronNERModelPath = root + p.getProperty("PerceptronNERModelPath", PerceptronNERModelPath); ShowTermNature = "true".equals(p.getProperty("ShowTermNature", "true")); Normalization = "true".equals(p.getProperty("Normalization", "false")); IOAdapter = null; // 在有配置文件的情况下,无论有无IOAdapter配置项,都先将IOAdapter置为null @@ -273,26 +337,40 @@ public static final class Config } catch (Exception e) { - StringBuilder sbInfo = new StringBuilder("========Tips========\n请将hanlp.properties放在下列目录:\n"); // 打印一些友好的tips - String classPath = (String) System.getProperties().get("java.class.path"); - if (classPath != null) + if (new File("data/dictionary/CoreNatureDictionary.tr.txt").isFile()) { - for (String path : classPath.split(File.pathSeparator)) + logger.info("使用当前目录下的data"); + } + else + { + StringBuilder sbInfo = new StringBuilder("========Tips========\n请将hanlp.properties放在下列目录:\n"); // 打印一些友好的tips + if (new File("src/main/java").isDirectory()) + { + sbInfo.append("src/main/resources"); + } + else { - if (new File(path).isDirectory()) + String classPath = (String) System.getProperties().get("java.class.path"); + if (classPath != null) { - sbInfo.append(path).append('\n'); + for (String path : classPath.split(File.pathSeparator)) + { + if (new File(path).isDirectory()) + { + sbInfo.append(path).append('\n'); + } + } } + sbInfo.append("Web项目则请放到下列目录:\n" + + "Webapp/WEB-INF/lib\n" + + "Webapp/WEB-INF/classes\n" + + "Appserver/lib\n" + + "JRE/lib\n"); + sbInfo.append("并且编辑root=PARENT/path/to/your/data\n"); + sbInfo.append("现在HanLP将尝试从jar包内部resource读取data……"); } + logger.info("没有找到hanlp.properties,进入portable模式。若需要自定义,请按下列提示操作:\n" + sbInfo); } - sbInfo.append("Web项目则请放到下列目录:\n" + - "Webapp/WEB-INF/lib\n" + - "Webapp/WEB-INF/classes\n" + - "Appserver/lib\n" + - "JRE/lib\n"); - sbInfo.append("并且编辑root=PARENT/path/to/your/data\n"); - sbInfo.append("现在HanLP将尝试从jar包内部resource读取data……"); - logger.info("hanlp.properties,进入portable模式。若需要自定义HanLP,请按下列提示操作:\n" + sbInfo); } } @@ -575,6 +653,58 @@ public static Segment newSegment() return new ViterbiSegment(); // Viterbi分词器是目前效率和效果的最佳平衡 } + /** + * 创建一个分词器, + * 这是一个工厂方法
+ * + * @param algorithm 分词算法,传入算法的中英文名都可以,可选列表:
+ *
    + *
  • 维特比 (viterbi):效率和效果的最佳平衡
  • + *
  • 双数组trie树 (dat):极速词典分词,千万字符每秒
  • + *
  • 条件随机场 (crf):分词、词性标注与命名实体识别精度都较高,适合要求较高的NLP任务
  • + *
  • 感知机 (perceptron):分词、词性标注与命名实体识别,支持在线学习
  • + *
  • N最短路 (nshort):命名实体识别稍微好一些,牺牲了速度
  • + *
+ * @return 一个分词器 + */ + public static Segment newSegment(String algorithm) + { + if (algorithm == null) + { + throw new IllegalArgumentException(String.format("非法参数 algorithm == %s", algorithm)); + } + algorithm = algorithm.toLowerCase(); + if ("viterbi".equals(algorithm) || "维特比".equals(algorithm)) + return new ViterbiSegment(); // Viterbi分词器是目前效率和效果的最佳平衡 + else if ("dat".equals(algorithm) || "双数组trie树".equals(algorithm)) + return new DoubleArrayTrieSegment(); + else if ("nshort".equals(algorithm) || "n最短路".equals(algorithm)) + return new NShortSegment(); + else if ("crf".equals(algorithm) || "条件随机场".equals(algorithm)) + try + { + return new CRFLexicalAnalyzer(); + } + catch (IOException e) + { + logger.warning("CRF模型加载失败"); + throw new RuntimeException(e); + } + else if ("perceptron".equals(algorithm) || "感知机".equals(algorithm)) + { + try + { + return new PerceptronLexicalAnalyzer(); + } + catch (IOException e) + { + logger.warning("感知机模型加载失败"); + throw new RuntimeException(e); + } + } + throw new IllegalArgumentException(String.format("非法参数 algorithm == %s", algorithm)); + } + /** * 依存文法分析 * @@ -651,6 +781,24 @@ public static List extractWords(BufferedReader reader, int size, boole return discover.discover(reader, size); } + /** + * 提取词语(新词发现) + * + * @param reader 从reader获取文本 + * @param size 需要提取词语的数量 + * @param newWordsOnly 是否只提取词典中没有的词语 + * @param max_word_len 词语最长长度 + * @param min_freq 词语最低频率 + * @param min_entropy 词语最低熵 + * @param min_aggregation 词语最低互信息 + * @return 一个词语列表 + */ + public static List extractWords(BufferedReader reader, int size, boolean newWordsOnly, int max_word_len, float min_freq, float min_entropy, float min_aggregation) throws IOException + { + NewWordDiscover discover = new NewWordDiscover(max_word_len, min_freq, min_entropy, min_aggregation, newWordsOnly); + return discover.discover(reader, size); + } + /** * 提取关键词 * diff --git a/src/main/java/com/hankcs/hanlp/algorithm/MaxHeap.java b/src/main/java/com/hankcs/hanlp/algorithm/MaxHeap.java index 195ee0297..24bfc525c 100644 --- a/src/main/java/com/hankcs/hanlp/algorithm/MaxHeap.java +++ b/src/main/java/com/hankcs/hanlp/algorithm/MaxHeap.java @@ -18,7 +18,7 @@ * * @author hankcs */ -public class MaxHeap +public class MaxHeap implements Iterable { /** * 优先队列 @@ -95,4 +95,15 @@ public List toList() return list; } + + @Override + public Iterator iterator() + { + return queue.iterator(); + } + + public int size() + { + return queue.size(); + } } diff --git a/src/main/java/com/hankcs/hanlp/algorithm/VectorDistance.java b/src/main/java/com/hankcs/hanlp/algorithm/VectorDistance.java deleted file mode 100644 index dffea5398..000000000 --- a/src/main/java/com/hankcs/hanlp/algorithm/VectorDistance.java +++ /dev/null @@ -1,70 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/9/14 0:04 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.algorithm; - -import com.hankcs.hanlp.corpus.synonym.Synonym; -import com.hankcs.hanlp.dictionary.common.CommonSynonymDictionary; - -import java.util.List; - -/** - * 词向量距离计算 - * @author hankcs - */ -public class VectorDistance -{ - public static long compute(long[] arrayA, long[] arrayB) - { - final int m = arrayA.length; - final int n = arrayB.length; - if (m == 0 || n == 0) return 1; - - long total = 0; - for (long va : arrayA) - { - long min_distance = Long.MAX_VALUE; - for (long vb : arrayB) - { - min_distance = Math.min(min_distance, Math.abs(va - vb)); - } - total += min_distance; - } - - return total / m; - } - - public static double compute(List synonymItemListA, List synonymItemListB) - { - double total = 0; - for (CommonSynonymDictionary.SynonymItem itemA : synonymItemListA) - { - long min_distance = Long.MAX_VALUE; - for (CommonSynonymDictionary.SynonymItem itemB : synonymItemListB) - { - long distance; - if (itemA.type != Synonym.Type.UNDEFINED && itemB.type != Synonym.Type.UNDEFINED) - { - distance = Math.abs(itemA.entry.id - itemB.entry.id); - } - else - { - // 用编辑距离凑合一个 - distance = EditDistance.ed(itemA.entry.realWord, itemB.entry.realWord) * 1000000; - } - min_distance = Math.min(min_distance, distance); - } - total += min_distance; - } - - return total; - } -} diff --git a/src/main/java/com/hankcs/hanlp/algorithm/Viterbi.java b/src/main/java/com/hankcs/hanlp/algorithm/Viterbi.java index 2b6b1cd51..93acc2861 100644 --- a/src/main/java/com/hankcs/hanlp/algorithm/Viterbi.java +++ b/src/main/java/com/hankcs/hanlp/algorithm/Viterbi.java @@ -13,6 +13,7 @@ import com.hankcs.hanlp.corpus.dictionary.item.EnumItem; import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.TransformMatrix; import com.hankcs.hanlp.dictionary.TransformMatrixDictionary; import com.hankcs.hanlp.seg.common.Vertex; @@ -99,8 +100,10 @@ public static int[] compute(int[] obs, int[] states, double[] start_p, double[][ * @param vertexList 包含Vertex.B节点的路径 * @param transformMatrixDictionary 词典对应的转移矩阵 */ - public static void compute(List vertexList, TransformMatrixDictionary transformMatrixDictionary) + public static void compute(List vertexList, TransformMatrix transformMatrixDictionary) { + if (Nature.values().length != transformMatrixDictionary.states.length) + transformMatrixDictionary.extend(Nature.values().length); int length = vertexList.size() - 1; double[][] cost = new double[2][]; // 滚动数组 Iterator iterator = vertexList.iterator(); @@ -118,7 +121,7 @@ public static void compute(List vertexList, TransformMatrixDictionary vertexList, TransformMatrixDictionary keywords) + { + this(); + addAllKeyword(keywords); + } public Trie removeOverlaps() { diff --git a/src/main/java/com/hankcs/hanlp/classification/classifiers/AbstractClassifier.java b/src/main/java/com/hankcs/hanlp/classification/classifiers/AbstractClassifier.java index 9c2390da6..d09f8dd2b 100644 --- a/src/main/java/com/hankcs/hanlp/classification/classifiers/AbstractClassifier.java +++ b/src/main/java/com/hankcs/hanlp/classification/classifiers/AbstractClassifier.java @@ -16,13 +16,13 @@ import com.hankcs.hanlp.classification.corpus.MemoryDataSet; import com.hankcs.hanlp.classification.models.AbstractModel; import com.hankcs.hanlp.classification.utilities.CollectionUtility; -import com.hankcs.hanlp.classification.utilities.MathUtility; +import com.hankcs.hanlp.utility.MathUtility; import java.io.IOException; import java.util.Map; import java.util.TreeMap; -import static com.hankcs.hanlp.classification.utilities.Predefine.logger; +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; /** * @author hankcs @@ -32,6 +32,7 @@ public abstract class AbstractClassifier implements IClassifier @Override public IClassifier enableProbability(boolean enable) { + configProbabilityEnabled = enable; return this; } diff --git a/src/main/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifier.java b/src/main/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifier.java index 9d7a6a6d9..50f0589d0 100644 --- a/src/main/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifier.java +++ b/src/main/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifier.java @@ -1,6 +1,6 @@ package com.hankcs.hanlp.classification.classifiers; -import com.hankcs.hanlp.classification.utilities.MathUtility; +import com.hankcs.hanlp.utility.MathUtility; import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; import com.hankcs.hanlp.classification.corpus.*; import com.hankcs.hanlp.classification.features.ChiSquareFeatureExtractor; @@ -8,7 +8,7 @@ import com.hankcs.hanlp.classification.models.AbstractModel; import com.hankcs.hanlp.classification.models.NaiveBayesModel; -import static com.hankcs.hanlp.classification.utilities.Predefine.logger; +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; import java.util.*; diff --git a/src/main/java/com/hankcs/hanlp/classification/corpus/AbstractDataSet.java b/src/main/java/com/hankcs/hanlp/classification/corpus/AbstractDataSet.java index bc0fdb2f1..04876710e 100644 --- a/src/main/java/com/hankcs/hanlp/classification/corpus/AbstractDataSet.java +++ b/src/main/java/com/hankcs/hanlp/classification/corpus/AbstractDataSet.java @@ -12,17 +12,16 @@ package com.hankcs.hanlp.classification.corpus; import com.hankcs.hanlp.classification.models.AbstractModel; -import com.hankcs.hanlp.classification.tokenizers.BigramTokenizer; import com.hankcs.hanlp.classification.tokenizers.HanLPTokenizer; import com.hankcs.hanlp.classification.tokenizers.ITokenizer; -import com.hankcs.hanlp.classification.utilities.MathUtility; +import com.hankcs.hanlp.utility.MathUtility; import com.hankcs.hanlp.classification.utilities.TextProcessUtility; import java.io.File; import java.io.IOException; import java.util.Map; -import static com.hankcs.hanlp.classification.utilities.Predefine.logger; +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; /** * @author hankcs diff --git a/src/main/java/com/hankcs/hanlp/classification/corpus/Catalog.java b/src/main/java/com/hankcs/hanlp/classification/corpus/Catalog.java index 6a0115ca5..dcab05b71 100644 --- a/src/main/java/com/hankcs/hanlp/classification/corpus/Catalog.java +++ b/src/main/java/com/hankcs/hanlp/classification/corpus/Catalog.java @@ -69,6 +69,11 @@ public String getCategory(int id) return idCategory.get(id); } + public List getCategories() + { + return idCategory; + } + public int size() { return idCategory.size(); @@ -81,4 +86,10 @@ public String[] toArray() return catalog; } + + @Override + public String toString() + { + return idCategory.toString(); + } } diff --git a/src/main/java/com/hankcs/hanlp/classification/features/ChiSquareFeatureExtractor.java b/src/main/java/com/hankcs/hanlp/classification/features/ChiSquareFeatureExtractor.java index f2bdca827..fa1c0d491 100644 --- a/src/main/java/com/hankcs/hanlp/classification/features/ChiSquareFeatureExtractor.java +++ b/src/main/java/com/hankcs/hanlp/classification/features/ChiSquareFeatureExtractor.java @@ -34,7 +34,8 @@ public static BaseFeatureData extractBasicFeatureData(IDataSet dataSet) } /** - * 使用卡方非参数校验来执行特征选择 + * 使用卡方非参数校验来执行特征选择
+ * https://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html * * @param stats * @return @@ -43,7 +44,7 @@ public Map chi_square(BaseFeatureData stats) { Map selectedFeatures = new HashMap(); - int N1dot, N0dot, N00, N01, N10, N11; + double N1dot, N0dot, N00, N01, N10, N11; double chisquareScore; Double previousScore; for (int feature = 0; feature < stats.featureCategoryJointCount.length; feature++) @@ -83,6 +84,13 @@ public Map chi_square(BaseFeatureData stats) } } } + if (selectedFeatures.size() == 0) // 当特征全部无法通过卡方检测时,取全集作为特征 + { + for (int feature = 0; feature < stats.featureCategoryJointCount.length; feature++) + { + selectedFeatures.put(feature, 0.); + } + } if (selectedFeatures.size() > maxSize) { MaxHeap> maxHeap = new MaxHeap>(maxSize, new Comparator>() @@ -98,7 +106,7 @@ public int compare(Map.Entry o1, Map.Entry o2) maxHeap.add(entry); } selectedFeatures.clear(); - for (Map.Entry entry : maxHeap.toList()) + for (Map.Entry entry : maxHeap) { selectedFeatures.put(entry.getKey(), entry.getValue()); } diff --git a/src/main/java/com/hankcs/hanlp/classification/statistics/evaluations/Evaluator.java b/src/main/java/com/hankcs/hanlp/classification/statistics/evaluations/Evaluator.java index 32ccb22f5..175c45856 100644 --- a/src/main/java/com/hankcs/hanlp/classification/statistics/evaluations/Evaluator.java +++ b/src/main/java/com/hankcs/hanlp/classification/statistics/evaluations/Evaluator.java @@ -15,7 +15,7 @@ import com.hankcs.hanlp.classification.corpus.Document; import com.hankcs.hanlp.classification.corpus.IDataSet; import com.hankcs.hanlp.classification.corpus.MemoryDataSet; -import com.hankcs.hanlp.classification.utilities.MathUtility; +import com.hankcs.hanlp.utility.MathUtility; import java.util.Map; diff --git a/src/main/java/com/hankcs/hanlp/classification/tokenizers/BigramTokenizer.java b/src/main/java/com/hankcs/hanlp/classification/tokenizers/BigramTokenizer.java index e655f0010..b611ffc22 100644 --- a/src/main/java/com/hankcs/hanlp/classification/tokenizers/BigramTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/classification/tokenizers/BigramTokenizer.java @@ -1,5 +1,8 @@ package com.hankcs.hanlp.classification.tokenizers; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.dictionary.other.CharType; + import java.util.Iterator; import java.util.LinkedList; import java.util.List; diff --git a/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharTable.java b/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharTable.java deleted file mode 100644 index efb5b06f1..000000000 --- a/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharTable.java +++ /dev/null @@ -1,9 +0,0 @@ -package com.hankcs.hanlp.classification.tokenizers; - -/** - * 字符正规化表 - * @author hankcs - */ -public class CharTable extends com.hankcs.hanlp.dictionary.other.CharTable -{ -} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharType.java b/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharType.java deleted file mode 100644 index 46d11b976..000000000 --- a/src/main/java/com/hankcs/hanlp/classification/tokenizers/CharType.java +++ /dev/null @@ -1,53 +0,0 @@ -package com.hankcs.hanlp.classification.tokenizers; - -/** - * @author hankcs - */ -public class CharType -{ - - /** - * 中文字符 - */ - public static final byte CT_CHINESE = 1; - - /** - * 字母 - */ - public static final byte CT_LETTER = 2; - - /** - * 数字 - */ - public static final byte CT_NUM = 3; - - - static byte[] type; - - static - { - type = new byte[65536]; - for (int i = 19968; i < 40870; ++i) - { - type[i] = CT_CHINESE; - } - for (char c : "0123456789".toCharArray()) - { - type[c] = CT_NUM; - } - for (char c : "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".toCharArray()) - { - type[c] = CT_LETTER; - } - } - - /** - * 获取字符的类型 - * @param c - * @return - */ - public static byte get(char c) - { - return type[(int)c]; - } -} diff --git a/src/main/java/com/hankcs/hanlp/classification/utilities/CollectionUtility.java b/src/main/java/com/hankcs/hanlp/classification/utilities/CollectionUtility.java index c731ebc04..f0ebbef60 100644 --- a/src/main/java/com/hankcs/hanlp/classification/utilities/CollectionUtility.java +++ b/src/main/java/com/hankcs/hanlp/classification/utilities/CollectionUtility.java @@ -21,7 +21,7 @@ public class CollectionUtility public static > Map sortMapByValue(Map input, final boolean desc) { LinkedHashMap output = new LinkedHashMap(input.size()); - ArrayList> entryList = new ArrayList>(input.size()); + ArrayList> entryList = new ArrayList>(input.entrySet()); Collections.sort(entryList, new Comparator>() { public int compare(Map.Entry o1, Map.Entry o2) diff --git a/src/main/java/com/hankcs/hanlp/classification/utilities/Predefine.java b/src/main/java/com/hankcs/hanlp/classification/utilities/Predefine.java deleted file mode 100644 index 958619ff6..000000000 --- a/src/main/java/com/hankcs/hanlp/classification/utilities/Predefine.java +++ /dev/null @@ -1,27 +0,0 @@ -/* - * - * He Han - * me@hankcs.com - * 16/2/16 AM11:11 - * - * - * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ - * This source is subject to Hankcs. Please contact Hankcs to get more information. - * - */ -package com.hankcs.hanlp.classification.utilities; - -import com.hankcs.hanlp.classification.utilities.io.ConsoleLogger; -import com.hankcs.hanlp.classification.utilities.io.ILogger; - -/** - * 一些全局的常量 - * @author hankcs - */ -public class Predefine -{ - /** - * 日志 - */ - public static ILogger logger = new ConsoleLogger(); -} diff --git a/src/main/java/com/hankcs/hanlp/classification/utilities/io/ConsoleLogger.java b/src/main/java/com/hankcs/hanlp/classification/utilities/io/ConsoleLogger.java index 4e952cd91..81038ec9c 100644 --- a/src/main/java/com/hankcs/hanlp/classification/utilities/io/ConsoleLogger.java +++ b/src/main/java/com/hankcs/hanlp/classification/utilities/io/ConsoleLogger.java @@ -13,11 +13,17 @@ /** * 输出到stdout和stderr的日志系统 + * * @author hankcs */ public class ConsoleLogger implements ILogger { + /** + * 默认日志 + */ + public static ILogger logger = new ConsoleLogger(); long start; + public void out(String format, Object... args) { System.out.printf(format, args); diff --git a/src/main/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrie.java b/src/main/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrie.java index 55c981161..9ba84a81f 100644 --- a/src/main/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrie.java +++ b/src/main/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrie.java @@ -59,10 +59,25 @@ public class AhoCorasickDoubleArrayTrie */ protected int size; + /** + * 是否开启快速构建 + */ + private boolean enableFastBuild; + public AhoCorasickDoubleArrayTrie() { } + /** + * 开启快速构建,相比普通构建速度更快但内存占用微增,原理详见 https://github.com/hankcs/HanLP/issues/1801 + * + * @param enableFastBuild 是否开启快速构建 + */ + public AhoCorasickDoubleArrayTrie(boolean enableFastBuild) + { + this.enableFastBuild = enableFastBuild; + } + /** * 由一个词典创建 * @@ -448,19 +463,19 @@ protected int transition(int current, char c) /** * c转移,如果是根节点则返回自己 * - * @param nodePos + * @param from * @param c * @return */ - protected int transitionWithRoot(int nodePos, char c) + protected int transitionWithRoot(int from, char c) { - int b = base[nodePos]; + int b = base[from]; int p; p = b + c + 1; if (b != check[p]) { - if (nodePos == 0) return 0; + if (from == 0) return 0; return -1; } @@ -817,7 +832,6 @@ private void addAllKeyword(Collection keywordSet) private void constructFailureStates() { fail = new int[size + 1]; - fail[1] = base[0]; output = new int[size + 1][]; Queue queue = new LinkedBlockingDeque(); @@ -881,7 +895,10 @@ private void buildDoubleArrayTrie(Set keySet) List> siblings = new ArrayList>(root_node.getSuccess().entrySet().size()); fetch(root_node, siblings); - insert(siblings); + if (siblings.isEmpty()) + Arrays.fill(check, -1); // fill -1 such that no transition is allowed + else + insert(siblings); } /** @@ -918,7 +935,7 @@ private int resize(int newSize) private int insert(List> siblings) { int begin = 0; - int pos = Math.max(siblings.get(0).getKey() + 1, nextCheckPos) - 1; + int pos = Math.max(siblings.get(0).getKey() + 1, enableFastBuild ? (nextCheckPos + 1) : nextCheckPos) - 1; int nonzero_num = 0; int first = 0; @@ -1009,7 +1026,7 @@ private void loseWeight() base = nbase; int ncheck[] = new int[size + 65535]; - System.arraycopy(check, 0, ncheck, 0, size); + System.arraycopy(check, 0, ncheck, 0, Math.min(check.length, ncheck.length)); check = ncheck; } } diff --git a/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAGSet.java b/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAGSet.java index 38de4c9c6..f4a7641e1 100644 --- a/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAGSet.java +++ b/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAGSet.java @@ -156,7 +156,8 @@ public void clear() { sourceNode = new MDAGNode(false); simplifiedSourceNode = null; - equivalenceClassMDAGNodeHashMap.clear(); + if (equivalenceClassMDAGNodeHashMap != null) + equivalenceClassMDAGNodeHashMap.clear(); mdagDataArray = null; charTreeSet.clear(); transitionCount = 0; diff --git a/src/main/java/com/hankcs/hanlp/collection/dartsclone/DoubleArray.java b/src/main/java/com/hankcs/hanlp/collection/dartsclone/DoubleArray.java index 1f1cd8701..e58f9e975 100644 --- a/src/main/java/com/hankcs/hanlp/collection/dartsclone/DoubleArray.java +++ b/src/main/java/com/hankcs/hanlp/collection/dartsclone/DoubleArray.java @@ -19,7 +19,7 @@ * * @author manabe */ -public class DoubleArray +public class DoubleArray implements Serializable { static Charset utf8 = Charset.forName("UTF-8"); @@ -109,6 +109,16 @@ public void save(OutputStream stream) throws IOException } } + private void writeObject(ObjectOutputStream out) throws IOException + { + out.writeObject(_array); + } + + private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException + { + _array = (int[]) in.readObject(); + } + /** * Returns the corresponding value if the key is found. Otherwise returns -1. * This method converts the key into UTF-8. diff --git a/src/main/java/com/hankcs/hanlp/collection/sequence/SString.java b/src/main/java/com/hankcs/hanlp/collection/sequence/SString.java index 8a664f265..cfa7c68d2 100644 --- a/src/main/java/com/hankcs/hanlp/collection/sequence/SString.java +++ b/src/main/java/com/hankcs/hanlp/collection/sequence/SString.java @@ -14,7 +14,7 @@ import java.util.Arrays; /** - * (SimpleString)字符串,为了公用内存,避免值传递,优化运行效率而设置的String的替代品 + * (SimpleString)字符串,因为String内部的char[]无法访问,而许多任务经常操作char[],所以封装了这个结构。 * * @author hankcs */ diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrie.java b/src/main/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrie.java index bd00c7595..c31b9dd38 100644 --- a/src/main/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrie.java +++ b/src/main/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrie.java @@ -15,6 +15,7 @@ */ package com.hankcs.hanlp.collection.trie; +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; import com.hankcs.hanlp.corpus.io.ByteArray; import com.hankcs.hanlp.corpus.io.ByteArrayStream; import com.hankcs.hanlp.corpus.io.IOUtil; @@ -58,7 +59,6 @@ public String toString() protected int check[]; protected int base[]; - private BitSet used; /** * base 和 check 的大小 */ @@ -78,6 +78,11 @@ public String toString() // inline _resize expanded + /** + * 是否开启快速构建 + */ + private boolean enableFastBuild; + /** * 拓展数组 * @@ -158,13 +163,13 @@ private int fetch(Node parent, List siblings) * @param siblings 等待插入的兄弟节点 * @return 插入位置 */ - private int insert(List siblings) + private int insert(List siblings, BitSet used) { if (error_ < 0) return 0; int begin = 0; - int pos = Math.max(siblings.get(0).code + 1, nextCheckPos) - 1; + int pos = Math.max(siblings.get(0).code + 1, enableFastBuild ? (nextCheckPos + 1) : nextCheckPos) - 1; int nonzero_num = 0; int first = 0; @@ -172,7 +177,7 @@ private int insert(List siblings) resize(pos + 1); outer: - // 此循环体的目标是找出满足base[begin + a1...an] == 0的n个空闲空间,a1...an是siblings中的n个节点 + // 此循环体的目标是找出满足check[begin + a1...an] == 0的n个空闲空间,a1...an是siblings中的n个节点 while (true) { pos++; @@ -221,7 +226,7 @@ else if (first == 0) //used[begin] = true; used.set(begin); - + size = (size > begin + siblings.get(siblings.size() - 1).code + 1) ? size : begin + siblings.get(siblings.size() - 1).code + 1; @@ -253,7 +258,7 @@ else if (first == 0) } else { - int h = insert(new_siblings); // dfs + int h = insert(new_siblings, used); // dfs base[begin + siblings.get(i).code] = h; // System.out.println(this); } @@ -265,13 +270,23 @@ public DoubleArrayTrie() { check = null; base = null; - used = new BitSet(); size = 0; allocSize = 0; // no_delete_ = false; error_ = 0; } + /** + * 开启快速构建,相比普通构建速度更快但内存占用微增,原理详见 https://github.com/hankcs/HanLP/issues/1801 + * + * @param enableFastBuild 是否开启快速构建 + */ + public DoubleArrayTrie(boolean enableFastBuild) + { + this(); + this.enableFastBuild = enableFastBuild; + } + /** * 从TreeMap构造 * @param buildFrom @@ -299,7 +314,6 @@ void clear() // if (! no_delete_) check = null; base = null; - used = null; allocSize = 0; size = 0; // no_delete_ = false; @@ -357,7 +371,10 @@ public int build(Set> entrySet) List valueList = new ArrayList(entrySet.size()); for (Map.Entry entry : entrySet) { - keyList.add(entry.getKey()); + String key = entry.getKey(); + if (key.isEmpty()) + continue; + keyList.add(key); valueList.add(entry.getValue()); } @@ -389,7 +406,7 @@ public int build(TreeMap keyValueMap) public int build(List _key, int _length[], int _value[], int _keySize) { - if (_keySize > _key.size() || _key == null) + if (_key == null || _keySize > _key.size()) return 0; // progress_func_ = progress_func; @@ -398,6 +415,7 @@ public int build(List _key, int _length[], int _value[], keySize = _keySize; value = _value; progress = 0; + allocSize = 0; resize(65536 * 32); // 32个双字节 @@ -411,12 +429,12 @@ public int build(List _key, int _length[], int _value[], List siblings = new ArrayList(); fetch(root_node, siblings); - insert(siblings); + insert(siblings, new BitSet()); + shrink(); // size += (1 << 8 * 2) + 1; // ??? // if (size >= allocSize) resize (size); - used = null; key = null; length = null; @@ -543,7 +561,6 @@ public boolean load(ByteArray byteArray, V[] value) check[i] = byteArray.nextInt(); } v = value; - used = null; // 无用的对象,释放掉 return true; } @@ -569,7 +586,6 @@ public boolean load(byte[] bytes, int offset, V[] value) offset += 4; } v = value; - used = null; // 无用的对象,释放掉 return true; } @@ -1282,7 +1298,7 @@ public LongestSearcher(int offset, char[] charArray) */ public boolean next() { - value = null; + length = 0; begin = i; int b = base[0]; int n; @@ -1292,7 +1308,7 @@ public boolean next() { if (i >= arrayLength) // 指针到头了,将起点往前挪一个,重新开始,状态归零 { - return value != null; + return length > 0; } p = b + (int) (charArray[i]) + 1; // 状态转移 p = base[char[i-1]] + char[i] + 1 if (b == check[p]) // base[char[i-1]] == check[base[char[i-1]] + char[i] + 1] @@ -1300,12 +1316,14 @@ public boolean next() else { if (begin == arrayLength) break; - if (value != null) + if (length > 0) { + i = begin + length; // 输出最长词后,从该词语的下一个位置恢复扫描 return true; } - begin = i + 1; // 转移失败,重新开始,状态归零 + i = begin; // 转移失败,也将起点往前挪一个,重新开始,状态归零 + ++begin; b = base[0]; } p = b; @@ -1322,6 +1340,21 @@ public boolean next() } } + /** + * 全切分 + * + * @param text 文本 + * @param processor 处理器 + */ + public void parseText(String text, AhoCorasickDoubleArrayTrie.IHit processor) + { + Searcher searcher = getSearcher(text, 0); + while (searcher.next()) + { + processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); + } + } + public LongestSearcher getLongestSearcher(String text, int offset) { return getLongestSearcher(text.toCharArray(), offset); @@ -1332,6 +1365,21 @@ public LongestSearcher getLongestSearcher(char[] text, int offset) return new LongestSearcher(offset, text); } + /** + * 最长匹配 + * + * @param text 文本 + * @param processor 处理器 + */ + public void parseLongestText(String text, AhoCorasickDoubleArrayTrie.IHit processor) + { + LongestSearcher searcher = getLongestSearcher(text, 0); + while (searcher.next()) + { + processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); + } + } + /** * 转移状态 * @@ -1385,6 +1433,24 @@ public V get(int index) return v[index]; } + /** + * 释放空闲的内存 + */ + private void shrink() + { +// if (HanLP.Config.DEBUG) +// { +// System.err.printf("释放内存 %d bytes\n", base.length - size - 65535); +// } + int nbase[] = new int[size + 65535]; + System.arraycopy(base, 0, nbase, 0, size); + base = nbase; + + int ncheck[] = new int[size + 65535]; + System.arraycopy(check, 0, ncheck, 0, size); + check = ncheck; + } + /** * 打印统计信息 @@ -1405,4 +1471,4 @@ public V get(int index) // } // System.out.println("CheckUsed: " + nonZeroIndex); // } -} \ No newline at end of file +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BaseNode.java b/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BaseNode.java index c975218bc..fb0ee4fb9 100644 --- a/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BaseNode.java +++ b/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BaseNode.java @@ -50,6 +50,17 @@ public abstract class BaseNode implements Comparable */ protected V value; + public BaseNode transition(String path, int begin) + { + BaseNode cur = this; + for (int i = begin; i < path.length(); ++i) + { + cur = cur.getChild(path.charAt(i)); + if (cur == null || cur.status == Status.UNDEFINED_0) return null; + } + return cur; + } + public BaseNode transition(char[] path, int begin) { BaseNode cur = this; diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrie.java b/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrie.java index f4183adec..7a152c8ba 100644 --- a/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrie.java +++ b/src/main/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrie.java @@ -38,6 +38,15 @@ public BinTrie() status = Status.NOT_WORD_1; } + public BinTrie(Map map) + { + this(); + for (Map.Entry entry : map.entrySet()) + { + put(entry.getKey(), entry.getValue()); + } + } + /** * 插入一个词 * @@ -599,6 +608,14 @@ public void parseText(String text, AhoCorasickDoubleArrayTrie.IHit processor) { processor.hit(begin, i + 1, value); } + + /*如果是最后一位,这里不能直接跳出循环, 要继续从下一个字符开始判断*/ + if (i == length - 1) + { + i = begin; + ++begin; + state = this; + } } else { @@ -631,6 +648,14 @@ public void parseText(char[] text, AhoCorasickDoubleArrayTrie.IHit processor) { processor.hit(begin, i + 1, value); } + + /*如果是最后一位,这里不能直接跳出循环, 要继续从下一个字符开始判断*/ + if (i == length - 1) + { + i = begin; + ++begin; + state = this; + } } else { diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/CharacterMapping.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/CharacterMapping.java new file mode 100644 index 000000000..87ef03208 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/CharacterMapping.java @@ -0,0 +1,19 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +/** + * 字符映射接口 + */ +public interface CharacterMapping +{ + int getInitSize(); + + int getCharsetSize(); + + int zeroId(); + + int[] toIdList(String key); + + int[] toIdList(int codePoint); + + String toString(int[] ids); +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayList.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayList.java new file mode 100644 index 000000000..599952638 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayList.java @@ -0,0 +1,220 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.ICacheAble; + +import java.io.*; +import java.util.ArrayList; + +/** + * 动态数组 + */ +public class IntArrayList implements Serializable, ICacheAble +{ + private static final long serialVersionUID = 1908530358259070518L; + private int[] data; + /** + * 实际size + */ + private int size; + /** + * 线性递增 + */ + private int linearExpandFactor; + + public void setLinearExpandFactor(int linearExpandFactor) + { + this.linearExpandFactor = linearExpandFactor; + } + + /** + * 是否指数递增 + */ + private boolean exponentialExpanding = false; + + public boolean isExponentialExpanding() + { + return exponentialExpanding; + } + + public void setExponentialExpanding(boolean multiplyExpanding) + { + this.exponentialExpanding = multiplyExpanding; + } + + private double exponentialExpandFactor = 1.5; + + public double getExponentialExpandFactor() + { + return exponentialExpandFactor; + } + + public void setExponentialExpandFactor(double exponentialExpandFactor) + { + this.exponentialExpandFactor = exponentialExpandFactor; + } + + public IntArrayList() + { + this(1024); + } + + public IntArrayList(int capacity) + { + this(capacity, 10240); + } + + public IntArrayList(int capacity, int linearExpandFactor) + { + this.data = new int[capacity]; + this.size = 0; + this.linearExpandFactor = linearExpandFactor; + } + + private void expand() + { + if (!exponentialExpanding) + { + int[] newData = new int[this.data.length + this.linearExpandFactor]; + System.arraycopy(this.data, 0, newData, 0, this.data.length); + this.data = newData; + } + else + { + int[] newData = new int[(int) (this.data.length * exponentialExpandFactor)]; + System.arraycopy(this.data, 0, newData, 0, this.data.length); + this.data = newData; + } + } + + /** + * 在数组尾部新增一个元素 + * + * @param element + */ + public void append(int element) + { + if (this.size == this.data.length) + { + expand(); + } + this.data[this.size] = element; + this.size += 1; + } + + /** + * 去掉多余的buffer + */ + public void loseWeight() + { + if (size == data.length) + { + return; + } + int[] newData = new int[size]; + System.arraycopy(this.data, 0, newData, 0, size); + this.data = newData; + } + + public int size() + { + return this.size; + } + + public int getLinearExpandFactor() + { + return this.linearExpandFactor; + } + + public void set(int index, int value) + { + this.data[index] = value; + } + + public int get(int index) + { + return this.data[index]; + } + + public void removeLast() + { + --size; + } + + public int getLast() + { + return data[size - 1]; + } + + public void setLast(int value) + { + data[size - 1] = value; + } + + public int pop() + { + return data[--size]; + } + + @Override + public void save(DataOutputStream out) throws IOException + { + out.writeInt(size); + for (int i = 0; i < size; i++) + { + out.writeInt(data[i]); + } + out.writeInt(linearExpandFactor); + out.writeBoolean(exponentialExpanding); + out.writeDouble(exponentialExpandFactor); + } + + @Override + public boolean load(ByteArray byteArray) + { + if (byteArray == null) + { + return false; + } + size = byteArray.nextInt(); + data = new int[size]; + for (int i = 0; i < size; i++) + { + data[i] = byteArray.nextInt(); + } + linearExpandFactor = byteArray.nextInt(); + exponentialExpanding = byteArray.nextBoolean(); + exponentialExpandFactor = byteArray.nextDouble(); + return true; + } + + private void writeObject(ObjectOutputStream out) throws IOException + { + loseWeight(); + out.writeInt(size); + out.writeObject(data); + out.writeInt(linearExpandFactor); + out.writeBoolean(exponentialExpanding); + out.writeDouble(exponentialExpandFactor); + } + + private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException + { + size = in.readInt(); + data = (int[]) in.readObject(); + linearExpandFactor = in.readInt(); + exponentialExpanding = in.readBoolean(); + exponentialExpandFactor = in.readDouble(); + } + + @Override + public String toString() + { + ArrayList head = new ArrayList(20); + for (int i = 0; i < Math.min(size, 20); ++i) + { + head.add(data[i]); + } + return head.toString(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrie.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrie.java new file mode 100644 index 000000000..e6d93c9e5 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrie.java @@ -0,0 +1,428 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-11-17 下午1:48 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.collection.trie.datrie; + +import java.util.*; + +/** + * 泛型可变双数组trie树 + * + * @author hankcs + */ +public class MutableDoubleArrayTrie implements SortedMap, Iterable> +{ + MutableDoubleArrayTrieInteger trie; + ArrayList values; + + public MutableDoubleArrayTrie() + { + trie = new MutableDoubleArrayTrieInteger(); + values = new ArrayList(); + } + + public MutableDoubleArrayTrie(Map map) + { + this(); + putAll(map); + } + + /** + * 去掉多余的buffer + */ + public void loseWeight() + { + trie.loseWeight(); + } + + @Override + public String toString() + { + final StringBuilder sb = new StringBuilder("MutableDoubleArrayTrie{"); + sb.append("size=").append(size()).append(','); + sb.append("allocated=").append(trie.getBaseArraySize()).append(','); + sb.append('}'); + return sb.toString(); + } + + @Override + public Comparator comparator() + { + return new Comparator() + { + @Override + public int compare(String o1, String o2) + { + return o1.compareTo(o2); + } + }; + } + + @Override + public SortedMap subMap(String fromKey, String toKey) + { + throw new UnsupportedOperationException(); + } + + @Override + public SortedMap headMap(String toKey) + { + throw new UnsupportedOperationException(); + } + + @Override + public SortedMap tailMap(String fromKey) + { + throw new UnsupportedOperationException(); + } + + @Override + public String firstKey() + { + return trie.iterator().key(); + } + + @Override + public String lastKey() + { + MutableDoubleArrayTrieInteger.KeyValuePair iterator = trie.iterator(); + while (iterator.hasNext()) + { + iterator.next(); + } + return iterator.key(); + } + + @Override + public int size() + { + return trie.size(); + } + + @Override + public boolean isEmpty() + { + return trie.isEmpty(); + } + + @Override + public boolean containsKey(Object key) + { + if (key == null || !(key instanceof String)) + return false; + return trie.containsKey((String) key); + } + + @Override + public boolean containsValue(Object value) + { + return values.contains(value); + } + + @Override + public V get(Object key) + { + if (key == null) + return null; + int id; + if (key instanceof String) + { + id = trie.get((String) key); + } + else + { + id = trie.get(key.toString()); + } + if (id == -1) + return null; + return values.get(id); + } + + @Override + public V put(String key, V value) + { + int id = trie.get(key); + if (id == -1) + { + trie.set(key, values.size()); + values.add(value); + return null; + } + else + { + V v = values.get(id); + values.set(id, value); + return v; + } + } + + @Override + public V remove(Object key) + { + if (key == null) return null; + int id = trie.remove(key instanceof String ? (String) key : key.toString()); + if (id == -1) + return null; + trie.decreaseValues(id); + return values.remove(id); + } + + @Override + public void putAll(Map m) + { + for (Entry entry : m.entrySet()) + { + put(entry.getKey(), entry.getValue()); + } + } + + @Override + public void clear() + { + trie.clear(); + values.clear(); + } + + @Override + public Set keySet() + { + return new Set() + { + MutableDoubleArrayTrieInteger.KeyValuePair iterator = trie.iterator(); + + @Override + public int size() + { + return trie.size(); + } + + @Override + public boolean isEmpty() + { + return trie.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public Iterator iterator() + { + return new Iterator() + { + @Override + public boolean hasNext() + { + return iterator.hasNext(); + } + + @Override + public String next() + { + return iterator.next().key(); + } + + @Override + public void remove() + { + throw new UnsupportedOperationException(); + } + }; + } + + @Override + public Object[] toArray() + { + return values.toArray(); + } + + @Override + public T[] toArray(T[] a) + { + return values.toArray(a); + } + + @Override + public boolean add(String s) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean remove(Object o) + { + return trie.remove((String) o) != -1; + } + + @Override + public boolean containsAll(Collection c) + { + for (Object o : c) + { + if (!trie.containsKey((String) o)) + return false; + } + return true; + } + + @Override + public boolean addAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean retainAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean removeAll(Collection c) + { + boolean changed = false; + for (Object o : c) + { + if (!changed) + changed = MutableDoubleArrayTrie.this.remove(o) != null; + } + return changed; + } + + @Override + public void clear() + { + MutableDoubleArrayTrie.this.clear(); + } + }; + } + + @Override + public Collection values() + { + return values; + } + + @Override + public Set> entrySet() + { + return new Set>() + { + @Override + public int size() + { + return trie.size(); + } + + @Override + public boolean isEmpty() + { + return trie.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public Iterator> iterator() + { + return new Iterator>() + { + MutableDoubleArrayTrieInteger.KeyValuePair iterator = trie.iterator(); + + @Override + public boolean hasNext() + { + return iterator.hasNext(); + } + + @Override + public Entry next() + { + iterator.next(); + return new AbstractMap.SimpleEntry(iterator.key(), values.get(iterator.value())); + } + + @Override + public void remove() + { + throw new UnsupportedOperationException(); + } + }; + } + + @Override + public Object[] toArray() + { + throw new UnsupportedOperationException(); + } + + @Override + public T[] toArray(T[] a) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean add(Entry stringVEntry) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean remove(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean containsAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean addAll(Collection> c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean retainAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean removeAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public void clear() + { + MutableDoubleArrayTrie.this.clear(); + } + }; + } + + @Override + public Iterator> iterator() + { + return entrySet().iterator(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieInteger.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieInteger.java new file mode 100644 index 000000000..8c1e42f76 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieInteger.java @@ -0,0 +1,1385 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.ICacheAble; + +import java.io.*; +import java.util.*; + +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * 可变双数组trie树,重构自:https://github.com/fancyerii/DoubleArrayTrie + */ +public class MutableDoubleArrayTrieInteger implements Serializable, Iterable, ICacheAble +{ + private static final long serialVersionUID = 5586394930559218802L; + /** + * 0x40000000 + */ + private static final int LEAF_BIT = 1073741824; + private static final int[] EMPTY_WALK_STATE = {-1, -1}; + CharacterMapping charMap; + /** + * 字符串的终止字符(会在传入的字符串末尾添加该字符) + */ + private static final char UNUSED_CHAR = '\000'; + /** + * 终止字符的codePoint,这个字符作为叶节点的标识 + */ + private static final int UNUSED_CHAR_VALUE = UNUSED_CHAR; + private IntArrayList check; + private IntArrayList base; + /** + * 键值对数量 + */ + private int size; + + public MutableDoubleArrayTrieInteger(Map stringIntegerMap) + { + this(stringIntegerMap.entrySet()); + } + + public MutableDoubleArrayTrieInteger(Set> entrySet) + { + this(); + for (Map.Entry entry : entrySet) + { + put(entry.getKey(), entry.getValue()); + } + } + + /** + * 激活指数膨胀 + * + * @param exponentialExpanding + */ + public void setExponentialExpanding(boolean exponentialExpanding) + { + check.setExponentialExpanding(exponentialExpanding); + base.setExponentialExpanding(exponentialExpanding); + } + + /** + * 指数膨胀的底数 + * + * @param exponentialExpandFactor + */ + public void setExponentialExpandFactor(double exponentialExpandFactor) + { + check.setExponentialExpandFactor(exponentialExpandFactor); + base.setExponentialExpandFactor(exponentialExpandFactor); + } + + /** + * 设置线性膨胀 + * + * @param linearExpandFactor + */ + public void setLinearExpandFactor(int linearExpandFactor) + { + check.setLinearExpandFactor(linearExpandFactor); + base.setLinearExpandFactor(linearExpandFactor); + } + + public MutableDoubleArrayTrieInteger() + { + this(new Utf8CharacterMapping()); + } + + public MutableDoubleArrayTrieInteger(CharacterMapping charMap) + { + this.charMap = charMap; + clear(); + } + + public void clear() + { + this.base = new IntArrayList(this.charMap.getInitSize()); + this.check = new IntArrayList(this.charMap.getInitSize()); + + this.base.append(0); + this.check.append(0); + + this.base.append(1); + this.check.append(0); + expandArray(this.charMap.getInitSize()); + } + + public int getCheckArraySize() + { + return check.size(); + } + + public int getFreeSize() + { + int count = 0; + int chk = this.check.get(0); + while (chk != 0) + { + count++; + chk = this.check.get(-chk); + } + + return count; + } + + private boolean isLeafValue(int value) + { + return (value > 0) && ((value & LEAF_BIT) != 0); + } + + /** + * 最高4位置1 + * + * @param value + * @return + */ + private int setLeafValue(int value) + { + return value | LEAF_BIT; + } + + /** + * 最高4位置0 + * + * @param value + * @return + */ + private int getLeafValue(int value) + { + return value ^ LEAF_BIT; + } + + public int getBaseArraySize() + { + return this.base.size(); + } + + private int getBase(int index) + { + return this.base.get(index); + } + + private int getCheck(int index) + { + return this.check.get(index); + } + + private void setBase(int index, int value) + { + this.base.set(index, value); + } + + private void setCheck(int index, int value) + { + this.check.set(index, value); + } + + protected boolean isEmpty(int index) + { + return getCheck(index) <= 0; + } + + private int getNextFreeBase(int nextChar) + { + int index = -getCheck(0); + while (index != 0) + { + if (index > nextChar + 1) // 因为ROOT的index从1开始,所以至少要大于1 + { + return index - nextChar; + } + index = -getCheck(index); + } + int oldSize = getBaseArraySize(); + expandArray(oldSize + this.base.getLinearExpandFactor()); + return oldSize; + } + + private void addFreeLink(int index) + { + this.check.set(index, this.check.get(-this.base.get(0))); + this.check.set(-this.base.get(0), -index); + this.base.set(index, this.base.get(0)); + this.base.set(0, -index); + } + + /** + * 将index从空闲循环链表中删除 + * + * @param index + */ + private void deleteFreeLink(int index) + { + this.base.set(-this.check.get(index), this.base.get(index)); + this.check.set(-this.base.get(index), this.check.get(index)); + } + + /** + * 动态数组扩容 + * + * @param maxSize 需要的容量 + */ + private void expandArray(int maxSize) + { + int curSize = getBaseArraySize(); + if (curSize > maxSize) + { + return; + } + if (maxSize >= LEAF_BIT) + { + throw new RuntimeException("Double Array Trie size exceeds absolute threshold"); + } + for (int i = curSize; i <= maxSize; ++i) + { + this.base.append(0); + this.check.append(0); + addFreeLink(i); + } + } + + /** + * 插入条目 + * + * @param key 键 + * @param value 值 + * @param overwrite 是否覆盖 + * @return + */ + public boolean insert(String key, int value, boolean overwrite) + { + if ((null == key) || key.length() == 0 || (key.indexOf(UNUSED_CHAR) != -1)) + { + return false; + } + if ((value < 0) || ((value & LEAF_BIT) != 0)) + { + return false; + } + + value = setLeafValue(value); + + int[] ids = this.charMap.toIdList(key + UNUSED_CHAR); + + int fromState = 1; // 根节点的index为1 + int toState = 1; + int index = 0; + while (index < ids.length) + { + int c = ids[index]; + toState = getBase(fromState) + c; // to = base[from] + c + expandArray(toState); + + if (isEmpty(toState)) + { + deleteFreeLink(toState); + + setCheck(toState, fromState); // check[to] = from + if (index == ids.length - 1) // Leaf + { + ++this.size; + setBase(toState, value); // base[to] = value + } + else + { + int nextChar = ids[(index + 1)]; + setBase(toState, getNextFreeBase(nextChar)); // base[to] = free_state - c + } + } + else if (getCheck(toState) != fromState) // 冲突 + { + solveConflict(fromState, c); + continue; + } + fromState = toState; + ++index; + } + if (overwrite) + { + setBase(toState, value); + } + return true; + } + + /** + * 寻找可以放下子节点集合的“连续”空闲区间 + * + * @param children 子节点集合 + * @return base值 + */ + private int searchFreeBase(SortedSet children) + { + int minChild = children.first(); + int maxChild = children.last(); + int current = 0; + while (getCheck(current) != 0) // 循环链表回到了头,说明没有符合要求的“连续”区间 + { + if (current > minChild + 1) + { + int base = current - minChild; + boolean ok = true; + for (Iterator it = children.iterator(); it.hasNext(); ) // 检查是否每个子节点的位置都空闲(“连续”区间) + { + int to = base + it.next(); + if (to >= getBaseArraySize()) + { + ok = false; + break; + } + if (!isEmpty(to)) + { + ok = false; + break; + } + } + if (ok) + { + return base; + } + } + current = -getCheck(current); // 从链表中取出下一个空闲位置 + } + int oldSize = getBaseArraySize(); // 没有足够长的“连续”空闲区间,所以在双数组尾部额外分配一块 + expandArray(oldSize + maxChild); + return oldSize; + } + + /** + * 解决冲突 + * + * @param parent 父节点 + * @param newChild 子节点的char值 + */ + private void solveConflict(int parent, int newChild) + { + // 找出parent的所有子节点 + TreeSet children = new TreeSet(); + children.add(newChild); + final int charsetSize = this.charMap.getCharsetSize(); + for (int c = 0; c < charsetSize; ++c) + { + int next = getBase(parent) + c; + if (next >= getBaseArraySize()) + { + break; + } + if (getCheck(next) == parent) + { + children.add(c); + } + } + // 移动旧子节点到新的位置 + int newBase = searchFreeBase(children); + children.remove(newChild); + for (Integer c : children) + { + int child = newBase + c; + deleteFreeLink(child); + + setCheck(child, parent); + int childBase = getBase(getBase(parent) + c); + setBase(child, childBase); + + if (!isLeafValue(childBase)) + { + for (int d = 0; d < charsetSize; ++d) + { + int to = childBase + d; + if (to >= getBaseArraySize()) + { + break; + } + if (getCheck(to) == getBase(parent) + c) + { + setCheck(to, child); + } + } + } + addFreeLink(getBase(parent) + c); + } + // 更新新base值 + setBase(parent, newBase); + } + + /** + * 键值对个数 + * + * @return + */ + public int size() + { + return this.size; + } + + public boolean isEmpty() + { + return size == 0; + } + + /** + * 覆盖模式添加 + * + * @param key + * @param value + * @return + */ + public boolean insert(String key, int value) + { + return insert(key, value, true); + } + + /** + * 非覆盖模式添加 + * + * @param key + * @param value + * @return + */ + public boolean add(String key, int value) + { + return insert(key, value, false); + } + + /** + * 非覆盖模式添加,值默认为当前集合大小 + * + * @param key + * @return + */ + public boolean add(String key) + { + return add(key, size); + } + + /** + * 查询以prefix开头的所有键 + * + * @param prefix + * @return + */ + public List prefixMatch(String prefix) + { + int curState = 1; + IntArrayList bytes = new IntArrayList(prefix.length() * 4); + for (int i = 0; i < prefix.length(); i++) + { + int codePoint = prefix.charAt(i); + if (curState < 1) + { + return Collections.emptyList(); + } + if ((curState != 1) && (isEmpty(curState))) + { + return Collections.emptyList(); + } + int[] ids = this.charMap.toIdList(codePoint); + if (ids.length == 0) + { + return Collections.emptyList(); + } + for (int j = 0; j < ids.length; j++) + { + int c = ids[j]; + if ((getBase(curState) + c < getBaseArraySize()) + && (getCheck(getBase(curState) + c) == curState)) + { + bytes.append(c); + curState = getBase(curState) + c; + } + else + { + return Collections.emptyList(); + } + } + + } + List result = new ArrayList(); + recursiveAddSubTree(curState, result, bytes); + + return result; + } + + private void recursiveAddSubTree(int curState, List result, IntArrayList bytes) + { + if (getCheck(getBase(curState) + UNUSED_CHAR_VALUE) == curState) + { + byte[] array = new byte[bytes.size()]; + for (int i = 0; i < bytes.size(); i++) + { + array[i] = (byte) bytes.get(i); + } + result.add(new String(array, Utf8CharacterMapping.UTF_8)); + } + int base = getBase(curState); + for (int c = 0; c < charMap.getCharsetSize(); c++) + { + if (c == UNUSED_CHAR_VALUE) continue; + int check = getCheck(base + c); + if (base + c < getBaseArraySize() && check == curState) + { + bytes.append(c); + recursiveAddSubTree(base + c, result, bytes); + bytes.removeLast(); + } + } + } + + /** + * 最长查询 + * + * @param query + * @param start + * @return (最长长度,对应的值) + */ + public int[] findLongest(CharSequence query, int start) + { + if ((query == null) || (start >= query.length())) + { + return new int[]{0, -1}; + } + int state = 1; + int maxLength = 0; + int lastVal = -1; + for (int i = start; i < query.length(); i++) + { + int[] res = transferValues(state, query.charAt(i)); + if (res[0] == -1) + { + break; + } + state = res[0]; + if (res[1] != -1) + { + maxLength = i - start + 1; + lastVal = res[1]; + } + } + return new int[]{maxLength, lastVal}; + } + + public int[] findWithSupplementary(String query, int start) + { + if ((query == null) || (start >= query.length())) + { + return new int[]{0, -1}; + } + int curState = 1; + int maxLength = 0; + int lastVal = -1; + int charCount = 1; + for (int i = start; i < query.length(); i += charCount) + { + int codePoint = query.codePointAt(i); + charCount = Character.charCount(codePoint); + int[] res = transferValues(curState, codePoint); + if (res[0] == -1) + { + break; + } + curState = res[0]; + if (res[1] != -1) + { + maxLength = i - start + 1; + lastVal = res[1]; + } + } + return new int[]{maxLength, lastVal}; + + } + + public List findAllWithSupplementary(String query, int start) + { + List ret = new ArrayList(5); + if ((query == null) || (start >= query.length())) + { + return ret; + } + int curState = 1; + int charCount = 1; + for (int i = start; i < query.length(); i += charCount) + { + int codePoint = query.codePointAt(i); + charCount = Character.charCount(codePoint); + int[] res = transferValues(curState, codePoint); + if (res[0] == -1) + { + break; + } + curState = res[0]; + if (res[1] != -1) + { + ret.add(new int[]{i - start + 1, res[1]}); + } + } + return ret; + } + + /** + * 查询与query的前缀重合的所有词语 + * + * @param query + * @param start + * @return + */ + public List commonPrefixSearch(String query, int start) + { + List ret = new ArrayList(5); + if ((query == null) || (start >= query.length())) + { + return ret; + } + int curState = 1; + for (int i = start; i < query.length(); i++) + { + int[] res = transferValues(curState, query.charAt(i)); + if (res[0] == -1) + { + break; + } + curState = res[0]; + if (res[1] != -1) + { + ret.add(new int[]{i - start + 1, res[1]}); + } + } + return ret; + } + + /** + * 转移状态并输出值 + * + * @param state + * @param codePoint char + * @return + */ + public int[] transferValues(int state, int codePoint) + { + if (state < 1) + { + return EMPTY_WALK_STATE; + } + if ((state != 1) && (isEmpty(state))) + { + return EMPTY_WALK_STATE; + } + int[] ids = this.charMap.toIdList(codePoint); + if (ids.length == 0) + { + return EMPTY_WALK_STATE; + } + for (int i = 0; i < ids.length; i++) + { + int c = ids[i]; + if ((getBase(state) + c < getBaseArraySize()) + && (getCheck(getBase(state) + c) == state)) + { + state = getBase(state) + c; + } + else + { + return EMPTY_WALK_STATE; + } + } + if (getCheck(getBase(state) + UNUSED_CHAR_VALUE) == state) + { + int value = getLeafValue(getBase(getBase(state) + + UNUSED_CHAR_VALUE)); + return new int[]{state, value}; + } + return new int[]{state, -1}; + } + + /** + * 转移状态 + * + * @param state + * @param codePoint + * @return + */ + public int transfer(int state, int codePoint) + { + if (state < 1) + { + return -1; + } + if ((state != 1) && (isEmpty(state))) + { + return -1; + } + int[] ids = this.charMap.toIdList(codePoint); + if (ids.length == 0) + { + return -1; + } + return transfer(state, ids); + } + + /** + * 转移状态 + * + * @param state + * @param ids + * @return + */ + private int transfer(int state, int[] ids) + { + for (int c : ids) + { + if ((getBase(state) + c < getBaseArraySize()) + && (getCheck(getBase(state) + c) == state)) + { + state = getBase(state) + c; + } + else + { + return -1; + } + } + return state; + } + + public int stateValue(int state) + { + int leaf = getBase(state) + UNUSED_CHAR_VALUE; + if (getCheck(leaf) == state) + { + return getLeafValue(getBase(leaf)); + } + return -1; + } + + /** + * 去掉多余的buffer + */ + public void loseWeight() + { + base.loseWeight(); + check.loseWeight(); + } + + /** + * 将值大于等于from的统一递减1
+ * + * @param from + */ + public void decreaseValues(int from) + { + for (int state = 1; state < getBaseArraySize(); ++state) + { + int leaf = getBase(state) + UNUSED_CHAR_VALUE; + if (1 < leaf && leaf < getCheckArraySize() && getCheck(leaf) == state) + { + int value = getLeafValue(getBase(leaf)); + if (value >= from) + { + setBase(leaf, setLeafValue(--value)); + } + } + } + } + + /** + * 精确查询 + * + * @param key + * @param start + * @return -1表示不存在 + */ + public int get(String key, int start) + { + assert key != null; + assert 0 <= start && start <= key.length(); + int state = 1; + int[] ids = charMap.toIdList(key.substring(start)); + state = transfer(state, ids); + if (state < 0) + { + return -1; + } + return stateValue(state); + } + + /** + * 精确查询 + * + * @param key + * @return -1表示不存在 + */ + public int get(String key) + { + return get(key, 0); + } + + /** + * 设置键值 (同put) + * + * @param key + * @param value + * @return 是否设置成功(失败的原因是键值不合法) + */ + public boolean set(String key, int value) + { + return insert(key, value, true); + } + + /** + * 设置键值 (同set) + * + * @param key + * @param value + * @return 是否设置成功(失败的原因是键值不合法) + */ + public boolean put(String key, int value) + { + return insert(key, value, true); + } + + /** + * 删除键 + * + * @param key + * @return 值 + */ + public int remove(String key) + { + return delete(key); + } + + /** + * 删除键 + * + * @param key + * @return 值 + */ + public int delete(String key) + { + if (key == null) + { + return -1; + } + int curState = 1; + int[] ids = this.charMap.toIdList(key); + + int[] path = new int[ids.length + 1]; + int i = 0; + for (; i < ids.length; i++) + { + int c = ids[i]; + if ((getBase(curState) + c >= getBaseArraySize()) + || (getCheck(getBase(curState) + c) != curState)) + { + break; + } + curState = getBase(curState) + c; + path[i] = curState; + } + int ret = -1; + if (i == ids.length) + { + if (getCheck(getBase(curState) + UNUSED_CHAR_VALUE) == curState) + { + --this.size; + ret = getLeafValue(getBase(getBase(curState) + UNUSED_CHAR_VALUE)); + path[(path.length - 1)] = (getBase(curState) + UNUSED_CHAR_VALUE); + for (int j = path.length - 1; j >= 0; --j) + { + boolean isLeaf = true; + int state = path[j]; + for (int k = 0; k < this.charMap.getCharsetSize(); k++) + { + if (isLeafValue(getBase(state))) + { + break; + } + if ((getBase(state) + k < getBaseArraySize()) + && (getCheck(getBase(state) + k) == state)) + { + isLeaf = false; + break; + } + } + if (!isLeaf) + { + break; + } + addFreeLink(state); + } + } + } + return ret; + } + + /** + * 获取空闲的数组元素个数 + * + * @return + */ + public int getEmptySize() + { + int size = 0; + for (int i = 0; i < getBaseArraySize(); i++) + { + if (isEmpty(i)) + { + ++size; + } + } + return size; + } + + /** + * 可以设置的最大值 + * + * @return + */ + public int getMaximumValue() + { + return LEAF_BIT - 1; + } + + public Set> entrySet() + { + return new Set>() + { + @Override + public int size() + { + return MutableDoubleArrayTrieInteger.this.size; + } + + @Override + public boolean isEmpty() + { + return MutableDoubleArrayTrieInteger.this.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public Iterator> iterator() + { + return new Iterator>() + { + KeyValuePair iterator = MutableDoubleArrayTrieInteger.this.iterator(); + + @Override + public boolean hasNext() + { + return iterator.hasNext(); + } + + @Override + public void remove() + { + throw new UnsupportedOperationException(); + } + + @Override + public Map.Entry next() + { + iterator.next(); + return new AbstractMap.SimpleEntry(iterator.key, iterator.value); + } + }; + } + + @Override + public Object[] toArray() + { + ArrayList> entries = new ArrayList>(size); + for (Map.Entry entry : this) + { + entries.add(entry); + } + return entries.toArray(); + } + + @Override + public T[] toArray(T[] a) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean add(Map.Entry stringIntegerEntry) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean remove(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean containsAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean addAll(Collection> c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean retainAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean removeAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public void clear() + { + throw new UnsupportedOperationException(); + } + }; + } + + @Override + public KeyValuePair iterator() + { + return new KeyValuePair(); + } + + public boolean containsKey(String key) + { + return get(key) != -1; + } + + public Set keySet() + { + return new Set() + { + @Override + public int size() + { + return MutableDoubleArrayTrieInteger.this.size; + } + + @Override + public boolean isEmpty() + { + return MutableDoubleArrayTrieInteger.this.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + return MutableDoubleArrayTrieInteger.this.containsKey((String) o); + } + + @Override + public Iterator iterator() + { + return new Iterator() + { + KeyValuePair iterator = MutableDoubleArrayTrieInteger.this.iterator(); + + @Override + public void remove() + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean hasNext() + { + return iterator.hasNext(); + } + + @Override + public String next() + { + return iterator.next().key; + } + }; + } + + @Override + public Object[] toArray() + { + throw new UnsupportedOperationException(); + } + + @Override + public T[] toArray(T[] a) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean add(String s) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean remove(Object o) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean containsAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean addAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean retainAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public boolean removeAll(Collection c) + { + throw new UnsupportedOperationException(); + } + + @Override + public void clear() + { + throw new UnsupportedOperationException(); + } + }; + } + + @Override + public void save(DataOutputStream out) throws IOException + { + if (!(charMap instanceof Utf8CharacterMapping)) + { + logger.warning("将来需要在构造的时候传入 " + charMap.getClass()); + } + out.writeInt(size); + base.save(out); + check.save(out); + } + + @Override + public boolean load(ByteArray byteArray) + { + size = byteArray.nextInt(); + if (!base.load(byteArray)) return false; + if (!check.load(byteArray)) return false; + return true; + } + + private void writeObject(ObjectOutputStream out) throws IOException + { + out.writeInt(size); + out.writeObject(base); + out.writeObject(check); + } + + private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException + { + size = in.readInt(); + base = (IntArrayList) in.readObject(); + check = (IntArrayList) in.readObject(); + charMap = new Utf8CharacterMapping(); + } + +// /** +// * 遍历时无法删除 +// * +// * @return +// */ +// public DATIterator iterator() +// { +// return new KeyValuePair(); +// } + + public class KeyValuePair implements Iterator + { + /** + * 储存(index, charPoint) + */ + private IntArrayList path; + /** + * 当前所处的键值的索引 + */ + private int index; + private int value = -1; + private String key = null; + private int currentBase; + + public KeyValuePair() + { + path = new IntArrayList(20); + path.append(1); // ROOT + int from = 1; + int b = base.get(from); + if (size > 0) + { + while (true) + { + for (int i = 0; i < charMap.getCharsetSize(); i++) + { + int c = check.get(b + i); + if (c == from) + { + path.append(i); + from = b + i; + path.append(from); + b = base.get(from); + i = 0; + if (getCheck(b + UNUSED_CHAR_VALUE) == from) + { + value = getLeafValue(getBase(b + UNUSED_CHAR_VALUE)); + int[] ids = new int[path.size() / 2]; + for (int k = 0, j = 1; j < path.size(); k++, j += 2) + { + ids[k] = path.get(j); + } + key = charMap.toString(ids); + path.append(UNUSED_CHAR_VALUE); + currentBase = b; + return; + } + } + } + } + } + } + + public String key() + { + return key; + } + + public int value() + { + return value; + } + + public String getKey() + { + return key; + } + + public int getValue() + { + return value; + } + + public int setValue(int v) + { + int value = getLeafValue(v); + setBase(currentBase + UNUSED_CHAR_VALUE, value); + this.value = v; + return v; + } + + @Override + public boolean hasNext() + { + return index < size; + } + + @Override + public KeyValuePair next() + { + if (index >= size) + { + throw new NoSuchElementException(); + } + else if (index == 0) + { + } + else + { + while (path.size() > 0) + { + int charPoint = path.pop(); + int base = path.getLast(); + int n = getNext(base, charPoint); + if (n != -1) break; + path.removeLast(); + } + } + + ++index; + return this; + } + + @Override + public void remove() + { + throw new UnsupportedOperationException(); + } + + /** + * 遍历下一个终止路径 + * + * @param parent 父节点 + * @param charPoint 子节点的char + * @return + */ + private int getNext(int parent, int charPoint) + { + int startChar = charPoint + 1; + int baseParent = getBase(parent); + int from = parent; + + for (int i = startChar; i < charMap.getCharsetSize(); i++) + { + int to = baseParent + i; + if (check.size() > to && check.get(to) == from) + { + path.append(i); + from = to; + path.append(from); + baseParent = base.get(from); + if (getCheck(baseParent + UNUSED_CHAR_VALUE) == from) + { + value = getLeafValue(getBase(baseParent + UNUSED_CHAR_VALUE)); + int[] ids = new int[path.size() / 2]; + for (int k = 0, j = 1; j < path.size(); ++k, j += 2) + { + ids[k] = path.get(j); + } + key = charMap.toString(ids); + path.append(UNUSED_CHAR_VALUE); + currentBase = baseParent; + return from; + } + else + { + return getNext(from, 0); + } + } + } + return -1; + } + + @Override + public String toString() + { + return key + '=' + value; + } + } + +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMapping.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMapping.java new file mode 100644 index 000000000..3d3e973df --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMapping.java @@ -0,0 +1,119 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import java.io.Serializable; +import java.io.UnsupportedEncodingException; +import java.nio.charset.Charset; + +/** + * UTF-8编码到int的映射 + */ +public class Utf8CharacterMapping implements CharacterMapping, Serializable +{ + private static final long serialVersionUID = -6529481088518753872L; + private static final int N = 256; + private static final int[] EMPTYLIST = new int[0]; + public static final Charset UTF_8 = Charset.forName("UTF-8"); + + @Override + public int getInitSize() + { + return N; + } + + @Override + public int getCharsetSize() + { + return N; + } + + @Override + public int zeroId() + { + return 0; + } + + @Override + public int[] toIdList(String key) + { + + byte[] bytes = key.getBytes(UTF_8); + int[] res = new int[bytes.length]; + for (int i = 0; i < res.length; i++) + { + res[i] = bytes[i] & 0xFF; // unsigned byte + } + if ((res.length == 1) && (res[0] == 0)) + { + return EMPTYLIST; + } + return res; + } + + /** + * codes ported from iconv lib in utf8.h utf8_codepointtomb + */ + @Override + public int[] toIdList(int codePoint) + { + int count; + if (codePoint < 0x80) + count = 1; + else if (codePoint < 0x800) + count = 2; + else if (codePoint < 0x10000) + count = 3; + else if (codePoint < 0x200000) + count = 4; + else if (codePoint < 0x4000000) + count = 5; + else if (codePoint <= 0x7fffffff) + count = 6; + else + return EMPTYLIST; + int[] r = new int[count]; + switch (count) + { /* note: code falls through cases! */ + case 6: + r[5] = (char) (0x80 | (codePoint & 0x3f)); + codePoint = codePoint >> 6; + codePoint |= 0x4000000; + case 5: + r[4] = (char) (0x80 | (codePoint & 0x3f)); + codePoint = codePoint >> 6; + codePoint |= 0x200000; + case 4: + r[3] = (char) (0x80 | (codePoint & 0x3f)); + codePoint = codePoint >> 6; + codePoint |= 0x10000; + case 3: + r[2] = (char) (0x80 | (codePoint & 0x3f)); + codePoint = codePoint >> 6; + codePoint |= 0x800; + case 2: + r[1] = (char) (0x80 | (codePoint & 0x3f)); + codePoint = codePoint >> 6; + codePoint |= 0xc0; + case 1: + r[0] = (char) codePoint; + } + return r; + } + + @Override + public String toString(int[] ids) + { + byte[] bytes = new byte[ids.length]; + for (int i = 0; i < ids.length; i++) + { + bytes[i] = (byte) ids[i]; + } + try + { + return new String(bytes, "UTF-8"); + } + catch (UnsupportedEncodingException e) + { + return null; + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/collection/trie/datrie/package-info.java b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/package-info.java new file mode 100644 index 000000000..03d462bba --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/collection/trie/datrie/package-info.java @@ -0,0 +1,14 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-02-28 下午9:17 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +/** + * 可变双数组trie树,可以当做Map来用。如果V是int,可以直接用MutableDoubleArrayTrieInteger + */ +package com.hankcs.hanlp.collection.trie.datrie; \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLSentence.java b/src/main/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLSentence.java index fdeb3c254..3a0525b7f 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLSentence.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLSentence.java @@ -12,6 +12,7 @@ package com.hankcs.hanlp.corpus.dependency.CoNll; import java.util.Iterator; +import java.util.LinkedList; import java.util.List; /** @@ -127,4 +128,37 @@ public void remove() } }; } + + /** + * 找出所有子节点 + * @param word + * @return + */ + public List findChildren(CoNLLWord word) + { + List result = new LinkedList(); + for (CoNLLWord other : this) + { + if (other.HEAD == word) + result.add(other); + } + return result; + } + + /** + * 找出特定依存关系的子节点 + * @param word + * @param relation + * @return + */ + public List findChildren(CoNLLWord word, String relation) + { + List result = new LinkedList(); + for (CoNLLWord other : this) + { + if (other.HEAD == word && other.DEPREL.equals(relation)) + result.add(other); + } + return result; + } } diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/CommonDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/CommonDictionaryMaker.java index 2f8db63ce..7dd7852ab 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/CommonDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/CommonDictionaryMaker.java @@ -11,16 +11,22 @@ */ package com.hankcs.hanlp.corpus.dictionary; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; import com.hankcs.hanlp.corpus.document.sentence.word.IWord; import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.LinkedList; import java.util.List; -import static com.hankcs.hanlp.utility.Predefine.logger; /** * @author hankcs */ public abstract class CommonDictionaryMaker implements ISaveAble { - static boolean verbose = false; + public boolean verbose = false; /** * 语料库中的单词 */ @@ -64,6 +70,51 @@ public void compute(List> sentenceList) addToDictionary(sentenceList); } + /** + * 同compute + * @param sentenceList + */ + public void learn(List sentenceList) + { + List> s = new ArrayList>(sentenceList.size()); + for (Sentence sentence : sentenceList) + { + s.add(sentence.wordList); + } + compute(s); + } + + /** + * 同compute + * @param sentences + */ + public void learn(Sentence ... sentences) + { + learn(Arrays.asList(sentences)); + } + + /** + * 训练 + * @param corpus 语料库路径 + */ + public void train(String corpus) + { + CorpusLoader.walk(corpus, new CorpusLoader.Handler() + { + @Override + public void handle(Document document) + { + List> simpleSentenceList = document.getSimpleSentenceList(); + List> compatibleList = new LinkedList>(); + for (List wordList : simpleSentenceList) + { + compatibleList.add(new LinkedList(wordList)); + } + CommonDictionaryMaker.this.compute(compatibleList); + } + }); + } + /** * 加入到词典中,允许子类自定义过滤等等,这样比较灵活 * @param sentenceList diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/EasyDictionary.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/EasyDictionary.java index a26fd4e43..80a7807c1 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/EasyDictionary.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/EasyDictionary.java @@ -48,8 +48,7 @@ public static EasyDictionary create(String path) private boolean load(String path) { logger.info("通用词典开始加载:" + path); - List wordList = new ArrayList(); - List attributeList = new ArrayList(); + TreeMap map = new TreeMap(); BufferedReader br = null; try { @@ -57,19 +56,18 @@ private boolean load(String path) String line; while ((line = br.readLine()) != null) { - String param[] = line.split(" "); - wordList.add(param[0]); + String param[] = line.split("\\s+"); int natureCount = (param.length - 1) / 2; Attribute attribute = new Attribute(natureCount); for (int i = 0; i < natureCount; ++i) { - attribute.nature[i] = Enum.valueOf(Nature.class, param[1 + 2 * i]); + attribute.nature[i] = Nature.create(param[1 + 2 * i]); attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); attribute.totalFrequency += attribute.frequency[i]; } - attributeList.add(attribute); + map.put(param[0], attribute); } - logger.info("通用词典读入词条" + wordList.size() + " 属性" + attributeList.size()); + logger.info("通用词典读入词条" + map.size()); br.close(); } catch (FileNotFoundException e) @@ -83,7 +81,7 @@ private boolean load(String path) return false; } - logger.info("通用词典DAT构建结果:" + trie.build(wordList, attributeList)); + logger.info("通用词典DAT构建结果:" + trie.build(map)); logger.info("通用词典加载成功:" + trie.size() +"个词条" ); return true; } @@ -206,7 +204,7 @@ public int getNatureFrequency(String nature) { try { - Nature pos = Enum.valueOf(Nature.class, nature); + Nature pos = Nature.create(nature); return getNatureFrequency(pos); } catch (IllegalArgumentException e) diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NGramDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NGramDictionaryMaker.java index d7d1f8794..602812911 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NGramDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NGramDictionaryMaker.java @@ -74,7 +74,7 @@ public boolean saveNGramToTxt(String path) { try { - BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path))); + BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path), "UTF-8")); for (Map.Entry entry : trie.entrySet()) { bw.write(entry.getKey() + " " + entry.getValue()); diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NRDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NRDictionaryMaker.java index c7920ad51..53d893539 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NRDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NRDictionaryMaker.java @@ -11,21 +11,16 @@ */ package com.hankcs.hanlp.corpus.dictionary; -import com.hankcs.hanlp.corpus.document.CorpusLoader; -import com.hankcs.hanlp.corpus.document.Document; import com.hankcs.hanlp.corpus.document.sentence.word.IWord; import com.hankcs.hanlp.corpus.document.sentence.word.Word; import com.hankcs.hanlp.corpus.tag.NR; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.corpus.util.Precompiler; import com.hankcs.hanlp.utility.Predefine; import java.util.LinkedList; import java.util.List; import java.util.ListIterator; -import static com.hankcs.hanlp.utility.Predefine.logger; - /** * nr词典(词典+ngram转移+词性转移矩阵)制作工具 * @author hankcs @@ -41,7 +36,8 @@ public NRDictionaryMaker(EasyDictionary dictionary) @Override protected void addToDictionary(List> sentenceList) { - logger.warning("开始制作词典"); + if (verbose) + System.out.println("开始制作词典"); // 将非A的词语保存下来 for (List wordList : sentenceList) { @@ -71,12 +67,16 @@ protected void addToDictionary(List> sentenceList) @Override protected void roleTag(List> sentenceList) { - logger.info("开始标注角色"); + if (verbose) + System.out.println("开始标注角色"); int i = 0; for (List wordList : sentenceList) { - logger.info(++i + " / " + sentenceList.size()); - if (verbose) System.out.println("原始语料 " + wordList); + if (verbose) + { + System.out.println(++i + " / " + sentenceList.size()); + System.out.println("原始语料 " + wordList); + } // 先标注A和K IWord pre = new Word("##始##", "begin"); ListIterator listIterator = wordList.listIterator(); @@ -89,7 +89,7 @@ protected void roleTag(List> sentenceList) } else { - if (!pre.getLabel().equals(Nature.nr.toString())) + if (!pre.getLabel().equals(Nature.nr.toString()) && !pre.getValue().equals(Predefine.TAG_BIGIN)) { pre.setLabel(NR.K.toString()); } @@ -158,6 +158,8 @@ else if (word.getValue().endsWith("哥") word.setValue(word.getValue().substring(0, 1)); word.setLabel(NR.B.toString()); break; + default: + word.setLabel(NR.A.toString()); // 非中国人名 } } } diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NSDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NSDictionaryMaker.java index e404bd82d..3f7be8dd9 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NSDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NSDictionaryMaker.java @@ -91,7 +91,7 @@ protected void roleTag(List> sentenceList) while (iterator.hasNext()) { IWord current = iterator.next(); - if (current.getLabel().startsWith("ns") && !pre.getLabel().startsWith("ns")) + if (current.getLabel().startsWith("ns") && !pre.getLabel().startsWith("ns") && !pre.getValue().equals(Predefine.TAG_BIGIN)) { pre.setLabel(NS.A.toString()); } diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NTDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NTDictionaryMaker.java index c0eb52e21..ce814b2f7 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/NTDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/NTDictionaryMaker.java @@ -89,7 +89,7 @@ protected void roleTag(List> sentenceList) while (iterator.hasNext()) { IWord current = iterator.next(); - if (current.getLabel().startsWith("nt") && !pre.getLabel().startsWith("nt")) + if (current.getLabel().startsWith("nt") && !pre.getLabel().startsWith("nt") && !pre.getValue().equals(Predefine.TAG_BIGIN)) { pre.setLabel(NT.A.toString()); } diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/StringDictionary.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/StringDictionary.java index 5cf629676..a91abbac8 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/StringDictionary.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/StringDictionary.java @@ -62,7 +62,7 @@ public boolean save(String path) { try { - BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path))); + BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path), "UTF-8")); for (Map.Entry entry : trie.entrySet()) { bw.write(entry.getKey()); diff --git a/src/main/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMaker.java index fed0c6121..dce7fa059 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMaker.java +++ b/src/main/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMaker.java @@ -92,7 +92,7 @@ public boolean saveTxtTo(String path) { try { - BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path))); + BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path), "UTF-8")); bw.write(toString()); bw.close(); } diff --git a/src/main/java/com/hankcs/hanlp/corpus/document/CorpusLoader.java b/src/main/java/com/hankcs/hanlp/corpus/document/CorpusLoader.java index 2d946df9a..7cc03de3f 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/document/CorpusLoader.java +++ b/src/main/java/com/hankcs/hanlp/corpus/document/CorpusLoader.java @@ -14,7 +14,6 @@ import com.hankcs.hanlp.corpus.document.sentence.Sentence; import com.hankcs.hanlp.corpus.document.sentence.word.IWord; import com.hankcs.hanlp.corpus.document.sentence.word.Word; -import com.hankcs.hanlp.corpus.io.FolderWalker; import com.hankcs.hanlp.corpus.io.IOUtil; import java.io.File; @@ -31,7 +30,7 @@ public class CorpusLoader public static void walk(String folderPath, Handler handler) { long start = System.currentTimeMillis(); - List fileList = FolderWalker.open(folderPath); + List fileList = IOUtil.fileList(folderPath); int i = 0; for (File file : fileList) { @@ -46,7 +45,7 @@ public static void walk(String folderPath, Handler handler) public static void walk(String folderPath, HandlerThread[] threadArray) { long start = System.currentTimeMillis(); - List fileList = FolderWalker.open(folderPath); + List fileList = IOUtil.fileList(folderPath); for (int i = 0; i < threadArray.length - 1; ++i) { threadArray[i].fileList = fileList.subList(fileList.size() / threadArray.length * i, fileList.size() / threadArray.length * (i + 1)); @@ -69,20 +68,35 @@ public static void walk(String folderPath, HandlerThread[] threadArray) } public static List convert2DocumentList(String folderPath) + { + return convert2DocumentList(folderPath, false); + } + + /** + * 读取整个目录中的人民日报格式语料 + * + * @param folderPath 路径 + * @param verbose + * @return + */ + public static List convert2DocumentList(String folderPath, boolean verbose) { long start = System.currentTimeMillis(); - List fileList = FolderWalker.open(folderPath); + List fileList = IOUtil.fileList(folderPath); List documentList = new LinkedList(); int i = 0; for (File file : fileList) { - System.out.print(file); + if (verbose) System.out.print(file); Document document = convert2Document(file); documentList.add(document); - System.out.println(" " + ++i + " / " + fileList.size()); + if (verbose) System.out.println(" " + ++i + " / " + fileList.size()); + } + if (verbose) + { + System.out.println(documentList.size()); + System.out.printf("花费时间%d ms\n", System.currentTimeMillis() - start); } - System.out.println(documentList.size()); - System.out.printf("花费时间%d ms\n", System.currentTimeMillis() - start); return documentList; } @@ -137,7 +151,7 @@ public static Document convert2Document(File file) { // try // { - Document document = Document.create(IOUtil.readTxt(file.getPath())); + Document document = Document.create(file); if (document != null) { return document; diff --git a/src/main/java/com/hankcs/hanlp/corpus/document/Document.java b/src/main/java/com/hankcs/hanlp/corpus/document/Document.java index add6a6a6f..309bcfc8c 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/document/Document.java +++ b/src/main/java/com/hankcs/hanlp/corpus/document/Document.java @@ -15,14 +15,18 @@ import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; import com.hankcs.hanlp.corpus.document.sentence.word.IWord; import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.io.IOUtil; +import java.io.File; import java.io.Serializable; import java.util.LinkedList; import java.util.List; import java.util.Set; import java.util.regex.Matcher; import java.util.regex.Pattern; + import static com.hankcs.hanlp.utility.Predefine.logger; + /** * @author hankcs */ @@ -90,6 +94,7 @@ public List getSimpleWordList() /** * 获取简单的句子列表,其中复合词会被拆分为简单词 + * * @return */ public List> getSimpleSentenceList() @@ -120,6 +125,7 @@ public List> getSimpleSentenceList() /** * 获取复杂句子列表,句子中的每个单词有可能是复合词,有可能是简单词 + * * @return */ public List> getComplexSentenceList() @@ -135,6 +141,7 @@ public List> getComplexSentenceList() /** * 获取简单的句子列表 + * * @param spilt 如果为真,其中复合词会被拆分为简单词 * @return */ @@ -173,6 +180,7 @@ public List> getSimpleSentenceList(boolean spilt) /** * 获取简单的句子列表,其中复合词的标签如果是set中指定的话会被拆分为简单词 + * * @param labelSet * @return */ @@ -221,4 +229,23 @@ public String toString() if (sb.length() > 0) sb.deleteCharAt(sb.length() - 1); return sb.toString(); } + + public static Document create(File file) + { + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(file.getAbsolutePath()); + List sentenceList = new LinkedList(); + for (String line : lineIterator) + { + line = line.trim(); + if (line.isEmpty()) continue; + Sentence sentence = Sentence.create(line); + if (sentence == null) + { + logger.warning("使用 " + line + " 创建句子失败"); + return null; + } + sentenceList.add(sentence); + } + return new Document(sentenceList); + } } diff --git a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/Sentence.java b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/Sentence.java index ede096858..744a6cb50 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/Sentence.java +++ b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/Sentence.java @@ -11,18 +11,27 @@ */ package com.hankcs.hanlp.corpus.document.sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; import com.hankcs.hanlp.corpus.document.sentence.word.WordFactory; +import com.hankcs.hanlp.dictionary.other.PartOfSpeechTagDictionary; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; import java.io.Serializable; import java.util.Iterator; import java.util.LinkedList; import java.util.List; +import java.util.ListIterator; import java.util.regex.Matcher; import java.util.regex.Pattern; + import static com.hankcs.hanlp.utility.Predefine.logger; + /** * 句子,指的是以。!等标点结尾的句子 + * * @author hankcs */ public class Sentence implements Serializable, Iterable @@ -44,26 +53,175 @@ public String toString() int i = 1; for (IWord word : wordList) { - sb.append(word.getValue()); - String label = word.getLabel(); - if (label != null) + sb.append(word); + if (i != wordList.size()) sb.append(' '); + ++i; + } + return sb.toString(); + } + + /** + * 转换为空格分割无标签的String + * + * @return + */ + public String toStringWithoutLabels() + { + StringBuilder sb = new StringBuilder(size() * 4); + int i = 1; + for (IWord word : wordList) + { + if (word instanceof CompoundWord) { - sb.append('/').append(label); + int j = 0; + for (Word w : ((CompoundWord) word).innerList) + { + sb.append(w.getValue()); + if (++j != ((CompoundWord) word).innerList.size()) + sb.append(' '); + } } + else + sb.append(word.getValue()); if (i != wordList.size()) sb.append(' '); ++i; } return sb.toString(); } + /** + * brat standoff format
+ * http://brat.nlplab.org/standoff.html + * + * @return + */ + public String toStandoff() + { + return toStandoff(false); + } + + /** + * brat standoff format
+ * http://brat.nlplab.org/standoff.html + * + * @param withComment + * @return + */ + public String toStandoff(boolean withComment) + { + StringBuilder sb = new StringBuilder(size() * 4); + String delimiter = " "; + String text = text(delimiter); + sb.append(text).append('\n'); + int i = 1; + int offset = 0; + for (IWord word : wordList) + { + assert text.charAt(offset) == word.getValue().charAt(0); + printWord(word, sb, i, offset, withComment); + ++i; + if (word instanceof CompoundWord) + { + int offsetChild = offset; + for (Word child : ((CompoundWord) word).innerList) + { + printWord(child, sb, i, offsetChild, withComment); + offsetChild += child.length(); + offsetChild += delimiter.length(); + ++i; + } + offset += delimiter.length() * ((CompoundWord) word).innerList.size(); + } + else + { + offset += delimiter.length(); + } + offset += word.length(); + } + return sb.toString(); + } + + /** + * 按照 PartOfSpeechTagDictionary 指定的映射表将词语词性翻译过去 + * + * @return + */ + public Sentence translateLabels() + { + for (IWord word : wordList) + { + word.setLabel(PartOfSpeechTagDictionary.translate(word.getLabel())); + if (word instanceof CompoundWord) + { + for (Word child : ((CompoundWord) word).innerList) + { + child.setLabel(PartOfSpeechTagDictionary.translate(child.getLabel())); + } + } + } + return this; + } + + /** + * 按照 PartOfSpeechTagDictionary 指定的映射表将复合词词语词性翻译过去 + * + * @return + */ + public Sentence translateCompoundWordLabels() + { + for (IWord word : wordList) + { + if (word instanceof CompoundWord) + word.setLabel(PartOfSpeechTagDictionary.translate(word.getLabel())); + } + return this; + } + + private void printWord(IWord word, StringBuilder sb, int id, int offset) + { + printWord(word, sb, id, offset, false); + } + + private void printWord(IWord word, StringBuilder sb, int id, int offset, boolean withComment) + { + char delimiter = '\t'; + char endLine = '\n'; + sb.append('T').append(id).append(delimiter); + sb.append(word.getLabel()).append(delimiter); + int length = word.length(); + if (word instanceof CompoundWord) + { + length += ((CompoundWord) word).innerList.size() - 1; + } + sb.append(offset).append(delimiter).append(offset + length).append(delimiter); + sb.append(word.getValue()).append(endLine); + String translated = PartOfSpeechTagDictionary.translate(word.getLabel()); + if (withComment && !word.getLabel().equals(translated)) + { + sb.append('#').append(id).append(delimiter).append("AnnotatorNotes").append(delimiter) + .append('T').append(id).append(delimiter).append(translated) + .append(endLine); + } + } + /** * 以人民日报2014语料格式的字符串创建一个结构化句子 + * * @param param * @return */ public static Sentence create(String param) { - Pattern pattern = Pattern.compile("(\\[(([^\\s]+/[0-9a-zA-Z]+)\\s+)+?([^\\s]+/[0-9a-zA-Z]+)]/?[0-9a-zA-Z]+)|([^\\s]+/[0-9a-zA-Z]+)"); + if (param == null) + { + return null; + } + param = param.trim(); + if (param.isEmpty()) + { + return null; + } + Pattern pattern = Pattern.compile("(\\[(([^\\s\\]]+/[0-9a-zA-Z]+)\\s+)+?([^\\s\\]]+/[0-9a-zA-Z]+)]/?[0-9a-zA-Z]+)|([^\\s]+/[0-9a-zA-Z]+)"); Matcher matcher = pattern.matcher(param); List wordList = new LinkedList(); while (matcher.find()) @@ -72,17 +230,25 @@ public static Sentence create(String param) IWord word = WordFactory.create(single); if (word == null) { - logger.warning("在用" + single + "构造单词时失败"); + logger.warning("在用 " + single + " 构造单词时失败,句子构造参数为 " + param); return null; } wordList.add(word); } + if (wordList.isEmpty()) // 按照无词性来解析 + { + for (String w : param.split("\\s+")) + { + wordList.add(new Word(w, null)); + } + } return new Sentence(wordList); } /** * 句子中单词(复合词或简单词)的数量 + * * @return */ public int size() @@ -92,6 +258,7 @@ public int size() /** * 句子文本长度 + * * @return */ public int length() @@ -107,15 +274,39 @@ public int length() /** * 原始文本形式(无标注,raw text) + * * @return */ public String text() { + return text(null); + } + + /** + * 原始文本形式(无标注,raw text) + * + * @param delimiter 词语之间的分隔符 + * @return + */ + public String text(String delimiter) + { + if (delimiter == null) delimiter = ""; StringBuilder sb = new StringBuilder(size() * 3); for (IWord word : this) { - sb.append(word.getValue()); + if (word instanceof CompoundWord) + { + for (Word child : ((CompoundWord) word).innerList) + { + sb.append(child.getValue()).append(delimiter); + } + } + else + { + sb.append(word.getValue()).append(delimiter); + } } + sb.setLength(sb.length() - delimiter.length()); return sb.toString(); } @@ -125,4 +316,185 @@ public Iterator iterator() { return wordList.iterator(); } + + /** + * 找出所有词性为label的单词(不检查复合词内部的简单词) + * + * @param label + * @return + */ + public List findWordsByLabel(String label) + { + List wordList = new LinkedList(); + for (IWord word : this) + { + if (label.equals(word.getLabel())) + { + wordList.add(word); + } + } + return wordList; + } + + /** + * 找出第一个词性为label的单词(不检查复合词内部的简单词) + * + * @param label + * @return + */ + public IWord findFirstWordByLabel(String label) + { + for (IWord word : this) + { + if (label.equals(word.getLabel())) + { + return word; + } + } + return null; + } + + /** + * 找出第一个词性为label的单词的指针(不检查复合词内部的简单词)
+ * 若要查看该单词,请调用 previous
+ * 若要删除该单词,请调用 remove
+ * + * @param label + * @return + */ + public ListIterator findFirstWordIteratorByLabel(String label) + { + ListIterator listIterator = this.wordList.listIterator(); + while (listIterator.hasNext()) + { + IWord word = listIterator.next(); + if (label.equals(word.getLabel())) + { + return listIterator; + } + } + return null; + } + + /** + * 是否含有词性为label的单词 + * + * @param label + * @return + */ + public boolean containsWordWithLabel(String label) + { + return findFirstWordByLabel(label) != null; + } + + /** + * 转换为简单单词列表 + * + * @return + */ + public List toSimpleWordList() + { + List wordList = new LinkedList(); + for (IWord word : this.wordList) + { + if (word instanceof CompoundWord) + { + wordList.addAll(((CompoundWord) word).innerList); + } + else + { + wordList.add((Word) word); + } + } + + return wordList; + } + + /** + * 获取所有单词构成的数组 + * + * @return + */ + public String[] toWordArray() + { + List wordList = toSimpleWordList(); + String[] wordArray = new String[wordList.size()]; + Iterator iterator = wordList.iterator(); + for (int i = 0; i < wordArray.length; i++) + { + wordArray[i] = iterator.next().value; + } + return wordArray; + } + + /** + * word pos + * + * @return + */ + public String[][] toWordTagArray() + { + List wordList = toSimpleWordList(); + String[][] pair = new String[2][wordList.size()]; + Iterator iterator = wordList.iterator(); + for (int i = 0; i < pair[0].length; i++) + { + Word word = iterator.next(); + pair[0][i] = word.value; + pair[1][i] = word.label; + } + return pair; + } + + /** + * word pos ner + * + * @param tagSet + * @return + */ + public String[][] toWordTagNerArray(NERTagSet tagSet) + { + List tupleList = Utility.convertSentenceToNER(this, tagSet); + String[][] result = new String[3][tupleList.size()]; + Iterator iterator = tupleList.iterator(); + for (int i = 0; i < result[0].length; i++) + { + String[] tuple = iterator.next(); + for (int j = 0; j < 3; ++j) + { + result[j][i] = tuple[j]; + } + } + return result; + } + + public Sentence mergeCompoundWords() + { + ListIterator listIterator = wordList.listIterator(); + while (listIterator.hasNext()) + { + IWord word = listIterator.next(); + if (word instanceof CompoundWord) + { + listIterator.set(new Word(word.getValue(), word.getLabel())); + } + } + return this; + } + + @Override + public boolean equals(Object o) + { + if (this == o) return true; + if (o == null || getClass() != o.getClass()) return false; + + Sentence sentence = (Sentence) o; + return toString().equals(sentence.toString()); + } + + @Override + public int hashCode() + { + return toString().hashCode(); + } } diff --git a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/CompoundWord.java b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/CompoundWord.java index 942e6b8ca..567191250 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/CompoundWord.java +++ b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/CompoundWord.java @@ -76,7 +76,12 @@ public String toString() int i = 1; for (Word word : innerList) { - sb.append(word.toString()); + sb.append(word.getValue()); + String label = word.getLabel(); + if (label != null) + { + sb.append('/').append(label); + } if (i != innerList.size()) { sb.append(' '); diff --git a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/Word.java b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/Word.java index df7d22257..5d3072821 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/Word.java +++ b/src/main/java/com/hankcs/hanlp/corpus/document/sentence/word/Word.java @@ -30,6 +30,8 @@ public class Word implements IWord @Override public String toString() { + if (label == null) + return value; return value + '/' + label; } diff --git a/src/main/java/com/hankcs/hanlp/corpus/io/ByteArray.java b/src/main/java/com/hankcs/hanlp/corpus/io/ByteArray.java index 52d331998..fe2e91f22 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/io/ByteArray.java +++ b/src/main/java/com/hankcs/hanlp/corpus/io/ByteArray.java @@ -99,6 +99,15 @@ public byte nextByte() return bytes[offset++]; } + /** + * 读取一个布尔值 + * @return + */ + public boolean nextBoolean() + { + return nextByte() == 1; + } + public boolean hasMore() { return offset < bytes.length; diff --git a/src/main/java/com/hankcs/hanlp/corpus/io/ByteArrayFileStream.java b/src/main/java/com/hankcs/hanlp/corpus/io/ByteArrayFileStream.java index cb9735194..eddc19f68 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/io/ByteArrayFileStream.java +++ b/src/main/java/com/hankcs/hanlp/corpus/io/ByteArrayFileStream.java @@ -107,6 +107,7 @@ public void close() super.close(); try { + if (fileChannel == null) return; fileChannel.close(); } catch (IOException e) diff --git a/src/main/java/com/hankcs/hanlp/corpus/io/FolderWalker.java b/src/main/java/com/hankcs/hanlp/corpus/io/FolderWalker.java deleted file mode 100644 index f10932976..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/io/FolderWalker.java +++ /dev/null @@ -1,66 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/9/8 17:14 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.corpus.io; - - -import java.io.File; -import java.util.LinkedList; -import java.util.List; -import static com.hankcs.hanlp.utility.Predefine.logger; -/** - * 遍历目录工具类 - * @author hankcs - */ -public class FolderWalker -{ - /** - * 打开一个目录,获取全部的文件名 - * @param path - * @return - */ - public static List open(String path) - { - List fileList = new LinkedList(); - File folder = new File(path); - handleFolder(folder, fileList); - return fileList; - } - - private static void handleFolder(File folder, List fileList) - { - File[] fileArray = folder.listFiles(); - if (fileArray != null) - { - for (File file : fileArray) - { - if (file.isFile() && !file.getName().startsWith(".")) // 过滤隐藏文件 - { - fileList.add(file); - } - else - { - handleFolder(file, fileList); - } - } - } - } - -// public static void main(String[] args) -// { -// List fileList = FolderWalker.open("D:\\Doc\\语料库\\2014"); -// for (File file : fileList) -// { -// System.out.println(file); -// } -// } - -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/io/ICacheAble.java b/src/main/java/com/hankcs/hanlp/corpus/io/ICacheAble.java index ee1996ec5..5b24b7ab1 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/io/ICacheAble.java +++ b/src/main/java/com/hankcs/hanlp/corpus/io/ICacheAble.java @@ -31,5 +31,5 @@ public interface ICacheAble * @param byteArray * @return */ - boolean load(ByteArray byteArray); + boolean load(ByteArray byteArray); // 目前的设计并不好,应该抛异常而不是返回布尔值 } diff --git a/src/main/java/com/hankcs/hanlp/corpus/io/IOUtil.java b/src/main/java/com/hankcs/hanlp/corpus/io/IOUtil.java index ed895c597..1296a3f99 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/io/IOUtil.java +++ b/src/main/java/com/hankcs/hanlp/corpus/io/IOUtil.java @@ -95,10 +95,10 @@ public static String readTxt(String path) InputStream in = IOAdapter == null ? new FileInputStream(path) : IOAdapter.open(path); byte[] fileContent = new byte[in.available()]; - readBytesFromOtherInputStream(in, fileContent); + int read = readBytesFromOtherInputStream(in, fileContent); in.close(); // 处理 UTF-8 BOM - if (fileContent[0] == -17 && fileContent[1] == -69 && fileContent[2] == -65) + if (read >= 3 && fileContent[0] == -17 && fileContent[1] == -69 && fileContent[2] == -65) return new String(fileContent, 3, fileContent.length - 3, Charset.forName("UTF-8")); return new String(fileContent, Charset.forName("UTF-8")); } @@ -292,7 +292,7 @@ public static byte[] readBytesFromOtherInputStream(InputStream is) throws IOExce public static int readBytesFromOtherInputStream(InputStream is, byte[] targetArray) throws IOException { assert targetArray != null; - assert targetArray.length > 0; + if (targetArray.length == 0) return 0; int len; int off = 0; while (off < targetArray.length && (len = is.read(targetArray, off, targetArray.length - off)) != -1) @@ -437,10 +437,67 @@ public static boolean deleteFile(String path) return new File(path).delete(); } + /** + * 去除文件第一行中的UTF8 BOM
+ * 这是Java的bug,且官方不会修复。参考 https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker + * @param line 文件第一行 + * @return 去除BOM的部分 + */ + public static String removeUTF8BOM(String line) + { + if (line != null && line.startsWith("\uFEFF")) // UTF-8 byte order mark (EF BB BF) + { + line = line.substring(1); + } + return line; + } + + /** + * 递归遍历获取目录下的所有文件 + * + * @param path 根目录 + * @return 文件列表 + */ + public static List fileList(String path) + { + List fileList = new LinkedList(); + File folder = new File(path); + if (folder.isDirectory()) + enumerate(folder, fileList); + else + fileList.add(folder); // 兼容路径为文件的情况 + return fileList; + } + + /** + * 递归遍历目录 + * + * @param folder 目录 + * @param fileList 储存文件 + */ + private static void enumerate(File folder, List fileList) + { + File[] fileArray = folder.listFiles(); + if (fileArray != null) + { + for (File file : fileArray) + { + if (file.isFile() && !file.getName().startsWith(".")) // 过滤隐藏文件 + { + fileList.add(file); + } + else + { + enumerate(file, fileList); + } + } + } + } + /** * 方便读取按行读取大文件 */ - public static class LineIterator implements Iterator + public static class LineIterator implements Iterator, Iterable { BufferedReader bw; String line; @@ -451,6 +508,7 @@ public LineIterator(BufferedReader bw) try { line = bw.readLine(); + line = IOUtil.removeUTF8BOM(line); } catch (IOException e) { @@ -465,6 +523,7 @@ public LineIterator(String path) { bw = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); line = bw.readLine(); + line = IOUtil.removeUTF8BOM(line); } catch (FileNotFoundException e) { @@ -553,6 +612,12 @@ public void remove() { throw new UnsupportedOperationException("只读,不可写!"); } + + @Override + public Iterator iterator() + { + return this; + } } /** @@ -661,7 +726,7 @@ public static void writeLine(BufferedWriter bw, String... params) throws IOExcep /** * 加载词典,词典必须遵守HanLP核心词典格式 - * @param pathArray 词典路径,可以有任意个 + * @param pathArray 词典路径,可以有任意个。每个路径支持用空格表示默认词性,比如“全国地名大全.txt ns” * @return 一个储存了词条的map * @throws IOException 异常表示加载失败 */ @@ -670,8 +735,21 @@ public static TreeMap loadDictionary(String... TreeMap map = new TreeMap(); for (String path : pathArray) { + File file = new File(path); + String fileName = file.getName(); + int natureIndex = fileName.lastIndexOf(' '); + Nature defaultNature = Nature.n; + if (natureIndex > 0) + { + String natureString = fileName.substring(natureIndex + 1); + path = file.getParent() + File.separator + fileName.substring(0, natureIndex); + if (natureString.length() > 0 && !natureString.endsWith(".txt") && !natureString.endsWith(".csv")) + { + defaultNature = Nature.create(natureString); + } + } BufferedReader br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - loadDictionary(br, map); + loadDictionary(br, map, path.endsWith(".csv"), defaultNature); } return map; @@ -683,19 +761,39 @@ public static TreeMap loadDictionary(String... * @param storage 储存位置 * @throws IOException 异常表示加载失败 */ - public static void loadDictionary(BufferedReader br, TreeMap storage) throws IOException + public static void loadDictionary(BufferedReader br, TreeMap storage, boolean isCSV, Nature defaultNature) throws IOException { + String splitter = "\\s"; + if (isCSV) + { + splitter = ","; + } String line; + boolean firstLine = true; while ((line = br.readLine()) != null) { - String param[] = line.split("\\s"); + if (firstLine) + { + line = IOUtil.removeUTF8BOM(line); + firstLine = false; + } + String param[] = line.split(splitter); + int natureCount = (param.length - 1) / 2; - CoreDictionary.Attribute attribute = new CoreDictionary.Attribute(natureCount); - for (int i = 0; i < natureCount; ++i) + CoreDictionary.Attribute attribute; + if (natureCount == 0) { - attribute.nature[i] = LexiconUtility.convertStringToNature(param[1 + 2 * i]); - attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); - attribute.totalFrequency += attribute.frequency[i]; + attribute = new CoreDictionary.Attribute(defaultNature); + } + else + { + attribute = new CoreDictionary.Attribute(natureCount); + for (int i = 0; i < natureCount; ++i) + { + attribute.nature[i] = LexiconUtility.convertStringToNature(param[1 + 2 * i]); + attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); + attribute.totalFrequency += attribute.frequency[i]; + } } storage.put(param[0], attribute); } diff --git a/src/main/java/com/hankcs/hanlp/corpus/nr/FamilyName.java b/src/main/java/com/hankcs/hanlp/corpus/nr/FamilyName.java deleted file mode 100644 index 2b2d5eaf7..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/nr/FamilyName.java +++ /dev/null @@ -1,55 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/9/11 16:26 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.corpus.nr; - -import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; -import com.hankcs.hanlp.corpus.dictionary.item.Item; - -import java.io.*; -import java.util.List; - -/** - * @author hankcs - */ -public class FamilyName -{ - static boolean fn[]; - static - { - fn = new boolean[65535]; - try - { - BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/dictionary/person/familyname.txt"))); - String line; - while ((line = br.readLine()) != null) - { - fn[line.charAt(0)] = true; - } - br.close(); - } - catch (Exception e) - { - e.printStackTrace(); - } - } - - public static boolean contains(char c) - { - return fn[c]; - } - - public static boolean contains(String c) - { - if (c.length() != 1) return false; - return fn[c.charAt(0)]; - } -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/nr/NRCorpusLoader.java b/src/main/java/com/hankcs/hanlp/corpus/nr/NRCorpusLoader.java deleted file mode 100644 index 936040a16..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/nr/NRCorpusLoader.java +++ /dev/null @@ -1,92 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/9/11 12:58 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.corpus.nr; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; -import com.hankcs.hanlp.corpus.dictionary.item.Item; -import com.hankcs.hanlp.corpus.document.sentence.word.Word; -import com.hankcs.hanlp.corpus.tag.NR; - -import java.io.BufferedReader; -import java.io.FileInputStream; -import java.io.InputStreamReader; -import static com.hankcs.hanlp.utility.Predefine.logger; -/** - * 对人名语料的解析,并且生成词典 - * @author hankcs - */ -public class NRCorpusLoader -{ - public static boolean load(String path) - { - try - { - BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")); - String line; - DictionaryMaker dictionaryMaker = new DictionaryMaker(); - while ((line = br.readLine()) != null) - { - if (line.matches(".*[\\p{P}+~$`^=|<>~`$^+=|<>¥×|\\s|a-z0-9A-Z]+.*")) continue; - // 只载入两字和三字的名字 - Integer length = line.length(); - switch (length) - { - case 2: - { - Word wordB = new Word(line.substring(0, 1), NR.B.toString()); - Word wordE = new Word(line.substring(1), NR.E.toString()); - dictionaryMaker.add(wordB); - dictionaryMaker.add(wordE); - break; - } - case 3: - { - Word wordB = new Word(line.substring(0, 1), NR.B.toString()); - Word wordC = new Word(line.substring(1, 2), NR.C.toString()); - Word wordD = new Word(line.substring(2, 3), NR.D.toString()); - dictionaryMaker.add(wordB); - dictionaryMaker.add(wordC); - dictionaryMaker.add(wordD); - break; - } - default: -// L.trace("放弃【{}】", line); - break; - } - } - br.close(); - logger.info(dictionaryMaker.toString()); - dictionaryMaker.saveTxtTo("data/dictionary/person/name.txt", new DictionaryMaker.Filter() - { - @Override - public boolean onSave(Item item) - { - return false; - } - }); - } - catch (Exception e) - { - logger.warning("读取" + path + "发生错误"); - return false; - } - - return true; - } - - public static void combine() - { - DictionaryMaker dictionaryMaker = DictionaryMaker.combine(HanLP.Config.CoreDictionaryPath, "XXXDictionary.txt"); - dictionaryMaker.saveTxtTo(HanLP.Config.CoreDictionaryPath); - } -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/nr/NameDictionaryMaker.java b/src/main/java/com/hankcs/hanlp/corpus/nr/NameDictionaryMaker.java deleted file mode 100644 index d8aff150e..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/nr/NameDictionaryMaker.java +++ /dev/null @@ -1,80 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/9/11 18:04 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.corpus.nr; - -import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; -import com.hankcs.hanlp.corpus.document.sentence.word.Word; -import com.hankcs.hanlp.corpus.tag.NR; - -import java.io.BufferedReader; -import java.io.FileInputStream; -import java.io.InputStreamReader; -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * @author hankcs - */ -public class NameDictionaryMaker -{ - public static DictionaryMaker create(String path) - { - DictionaryMaker dictionaryMaker = new DictionaryMaker(); - try - { - BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")); - String line; - while ((line = br.readLine()) != null) - { - if (line.matches(".*[\\p{P}+~$`^=|<>~`$^+=|<>¥×|\\s|a-z0-9A-Z]+.*")) continue; - // 只载入两字和三字的名字 - Integer length = line.length(); - switch (length) - { - case 2: - { - Word wordB = new Word(line.substring(0, 1), NR.B.toString()); - if (!FamilyName.contains(wordB.value)) break; - Word wordE = new Word(line.substring(1), NR.E.toString()); - dictionaryMaker.add(wordB); - dictionaryMaker.add(wordE); - break; - } - case 3: - { - Word wordB = new Word(line.substring(0, 1), NR.B.toString()); - if (!FamilyName.contains(wordB.value)) break; - Word wordC = new Word(line.substring(1, 2), NR.C.toString()); - Word wordD = new Word(line.substring(2, 3), NR.D.toString()); -// Word wordC = new Word(line.substring(1, 2), NR.E.toString()); -// Word wordD = new Word(line.substring(2, 3), NR.E.toString()); - dictionaryMaker.add(wordB); - dictionaryMaker.add(wordC); - dictionaryMaker.add(wordD); - break; - } - default: -// L.trace("放弃【{}】", line); - break; - } - } - br.close(); - logger.info(dictionaryMaker.toString()); - } - catch (Exception e) - { - logger.warning("读取" + path + "发生错误"); - return null; - } - - return dictionaryMaker; - } -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/occurrence/Occurrence.java b/src/main/java/com/hankcs/hanlp/corpus/occurrence/Occurrence.java index 0761afa87..2c955500f 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/occurrence/Occurrence.java +++ b/src/main/java/com/hankcs/hanlp/corpus/occurrence/Occurrence.java @@ -298,7 +298,7 @@ public double computeMutualInformation(String first, String second) public double computeMutualInformation(PairFrequency pair) { - return Math.log(Math.max(Predefine.MIN_PROBABILITY, pair.getValue() / totalPair) / Math.max(Predefine.MIN_PROBABILITY, (CoreDictionary.getTermFrequency(pair.first) / (double) CoreDictionary.totalFrequency * CoreDictionary.getTermFrequency(pair.second) / (double) CoreDictionary.totalFrequency))); + return Math.log(Math.max(Predefine.MIN_PROBABILITY, pair.getValue() / totalPair) / Math.max(Predefine.MIN_PROBABILITY, (CoreDictionary.getTermFrequency(pair.first) / (double) Predefine.TOTAL_FREQUENCY * CoreDictionary.getTermFrequency(pair.second) / (double) Predefine.TOTAL_FREQUENCY))); } /** @@ -364,11 +364,18 @@ public void compute() for (Map.Entry entry : entrySetPair) { PairFrequency value = entry.getValue(); - value.score = value.mi / total_mi + value.le / total_le+ value.re / total_re; // 归一化 + value.score = safeDivide(value.mi, total_mi) + safeDivide(value.le, total_le) + safeDivide(value.re, total_re); // 归一化 value.score *= entrySetPair.size(); } } + private static double safeDivide(double x, double y) + { + if (y == 0) + return 0; + return x / y; + } + /** * 获取一阶共现,其实就是词频统计 * @return diff --git a/src/main/java/com/hankcs/hanlp/corpus/tag/Nature.java b/src/main/java/com/hankcs/hanlp/corpus/tag/Nature.java index 7a3f5fb05..4eeee318a 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/tag/Nature.java +++ b/src/main/java/com/hankcs/hanlp/corpus/tag/Nature.java @@ -11,831 +11,856 @@ */ package com.hankcs.hanlp.corpus.tag; -import com.hankcs.hanlp.corpus.util.CustomNatureUtility; +import java.util.TreeMap; +import java.util.concurrent.ConcurrentHashMap; /** * 词性 * * @author hankcs */ -public enum Nature +public class Nature { /** * 区别语素 */ - bg, + public static final Nature bg = new Nature("bg"); /** * 数语素 */ - mg, + public static final Nature mg = new Nature("mg"); /** * 名词性惯用语 */ - nl, + public static final Nature nl = new Nature("nl"); /** * 字母专名 */ - nx, + public static final Nature nx = new Nature("nx"); /** * 量词语素 */ - qg, + public static final Nature qg = new Nature("qg"); /** * 助词 */ - ud, + public static final Nature ud = new Nature("ud"); /** * 助词 */ - uj, + public static final Nature uj = new Nature("uj"); /** * 着 */ - uz, + public static final Nature uz = new Nature("uz"); /** * 过 */ - ug, + public static final Nature ug = new Nature("ug"); /** * 连词 */ - ul, + public static final Nature ul = new Nature("ul"); /** * 连词 */ - uv, + public static final Nature uv = new Nature("uv"); /** * 语气语素 */ - yg, + public static final Nature yg = new Nature("yg"); /** * 状态词 */ - zg, + public static final Nature zg = new Nature("zg"); // 以上标签来自ICT,以下标签来自北大 /** * 名词 */ - n, + public static final Nature n = new Nature("n"); /** * 人名 */ - nr, + public static final Nature nr = new Nature("nr"); /** * 日语人名 */ - nrj, + public static final Nature nrj = new Nature("nrj"); /** * 音译人名 */ - nrf, + public static final Nature nrf = new Nature("nrf"); /** * 复姓 */ - nr1, + public static final Nature nr1 = new Nature("nr1"); /** * 蒙古姓名 */ - nr2, + public static final Nature nr2 = new Nature("nr2"); /** * 地名 */ - ns, + public static final Nature ns = new Nature("ns"); /** * 音译地名 */ - nsf, + public static final Nature nsf = new Nature("nsf"); /** * 机构团体名 */ - nt, + public static final Nature nt = new Nature("nt"); /** * 公司名 */ - ntc, + public static final Nature ntc = new Nature("ntc"); /** * 工厂 */ - ntcf, + public static final Nature ntcf = new Nature("ntcf"); /** * 银行 */ - ntcb, + public static final Nature ntcb = new Nature("ntcb"); /** * 酒店宾馆 */ - ntch, + public static final Nature ntch = new Nature("ntch"); /** * 政府机构 */ - nto, + public static final Nature nto = new Nature("nto"); /** * 大学 */ - ntu, + public static final Nature ntu = new Nature("ntu"); /** * 中小学 */ - nts, + public static final Nature nts = new Nature("nts"); /** * 医院 */ - nth, + public static final Nature nth = new Nature("nth"); /** * 医药疾病等健康相关名词 */ - nh, + public static final Nature nh = new Nature("nh"); /** * 药品 */ - nhm, + public static final Nature nhm = new Nature("nhm"); /** * 疾病 */ - nhd, + public static final Nature nhd = new Nature("nhd"); /** * 工作相关名词 */ - nn, + public static final Nature nn = new Nature("nn"); /** * 职务职称 */ - nnt, + public static final Nature nnt = new Nature("nnt"); /** * 职业 */ - nnd, + public static final Nature nnd = new Nature("nnd"); /** * 名词性语素 */ - ng, + public static final Nature ng = new Nature("ng"); /** * 食品,比如“薯片” */ - nf, + public static final Nature nf = new Nature("nf"); /** * 机构相关(不是独立机构名) */ - ni, + public static final Nature ni = new Nature("ni"); /** * 教育相关机构 */ - nit, + public static final Nature nit = new Nature("nit"); /** * 下属机构 */ - nic, + public static final Nature nic = new Nature("nic"); /** * 机构后缀 */ - nis, + public static final Nature nis = new Nature("nis"); /** * 物品名 */ - nm, + public static final Nature nm = new Nature("nm"); /** * 化学品名 */ - nmc, + public static final Nature nmc = new Nature("nmc"); /** * 生物名 */ - nb, + public static final Nature nb = new Nature("nb"); /** * 动物名 */ - nba, + public static final Nature nba = new Nature("nba"); /** * 动物纲目 */ - nbc, + public static final Nature nbc = new Nature("nbc"); /** * 植物名 */ - nbp, + public static final Nature nbp = new Nature("nbp"); /** * 其他专名 */ - nz, + public static final Nature nz = new Nature("nz"); /** * 学术词汇 */ - g, + public static final Nature g = new Nature("g"); /** * 数学相关词汇 */ - gm, + public static final Nature gm = new Nature("gm"); /** * 物理相关词汇 */ - gp, + public static final Nature gp = new Nature("gp"); /** * 化学相关词汇 */ - gc, + public static final Nature gc = new Nature("gc"); /** * 生物相关词汇 */ - gb, + public static final Nature gb = new Nature("gb"); /** * 生物类别 */ - gbc, + public static final Nature gbc = new Nature("gbc"); /** * 地理地质相关词汇 */ - gg, + public static final Nature gg = new Nature("gg"); /** * 计算机相关词汇 */ - gi, + public static final Nature gi = new Nature("gi"); /** * 简称略语 */ - j, + public static final Nature j = new Nature("j"); /** * 成语 */ - i, + public static final Nature i = new Nature("i"); /** * 习用语 */ - l, + public static final Nature l = new Nature("l"); /** * 时间词 */ - t, + public static final Nature t = new Nature("t"); /** * 时间词性语素 */ - tg, + public static final Nature tg = new Nature("tg"); /** * 处所词 */ - s, + public static final Nature s = new Nature("s"); /** * 方位词 */ - f, + public static final Nature f = new Nature("f"); /** * 动词 */ - v, + public static final Nature v = new Nature("v"); /** * 副动词 */ - vd, + public static final Nature vd = new Nature("vd"); /** * 名动词 */ - vn, + public static final Nature vn = new Nature("vn"); /** * 动词“是” */ - vshi, + public static final Nature vshi = new Nature("vshi"); /** * 动词“有” */ - vyou, + public static final Nature vyou = new Nature("vyou"); /** * 趋向动词 */ - vf, + public static final Nature vf = new Nature("vf"); /** * 形式动词 */ - vx, + public static final Nature vx = new Nature("vx"); /** * 不及物动词(内动词) */ - vi, + public static final Nature vi = new Nature("vi"); /** * 动词性惯用语 */ - vl, + public static final Nature vl = new Nature("vl"); /** * 动词性语素 */ - vg, + public static final Nature vg = new Nature("vg"); /** * 形容词 */ - a, + public static final Nature a = new Nature("a"); /** * 副形词 */ - ad, + public static final Nature ad = new Nature("ad"); /** * 名形词 */ - an, + public static final Nature an = new Nature("an"); /** * 形容词性语素 */ - ag, + public static final Nature ag = new Nature("ag"); /** * 形容词性惯用语 */ - al, + public static final Nature al = new Nature("al"); /** * 区别词 */ - b, + public static final Nature b = new Nature("b"); /** * 区别词性惯用语 */ - bl, + public static final Nature bl = new Nature("bl"); /** * 状态词 */ - z, + public static final Nature z = new Nature("z"); /** * 代词 */ - r, + public static final Nature r = new Nature("r"); /** * 人称代词 */ - rr, + public static final Nature rr = new Nature("rr"); /** * 指示代词 */ - rz, + public static final Nature rz = new Nature("rz"); /** * 时间指示代词 */ - rzt, + public static final Nature rzt = new Nature("rzt"); /** * 处所指示代词 */ - rzs, + public static final Nature rzs = new Nature("rzs"); /** * 谓词性指示代词 */ - rzv, + public static final Nature rzv = new Nature("rzv"); /** * 疑问代词 */ - ry, + public static final Nature ry = new Nature("ry"); /** * 时间疑问代词 */ - ryt, + public static final Nature ryt = new Nature("ryt"); /** * 处所疑问代词 */ - rys, + public static final Nature rys = new Nature("rys"); /** * 谓词性疑问代词 */ - ryv, + public static final Nature ryv = new Nature("ryv"); /** * 代词性语素 */ - rg, + public static final Nature rg = new Nature("rg"); /** * 古汉语代词性语素 */ - Rg, + public static final Nature Rg = new Nature("Rg"); /** * 数词 */ - m, + public static final Nature m = new Nature("m"); /** * 数量词 */ - mq, + public static final Nature mq = new Nature("mq"); /** * 甲乙丙丁之类的数词 */ - Mg, + public static final Nature Mg = new Nature("Mg"); /** * 量词 */ - q, + public static final Nature q = new Nature("q"); /** * 动量词 */ - qv, + public static final Nature qv = new Nature("qv"); /** * 时量词 */ - qt, + public static final Nature qt = new Nature("qt"); /** * 副词 */ - d, + public static final Nature d = new Nature("d"); /** * 辄,俱,复之类的副词 */ - dg, + public static final Nature dg = new Nature("dg"); /** * 连语 */ - dl, + public static final Nature dl = new Nature("dl"); /** * 介词 */ - p, + public static final Nature p = new Nature("p"); /** * 介词“把” */ - pba, + public static final Nature pba = new Nature("pba"); /** * 介词“被” */ - pbei, + public static final Nature pbei = new Nature("pbei"); /** * 连词 */ - c, + public static final Nature c = new Nature("c"); /** * 并列连词 */ - cc, + public static final Nature cc = new Nature("cc"); /** * 助词 */ - u, + public static final Nature u = new Nature("u"); /** * 着 */ - uzhe, + public static final Nature uzhe = new Nature("uzhe"); /** * 了 喽 */ - ule, + public static final Nature ule = new Nature("ule"); /** * 过 */ - uguo, + public static final Nature uguo = new Nature("uguo"); /** * 的 底 */ - ude1, + public static final Nature ude1 = new Nature("ude1"); /** * 地 */ - ude2, + public static final Nature ude2 = new Nature("ude2"); /** * 得 */ - ude3, + public static final Nature ude3 = new Nature("ude3"); /** * 所 */ - usuo, + public static final Nature usuo = new Nature("usuo"); /** * 等 等等 云云 */ - udeng, + public static final Nature udeng = new Nature("udeng"); /** * 一样 一般 似的 般 */ - uyy, + public static final Nature uyy = new Nature("uyy"); /** * 的话 */ - udh, + public static final Nature udh = new Nature("udh"); /** * 来讲 来说 而言 说来 */ - uls, + public static final Nature uls = new Nature("uls"); /** * 之 */ - uzhi, + public static final Nature uzhi = new Nature("uzhi"); /** * 连 (“连小学生都会”) */ - ulian, + public static final Nature ulian = new Nature("ulian"); /** * 叹词 */ - e, + public static final Nature e = new Nature("e"); /** * 语气词(delete yg) */ - y, + public static final Nature y = new Nature("y"); /** * 拟声词 */ - o, + public static final Nature o = new Nature("o"); /** * 前缀 */ - h, + public static final Nature h = new Nature("h"); /** * 后缀 */ - k, + public static final Nature k = new Nature("k"); /** * 字符串 */ - x, + public static final Nature x = new Nature("x"); /** * 非语素字 */ - xx, + public static final Nature xx = new Nature("xx"); /** * 网址URL */ - xu, + public static final Nature xu = new Nature("xu"); /** * 标点符号 */ - w, + public static final Nature w = new Nature("w"); /** * 左括号,全角:( 〔 [ { 《 【 〖 〈 半角:( [ { < */ - wkz, + public static final Nature wkz = new Nature("wkz"); /** * 右括号,全角:) 〕 ] } 》 】 〗 〉 半角: ) ] { > */ - wky, + public static final Nature wky = new Nature("wky"); /** * 左引号,全角:“ ‘ 『 */ - wyz, + public static final Nature wyz = new Nature("wyz"); /** * 右引号,全角:” ’ 』 */ - wyy, + public static final Nature wyy = new Nature("wyy"); /** * 句号,全角:。 */ - wj, + public static final Nature wj = new Nature("wj"); /** * 问号,全角:? 半角:? */ - ww, + public static final Nature ww = new Nature("ww"); /** * 叹号,全角:! 半角:! */ - wt, + public static final Nature wt = new Nature("wt"); /** * 逗号,全角:, 半角:, */ - wd, + public static final Nature wd = new Nature("wd"); /** * 分号,全角:; 半角: ; */ - wf, + public static final Nature wf = new Nature("wf"); /** * 顿号,全角:、 */ - wn, + public static final Nature wn = new Nature("wn"); /** * 冒号,全角:: 半角: : */ - wm, + public static final Nature wm = new Nature("wm"); /** * 省略号,全角:…… … */ - ws, + public static final Nature ws = new Nature("ws"); /** * 破折号,全角:—— -- ——- 半角:--- ---- */ - wp, + public static final Nature wp = new Nature("wp"); /** * 百分号千分号,全角:% ‰ 半角:% */ - wb, + public static final Nature wb = new Nature("wb"); /** * 单位符号,全角:¥ $ £ ° ℃ 半角:$ */ - wh, + public static final Nature wh = new Nature("wh"); /** * 仅用于终##终,不会出现在分词结果中 */ - end, + public static final Nature end = new Nature("end"); /** * 仅用于始##始,不会出现在分词结果中 */ - begin, + public static final Nature begin = new Nature("begin"); - ; + private static ConcurrentHashMap idMap; + private static Nature[] values; + private int ordinal; + private final String name; + + private Nature(String name) + { + if (idMap == null) { + idMap = new ConcurrentHashMap(); + } + assert !idMap.containsKey(name); + this.name = name; + ordinal = idMap.size(); + idMap.put(name, ordinal); + Nature[] extended = new Nature[idMap.size()]; + if (values != null){ + System.arraycopy(values, 0, extended, 0, values.length); + } + extended[ordinal] = this; + values = extended; + } /** * 词性是否以该前缀开头
- * 词性根据开头的几个字母可以判断大的类别 + * 词性根据开头的几个字母可以判断大的类别 + * * @param prefix 前缀 * @return 是否以该前缀开头 */ public boolean startsWith(String prefix) { - return toString().startsWith(prefix); + return name.startsWith(prefix); } /** * 词性是否以该前缀开头
- * 词性根据开头的几个字母可以判断大的类别 + * 词性根据开头的几个字母可以判断大的类别 + * * @param prefix 前缀 * @return 是否以该前缀开头 */ public boolean startsWith(char prefix) { - return toString().charAt(0) == prefix; + return name.charAt(0) == prefix; } /** * 词性的首字母
- * 词性根据开头的几个字母可以判断大的类别 + * 词性根据开头的几个字母可以判断大的类别 + * * @return */ public char firstChar() { - return toString().charAt(0); + return name.charAt(0); } /** * 安全地将字符串类型的词性转为Enum类型,如果未定义该词性,则返回null + * * @param name 字符串词性 * @return Enum词性 */ - public static Nature fromString(String name) + public static final Nature fromString(String name) { - try - { - return Nature.valueOf(name); - } - catch (Exception e) - { - // 动态添加的词语有可能无法通过valueOf获取,所以遍历搜索 - for (Nature nature : Nature.values()) - { - if (nature.toString().equals(name)) - { - return nature; - } - } - } - - return null; + Integer id = idMap.get(name); + if (id == null) + return null; + return values[id]; } /** * 创建自定义词性,如果已有该对应词性,则直接返回已有的词性 + * * @param name 字符串词性 * @return Enum词性 */ - public static Nature create(String name) + public static final Nature create(String name) { - try - { - return Nature.valueOf(name); - } - catch (Exception e) - { - return CustomNatureUtility.addNature(name); - } + Nature nature = fromString(name); + if (nature == null) + return new Nature(name); + return nature; + } + + @Override + public String toString() + { + return name; + } + + public int ordinal() + { + return ordinal; + } + + public static Nature[] values() + { + return values; } -} \ No newline at end of file +} diff --git a/src/main/java/com/hankcs/hanlp/corpus/util/CustomNatureUtility.java b/src/main/java/com/hankcs/hanlp/corpus/util/CustomNatureUtility.java deleted file mode 100644 index 2b2441ed6..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/util/CustomNatureUtility.java +++ /dev/null @@ -1,98 +0,0 @@ -/* - * - * He Han - * me@hankcs.com - * 2016/1/4 16:02 - * - * - * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ - * This source is subject to Hankcs. Please contact Hankcs to get more information. - * - */ -package com.hankcs.hanlp.corpus.util; - -import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.dictionary.CoreDictionaryTransformMatrixDictionary; -import com.hankcs.hanlp.dictionary.CustomDictionary; -import com.hankcs.hanlp.recognition.nr.PersonRecognition; -import com.hankcs.hanlp.recognition.nt.OrganizationRecognition; -import com.hankcs.hanlp.seg.common.Vertex; - -import java.util.Map; -import java.util.TreeMap; - -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * 运行时动态增加词性工具 - * - * @author hankcs - */ -public class CustomNatureUtility -{ - static - { - logger.warning("已激活自定义词性功能,由于采用了反射技术,用户需对本地环境的兼容性和稳定性负责!\n" + - "如果用户代码X.java中有switch(nature)语句,需要调用CustomNatureUtility.registerSwitchClass(X.class)注册X这个类"); - } - private static Map extraValueMap = new TreeMap(); - - /** - * 动态增加词性工具 - */ - private static EnumBuster enumBuster = new EnumBuster(Nature.class, - CustomDictionary.class, - Vertex.class, - PersonRecognition.class, - OrganizationRecognition.class); - - /** - * 增加词性 - * @param name 词性名称 - * @return 词性 - */ - public static Nature addNature(String name) - { - Nature customNature = extraValueMap.get(name); - if (customNature != null) return customNature; - customNature = enumBuster.make(name); - enumBuster.addByValue(customNature); - extraValueMap.put(name, customNature); - // 必须对词性标注HMM模型中的元组做出调整 - CoreDictionaryTransformMatrixDictionary.transformMatrixDictionary.extendSize(); - - return customNature; - } - - /** - * 注册switch(nature)语句类 - * @param switchUsers 任何使用了switch(nature)语句的类 - */ - public static void registerSwitchClass(Class... switchUsers) - { - enumBuster.registerSwitchClass(switchUsers); - } - - /** - * 还原对词性的全部修改 - */ - public static void restore() - { - enumBuster.restore(); - extraValueMap.clear(); - } - - public static Nature getNature(String name) - { - - try - { - return Nature.valueOf(name); - } - catch (Exception e) - { - // 动态添加的词语有可能无法通过valueOf获取 - return extraValueMap.get(name); - } - } -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/util/DictionaryUtil.java b/src/main/java/com/hankcs/hanlp/corpus/util/DictionaryUtil.java index e8af9f09e..d050706ee 100644 --- a/src/main/java/com/hankcs/hanlp/corpus/util/DictionaryUtil.java +++ b/src/main/java/com/hankcs/hanlp/corpus/util/DictionaryUtil.java @@ -42,7 +42,7 @@ public static boolean sortDictionary(String path) } br.close(); - BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path))); + BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path), "UTF-8")); for (Map.Entry entry : map.entrySet()) { bw.write(entry.getValue()); diff --git a/src/main/java/com/hankcs/hanlp/corpus/util/EnumBuster.java b/src/main/java/com/hankcs/hanlp/corpus/util/EnumBuster.java deleted file mode 100644 index 1843fc1b8..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/util/EnumBuster.java +++ /dev/null @@ -1,518 +0,0 @@ -/* - * - * Hankcs - * me@hankcs.com - * 2016-03-26 PM5:35 - * - * - * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ - * This source is subject to Hankcs. Please contact Hankcs to get more information. - * - */ -package com.hankcs.hanlp.corpus.util; - -import sun.reflect.*; - -import java.lang.reflect.*; -import java.util.*; - -/** - * 动态修改Enum的对象 - * @param - */ -public class EnumBuster> -{ - private static final Class[] EMPTY_CLASS_ARRAY = - new Class[0]; - private static final Object[] EMPTY_OBJECT_ARRAY = - new Object[0]; - - private static final String VALUES_FIELD = "$VALUES"; - private static final String ORDINAL_FIELD = "ordinal"; - - private final ReflectionFactory reflection = - ReflectionFactory.getReflectionFactory(); - - private final Class clazz; - - private final Collection switchFields; - - private final Deque undoStack = - new LinkedList(); - - /** - * Construct an EnumBuster for the given enum class and keep - * the switch statements of the classes specified in - * switchUsers in sync with the enum values. - */ - public EnumBuster(Class clazz, Class... switchUsers) - { - try - { - this.clazz = clazz; - switchFields = findRelatedSwitchFields(switchUsers); - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not create the class", e); - } - } - - /** - * Make a new enum instance, without adding it to the values - * array and using the default ordinal of 0. - */ - public E make(String value) - { - return make(value, 0, - EMPTY_CLASS_ARRAY, EMPTY_OBJECT_ARRAY); - } - - /** - * Make a new enum instance with the given ordinal. - */ - public E make(String value, int ordinal) - { - return make(value, ordinal, - EMPTY_CLASS_ARRAY, EMPTY_OBJECT_ARRAY); - } - - /** - * Make a new enum instance with the given value, ordinal and - * additional parameters. The additionalTypes is used to match - * the constructor accurately. - */ - public E make(String value, int ordinal, - Class[] additionalTypes, Object[] additional) - { - try - { - undoStack.push(new Memento()); - ConstructorAccessor ca = findConstructorAccessor( - additionalTypes, clazz); - return constructEnum(clazz, ca, value, - ordinal, additional); - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not create enum", e); - } - } - - /** - * This method adds the given enum into the array - * inside the enum class. If the enum already - * contains that particular value, then the value - * is overwritten with our enum. Otherwise it is - * added at the end of the array. - *

- * In addition, if there is a constant field in the - * enum class pointing to an enum with our value, - * then we replace that with our enum instance. - *

- * The ordinal is either set to the existing position - * or to the last value. - *

- * Warning: This should probably never be called, - * since it can cause permanent changes to the enum - * values. Use only in extreme conditions. - * - * @param e the enum to add - */ - public void addByValue(E e) - { - try - { - undoStack.push(new Memento()); - Field valuesField = findValuesField(); - - // we get the current Enum[] - E[] values = values(); - for (int i = 0; i < values.length; i++) - { - E value = values[i]; - if (value.name().equals(e.name())) - { - setOrdinal(e, value.ordinal()); - values[i] = e; - replaceConstant(e); - return; - } - } - - // we did not find it in the existing array, thus - // append it to the array - E[] newValues = - Arrays.copyOf(values, values.length + 1); - newValues[newValues.length - 1] = e; - ReflectionHelper.setStaticFinalField( - valuesField, newValues); - - int ordinal = newValues.length - 1; - setOrdinal(e, ordinal); - addSwitchCase(); - } - catch (Exception ex) - { - throw new IllegalArgumentException( - "Could not set the enum", ex); - } - } - - /** - * We delete the enum from the values array and set the - * constant pointer to null. - * - * @param e the enum to delete from the type. - * @return true if the enum was found and deleted; - * false otherwise - */ - public boolean deleteByValue(E e) - { - if (e == null) throw new NullPointerException(); - try - { - undoStack.push(new Memento()); - // we get the current E[] - E[] values = values(); - for (int i = 0; i < values.length; i++) - { - E value = values[i]; - if (value.name().equals(e.name())) - { - E[] newValues = - Arrays.copyOf(values, values.length - 1); - System.arraycopy(values, i + 1, newValues, i, - values.length - i - 1); - for (int j = i; j < newValues.length; j++) - { - setOrdinal(newValues[j], j); - } - Field valuesField = findValuesField(); - ReflectionHelper.setStaticFinalField( - valuesField, newValues); - removeSwitchCase(i); - blankOutConstant(e); - return true; - } - } - } - catch (Exception ex) - { - throw new IllegalArgumentException( - "Could not set the enum", ex); - } - return false; - } - - /** - * Undo the state right back to the beginning when the - * EnumBuster was created. - */ - public void restore() - { - while (undo()) - { - // - } - } - - /** - * Undo the previous operation. - */ - public boolean undo() - { - try - { - Memento memento = undoStack.poll(); - if (memento == null) return false; - memento.undo(); - return true; - } - catch (Exception e) - { - throw new IllegalStateException("Could not undo", e); - } - } - - private ConstructorAccessor findConstructorAccessor( - Class[] additionalParameterTypes, - Class clazz) throws NoSuchMethodException - { - Class[] parameterTypes = - new Class[additionalParameterTypes.length + 2]; - parameterTypes[0] = String.class; - parameterTypes[1] = int.class; - System.arraycopy( - additionalParameterTypes, 0, - parameterTypes, 2, - additionalParameterTypes.length); - Constructor cstr = clazz.getDeclaredConstructor( - parameterTypes - ); - return reflection.newConstructorAccessor(cstr); - } - - private E constructEnum(Class clazz, - ConstructorAccessor ca, - String value, int ordinal, - Object[] additional) - throws Exception - { - Object[] parms = new Object[additional.length + 2]; - parms[0] = value; - parms[1] = ordinal; - System.arraycopy( - additional, 0, parms, 2, additional.length); - return clazz.cast(ca.newInstance(parms)); - } - - /** - * The only time we ever add a new enum is at the end. - * Thus all we need to do is expand the switch map arrays - * by one empty slot. - */ - private void addSwitchCase() - { - try - { - for (Field switchField : switchFields) - { - int[] switches = (int[]) switchField.get(null); - switches = Arrays.copyOf(switches, switches.length + 1); - ReflectionHelper.setStaticFinalField( - switchField, switches - ); - } - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not fix switch", e); - } - } - - private void replaceConstant(E e) - throws IllegalAccessException, NoSuchFieldException - { - Field[] fields = clazz.getDeclaredFields(); - for (Field field : fields) - { - if (field.getName().equals(e.name())) - { - ReflectionHelper.setStaticFinalField( - field, e - ); - } - } - } - - - private void blankOutConstant(E e) - throws IllegalAccessException, NoSuchFieldException - { - Field[] fields = clazz.getDeclaredFields(); - for (Field field : fields) - { - if (field.getName().equals(e.name())) - { - ReflectionHelper.setStaticFinalField( - field, null - ); - } - } - } - - private void setOrdinal(E e, int ordinal) - throws NoSuchFieldException, IllegalAccessException - { - Field ordinalField = Enum.class.getDeclaredField( - ORDINAL_FIELD); - ordinalField.setAccessible(true); - ordinalField.set(e, ordinal); - } - - /** - * Method to find the values field, set it to be accessible, - * and return it. - * - * @return the values array field for the enum. - * @throws NoSuchFieldException if the field could not be found - */ - private Field findValuesField() - throws NoSuchFieldException - { - // first we find the static final array that holds - // the values in the enum class - Field valuesField = null; - try - { - valuesField = clazz.getDeclaredField( - VALUES_FIELD); - } - catch (NoSuchFieldException e) - { - Field[] fields = clazz.getDeclaredFields(); - for (Field field : fields) - { - if (field.getName().contains(VALUES_FIELD)) - { - valuesField = field; - break; - } - } - } - if (valuesField == null) - { - throw new RuntimeException("本地JVM不支持自定义词性"); - } - - // we mark it to be public - valuesField.setAccessible(true); - return valuesField; - } - - public void registerSwitchClass(Class[] switchUsers) - { - switchFields.addAll(findRelatedSwitchFields(switchUsers)); - } - - private Collection findRelatedSwitchFields( - Class[] switchUsers) - { - Collection result = new LinkedList(); - try - { - for (Class switchUser : switchUsers) - { - String name = switchUser.getName(); - int i = 0; - while (true) - { - try - { - Class suspect = Class.forName(String.format("%s$%d", name, ++i)); - Field[] fields = suspect.getDeclaredFields(); - for (Field field : fields) - { - String fieldName = field.getName(); - if (fieldName.startsWith("$SwitchMap$") && fieldName.endsWith(clazz.getSimpleName())) - { - field.setAccessible(true); - result.add(field); - } - } - } - catch (ClassNotFoundException e) - { - break; - } - } - } - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not fix switch", e); - } - return result; - } - - private void removeSwitchCase(int ordinal) - { - try - { - for (Field switchField : switchFields) - { - int[] switches = (int[]) switchField.get(null); - int[] newSwitches = Arrays.copyOf( - switches, switches.length - 1); - System.arraycopy(switches, ordinal + 1, newSwitches, - ordinal, switches.length - ordinal - 1); - ReflectionHelper.setStaticFinalField( - switchField, newSwitches - ); - } - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not fix switch", e); - } - } - - @SuppressWarnings("unchecked") - private E[] values() - throws NoSuchFieldException, IllegalAccessException - { - Field valuesField = findValuesField(); - return (E[]) valuesField.get(null); - } - - private class Memento - { - private final E[] values; - private final Map savedSwitchFieldValues = - new HashMap(); - - private Memento() throws IllegalAccessException - { - try - { - values = values().clone(); - for (Field switchField : switchFields) - { - int[] switchArray = (int[]) switchField.get(null); - savedSwitchFieldValues.put(switchField, - switchArray.clone()); - } - } - catch (Exception e) - { - throw new IllegalArgumentException( - "Could not create the class", e); - } - } - - private void undo() throws - NoSuchFieldException, IllegalAccessException - { - Field valuesField = findValuesField(); - ReflectionHelper.setStaticFinalField(valuesField, values); - - for (int i = 0; i < values.length; i++) - { - setOrdinal(values[i], i); - } - - // reset all of the constants defined inside the enum - Map valuesMap = - new HashMap(); - for (E e : values) - { - valuesMap.put(e.name(), e); - } - Field[] constantEnumFields = clazz.getDeclaredFields(); - for (Field constantEnumField : constantEnumFields) - { - E en = valuesMap.get(constantEnumField.getName()); - if (en != null) - { - ReflectionHelper.setStaticFinalField( - constantEnumField, en - ); - } - } - - for (Map.Entry entry : - savedSwitchFieldValues.entrySet()) - { - Field field = entry.getKey(); - int[] mappings = entry.getValue(); - ReflectionHelper.setStaticFinalField(field, mappings); - } - } - } -} diff --git a/src/main/java/com/hankcs/hanlp/corpus/util/ReflectionHelper.java b/src/main/java/com/hankcs/hanlp/corpus/util/ReflectionHelper.java deleted file mode 100644 index be5228dcd..000000000 --- a/src/main/java/com/hankcs/hanlp/corpus/util/ReflectionHelper.java +++ /dev/null @@ -1,50 +0,0 @@ -/* - *

- * Hankcs - * me@hankcs.com - * 2016-03-26 PM5:36 - * - * - * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ - * This source is subject to Hankcs. Please contact Hankcs to get more information. - * - */ -package com.hankcs.hanlp.corpus.util; - -import sun.reflect.FieldAccessor; -import sun.reflect.ReflectionFactory; - -import java.lang.reflect.Field; -import java.lang.reflect.Modifier; - -/** - * 修改final static域的反射工具 - * @author hankcs - */ -public class ReflectionHelper -{ - private static final String MODIFIERS_FIELD = "modifiers"; - - private static final ReflectionFactory reflection = - ReflectionFactory.getReflectionFactory(); - - public static void setStaticFinalField( - Field field, Object value) - throws NoSuchFieldException, IllegalAccessException - { - // 获得 public 权限 - field.setAccessible(true); - // 将modifiers域设为非final,这样就可以修改了 - Field modifiersField = - Field.class.getDeclaredField(MODIFIERS_FIELD); - modifiersField.setAccessible(true); - int modifiers = modifiersField.getInt(field); - // 去掉 final 标志位 - modifiers &= ~Modifier.FINAL; - modifiersField.setInt(field, modifiers); - FieldAccessor fa = reflection.newFieldAccessor( - field, false - ); - fa.set(null, value); - } -} diff --git a/src/main/java/com/hankcs/hanlp/dependency/AbstractDependencyParser.java b/src/main/java/com/hankcs/hanlp/dependency/AbstractDependencyParser.java index b1aaef3d2..388ec1d1a 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/AbstractDependencyParser.java +++ b/src/main/java/com/hankcs/hanlp/dependency/AbstractDependencyParser.java @@ -11,7 +11,6 @@ */ package com.hankcs.hanlp.dependency; -import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; import com.hankcs.hanlp.corpus.io.IOUtil; @@ -30,7 +29,7 @@ public abstract class AbstractDependencyParser implements IDependencyParser /** * 本Parser使用的分词器,可以自由替换 */ - private Segment segment = HanLP.newSegment().enablePartOfSpeechTagging(true); + private Segment segment; /** * 依存关系映射表(可以将英文标签映射为中文) */ @@ -40,6 +39,16 @@ public abstract class AbstractDependencyParser implements IDependencyParser */ private boolean enableDeprelTranslater; + public AbstractDependencyParser(Segment segment) + { + this.segment = segment; + } + + public AbstractDependencyParser() + { + this(NLPTokenizer.ANALYZER); + } + @Override public CoNLLSentence parse(String sentence) { diff --git a/src/main/java/com/hankcs/hanlp/dependency/CRFDependencyParser.java b/src/main/java/com/hankcs/hanlp/dependency/CRFDependencyParser.java deleted file mode 100644 index 90a874cbd..000000000 --- a/src/main/java/com/hankcs/hanlp/dependency/CRFDependencyParser.java +++ /dev/null @@ -1,347 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/12/11 21:09 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.dependency; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; -import com.hankcs.hanlp.collection.trie.ITrie; -import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; -import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; -import com.hankcs.hanlp.corpus.io.ByteArray; -import com.hankcs.hanlp.corpus.io.IOUtil; -import com.hankcs.hanlp.dependency.common.POSUtil; -import com.hankcs.hanlp.model.bigram.BigramDependencyModel; -import com.hankcs.hanlp.model.crf.CRFModel; -import com.hankcs.hanlp.model.crf.FeatureFunction; -import com.hankcs.hanlp.model.crf.Table; -import com.hankcs.hanlp.seg.common.Term; -import com.hankcs.hanlp.tokenizer.NLPTokenizer; -import com.hankcs.hanlp.utility.GlobalObjectPool; -import com.hankcs.hanlp.utility.Predefine; -import com.hankcs.hanlp.utility.TextUtility; - -import java.io.DataOutputStream; -import java.io.FileOutputStream; -import java.util.Iterator; -import java.util.LinkedList; -import java.util.List; - -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * 基于随机条件场的依存句法分析器 - * - * @deprecated 关于将线性CRF序列标注应用于句法分析,我持反对意见。CRF的链式结构决定它的视野只有当前位置的前后n个单词构成的特征, - * 如果依存节点恰好落在这n个范围内还好理解,如果超出该范围,利用这个n个单词的特征推测它是不合理的。 - * 也就是说,我认为利用链式CRF预测长依存是不科学的。线性链CRF做句法分析的理论基础非常薄弱,一阶CRF这个标注模型根本无法阻止环的产生, - * 这份实现也没有复现论文的结果,所以不再维护,其模型文件也不再打包到新data里面。请使用在理论和工程上更稳定的 - * {@link com.hankcs.hanlp.dependency.nnparser.NeuralNetworkDependencyParser}。 - * - * @author hankcs - */ -public class CRFDependencyParser extends AbstractDependencyParser -{ - CRFModel crfModel; - - public CRFDependencyParser(String modelPath) - { - crfModel = GlobalObjectPool.get(modelPath); - if (crfModel != null) return; - long start = System.currentTimeMillis(); - if (load(modelPath)) - { - logger.info("加载随机条件场依存句法分析器模型" + modelPath + "成功,耗时 " + (System.currentTimeMillis() - start) + " ms"); - GlobalObjectPool.put(modelPath, crfModel); - } - else - { - logger.info("加载随机条件场依存句法分析器模型" + modelPath + "失败,耗时 " + (System.currentTimeMillis() - start) + " ms"); - } - } - - public CRFDependencyParser() - { - this(HanLP.Config.CRFDependencyModelPath); - } - - /** - * 分析句子的依存句法 - * - * @param termList 句子,可以是任何具有词性标注功能的分词器的分词结果 - * @return CoNLL格式的依存句法树 - */ - public static CoNLLSentence compute(List termList) - { - return new CRFDependencyParser().parse(termList); - } - - /** - * 分析句子的依存句法 - * - * @param sentence 句子 - * @return CoNLL格式的依存句法树 - */ - public static CoNLLSentence compute(String sentence) - { - return new CRFDependencyParser().parse(sentence); - } - - boolean load(String path) - { - if (loadDat(path + Predefine.BIN_EXT)) return true; - crfModel = CRFModel.loadTxt(path, new CRFModelForDependency(new DoubleArrayTrie())); // 使用特化版的CRF - return crfModel != null; - } - - boolean loadDat(String path) - { - ByteArray byteArray = ByteArray.createByteArray(path); - if (byteArray == null) return false; - crfModel = new CRFModelForDependency(new DoubleArrayTrie()); - return crfModel.load(byteArray); - } - - boolean saveDat(String path) - { - try - { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path)); - crfModel.save(out); - out.close(); - } - catch (Exception e) - { - logger.warning("在缓存" + path + "时发生错误" + TextUtility.exceptionToString(e)); - return false; - } - - return true; - } - - @Override - public CoNLLSentence parse(List termList) - { - Table table = new Table(); - table.v = new String[termList.size()][4]; - Iterator iterator = termList.iterator(); - for (String[] line : table.v) - { - Term term = iterator.next(); - line[0] = term.word; - line[2] = POSUtil.compilePOS(term.nature); - line[1] = line[2].substring(0, 1); - } - crfModel.tag(table); - if (HanLP.Config.DEBUG) - { - System.out.println(table); - } - CoNLLWord[] coNLLWordArray = new CoNLLWord[table.size()]; - for (int i = 0; i < coNLLWordArray.length; i++) - { - coNLLWordArray[i] = new CoNLLWord(i + 1, table.v[i][0], table.v[i][2], table.v[i][1]); - } - int i = 0; - for (String[] line : table.v) - { - CRFModelForDependency.DTag dTag = new CRFModelForDependency.DTag(line[3]); - if (dTag.pos.endsWith("ROOT")) - { - coNLLWordArray[i].HEAD = CoNLLWord.ROOT; - } - else - { - int index = convertOffset2Index(dTag, table, i); - if (index == -1) - coNLLWordArray[i].HEAD = CoNLLWord.NULL; - else coNLLWordArray[i].HEAD = coNLLWordArray[index]; - } - ++i; - } - - for (i = 0; i < coNLLWordArray.length; i++) - { - coNLLWordArray[i].DEPREL = BigramDependencyModel.get(coNLLWordArray[i].NAME, coNLLWordArray[i].POSTAG, coNLLWordArray[i].HEAD.NAME, coNLLWordArray[i].HEAD.POSTAG); - } - return new CoNLLSentence(coNLLWordArray); - } - - static int convertOffset2Index(CRFModelForDependency.DTag dTag, Table table, int current) - { - int posCount = 0; - if (dTag.offset > 0) - { - for (int i = current + 1; i < table.size(); ++i) - { - if (table.v[i][1].equals(dTag.pos)) ++posCount; - if (posCount == dTag.offset) return i; - } - } - else - { - for (int i = current - 1; i >= 0; --i) - { - if (table.v[i][1].equals(dTag.pos)) ++posCount; - if (posCount == -dTag.offset) return i; - } - } - - return -1; - } - - /** - * 必须对维特比算法做一些特化修改 - */ - static class CRFModelForDependency extends CRFModel - { - - public CRFModelForDependency(ITrie featureFunctionTrie) - { - super(featureFunctionTrie); - } - - /** - * 每个tag的分解。内部类的内部类你到底累不累 - */ - static class DTag - { - int offset; - String pos; - - public DTag(String tag) - { - String[] args = tag.split("_", 2); - if (args[0].charAt(0) == '+') args[0] = args[0].substring(1); - offset = Integer.parseInt(args[0]); - pos = args[1]; - } - - @Override - public String toString() - { - return (offset > 0 ? "+" : "") + offset + "_" + pos; - } - } - - DTag[] id2dtag; - - @Override - public boolean load(ByteArray byteArray) - { - if (!super.load(byteArray)) return false; - initId2dtagArray(); - return true; - } - - private void initId2dtagArray() - { - id2dtag = new DTag[id2tag.length]; - for (int i = 0; i < id2tag.length; i++) - { - id2dtag[i] = new DTag(id2tag[i]); - } - } - - @Override - protected void onLoadTxtFinished() - { - super.onLoadTxtFinished(); - initId2dtagArray(); - } - - boolean isLegal(int tagId, int current, Table table) - { - DTag tag = id2dtag[tagId]; - if ("ROOT".equals(tag.pos)) - { - for (int i = 0; i < current; ++i) - { - if (table.v[i][3].endsWith("ROOT")) return false; - } - return true; - } - else - { - int posCount = 0; - if (tag.offset > 0) - { - for (int i = current + 1; i < table.size(); ++i) - { - if (table.v[i][1].equals(tag.pos)) ++posCount; - if (posCount == tag.offset) return true; - } - return false; - } - else - { - for (int i = current - 1; i >= 0; --i) - { - if (table.v[i][1].equals(tag.pos)) ++posCount; - if (posCount == -tag.offset) return true; - } - return false; - } - } - } - - @Override - public void tag(Table table) - { - int size = table.size(); - double bestScore = Double.MIN_VALUE; - int bestTag = 0; - int tagSize = id2tag.length; - LinkedList scoreList = computeScoreList(table, 0); // 0位置命中的特征函数 - for (int i = 0; i < tagSize; ++i) // -1位置的标签遍历 - { - for (int j = 0; j < tagSize; ++j) // 0位置的标签遍历 - { - if (!isLegal(j, 0, table)) continue; - double curScore = computeScore(scoreList, j); - if (matrix != null) - { - curScore += matrix[i][j]; - } - if (curScore > bestScore) - { - bestScore = curScore; - bestTag = j; - } - } - } - table.setLast(0, id2tag[bestTag]); - int preTag = bestTag; - // 0位置打分完毕,接下来打剩下的 - for (int i = 1; i < size; ++i) - { - scoreList = computeScoreList(table, i); // i位置命中的特征函数 - bestScore = Double.MIN_VALUE; - for (int j = 0; j < tagSize; ++j) // i位置的标签遍历 - { - if (!isLegal(j, i, table)) continue; - double curScore = computeScore(scoreList, j); - if (matrix != null) - { - curScore += matrix[preTag][j]; - } - if (curScore > bestScore) - { - bestScore = curScore; - bestTag = j; - } - } - table.setLast(i, id2tag[bestTag]); - preTag = bestTag; - } - } - } -} diff --git a/src/main/java/com/hankcs/hanlp/dependency/MaxEntDependencyParser.java b/src/main/java/com/hankcs/hanlp/dependency/MaxEntDependencyParser.java index bd6933254..7bbe39fa4 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/MaxEntDependencyParser.java +++ b/src/main/java/com/hankcs/hanlp/dependency/MaxEntDependencyParser.java @@ -18,6 +18,7 @@ import com.hankcs.hanlp.corpus.io.ByteArrayFileStream; import com.hankcs.hanlp.dependency.common.Edge; import com.hankcs.hanlp.dependency.common.Node; +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; import com.hankcs.hanlp.model.maxent.MaxEntModel; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.utility.GlobalObjectPool; @@ -30,6 +31,7 @@ /** * 最大熵句法分析器 * + * @deprecated 已废弃,请使用{@link KBeamArcEagerDependencyParser}。未来版本将不再发布该模型,并删除配置项 * @author hankcs */ public class MaxEntDependencyParser extends MinimumSpanningTreeParser diff --git a/src/main/java/com/hankcs/hanlp/dependency/common/Node.java b/src/main/java/com/hankcs/hanlp/dependency/common/Node.java index adc8003e8..fb93e035e 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/common/Node.java +++ b/src/main/java/com/hankcs/hanlp/dependency/common/Node.java @@ -16,12 +16,89 @@ import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.seg.common.Term; +import java.util.Map; +import java.util.TreeMap; + /** * 节点 * @author hankcs */ public class Node { + private final static Map natureConverter = new TreeMap(); + static + { + natureConverter.put("begin", "root"); + natureConverter.put("bg", "b"); + natureConverter.put("e", "y"); + natureConverter.put("g", "nz"); + natureConverter.put("gb", "nz"); + natureConverter.put("gbc", "nz"); + natureConverter.put("gc", "nz"); + natureConverter.put("gg", "nz"); + natureConverter.put("gi", "nz"); + natureConverter.put("gm", "nz"); + natureConverter.put("gp", "nz"); + natureConverter.put("i", "nz"); + natureConverter.put("j", "nz"); + natureConverter.put("l", "nz"); + natureConverter.put("mg", "Mg"); + natureConverter.put("nb", "nz"); + natureConverter.put("nba", "nz"); + natureConverter.put("nbc", "nz"); + natureConverter.put("nbp", "nz"); + natureConverter.put("nf", "n"); + natureConverter.put("nh", "nz"); + natureConverter.put("nhd", "nz"); + natureConverter.put("nhm", "nz"); + natureConverter.put("ni", "nt"); + natureConverter.put("nic", "nt"); + natureConverter.put("nis", "n"); + natureConverter.put("nit", "nt"); + natureConverter.put("nm", "n"); + natureConverter.put("nmc", "nz"); + natureConverter.put("nn", "n"); + natureConverter.put("nnd", "n"); + natureConverter.put("nnt", "n"); + natureConverter.put("ntc", "nt"); + natureConverter.put("ntcb", "nt"); + natureConverter.put("ntcf", "nt"); + natureConverter.put("ntch", "nt"); + natureConverter.put("nth", "nt"); + natureConverter.put("nto", "nt"); + natureConverter.put("nts", "nt"); + natureConverter.put("ntu", "nt"); + natureConverter.put("nx", "x"); + natureConverter.put("qg", "q"); + natureConverter.put("rg", "Rg"); + natureConverter.put("ud", "u"); + natureConverter.put("udh", "u"); + natureConverter.put("ug", "uguo"); + natureConverter.put("uj", "u"); + natureConverter.put("ul", "ulian"); + natureConverter.put("uv", "u"); + natureConverter.put("uz", "uzhe"); + natureConverter.put("w", "x"); + natureConverter.put("wb", "x"); + natureConverter.put("wd", "x"); + natureConverter.put("wf", "x"); + natureConverter.put("wh", "x"); + natureConverter.put("wj", "x"); + natureConverter.put("wky", "x"); + natureConverter.put("wkz", "x"); + natureConverter.put("wm", "x"); + natureConverter.put("wn", "x"); + natureConverter.put("wp", "x"); + natureConverter.put("ws", "x"); + natureConverter.put("wt", "x"); + natureConverter.put("ww", "x"); + natureConverter.put("wyy", "x"); + natureConverter.put("wyz", "x"); + natureConverter.put("xu", "x"); + natureConverter.put("xx", "x"); + natureConverter.put("yg", "y"); + natureConverter.put("zg", "z"); + } public final static Node NULL = new Node(new Term(CoNLLWord.NULL.NAME, Nature.n), -1); static { @@ -35,160 +112,10 @@ public class Node public Node(Term term, int id) { this.id = id; - switch (term.nature) - { - - case bg: - label = "b"; - break; - case mg: - label = "Mg"; - break; - case nx: - label = "x"; - break; - case qg: - label = "q"; - break; - case ud: - label = "u"; - break; - case uj: - label = "u"; - break; - case uz: - label = "uzhe"; - break; - case ug: - label = "uguo"; - break; - case ul: - label = "ulian"; - break; - case uv: - label = "u"; - break; - case yg: - label = "y"; - break; - case zg: - label = "z"; - break; - case ntc: - case ntcf: - case ntcb: - case ntch: - case nto: - case ntu: - case nts: - case nth: - label = "nt"; - break; - case nh: - case nhm: - case nhd: - label = "nz"; - break; - case nn: - label = "n"; - break; - case nnt: - label = "n"; - break; - case nnd: - label = "n"; - break; - case nf: - label = "n"; - break; - case ni: - case nit: - case nic: - label = "nt"; - break; - case nis: - label = "n"; - break; - case nm: - label = "n"; - break; - case nmc: - label = "nz"; - break; - case nb: - label = "nz"; - break; - case nba: - label = "nz"; - break; - case nbc: - case nbp: - case nz: - label = "nz"; - break; - case g: - label = "nz"; - break; - case gm: - case gp: - case gc: - case gb: - case gbc: - case gg: - case gi: - label = "nz"; - break; - case j: - label = "nz"; - break; - case i: - label = "nz"; - break; - case l: - label = "nz"; - break; - case rg: - case Rg: - label = "Rg"; - break; - case udh: - label = "u"; - break; - case e: - label = "y"; - break; - case xx: - label = "x"; - break; - case xu: - label = "x"; - break; - case w: - case wkz: - case wky: - case wyz: - case wyy: - case wj: - case ww: - case wt: - case wd: - case wf: - case wn: - case wm: - case ws: - case wp: - case wb: - case wh: - label = "x"; - break; - case begin: - label = "root"; - break; - default: - label = term.nature.toString(); - break; - } word = term.word; + label = natureConverter.get(term.nature.toString()); + if (label == null) + label = term.nature.toString(); compiledWord = PosTagCompiler.compile(label, word); } diff --git a/src/main/java/com/hankcs/hanlp/dependency/common/POSUtil.java b/src/main/java/com/hankcs/hanlp/dependency/common/POSUtil.java deleted file mode 100644 index 89ae239af..000000000 --- a/src/main/java/com/hankcs/hanlp/dependency/common/POSUtil.java +++ /dev/null @@ -1,185 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/12/11 21:14 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.dependency.common; - -import com.hankcs.hanlp.corpus.tag.Nature; - -/** - * 词性操作工具类 - * @author hankcs - */ -public class POSUtil -{ - /** - * - * @param nature - * @return - */ - public static String compilePOS(Nature nature) - { - String label = nature.toString(); - switch (nature) - { - - case bg: - label = "b"; - break; - case mg: - label = "Mg"; - break; - case nx: - label = "x"; - break; - case qg: - label = "q"; - break; - case ud: - label = "u"; - break; - case uj: - label = "u"; - break; - case uz: - label = "uzhe"; - break; - case ug: - label = "uguo"; - break; - case ul: - label = "ulian"; - break; - case uv: - label = "u"; - break; - case yg: - label = "y"; - break; - case zg: - label = "z"; - break; - case ntc: - case ntcf: - case ntcb: - case ntch: - case nto: - case ntu: - case nts: - case nth: - label = "nt"; - break; - case nh: - case nhm: - case nhd: - label = "nz"; - break; - case nn: - label = "n"; - break; - case nnt: - label = "n"; - break; - case nnd: - label = "n"; - break; - case nf: - label = "n"; - break; - case ni: - case nit: - case nic: - label = "nt"; - break; - case nis: - label = "n"; - break; - case nm: - label = "n"; - break; - case nmc: - label = "nz"; - break; - case nb: - label = "nz"; - break; - case nba: - label = "nz"; - break; - case nbc: - case nbp: - case nz: - label = "nz"; - break; - case g: - label = "nz"; - break; - case gm: - case gp: - case gc: - case gb: - case gbc: - case gg: - case gi: - label = "nz"; - break; - case j: - label = "nz"; - break; - case i: - label = "nz"; - break; - case l: - label = "nz"; - break; - case rg: - case Rg: - label = "Rg"; - break; - case udh: - label = "u"; - break; - case e: - label = "y"; - break; - case xx: - label = "x"; - break; - case xu: - label = "x"; - break; - case w: - case wkz: - case wky: - case wyz: - case wyy: - case wj: - case ww: - case wt: - case wd: - case wf: - case wn: - case wm: - case ws: - case wp: - case wb: - case wh: - label = "x"; - break; - case begin: - label = "root"; - break; - default: - break; - } - - return label; - } -} diff --git a/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkClassifier.java b/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkClassifier.java index ef0b7412f..5c0874bc6 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkClassifier.java +++ b/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkClassifier.java @@ -382,7 +382,7 @@ void score(final List attributes, // final List classes = sample -> classes; // // Matrix Y = Matrix.Map( classes[0], classes.size()); -// Matrix _ = (Eigen.ArrayXd.Random (hidden_layer_size) > mask_prob).select( +// Matrix buffer = (Eigen.ArrayXd.Random (hidden_layer_size) > mask_prob).select( // Matrix.Ones (hidden_layer_size), // Matrix.zero(hidden_layer_size)); // Matrix hidden_layer = Matrix.zero(hidden_layer_size); diff --git a/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkDependencyParser.java b/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkDependencyParser.java index 7a4711562..76221b4bf 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkDependencyParser.java +++ b/src/main/java/com/hankcs/hanlp/dependency/nnparser/NeuralNetworkDependencyParser.java @@ -34,12 +34,18 @@ public class NeuralNetworkDependencyParser extends AbstractDependencyParser { private parser_dll parser_dll; - public NeuralNetworkDependencyParser() + public NeuralNetworkDependencyParser(Segment segment) { + super(segment); parser_dll = new parser_dll(); setDeprelTranslater(ConfigOption.DEPRL_DESCRIPTION_PATH).enableDeprelTranslator(true); } + public NeuralNetworkDependencyParser() + { + this(NLPTokenizer.ANALYZER); + } + @Override public CoNLLSentence parse(List termList) { diff --git a/src/main/java/com/hankcs/hanlp/dependency/nnparser/util/PosTagUtil.java b/src/main/java/com/hankcs/hanlp/dependency/nnparser/util/PosTagUtil.java index 70b3689c2..268dc04eb 100644 --- a/src/main/java/com/hankcs/hanlp/dependency/nnparser/util/PosTagUtil.java +++ b/src/main/java/com/hankcs/hanlp/dependency/nnparser/util/PosTagUtil.java @@ -11,35 +11,176 @@ */ package com.hankcs.hanlp.dependency.nnparser.util; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.PerceptronTrainer; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.utility.IOUtility; +import com.hankcs.hanlp.model.perceptron.utility.Utility; import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.lexical.POSTagger; -import java.util.ArrayList; -import java.util.List; +import java.util.*; /** * @author hankcs */ public class PosTagUtil { + private static Map posConverter = new TreeMap(); + + static + { + posConverter.put("Mg", "m"); + posConverter.put("Rg", "r"); + posConverter.put("ad", "a"); + posConverter.put("ag", "a"); + posConverter.put("al", "a"); + posConverter.put("an", "a"); + posConverter.put("begin", "x"); + posConverter.put("bg", "b"); + posConverter.put("bl", "b"); + posConverter.put("cc", "c"); + posConverter.put("dg", "d"); + posConverter.put("dl", "d"); + posConverter.put("end", "x"); + posConverter.put("f", "nd"); + posConverter.put("g", "nz"); + posConverter.put("gb", "nz"); + posConverter.put("gbc", "nz"); + posConverter.put("gc", "nz"); + posConverter.put("gg", "nz"); + posConverter.put("gi", "nz"); + posConverter.put("gm", "nz"); + posConverter.put("gp", "nz"); + posConverter.put("l", "i"); + posConverter.put("mg", "m"); + posConverter.put("mq", "m"); + posConverter.put("nb", "nz"); + posConverter.put("nba", "nz"); + posConverter.put("nbc", "nz"); + posConverter.put("nbp", "nz"); + posConverter.put("nf", "n"); + posConverter.put("ng", "n"); + posConverter.put("nh", "nz"); + posConverter.put("nhd", "nz"); + posConverter.put("nhm", "nz"); + posConverter.put("ni", "n"); + posConverter.put("nic", "nt"); + posConverter.put("nis", "nt"); + posConverter.put("nit", "nt"); + posConverter.put("nl", "n"); + posConverter.put("nm", "nz"); + posConverter.put("nmc", "nz"); + posConverter.put("nn", "nz"); + posConverter.put("nnd", "nz"); + posConverter.put("nnt", "nz"); + posConverter.put("nr", "nh"); + posConverter.put("nr1", "nh"); + posConverter.put("nr2", "nh"); + posConverter.put("nrf", "nh"); + posConverter.put("nrj", "nh"); + posConverter.put("nsf", "ns"); + posConverter.put("nt", "ni"); + posConverter.put("ntc", "ni"); + posConverter.put("ntcb", "ni"); + posConverter.put("ntcf", "ni"); + posConverter.put("ntch", "ni"); + posConverter.put("nth", "ni"); + posConverter.put("nto", "ni"); + posConverter.put("nts", "ni"); + posConverter.put("ntu", "ni"); + posConverter.put("nx", "ws"); + posConverter.put("pba", "p"); + posConverter.put("pbei", "p"); + posConverter.put("qg", "q"); + posConverter.put("qt", "q"); + posConverter.put("qv", "q"); + posConverter.put("rg", "r"); + posConverter.put("rr", "r"); + posConverter.put("ry", "r"); + posConverter.put("rys", "r"); + posConverter.put("ryt", "r"); + posConverter.put("ryv", "r"); + posConverter.put("rz", "r"); + posConverter.put("rzs", "r"); + posConverter.put("rzt", "r"); + posConverter.put("rzv", "r"); + posConverter.put("s", "nl"); + posConverter.put("t", "nt"); + posConverter.put("tg", "nt"); + posConverter.put("ud", "u"); + posConverter.put("ude1", "u"); + posConverter.put("ude2", "u"); + posConverter.put("ude3", "u"); + posConverter.put("udeng", "u"); + posConverter.put("udh", "u"); + posConverter.put("ug", "u"); + posConverter.put("uguo", "u"); + posConverter.put("uj", "u"); + posConverter.put("ul", "u"); + posConverter.put("ule", "u"); + posConverter.put("ulian", "u"); + posConverter.put("uls", "u"); + posConverter.put("usuo", "u"); + posConverter.put("uv", "u"); + posConverter.put("uyy", "u"); + posConverter.put("uz", "u"); + posConverter.put("uzhe", "u"); + posConverter.put("uzhi", "u"); + posConverter.put("vd", "v"); + posConverter.put("vf", "v"); + posConverter.put("vg", "v"); + posConverter.put("vi", "v"); + posConverter.put("vl", "v"); + posConverter.put("vn", "v"); + posConverter.put("vshi", "v"); + posConverter.put("vx", "v"); + posConverter.put("vyou", "v"); + posConverter.put("w", "wp"); + posConverter.put("wb", "wp"); + posConverter.put("wd", "wp"); + posConverter.put("wf", "wp"); + posConverter.put("wh", "wp"); + posConverter.put("wj", "wp"); + posConverter.put("wky", "wp"); + posConverter.put("wkz", "wp"); + posConverter.put("wm", "wp"); + posConverter.put("wn", "wp"); + posConverter.put("ws", "wp"); + posConverter.put("wt", "wp"); + posConverter.put("ww", "wp"); + posConverter.put("wyy", "wp"); + posConverter.put("wyz", "wp"); + posConverter.put("xu", "x"); + posConverter.put("xx", "x"); + posConverter.put("y", "e"); + posConverter.put("yg", "u"); + posConverter.put("z", "u"); + posConverter.put("zg", "u"); + } + /** * 转为863标注集
* 863词性标注集,其各个词性含义如下表: - - Tag Description Example Tag Description Example - a adjective 美丽 ni organization name 保险公司 - b other noun-modifier 大型, 西式 nl location noun 城郊 - c conjunction 和, 虽然 ns geographical name 北京 - d adverb 很 nt temporal noun 近日, 明代 - e exclamation 哎 nz other proper noun 诺贝尔奖 - g morpheme 茨, 甥 o onomatopoeia 哗啦 - h prefix 阿, 伪 p preposition 在, 把 - i idiom 百花齐放 q quantity 个 - j abbreviation 公检法 r pronoun 我们 - k suffix 界, 率 u auxiliary 的, 地 - m number 一, 第一 v verb 跑, 学习 - n general noun 苹果 wp punctuation ,。! - nd direction noun 右侧 ws foreign words CPU - nh person name 杜甫, 汤姆 x non-lexeme 萄, 翱 + *

+ * Tag Description Example Tag Description Example + * a adjective 美丽 ni organization name 保险公司 + * b other noun-modifier 大型, 西式 nl location noun 城郊 + * c conjunction 和, 虽然 ns geographical name 北京 + * d adverb 很 nt temporal noun 近日, 明代 + * e exclamation 哎 nz other proper noun 诺贝尔奖 + * g morpheme 茨, 甥 o onomatopoeia 哗啦 + * h prefix 阿, 伪 p preposition 在, 把 + * i idiom 百花齐放 q quantity 个 + * j abbreviation 公检法 r pronoun 我们 + * k suffix 界, 率 u auxiliary 的, 地 + * m number 一, 第一 v verb 跑, 学习 + * n general noun 苹果 wp punctuation ,。! + * nd direction noun 右侧 ws foreign words CPU + * nh person name 杜甫, 汤姆 x non-lexeme 萄, 翱 + * * @param termList * @return */ @@ -48,248 +189,41 @@ public static List to863(List termList) List posTagList = new ArrayList(termList.size()); for (Term term : termList) { - String posTag = "x"; - switch (term.nature) - { - case bg: - posTag = "b"; - break; - case mg: - posTag = "m"; - break; - case nl: - posTag = "n"; - break; - case nx: - posTag = "ws"; - break; - case qg: - posTag = "q"; - break; - case ud: - case uj: - case uz: - case ug: - case ul: - case uv: - posTag = "u"; - break; - case yg: - posTag = "u"; - break; - case zg: - posTag = "u"; - break; - case n: - posTag = "n"; - break; - case nr: - case nrj: - case nrf: - case nr1: - case nr2: - posTag = "nh"; - break; - case ns: - case nsf: - posTag = "ns"; - break; - case nt: - case ntc: - case ntcf: - case ntcb: - case ntch: - case nto: - case ntu: - case nts: - case nth: - posTag = "ni"; - break; - case nh: - case nhm: - case nhd: - case nn: - case nnt: - case nnd: - posTag = "nz"; - break; - case ng: - posTag = "n"; - break; - case nf: - posTag = "n"; - break; - case ni: - posTag = "n"; - break; - case nit: - case nic: - case nis: - posTag = "nt"; - break; - case nm: - case nmc: - case nb: - case nba: - case nbc: - case nbp: - case nz: - posTag = "nz"; - break; - case g: - case gm: - case gp: - case gc: - case gb: - case gbc: - case gg: - case gi: - posTag = "nz"; - break; - case j: - posTag = "j"; - break; - case i: - posTag = "i"; - break; - case l: - posTag = "i"; - break; - case t: - posTag = "nt"; - break; - case tg: - posTag = "nt"; - break; - case s: - posTag = "nl"; - break; - case f: - posTag = "nd"; - break; - case v: - case vd: - case vn: - case vshi: - case vyou: - case vf: - case vx: - case vi: - case vl: - case vg: - posTag = "v"; - break; - case a: - case ad: - case an: - case ag: - case al: - posTag = "a"; - break; - case b: - case bl: - posTag = "b"; - break; - case z: - posTag = "u"; - break; - case r: - case rr: - case rz: - case rzt: - case rzs: - case rzv: - case ry: - case ryt: - case rys: - case ryv: - case rg: - case Rg: - posTag = "r"; - break; - case m: - case mq: - case Mg: - posTag = "m"; - break; - case q: - case qv: - case qt: - posTag = "q"; - break; - case d: - case dg: - case dl: - posTag = "d"; - break; - case p: - case pba: - case pbei: - posTag = "p"; - break; - case c: - case cc: - posTag = "c"; - break; - case u: - case uzhe: - case ule: - case uguo: - case ude1: - case ude2: - case ude3: - case usuo: - case udeng: - case uyy: - case udh: - case uls: - case uzhi: - case ulian: - posTag = "u"; - break; - case e: - posTag = "e"; - break; - case y: - posTag = "e"; - break; - case o: - posTag = "o"; - break; - case h: - posTag = "h"; - break; - case k: - posTag = "k"; - break; - case x: - case xx: - case xu: - posTag = "x"; - break; - case w: - case wkz: - case wky: - case wyz: - case wyy: - case wj: - case ww: - case wt: - case wd: - case wf: - case wn: - case wm: - case ws: - case wp: - case wb: - case wh: - posTag = "wp"; - break; - } - + String posTag = posConverter.get(term.nature.toString()); + if (posTag == null) + posTag = term.nature.toString(); posTagList.add(posTag); } return posTagList; } + + /** + * 评估词性标注器的准确率 + * + * @param tagger 词性标注器 + * @param corpus 测试集 + * @return Accuracy百分比 + */ + public static float evaluate(POSTagger tagger, String corpus) + { + int correct = 0, total = 0; + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(corpus); + for (String line : lineIterator) + { + Sentence sentence = Sentence.create(line); + if (sentence == null) continue; + String[][] wordTagArray = sentence.toWordTagArray(); + String[] prediction = tagger.tag(wordTagArray[0]); + assert prediction.length == wordTagArray[1].length; + total += prediction.length; + for (int i = 0; i < prediction.length; i++) + { + if (prediction[i].equals(wordTagArray[1][i])) + ++correct; + } + } + if (total == 0) return 0; + return correct / (float) total * 100; + } } diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/CoNLLReader.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/CoNLLReader.java new file mode 100644 index 000000000..f27214244 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/CoNLLReader.java @@ -0,0 +1,372 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.accessories; + +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.CompactTree; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; + +import java.io.BufferedReader; +import java.io.FileNotFoundException; +import java.io.FileReader; +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; + +public class CoNLLReader +{ + /** + * An object for reading the CoNLL file + */ + BufferedReader fileReader; + + /** + * Initializes the file reader + * + * @param filePath Path to the file + * @throws Exception If the file path is not correct or there are not enough permission to read the file + */ + public CoNLLReader(String filePath) throws FileNotFoundException + { + fileReader = new BufferedReader(new FileReader(filePath)); + } + + /** + * 读取CoNLL文件,创建索引 + * + * @param conllPath + * @param labeled + * @param lowercased + * @param clusterFile + * @return + * @throws Exception + */ + public static IndexMaps createIndices(String conllPath, boolean labeled, boolean lowercased, String clusterFile) throws IOException + { + HashMap wordMap = new HashMap(); + HashMap labels = new HashMap(); + HashMap clusterMap = new HashMap(); + HashMap cluster4Map = new HashMap(); + HashMap cluster6Map = new HashMap(); + + String rootString = "ROOT"; + + wordMap.put("ROOT", 0); + labels.put(0, 0); + + // 所有label的id必须从零开始并且连续 + BufferedReader reader = new BufferedReader(new FileReader(conllPath)); + String line; + while ((line = reader.readLine()) != null) + { + String[] args = line.trim().split("\t"); + if (args.length > 7) + { + String label = args[7]; + int head = Integer.parseInt(args[6]); + if (head == 0) + rootString = label; + + if (!labeled) + label = "~"; + else if (label.equals("_")) + label = "-"; + + if (!wordMap.containsKey(label)) + { + labels.put(wordMap.size(), labels.size()); + wordMap.put(label, wordMap.size()); + } + } + } + + reader = new BufferedReader(new FileReader(conllPath)); + while ((line = reader.readLine()) != null) + { + String[] cells = line.trim().split("\t"); + if (cells.length > 7) + { + String pos = cells[3]; + if (!wordMap.containsKey(pos)) + { + wordMap.put(pos, wordMap.size()); + } + } + } + + if (clusterFile.length() > 0) + { + reader = new BufferedReader(new FileReader(clusterFile)); + while ((line = reader.readLine()) != null) + { + String[] cells = line.trim().split("\t"); + if (cells.length > 2) + { + String cluster = cells[0]; + String word = cells[1]; + String prefix4 = cluster.substring(0, Math.min(4, cluster.length())); + String prefix6 = cluster.substring(0, Math.min(6, cluster.length())); + int clusterId = wordMap.size(); + + if (!wordMap.containsKey(cluster)) + { + clusterMap.put(word, wordMap.size()); + wordMap.put(cluster, wordMap.size()); + } + else + { + clusterId = wordMap.get(cluster); + clusterMap.put(word, clusterId); + } + + int pref4Id = wordMap.size(); + if (!wordMap.containsKey(prefix4)) + { + wordMap.put(prefix4, wordMap.size()); + } + else + { + pref4Id = wordMap.get(prefix4); + } + + int pref6Id = wordMap.size(); + if (!wordMap.containsKey(prefix6)) + { + wordMap.put(prefix6, wordMap.size()); + } + else + { + pref6Id = wordMap.get(prefix6); + } + + cluster4Map.put(clusterId, pref4Id); + cluster6Map.put(clusterId, pref6Id); + } + } + } + + reader = new BufferedReader(new FileReader(conllPath)); + while ((line = reader.readLine()) != null) + { + String[] cells = line.trim().split("\t"); + if (cells.length > 7) + { + String word = cells[1]; + if (lowercased) + word = word.toLowerCase(); + if (!wordMap.containsKey(word)) + { + wordMap.put(word, wordMap.size()); + } + } + } + + return new IndexMaps(wordMap, labels, rootString, cluster4Map, cluster6Map, clusterMap); + } + + /** + * 读取句子 + * + * @param limit 最大多少句 + * @param keepNonProjective 保留非投影 + * @param labeled + * @param rootFirst 是否把root放到最前面 + * @param lowerCased + * @param maps feature id map + * @return + * @throws Exception + */ + public ArrayList readData(int limit, boolean keepNonProjective, boolean labeled, boolean rootFirst, boolean lowerCased, IndexMaps maps) throws IOException + { + HashMap wordMap = maps.getWordId(); + ArrayList instanceList = new ArrayList(); + + String line; + ArrayList tokens = new ArrayList(); + ArrayList tags = new ArrayList(); + ArrayList cluster4Ids = new ArrayList(); + ArrayList cluster6Ids = new ArrayList(); + ArrayList clusterIds = new ArrayList(); + + HashMap goldDependencies = new HashMap(); + int sentenceCounter = 0; + while ((line = fileReader.readLine()) != null) + { + line = line.trim(); + if (line.length() == 0) // 句子分隔空白行 + { + if (tokens.size() > 0) + { + sentenceCounter++; + if (!rootFirst) + { + for (Edge edge : goldDependencies.values()) + { + if (edge.headIndex == 0) + edge.headIndex = tokens.size() + 1; + } + tokens.add(0); + tags.add(0); + cluster4Ids.add(0); + cluster6Ids.add(0); + clusterIds.add(0); + } + Sentence currentSentence = new Sentence(tokens, tags, cluster4Ids, cluster6Ids, clusterIds); + Instance instance = new Instance(currentSentence, goldDependencies); + if (keepNonProjective || !instance.isNonprojective()) + instanceList.add(instance); + goldDependencies = new HashMap(); + tokens = new ArrayList(); + tags = new ArrayList(); + cluster4Ids = new ArrayList(); + cluster6Ids = new ArrayList(); + clusterIds = new ArrayList(); + } + else + { + goldDependencies = new HashMap(); + tokens = new ArrayList(); + tags = new ArrayList(); + cluster4Ids = new ArrayList(); + cluster6Ids = new ArrayList(); + clusterIds = new ArrayList(); + } + if (sentenceCounter >= limit) + { + System.out.println("buffer full..." + instanceList.size()); + break; + } + } + else + { + String[] cells = line.split("\t"); + if (cells.length < 8) + throw new IllegalArgumentException("invalid conll format"); + int wordIndex = Integer.parseInt(cells[0]); + String word = cells[1].trim(); + if (lowerCased) + word = word.toLowerCase(); + String pos = cells[3].trim(); + + int wi = getId(word, wordMap); + int pi = getId(pos, wordMap); + + tags.add(pi); + tokens.add(wi); + + int headIndex = Integer.parseInt(cells[6]); + String relation = cells[7]; + if (!labeled) + relation = "~"; + else if (relation.equals("_")) + relation = "-"; + + if (headIndex == 0) + relation = "ROOT"; + + int ri = getId(relation, wordMap); + if (headIndex == -1) + ri = -1; + + int[] ids = maps.clusterId(word); + clusterIds.add(ids[0]); + cluster4Ids.add(ids[1]); + cluster6Ids.add(ids[2]); + + if (headIndex >= 0) + goldDependencies.put(wordIndex, new Edge(headIndex, ri)); + } + } + if (tokens.size() > 0) + { + if (!rootFirst) + { + for (int gold : goldDependencies.keySet()) + { + if (goldDependencies.get(gold).headIndex == 0) + goldDependencies.get(gold).headIndex = goldDependencies.size() + 1; + } + tokens.add(0); + tags.add(0); + cluster4Ids.add(0); + cluster6Ids.add(0); + clusterIds.add(0); + } + sentenceCounter++; + Sentence currentSentence = new Sentence(tokens, tags, cluster4Ids, cluster6Ids, clusterIds); + instanceList.add(new Instance(currentSentence, goldDependencies)); + } + + return instanceList; + } + + private static int getId(String word, HashMap wordMap) + { + return getId(word, wordMap, -1); + } + + private static int getId(String word, HashMap wordMap, int defaultValue) + { + Integer id = wordMap.get(word); + if (id == null) return defaultValue; + return id; + } + + public ArrayList readStringData() throws IOException + { + ArrayList treeSet = new ArrayList(); + + String line; + ArrayList tags = new ArrayList(); + + HashMap> goldDependencies = new HashMap>(); + while ((line = fileReader.readLine()) != null) + { + line = line.trim(); + if (line.length() == 0) + { + if (tags.size() >= 1) + { + CompactTree goldConfiguration = new CompactTree(goldDependencies, tags); + treeSet.add(goldConfiguration); + } + tags = new ArrayList(); + goldDependencies = new HashMap>(); + } + else + { + String[] splitLine = line.split("\t"); + if (splitLine.length < 8) + throw new IllegalArgumentException("wrong file format"); + int wordIndex = Integer.parseInt(splitLine[0]); + String pos = splitLine[3].trim(); + + tags.add(pos); + + int headIndex = Integer.parseInt(splitLine[6]); + String relation = splitLine[7]; + + if (headIndex == 0) + { + relation = "ROOT"; + } + + if (pos.length() > 0) + goldDependencies.put(wordIndex, new Pair(headIndex, relation)); + } + } + + + if (tags.size() > 0) + { + treeSet.add(new CompactTree(goldDependencies, tags)); + } + + return treeSet; + } + +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Edge.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Edge.java new file mode 100644 index 000000000..26a82b515 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Edge.java @@ -0,0 +1,39 @@ +/* + * Han He + * me@hankcs.com + * 2018-04-04 下午2:40 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.dependency.perceptron.accessories; + +/** + * 依存句法树上的一条边 + * @author hankcs + */ +public class Edge +{ + /** + * head + */ + public int headIndex; + /** + * label + */ + public int relationId; + + public Edge(int headIndex, int relationId) + { + this.headIndex = headIndex; + this.relationId = relationId; + } + + @Override + public Edge clone() + { + return new Edge(headIndex, relationId); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Evaluator.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Evaluator.java new file mode 100644 index 000000000..6c3ed09e8 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Evaluator.java @@ -0,0 +1,90 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.accessories; + +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.CompactTree; + +import java.io.IOException; +import java.text.DecimalFormat; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; + +public class Evaluator +{ + public static double[] evaluate(String testPath, String predictedPath, HashSet puncTags) throws IOException + { + CoNLLReader goldReader = new CoNLLReader(testPath); + CoNLLReader predictedReader = new CoNLLReader(predictedPath); + + ArrayList goldConfiguration = goldReader.readStringData(); + ArrayList predConfiguration = predictedReader.readStringData(); + + float unlabMatch = 0f; + float labMatch = 0f; + int all = 0; + + float fullULabMatch = 0f; + float fullLabMatch = 0f; + int numTree = 0; + + for (int i = 0; i < predConfiguration.size(); i++) + { + HashMap> goldDeps = goldConfiguration.get(i).goldDependencies; + HashMap> predDeps = predConfiguration.get(i).goldDependencies; + + ArrayList goldTags = goldConfiguration.get(i).posTags; + + numTree++; + boolean fullMatch = true; + boolean fullUnlabMatch = true; + for (int dep : goldDeps.keySet()) + { + if (!puncTags.contains(goldTags.get(dep - 1).trim())) + { + all++; + int gh = goldDeps.get(dep).first; + int ph = predDeps.get(dep).first; + String gl = goldDeps.get(dep).second.trim(); + String pl = predDeps.get(dep).second.trim(); + + if (ph == gh) + { + unlabMatch++; + + if (pl.equals(gl)) + labMatch++; + else + { + fullMatch = false; + } + } + else + { + fullMatch = false; + fullUnlabMatch = false; + } + } + } + + if (fullMatch) + fullLabMatch++; + if (fullUnlabMatch) + fullULabMatch++; + } + +// DecimalFormat format = new DecimalFormat("##.00"); + double labeledAccuracy = 100.0 * labMatch / all; + double unlabaledAccuracy = 100.0 * unlabMatch / all; +// System.err.println("Labeled accuracy: " + format.format(labeledAccuracy)); +// System.err.println("Unlabeled accuracy: " + format.format(unlabaledAccuracy)); + double labExact = 100.0 * fullLabMatch / numTree; + double ulabExact = 100.0 * fullULabMatch / numTree; +// System.err.println("Labeled exact match: " + format.format(labExact)); +// System.err.println("Unlabeled exact match: " + format.format(ulabExact) + " \n"); + return new double[]{unlabaledAccuracy, labeledAccuracy}; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Options.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Options.java new file mode 100644 index 000000000..155eb3b36 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Options.java @@ -0,0 +1,455 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.accessories; + +import java.io.BufferedReader; +import java.io.FileReader; +import java.io.Serializable; +import java.util.ArrayList; +import java.util.HashSet; + +public class Options implements Serializable +{ + public boolean train; + public boolean parseTaggedFile; + public boolean parseConllFile; + public int beamWidth; + public boolean rootFirst; + public boolean showHelp; + public boolean labeled; + public String inputFile; + public String outputFile; + public String devPath; + public int trainingIter; + public boolean evaluate; + public boolean parsePartialConll; + public String scorePath; + public String clusterFile; + + public String modelFile; + public boolean lowercase; + public boolean useExtendedFeatures; + public boolean useExtendedWithBrownClusterFeatures; + public boolean useMaxViol; + public boolean useDynamicOracle; + public boolean useRandomOracleSelection; + public String separator; + public int numOfThreads; + + public String goldFile; + + public HashSet punctuations; + public String predFile; + + public int partialTrainingStartingIteration; + + public Options() + { + showHelp = false; + train = false; + parseConllFile = false; + parseTaggedFile = false; + beamWidth = 64; + rootFirst = false; + modelFile = ""; + outputFile = ""; + inputFile = ""; + devPath = ""; + scorePath = ""; + separator = "_"; + clusterFile = ""; + labeled = true; + lowercase = false; + useExtendedFeatures = true; + useMaxViol = true; + useDynamicOracle = true; + useRandomOracleSelection = false; + trainingIter = 20; + evaluate = false; + numOfThreads = Runtime.getRuntime().availableProcessors(); + useExtendedWithBrownClusterFeatures = false; + parsePartialConll = false; + + partialTrainingStartingIteration = 3; + + punctuations = new HashSet(); + punctuations.add("#"); + punctuations.add("''"); + punctuations.add("("); + punctuations.add(")"); + punctuations.add("["); + punctuations.add("]"); + punctuations.add("{"); + punctuations.add("}"); + punctuations.add("\""); + punctuations.add(","); + punctuations.add("."); + punctuations.add(":"); + punctuations.add("``"); + punctuations.add("-LRB-"); + punctuations.add("-RRB-"); + punctuations.add("-LSB-"); + punctuations.add("-RSB-"); + punctuations.add("-LCB-"); + punctuations.add("-RCB-"); + punctuations.add("!"); + punctuations.add("."); + punctuations.add("#"); + punctuations.add("$"); + punctuations.add("''"); + punctuations.add("("); + punctuations.add(")"); + punctuations.add(","); + punctuations.add("-LRB-"); + punctuations.add("-RRB-"); + punctuations.add(":"); + punctuations.add("?"); + } + + public static void showHelp() + { + StringBuilder output = new StringBuilder(); + output.append("© Yara YaraParser.Parser \n"); + output.append("\u00a9 Copyright 2014, Yahoo! Inc.\n"); + output.append("\u00a9 Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms."); + output.append("http://www.apache.org/licenses/LICENSE-2.0\n"); + output.append("With modifications by HanLP project.\n"); + output.append("\n"); + + output.append("Usage:\n"); + + output.append("* Train a parser:\n"); + output.append("\tjava -jar hanlp.jar com.hankcs.hanlp.dependency.perceptron.parser.Main train -train-file [train-file] -dev [dev-file] -model [model-file] -punc [punc-file]\n"); + output.append("\t** The model for each iteration is with the pattern [model-file]_iter[iter#]; e.g. mode_iter2\n"); + output.append("\t** [punc-file]: File contains list of pos tags for punctuations in the treebank, each in one line\n"); + output.append("\t** Other options\n"); + output.append("\t \t -cluster [cluster-file] Brown cluster file: at most 4096 clusters are supported by the parser (default: empty)\n\t\t\t the format should be the same as https://github.com/percyliang/brown-cluster/blob/master/output.txt \n"); + output.append("\t \t beam:[beam-width] (default:64)\n"); + output.append("\t \t iter:[training-iterations] (default:20)\n"); + output.append("\t \t unlabeled (default: labeled parsing, unless explicitly put `unlabeled')\n"); + output.append("\t \t lowercase (default: case-sensitive words, unless explicitly put 'lowercase')\n"); + output.append("\t \t basic (default: use extended feature set, unless explicitly put 'basic')\n"); + output.append("\t \t early (default: use max violation update, unless explicitly put `early' for early update)\n"); + output.append("\t \t static (default: use dynamic oracles, unless explicitly put `static' for static oracles)\n"); + output.append("\t \t random (default: choose maximum scoring oracle, unless explicitly put `random' for randomly choosing an oracle)\n"); + output.append("\t \t nt:[#_of_threads] (default:8)\n"); + output.append("\t \t pt:[#partail_training_starting_iteration] (default:3; shows the starting iteration for considering partial trees)\n"); + output.append("\t \t root_first (default: put ROOT in the last position, unless explicitly put 'root_first')\n\n"); + + output.append("* Parse a CoNLL'2006 file:\n"); + output.append("\tjava -jar hanlp.jar com.hankcs.hanlp.dependency.perceptron.parser.Main parse_conll -input [test-file] -out [output-file] -model [model-file] nt:[#_of_threads (optional -- default:8)] \n"); + output.append("\t** The test file should have the conll 2006 format\n"); + output.append("\t** Optional: -score [score file] averaged score of each output parse tree in a file\n\n"); + + output.append("* Parse a tagged file:\n"); + output.append("\tjava -jar hanlp.jar com.hankcs.hanlp.dependency.perceptron.parser.Main parse_tagged -input [test-file] -out [output-file] -model [model-file] nt:[#_of_threads (optional -- default:8)] \n"); + output.append("\t** The test file should have each sentence in line and word_tag pairs are space-delimited\n"); + output.append("\t** Optional: -delim [delim] (default is _)\n"); + output.append("\t \t Example: He_PRP is_VBZ nice_AJ ._.\n\n"); + + output.append("* Parse a CoNLL'2006 file with partial gold trees:\n"); + output.append("\tjava -jar hanlp.jar com.hankcs.hanlp.dependency.perceptron.parser.Main parse_partial -input [test-file] -out [output-file] -model [model-file] nt:[#_of_threads (optional -- default:8)] \n"); + output.append("\t** The test file should have the conll 2006 format; each word that does not have a parent, should have a -1 parent-index"); + output.append("\t** Optional: -score [score file] averaged score of each output parse tree in a file\n\n"); + + output.append("* Evaluate a Conll file:\n"); + output.append("\tjava -jar hanlp.jar com.hankcs.hanlp.dependency.perceptron.parser.Main eval -gold [gold-file] -parse [parsed-file] -punc [punc-file]\n"); + output.append("\t** [punc-file]: File contains list of pos tags for punctuations in the treebank, each in one line\n"); + output.append("\t** Both files should have conll 2006 format\n"); + System.out.println(output.toString()); + } + + public static Options processArgs(String[] args) throws Exception + { + Options options = new Options(); + + for (int i = 0; i < args.length; i++) + { + if (args[i].equals("--help") || args[i].equals("-h") || args[i].equals("-help")) + options.showHelp = true; + else if (args[i].equals("train")) + options.train = true; + else if (args[i].equals("parse_conll")) + options.parseConllFile = true; + else if (args[i].equals("parse_partial")) + options.parsePartialConll = true; + else if (args[i].equals("eval")) + options.evaluate = true; + else if (args[i].equals("parse_tagged")) + options.parseTaggedFile = true; + else if (args[i].equals("-train-file") || args[i].equals("-input")) + options.inputFile = args[i + 1]; + else if (args[i].equals("-punc")) + options.changePunc(args[i + 1]); + else if (args[i].equals("-model")) + options.modelFile = args[i + 1]; + else if (args[i].startsWith("-dev")) + options.devPath = args[i + 1]; + else if (args[i].equals("-gold")) + options.goldFile = args[i + 1]; + else if (args[i].startsWith("-parse")) + options.predFile = args[i + 1]; + else if (args[i].startsWith("-cluster")) + { + options.clusterFile = args[i + 1]; + options.useExtendedWithBrownClusterFeatures = true; + } + else if (args[i].startsWith("-out")) + options.outputFile = args[i + 1]; + else if (args[i].startsWith("-delim")) + options.separator = args[i + 1]; + else if (args[i].startsWith("beam:")) + options.beamWidth = Integer.parseInt(args[i].substring(args[i].lastIndexOf(":") + 1)); + else if (args[i].startsWith("nt:")) + options.numOfThreads = Integer.parseInt(args[i].substring(args[i].lastIndexOf(":") + 1)); + else if (args[i].startsWith("pt:")) + options.partialTrainingStartingIteration = Integer.parseInt(args[i].substring(args[i].lastIndexOf(":") + 1)); + else if (args[i].equals("unlabeled")) + options.labeled = Boolean.parseBoolean(args[i]); + else if (args[i].equals("lowercase")) + options.lowercase = Boolean.parseBoolean(args[i]); + else if (args[i].startsWith("-score")) + options.scorePath = args[i + 1]; + else if (args[i].equals("basic")) + options.useExtendedFeatures = false; + else if (args[i].equals("early")) + options.useMaxViol = false; + else if (args[i].equals("static")) + options.useDynamicOracle = false; + else if (args[i].equals("random")) + options.useRandomOracleSelection = true; + else if (args[i].equals("root_first")) + options.rootFirst = true; + else if (args[i].startsWith("iter:")) + options.trainingIter = Integer.parseInt(args[i].substring(args[i].lastIndexOf(":") + 1)); + } + + if (options.train || options.parseTaggedFile || options.parseConllFile) + options.showHelp = false; + + return options; + } + + public static ArrayList getAllPossibleOptions(Options option) + { + ArrayList options = new ArrayList(); + options.add(option); + + ArrayList tmp = new ArrayList(); + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.labeled = true; + + Options o2 = opt.clone(); + o2.labeled = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.lowercase = true; + + Options o2 = opt.clone(); + o2.lowercase = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.useExtendedFeatures = true; + + Options o2 = opt.clone(); + o2.useExtendedFeatures = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.useDynamicOracle = true; + + Options o2 = opt.clone(); + o2.useDynamicOracle = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.useMaxViol = true; + + Options o2 = opt.clone(); + o2.useMaxViol = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.useRandomOracleSelection = true; + + Options o2 = opt.clone(); + o2.useRandomOracleSelection = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + tmp = new ArrayList(); + + + for (Options opt : options) + { + Options o1 = opt.clone(); + o1.rootFirst = true; + + Options o2 = opt.clone(); + o2.rootFirst = false; + tmp.add(o1); + tmp.add(o2); + } + + options = tmp; + return options; + } + + public void changePunc(String puncPath) throws Exception + { + BufferedReader reader = new BufferedReader(new FileReader(puncPath)); + + punctuations = new HashSet(); + String line; + while ((line = reader.readLine()) != null) + { + line = line.trim(); + if (line.length() > 0) + punctuations.add(line.split(" ")[0].trim()); + } + } + + public String toString() + { + if (train) + { + StringBuilder builder = new StringBuilder(); + builder.append("train file: " + inputFile + "\n"); + builder.append("dev file: " + devPath + "\n"); + builder.append("cluster file: " + clusterFile + "\n"); + builder.append("beam width: " + beamWidth + "\n"); + builder.append("rootFirst: " + rootFirst + "\n"); + builder.append("labeled: " + labeled + "\n"); + builder.append("lower-case: " + lowercase + "\n"); + builder.append("extended features: " + useExtendedFeatures + "\n"); + builder.append("extended with brown cluster features: " + useExtendedWithBrownClusterFeatures + "\n"); + builder.append("updateModel: " + (useMaxViol ? "max violation" : "early") + "\n"); + builder.append("oracle: " + (useDynamicOracle ? "dynamic" : "static") + "\n"); + if (useDynamicOracle) + builder.append("oracle selection: " + (!useRandomOracleSelection ? "latent max" : "random") + "\n"); + + builder.append("training-iterations: " + trainingIter + "\n"); + builder.append("index of threads: " + numOfThreads + "\n"); + builder.append("partial training starting iteration: " + partialTrainingStartingIteration + "\n"); + return builder.toString(); + } + else if (parseConllFile) + { + StringBuilder builder = new StringBuilder(); + builder.append("parse conll" + "\n"); + builder.append("input file: " + inputFile + "\n"); + builder.append("output file: " + outputFile + "\n"); + builder.append("model file: " + modelFile + "\n"); + builder.append("score file: " + scorePath + "\n"); + builder.append("index of threads: " + numOfThreads + "\n"); + return builder.toString(); + } + else if (parseTaggedFile) + { + StringBuilder builder = new StringBuilder(); + builder.append("parse tag file" + "\n"); + builder.append("input file: " + inputFile + "\n"); + builder.append("output file: " + outputFile + "\n"); + builder.append("model file: " + modelFile + "\n"); + builder.append("score file: " + scorePath + "\n"); + builder.append("index of threads: " + numOfThreads + "\n"); + return builder.toString(); + } + else if (parsePartialConll) + { + StringBuilder builder = new StringBuilder(); + builder.append("parse partial conll" + "\n"); + builder.append("input file: " + inputFile + "\n"); + builder.append("output file: " + outputFile + "\n"); + builder.append("score file: " + scorePath + "\n"); + builder.append("model file: " + modelFile + "\n"); + builder.append("labeled: " + labeled + "\n"); + builder.append("index of threads: " + numOfThreads + "\n"); + return builder.toString(); + } + else if (evaluate) + { + StringBuilder builder = new StringBuilder(); + builder.append("Evaluate" + "\n"); + builder.append("gold file: " + goldFile + "\n"); + builder.append("parsed file: " + predFile + "\n"); + return builder.toString(); + } + return ""; + } + + public Options clone() + { + Options options = new Options(); + options.train = train; + options.labeled = labeled; + options.trainingIter = trainingIter; + options.useMaxViol = useMaxViol; + options.beamWidth = beamWidth; + options.devPath = devPath; + options.evaluate = evaluate; + options.goldFile = goldFile; + options.inputFile = inputFile; + options.lowercase = lowercase; + options.numOfThreads = numOfThreads; + options.outputFile = outputFile; + options.useDynamicOracle = useDynamicOracle; + options.modelFile = modelFile; + options.rootFirst = rootFirst; + options.parseConllFile = parseConllFile; + options.parseTaggedFile = parseTaggedFile; + options.predFile = predFile; + options.showHelp = showHelp; + options.separator = separator; + options.useExtendedFeatures = useExtendedFeatures; + options.parsePartialConll = parsePartialConll; + options.partialTrainingStartingIteration = partialTrainingStartingIteration; + return options; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Pair.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Pair.java new file mode 100644 index 000000000..399d21f9e --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/accessories/Pair.java @@ -0,0 +1,69 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.accessories; + +import java.io.Serializable; + +public class Pair implements Comparable, Cloneable, Serializable +{ + + public T1 first; + public T2 second; + + public Pair(T1 first, T2 second) + { + this.first = first; + this.second = second; + } + + public void setFirst(T1 first) + { + this.first = first; + } + + @Override + public Pair clone() + { + return new Pair(first, second); + } + + @Override + public boolean equals(Object o) + { + if (!(o instanceof Pair)) + return false; + Pair pair = (Pair) o; + + if (pair.second == null) + if (second == null) + return pair.first.equals(first); + else + return false; + if (second == null) + return false; + return pair.first.equals(first) && pair.second.equals(second); + } + + @Override + public int hashCode() + { + int firstHash = 0; + int secondHash = 0; + if (first != null) + firstHash = first.hashCode(); + if (second != null) + secondHash = second.hashCode(); + return firstHash + secondHash; + } + + @Override + public int compareTo(Object o) + { + if (equals(o)) + return 0; + return hashCode() - o.hashCode(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/learning/AveragedPerceptron.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/learning/AveragedPerceptron.java new file mode 100644 index 000000000..ef691e927 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/learning/AveragedPerceptron.java @@ -0,0 +1,315 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.learning; + +import com.hankcs.hanlp.dependency.perceptron.structures.ParserModel; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.Action; +import com.hankcs.hanlp.dependency.perceptron.structures.CompactArray; + +import java.util.HashMap; + +public class AveragedPerceptron +{ + /** + * This class tries to implement averaged Perceptron algorithm + * Collins, Michael. "Discriminative training methods for hidden Markov models: Theory and experiments with Perceptron algorithms." + * In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pp. 1-8. + * Association for Computational Linguistics, 2002. + *

+ * The averaging update is also optimized by using the trick introduced in Hal Daume's dissertation. + * For more information see the second chapter of his thesis: + * Harold Charles Daume' III. "Practical Structured YaraParser.Learning Techniques for Natural Language Processing", PhD thesis, ISI USC, 2006. + * http://www.umiacs.umd.edu/~hal/docs/daume06thesis.pdf + */ + /** + * For the weights for all features + */ + public HashMap[] shiftFeatureWeights; + public HashMap[] reduceFeatureWeights; + public HashMap[] leftArcFeatureWeights; + public HashMap[] rightArcFeatureWeights; + + public int iteration; + public int dependencySize; + /** + * This is the main part of the extension to the original perceptron algorithm which the averaging over all the history + */ + public HashMap[] shiftFeatureAveragedWeights; + public HashMap[] reduceFeatureAveragedWeights; + public HashMap[] leftArcFeatureAveragedWeights; + public HashMap[] rightArcFeatureAveragedWeights; + + + public AveragedPerceptron(int featSize, int dependencySize) + { + shiftFeatureWeights = new HashMap[featSize]; + reduceFeatureWeights = new HashMap[featSize]; + leftArcFeatureWeights = new HashMap[featSize]; + rightArcFeatureWeights = new HashMap[featSize]; + + shiftFeatureAveragedWeights = new HashMap[featSize]; + reduceFeatureAveragedWeights = new HashMap[featSize]; + leftArcFeatureAveragedWeights = new HashMap[featSize]; + rightArcFeatureAveragedWeights = new HashMap[featSize]; + for (int i = 0; i < featSize; i++) + { + shiftFeatureWeights[i] = new HashMap(); + reduceFeatureWeights[i] = new HashMap(); + leftArcFeatureWeights[i] = new HashMap(); + rightArcFeatureWeights[i] = new HashMap(); + + + shiftFeatureAveragedWeights[i] = new HashMap(); + reduceFeatureAveragedWeights[i] = new HashMap(); + leftArcFeatureAveragedWeights[i] = new HashMap(); + rightArcFeatureAveragedWeights[i] = new HashMap(); + } + + iteration = 1; + this.dependencySize = dependencySize; + } + + private AveragedPerceptron(HashMap[] shiftFeatureAveragedWeights, HashMap[] reduceFeatureAveragedWeights, + HashMap[] leftArcFeatureAveragedWeights, HashMap[] rightArcFeatureAveragedWeights, + int dependencySize) + { + this.shiftFeatureAveragedWeights = shiftFeatureAveragedWeights; + this.reduceFeatureAveragedWeights = reduceFeatureAveragedWeights; + this.leftArcFeatureAveragedWeights = leftArcFeatureAveragedWeights; + this.rightArcFeatureAveragedWeights = rightArcFeatureAveragedWeights; + this.dependencySize = dependencySize; + } + + public AveragedPerceptron(ParserModel parserModel) + { + this(parserModel.shiftFeatureAveragedWeights, parserModel.reduceFeatureAveragedWeights, parserModel.leftArcFeatureAveragedWeights, parserModel.rightArcFeatureAveragedWeights, parserModel.dependencySize); + } + + public float changeWeight(Action actionType, int slotNum, Object featureName, int labelIndex, float change) + { + if (featureName == null) + return 0; + if (actionType == Action.Shift) + { + if (!shiftFeatureWeights[slotNum].containsKey(featureName)) + shiftFeatureWeights[slotNum].put(featureName, change); + else + shiftFeatureWeights[slotNum].put(featureName, shiftFeatureWeights[slotNum].get(featureName) + change); + + if (!shiftFeatureAveragedWeights[slotNum].containsKey(featureName)) + shiftFeatureAveragedWeights[slotNum].put(featureName, iteration * change); + else + shiftFeatureAveragedWeights[slotNum].put(featureName, shiftFeatureAveragedWeights[slotNum].get(featureName) + iteration * change); + } + else if (actionType == Action.Reduce) + { + if (!reduceFeatureWeights[slotNum].containsKey(featureName)) + reduceFeatureWeights[slotNum].put(featureName, change); + else + reduceFeatureWeights[slotNum].put(featureName, reduceFeatureWeights[slotNum].get(featureName) + change); + + if (!reduceFeatureAveragedWeights[slotNum].containsKey(featureName)) + reduceFeatureAveragedWeights[slotNum].put(featureName, iteration * change); + else + reduceFeatureAveragedWeights[slotNum].put(featureName, reduceFeatureAveragedWeights[slotNum].get(featureName) + iteration * change); + } + else if (actionType == Action.RightArc) + { + changeFeatureWeight(rightArcFeatureWeights[slotNum], rightArcFeatureAveragedWeights[slotNum], featureName, labelIndex, change, dependencySize); + } + else if (actionType == Action.LeftArc) + { + changeFeatureWeight(leftArcFeatureWeights[slotNum], leftArcFeatureAveragedWeights[slotNum], featureName, labelIndex, change, dependencySize); + } + + return change; + } + + public void changeFeatureWeight(HashMap map, HashMap aMap, Object featureName, int labelIndex, float change, int size) + { + CompactArray values = map.get(featureName); + CompactArray aValues; + if (values != null) + { + values.set(labelIndex, change); + aValues = aMap.get(featureName); + aValues.set(labelIndex, iteration * change); + } + else + { + float[] val = new float[]{change}; + values = new CompactArray(labelIndex, val); + map.put(featureName, values); + + float[] aVal = new float[]{iteration * change}; + aValues = new CompactArray(labelIndex, aVal); + aMap.put(featureName, aValues); + } + } + + + /** + * Adds to the iterations + */ + public void incrementIteration() + { + iteration++; + } + + public float shiftScore(final Object[] features, boolean decode) + { + float score = 0.0f; + + HashMap[] map = decode ? shiftFeatureAveragedWeights : shiftFeatureWeights; + + for (int i = 0; i < features.length; i++) + { + if (features[i] == null || (i >= 26 && i < 32)) // [26, 32) is distance feature + continue; + Float weight = map[i].get(features[i]); + if (weight != null) + { + score += weight; + } + } + + return score; + } + + public float reduceScore(final Object[] features, boolean decode) + { + float score = 0.0f; + + HashMap[] map = decode ? reduceFeatureAveragedWeights : reduceFeatureWeights; + + for (int i = 0; i < features.length; i++) + { + if (features[i] == null || (i >= 26 && i < 32)) + continue; + Float values = map[i].get(features[i]); + if (values != null) + { + score += values; + } + } + + return score; + } + + public float[] leftArcScores(final Object[] features, boolean decode) + { + float scores[] = new float[dependencySize]; + + HashMap[] map = decode ? leftArcFeatureAveragedWeights : leftArcFeatureWeights; + + for (int i = 0; i < features.length; i++) + { + if (features[i] == null) + continue; + CompactArray values = map[i].get(features[i]); + if (values != null) + { + int offset = values.getOffset(); + float[] weightVector = values.getArray(); + + for (int d = offset; d < offset + weightVector.length; d++) + { + scores[d] += weightVector[d - offset]; + } + } + } + + return scores; + } + + public float[] rightArcScores(final Object[] features, boolean decode) + { + float scores[] = new float[dependencySize]; + + HashMap[] map = decode ? rightArcFeatureAveragedWeights : rightArcFeatureWeights; + + for (int i = 0; i < features.length; i++) + { + if (features[i] == null) + continue; + CompactArray values = map[i].get(features[i]); + if (values != null) + { + int offset = values.getOffset(); + float[] weightVector = values.getArray(); + + for (int d = offset; d < offset + weightVector.length; d++) + { + scores[d] += weightVector[d - offset]; + } + } + } + + return scores; + } + + public int featureSize() + { + return shiftFeatureAveragedWeights.length; + } + + public int raSize() + { + int size = 0; + for (int i = 0; i < leftArcFeatureAveragedWeights.length; i++) + { + for (Object feat : rightArcFeatureAveragedWeights[i].keySet()) + { + size += rightArcFeatureAveragedWeights[i].get(feat).length(); + } + } + return size; + } + + public int effectiveRaSize() + { + int size = 0; + for (int i = 0; i < leftArcFeatureAveragedWeights.length; i++) + { + for (Object feat : rightArcFeatureAveragedWeights[i].keySet()) + { + for (float f : rightArcFeatureAveragedWeights[i].get(feat).getArray()) + if (f != 0f) + size++; + } + } + return size; + } + + + public int laSize() + { + int size = 0; + for (int i = 0; i < leftArcFeatureAveragedWeights.length; i++) + { + for (Object feat : leftArcFeatureAveragedWeights[i].keySet()) + { + size += leftArcFeatureAveragedWeights[i].get(feat).length(); + } + } + return size; + } + + public int effectiveLaSize() + { + int size = 0; + for (int i = 0; i < leftArcFeatureAveragedWeights.length; i++) + { + for (Object feat : leftArcFeatureAveragedWeights[i].keySet()) + { + for (float f : leftArcFeatureAveragedWeights[i].get(feat).getArray()) + if (f != 0f) + size++; + } + } + return size; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/package-info.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/package-info.java new file mode 100644 index 000000000..79646cd5b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/package-info.java @@ -0,0 +1,12 @@ +/** + * 本package是对Yara Parser的包装与优化,主要做了如下几点优化 + * - 代码重构,提高复用率(由于dynamic oracle需要在训练的过程中逐渐动态地创建特征, + * 所以无法复用HanLP的感知机框架,这也是为什么选择直接包装该模块而不是重新实现的原因之一。) + * - 接口调整,与词法分析器整合 + * - debug + * - 文档注释 + * Yara Parser的版权与授权信息如下: + * © Copyright 2014-2015, Yahoo! Inc. + * © Licensed under the terms of the Apache License 2.0. + */ +package com.hankcs.hanlp.dependency.perceptron; \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParser.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParser.java new file mode 100644 index 000000000..3b2ab94df --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParser.java @@ -0,0 +1,169 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-08 12:35 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.dependency.perceptron.parser; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; +import com.hankcs.hanlp.dependency.AbstractDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.accessories.Evaluator; +import com.hankcs.hanlp.dependency.perceptron.accessories.Options; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.KBeamArcEagerParser; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; + +import java.io.File; +import java.io.IOException; +import java.util.Date; +import java.util.List; +import java.util.concurrent.ExecutionException; + +/** + * 基于ArcEager转移系统以平均感知机作为分类器的柱搜索依存句法分析器 + * + * @author hankcs + */ +public class KBeamArcEagerDependencyParser extends AbstractDependencyParser +{ + KBeamArcEagerParser parser; + + public KBeamArcEagerDependencyParser() throws IOException, ClassNotFoundException + { + this(HanLP.Config.PerceptronParserModelPath); + } + + public KBeamArcEagerDependencyParser(Segment segment, KBeamArcEagerParser parser) + { + super(segment); + this.parser = parser; + } + + public KBeamArcEagerDependencyParser(KBeamArcEagerParser parser) + { + this.parser = parser; + } + + public KBeamArcEagerDependencyParser(String modelPath) throws IOException, ClassNotFoundException + { + this(modelPath, HanLP.Config.PerceptronCWSModelPath, HanLP.Config.PerceptronPOSModelPath.replaceFirst("data/model/.*?.bin", "data/model/perceptron/ctb/pos.bin")); + } + + public KBeamArcEagerDependencyParser(String modelPath, String cwsModelPath, String posModelPath) throws IOException, ClassNotFoundException + { + this(new PerceptronLexicalAnalyzer(cwsModelPath, posModelPath).enableCustomDictionary(false), new KBeamArcEagerParser(modelPath)); + } + + /** + * 训练依存句法分析器 + * + * @param trainCorpus 训练集 + * @param devCorpus 开发集 + * @param clusterPath Brown词聚类文件 + * @param modelPath 模型储存路径 + * @throws InterruptedException + * @throws ExecutionException + * @throws IOException + * @throws ClassNotFoundException + */ + public static KBeamArcEagerDependencyParser train(String trainCorpus, String devCorpus, String clusterPath, String modelPath) throws InterruptedException, ExecutionException, IOException, ClassNotFoundException + { + Options options = new Options(); + options.train = true; + options.inputFile = trainCorpus; + options.devPath = devCorpus; + options.clusterFile = clusterPath; + options.modelFile = modelPath; + Main.train(options); + return new KBeamArcEagerDependencyParser(modelPath); + } + + /** + * 标准化评测 + * + * @param testCorpus 测试语料 + * @return 包含UF、LF的数组 + * @throws IOException + * @throws ExecutionException + * @throws InterruptedException + */ + public double[] evaluate(String testCorpus) throws IOException, ExecutionException, InterruptedException + { + Options options = parser.options; + options.goldFile = testCorpus; + File tmpTemplate = File.createTempFile("pred-" + new Date().getTime(), ".conll"); + tmpTemplate.deleteOnExit(); + options.predFile = tmpTemplate.getAbsolutePath(); + options.outputFile = options.predFile; + File scoreFile = File.createTempFile("score-" + new Date().getTime(), ".txt"); + scoreFile.deleteOnExit(); + parser.parseConllFile(testCorpus, options.outputFile, options.rootFirst, options.beamWidth, true, + options.lowercase, 1, false, scoreFile.getAbsolutePath()); + return Evaluator.evaluate(options.goldFile, options.predFile, options.punctuations); + } + + @Override + public CoNLLSentence parse(List termList) + { + return parse(termList, 64, 1); + } + + /** + * 执行句法分析 + * + * @param termList 分词结果 + * @param beamWidth 柱搜索宽度 + * @param numOfThreads 多线程数 + * @return 句法树 + */ + public CoNLLSentence parse(List termList, int beamWidth, int numOfThreads) + { + String[] words = new String[termList.size()]; + String[] tags = new String[termList.size()]; + int k = 0; + for (Term term : termList) + { + words[k] = term.word; + tags[k] = term.nature.toString(); + ++k; + } + + Configuration bestParse; + try + { + bestParse = parser.parse(words, tags, false, beamWidth, numOfThreads); + } + catch (Exception e) + { + throw new RuntimeException(e); + } + CoNLLWord[] wordArray = new CoNLLWord[termList.size()]; + for (int i = 0; i < words.length; i++) + { + wordArray[i] = new CoNLLWord(i + 1, words[i], tags[i]); + } + for (int i = 0; i < words.length; i++) + { + wordArray[i].DEPREL = parser.idWord(bestParse.state.getDependent(i + 1)); + int index = bestParse.state.getHead(i + 1) - 1; + if (index < 0 || index >= wordArray.length) + { + wordArray[i].HEAD = CoNLLWord.ROOT; + } + else + { + wordArray[i].HEAD = wordArray[index]; + } + } + return new CoNLLSentence(wordArray); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/Main.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/Main.java new file mode 100644 index 000000000..9b668b3de --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/parser/Main.java @@ -0,0 +1,147 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.parser; + +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.accessories.CoNLLReader; +import com.hankcs.hanlp.dependency.perceptron.accessories.Evaluator; +import com.hankcs.hanlp.dependency.perceptron.accessories.Options; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.structures.ParserModel; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.KBeamArcEagerParser; +import com.hankcs.hanlp.dependency.perceptron.transition.trainer.ArcEagerBeamTrainer; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.concurrent.ExecutionException; + +public class Main +{ + public static void main(String[] args) throws Exception + { + Options options = Options.processArgs(args); + + if (options.showHelp) + { + Options.showHelp(); + } + else + { + System.out.println(options); + if (options.train) + { + train(options); + } + else if (options.parseTaggedFile || options.parseConllFile || options.parsePartialConll) + { + parse(options); + } + else if (options.evaluate) + { + evaluate(options); + } + else + { + Options.showHelp(); + } + } + System.exit(0); + } + + private static void evaluate(Options options) throws Exception + { + if (options.goldFile.equals("") || options.predFile.equals("")) + Options.showHelp(); + else + { + Evaluator.evaluate(options.goldFile, options.predFile, options.punctuations); + } + } + + private static void parse(Options options) throws Exception + { + if (options.outputFile.equals("") || options.inputFile.equals("") + || options.modelFile.equals("")) + { + Options.showHelp(); + + } + else + { + ParserModel parserModel = new ParserModel(options.modelFile); + ArrayList dependencyLabels = parserModel.dependencyLabels; + IndexMaps maps = parserModel.maps; + + + Options inf_options = parserModel.options; + AveragedPerceptron averagedPerceptron = new AveragedPerceptron(parserModel); + + int featureSize = averagedPerceptron.featureSize(); + KBeamArcEagerParser parser = new KBeamArcEagerParser(averagedPerceptron, dependencyLabels, featureSize, maps, options.numOfThreads, options); + + if (options.parseTaggedFile) + parser.parseTaggedFile(options.inputFile, + options.outputFile, inf_options.rootFirst, inf_options.beamWidth, inf_options.lowercase, options.separator, options.numOfThreads); + else if (options.parseConllFile) + parser.parseConllFile(options.inputFile, + options.outputFile, inf_options.rootFirst, inf_options.beamWidth, true, inf_options.lowercase, options.numOfThreads, false, options.scorePath); + else if (options.parsePartialConll) + parser.parseConllFile(options.inputFile, + options.outputFile, inf_options.rootFirst, inf_options.beamWidth, options.labeled, inf_options.lowercase, options.numOfThreads, true, options.scorePath); + parser.shutDownLiveThreads(); + } + } + + public static void train(Options options) throws IOException, ExecutionException, InterruptedException + { + if (options.inputFile.equals("") || options.modelFile.equals("")) + { + Options.showHelp(); + } + else + { + IndexMaps maps = CoNLLReader.createIndices(options.inputFile, options.labeled, options.lowercase, options.clusterFile); + CoNLLReader reader = new CoNLLReader(options.inputFile); + ArrayList dataSet = reader.readData(Integer.MAX_VALUE, false, options.labeled, options.rootFirst, options.lowercase, maps); +// System.out.println("读取CoNLL文件结束。"); + + ArrayList dependencyLabels = new ArrayList(); + dependencyLabels.addAll(maps.getLabels().keySet()); + + int featureLength = options.useExtendedFeatures ? 72 : 26; + if (options.useExtendedWithBrownClusterFeatures || maps.hasClusters()) + featureLength = 153; + + System.out.println("训练集句子数量: " + dataSet.size()); + + HashMap labels = new HashMap(); + labels.put("sh", labels.size()); + labels.put("rd", labels.size()); + labels.put("us", labels.size()); + for (int label : dependencyLabels) + { + if (options.labeled) + { + labels.put("ra_" + label, 3 + label); + labels.put("la_" + label, 3 + dependencyLabels.size() + label); + } + else + { + labels.put("ra_" + label, 3); + labels.put("la_" + label, 4); + } + } + + ArcEagerBeamTrainer trainer = new ArcEagerBeamTrainer(options.useMaxViol ? "max_violation" : "early", + new AveragedPerceptron(featureLength, dependencyLabels.size()), + options, dependencyLabels, featureLength, maps); + trainer.train(dataSet, options.devPath, options.trainingIter, options.modelFile, options.lowercase, options.punctuations, options.partialTrainingStartingIteration); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/CompactArray.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/CompactArray.java new file mode 100644 index 000000000..04bb36f29 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/CompactArray.java @@ -0,0 +1,80 @@ +package com.hankcs.hanlp.dependency.perceptron.structures; + +import java.io.Serializable; + +/** + * Created by Mohammad Sadegh Rasooli. + * ML-NLP Lab, Department of Computer Science, Columbia University + * Date Created: 2/5/15 + * Time: 10:27 PM + * To report any bugs or problems contact rasooli@cs.columbia.edu + */ + +/** + * 一个稀疏数组,实际只有一个连续区间被分配内存 + */ +public class CompactArray implements Serializable +{ + float[] array; + int offset; + + public CompactArray(int offset, float[] array) + { + this.offset = offset; + this.array = array; + } + + /** + * 将index处的元素设置为value + * + * @param index + * @param value + */ + public void set(int index, float value) + { + if (index < offset + array.length && index >= offset) + { + array[index - offset] += value; + } + else if (index < offset) + { //expand from left + int gap = offset - index; + int newSize = gap + array.length; + float[] newArray = new float[newSize]; + newArray[0] = value; + for (int i = 0; i < array.length; i++) + { + newArray[gap + i] = array[i]; + } + this.offset = index; + this.array = newArray; + } + else + { + int gap = index - (array.length + offset - 1); + int newSize = array.length + gap; + float[] newArray = new float[newSize]; + newArray[newSize - 1] = value; + for (int i = 0; i < array.length; i++) + { + newArray[i] = array[i]; + } + this.array = newArray; + } + } + + public float[] getArray() + { + return array; + } + + public int getOffset() + { + return offset; + } + + public int length() + { + return array.length; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/IndexMaps.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/IndexMaps.java new file mode 100644 index 000000000..8a8f26ae7 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/IndexMaps.java @@ -0,0 +1,166 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.structures; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.HashMap; + +/** + * 将所有字符串混到一起赋予id的结构 + */ +public class IndexMaps implements Serializable +{ + /** + * ROOT + */ + public final String rootString; + /** + * uid to word + */ + public String[] idWord; + /** + * word(包含pos、label、cluster) to uid + */ + private HashMap wordId; + /** + * label id to uid,所有label的id必须从零开始并且连续 + */ + private HashMap labels; + /** + * cluster id to prefix 4 id + */ + private HashMap brown4Clusters; + private HashMap brown6Clusters; + /** + * word to cluster id + */ + private HashMap brownFullClusters; + + public IndexMaps(HashMap wordId, HashMap labels, String rootString, + HashMap brown4Clusters, HashMap brown6Clusters, HashMap brownFullClusters) + { + this.wordId = wordId; + this.labels = labels; + + idWord = new String[wordId.size() + 1]; + idWord[0] = "ROOT"; + + for (String word : wordId.keySet()) + { + idWord[wordId.get(word)] = word; + } + + this.brown4Clusters = brown4Clusters; + this.brown6Clusters = brown6Clusters; + this.brownFullClusters = brownFullClusters; + this.rootString = rootString; + } + + /** + * 将句子中的字符串转换为id + * + * @param words + * @param posTags + * @param rootFirst + * @param lowerCased + * @return + */ + public Sentence makeSentence(String[] words, String[] posTags, boolean rootFirst, boolean lowerCased) + { + ArrayList tokens = new ArrayList(); + ArrayList tags = new ArrayList(); + ArrayList bc4 = new ArrayList(); + ArrayList bc6 = new ArrayList(); + ArrayList bcf = new ArrayList(); + + int i = 0; + for (String word : words) + { + if (word.length() == 0) + continue; + String lowerCaseWord = word.toLowerCase(); + if (lowerCased) + word = lowerCaseWord; + + int[] clusterIDs = clusterId(word); + bcf.add(clusterIDs[0]); + bc4.add(clusterIDs[1]); + bc6.add(clusterIDs[2]); + + String pos = posTags[i]; + + int wi = -1; + if (wordId.containsKey(word)) + wi = wordId.get(word); + + int pi = -1; + if (wordId.containsKey(pos)) + pi = wordId.get(pos); + + tokens.add(wi); + tags.add(pi); + + i++; + } + + if (!rootFirst) + { + tokens.add(0); + tags.add(0); + bcf.add(0); + bc6.add(0); + bc4.add(0); + } + + return new Sentence(tokens, tags, bc4, bc6, bcf); + } + + public HashMap getWordId() + { + return wordId; + } + + /** + * 依存关系 + * + * @return + */ + public HashMap getLabels() + { + return labels; + } + + /** + * 获取聚类id + * + * @param word + * @return + */ + public int[] clusterId(String word) + { + int[] ids = new int[3]; + ids[0] = -100; + ids[1] = -100; + ids[2] = -100; + if (brownFullClusters.containsKey(word)) + ids[0] = brownFullClusters.get(word); + + if (ids[0] > 0) + { + ids[1] = brown4Clusters.get(ids[0]); + ids[2] = brown6Clusters.get(ids[0]); + } + return ids; + } + + public boolean hasClusters() + { + if (brownFullClusters != null && brownFullClusters.size() > 0) + return true; + return false; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/ParserModel.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/ParserModel.java new file mode 100644 index 000000000..fdf290bbd --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/ParserModel.java @@ -0,0 +1,161 @@ +package com.hankcs.hanlp.dependency.perceptron.structures; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dependency.perceptron.accessories.Options; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; + +import java.io.*; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.zip.GZIPInputStream; +import java.util.zip.GZIPOutputStream; + +/** + * Created by Mohammad Sadegh Rasooli. + * ML-NLP Lab, Department of Computer Science, Columbia University + * Date Created: 1/8/15 + * Time: 11:41 AM + * To report any bugs or problems contact rasooli@cs.columbia.edu + */ + +/** + * 句法分析模型(参数、超参数、词表等) + */ +public class ParserModel +{ + public HashMap[] shiftFeatureAveragedWeights; + public HashMap[] reduceFeatureAveragedWeights; + public HashMap[] leftArcFeatureAveragedWeights; + public HashMap[] rightArcFeatureAveragedWeights; + public int dependencySize; + + public IndexMaps maps; + public ArrayList dependencyLabels; + public Options options; + + public ParserModel(HashMap[] shiftFeatureAveragedWeights, HashMap[] reduceFeatureAveragedWeights, HashMap[] leftArcFeatureAveragedWeights, HashMap[] rightArcFeatureAveragedWeights, + IndexMaps maps, ArrayList dependencyLabels, Options options, int dependencySize) + { + this.shiftFeatureAveragedWeights = shiftFeatureAveragedWeights; + this.reduceFeatureAveragedWeights = reduceFeatureAveragedWeights; + this.leftArcFeatureAveragedWeights = leftArcFeatureAveragedWeights; + this.rightArcFeatureAveragedWeights = rightArcFeatureAveragedWeights; + this.maps = maps; + this.dependencyLabels = dependencyLabels; + this.options = options; + this.dependencySize = dependencySize; + } + + public ParserModel(AveragedPerceptron perceptron, IndexMaps maps, ArrayList dependencyLabels, Options options) + { + shiftFeatureAveragedWeights = new HashMap[perceptron.shiftFeatureAveragedWeights.length]; + reduceFeatureAveragedWeights = new HashMap[perceptron.reduceFeatureAveragedWeights.length]; + + HashMap[] map = perceptron.shiftFeatureWeights; + HashMap[] avgMap = perceptron.shiftFeatureAveragedWeights; + this.dependencySize = perceptron.dependencySize; + + for (int i = 0; i < shiftFeatureAveragedWeights.length; i++) + { + shiftFeatureAveragedWeights[i] = new HashMap(); + for (Object feat : map[i].keySet()) + { + float vals = map[i].get(feat); + float avgVals = avgMap[i].get(feat); + float newVals = vals - (avgVals / perceptron.iteration); + shiftFeatureAveragedWeights[i].put(feat, newVals); + } + } + + HashMap[] map4 = perceptron.reduceFeatureWeights; + HashMap[] avgMap4 = perceptron.reduceFeatureAveragedWeights; + this.dependencySize = perceptron.dependencySize; + + for (int i = 0; i < reduceFeatureAveragedWeights.length; i++) + { + reduceFeatureAveragedWeights[i] = new HashMap(); + for (Object feat : map4[i].keySet()) + { + float vals = map4[i].get(feat); + float avgVals = avgMap4[i].get(feat); + float newVals = vals - (avgVals / perceptron.iteration); + reduceFeatureAveragedWeights[i].put(feat, newVals); + } + } + + leftArcFeatureAveragedWeights = new HashMap[perceptron.leftArcFeatureAveragedWeights.length]; + HashMap[] map2 = perceptron.leftArcFeatureWeights; + HashMap[] avgMap2 = perceptron.leftArcFeatureAveragedWeights; + + for (int i = 0; i < leftArcFeatureAveragedWeights.length; i++) + { + leftArcFeatureAveragedWeights[i] = new HashMap(); + for (Object feat : map2[i].keySet()) + { + CompactArray vals = map2[i].get(feat); + CompactArray avgVals = avgMap2[i].get(feat); + leftArcFeatureAveragedWeights[i].put(feat, getAveragedCompactArray(vals, avgVals, perceptron.iteration)); + } + } + + rightArcFeatureAveragedWeights = new HashMap[perceptron.rightArcFeatureAveragedWeights.length]; + HashMap[] map3 = perceptron.rightArcFeatureWeights; + HashMap[] avgMap3 = perceptron.rightArcFeatureAveragedWeights; + + for (int i = 0; i < rightArcFeatureAveragedWeights.length; i++) + { + rightArcFeatureAveragedWeights[i] = new HashMap(); + for (Object feat : map3[i].keySet()) + { + CompactArray vals = map3[i].get(feat); + CompactArray avgVals = avgMap3[i].get(feat); + rightArcFeatureAveragedWeights[i].put(feat, getAveragedCompactArray(vals, avgVals, perceptron.iteration)); + } + } + + this.maps = maps; + this.dependencyLabels = dependencyLabels; + this.options = options; + } + + public ParserModel(String modelPath) throws IOException, ClassNotFoundException + { + ObjectInputStream reader = new ObjectInputStream(new GZIPInputStream(IOUtil.newInputStream(modelPath))); + dependencyLabels = (ArrayList) reader.readObject(); + maps = (IndexMaps) reader.readObject(); + options = (Options) reader.readObject(); + shiftFeatureAveragedWeights = (HashMap[]) reader.readObject(); + reduceFeatureAveragedWeights = (HashMap[]) reader.readObject(); + leftArcFeatureAveragedWeights = (HashMap[]) reader.readObject(); + rightArcFeatureAveragedWeights = (HashMap[]) reader.readObject(); + dependencySize = reader.readInt(); + reader.close(); + } + + public void saveModel(String modelPath) throws IOException + { + ObjectOutput writer = new ObjectOutputStream(new GZIPOutputStream(IOUtil.newOutputStream(modelPath))); + writer.writeObject(dependencyLabels); + writer.writeObject(maps); + writer.writeObject(options); + writer.writeObject(shiftFeatureAveragedWeights); + writer.writeObject(reduceFeatureAveragedWeights); + writer.writeObject(leftArcFeatureAveragedWeights); + writer.writeObject(rightArcFeatureAveragedWeights); + writer.writeInt(dependencySize); + writer.close(); + } + + private CompactArray getAveragedCompactArray(CompactArray ca, CompactArray aca, int iteration) + { + int offset = ca.getOffset(); + float[] a = ca.getArray(); + float[] aa = aca.getArray(); + float[] aNew = new float[a.length]; + for (int i = 0; i < a.length; i++) + { + aNew[i] = a[i] - (aa[i] / iteration); + } + return new CompactArray(offset, aNew); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/Sentence.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/Sentence.java new file mode 100644 index 000000000..67091d174 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/structures/Sentence.java @@ -0,0 +1,126 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.structures; + + +import java.util.ArrayList; + +/** + * CoNLL中的一个句子 + */ +public class Sentence implements Comparable +{ + /** + * 词语id + */ + private int[] words; + /** + * 词性 + */ + private int[] tags; + + private int[] brownCluster4thPrefix; + private int[] brownCluster6thPrefix; + private int[] brownClusterFullString; + + + public Sentence(ArrayList tokens, ArrayList pos, ArrayList brownCluster4thPrefix, ArrayList brownCluster6thPrefix, ArrayList brownClusterFullString) + { + words = new int[tokens.size()]; + tags = new int[tokens.size()]; + this.brownCluster4thPrefix = new int[tokens.size()]; + this.brownCluster6thPrefix = new int[tokens.size()]; + this.brownClusterFullString = new int[tokens.size()]; + for (int i = 0; i < tokens.size(); i++) + { + words[i] = tokens.get(i); + tags[i] = pos.get(i); + this.brownCluster4thPrefix[i] = brownCluster4thPrefix.get(i); + this.brownCluster6thPrefix[i] = brownCluster6thPrefix.get(i); + this.brownClusterFullString[i] = brownClusterFullString.get(i); + } + } + + public int size() + { + return words.length; + } + + public int posAt(int position) + { + if (position == 0) + return 0; + + return tags[position - 1]; + } + + public int[] getWords() + { + return words; + } + + public int[] getTags() + { + return tags; + } + + + public int[] getBrownCluster4thPrefix() + { + return brownCluster4thPrefix; + } + + + public int[] getBrownCluster6thPrefix() + { + return brownCluster6thPrefix; + } + + public int[] getBrownClusterFullString() + { + return brownClusterFullString; + } + + @Override + public boolean equals(Object obj) + { + if (obj instanceof Sentence) + { + Sentence sentence = (Sentence) obj; + if (sentence.words.length != words.length) + return false; + for (int i = 0; i < sentence.words.length; i++) + { + if (sentence.words[i] != words[i]) + return false; + if (sentence.tags[i] != tags[i]) + return false; + } + return true; + } + return false; + } + + @Override + public int compareTo(Object o) + { + if (equals(o)) + return 0; + return hashCode() - o.hashCode(); + } + + @Override + public int hashCode() + { + int hash = 0; + for (int tokenId = 0; tokenId < words.length; tokenId++) + { + hash ^= (words[tokenId] * tags[tokenId]); + } + return hash; + } + +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/BeamElement.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/BeamElement.java new file mode 100644 index 000000000..90b61dedd --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/BeamElement.java @@ -0,0 +1,41 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.configuration; + +public class BeamElement implements Comparable +{ + public float score; + public int index; + public int action; + public int label; + + public BeamElement(float score, int index, int action, int label) + { + this.score = score; + this.index = index; + this.action = action; + this.label = label; + } + + @Override + public int compareTo(BeamElement beamElement) + { + float diff = score - beamElement.score; + if (diff > 0) + return 2; + if (diff < 0) + return -2; + if (index != beamElement.index) + return beamElement.index - index; + return beamElement.action - action; + } + + @Override + public boolean equals(Object o) + { + return false; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/CompactTree.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/CompactTree.java new file mode 100644 index 000000000..2ec523bcf --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/CompactTree.java @@ -0,0 +1,23 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.configuration; + +import com.hankcs.hanlp.dependency.perceptron.accessories.Pair; + +import java.util.ArrayList; +import java.util.HashMap; + +public class CompactTree +{ + public HashMap> goldDependencies; + public ArrayList posTags; + + public CompactTree(HashMap> goldDependencies, ArrayList posTags) + { + this.goldDependencies = goldDependencies; + this.posTags = posTags; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Configuration.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Configuration.java new file mode 100644 index 000000000..af3d3ee75 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Configuration.java @@ -0,0 +1,127 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.configuration; + +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; + +import java.io.Serializable; +import java.util.ArrayList; + +/** + * 由stack、buffer和arc组成,额外记录了导致该状态的动作历史和分数 + */ +public class Configuration implements Comparable, Cloneable, Serializable +{ + public Sentence sentence; + + public State state; + + public ArrayList actionHistory; + + public float score; + + public Configuration(Sentence sentence, boolean rootFirst) + { + this.sentence = sentence; + state = new State(sentence.size(), rootFirst); + score = 0.0f; + actionHistory = new ArrayList(2 * (sentence.size() + 1)); + } + + public Configuration(Sentence sentence) + { + this.sentence = sentence; + state = new State(sentence.size()); + score = (float) 0.0; + actionHistory = new ArrayList(2 * (sentence.size() + 1)); + } + + /** + * Returns the current score of the configuration + * + * @param normalized if true, the score will be normalized by the index of done actions + * @return + */ + public float getScore(boolean normalized) + { + // if (normalized && actionHistory.size() > 0) + // return score / actionHistory.size(); + return score; + } + + public void addScore(float score) + { + this.score += score; + } + + public void setScore(float score) + { + this.score = score; + } + + public void addAction(int action) + { + actionHistory.add(action); + } + + @Override + public int compareTo(Object o) + { + if (!(o instanceof Configuration)) + return hashCode() - o.hashCode(); + + // may be unsafe + Configuration configuration = (Configuration) o; + float diff = getScore(true) - configuration.getScore(true); + + if (diff > 0) + return (int) Math.ceil(diff); + else if (diff < 0) + return (int) Math.floor(diff); + else + return 0; + } + + @Override + public boolean equals(Object o) + { + if (o instanceof Configuration) + { + Configuration configuration = (Configuration) o; + if (configuration.score != score) + return false; + if (configuration.actionHistory.size() != actionHistory.size()) + return false; + for (int i = 0; i < actionHistory.size(); i++) + if (!actionHistory.get(i).equals(configuration.actionHistory.get(i))) + return false; + return true; + } + return false; + } + + @Override + public Configuration clone() + { + Configuration configuration = new Configuration(sentence); + configuration.actionHistory = new ArrayList(actionHistory); + configuration.score = score; + configuration.state = state.clone(); + + return configuration; + } + + @Override + public int hashCode() + { + int hashCode = 0; + int i = 0; + for (int action : actionHistory) + hashCode += action << i++; + hashCode += score; + return hashCode; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Instance.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Instance.java new file mode 100644 index 000000000..89793db2c --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/Instance.java @@ -0,0 +1,217 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + + +package com.hankcs.hanlp.dependency.perceptron.transition.configuration; + +import com.hankcs.hanlp.dependency.perceptron.accessories.Edge; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.Action; +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.ArcEager; + +import java.util.HashMap; +import java.util.HashSet; +import java.util.Map; + +/** + * 训练实例 + */ +public class Instance +{ + /** + * dependent -> head + */ + protected HashMap goldDependencies; + /** + * head -> dependents + */ + protected HashMap> reversedDependencies; + protected Sentence sentence; + + public Instance(Sentence sentence, HashMap goldDependencies) + { + this.goldDependencies = new HashMap(); + reversedDependencies = new HashMap>(); + for (Map.Entry entry : goldDependencies.entrySet()) + { + Integer dependent = entry.getKey(); + Edge edge = entry.getValue(); + int head = edge.headIndex; + this.goldDependencies.put(dependent, edge.clone()); + HashSet dependents = reversedDependencies.get(head); + if (dependents == null) + { + dependents = new HashSet(); + reversedDependencies.put(head, dependents); + } + dependents.add(dependent); + } + this.sentence = sentence; + } + + + public Sentence getSentence() + { + return sentence; + } + + public int head(int dependent) + { + if (!goldDependencies.containsKey(dependent)) + return -1; + return goldDependencies.get(dependent).headIndex; + } + + public String relation(int dependent) + { + if (!goldDependencies.containsKey(dependent)) + return "_"; + return goldDependencies.get(dependent).relationId + ""; + } + + public HashMap getGoldDependencies() + { + return goldDependencies; + } + + /** + * Shows whether the tree to train is projective or not + * + * @return true if the tree is non-projective + */ + public boolean isNonprojective() + { + for (int dep1 : goldDependencies.keySet()) + { + int head1 = goldDependencies.get(dep1).headIndex; + for (int dep2 : goldDependencies.keySet()) + { + int head2 = goldDependencies.get(dep2).headIndex; + if (head1 < 0 || head2 < 0) + continue; + if (dep1 > head1 && head1 != head2) + if ((dep1 > head2 && dep1 < dep2 && head1 < head2) || (dep1 < head2 && dep1 > dep2 && head1 < dep2)) + return true; + if (dep1 < head1 && head1 != head2) + if ((head1 > head2 && head1 < dep2 && dep1 < head2) || (head1 < head2 && head1 > dep2 && dep1 < dep2)) + return true; + } + } + return false; + } + + public boolean isPartial(boolean rootFirst) + { + for (int i = 0; i < sentence.size(); i++) + { + if (rootFirst || i < sentence.size() - 1) + { + if (!goldDependencies.containsKey(i + 1)) + return true; + } + } + return false; + } + + public HashMap> getReversedDependencies() + { + return reversedDependencies; + } + + /** + * For the cost of an action given the gold dependencies + * For more information see: + * Yoav Goldberg and Joakim Nivre. "Training Deterministic Parsers with Non-Deterministic Oracles." + * TACL 1 (2013): 403-414. + * + * @param action + * @param dependency + * @param state + * @return oracle cost of the action + * @throws Exception + */ + public int actionCost(Action action, int dependency, State state) + { + if (!ArcEager.canDo(action, state)) + return Integer.MAX_VALUE; + int cost = 0; + + // added by me to take care of labels + if (action == Action.LeftArc) + { // left arc + int bufferHead = state.bufferHead(); + int stackHead = state.stackTop(); + + if (goldDependencies.containsKey(stackHead) && goldDependencies.get(stackHead).headIndex == bufferHead + && goldDependencies.get(stackHead).relationId != (dependency)) + cost += 1; + } + else if (action == Action.RightArc) + { //right arc + int bufferHead = state.bufferHead(); + int stackHead = state.stackTop(); + if (goldDependencies.containsKey(bufferHead) && goldDependencies.get(bufferHead).headIndex == stackHead + && goldDependencies.get(bufferHead).relationId != (dependency)) + cost += 1; + } + + if (action == Action.Shift) + { //shift + int bufferHead = state.bufferHead(); + for (int stackItem : state.getStack()) + { + if (goldDependencies.containsKey(stackItem) && goldDependencies.get(stackItem).headIndex == (bufferHead)) + cost += 1; + if (goldDependencies.containsKey(bufferHead) && goldDependencies.get(bufferHead).headIndex == (stackItem)) + cost += 1; + } + + } + else if (action == Action.Reduce) + { //reduce + int stackHead = state.stackTop(); + if (!state.bufferEmpty()) + for (int bufferItem = state.bufferHead(); bufferItem <= state.maxSentenceSize; bufferItem++) + { + if (goldDependencies.containsKey(bufferItem) && goldDependencies.get(bufferItem).headIndex == (stackHead)) + cost += 1; + } + } + else if (action == Action.LeftArc && cost == 0) + { //left arc + int stackHead = state.stackTop(); + if (!state.bufferEmpty()) + for (int bufferItem = state.bufferHead(); bufferItem <= state.maxSentenceSize; bufferItem++) + { + if (goldDependencies.containsKey(bufferItem) && goldDependencies.get(bufferItem).headIndex == (stackHead)) + cost += 1; + if (goldDependencies.containsKey(stackHead) && goldDependencies.get(stackHead).headIndex == (bufferItem)) + if (bufferItem != state.bufferHead()) + cost += 1; + } + } + else if (action == Action.RightArc && cost == 0) + { //right arc + int stackHead = state.stackTop(); + int bufferHead = state.bufferHead(); + for (int stackItem : state.getStack()) + { + if (goldDependencies.containsKey(bufferHead) && goldDependencies.get(bufferHead).headIndex == (stackItem)) + if (stackItem != stackHead) + cost += 1; + + if (goldDependencies.containsKey(stackItem) && goldDependencies.get(stackItem).headIndex == (bufferHead)) + cost += 1; + } + if (!state.bufferEmpty()) + for (int bufferItem = state.bufferHead(); bufferItem <= state.maxSentenceSize; bufferItem++) + { + if (goldDependencies.containsKey(bufferHead) && goldDependencies.get(bufferHead).headIndex == (bufferItem)) + cost += 1; + } + } + return cost; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/State.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/State.java new file mode 100644 index 000000000..69cb1137b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/configuration/State.java @@ -0,0 +1,305 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.configuration; + +import com.hankcs.hanlp.dependency.perceptron.accessories.Edge; + +import java.util.ArrayDeque; + +/** + * 由buffer、stack和arc组成的状态 + */ +public class State implements Cloneable +{ + public int rootIndex; + public int maxSentenceSize; + + /** + * This is the additional information for the case of parsing with tree constraint + * For more information see: + * Joakim Nivre and Daniel Fernández-González. "Arc-Eager Parsing with the Tree Constraint." + * Computational Linguistics(2014). + */ + protected boolean emptyFlag; + + /** + * Keeps dependent->head information + */ + protected Edge[] arcs; + protected int[] leftMostArcs; + protected int[] rightMostArcs; + /** + * left modifiers + */ + protected int[] leftValency; + protected int[] rightValency; + protected long[] rightDepLabels; + protected long[] leftDepLabels; + protected ArrayDeque stack; + int bufferHead; + + public State(int size) + { + emptyFlag = false; + stack = new ArrayDeque(); + arcs = new Edge[size + 1]; + + leftMostArcs = new int[size + 1]; + rightMostArcs = new int[size + 1]; + leftValency = new int[size + 1]; + rightValency = new int[size + 1]; + rightDepLabels = new long[size + 1]; + leftDepLabels = new long[size + 1]; + + rootIndex = 0; + bufferHead = 1; + maxSentenceSize = 0; + } + + /** + * @param sentenceSize 句子长度(不包含ROOT) + * @param rootFirst 是否将ROOT作为index=0的词语,否则作为最后一个词语 + */ + public State(int sentenceSize, boolean rootFirst) + { + this(sentenceSize); + if (rootFirst) + { + stack.push(0); + rootIndex = 0; + maxSentenceSize = sentenceSize; + } + else + { + rootIndex = sentenceSize; + maxSentenceSize = sentenceSize; + } + } + + public ArrayDeque getStack() + { + return stack; + } + + public int pop() + { + return stack.pop(); + } + + public void push(int index) + { + stack.push(index); + } + + public void addArc(int dependent, int head, int dependency) + { + arcs[dependent] = new Edge(head, dependency); + long value = 1L << (dependency); + + assert dependency < 64; + + if (dependent > head) + { //right dep + if (rightMostArcs[head] == 0 || dependent > rightMostArcs[head]) + rightMostArcs[head] = dependent; + rightValency[head] += 1; + rightDepLabels[head] = rightDepLabels[head] | value; + + } + else + { //left dependency + if (leftMostArcs[head] == 0 || dependent < leftMostArcs[head]) + leftMostArcs[head] = dependent; + leftDepLabels[head] = leftDepLabels[head] | value; + leftValency[head] += 1; + } + } + + public long rightDependentLabels(int position) + { + return rightDepLabels[position]; + } + + public long leftDependentLabels(int position) + { + return leftDepLabels[position]; + } + + public boolean isEmptyFlag() + { + return emptyFlag; + } + + public void setEmptyFlag(boolean emptyFlag) + { + this.emptyFlag = emptyFlag; + } + + public int bufferHead() + { + return bufferHead; + } + + /** + * View top element of stack + * @return + */ + public int stackTop() + { + if (stack.size() > 0) + return stack.peek(); + return -1; + } + + public int getBufferItem(int position) + { + return bufferHead + position; + } + + public boolean isTerminalState() + { + if (stackEmpty()) + { + if (bufferEmpty() || bufferHead == rootIndex) + { + return true; + } + } + return false; + } + + public boolean hasHead(int dependent) + { + return arcs[dependent] != null; + } + + public boolean bufferEmpty() + { + return bufferHead == -1; + } + + public boolean stackEmpty() + { + return stack.size() == 0; + } + + public int bufferSize() + { + if (bufferHead < 0) + return 0; + return (maxSentenceSize - bufferHead + 1); + } + + public int stackSize() + { + return stack.size(); + } + + public int rightMostModifier(int index) + { + return (rightMostArcs[index] == 0 ? -1 : rightMostArcs[index]); + } + + public int leftMostModifier(int index) + { + return (leftMostArcs[index] == 0 ? -1 : leftMostArcs[index]); + } + + /** + * @param head + * @return the current index of dependents + */ + public int valence(int head) + { + return rightValency(head) + leftValency(head); + } + + /** + * @param head + * @return the current index of right modifiers + */ + public int rightValency(int head) + { + return rightValency[head]; + } + + /** + * @param head + * @return the current index of left modifiers + */ + public int leftValency(int head) + { + return leftValency[head]; + } + + public int getHead(int index) + { + if (arcs[index] != null) + return arcs[index].headIndex; + return -1; + } + + public int getDependent(int index) + { + if (arcs[index] != null) + return arcs[index].relationId; + return -1; + } + + public void setMaxSentenceSize(int maxSentenceSize) + { + this.maxSentenceSize = maxSentenceSize; + } + + public void incrementBufferHead() + { + if (bufferHead == maxSentenceSize) + bufferHead = -1; + else + bufferHead++; + } + + public void setBufferHead(int bufferHead) + { + this.bufferHead = bufferHead; + } + + @Override + public State clone() + { + State state = new State(arcs.length - 1); + state.stack = new ArrayDeque(stack); + + for (int dependent = 0; dependent < arcs.length; dependent++) + { + if (arcs[dependent] != null) + { + Edge head = arcs[dependent]; + state.arcs[dependent] = head; + int h = head.headIndex; + + if (rightMostArcs[h] != 0) + { + state.rightMostArcs[h] = rightMostArcs[h]; + state.rightValency[h] = rightValency[h]; + state.rightDepLabels[h] = rightDepLabels[h]; + } + + if (leftMostArcs[h] != 0) + { + state.leftMostArcs[h] = leftMostArcs[h]; + state.leftValency[h] = leftValency[h]; + state.leftDepLabels[h] = leftDepLabels[h]; + } + } + } + state.rootIndex = rootIndex; + state.bufferHead = bufferHead; + state.maxSentenceSize = maxSentenceSize; + state.emptyFlag = emptyFlag; + return state; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/features/FeatureExtractor.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/features/FeatureExtractor.java new file mode 100644 index 000000000..af0b4a5e2 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/features/FeatureExtractor.java @@ -0,0 +1,1847 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project 0 for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.features; + +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +public class FeatureExtractor +{ + /** + * Given a list of templates, extracts all features for the given state + * + * @param configuration + * @return + * @throws Exception + */ + public static Object[] extractAllParseFeatures(Configuration configuration, int length) + { + if (length == 26) + return extractBasicFeatures(configuration, length); + else if (length == 72) + return extractExtendedFeatures(configuration, length); + else + return extractExtendedFeaturesWithBrownClusters(configuration, length); + } + + + /** + * 根据特征模板为状态提取特征 + * + * @param configuration + * @return + * @throws Exception + */ + private static Object[] extractExtendedFeatures(Configuration configuration, int length) + { + Object[] featureMap = new Object[length]; + + State state = configuration.state; + Sentence sentence = configuration.sentence; + + int b0Position = 0; + int b1Position = 0; + int b2Position = 0; + int s0Position = 0; + + long svr = 0; // stack right valency + long svl = 0; // stack left valency + long bvl = 0; // buffer left valency + + long b0w = 0; + long b0p = 0; + + long b1w = 0; + long b1p = 0; + + long b2w = 0; + long b2p = 0; + + long s0w = 0; + long s0p = 0; + long s0l = 0; + + long bl0p = 0; + long bl0w = 0; + long bl0l = 0; + + long bl1w = 0; + long bl1p = 0; + long bl1l = 0; + + long sr0p = 0; + long sr0w = 0; + long sr0l = 0; + + long sh0w = 0; + long sh0p = 0; + long sh0l = 0; + + long sl0p = 0; + long sl0w = 0; + long sl0l = 0; + + long sr1w = 0; + long sr1p = 0; + long sr1l = 0; + + long sh1w = 0; + long sh1p = 0; + + long sl1w = 0; + long sl1p = 0; + long sl1l = 0; + + long sdl = 0; + long sdr = 0; + long bdl = 0; + + int[] words = sentence.getWords(); + int[] tags = sentence.getTags(); + + if (0 < state.bufferSize()) + { + b0Position = state.bufferHead(); + b0w = b0Position == 0 ? 0 : words[b0Position - 1]; + b0w += 2; + b0p = b0Position == 0 ? 0 : tags[b0Position - 1]; + b0p += 2; + bvl = state.leftValency(b0Position); + + int leftMost = state.leftMostModifier(state.getBufferItem(0)); + if (leftMost >= 0) + { + bl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + bl0p += 2; + bl0w = leftMost == 0 ? 0 : words[leftMost - 1]; + bl0w += 2; + bl0l = state.getDependent(leftMost); + bl0l += 2; + + int l2 = state.leftMostModifier(leftMost); + if (l2 >= 0) + { + bl1w = l2 == 0 ? 0 : words[l2 - 1]; + bl1w += 2; + bl1p = l2 == 0 ? 0 : tags[l2 - 1]; + bl1p += 2; + bl1l = state.getDependent(l2); + bl1l += 2; + } + } + + if (1 < state.bufferSize()) + { + b1Position = state.getBufferItem(1); + b1w = b1Position == 0 ? 0 : words[b1Position - 1]; + b1w += 2; + b1p = b1Position == 0 ? 0 : tags[b1Position - 1]; + b1p += 2; + + if (2 < state.bufferSize()) + { + b2Position = state.getBufferItem(2); + + b2w = b2Position == 0 ? 0 : words[b2Position - 1]; + b2w += 2; + b2p = b2Position == 0 ? 0 : tags[b2Position - 1]; + b2p += 2; + } + } + } + + if (0 < state.stackSize()) + { + s0Position = state.stackTop(); + s0w = s0Position == 0 ? 0 : words[s0Position - 1]; + s0w += 2; + s0p = s0Position == 0 ? 0 : tags[s0Position - 1]; + s0p += 2; + s0l = state.getDependent(s0Position); + s0l += 2; + + svl = state.leftValency(s0Position); + svr = state.rightValency(s0Position); + + int leftMost = state.leftMostModifier(s0Position); + if (leftMost >= 0) + { + sl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + sl0p += 2; + sl0w = leftMost == 0 ? 0 : words[leftMost - 1]; + sl0w += 2; + sl0l = state.getDependent(leftMost); + sl0l += 2; + } + + int rightMost = state.rightMostModifier(s0Position); + if (rightMost >= 0) + { + sr0p = rightMost == 0 ? 0 : tags[rightMost - 1]; + sr0p += 2; + sr0w = rightMost == 0 ? 0 : words[rightMost - 1]; + sr0w += 2; + sr0l = state.getDependent(rightMost); + sr0l += 2; + } + + int headIndex = state.getHead(s0Position); + if (headIndex >= 0) + { + sh0w = headIndex == 0 ? 0 : words[headIndex - 1]; + sh0w += 2; + sh0p = headIndex == 0 ? 0 : tags[headIndex - 1]; + sh0p += 2; + sh0l = state.getDependent(headIndex); + sh0l += 2; + } + + if (leftMost >= 0) + { + int l2 = state.leftMostModifier(leftMost); + if (l2 >= 0) + { + sl1w = l2 == 0 ? 0 : words[l2 - 1]; + sl1w += 2; + sl1p = l2 == 0 ? 0 : tags[l2 - 1]; + sl1p += 2; + sl1l = state.getDependent(l2); + sl1l += 2; + } + } + if (headIndex >= 0) + { + if (state.hasHead(headIndex)) + { + int h2 = state.getHead(headIndex); + sh1w = h2 == 0 ? 0 : words[h2 - 1]; + sh1w += 2; + sh1p = h2 == 0 ? 0 : tags[h2 - 1]; + sh1p += 2; + } + } + if (rightMost >= 0) + { + int r2 = state.rightMostModifier(rightMost); + if (r2 >= 0) + { + sr1w = r2 == 0 ? 0 : words[r2 - 1]; + sr1w += 2; + sr1p = r2 == 0 ? 0 : tags[r2 - 1]; + sr1p += 2; + sr1l = state.getDependent(r2); + sr1l += 2; + } + } + } + int index = 0; + + long b0wp = b0p; + b0wp |= (b0w << 8); + long b1wp = b1p; + b1wp |= (b1w << 8); + long s0wp = s0p; + s0wp |= (s0w << 8); + long b2wp = b2p; + b2wp |= (b2w << 8); + + /** + * From single words + */ + if (s0w != 1) + { + featureMap[index++] = s0wp; + featureMap[index++] = s0w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = s0p; + + if (b0w != 1) + { + featureMap[index++] = b0wp; + featureMap[index++] = b0w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b0p; + + if (b1w != 1) + { + featureMap[index++] = b1wp; + featureMap[index++] = b1w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b1p; + + if (b2w != 1) + { + featureMap[index++] = b2wp; + featureMap[index++] = b2w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b2p; + + /** + * from word pairs + */ + if (s0w != 1 && b0w != 1) + { + featureMap[index++] = (s0wp << 28) | b0wp; + featureMap[index++] = (s0wp << 20) | b0w; + featureMap[index++] = (s0w << 28) | b0wp; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + featureMap[index++] = null; + } + + if (s0w != 1) + { + featureMap[index++] = (s0wp << 8) | b0p; + } + else + { + featureMap[index++] = null; + } + + if (b0w != 1) + { + featureMap[index++] = (s0p << 28) | b0wp; + } + else + { + featureMap[index++] = null; + } + + if (s0w != 1 && b0w != 1) + { + featureMap[index++] = (s0w << 20) | b0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = (s0p << 8) | b0p; + featureMap[index++] = (b0p << 8) | b1p; + + /** + * from three words + */ + featureMap[index++] = (b0p << 16) | (b1p << 8) | b2p; + featureMap[index++] = (s0p << 16) | (b0p << 8) | b1p; + featureMap[index++] = (sh0p << 16) | (s0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (sl0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (sr0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (b0p << 8) | bl0p; + + /** + * distance + */ + long distance = 0; + if (s0Position > 0 && b0Position > 0) + distance = Math.abs(b0Position - s0Position); + if (s0w != 1) + { + featureMap[index++] = s0w | (distance << 20); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = s0p | (distance << 8); + if (b0w != 1) + { + featureMap[index++] = b0w | (distance << 20); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = b0p | (distance << 8); + if (s0w != 1 && b0w != 1) + { + featureMap[index++] = s0w | (b0w << 20) | (distance << 40); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = s0p | (b0p << 8) | (distance << 28); + + /** + * Valency information + */ + if (s0w != 1) + { + featureMap[index++] = s0w | (svr << 20); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = s0p | (svr << 8); + if (s0w != 1) + { + featureMap[index++] = s0w | (svl << 20); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = s0p | (svl << 8); + if (b0w != 1) + { + featureMap[index++] = b0w | (bvl << 20); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = b0p | (bvl << 8); + + /** + * Unigrams + */ + if (sh0w != 1) + { + featureMap[index++] = sh0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sh0p; + featureMap[index++] = s0l; + if (sl0w != 1) + { + featureMap[index++] = sl0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sl0p; + featureMap[index++] = sl0l; + if (sr0w != 1) + { + featureMap[index++] = sr0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sr0p; + featureMap[index++] = sr0l; + if (bl0w != 1) + { + featureMap[index++] = bl0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = bl0p; + featureMap[index++] = bl0l; + + /** + * From third order features + */ + if (sh1w != 1) + { + featureMap[index++] = sh1w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sh1p; + featureMap[index++] = sh0l; + if (sl1w != 1) + { + featureMap[index++] = sl1w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sl1p; + featureMap[index++] = sl1l; + if (sr1w != 1) + { + featureMap[index++] = sr1w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = sr1p; + featureMap[index++] = sr1l; + if (bl1w != 1) + { + featureMap[index++] = bl1w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = bl1p; + featureMap[index++] = bl1l; + featureMap[index++] = s0p | (sl0p << 8) | (sl1p << 16); + featureMap[index++] = s0p | (sr0p << 8) | (sr1p << 16); + featureMap[index++] = s0p | (sh0p << 8) | (sh1p << 16); + featureMap[index++] = b0p | (bl0p << 8) | (bl1p << 16); + + /** + * label set + */ + if (s0Position >= 0) + { + sdl = state.leftDependentLabels(s0Position); + sdr = state.rightDependentLabels(s0Position); + } + + if (b0Position >= 0) + { + bdl = state.leftDependentLabels(b0Position); + } + + if (s0w != 1) + { + featureMap[index++] = (s0w + "|" + sdr); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = (s0p + "|" + sdr); + if (s0w != 1) + { + featureMap[index++] = s0w + "|" + sdl; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = (s0p + "|" + sdl); + if (b0w != 1) + { + featureMap[index++] = (b0w + "|" + bdl); + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = (b0p + "|" + bdl); + return featureMap; + } + + /** + * Given a list of templates, extracts all features for the given state + * + * @param configuration + * @return + * @throws Exception + */ + private static Long[] extractBasicFeatures(Configuration configuration, int length) + { + Long[] featureMap = new Long[length]; + + State state = configuration.state; + Sentence sentence = configuration.sentence; + + int b0Position = 0; + int b1Position = 0; + int b2Position = 0; + int s0Position = 0; + + long b0w = 0; + long b0p = 0; + + long b1w = 0; + long b1p = 0; + + long b2w = 0; + long b2p = 0; + + long s0w = 0; + long s0p = 0; + long bl0p = 0; + long sr0p = 0; + long sh0p = 0; + + long sl0p = 0; + + int[] words = sentence.getWords(); + int[] tags = sentence.getTags(); + + if (0 < state.bufferSize()) + { + b0Position = state.bufferHead(); + b0w = b0Position == 0 ? 0 : words[b0Position - 1]; + b0w += 2; + b0p = b0Position == 0 ? 0 : tags[b0Position - 1]; + b0p += 2; + + int leftMost = state.leftMostModifier(state.getBufferItem(0)); + if (leftMost >= 0) + { + bl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + bl0p += 2; + } + + if (1 < state.bufferSize()) + { + b1Position = state.getBufferItem(1); + b1w = b1Position == 0 ? 0 : words[b1Position - 1]; + b1w += 2; + b1p = b1Position == 0 ? 0 : tags[b1Position - 1]; + b1p += 2; + + if (2 < state.bufferSize()) + { + b2Position = state.getBufferItem(2); + + b2w = b2Position == 0 ? 0 : words[b2Position - 1]; + b2w += 2; + b2p = b2Position == 0 ? 0 : tags[b2Position - 1]; + b2p += 2; + } + } + } + + + if (0 < state.stackSize()) + { + s0Position = state.stackTop(); + s0w = s0Position == 0 ? 0 : words[s0Position - 1]; + s0w += 2; + s0p = s0Position == 0 ? 0 : tags[s0Position - 1]; + s0p += 2; + + int leftMost = state.leftMostModifier(s0Position); + if (leftMost >= 0) + { + sl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + sl0p += 2; + } + + int rightMost = state.rightMostModifier(s0Position); + if (rightMost >= 0) + { + sr0p = rightMost == 0 ? 0 : tags[rightMost - 1]; + sr0p += 2; + } + + int headIndex = state.getHead(s0Position); + if (headIndex >= 0) + { + sh0p = headIndex == 0 ? 0 : tags[headIndex - 1]; + sh0p += 2; + } + + } + int index = 0; + + long b0wp = b0p; + b0wp |= (b0w << 8); + long b1wp = b1p; + b1wp |= (b1w << 8); + long s0wp = s0p; + s0wp |= (s0w << 8); + long b2wp = b2p; + b2wp |= (b2w << 8); + + /** + * From single words + */ + if (s0w != 1) + { + featureMap[index++] = s0wp; + featureMap[index++] = s0w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = s0p; + + if (b0w != 1) + { + featureMap[index++] = b0wp; + featureMap[index++] = b0w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b0p; + + if (b1w != 1) + { + featureMap[index++] = b1wp; + featureMap[index++] = b1w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b1p; + + if (b2w != 1) + { + featureMap[index++] = b2wp; + featureMap[index++] = b2w; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + } + featureMap[index++] = b2p; + + /** + * from word pairs + */ + if (s0w != 1 && b0w != 1) + { + featureMap[index++] = (s0wp << 28) | b0wp; + featureMap[index++] = (s0wp << 20) | b0w; + featureMap[index++] = (s0w << 28) | b0wp; + } + else + { + featureMap[index++] = null; + featureMap[index++] = null; + featureMap[index++] = null; + } + + if (s0w != 1) + { + featureMap[index++] = (s0wp << 8) | b0p; + } + else + { + featureMap[index++] = null; + } + + if (b0w != 1) + { + featureMap[index++] = (s0p << 28) | b0wp; + } + else + { + featureMap[index++] = null; + } + + if (s0w != 1 && b0w != 1) + { + featureMap[index++] = (s0w << 20) | b0w; + } + else + { + featureMap[index++] = null; + } + featureMap[index++] = (s0p << 8) | b0p; + featureMap[index++] = (b0p << 8) | b1p; + + /** + * from three words + */ + featureMap[index++] = (b0p << 16) | (b1p << 8) | b2p; + featureMap[index++] = (s0p << 16) | (b0p << 8) | b1p; + featureMap[index++] = (sh0p << 16) | (s0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (sl0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (sr0p << 8) | b0p; + featureMap[index++] = (s0p << 16) | (b0p << 8) | bl0p; + return featureMap; + } + + private static Object[] extractExtendedFeaturesWithBrownClusters(Configuration configuration, int length) + { + Object[] featureVector = new Object[length]; + + State state = configuration.state; + Sentence sentence = configuration.sentence; + + int b0Position = 0; + int b1Position = 0; + int b2Position = 0; + int s0Position = 0; + + int svr = 0; // stack right valency + int svl = 0; // stack left valency + int bvl = 0; // buffer left valency + + long b0w = 0; + long b0p = 0; + long b0bc4 = 0; + long b0bc6 = 0; + long b0bcf = 0; + + long b1w = 0; + long b1p = 0; + + long b2w = 0; + long b2p = 0; + + long s0w = 0; + long s0p = 0; + long s0bc4 = 0; + long s0bc6 = 0; + long s0bcf = 0; + + long s0l = 0; + + long bl0p = 0; + long bl0w = 0; + long bl0l = 0; + + long bl1w = 0; + long bl1p = 0; + long bl1l = 0; + + long sr0p = 0; + long sr0w = 0; + long sr0l = 0; + + long sh0w = 0; + long sh0p = 0; + long sh0l = 0; + + long sl0p = 0; + long sl0w = 0; + long sl0l = 0; + + long sr1w = 0; + long sr1p = 0; + long sr1l = 0; + + long sh1w = 0; + long sh1p = 0; + + long sl1w = 0; + long sl1p = 0; + long sl1l = 0; + + long sdl = 0; + long sdr = 0; + long bdl = 0; + + int[] words = sentence.getWords(); + int[] tags = sentence.getTags(); + int[] bc4 = sentence.getBrownCluster4thPrefix(); + int[] bc6 = sentence.getBrownCluster6thPrefix(); + int[] bcf = sentence.getBrownClusterFullString(); + + if (0 < state.bufferSize()) + { + b0Position = state.bufferHead(); + b0w = b0Position == 0 ? 0 : words[b0Position - 1]; + b0w += 2; + b0p = b0Position == 0 ? 0 : tags[b0Position - 1]; + b0p += 2; + b0bc4 = b0Position == 0 ? 0 : bc4[b0Position - 1]; + b0bc4 += 2; + b0bc6 = b0Position == 0 ? 0 : bc6[b0Position - 1]; + b0bc6 += 2; + b0bcf = b0Position == 0 ? 0 : bcf[b0Position - 1]; + b0bcf += 2; + + bvl = state.leftValency(b0Position); + + int leftMost = state.leftMostModifier(state.bufferHead()); + if (leftMost >= 0) + { + bl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + bl0p += 2; + bl0w = leftMost == 0 ? 0 : words[leftMost - 1]; + bl0w += 2; + bl0l = state.getDependent(leftMost); + bl0l += 2; + + int l2 = state.leftMostModifier(leftMost); + if (l2 >= 0) + { + bl1w = l2 == 0 ? 0 : words[l2 - 1]; + bl1w += 2; + bl1p = l2 == 0 ? 0 : tags[l2 - 1]; + bl1p += 2; + bl1l = state.getDependent(l2); + bl1l += 2; + } + } + + if (1 < state.bufferSize()) + { + b1Position = state.getBufferItem(1); + b1w = b1Position == 0 ? 0 : words[b1Position - 1]; + b1w += 2; + b1p = b1Position == 0 ? 0 : tags[b1Position - 1]; + b1p += 2; + + if (2 < state.bufferSize()) + { + b2Position = state.getBufferItem(2); + + b2w = b2Position == 0 ? 0 : words[b2Position - 1]; + b2w += 2; + b2p = b2Position == 0 ? 0 : tags[b2Position - 1]; + b2p += 2; + } + } + } + + if (0 < state.stackSize()) + { + s0Position = state.stackTop(); + s0w = s0Position == 0 ? 0 : words[s0Position - 1]; + s0w += 2; + s0p = s0Position == 0 ? 0 : tags[s0Position - 1]; + s0p += 2; + s0bc4 = s0Position == 0 ? 0 : bc4[s0Position - 1]; + s0bc4 += 2; + s0bc6 = s0Position == 0 ? 0 : bc6[s0Position - 1]; + s0bc6 += 2; + s0bcf = s0Position == 0 ? 0 : bcf[s0Position - 1]; + s0bcf += 2; + + s0l = state.getDependent(s0Position); + s0l += 2; + + svl = state.leftValency(s0Position); + svr = state.rightValency(s0Position); + + int leftMost = state.leftMostModifier(s0Position); + if (leftMost >= 0) + { + sl0p = leftMost == 0 ? 0 : tags[leftMost - 1]; + sl0p += 2; + sl0w = leftMost == 0 ? 0 : words[leftMost - 1]; + sl0w += 2; + sl0l = state.getDependent(leftMost); + sl0l += 2; + } + + int rightMost = state.rightMostModifier(s0Position); + if (rightMost >= 0) + { + sr0p = rightMost == 0 ? 0 : tags[rightMost - 1]; + sr0p += 2; + sr0w = rightMost == 0 ? 0 : words[rightMost - 1]; + sr0w += 2; + sr0l = state.getDependent(rightMost); + sr0l += 2; + } + + int headIndex = state.getHead(s0Position); + if (headIndex >= 0) + { + sh0w = headIndex == 0 ? 0 : words[headIndex - 1]; + sh0w += 2; + sh0p = headIndex == 0 ? 0 : tags[headIndex - 1]; + sh0p += 2; + sh0l = state.getDependent(headIndex); + sh0l += 2; + } + + if (leftMost >= 0) + { + int l2 = state.leftMostModifier(leftMost); + if (l2 >= 0) + { + sl1w = l2 == 0 ? 0 : words[l2 - 1]; + sl1w += 2; + sl1p = l2 == 0 ? 0 : tags[l2 - 1]; + sl1p += 2; + sl1l = state.getDependent(l2); + sl1l += 2; + } + } + if (headIndex >= 0) + { + if (state.hasHead(headIndex)) + { + int h2 = state.getHead(headIndex); + sh1w = h2 == 0 ? 0 : words[h2 - 1]; + sh1w += 2; + sh1p = h2 == 0 ? 0 : tags[h2 - 1]; + sh1p += 2; + } + } + if (rightMost >= 0) + { + int r2 = state.rightMostModifier(rightMost); + if (r2 >= 0) + { + sr1w = r2 == 0 ? 0 : words[r2 - 1]; + sr1w += 2; + sr1p = r2 == 0 ? 0 : tags[r2 - 1]; + sr1p += 2; + sr1l = state.getDependent(r2); + sr1l += 2; + } + } + } + int index = 0; + + long b0wp = b0p; + b0wp |= (b0w << 8); // 最多256种pos + long b1wp = b1p; + b1wp |= (b1w << 8); + long s0wp = s0p; + s0wp |= (s0w << 8); + long b2wp = b2p; + b2wp |= (b2w << 8); + + + /** + * From single words + */ + if (s0w != 1) // -1 + 2 = 1, means unk + { + featureVector[index++] = s0wp; + featureVector[index++] = s0w; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = s0p; + + if (b0w != 1) + { + featureVector[index++] = b0wp; + featureVector[index++] = b0w; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = b0p; + + if (b1w != 1) + { + featureVector[index++] = b1wp; + featureVector[index++] = b1w; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = b1p; + + if (b2w != 1) + { + featureVector[index++] = b2wp; + featureVector[index++] = b2w; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = b2p; + + /** + * from word pairs + */ + if (s0w != 1 && b0w != 1) + { + featureVector[index++] = (s0wp << 28) | b0wp; + featureVector[index++] = (s0wp << 20) | b0w; + featureVector[index++] = (s0w << 28) | b0wp; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0w != 1) + { + featureVector[index++] = (s0wp << 8) | b0p; + } + else + { + featureVector[index++] = null; + } + + if (b0w != 1) + { + featureVector[index++] = (s0p << 28) | b0wp; + } + else + { + featureVector[index++] = null; + } + + if (s0w != 1 && b0w != 1) + { + featureVector[index++] = (s0w << 20) | b0w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = (s0p << 8) | b0p; + featureVector[index++] = (b0p << 8) | b1p; + + /** + * from three words + */ + featureVector[index++] = (b0p << 16) | (b1p << 8) | b2p; + featureVector[index++] = (s0p << 16) | (b0p << 8) | b1p; + featureVector[index++] = (sh0p << 16) | (s0p << 8) | b0p; + featureVector[index++] = (s0p << 16) | (sl0p << 8) | b0p; + featureVector[index++] = (s0p << 16) | (sr0p << 8) | b0p; + featureVector[index++] = (s0p << 16) | (b0p << 8) | bl0p; + + /** + * distance + */ + long distance = 0; + if (s0Position > 0 && b0Position > 0) + distance = Math.abs(b0Position - s0Position); + if (s0w != 1) + { + featureVector[index++] = s0w | (distance << 20); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = s0p | (distance << 8); + if (b0w != 1) + { + featureVector[index++] = b0w | (distance << 20); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = b0p | (distance << 8); + if (s0w != 1 && b0w != 1) + { + featureVector[index++] = s0w | (b0w << 20) | (distance << 40); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = s0p | (b0p << 8) | (distance << 28); + + /** + * Valency information + */ + if (s0w != 1) + { + featureVector[index++] = s0w | (svr << 20); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = s0p | (svr << 8); + if (s0w != 1) + { + featureVector[index++] = s0w | (svl << 20); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = s0p | (svl << 8); + if (b0w != 1) + { + featureVector[index++] = b0w | (bvl << 20); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = b0p | (bvl << 8); + + /** + * Unigrams + */ + if (sh0w != 1) + { + featureVector[index++] = sh0w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sh0p; + featureVector[index++] = s0l; + if (sl0w != 1) + { + featureVector[index++] = sl0w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sl0p; + featureVector[index++] = sl0l; + if (sr0w != 1) + { + featureVector[index++] = sr0w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sr0p; + featureVector[index++] = sr0l; + if (bl0w != 1) + { + featureVector[index++] = bl0w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = bl0p; + featureVector[index++] = bl0l; + + /** + * From third order features + */ + if (sh1w != 1) + { + featureVector[index++] = sh1w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sh1p; + featureVector[index++] = sh0l; + if (sl1w != 1) + { + featureVector[index++] = sl1w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sl1p; + featureVector[index++] = sl1l; + if (sr1w != 1) + { + featureVector[index++] = sr1w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = sr1p; + featureVector[index++] = sr1l; + if (bl1w != 1) + { + featureVector[index++] = bl1w; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = bl1p; + featureVector[index++] = bl1l; + featureVector[index++] = s0p | (sl0p << 8) | (sl1p << 16); + featureVector[index++] = s0p | (sr0p << 8) | (sr1p << 16); + featureVector[index++] = s0p | (sh0p << 8) | (sh1p << 16); + featureVector[index++] = b0p | (bl0p << 8) | (bl1p << 16); + + /** + * label set + */ + if (s0Position >= 0) + { + sdl = state.leftDependentLabels(s0Position); + sdr = state.rightDependentLabels(s0Position); + } + + if (b0Position >= 0) + { + bdl = state.leftDependentLabels(b0Position); + } + + if (s0w != 1) + { + featureVector[index++] = (s0w + "|" + sdr); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = (s0p + "|" + sdr); + if (s0w != 1) + { + featureVector[index++] = s0w + "|" + sdl; + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = (s0p + "|" + sdl); + if (b0w != 1) + { + featureVector[index++] = (b0w + "|" + bdl); + } + else + { + featureVector[index++] = null; + } + featureVector[index++] = (b0p + "|" + bdl); + + /** + * Brown cluster features + * full string for b0w and s0w + * 4 and 6 prefix string for s0p and b0p + */ + long b0wbc4 = b0bc4; + b0wbc4 |= (b0w << 12); + if (b0w == 1) + b0wbc4 = 0; + long b0wbc6 = b0bc6; + b0wbc6 |= (b0w << 12); + if (b0w == 1) + b0wbc6 = 0; + long b0bcfP = b0p; + b0bcfP |= (b0bcf << 8); + long s0wbc4 = s0bc4; + s0wbc4 |= (s0w << 12); + if (s0w == 0) + s0wbc4 = 0; + long s0wbc6 = s0bc6; + s0wbc6 |= (s0w << 12); + if (s0w == 0) + s0wbc6 = 0; + long s0bcfP = s0p; + s0bcfP |= (s0bcf << 8); + + + /** + * From single words + */ + if (s0bcf > 0) + { + if (s0w != 1) + { + featureVector[index++] = s0wbc4; + featureVector[index++] = s0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = s0bcfP; + + featureVector[index++] = s0bcf; + + featureVector[index++] = s0bc4; + featureVector[index++] = s0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + if (b0w != 1) + { + featureVector[index++] = b0wbc4; + featureVector[index++] = b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + featureVector[index++] = b0bcfP; + + featureVector[index++] = b0bcf; + + featureVector[index++] = b0bc4; + featureVector[index++] = b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + + /** + * from word pairs + */ + if (s0bcf > 0 && s0w != 1) + { + if (b0bcf > 0 && b0w != 1) + { + featureVector[index++] = (s0wbc4 << 32) | b0wbc4; + featureVector[index++] = (s0wbc6 << 32) | b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + if (b0w != 1) + { + featureVector[index++] = (s0wbc4 << 28) | b0wp; + featureVector[index++] = (s0wbc6 << 28) | b0wp; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + if (b0bcf > 0 && s0w != 1 & b0w != 1) + { + featureVector[index++] = (s0wp << 32) | b0wbc4; + featureVector[index++] = (s0wp << 32) | b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0 && s0w != 1) + { + if (b0w != 1) + { + featureVector[index++] = (s0wbc4 << 20) | b0w; + featureVector[index++] = (s0wbc6 << 20) | b0w; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + if (b0bcf > 0) + { + featureVector[index++] = (s0wbc4 << 12) | b0bcf; + featureVector[index++] = (s0wbc6 << 12) | b0bcf; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0 && s0w != 1) + { + featureVector[index++] = (s0wp << 12) | b0bcf; + } + else + { + featureVector[index++] = null; + } + + if (s0bcf > 0 && b0w != 1) + { + featureVector[index++] = (s0bcf << 28) | b0wp; + } + else + { + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + if (s0w != 1 && b0w != 1) + { + featureVector[index++] = (s0w << 32) | b0wbc4; + featureVector[index++] = (s0w << 32) | b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + if (s0bcf > 0 && b0w != 1) + { + featureVector[index++] = (s0bcf << 32) | b0wbc4; + featureVector[index++] = (s0bcf << 32) | b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0 && s0w != 1) + { + featureVector[index++] = (s0wbc4 << 8) | b0p; + featureVector[index++] = (s0wbc6 << 8) | b0p; + if (b0bcf > 0) + { + featureVector[index++] = (s0wbc4 << 8) | b0bc4; + featureVector[index++] = (s0wbc6 << 8) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0 && b0w != 1) + { + featureVector[index++] = (s0bc4 << 28) | b0wp; + featureVector[index++] = (s0bc6 << 28) | b0wp; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0 && b0w != 1) + { + featureVector[index++] = (s0p << 32) | b0wbc4; + featureVector[index++] = (s0p << 32) | b0wbc6; + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 32) | b0wbc4; + featureVector[index++] = (s0bc6 << 32) | b0wbc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0 && s0w != 1) + { + featureVector[index++] = (s0w << 12) | b0bcf; + } + else + { + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + if (b0w != 1) + { + featureVector[index++] = (s0bcf << 20) | b0w; + } + else + { + featureVector[index++] = null; + } + if (b0bcf > 0) + { + featureVector[index++] = (s0bcf << 12) | b0bcf; + } + else + { + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 8) | b0p; + featureVector[index++] = (s0bc6 << 8) | b0p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + featureVector[index++] = (s0p << 12) | b0bc4; + featureVector[index++] = (s0p << 12) | b0bc6; + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 12) | b0bc4; + featureVector[index++] = (s0bc6 << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + featureVector[index++] = (b0bc4 << 8) | b1p; + featureVector[index++] = (b0bc6 << 8) | b1p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + /** + * from three words + */ + if (b0bcf > 0) + { + featureVector[index++] = (b0bc4 << 16) | (b1p << 8) | b2p; + featureVector[index++] = (b0bc6 << 16) | (b1p << 8) | b2p; + + featureVector[index++] = (s0p << 20) | (b0bc4 << 8) | b1p; + featureVector[index++] = (s0p << 20) | (b0bc6 << 8) | b1p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 16) | (b2p << 8) | b1p; + featureVector[index++] = (s0bc6 << 16) | (b2p << 8) | b1p; + if (b0bcf > 0) + { + featureVector[index++] = (s0bc4 << 20) | (b0bc4 << 8) | b1p; + featureVector[index++] = (s0bc6 << 20) | (b0bc6 << 8) | b1p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + featureVector[index++] = (sh0p << 20) | (s0bc4 << 8) | b0p; + featureVector[index++] = (sh0p << 20) | (s0bc6 << 8) | b0p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + featureVector[index++] = (sh0p << 20) | (s0p << 12) | b0bc4; + featureVector[index++] = (sh0p << 20) | (s0p << 12) | b0bc6; + if (s0bcf > 0) + { + featureVector[index++] = (sh0p << 24) | (s0bc4 << 12) | b0bc4; + featureVector[index++] = (sh0p << 24) | (s0bc6 << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + + if (b0bcf > 0) + { + featureVector[index++] = (s0p << 20) | (sl0p << 12) | b0bc4; + featureVector[index++] = (s0p << 20) | (sl0p << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 16) | (sl0p << 8) | b0p; + featureVector[index++] = (s0bc6 << 16) | (sl0p << 8) | b0p; + if (b0bcf > 0) + { + featureVector[index++] = (s0bc4 << 20) | (sl0p << 12) | b0bc4; + featureVector[index++] = (s0bc6 << 20) | (sl0p << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + featureVector[index++] = (s0p << 20) | (sr0p << 12) | b0bc4; + featureVector[index++] = (s0p << 20) | (sr0p << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 16) | (sr0p << 8) | b0p; + featureVector[index++] = (s0bc6 << 16) | (sr0p << 8) | b0p; + if (b0bcf > 0) + { + featureVector[index++] = (s0bc4 << 20) | (sr0p << 12) | b0bc4; + featureVector[index++] = (s0bc6 << 20) | (sr0p << 12) | b0bc6; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (b0bcf > 0) + { + featureVector[index++] = (s0p << 20) | (b0bc4 << 8) | bl0p; + featureVector[index++] = (s0p << 20) | (b0bc6 << 8) | bl0p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + + if (s0bcf > 0) + { + featureVector[index++] = (s0bc4 << 16) | (b0p << 8) | bl0p; + featureVector[index++] = (s0bc6 << 16) | (b0p << 8) | bl0p; + if (b0bcf > 0) + { + featureVector[index++] = (s0bc4 << 20) | (b0bc4 << 8) | bl0p; + featureVector[index++] = (s0bc6 << 20) | (b0bc6 << 8) | bl0p; + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + } + } + else + { + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + featureVector[index++] = null; + } + + return featureVector; + } + +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/Action.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/Action.java new file mode 100644 index 000000000..0dd07bb91 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/Action.java @@ -0,0 +1,65 @@ +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; + +/** + * Created by Mohammad Sadegh Rasooli. + * ML-NLP Lab, Department of Computer Science, Columbia University + * Date Created: 12/23/14 + * Time: 11:08 AM + * To report any bugs or problems contact rasooli@cs.columbia.edu + */ + +public enum Action implements IAction +{ + Shift + { + @Override + public void commit(int relation, float score, int relationSize, Configuration config) + { + ArcEager.shift(config.state); + config.addAction(ordinal()); + config.setScore(score); + } + }, + Reduce + { + @Override + public void commit(int relation, float score, int relationSize, Configuration config) + { + ArcEager.reduce(config.state); + config.addAction(ordinal()); + config.setScore(score); + } + }, + Unshift + { + @Override + public void commit(int relation, float score, int relationSize, Configuration config) + { + ArcEager.unShift(config.state); + config.addAction(ordinal()); + config.setScore(score); + } + }, + RightArc + { + @Override + public void commit(int relation, float score, int relationSize, Configuration config) + { + ArcEager.rightArc(config.state, relation); + config.addAction(ordinal() + relation); + config.setScore(score); + } + }, + LeftArc + { + @Override + public void commit(int relation, float score, int relationSize, Configuration config) + { + ArcEager.leftArc(config.state, relation); + config.addAction(ordinal() + relationSize + relation); + config.setScore(score); + } + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ArcEager.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ArcEager.java new file mode 100644 index 000000000..b25757088 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ArcEager.java @@ -0,0 +1,140 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +import java.util.ArrayList; + +public class ArcEager extends TransitionBasedParser +{ + private ArcEager(AveragedPerceptron classifier, ArrayList dependencyRelations, int featureLength, IndexMaps maps) + { + super(classifier, dependencyRelations, featureLength, maps); + } + + public static void shift(State state) + { + state.push(state.bufferHead()); + state.incrementBufferHead(); + + // changing the constraint + if (state.bufferEmpty()) + state.setEmptyFlag(true); + } + + public static void unShift(State state) + { + if (!state.stackEmpty()) + state.setBufferHead(state.pop()); + // to make sure + state.setEmptyFlag(true); + state.setMaxSentenceSize(state.bufferHead()); + } + + public static void reduce(State state) + { + state.pop(); + if (state.stackEmpty() && state.bufferEmpty()) + state.setEmptyFlag(true); + } + + public static void leftArc(State state, int dependency) + { + state.addArc(state.pop(), state.bufferHead(), dependency); + } + + public static void rightArc(State state, int dependency) + { + state.addArc(state.bufferHead(), state.stackTop(), dependency); + state.push(state.bufferHead()); + state.incrementBufferHead(); + if (!state.isEmptyFlag() && state.bufferEmpty()) + state.setEmptyFlag(true); + } + + public static boolean canDo(Action action, State state) + { + if (action == Action.Shift) + { //shift + return !(!state.bufferEmpty() && state.bufferHead() == state.rootIndex && !state.stackEmpty()) && !state.bufferEmpty() && !state.isEmptyFlag(); + } + else if (action == Action.RightArc) + { //right arc + if (state.stackEmpty()) + return false; + return !(!state.bufferEmpty() && state.bufferHead() == state.rootIndex) && !state.bufferEmpty() && !state.stackEmpty(); + + } + else if (action == Action.LeftArc) + { //left arc + if (state.stackEmpty() || state.bufferEmpty()) + return false; + + if (!state.stackEmpty() && state.stackTop() == state.rootIndex) + return false; + + return state.stackTop() != state.rootIndex && !state.hasHead(state.stackTop()) && !state.stackEmpty(); + } + else if (action == Action.Reduce) + { //reduce + return !state.stackEmpty() && state.hasHead(state.stackTop()) || !state.stackEmpty() && state.stackSize() == 1 && state.bufferSize() == 0 && state.stackTop() == state.rootIndex; + } + else if (action == Action.Unshift) + { //unshift + return !state.stackEmpty() && !state.hasHead(state.stackTop()) && state.isEmptyFlag(); + } + return false; + } + + /** + * Shows true if all of the configurations in the beam are in the terminal state + * + * @param beam the current beam + * @return true if all of the configurations in the beam are in the terminal state + */ + public static boolean isTerminal(ArrayList beam) + { + for (Configuration configuration : beam) + if (!configuration.state.isTerminalState()) + return false; + return true; + } + + + public static void commitAction(int action, int label, float score, ArrayList dependencyRelations, Configuration newConfig) + { + if (action == 0) + { + shift(newConfig.state); + newConfig.addAction(0); + } + else if (action == 1) + { + reduce(newConfig.state); + newConfig.addAction(1); + } + else if (action == 2) + { + rightArc(newConfig.state, label); + newConfig.addAction(3 + label); + } + else if (action == 3) + { + leftArc(newConfig.state, label); + newConfig.addAction(3 + dependencyRelations.size() + label); + } + else if (action == 4) + { + unShift(newConfig.state); + newConfig.addAction(2); + } + newConfig.setScore(score); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/BeamScorerThread.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/BeamScorerThread.java new file mode 100644 index 000000000..f6d9e9da5 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/BeamScorerThread.java @@ -0,0 +1,59 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.transition.features.FeatureExtractor; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.BeamElement; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +import java.util.ArrayList; +import java.util.concurrent.Callable; + +import static com.hankcs.hanlp.dependency.perceptron.transition.parser.PartialTreeBeamScorerThread.addAvailableBeamElements; + + +public class BeamScorerThread implements Callable> +{ + + boolean isDecode; + AveragedPerceptron classifier; + Configuration configuration; + ArrayList dependencyRelations; + int featureLength; + int b; + boolean rootFirst; + + public BeamScorerThread(boolean isDecode, AveragedPerceptron classifier, Configuration configuration, ArrayList dependencyRelations, int featureLength, int b, boolean rootFirst) + { + this.isDecode = isDecode; + this.classifier = classifier; + this.configuration = configuration; + this.dependencyRelations = dependencyRelations; + this.featureLength = featureLength; + this.b = b; + this.rootFirst = rootFirst; + } + + + public ArrayList call() + { + ArrayList elements = new ArrayList(dependencyRelations.size() * 2 + 3); + + State currentState = configuration.state; + float prevScore = configuration.score; + + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + + addAvailableBeamElements(elements, prevScore, canShift, canReduce, canRightArc, canLeftArc, features, classifier, isDecode, b, dependencyRelations); + return elements; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/IAction.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/IAction.java new file mode 100644 index 000000000..7e6c72dfa --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/IAction.java @@ -0,0 +1,21 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-04 上午10:23 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; + +/** + * @author hankcs + */ +public interface IAction +{ + void commit(int relation, float score, int relationSize, Configuration config); +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/KBeamArcEagerParser.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/KBeamArcEagerParser.java new file mode 100644 index 000000000..4404158e8 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/KBeamArcEagerParser.java @@ -0,0 +1,628 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dependency.perceptron.accessories.Edge; +import com.hankcs.hanlp.dependency.perceptron.accessories.Options; +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.structures.ParserModel; +import com.hankcs.hanlp.dependency.perceptron.transition.features.FeatureExtractor; +import com.hankcs.hanlp.dependency.perceptron.accessories.CoNLLReader; +import com.hankcs.hanlp.dependency.perceptron.accessories.Pair; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.BeamElement; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +import java.io.*; +import java.text.DecimalFormat; +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.TreeSet; +import java.util.concurrent.*; + +public class KBeamArcEagerParser extends TransitionBasedParser +{ + ExecutorService executor; + CompletionService> pool; + public Options options; + + public KBeamArcEagerParser(String modelPath) throws IOException, ClassNotFoundException + { + this(modelPath, Runtime.getRuntime().availableProcessors()); + } + + public KBeamArcEagerParser(String modelPath, int numOfThreads) throws IOException, ClassNotFoundException + { + this(new ParserModel(modelPath), numOfThreads); + } + + public KBeamArcEagerParser(ParserModel parserModel, int numOfThreads) + { + this(new AveragedPerceptron(parserModel), parserModel.dependencyLabels, parserModel.shiftFeatureAveragedWeights.length, parserModel.maps, numOfThreads, parserModel.options); + } + + public KBeamArcEagerParser(AveragedPerceptron classifier, ArrayList dependencyRelations, + int featureLength, IndexMaps maps, int numOfThreads, Options options) + { + super(classifier, dependencyRelations, featureLength, maps); + executor = Executors.newFixedThreadPool(numOfThreads); + pool = new ExecutorCompletionService>(executor); + this.options = options; + } + + public Configuration parse(String[] words, String[] tags) throws ExecutionException, InterruptedException + { + return parse(maps.makeSentence(words, tags, options.rootFirst, options.lowercase), options.rootFirst, options.beamWidth, 1); + } + + public Configuration parse(Sentence sentence) throws ExecutionException, InterruptedException + { + return parse(sentence, options.rootFirst, options.beamWidth, options.numOfThreads); + } + + public Configuration parse(String[] words, String[] tags, boolean rootFirst, int beamWidth, int numOfThreads) throws ExecutionException, InterruptedException + { + return parse(maps.makeSentence(words, tags, options.rootFirst, options.lowercase), rootFirst, beamWidth, numOfThreads); + } + + public Configuration parse(Sentence sentence, boolean rootFirst, int beamWidth, int numOfThreads) throws ExecutionException, InterruptedException + { + Configuration initialConfiguration = new Configuration(sentence, rootFirst); + + ArrayList beam = new ArrayList(beamWidth); + beam.add(initialConfiguration); + + while (!ArcEager.isTerminal(beam)) + { + TreeSet beamPreserver = new TreeSet(); + + if (numOfThreads == 1) + { + sortBeam(beam, beamPreserver, false, new Instance(sentence, new HashMap()), beamWidth, rootFirst, featureLength, classifier, dependencyRelations); + } + else + { + for (int b = 0; b < beam.size(); b++) + { + pool.submit(new BeamScorerThread(true, classifier, beam.get(b), + dependencyRelations, featureLength, b, rootFirst)); + } + fetchBeamFromPool(beamWidth, beam, beamPreserver); + } + + + beam = commitActionInBeam(beamWidth, beam, beamPreserver); + } + + Configuration bestConfiguration = null; + float bestScore = Float.NEGATIVE_INFINITY; + for (Configuration configuration : beam) + { + if (configuration.getScore(true) > bestScore) + { + bestScore = configuration.getScore(true); + bestConfiguration = configuration; + } + } + return bestConfiguration; + } + + private ArrayList commitActionInBeam(int beamWidth, ArrayList beam, TreeSet beamPreserver) + { + ArrayList repBeam = new ArrayList(beamWidth); + for (BeamElement beamElement : beamPreserver.descendingSet()) + { + if (repBeam.size() >= beamWidth) + break; + int b = beamElement.index; + int action = beamElement.action; + int label = beamElement.label; + float score = beamElement.score; + + Configuration newConfig = beam.get(b).clone(); + + ArcEager.commitAction(action, label, score, dependencyRelations, newConfig); + repBeam.add(newConfig); + } + beam = repBeam; + return beam; + } + + private void parsePartialWithOneThread(ArrayList beam, TreeSet beamPreserver, Boolean isNonProjective, Instance instance, int beamWidth, boolean rootFirst) + { + sortBeam(beam, beamPreserver, isNonProjective, instance, beamWidth, rootFirst, featureLength, classifier, dependencyRelations); + + //todo + if (beamPreserver.size() == 0) + { + ParseThread.sortBeam(beam, beamPreserver, false, null, beamWidth, rootFirst, featureLength, classifier, dependencyRelations); + } + } + + private static void sortBeam(ArrayList beam, TreeSet beamPreserver, Boolean isNonProjective, Instance instance, int beamWidth, boolean rootFirst, int featureLength, AveragedPerceptron classifier, Collection dependencyRelations) + { + for (int b = 0; b < beam.size(); b++) + { + Configuration configuration = beam.get(b); + State currentState = configuration.state; + float prevScore = configuration.score; + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + if (!canShift + && !canReduce + && !canRightArc + && !canLeftArc && rootFirst) + { + beamPreserver.add(new BeamElement(prevScore, b, 4, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + + if (canShift) + { + if (isNonProjective || instance.actionCost(Action.Shift, -1, currentState) == 0) + { + float score = classifier.shiftScore(features, true); + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 0, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + + if (canReduce) + { + if (isNonProjective || instance.actionCost(Action.Reduce, -1, currentState) == 0) + { + float score = classifier.reduceScore(features, true); + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 1, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, true); + for (int dependency : dependencyRelations) + { + if (isNonProjective || instance.actionCost(Action.RightArc, dependency, currentState) == 0) + { + float score = rightArcScores[dependency]; + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 2, dependency)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + } + + if (canLeftArc) + { + float[] leftArcScores = classifier.leftArcScores(features, true); + for (int dependency : dependencyRelations) + { + if (isNonProjective || instance.actionCost(Action.LeftArc, dependency, currentState) == 0) + { + float score = leftArcScores[dependency]; + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 3, dependency)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + } + } + } + + public Configuration parsePartial(Instance instance, Sentence sentence, boolean rootFirst, int beamWidth, int numOfThreads) throws ExecutionException, InterruptedException + { + Configuration initialConfiguration = new Configuration(sentence, rootFirst); + boolean isNonProjective = false; + if (instance.isNonprojective()) + { + isNonProjective = true; + } + + ArrayList beam = new ArrayList(beamWidth); + beam.add(initialConfiguration); + + while (!ArcEager.isTerminal(beam)) + { + TreeSet beamPreserver = new TreeSet(); + + if (numOfThreads == 1) + { + parsePartialWithOneThread(beam, beamPreserver, isNonProjective, instance, beamWidth, rootFirst); + } + else + { + for (int b = 0; b < beam.size(); b++) + { + pool.submit(new PartialTreeBeamScorerThread(true, classifier, instance, beam.get(b), + dependencyRelations, featureLength, b)); + } + fetchBeamFromPool(beamWidth, beam, beamPreserver); + } + + beam = commitActionInBeam(beamWidth, beam, beamPreserver); + } + + Configuration bestConfiguration = null; + float bestScore = Float.NEGATIVE_INFINITY; + for (Configuration configuration : beam) + { + if (configuration.getScore(true) > bestScore) + { + bestScore = configuration.getScore(true); + bestConfiguration = configuration; + } + } + return bestConfiguration; + } + + private void fetchBeamFromPool(int beamWidth, ArrayList beam, TreeSet beamPreserver) throws InterruptedException, ExecutionException + { + for (int b = 0; b < beam.size(); b++) + { + for (BeamElement element : pool.take().get()) + { + beamPreserver.add(element); + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + } + + public void parseConllFile(String inputFile, String outputFile, boolean rootFirst, int beamWidth, boolean labeled, boolean lowerCased, int numThreads, boolean partial, String scorePath) throws IOException, ExecutionException, InterruptedException + { + if (numThreads == 1) + parseConllFileNoParallel(inputFile, outputFile, rootFirst, beamWidth, labeled, lowerCased, numThreads, partial, scorePath); + else + parseConllFileParallel(inputFile, outputFile, rootFirst, beamWidth, lowerCased, numThreads, partial, scorePath); + } + + /** + * Needs Conll 2006 format + * + * @param inputFile + * @param outputFile + * @param rootFirst + * @param beamWidth + * @throws Exception + */ + public void parseConllFileNoParallel(String inputFile, String outputFile, boolean rootFirst, int beamWidth, boolean labeled, boolean lowerCased, int numOfThreads, boolean partial, String scorePath) throws IOException, ExecutionException, InterruptedException + { + CoNLLReader reader = new CoNLLReader(inputFile); + boolean addScore = false; + if (scorePath.trim().length() > 0) + addScore = true; + ArrayList scoreList = new ArrayList(); + + long start = System.currentTimeMillis(); + int allArcs = 0; + int size = 0; + BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile + ".tmp")); + int dataCount = 0; + + while (true) + { + ArrayList data = reader.readData(15000, true, labeled, rootFirst, lowerCased, maps); + size += data.size(); + if (data.size() == 0) + break; + + for (Instance instance : data) + { + dataCount++; + if (dataCount % 100 == 0) + System.err.print(dataCount + " ... "); + Configuration bestParse; + if (partial) + bestParse = parsePartial(instance, instance.getSentence(), rootFirst, beamWidth, numOfThreads); + else bestParse = parse(instance.getSentence(), rootFirst, beamWidth, numOfThreads); + + int[] words = instance.getSentence().getWords(); + allArcs += words.length - 1; + if (addScore) + scoreList.add(bestParse.score / bestParse.sentence.size()); + + writeParsedSentence(writer, rootFirst, bestParse, words); + } + } + +// System.err.print("\n"); + long end = System.currentTimeMillis(); + float each = (1.0f * (end - start)) / size; + float eacharc = (1.0f * (end - start)) / allArcs; + + writer.flush(); + writer.close(); + +// DecimalFormat format = new DecimalFormat("##.00"); +// +// System.err.print(format.format(eacharc) + " ms for each arc!\n"); +// System.err.print(format.format(each) + " ms for each sentence!\n\n"); + + BufferedReader gReader = new BufferedReader(new FileReader(inputFile)); + BufferedReader pReader = new BufferedReader(new FileReader(outputFile + ".tmp")); + BufferedWriter pwriter = new BufferedWriter(new FileWriter(outputFile)); + + String line; + + while ((line = pReader.readLine()) != null) + { + String gLine = gReader.readLine(); + if (line.trim().length() > 0) + { + while (gLine.trim().length() == 0) + gLine = gReader.readLine(); + String[] ps = line.split("\t"); + String[] gs = gLine.split("\t"); + gs[6] = ps[0]; + gs[7] = ps[1]; + StringBuilder output = new StringBuilder(); + for (int i = 0; i < gs.length; i++) + { + output.append(gs[i]).append("\t"); + } + pwriter.write(output.toString().trim() + "\n"); + } + else + { + pwriter.write("\n"); + } + } + pwriter.flush(); + pwriter.close(); + + if (addScore) + { + BufferedWriter scoreWriter = new BufferedWriter(new FileWriter(scorePath)); + + for (int i = 0; i < scoreList.size(); i++) + scoreWriter.write(scoreList.get(i) + "\n"); + scoreWriter.flush(); + scoreWriter.close(); + } + IOUtil.deleteFile(outputFile + ".tmp"); + } + + private void writeParsedSentence(BufferedWriter writer, boolean rootFirst, Configuration bestParse, int[] words) throws IOException + { + StringBuilder finalOutput = new StringBuilder(); + for (int i = 0; i < words.length; i++) + { + int w = i + 1; + int head = bestParse.state.getHead(w); + int dep = bestParse.state.getDependent(w); + + if (w == bestParse.state.rootIndex && !rootFirst) + continue; + + if (head == bestParse.state.rootIndex) + head = 0; + + String label = head == 0 ? maps.rootString : maps.idWord[dep]; + String output = head + "\t" + label + "\n"; + finalOutput.append(output); + } + finalOutput.append("\n"); + writer.write(finalOutput.toString()); + } + + public void parseTaggedFile(String inputFile, String outputFile, boolean rootFirst, int beamWidth, boolean lowerCased, String separator, int numOfThreads) throws Exception + { + BufferedReader reader = new BufferedReader(new FileReader(inputFile)); + BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile)); + long start = System.currentTimeMillis(); + + ExecutorService executor = Executors.newFixedThreadPool(numOfThreads); + CompletionService> pool = new ExecutorCompletionService>(executor); + + + String line; + int count = 0; + int lineNum = 0; + while ((line = reader.readLine()) != null) + { + pool.submit(new ParseTaggedThread(lineNum++, line, separator, rootFirst, lowerCased, maps, beamWidth, this)); + + if (lineNum % 1000 == 0) + { + String[] outs = new String[lineNum]; + for (int i = 0; i < lineNum; i++) + { + count++; + if (count % 100 == 0) + System.err.print(count + "..."); + Pair result = pool.take().get(); + outs[result.second] = result.first; + } + + for (int i = 0; i < lineNum; i++) + { + if (outs[i].length() > 0) + { + writer.write(outs[i]); + } + } + + lineNum = 0; + } + } + + if (lineNum > 0) + { + String[] outs = new String[lineNum]; + for (int i = 0; i < lineNum; i++) + { + count++; + if (count % 100 == 0) + System.err.print(count + "..."); + Pair result = pool.take().get(); + outs[result.second] = result.first; + } + + for (int i = 0; i < lineNum; i++) + { + + if (outs[i].length() > 0) + { + writer.write(outs[i]); + } + } + } + + long end = System.currentTimeMillis(); + System.out.println("\n" + (end - start) + " ms"); + writer.flush(); + writer.close(); + + System.out.println("done!"); + } + + public void parseConllFileParallel(String inputFile, String outputFile, boolean rootFirst, int beamWidth, boolean lowerCased, int numThreads, boolean partial, String scorePath) throws IOException, InterruptedException, ExecutionException + { + CoNLLReader reader = new CoNLLReader(inputFile); + + boolean addScore = false; + if (scorePath.trim().length() > 0) + addScore = true; + ArrayList scoreList = new ArrayList(); + + ExecutorService executor = Executors.newFixedThreadPool(numThreads); + CompletionService> pool = new ExecutorCompletionService>(executor); + + long start = System.currentTimeMillis(); + int allArcs = 0; + int size = 0; + BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile + ".tmp")); + int dataCount = 0; + + while (true) + { + ArrayList data = reader.readData(15000, true, true, rootFirst, lowerCased, maps); + size += data.size(); + if (data.size() == 0) + break; + + int index = 0; + Configuration[] confs = new Configuration[data.size()]; + + for (Instance instance : data) + { + ParseThread thread = new ParseThread(index, classifier, dependencyRelations, featureLength, instance.getSentence(), rootFirst, beamWidth, instance, partial); + pool.submit(thread); + index++; + } + + for (int i = 0; i < confs.length; i++) + { + dataCount++; +// if (dataCount % 100 == 0) +// System.err.print(dataCount + " ... "); + + Pair configurationIntegerPair = pool.take().get(); + confs[configurationIntegerPair.second] = configurationIntegerPair.first; + } + + for (int j = 0; j < confs.length; j++) + { + Configuration bestParse = confs[j]; + if (addScore) + { + scoreList.add(bestParse.score / bestParse.sentence.size()); + } + int[] words = data.get(j).getSentence().getWords(); + + allArcs += words.length - 1; + + writeParsedSentence(writer, rootFirst, bestParse, words); + } + } + +// System.err.print("\n"); + long end = System.currentTimeMillis(); + float each = (1.0f * (end - start)) / size; + float eacharc = (1.0f * (end - start)) / allArcs; + + writer.flush(); + writer.close(); + +// DecimalFormat format = new DecimalFormat("##.00"); +// +// System.err.print(format.format(eacharc) + " ms for each arc!\n"); +// System.err.print(format.format(each) + " ms for each sentence!\n\n"); + + BufferedReader gReader = new BufferedReader(new FileReader(inputFile)); + BufferedReader pReader = new BufferedReader(new FileReader(outputFile + ".tmp")); + BufferedWriter pwriter = new BufferedWriter(new FileWriter(outputFile)); + + String line; + + while ((line = pReader.readLine()) != null) + { + String gLine = gReader.readLine(); + if (line.trim().length() > 0) + { + while (gLine.trim().length() == 0) + gLine = gReader.readLine(); + String[] ps = line.split("\t"); + String[] gs = gLine.split("\t"); + gs[6] = ps[0]; + gs[7] = ps[1]; + StringBuilder output = new StringBuilder(); + for (int i = 0; i < gs.length; i++) + { + output.append(gs[i]).append("\t"); + } + pwriter.write(output.toString().trim() + "\n"); + } + else + { + pwriter.write("\n"); + } + } + pwriter.flush(); + pwriter.close(); + + if (addScore) + { + BufferedWriter scoreWriter = new BufferedWriter(new FileWriter(scorePath)); + + for (int i = 0; i < scoreList.size(); i++) + scoreWriter.write(scoreList.get(i) + "\n"); + scoreWriter.flush(); + scoreWriter.close(); + } + IOUtil.deleteFile(outputFile + ".tmp"); + } + + public void shutDownLiveThreads() + { + boolean isTerminated = executor.isTerminated(); + while (!isTerminated) + { + executor.shutdownNow(); + isTerminated = executor.isTerminated(); + } + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/LabeledAction.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/LabeledAction.java new file mode 100644 index 000000000..8cdd1a916 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/LabeledAction.java @@ -0,0 +1,52 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-04 上午11:10 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +/** + * @author hankcs + */ +public class LabeledAction +{ + public Action action; + public int label; + + public LabeledAction(Action action, int label) + { + this.action = action; + this.label = label; + } + + public LabeledAction(final int actionCode, final int labelSize) + { + if (actionCode == Action.Shift.ordinal()) + { + action = Action.Shift; + } + else if (actionCode == Action.Reduce.ordinal()) + { + action = Action.Reduce; + } + else if (actionCode >= Action.RightArc.ordinal() + labelSize) + { + label = actionCode - (Action.RightArc.ordinal() + labelSize); + action = Action.LeftArc; + } + else if (actionCode >= Action.RightArc.ordinal()) + { + label = actionCode - Action.RightArc.ordinal(); + action = Action.RightArc; + } + else if (actionCode == Action.Unshift.ordinal()) + { + action = Action.Unshift; + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseTaggedThread.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseTaggedThread.java new file mode 100644 index 000000000..aaac101cb --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseTaggedThread.java @@ -0,0 +1,132 @@ +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.accessories.Pair; +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.concurrent.Callable; + +/** + * Created by Mohammad Sadegh Rasooli. + * ML-NLP Lab, Department of Computer Science, Columbia University + * Date Created: 1/6/15 + * Time: 11:12 AM + * To report any bugs or problems contact rasooli@cs.columbia.edu + */ + +public class ParseTaggedThread implements Callable> +{ + int lineNum; + String line; + String delim; + boolean rootFirst; + boolean lowerCased; + IndexMaps maps; + int beamWidth; + KBeamArcEagerParser parser; + + public ParseTaggedThread(int lineNum, String line, String delim, boolean rootFirst, boolean lowerCased, IndexMaps maps, int beamWidth, KBeamArcEagerParser parser) + { + this.lineNum = lineNum; + this.line = line; + this.delim = delim; + this.rootFirst = rootFirst; + this.lowerCased = lowerCased; + this.maps = maps; + this.beamWidth = beamWidth; + this.parser = parser; + } + + @Override + public Pair call() throws Exception + { + HashMap wordMap = maps.getWordId(); + + line = line.trim(); + String[] wrds = line.split(" "); + String[] words = new String[wrds.length]; + String[] posTags = new String[wrds.length]; + + ArrayList tokens = new ArrayList(); + ArrayList tags = new ArrayList(); + ArrayList brownCluster4thPrefix = new ArrayList(); + ArrayList brownCluster6thPrefix = new ArrayList(); + ArrayList brownClusterFullString = new ArrayList(); + + int i = 0; + for (String w : wrds) + { + if (w.length() == 0) + continue; + int index = w.lastIndexOf(delim); + String word = w.substring(0, index); + if (lowerCased) + word = word.toLowerCase(); + String pos = w.substring(index + 1); + words[i] = word; + posTags[i++] = pos; + + int wi = -1; + if (wordMap.containsKey(word)) + wi = wordMap.get(word); + + int pi = -1; + if (wordMap.containsKey(pos)) + pi = wordMap.get(pos); + int[] clusters = maps.clusterId(word); + brownClusterFullString.add(clusters[0]); + brownCluster4thPrefix.add(clusters[1]); + brownCluster6thPrefix.add(clusters[2]); + + tokens.add(wi); + tags.add(pi); + } + + if (tokens.size() > 0) + { + if (!rootFirst) + { + tokens.add(0); + tags.add(0); + brownClusterFullString.add(0); + brownCluster4thPrefix.add(0); + brownCluster6thPrefix.add(0); + } + + Sentence sentence = new Sentence(tokens, tags, brownCluster4thPrefix, brownCluster6thPrefix, brownClusterFullString); + Configuration bestParse = parser.parse(sentence, rootFirst, beamWidth, 1); + + StringBuilder finalOutput = new StringBuilder(); + for (i = 0; i < words.length; i++) + { + + String word = words[i]; + String pos = posTags[i]; + + int w = i + 1; + int head = bestParse.state.getHead(w); + int dep = bestParse.state.getDependent(w); + + + String lemma = "_"; + + String fpos = "_"; + + if (head == bestParse.state.rootIndex) + head = 0; + String label = head == 0 ? maps.rootString : maps.idWord[dep]; + + String output = w + "\t" + word + "\t" + lemma + "\t" + pos + "\t" + fpos + "\t_\t" + head + "\t" + label + "\t_\t_\n"; + finalOutput.append(output); + } + if (words.length > 0) + finalOutput.append("\n"); + + return new Pair(finalOutput.toString(), lineNum); + } + return new Pair("", lineNum); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseThread.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseThread.java new file mode 100644 index 000000000..9f5de8416 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/ParseThread.java @@ -0,0 +1,387 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.accessories.Pair; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; +import com.hankcs.hanlp.dependency.perceptron.transition.features.FeatureExtractor; +import com.hankcs.hanlp.dependency.perceptron.structures.Sentence; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.BeamElement; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.TreeSet; +import java.util.concurrent.Callable; + +public class ParseThread implements Callable> +{ + AveragedPerceptron classifier; + + ArrayList dependencyRelations; + + int featureLength; + + Sentence sentence; + boolean rootFirst; + int beamWidth; + Instance instance; + boolean partial; + + int id; + + public ParseThread(int id, AveragedPerceptron classifier, ArrayList dependencyRelations, int featureLength, + Sentence sentence, + boolean rootFirst, int beamWidth, Instance instance, boolean partial) + { + this.id = id; + this.classifier = classifier; + this.dependencyRelations = dependencyRelations; + this.featureLength = featureLength; + this.sentence = sentence; + this.rootFirst = rootFirst; + this.beamWidth = beamWidth; + this.instance = instance; + this.partial = partial; + } + + @Override + public Pair call() throws Exception + { + if (!partial) + return parse(); + else return new Pair(parsePartial(), id); + } + + Pair parse() throws Exception + { + Configuration initialConfiguration = new Configuration(sentence, rootFirst); + + ArrayList beam = new ArrayList(beamWidth); + beam.add(initialConfiguration); + + while (!ArcEager.isTerminal(beam)) + { + if (beamWidth != 1) + { + TreeSet beamPreserver = new TreeSet(); + sortBeam(beam, beamPreserver, false, null, beamWidth, rootFirst, featureLength, classifier, dependencyRelations); + + ArrayList repBeam = new ArrayList(beamWidth); + for (BeamElement beamElement : beamPreserver.descendingSet()) + { + if (repBeam.size() >= beamWidth) + break; + int b = beamElement.index; + int action = beamElement.action; + int label = beamElement.label; + float score = beamElement.score; + + Configuration newConfig = beam.get(b).clone(); + + if (action == 0) + { + ArcEager.shift(newConfig.state); + newConfig.addAction(0); + } + else if (action == 1) + { + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + } + else if (action == 2) + { + ArcEager.rightArc(newConfig.state, label); + newConfig.addAction(3 + label); + } + else if (action == 3) + { + ArcEager.leftArc(newConfig.state, label); + newConfig.addAction(3 + dependencyRelations.size() + label); + } + else if (action == 4) + { + ArcEager.unShift(newConfig.state); + newConfig.addAction(2); + } + newConfig.setScore(score); + repBeam.add(newConfig); + } + beam = repBeam; + } + else + { + Configuration configuration = beam.get(0); + State currentState = configuration.state; + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + float bestScore = Float.NEGATIVE_INFINITY; + int bestAction = -1; + + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + + if (!canShift + && !canReduce + && !canRightArc + && !canLeftArc) + { + + if (!currentState.stackEmpty()) + { + ArcEager.unShift(currentState); + configuration.addAction(2); + } + else if (!currentState.bufferEmpty() && currentState.stackEmpty()) + { + ArcEager.shift(currentState); + configuration.addAction(0); + } + } + + if (canShift) + { + float score = classifier.shiftScore(features, true); + if (score > bestScore) + { + bestScore = score; + bestAction = 0; + } + } + if (canReduce) + { + float score = classifier.reduceScore(features, true); + if (score > bestScore) + { + bestScore = score; + bestAction = 1; + } + } + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, true); + for (int dependency : dependencyRelations) + { + float score = rightArcScores[dependency]; + if (score > bestScore) + { + bestScore = score; + bestAction = 3 + dependency; + } + } + } + if (ArcEager.canDo(Action.LeftArc, currentState)) + { + float[] leftArcScores = classifier.leftArcScores(features, true); + for (int dependency : dependencyRelations) + { + float score = leftArcScores[dependency]; + if (score > bestScore) + { + bestScore = score; + bestAction = 3 + dependencyRelations.size() + dependency; + } + } + } + + if (bestAction != -1) + { + if (bestAction == 0) + { + ArcEager.shift(configuration.state); + } + else if (bestAction == (1)) + { + ArcEager.reduce(configuration.state); + } + else + { + + if (bestAction >= 3 + dependencyRelations.size()) + { + int label = bestAction - (3 + dependencyRelations.size()); + ArcEager.leftArc(configuration.state, label); + } + else + { + int label = bestAction - 3; + ArcEager.rightArc(configuration.state, label); + } + } + configuration.addScore(bestScore); + configuration.addAction(bestAction); + } + if (beam.size() == 0) + { + System.out.println("WHY BEAM SIZE ZERO?"); + } + } + } + + Configuration bestConfiguration = null; + float bestScore = Float.NEGATIVE_INFINITY; + for (Configuration configuration : beam) + { + if (configuration.getScore(true) > bestScore) + { + bestScore = configuration.getScore(true); + bestConfiguration = configuration; + } + } + return new Pair(bestConfiguration, id); + } + + public static void sortBeam(ArrayList beam, TreeSet beamPreserver, Boolean isNonProjective, Instance instance, int beamWidth, boolean rootFirst, int featureLength, AveragedPerceptron classifier, Collection dependencyRelations) + { + for (int b = 0; b < beam.size(); b++) + { + Configuration configuration = beam.get(b); + State currentState = configuration.state; + float prevScore = configuration.score; + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + if (!canShift + && !canReduce + && !canRightArc + && !canLeftArc) + { + beamPreserver.add(new BeamElement(prevScore, b, 4, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + + if (canShift) + { + float score = classifier.shiftScore(features, true); + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 0, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + + if (canReduce) + { + float score = classifier.reduceScore(features, true); + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 1, -1)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, true); + for (int dependency : dependencyRelations) + { + float score = rightArcScores[dependency]; + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 2, dependency)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + + if (canLeftArc) + { + float[] leftArcScores = classifier.leftArcScores(features, true); + for (int dependency : dependencyRelations) + { + float score = leftArcScores[dependency]; + float addedScore = score + prevScore; + beamPreserver.add(new BeamElement(addedScore, b, 3, dependency)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + } + } + } + + public Configuration parsePartial() throws Exception + { + Configuration initialConfiguration = new Configuration(sentence, rootFirst); + boolean isNonProjective = false; + if (instance.isNonprojective()) + { + isNonProjective = true; + } + + ArrayList beam = new ArrayList(beamWidth); + beam.add(initialConfiguration); + + while (!ArcEager.isTerminal(beam)) + { + TreeSet beamPreserver = new TreeSet(); + + sortBeam(beam, beamPreserver, isNonProjective, instance, beamWidth, rootFirst, featureLength, classifier, dependencyRelations); + + ArrayList repBeam = new ArrayList(beamWidth); + for (BeamElement beamElement : beamPreserver.descendingSet()) + { + if (repBeam.size() >= beamWidth) + break; + int b = beamElement.index; + int action = beamElement.action; + int label = beamElement.label; + float score = beamElement.score; + + Configuration newConfig = beam.get(b).clone(); + + if (action == 0) + { + ArcEager.shift(newConfig.state); + newConfig.addAction(0); + } + else if (action == 1) + { + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + } + else if (action == 2) + { + ArcEager.rightArc(newConfig.state, label); + newConfig.addAction(3 + label); + } + else if (action == 3) + { + ArcEager.leftArc(newConfig.state, label); + newConfig.addAction(3 + dependencyRelations.size() + label); + } + else if (action == 4) + { + ArcEager.unShift(newConfig.state); + newConfig.addAction(2); + } + newConfig.setScore(score); + repBeam.add(newConfig); + } + beam = repBeam; + } + + Configuration bestConfiguration = null; + float bestScore = Float.NEGATIVE_INFINITY; + for (Configuration configuration : beam) + { + if (configuration.getScore(true) > bestScore) + { + bestScore = configuration.getScore(true); + bestConfiguration = configuration; + } + } + return bestConfiguration; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/PartialTreeBeamScorerThread.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/PartialTreeBeamScorerThread.java new file mode 100644 index 000000000..90524e41a --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/PartialTreeBeamScorerThread.java @@ -0,0 +1,153 @@ +/** + * Copyright 2014, Yahoo! Inc. and Mohammad Sadegh Rasooli + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.transition.features.FeatureExtractor; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.BeamElement; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; + +import java.util.ArrayList; +import java.util.concurrent.Callable; + + +public class PartialTreeBeamScorerThread implements Callable> +{ + + boolean isDecode; + AveragedPerceptron classifier; + Configuration configuration; + Instance instance; + ArrayList dependencyRelations; + int featureLength; + int b; + + public PartialTreeBeamScorerThread(boolean isDecode, AveragedPerceptron classifier, Instance instance, Configuration configuration, ArrayList dependencyRelations, int featureLength, int b) + { + this.isDecode = isDecode; + this.classifier = classifier; + this.configuration = configuration; + this.instance = instance; + this.dependencyRelations = dependencyRelations; + this.featureLength = featureLength; + this.b = b; + } + + + public ArrayList call() throws Exception + { + ArrayList elements = new ArrayList(dependencyRelations.size() * 2 + 3); + + boolean isNonProjective = false; + if (instance.isNonprojective()) + { + isNonProjective = true; + } + + State currentState = configuration.state; + float prevScore = configuration.score; + + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + + if (canShift) + { + if (isNonProjective || instance.actionCost(Action.Shift, -1, currentState) == 0) + { + float score = classifier.shiftScore(features, isDecode); + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 0, -1)); + } + } + if (canReduce) + { + if (isNonProjective || instance.actionCost(Action.Reduce, -1, currentState) == 0) + { + float score = classifier.reduceScore(features, isDecode); + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 1, -1)); + } + + } + + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, isDecode); + for (int dependency : dependencyRelations) + { + if (isNonProjective || instance.actionCost(Action.RightArc, dependency, currentState) == 0) + { + float score = rightArcScores[dependency]; + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 2, dependency)); + } + } + } + if (canLeftArc) + { + float[] leftArcScores = classifier.leftArcScores(features, isDecode); + for (int dependency : dependencyRelations) + { + if (isNonProjective || instance.actionCost(Action.LeftArc, dependency, currentState) == 0) + { + float score = leftArcScores[dependency]; + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 3, dependency)); + + } + } + } + + if (elements.size() == 0) + { + addAvailableBeamElements(elements, prevScore, canShift, canReduce, canRightArc, canLeftArc, features, classifier, isDecode, b, dependencyRelations); + } + + return elements; + } + + public static void addAvailableBeamElements(ArrayList elements, float prevScore, boolean canShift, boolean canReduce, boolean canRightArc, boolean canLeftArc, Object[] features, AveragedPerceptron classifier, boolean isDecode, int b, ArrayList dependencyRelations) + { + if (canShift) + { + float score = classifier.shiftScore(features, isDecode); + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 0, -1)); + } + if (canReduce) + { + float score = classifier.reduceScore(features, isDecode); + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 1, -1)); + } + + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, isDecode); + for (int dependency : dependencyRelations) + { + float score = rightArcScores[dependency]; + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 2, dependency)); + } + } + if (canLeftArc) + { + float[] leftArcScores = classifier.leftArcScores(features, isDecode); + for (int dependency : dependencyRelations) + { + float score = leftArcScores[dependency]; + float addedScore = score + prevScore; + elements.add(new BeamElement(addedScore, b, 3, dependency)); + } + } + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/TransitionBasedParser.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/TransitionBasedParser.java new file mode 100644 index 000000000..fb1af0bd4 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/parser/TransitionBasedParser.java @@ -0,0 +1,39 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.parser; + +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; + +import java.util.ArrayList; + +/** + * This class is just for making connection between different types of transition-based parsers + */ +public abstract class TransitionBasedParser +{ + + /** + * Any kind of classifier that can give us scores + */ + protected AveragedPerceptron classifier; + protected ArrayList dependencyRelations; + protected int featureLength; + protected IndexMaps maps; + + public TransitionBasedParser(AveragedPerceptron classifier, ArrayList dependencyRelations, int featureLength, IndexMaps maps) + { + this.classifier = classifier; + this.dependencyRelations = dependencyRelations; + this.featureLength = featureLength; + this.maps = maps; + } + + public String idWord(int id) + { + return maps.idWord[id]; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/trainer/ArcEagerBeamTrainer.java b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/trainer/ArcEagerBeamTrainer.java new file mode 100644 index 000000000..7c25d769b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dependency/perceptron/transition/trainer/ArcEagerBeamTrainer.java @@ -0,0 +1,755 @@ +/** + * Copyright 2014, Yahoo! Inc. + * Licensed under the terms of the Apache License 2.0. See LICENSE file at the project root for terms. + */ + +package com.hankcs.hanlp.dependency.perceptron.transition.trainer; + +import com.hankcs.hanlp.classification.utilities.io.ConsoleLogger; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dependency.perceptron.accessories.Edge; +import com.hankcs.hanlp.dependency.perceptron.accessories.Evaluator; +import com.hankcs.hanlp.dependency.perceptron.accessories.Pair; +import com.hankcs.hanlp.dependency.perceptron.learning.AveragedPerceptron; +import com.hankcs.hanlp.dependency.perceptron.structures.IndexMaps; +import com.hankcs.hanlp.dependency.perceptron.structures.ParserModel; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.BeamElement; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Instance; +import com.hankcs.hanlp.dependency.perceptron.transition.features.FeatureExtractor; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.*; +import com.hankcs.hanlp.dependency.perceptron.accessories.Options; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.Configuration; +import com.hankcs.hanlp.dependency.perceptron.transition.configuration.State; +import com.hankcs.hanlp.model.perceptron.feature.FeatureSortItem; +import com.hankcs.hanlp.utility.MathUtility; + +import java.io.IOException; +import java.text.DecimalFormat; +import java.util.*; +import java.util.concurrent.*; + +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; + +public class ArcEagerBeamTrainer extends TransitionBasedParser +{ + Options options; + /** + * Can be either "early" or "max_violation" + * For more information read: + * Liang Huang, Suphan Fayong and Yang Guo. "Structured perceptron with inexact search." + * In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, + * pp. 142-151. Association for Computational Linguistics, 2012. + */ + private String updateMode; + private Random randGen; + + public ArcEagerBeamTrainer(String updateMode, AveragedPerceptron classifier, Options options, + ArrayList dependencyRelations, int featureLength, IndexMaps maps) + { + super(classifier, dependencyRelations, featureLength, maps); + this.updateMode = updateMode; + this.options = options; + randGen = new Random(); + } + + public void train(ArrayList trainData, String devPath, int maxIteration, String modelPath, boolean lowerCased, HashSet punctuations, int partialTreeIter) throws IOException, ExecutionException, InterruptedException + { + /** + * Actions: 0=shift, 1=reduce, 2=unshift, ra_dep=3+dep, la_dep=3+dependencyRelations.size()+dep + */ + ExecutorService executor = Executors.newFixedThreadPool(options.numOfThreads); + CompletionService> pool = new ExecutorCompletionService>(executor); + + double bestUAS = -1.; + for (int i = 1; i <= maxIteration; i++) + { + long start = System.currentTimeMillis(); + + int dataCount = 0; + + int logEvery = (int) Math.ceil(trainData.size() / 10000f); + for (Instance instance : trainData) + { + dataCount++; + if (dataCount % logEvery == 0 || dataCount == trainData.size()) + { + System.out.printf("\r迭代 " + i + "/" + maxIteration + " %.2f%% ", MathUtility.percentage(dataCount, trainData.size())); + } + trainOnOneSample(instance, partialTreeIter, i, dataCount, pool); + + classifier.incrementIteration(); + } +// System.out.print("\n"); + long end = System.currentTimeMillis(); + long timeSec = (end - start) / 1000; + System.out.print(" 耗时 " + timeSec + " 秒。"); + +// System.out.print("saving the model..."); + ParserModel parserModel = new ParserModel(classifier, maps, dependencyRelations, options); +// infStruct.saveModel(modelPath + "_iter" + i); + +// System.out.println("done\n"); + + if (!devPath.equals("")) + { + AveragedPerceptron averagedPerceptron = new AveragedPerceptron(parserModel); + +// int raSize = averagedPerceptron.raSize(); +// int effectiveRaSize = averagedPerceptron.effectiveRaSize(); +// float raRatio = 100.0f * effectiveRaSize / raSize; +// +// int laSize = averagedPerceptron.laSize(); +// int effectiveLaSize = averagedPerceptron.effectiveLaSize(); +// float laRatio = 100.0f * effectiveLaSize / laSize; + +// DecimalFormat format = new DecimalFormat("##.00"); +// System.out.println("size of RA features in memory:" + effectiveRaSize + "/" + raSize + "=" + format.format(raRatio) + "%"); +// System.out.println("size of LA features in memory:" + effectiveLaSize + "/" + laSize + "=" + format.format(laRatio) + "%"); + KBeamArcEagerParser parser = new KBeamArcEagerParser(averagedPerceptron, dependencyRelations, featureLength, maps, options.numOfThreads, options); + + String outputFile = modelPath + ".__tmp__"; + parser.parseConllFile(devPath, outputFile, + options.rootFirst, options.beamWidth, true, lowerCased, options.numOfThreads, false, ""); + double[] score = Evaluator.evaluate(devPath, outputFile, punctuations); + System.out.printf("UAS=%.2f LAS=%.2f", score[0], score[1]); + IOUtil.deleteFile(outputFile); + parser.shutDownLiveThreads(); + if (score[0] > bestUAS) + { + bestUAS = score[0]; + System.out.println(" 最高分!保存中..."); + parserModel.saveModel(modelPath); + } + else + { + System.out.println(); + } + } + else + { + parserModel.saveModel(modelPath); + System.out.println(); + } + } + boolean isTerminated = executor.isTerminated(); + while (!isTerminated) + { + executor.shutdownNow(); + isTerminated = executor.isTerminated(); + } + } + + /** + * 在线学习 + * + * @param instance 实例 + * @param partialTreeIter 半标注树的训练迭代数 + * @param i 当前迭代数 + * @param dataCount + * @param pool + * @throws Exception + */ + private void trainOnOneSample(Instance instance, int partialTreeIter, int i, int dataCount, CompletionService> pool) throws InterruptedException, ExecutionException + { + boolean isPartial = instance.isPartial(options.rootFirst); + + if (partialTreeIter > i && isPartial) + return; + + Configuration initialConfiguration = new Configuration(instance.getSentence(), options.rootFirst); + Configuration firstOracle = initialConfiguration.clone(); + ArrayList beam = new ArrayList(options.beamWidth); + beam.add(initialConfiguration); + + /** + * The float is the oracle's cost + * For more information see: + * Yoav Goldberg and Joakim Nivre. "Training Deterministic Parsers with Non-Deterministic Oracles." + * TACL 1 (2013): 403-414. + * for the mean while we just use zero-cost oracles + */ + Collection oracles = new HashSet(); + + oracles.add(firstOracle); + + /** + * For keeping track of the violations + * For more information see: + * Liang Huang, Suphan Fayong and Yang Guo. "Structured perceptron with inexact search." + * In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, + * pp. 142-151. Association for Computational Linguistics, 2012. + */ + float maxViol = Float.NEGATIVE_INFINITY; + Pair maxViolPair = null; + + Configuration bestScoringOracle = null; + boolean oracleInBeam = false; + + while (!ArcEager.isTerminal(beam) && beam.size() > 0) + { + /** + * generating new oracles + * it keeps the oracles which are in the terminal state + */ + Collection newOracles = new HashSet(); + + if (options.useDynamicOracle || isPartial) + { + bestScoringOracle = zeroCostDynamicOracle(instance, oracles, newOracles); + } + else + { + bestScoringOracle = staticOracle(instance, oracles, newOracles); + } + // try explore non-optimal transitions + + if (newOracles.size() == 0) + { +// System.err.println("...no oracle(" + dataCount + ")..."); + bestScoringOracle = staticOracle(instance, oracles, newOracles); + } + oracles = newOracles; + + TreeSet beamPreserver = new TreeSet(); + + if (options.numOfThreads == 1 || beam.size() == 1) + { + beamSortOneThread(beam, beamPreserver); + } + else + { + for (int b = 0; b < beam.size(); b++) + { + pool.submit(new BeamScorerThread(false, classifier, beam.get(b), + dependencyRelations, featureLength, b, options.rootFirst)); + } + for (int b = 0; b < beam.size(); b++) + { + for (BeamElement element : pool.take().get()) + { + beamPreserver.add(element); + if (beamPreserver.size() > options.beamWidth) + beamPreserver.pollFirst(); + } + } + } + + if (beamPreserver.size() == 0 || beam.size() == 0) + { + break; + } + else + { + oracleInBeam = false; + + ArrayList repBeam = new ArrayList(options.beamWidth); + for (BeamElement beamElement : beamPreserver.descendingSet()) + { +// if (repBeam.size() >= options.beamWidth) // 只要beamWidth个configuration(这句是多余的) +// break; + int b = beamElement.index; + int action = beamElement.action; + int label = beamElement.label; + float score = beamElement.score; + + Configuration newConfig = beam.get(b).clone(); + + ArcEager.commitAction(action, label, score, dependencyRelations, newConfig); + repBeam.add(newConfig); + + if (!oracleInBeam && oracles.contains(newConfig)) + oracleInBeam = true; + } + beam = repBeam; + + if (beam.size() > 0 && oracles.size() > 0) + { + Configuration bestConfig = beam.get(0); + if (oracles.contains(bestConfig)) // 模型认为的最大分值configuration是zero cost + { + oracles = new HashSet(); + oracles.add(bestConfig); + } + else // 否则 + { + if (options.useRandomOracleSelection) // 随机选择一个 oracle + { // choosing randomly, otherwise using latent structured Perceptron + List keys = new ArrayList(oracles); + Configuration randomKey = keys.get(randGen.nextInt(keys.size())); + oracles = new HashSet(); + oracles.add(randomKey); + bestScoringOracle = randomKey; + } + else // 选择 oracle中被模型认为分值最大的那个 + { + oracles = new HashSet(); + oracles.add(bestScoringOracle); + } + } + + // do early update + if (!oracleInBeam && updateMode.equals("early")) + break; + + // keep violations + if (!oracleInBeam && updateMode.equals("max_violation")) + { + float violation = bestConfig.getScore(true) - bestScoringOracle.getScore(true);//Math.abs(beam.get(0).getScore(true) - bestScoringOracle.getScore(true)); + if (violation > maxViol) + { + maxViol = violation; + maxViolPair = new Pair(bestConfig, bestScoringOracle); + } + } + } + else + break; + } + } + + // updating weights + if (!oracleInBeam || + !bestScoringOracle.equals(beam.get(0)) // 虽然oracle在beam里面,但在最后时刻,它的得分不是最高 + ) + { + updateWeights(initialConfiguration, maxViol, isPartial, bestScoringOracle, maxViolPair, beam); + } + } + + private Configuration staticOracle(Instance instance, Collection oracles, Collection newOracles) + { + Configuration bestScoringOracle = null; + int top = -1; + int first = -1; + HashMap goldDependencies = instance.getGoldDependencies(); + HashMap> reversedDependencies = instance.getReversedDependencies(); + + for (Configuration configuration : oracles) + { + State state = configuration.state; + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + + if (!state.stackEmpty()) + top = state.stackTop(); + if (!state.bufferEmpty()) + first = state.bufferHead(); + + if (!configuration.state.isTerminalState()) + { + Configuration newConfig = configuration.clone(); + + if (first > 0 && goldDependencies.containsKey(first) && goldDependencies.get(first).headIndex == top) + { + int dependency = goldDependencies.get(first).relationId; + float[] scores = classifier.rightArcScores(features, false); + float score = scores[dependency]; + ArcEager.rightArc(newConfig.state, dependency); + newConfig.addAction(3 + dependency); + newConfig.addScore(score); + } + else if (top > 0 && goldDependencies.containsKey(top) && goldDependencies.get(top).headIndex == first) + { + int dependency = goldDependencies.get(top).relationId; + float[] scores = classifier.leftArcScores(features, false); + float score = scores[dependency]; + ArcEager.leftArc(newConfig.state, dependency); + newConfig.addAction(3 + dependencyRelations.size() + dependency); + newConfig.addScore(score); + } + else if (top >= 0 && state.hasHead(top)) + { + + if (reversedDependencies.containsKey(top)) + { + if (reversedDependencies.get(top).size() == state.valence(top)) + { + float score = classifier.reduceScore(features, false); + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + newConfig.addScore(score); + } + else + { + float score = classifier.shiftScore(features, false); + ArcEager.shift(newConfig.state); + newConfig.addAction(0); + newConfig.addScore(score); + } + } + else + { + float score = classifier.reduceScore(features, false); + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + newConfig.addScore(score); + } + + } + else if (state.bufferEmpty() && state.stackSize() == 1 && state.stackTop() == state.rootIndex) + { + float score = classifier.reduceScore(features, false); + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + newConfig.addScore(score); + } + else + { + float score = classifier.shiftScore(features, true); + ArcEager.shift(newConfig.state); + newConfig.addAction(0); + newConfig.addScore(score); + } + bestScoringOracle = newConfig; + newOracles.add(newConfig); + } + else + { + newOracles.add(configuration); + } + } + return bestScoringOracle; + } + + /** + * 获取 zero cost oracle + * + * @param instance 训练实例 + * @param oracles 当前的oracle + * @param newOracles 储存新oracle + * @return 这些 oracles 中在模型看来分数最大的那个 + * @throws Exception + */ + private Configuration zeroCostDynamicOracle(Instance instance, Collection oracles, Collection newOracles) + { + float bestScore = Float.NEGATIVE_INFINITY; + Configuration bestScoringOracle = null; + + for (Configuration configuration : oracles) + { + if (!configuration.state.isTerminalState()) + { + State currentState = configuration.state; + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + // I only assumed that we need zero cost ones + if (instance.actionCost(Action.Shift, -1, currentState) == 0) + { + Configuration newConfig = configuration.clone(); + float score = classifier.shiftScore(features, false); + ArcEager.shift(newConfig.state); + newConfig.addAction(0); + newConfig.addScore(score); + newOracles.add(newConfig); + + if (newConfig.getScore(true) > bestScore) + { + bestScore = newConfig.getScore(true); + bestScoringOracle = newConfig; + } + } + if (ArcEager.canDo(Action.RightArc, currentState)) + { + float[] rightArcScores = classifier.rightArcScores(features, false); + for (int dependency : dependencyRelations) + { + if (instance.actionCost(Action.RightArc, dependency, currentState) == 0) + { + Configuration newConfig = configuration.clone(); + float score = rightArcScores[dependency]; + ArcEager.rightArc(newConfig.state, dependency); + newConfig.addAction(3 + dependency); + newConfig.addScore(score); + newOracles.add(newConfig); + + if (newConfig.getScore(true) > bestScore) + { + bestScore = newConfig.getScore(true); + bestScoringOracle = newConfig; + } + } + } + } + if (ArcEager.canDo(Action.LeftArc, currentState)) + { + float[] leftArcScores = classifier.leftArcScores(features, false); + + for (int dependency : dependencyRelations) + { + if (instance.actionCost(Action.LeftArc, dependency, currentState) == 0) + { + Configuration newConfig = configuration.clone(); + float score = leftArcScores[dependency]; + ArcEager.leftArc(newConfig.state, dependency); + newConfig.addAction(3 + dependencyRelations.size() + dependency); + newConfig.addScore(score); + newOracles.add(newConfig); + + if (newConfig.getScore(true) > bestScore) + { + bestScore = newConfig.getScore(true); + bestScoringOracle = newConfig; + } + } + } + } + if (instance.actionCost(Action.Reduce, -1, currentState) == 0) + { + Configuration newConfig = configuration.clone(); + float score = classifier.reduceScore(features, false); + ArcEager.reduce(newConfig.state); + newConfig.addAction(1); + newConfig.addScore(score); + newOracles.add(newConfig); + + if (newConfig.getScore(true) > bestScore) + { + bestScore = newConfig.getScore(true); + bestScoringOracle = newConfig; + } + } + } + else + { + newOracles.add(configuration); + } + } + + return bestScoringOracle; + } + + /** + * 每个beam元素执行所有可能的动作一次 + * + * @param beam + * @param beamPreserver + * @throws Exception + */ + private void beamSortOneThread(ArrayList beam, TreeSet beamPreserver) + { + for (int b = 0; b < beam.size(); b++) + { + Configuration configuration = beam.get(b); + State currentState = configuration.state; + float prevScore = configuration.score; + boolean canShift = ArcEager.canDo(Action.Shift, currentState); + boolean canReduce = ArcEager.canDo(Action.Reduce, currentState); + boolean canRightArc = ArcEager.canDo(Action.RightArc, currentState); + boolean canLeftArc = ArcEager.canDo(Action.LeftArc, currentState); + Object[] features = FeatureExtractor.extractAllParseFeatures(configuration, featureLength); + + if (canShift) + { + float score = classifier.shiftScore(features, false); + float addedScore = score + prevScore; + addToBeam(beamPreserver, b, addedScore, 0, -1, options.beamWidth); + } + if (canReduce) + { + float score = classifier.reduceScore(features, false); + float addedScore = score + prevScore; + addToBeam(beamPreserver, b, addedScore, 1, -1, options.beamWidth); + } + + if (canRightArc) + { + float[] rightArcScores = classifier.rightArcScores(features, false); + for (int dependency : dependencyRelations) + { + float score = rightArcScores[dependency]; + float addedScore = score + prevScore; + addToBeam(beamPreserver, b, addedScore, 2, dependency, options.beamWidth); + } + } + if (canLeftArc) + { + float[] leftArcScores = classifier.leftArcScores(features, false); + for (int dependency : dependencyRelations) + { + float score = leftArcScores[dependency]; + float addedScore = score + prevScore; + addToBeam(beamPreserver, b, addedScore, 3, dependency, options.beamWidth); + } + } + } + } + + private void addToBeam(TreeSet beamPreserver, int b, float addedScore, int action, int label, int beamWidth) + { + beamPreserver.add(new BeamElement(addedScore, b, action, label)); + + if (beamPreserver.size() > beamWidth) + beamPreserver.pollFirst(); + } + + private void updateWeights(Configuration initialConfiguration, float maxViol, boolean isPartial, Configuration bestScoringOracle, Pair maxViolPair, ArrayList beam) + { + Configuration predicted; + Configuration finalOracle; + if (!updateMode.equals("max_violation")) + { + finalOracle = bestScoringOracle; + predicted = beam.get(0); + } + else + { + float violation = beam.get(0).getScore(true) - bestScoringOracle.getScore(true); //Math.abs(beam.get(0).getScore(true) - bestScoringOracle.getScore(true)); + if (violation > maxViol) + { + maxViolPair = new Pair(beam.get(0), bestScoringOracle); + } + predicted = maxViolPair.first; + finalOracle = maxViolPair.second; + } + + Object[] predictedFeatures = new Object[featureLength]; + Object[] oracleFeatures = new Object[featureLength]; + for (int f = 0; f < predictedFeatures.length; f++) + { + oracleFeatures[f] = new HashMap, Float>(); + predictedFeatures[f] = new HashMap, Float>(); + } + + Configuration predictedConfiguration = initialConfiguration.clone(); + Configuration oracleConfiguration = initialConfiguration.clone(); + + for (int action : finalOracle.actionHistory) + { + boolean isTrueFeature = isTrueFeature(isPartial, oracleConfiguration, action); + + if (isTrueFeature) + { // if the made dependency is truly for the word + Object[] feats = FeatureExtractor.extractAllParseFeatures(oracleConfiguration, featureLength); + for (int f = 0; f < feats.length; f++) + { + Pair featName = new Pair(action, feats[f]); + HashMap, Float> map = (HashMap, Float>) oracleFeatures[f]; + Float value = map.get(featName); + if (value == null) + map.put(featName, 1.0f); + else + map.put(featName, value + 1); + } + } + + if (action == 0) + { + ArcEager.shift(oracleConfiguration.state); + } + else if (action == 1) + { + ArcEager.reduce(oracleConfiguration.state); + } + else if (action >= (3 + dependencyRelations.size())) + { + int dependency = action - (3 + dependencyRelations.size()); + ArcEager.leftArc(oracleConfiguration.state, dependency); + } + else if (action >= 3) + { + int dependency = action - 3; + ArcEager.rightArc(oracleConfiguration.state, dependency); + } + } + + for (int action : predicted.actionHistory) + { + boolean isTrueFeature = isTrueFeature(isPartial, predictedConfiguration, action); + + if (isTrueFeature) + { // if the made dependency is truely for the word + Object[] feats = FeatureExtractor.extractAllParseFeatures(predictedConfiguration, featureLength); + if (action != 2) // do not take into account for unshift + for (int f = 0; f < feats.length; f++) + { + Pair featName = new Pair(action, feats[f]); + HashMap, Float> map = (HashMap, Float>) predictedFeatures[f]; + Float value = map.get(featName); + if (value == null) + map.put(featName, 1.f); + else + map.put(featName, map.get(featName) + 1); + } + } + + State state = predictedConfiguration.state; + if (action == 0) + { + ArcEager.shift(state); + } + else if (action == 1) + { + ArcEager.reduce(state); + } + else if (action >= 3 + dependencyRelations.size()) + { + int dependency = action - (3 + dependencyRelations.size()); + ArcEager.leftArc(state, dependency); + } + else if (action >= 3) + { + int dependency = action - 3; + ArcEager.rightArc(state, dependency); + } + else if (action == 2) + { + ArcEager.unShift(state); + } + } + + for (int f = 0; f < predictedFeatures.length; f++) + { + HashMap, Float> map = (HashMap, Float>) predictedFeatures[f]; + HashMap, Float> map2 = (HashMap, Float>) oracleFeatures[f]; + for (Pair feat : map.keySet()) + { + int action = feat.first; + LabeledAction labeledAction = new LabeledAction(action, dependencyRelations.size()); + Action actionType = labeledAction.action; + int dependency = labeledAction.label; + + if (feat.second != null) + { + Object feature = feat.second; + if (!(map2.containsKey(feat) && map2.get(feat).equals(map.get(feat)))) + classifier.changeWeight(actionType, f, feature, dependency, -map.get(feat)); + } + } + + for (Pair feat : map2.keySet()) + { + int action = feat.first; + LabeledAction labeledAction = new LabeledAction(action, dependencyRelations.size()); + Action actionType = labeledAction.action; + int dependency = labeledAction.label; + + if (feat.second != null) + { + Object feature = feat.second; + if (!(map.containsKey(feat) && map.get(feat).equals(map2.get(feat)))) + classifier.changeWeight(actionType, f, feature, dependency, map2.get(feat)); + } + } + } + } + + private static boolean isTrueFeature(boolean isPartial, Configuration oracleConfiguration, int action) + { + boolean isTrueFeature = true; + if (isPartial && action >= 3) + { + if (!oracleConfiguration.state.hasHead(oracleConfiguration.state.stackTop()) || !oracleConfiguration.state.hasHead(oracleConfiguration.state.bufferHead())) + isTrueFeature = false; + } + else if (isPartial && action == 0) + { + if (!oracleConfiguration.state.hasHead(oracleConfiguration.state.bufferHead())) + isTrueFeature = false; + } + else if (isPartial && action == 1) + { + if (!oracleConfiguration.state.hasHead(oracleConfiguration.state.stackTop())) + isTrueFeature = false; + } + return isTrueFeature; + } + +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/dictionary/BiGramDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/BiGramDictionary.java deleted file mode 100644 index 9c1f9453e..000000000 --- a/src/main/java/com/hankcs/hanlp/dictionary/BiGramDictionary.java +++ /dev/null @@ -1,265 +0,0 @@ -/* - *

- * He Han - * hankcs.cn@gmail.com - * 2014/05/2014/5/16 20:55 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.dictionary; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; -import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; -import com.hankcs.hanlp.corpus.io.ByteArray; -import com.hankcs.hanlp.corpus.io.IOUtil; -import com.hankcs.hanlp.utility.Predefine; -import com.hankcs.hanlp.utility.TextUtility; - -import java.io.*; -import java.nio.ByteBuffer; -import java.nio.channels.FileChannel; -import java.util.*; - -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * 2元语法词典 - * - * @deprecated 现在基于DoubleArrayTrie的BiGramDictionary已经由CoreBiGramTableDictionary替代,可以显著降低内存 - * @author hankcs - */ -public class BiGramDictionary -{ - static DoubleArrayTrie trie; - - public final static String path = HanLP.Config.BiGramDictionaryPath; - public static final int totalFrequency = 37545990; - - // 自动加载词典 - static - { - long start = System.currentTimeMillis(); - if (!load(path)) - { - throw new IllegalArgumentException("二元词典加载失败"); - } - else - { - logger.info(path + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); - } - } - - public static boolean load(String path) - { - logger.info("二元词典开始加载:" + path); - trie = new DoubleArrayTrie(); - boolean create = !loadDat(path); - if (!create) return true; - TreeMap map = new TreeMap(); - BufferedReader br; - try - { - br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - String line; - while ((line = br.readLine()) != null) - { - String[] params = line.split("\\s"); - String twoWord = params[0]; - int freq = Integer.parseInt(params[1]); - map.put(twoWord, freq); - } - br.close(); - logger.info("二元词典读取完毕:" + path + ",开始构建双数组Trie树(DoubleArrayTrie)……"); - } - catch (FileNotFoundException e) - { - logger.severe("二元词典" + path + "不存在!" + e); - return false; - } - catch (IOException e) - { - logger.severe("二元词典" + path + "读取错误!" + e); - return false; - } - - int resultCode = trie.build(map); - logger.info("二元词典DAT构建结果:{}" + resultCode); -// reSaveDictionary(map, path); - logger.info("二元词典加载成功:" + trie.size() + "个词条"); - if (create) - { - try - { - DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT))); - Collection freqList = map.values(); - out.writeInt(freqList.size()); - for (int freq : freqList) - { - out.writeInt(freq); - } - trie.save(out); - out.close(); - } - catch (Exception e) - { - logger.warning("在缓存" + path + Predefine.BIN_EXT + "时发生异常" + TextUtility.exceptionToString(e)); - return false; - } - } - return true; - } - - /** - * 从dat文件中加载排好的trie - * - * @param path - * @return - */ - private static boolean loadDat(String path) - { - try - { - ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT); - if (byteArray == null) return false; - - int size = byteArray.nextInt(); - Integer[] value = new Integer[size]; - for (int i = 0; i < size; i++) - { - value[i] = byteArray.nextInt(); - } - if (!trie.load(byteArray, value)) return false; - } - catch (Exception e) - { - return false; - } - - return true; - } - - /** - * 找寻特殊字串,如未##串 - * - * @return 一个包含特殊词串的set - * @deprecated 没事就不要用了 - */ - public static Set _findSpecialString() - { - Set stringSet = new HashSet(); - BufferedReader br; - try - { - br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - String line; - while ((line = br.readLine()) != null) - { - String[] params = line.split("\t"); - String twoWord = params[0]; - params = twoWord.split("@"); - for (String w : params) - { - if (w.contains("##")) - { - stringSet.add(w); - } - } - } - br.close(); - } - catch (FileNotFoundException e) - { - e.printStackTrace(); - } - catch (IOException e) - { - e.printStackTrace(); - } - - return stringSet; - } - - /** - * 获取共现频次 - * - * @param from 第一个词 - * @param to 第二个词 - * @return 第一个词@第二个词出现的频次 - */ - public static int getBiFrequency(String from, String to) - { - return getBiFrequency(from + '@' + to); - } - - /** - * 获取共现频次 - * - * @param twoWord 用@隔开的两个词 - * @return 共现频次 - */ - public static int getBiFrequency(String twoWord) - { - Integer result = trie.get(twoWord); - return (result == null ? 0 : result); - } - - /** - * 将NGram词典重新写回去 - * - * @param map - * @param path - * @return - */ - private static boolean reSaveDictionary(TreeMap map, String path) - { - StringBuilder sbOut = new StringBuilder(); - for (Map.Entry entry : map.entrySet()) - { - sbOut.append(entry.getKey()); - sbOut.append(' '); - sbOut.append(entry.getValue()); - sbOut.append('\n'); - } - - return IOUtil.saveTxt(path, sbOut.toString()); - } - - /** - * 接受键数组与值数组,排序以供建立trie树 - * - * @param wordList - * @param freqList - */ - private static void sortListForBuildTrie(List wordList, List freqList, String path) - { - BinTrie binTrie = new BinTrie(); - for (int i = 0; i < wordList.size(); ++i) - { - binTrie.put(wordList.get(i), freqList.get(i)); - } - Collections.sort(wordList); - try - { - BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path))); -// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path + "_sort.txt"))); - for (String w : wordList) - { - bw.write(w + '\t' + binTrie.get(w)); - bw.newLine(); - } - bw.close(); - } - catch (FileNotFoundException e) - { - e.printStackTrace(); - } - catch (IOException e) - { - e.printStackTrace(); - } - } -} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/CoreBiGramMixDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/CoreBiGramMixDictionary.java deleted file mode 100644 index 17b81d992..000000000 --- a/src/main/java/com/hankcs/hanlp/dictionary/CoreBiGramMixDictionary.java +++ /dev/null @@ -1,225 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/12/24 12:46 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.dictionary; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; -import com.hankcs.hanlp.corpus.io.ByteArray; -import com.hankcs.hanlp.corpus.io.IOUtil; -import com.hankcs.hanlp.seg.common.Vertex; -import com.hankcs.hanlp.utility.ByteUtil; -import com.hankcs.hanlp.utility.Predefine; - -import java.io.*; -import java.util.Collection; -import java.util.TreeMap; -import java.util.logging.Level; - -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * 核心词典的二元接续词典,混合采用词ID和词本身储存 - * - * @author hankcs - */ -public class CoreBiGramMixDictionary -{ - static DoubleArrayTrie trie; - public final static String path = HanLP.Config.BiGramDictionaryPath; - final static String datPath = HanLP.Config.BiGramDictionaryPath + ".mix" + Predefine.BIN_EXT; - - static - { - logger.info("开始加载二元词典" + path + ".mix"); - long start = System.currentTimeMillis(); - if (!load(path)) - { - throw new IllegalArgumentException("二元词典加载失败"); - } - else - { - logger.info(path + ".mix" + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); - } - } - - static boolean load(String path) - { - trie = new DoubleArrayTrie(); - if (loadDat(datPath)) return true; - TreeMap map = new TreeMap(); - BufferedReader br; - try - { - br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - String line; - StringBuilder sb = new StringBuilder(); - while ((line = br.readLine()) != null) - { - String[] params = line.split("\\s"); - String[] twoWord = params[0].split("@", 2); - buildID(twoWord[0], sb); - sb.append('@'); - buildID(twoWord[1], sb); - int freq = Integer.parseInt(params[1]); - map.put(sb.toString(), freq); - sb.setLength(0); - } - br.close(); - logger.info("二元词典读取完毕:" + path + ",开始构建双数组Trie树(DoubleArrayTrie)……"); - trie.build(map); - } - catch (FileNotFoundException e) - { - logger.severe("二元词典" + path + "不存在!" + e); - return false; - } - catch (IOException e) - { - logger.severe("二元词典" + path + "读取错误!" + e); - return false; - } - logger.info("开始缓存二元词典到" + datPath); - if (!saveDat(datPath, map)) - { - logger.warning("缓存二元词典到" + datPath + "失败"); - } - return true; - } - - static boolean saveDat(String path, TreeMap map) - { - try - { - DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path))); - Collection freqList = map.values(); - out.writeInt(freqList.size()); - for (int freq : freqList) - { - out.writeInt(freq); - } - trie.save(out); - out.close(); - } - catch (Exception e) - { - logger.log(Level.WARNING, "在缓存" + path + "时发生异常", e); - return false; - } - - return true; - } - - static boolean loadDat(String path) - { - try - { - ByteArray byteArray = ByteArray.createByteArray(path); - if (byteArray == null) return false; - - int size = byteArray.nextInt(); - Integer[] value = new Integer[size]; - for (int i = 0; i < size; i++) - { - value[i] = byteArray.nextInt(); - } - if (!trie.load(byteArray, value)) return false; - } - catch (Exception e) - { - return false; - } - - return true; - } - - /** - * 二分搜索 - * - * @param a - * @param key - * @return - */ - static int binarySearch(int[][] a, int key) - { - int low = 0; - int high = a.length - 1; - - while (low <= high) - { - int mid = (low + high) >>> 1; - int midVal = a[mid][0]; - - if (midVal < key) - low = mid + 1; - else if (midVal > key) - high = mid - 1; - else - return mid; // key found - } - return -(low + 1); // key not found. - } - - - -// public static int getBiFrequency(Vertex from, Vertex to) -// { -// StringBuilder key = new StringBuilder(); -// int idA = from.wordID; -// if (idA == -1) -// { -// key.append(from.word); -// } -// else -// { -// key.append(ByteUtil.convertIntToTwoChar(idA)); -// } -// key.append('@'); -// int idB = to.wordID; -// if (idB == -1) -// { -// key.append(to.word); -// } -// else -// { -// key.append(ByteUtil.convertIntToTwoChar(idB)); -// } -// -// Integer freq = trie.get(key.toString()); -// if (freq == null) return 0; -// return freq; -// } - - static void buildID(String word, StringBuilder sbStorage) - { - int id = CoreDictionary.trie.exactMatchSearch(word); - if (id == -1) - { - sbStorage.append(word); - } - else - { - char[] twoChar = ByteUtil.convertIntToTwoChar(id); - sbStorage.append(twoChar); - } - } - - /** - * 获取词语的ID - * - * @param a - * @return - */ - public static int getWordID(String a) - { - return CoreDictionary.trie.exactMatchSearch(a); - } -} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionary.java index dc23fc191..37f4cb011 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionary.java @@ -16,10 +16,8 @@ import com.hankcs.hanlp.corpus.io.ByteArray; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.utility.LexiconUtility; import com.hankcs.hanlp.utility.Predefine; import com.hankcs.hanlp.utility.TextUtility; - import java.io.*; import java.util.*; @@ -27,13 +25,13 @@ /** * 使用DoubleArrayTrie实现的核心词典 + * * @author hankcs */ public class CoreDictionary { public static DoubleArrayTrie trie = new DoubleArrayTrie(); public final static String path = HanLP.Config.CoreDictionaryPath; - public static final int totalFrequency = 221894; // 自动加载词典 static @@ -50,6 +48,8 @@ public class CoreDictionary } // 一些特殊的WORD_ID + public static final int BEGIN_WORD_ID = getWordID(Predefine.TAG_BIGIN); + public static final int END_WORD_ID = getWordID(Predefine.TAG_END); public static final int NR_WORD_ID = getWordID(Predefine.TAG_PEOPLE); public static final int NS_WORD_ID = getWordID(Predefine.TAG_PLACE); public static final int NT_WORD_ID = getWordID(Predefine.TAG_GROUP); @@ -68,7 +68,7 @@ private static boolean load(String path) { br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); String line; - int MAX_FREQUENCY = 0; + int totalFrequency = 0; long start = System.currentTimeMillis(); while ((line = br.readLine()) != null) { @@ -77,20 +77,20 @@ private static boolean load(String path) CoreDictionary.Attribute attribute = new CoreDictionary.Attribute(natureCount); for (int i = 0; i < natureCount; ++i) { - attribute.nature[i] = Enum.valueOf(Nature.class, param[1 + 2 * i]); + attribute.nature[i] = Nature.create(param[1 + 2 * i]); attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); attribute.totalFrequency += attribute.frequency[i]; } map.put(param[0], attribute); - MAX_FREQUENCY += attribute.totalFrequency; + totalFrequency += attribute.totalFrequency; } - logger.info("核心词典读入词条" + map.size() + " 全部频次" + MAX_FREQUENCY + ",耗时" + (System.currentTimeMillis() - start) + "ms"); + logger.info("核心词典读入词条" + map.size() + " 全部频次" + totalFrequency + ",耗时" + (System.currentTimeMillis() - start) + "ms"); br.close(); trie.build(map); logger.info("核心词典加载成功:" + trie.size() + "个词条,下面将写入缓存……"); try { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT)); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT))); Collection attributeList = map.values(); out.writeInt(attributeList.size()); for (CoreDictionary.Attribute attribute : attributeList) @@ -104,6 +104,8 @@ private static boolean load(String path) } } trie.save(out); + out.writeInt(totalFrequency); + Predefine.setTotalFrequency(totalFrequency); out.close(); } catch (Exception e) @@ -154,7 +156,20 @@ static boolean loadDat(String path) attributes[i].frequency[j] = byteArray.nextInt(); } } - if (!trie.load(byteArray, attributes) || byteArray.hasMore()) return false; + if (!trie.load(byteArray, attributes)) return false; + int totalFrequency = 0; + if (byteArray.hasMore()) // 自从1.8.2起,ngram模型最后一个整型为总词频 + { + totalFrequency = byteArray.nextInt(); + } + else + { + for (Attribute attribute : attributes) + { + totalFrequency += attribute.totalFrequency; + } + } + Predefine.setTotalFrequency(totalFrequency); } catch (Exception e) { @@ -166,6 +181,7 @@ static boolean loadDat(String path) /** * 获取条目 + * * @param key * @return */ @@ -176,6 +192,7 @@ public static Attribute get(String key) /** * 获取条目 + * * @param wordID * @return */ @@ -199,6 +216,7 @@ public static int getTermFrequency(String term) /** * 是否包含词语 + * * @param key * @return */ @@ -269,11 +287,15 @@ public static Attribute create(String natureWithFrequency) try { String param[] = natureWithFrequency.split(" "); + if (param.length % 2 != 0) + { + return new Attribute(Nature.create(natureWithFrequency.trim()), 1); // 儿童锁 + } int natureCount = param.length / 2; Attribute attribute = new Attribute(natureCount); for (int i = 0; i < natureCount; ++i) { - attribute.nature[i] = LexiconUtility.convertStringToNature(param[2 * i], null); + attribute.nature[i] = Nature.create(param[2 * i]); attribute.frequency[i] = Integer.parseInt(param[1 + 2 * i]); attribute.totalFrequency += attribute.frequency[i]; } @@ -288,6 +310,7 @@ public static Attribute create(String natureWithFrequency) /** * 从字节流中加载 + * * @param byteArray * @param natureIndexArray * @return @@ -318,7 +341,7 @@ public int getNatureFrequency(String nature) { try { - Nature pos = Enum.valueOf(Nature.class, nature); + Nature pos = Nature.create(nature); return getNatureFrequency(pos); } catch (IllegalArgumentException e) @@ -349,6 +372,7 @@ public int getNatureFrequency(final Nature nature) /** * 是否有某个词性 + * * @param nature * @return */ @@ -359,6 +383,7 @@ public boolean hasNature(Nature nature) /** * 是否有以某个前缀开头的词性 + * * @param prefix 词性前缀,比如u会查询是否有ude, uzhe等等 * @return */ @@ -396,11 +421,26 @@ public void save(DataOutputStream out) throws IOException /** * 获取词语的ID + * * @param a 词语 - * @return ID,如果不存在,则返回-1 + * @return ID, 如果不存在, 则返回-1 */ public static int getWordID(String a) { return CoreDictionary.trie.exactMatchSearch(a); } + + /** + * 热更新核心词典
+ * 集群环境(或其他IOAdapter)需要自行删除缓存文件 + * + * @return 是否成功 + */ + public static boolean reload() + { + String path = CoreDictionary.path; + IOUtil.deleteFile(path + Predefine.BIN_EXT); + + return load(path); + } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionaryTransformMatrixDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionaryTransformMatrixDictionary.java index ffb9796ef..7ef61d7ab 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionaryTransformMatrixDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/CoreDictionaryTransformMatrixDictionary.java @@ -21,10 +21,17 @@ */ public class CoreDictionaryTransformMatrixDictionary { - public static TransformMatrixDictionary transformMatrixDictionary; + public static TransformMatrix transformMatrixDictionary; static { - transformMatrixDictionary = new TransformMatrixDictionary(Nature.class); + transformMatrixDictionary = new TransformMatrix(){ + + @Override + public int ordinal(String tag) + { + return Nature.create(tag).ordinal(); + } + }; long start = System.currentTimeMillis(); if (!transformMatrixDictionary.load(HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath)) { diff --git a/src/main/java/com/hankcs/hanlp/dictionary/CoreSynonymDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/CoreSynonymDictionary.java index 73a7f5cb9..8695850ea 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/CoreSynonymDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/CoreSynonymDictionary.java @@ -115,7 +115,7 @@ public static double similarity(String A, String B) * @param withUndefinedItem 是否保留词典中没有的词语 * @return */ - public static List convert(List sentence, boolean withUndefinedItem) + public static List createSynonymList(List sentence, boolean withUndefinedItem) { List synonymItemList = new ArrayList(sentence.size()); for (Term term : sentence) diff --git a/src/main/java/com/hankcs/hanlp/dictionary/CustomDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/CustomDictionary.java index 151272657..450a2e5ff 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/CustomDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/CustomDictionary.java @@ -16,219 +16,56 @@ import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; -import com.hankcs.hanlp.corpus.io.ByteArray; -import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.dictionary.other.CharTable; -import com.hankcs.hanlp.utility.LexiconUtility; -import com.hankcs.hanlp.utility.Predefine; -import com.hankcs.hanlp.utility.TextUtility; -import java.io.*; -import java.util.*; - -import static com.hankcs.hanlp.utility.Predefine.logger; +import java.util.LinkedHashSet; +import java.util.LinkedList; +import java.util.Map; +import java.util.TreeMap; /** - * 用户自定义词典 + * 用户自定义词典
+ * 注意自定义词典的动态增删改不是线程安全的。 * * @author He Han */ public class CustomDictionary { /** - * 用于储存用户动态插入词条的二分trie树 + * 默认实例 */ - public static BinTrie trie; - public static DoubleArrayTrie dat = new DoubleArrayTrie(); - - // 自动加载词典 - static - { - String path[] = HanLP.Config.CustomDictionaryPath; - long start = System.currentTimeMillis(); - if (!loadMainDictionary(path[0])) - { - logger.warning("自定义词典" + Arrays.toString(path) + "加载失败"); - } - else - { - logger.info("自定义词典加载成功:" + dat.size() + "个词条,耗时" + (System.currentTimeMillis() - start) + "ms"); - } - } + public static DynamicCustomDictionary DEFAULT = new DynamicCustomDictionary(HanLP.Config.CustomDictionaryPath); - private static boolean loadMainDictionary(String mainPath) + /** + * 加载词典 + * + * @param mainPath 缓存文件文件名 + * @param path 自定义词典 + * @param isCache 是否缓存结果 + */ + public static boolean loadMainDictionary(String mainPath, String path[], DoubleArrayTrie dat, boolean isCache) { - logger.info("自定义词典开始加载:" + mainPath); - if (loadDat(mainPath)) return true; - dat = new DoubleArrayTrie(); - TreeMap map = new TreeMap(); - LinkedHashSet customNatureCollector = new LinkedHashSet(); - try - { - String path[] = HanLP.Config.CustomDictionaryPath; - for (String p : path) - { - Nature defaultNature = Nature.n; - int cut = p.indexOf(' '); - if (cut > 0) - { - // 有默认词性 - String nature = p.substring(cut + 1); - p = p.substring(0, cut); - try - { - defaultNature = LexiconUtility.convertStringToNature(nature, customNatureCollector); - } - catch (Exception e) - { - logger.severe("配置文件【" + p + "】写错了!" + e); - continue; - } - } - logger.info("以默认词性[" + defaultNature + "]加载自定义词典" + p + "中……"); - boolean success = load(p, defaultNature, map, customNatureCollector); - if (!success) logger.warning("失败:" + p); - } - if (map.size() == 0) - { - logger.warning("没有加载到任何词条"); - map.put(Predefine.TAG_OTHER, null); // 当作空白占位符 - } - logger.info("正在构建DoubleArrayTrie……"); - dat.build(map); - // 缓存成dat文件,下次加载会快很多 - logger.info("正在缓存词典为dat文件……"); - // 缓存值文件 - List attributeList = new LinkedList(); - for (Map.Entry entry : map.entrySet()) - { - attributeList.add(entry.getValue()); - } - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(mainPath + Predefine.BIN_EXT)); - // 缓存用户词性 - IOUtil.writeCustomNature(out, customNatureCollector); - // 缓存正文 - out.writeInt(attributeList.size()); - for (CoreDictionary.Attribute attribute : attributeList) - { - attribute.save(out); - } - dat.save(out); - out.close(); - } - catch (FileNotFoundException e) - { - logger.severe("自定义词典" + mainPath + "不存在!" + e); - return false; - } - catch (IOException e) - { - logger.severe("自定义词典" + mainPath + "读取错误!" + e); - return false; - } - catch (Exception e) - { - logger.warning("自定义词典" + mainPath + "缓存失败!\n" + TextUtility.exceptionToString(e)); - } - return true; + return DynamicCustomDictionary.loadMainDictionary(mainPath, path, dat, isCache, HanLP.Config.Normalization); } /** * 加载用户词典(追加) * - * @param path 词典路径 - * @param defaultNature 默认词性 + * @param path 词典路径 + * @param defaultNature 默认词性 * @param customNatureCollector 收集用户词性 * @return */ public static boolean load(String path, Nature defaultNature, TreeMap map, LinkedHashSet customNatureCollector) { - try - { - String splitter = "\\s"; - if (path.endsWith(".csv")) - { - splitter = ","; - } - BufferedReader br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - String line; - while ((line = br.readLine()) != null) - { - String[] param = line.split(splitter); - if (param[0].length() == 0) continue; // 排除空行 - if (HanLP.Config.Normalization) param[0] = CharTable.convert(param[0]); // 正规化 - - int natureCount = (param.length - 1) / 2; - CoreDictionary.Attribute attribute; - if (natureCount == 0) - { - attribute = new CoreDictionary.Attribute(defaultNature); - } - else - { - attribute = new CoreDictionary.Attribute(natureCount); - for (int i = 0; i < natureCount; ++i) - { - attribute.nature[i] = LexiconUtility.convertStringToNature(param[1 + 2 * i], customNatureCollector); - attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); - attribute.totalFrequency += attribute.frequency[i]; - } - } -// if (updateAttributeIfExist(param[0], attribute, map, rewriteTable)) continue; - map.put(param[0], attribute); - } - br.close(); - } - catch (Exception e) - { - if (!path.startsWith(".")) - logger.severe("自定义词典" + path + "读取错误!" + e); - return false; - } - - return true; + return DynamicCustomDictionary.load(path, defaultNature, map, customNatureCollector, HanLP.Config.Normalization); } - /** - * 如果已经存在该词条,直接更新该词条的属性 - * @param key 词语 - * @param attribute 词语的属性 - * @param map 加载期间的map - * @param rewriteTable - * @return 是否更新了 - */ - private static boolean updateAttributeIfExist(String key, CoreDictionary.Attribute attribute, TreeMap map, TreeMap rewriteTable) - { - int wordID = CoreDictionary.getWordID(key); - CoreDictionary.Attribute attributeExisted; - if (wordID != -1) - { - attributeExisted = CoreDictionary.get(wordID); - attributeExisted.nature = attribute.nature; - attributeExisted.frequency = attribute.frequency; - attributeExisted.totalFrequency = attribute.totalFrequency; - // 收集该覆写 - rewriteTable.put(wordID, attribute); - return true; - } - - attributeExisted = map.get(key); - if (attributeExisted != null) - { - attributeExisted.nature = attribute.nature; - attributeExisted.frequency = attribute.frequency; - attributeExisted.totalFrequency = attribute.totalFrequency; - return true; - } - - return false; - } /** * 往自定义词典中插入一个新词(非覆盖模式)
- * 动态增删不会持久化到词典文件 + * 动态增删不会持久化到词典文件 * * @param word 新词 如“裸婚” * @param natureWithFrequency 词性和其对应的频次,比如“nz 1 v 2”,null时表示“nz 1” @@ -236,27 +73,24 @@ private static boolean updateAttributeIfExist(String key, CoreDictionary.Attribu */ public static boolean add(String word, String natureWithFrequency) { - if (contains(word)) return false; - return insert(word, natureWithFrequency); + return DEFAULT.add(word, natureWithFrequency); } /** * 往自定义词典中插入一个新词(非覆盖模式)
- * 动态增删不会持久化到词典文件 + * 动态增删不会持久化到词典文件 * - * @param word 新词 如“裸婚” + * @param word 新词 如“裸婚” * @return 是否插入成功(失败的原因可能是不覆盖等,可以通过调试模式了解原因) */ public static boolean add(String word) { - if (HanLP.Config.Normalization) word = CharTable.convert(word); - if (contains(word)) return false; - return insert(word, null); + return DEFAULT.add(word); } /** * 往自定义词典中插入一个新词(覆盖模式)
- * 动态增删不会持久化到词典文件 + * 动态增删不会持久化到词典文件 * * @param word 新词 如“裸婚” * @param natureWithFrequency 词性和其对应的频次,比如“nz 1 v 2”,null时表示“nz 1”。 @@ -264,72 +98,36 @@ public static boolean add(String word) */ public static boolean insert(String word, String natureWithFrequency) { - if (word == null) return false; - if (HanLP.Config.Normalization) word = CharTable.convert(word); - CoreDictionary.Attribute att = natureWithFrequency == null ? new CoreDictionary.Attribute(Nature.nz, 1) : CoreDictionary.Attribute.create(natureWithFrequency); - if (att == null) return false; - if (dat != null && dat.set(word, att)) return true; - if (trie == null) trie = new BinTrie(); - trie.put(word, att); - return true; + return DEFAULT.insert(word, natureWithFrequency); } /** * 以覆盖模式增加新词
- * 动态增删不会持久化到词典文件 + * 动态增删不会持久化到词典文件 * * @param word * @return */ public static boolean insert(String word) { - return insert(word, null); + return DEFAULT.insert(word); + } + + public static boolean loadDat(String path, DoubleArrayTrie dat) + { + return DynamicCustomDictionary.loadDat(path, HanLP.Config.CustomDictionaryPath, dat); } /** * 从磁盘加载双数组 * - * @param path + * @param path 主词典路径 + * @param customDicPath 用户词典路径 * @return */ - static boolean loadDat(String path) + public static boolean loadDat(String path, String customDicPath[], DoubleArrayTrie dat) { - try - { - ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT); - if (byteArray == null) return false; - int size = byteArray.nextInt(); - if (size < 0) // 一种兼容措施,当size小于零表示文件头部储存了-size个用户词性 - { - while (++size <= 0) - { - Nature.create(byteArray.nextString()); - } - size = byteArray.nextInt(); - } - CoreDictionary.Attribute[] attributes = new CoreDictionary.Attribute[size]; - final Nature[] natureIndexArray = Nature.values(); - for (int i = 0; i < size; ++i) - { - // 第一个是全部频次,第二个是词性个数 - int currentTotalFrequency = byteArray.nextInt(); - int length = byteArray.nextInt(); - attributes[i] = new CoreDictionary.Attribute(length); - attributes[i].totalFrequency = currentTotalFrequency; - for (int j = 0; j < length; ++j) - { - attributes[i].nature[j] = natureIndexArray[byteArray.nextInt()]; - attributes[i].frequency[j] = byteArray.nextInt(); - } - } - if (!dat.load(byteArray, attributes)) return false; - } - catch (Exception e) - { - logger.warning("读取失败,问题发生在" + TextUtility.exceptionToString(e)); - return false; - } - return true; + return DynamicCustomDictionary.loadDat(path, customDicPath, dat); } /** @@ -340,24 +138,18 @@ static boolean loadDat(String path) */ public static CoreDictionary.Attribute get(String key) { - if (HanLP.Config.Normalization) key = CharTable.convert(key); - CoreDictionary.Attribute attribute = dat == null ? null : dat.get(key); - if (attribute != null) return attribute; - if (trie == null) return null; - return trie.get(key); + return DEFAULT.get(key); } /** * 删除单词
- * 动态增删不会持久化到词典文件 + * 动态增删不会持久化到词典文件 * * @param key */ public static void remove(String key) { - if (HanLP.Config.Normalization) key = CharTable.convert(key); - if (trie == null) return; - trie.remove(key); + DEFAULT.remove(key); } /** @@ -368,7 +160,7 @@ public static void remove(String key) */ public static LinkedList> commonPrefixSearch(String key) { - return trie.commonPrefixSearchWithValue(key); + return DEFAULT.commonPrefixSearch(key); } /** @@ -380,7 +172,7 @@ public static LinkedList> commonPref */ public static LinkedList> commonPrefixSearch(char[] chars, int begin) { - return trie.commonPrefixSearchWithValue(chars, begin); + return DEFAULT.commonPrefixSearch(chars, begin); } public static BaseSearcher getSearcher(String text) @@ -392,23 +184,24 @@ public static BaseSearcher getSearcher(String text) public String toString() { return "CustomDictionary{" + - "trie=" + trie + - '}'; + "trie=" + DEFAULT.trie + + '}'; } /** * 词典中是否含有词语 + * * @param key 词语 * @return 是否包含 */ public static boolean contains(String key) { - if (dat != null && dat.exactMatchSearch(key) >= 0) return true; - return trie != null && trie.containsKey(key); + return DEFAULT.contains(key); } /** * 获取一个BinTrie的查询工具 + * * @param charArray 文本 * @return 查询者 */ @@ -444,13 +237,13 @@ public Map.Entry next() // 保证首次调用找到一个词语 while (entryList.size() == 0 && begin < c.length) { - entryList = trie.commonPrefixSearchWithValue(c, begin); + entryList = DEFAULT.trie.commonPrefixSearchWithValue(c, begin); ++begin; } // 之后调用仅在缓存用完的时候调用一次 if (entryList.size() == 0 && begin < c.length) { - entryList = trie.commonPrefixSearchWithValue(c, begin); + entryList = DEFAULT.trie.commonPrefixSearchWithValue(c, begin); ++begin; } if (entryList.size() == 0) @@ -472,62 +265,50 @@ public Map.Entry next() */ public static BinTrie getTrie() { - return trie; + return DEFAULT.getTrie(); } /** * 解析一段文本(目前采用了BinTrie+DAT的混合储存形式,此方法可以统一两个数据结构) - * @param text 文本 - * @param processor 处理器 + * + * @param text 文本 + * @param processor 处理器 */ public static void parseText(char[] text, AhoCorasickDoubleArrayTrie.IHit processor) { - if (trie != null) - { - trie.parseText(text, processor); - } - DoubleArrayTrie.Searcher searcher = dat.getSearcher(text, 0); - while (searcher.next()) - { - processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); - } + DEFAULT.parseText(text, processor); } /** * 解析一段文本(目前采用了BinTrie+DAT的混合储存形式,此方法可以统一两个数据结构) - * @param text 文本 - * @param processor 处理器 + * + * @param text 文本 + * @param processor 处理器 */ public static void parseText(String text, AhoCorasickDoubleArrayTrie.IHit processor) { - if (trie != null) - { - BaseSearcher searcher = CustomDictionary.getSearcher(text); - int offset; - Map.Entry entry; - while ((entry = searcher.next()) != null) - { - offset = searcher.getOffset(); - processor.hit(offset, offset + entry.getKey().length(), entry.getValue()); - } - } - DoubleArrayTrie.Searcher searcher = dat.getSearcher(text, 0); - while (searcher.next()) - { - processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); - } + DEFAULT.parseText(text, processor); + } + + /** + * 最长匹配 + * + * @param text 文本 + * @param processor 处理器 + */ + public static void parseLongestText(String text, AhoCorasickDoubleArrayTrie.IHit processor) + { + DEFAULT.parseLongestText(text, processor); } /** * 热更新(重新加载)
* 集群环境(或其他IOAdapter)需要自行删除缓存文件(路径 = HanLP.Config.CustomDictionaryPath[0] + Predefine.BIN_EXT) + * * @return 是否加载成功 */ public static boolean reload() { - String path[] = HanLP.Config.CustomDictionaryPath; - if (path == null || path.length == 0) return false; - IOUtil.deleteFile(path[0] + Predefine.BIN_EXT); // 删掉缓存 - return loadMainDictionary(path[0]); + return DEFAULT.reload(); } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/DynamicCustomDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/DynamicCustomDictionary.java new file mode 100644 index 000000000..8df34ba7f --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dictionary/DynamicCustomDictionary.java @@ -0,0 +1,728 @@ +/* + * Han He + * me@hankcs.com + * 2021-01-30 11:12 PM + * + * + * Copyright (c) 2021, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.dictionary; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.utility.LexiconUtility; +import com.hankcs.hanlp.utility.Predefine; +import com.hankcs.hanlp.utility.TextUtility; + +import java.io.*; +import java.util.*; + +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * 用户自定义词典
+ * 注意自定义词典的动态增删改不是线程安全的。 + * + * @author hankcs + */ +public class DynamicCustomDictionary +{ + /** + * 用于储存用户动态插入词条的二分trie树 + */ + public BinTrie trie; + /** + * 用于储存文件中的词条 + */ + public DoubleArrayTrie dat; + /** + * 本词典是从哪些路径加载得到的 + */ + public String path[]; + + /** + * 是否执行字符正规化(繁体->简体,全角->半角,大写->小写),切换配置后必须删CustomDictionary.txt.bin缓存 + */ + public boolean normalization = HanLP.Config.Normalization; + + /** + * 构造一份词典对象,并加载{@code com.hankcs.hanlp.HanLP.Config#CustomDictionaryPath} + */ + public DynamicCustomDictionary() + { + this(HanLP.Config.CustomDictionaryPath); + } + + /** + * 构造一份词典对象,并加载指定路径的词典 + * + * @param path 词典路径 + */ + public DynamicCustomDictionary(String... path) + { + this(new DoubleArrayTrie(), new BinTrie(), path); + } + + /** + * 使用高级数据结构构造词典对象,并加载指定路径的词典 + * + * @param dat 双数组trie树 + * @param trie trie树 + * @param path 词典路径 + */ + public DynamicCustomDictionary(DoubleArrayTrie dat, BinTrie trie, String[] path) + { + this.dat = dat; + this.trie = trie; + if (path != null) + { + load(path); + } + } + + /** + * 加载指定路径的词典 + * + * @param path 词典路径 + * @return 是否加载成功 + */ + public boolean load(String... path) + { + long start = System.currentTimeMillis(); + if (!loadMainDictionary(path[0], path, this.dat, true, normalization)) + { + logger.warning("自定义词典" + Arrays.toString(path) + "加载失败"); + return false; + } + else + { + logger.info("自定义词典加载成功:" + dat.size() + "个词条,耗时" + (System.currentTimeMillis() - start) + "ms"); + this.path = path; + return true; + } + } + + /** + * 加载词典 + * + * @param mainPath 缓存文件文件名 + * @param path 自定义词典 + * @param isCache 是否缓存结果 + */ + public static boolean loadMainDictionary(String mainPath, String path[], DoubleArrayTrie dat, boolean isCache, boolean normalization) + { + logger.info("自定义词典开始加载:" + mainPath); + if (loadDat(mainPath, dat)) return true; + TreeMap map = new TreeMap(); + LinkedHashSet customNatureCollector = new LinkedHashSet(); + try + { + //String path[] = HanLP.Config.CustomDictionaryPath; + for (String p : path) + { + Nature defaultNature = Nature.n; + File file = new File(p); + String fileName = file.getName(); + int cut = fileName.lastIndexOf(' '); + if (cut > 0) + { + // 有默认词性 + String nature = fileName.substring(cut + 1); + p = file.getParent() + File.separator + fileName.substring(0, cut); + try + { + defaultNature = LexiconUtility.convertStringToNature(nature, customNatureCollector); + } + catch (Exception e) + { + logger.severe("配置文件【" + p + "】写错了!" + e); + continue; + } + } + logger.info("以默认词性[" + defaultNature + "]加载自定义词典" + p + "中……"); + boolean success = load(p, defaultNature, map, customNatureCollector, normalization); + if (!success) logger.warning("失败:" + p); + } + if (map.size() == 0) + { + logger.warning("没有加载到任何词条"); + map.put(Predefine.TAG_OTHER, null); // 当作空白占位符 + } + logger.info("正在构建DoubleArrayTrie……"); + dat.build(map); + if (isCache) + { + // 缓存成dat文件,下次加载会快很多 + logger.info("正在缓存词典为dat文件……"); + // 缓存值文件 + List attributeList = new LinkedList(); + for (Map.Entry entry : map.entrySet()) + { + attributeList.add(entry.getValue()); + } + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(mainPath + Predefine.BIN_EXT))); + // 缓存用户词性 + if (customNatureCollector.isEmpty()) // 热更新 + { + for (int i = Nature.begin.ordinal() + 1; i < Nature.values().length; ++i) + { + customNatureCollector.add(Nature.values()[i]); + } + } + IOUtil.writeCustomNature(out, customNatureCollector); + // 缓存正文 + out.writeInt(attributeList.size()); + for (CoreDictionary.Attribute attribute : attributeList) + { + attribute.save(out); + } + dat.save(out); + out.close(); + } + } + catch (FileNotFoundException e) + { + logger.severe("自定义词典" + mainPath + "不存在!" + e); + return false; + } + catch (IOException e) + { + logger.severe("自定义词典" + mainPath + "读取错误!" + e); + return false; + } + catch (Exception e) + { + logger.warning("自定义词典" + mainPath + "缓存失败!\n" + TextUtility.exceptionToString(e)); + } + return true; + } + + /** + * 使用词典路径为缓存路径,加载指定词典 + * + * @param mainPath 词典路径(+.bin等于缓存路径) + * @return + */ + public boolean loadMainDictionary(String mainPath, boolean normalization) + { + return loadMainDictionary(mainPath, this.path, this.dat, true, normalization); + } + + + /** + * 加载用户词典(追加) + * + * @param path 词典路径 + * @param defaultNature 默认词性 + * @param customNatureCollector 收集用户词性 + * @return + */ + public static boolean load(String path, Nature defaultNature, TreeMap map, LinkedHashSet customNatureCollector, boolean normalization) + { + try + { + String splitter = "\\s"; + if (path.endsWith(".csv")) + { + splitter = ","; + } + else if (path.endsWith(".tsv")) + { + splitter = "\t"; + } + BufferedReader br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); + String line; + boolean firstLine = true; + while ((line = br.readLine()) != null) + { + if (firstLine) + { + line = IOUtil.removeUTF8BOM(line); + firstLine = false; + } + String[] param = line.split(splitter); + if (param[0].length() == 0) continue; // 排除空行 + if (normalization) param[0] = CharTable.convert(param[0]); // 正规化 + + int natureCount = (param.length - 1) / 2; + CoreDictionary.Attribute attribute; + if (natureCount == 0) + { + attribute = new CoreDictionary.Attribute(defaultNature); + } + else + { + attribute = new CoreDictionary.Attribute(natureCount); + for (int i = 0; i < natureCount; ++i) + { + attribute.nature[i] = LexiconUtility.convertStringToNature(param[1 + 2 * i], customNatureCollector); + attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]); + attribute.totalFrequency += attribute.frequency[i]; + } + } +// if (updateAttributeIfExist(param[0], attribute, map, rewriteTable)) continue; + map.put(param[0], attribute); + } + br.close(); + } + catch (Exception e) + { + logger.severe("自定义词典" + path + "读取错误!" + e); + return false; + } + + return true; + } + + /** + * 如果已经存在该词条,直接更新该词条的属性 + * + * @param key 词语 + * @param attribute 词语的属性 + * @param map 加载期间的map + * @param rewriteTable + * @return 是否更新了 + */ + private boolean updateAttributeIfExist(String key, CoreDictionary.Attribute attribute, TreeMap map, TreeMap rewriteTable) + { + int wordID = CoreDictionary.getWordID(key); + CoreDictionary.Attribute attributeExisted; + if (wordID != -1) + { + attributeExisted = CoreDictionary.get(wordID); + attributeExisted.nature = attribute.nature; + attributeExisted.frequency = attribute.frequency; + attributeExisted.totalFrequency = attribute.totalFrequency; + // 收集该覆写 + rewriteTable.put(wordID, attribute); + return true; + } + + attributeExisted = map.get(key); + if (attributeExisted != null) + { + attributeExisted.nature = attribute.nature; + attributeExisted.frequency = attribute.frequency; + attributeExisted.totalFrequency = attribute.totalFrequency; + return true; + } + + return false; + } + + /** + * 往自定义词典中插入一个新词(非覆盖模式)
+ * 动态增删不会持久化到词典文件 + * + * @param word 新词 如“裸婚” + * @param natureWithFrequency 词性和其对应的频次,比如“nz 1 v 2”,null时表示“nz 1” + * @return 是否插入成功(失败的原因可能是不覆盖、natureWithFrequency有问题等,后者可以通过调试模式了解原因) + */ + public boolean add(String word, String natureWithFrequency) + { + if (contains(word)) return false; + return insert(word, natureWithFrequency); + } + + /** + * 往自定义词典中插入一个新词(非覆盖模式)
+ * 动态增删不会持久化到词典文件 + * + * @param word 新词 如“裸婚” + * @return 是否插入成功(失败的原因可能是不覆盖等,可以通过调试模式了解原因) + */ + public boolean add(String word) + { + if (normalization) word = CharTable.convert(word); + if (contains(word)) return false; + return insert(word, null); + } + + /** + * 往自定义词典中插入一个新词(覆盖模式)
+ * 动态增删不会持久化到词典文件 + * + * @param word 新词 如“裸婚” + * @param natureWithFrequency 词性和其对应的频次,比如“nz 1 v 2”,null时表示“nz 1”。 + * @return 是否插入成功(失败的原因可能是natureWithFrequency问题,可以通过调试模式了解原因) + */ + public boolean insert(String word, String natureWithFrequency) + { + if (word == null) return false; + if (normalization) word = CharTable.convert(word); + CoreDictionary.Attribute att = natureWithFrequency == null ? new CoreDictionary.Attribute(Nature.nz, 1) : CoreDictionary.Attribute.create(natureWithFrequency); + if (att == null) return false; + if (dat.set(word, att)) return true; + if (trie == null) trie = new BinTrie(); + trie.put(word, att); + return true; + } + + /** + * 以覆盖模式增加新词
+ * 动态增删不会持久化到词典文件 + * + * @param word + * @return + */ + public boolean insert(String word) + { + return insert(word, null); + } + + public static boolean loadDat(String path, DoubleArrayTrie dat) + { + return loadDat(path, HanLP.Config.CustomDictionaryPath, dat); + } + + /** + * 从磁盘加载双数组 + * + * @param path 主词典路径 + * @param customDicPath 用户词典路径 + * @return + */ + public static boolean loadDat(String path, String customDicPath[], DoubleArrayTrie dat) + { + try + { + if (HanLP.Config.CustomDictionaryAutoRefreshCache && isDicNeedUpdate(path, customDicPath)) + { + return false; + } + ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT); + if (byteArray == null) return false; + int size = byteArray.nextInt(); + if (size < 0) // 一种兼容措施,当size小于零表示文件头部储存了-size个用户词性 + { + while (++size <= 0) + { + Nature.create(byteArray.nextString()); + } + size = byteArray.nextInt(); + } + CoreDictionary.Attribute[] attributes = new CoreDictionary.Attribute[size]; + final Nature[] natureIndexArray = Nature.values(); + for (int i = 0; i < size; ++i) + { + // 第一个是全部频次,第二个是词性个数 + int currentTotalFrequency = byteArray.nextInt(); + int length = byteArray.nextInt(); + attributes[i] = new CoreDictionary.Attribute(length); + attributes[i].totalFrequency = currentTotalFrequency; + for (int j = 0; j < length; ++j) + { + attributes[i].nature[j] = natureIndexArray[byteArray.nextInt()]; + attributes[i].frequency[j] = byteArray.nextInt(); + } + } + if (!dat.load(byteArray, attributes)) return false; + } + catch (Exception e) + { + logger.warning("读取失败,问题发生在" + TextUtility.exceptionToString(e)); + return false; + } + return true; + } + + /** + * 获取本地词典更新状态 + * + * @return true 表示本地词典比缓存文件新,需要删除缓存 + */ + public static boolean isDicNeedUpdate(String mainPath, String path[]) + { + if (HanLP.Config.IOAdapter != null && + !HanLP.Config.IOAdapter.getClass().getName().contains("com.hankcs.hanlp.corpus.io.FileIOAdapter")) + { + return false; + } + String binPath = mainPath + Predefine.BIN_EXT; + File binFile = new File(binPath); + if (!binFile.exists()) + { + return true; + } + long lastModified = binFile.lastModified(); + //String path[] = HanLP.Config.CustomDictionaryPath; + for (String p : path) + { + File f = new File(p); + String fileName = f.getName(); + int cut = fileName.lastIndexOf(' '); + if (cut > 0) + { + p = f.getParent() + File.separator + fileName.substring(0, cut); + } + f = new File(p); + if (f.exists() && f.lastModified() > lastModified) + { + IOUtil.deleteFile(binPath); // 删掉缓存 + logger.info("已清除自定义词典缓存文件!"); + return true; + } + } + return false; + } + + /** + * 查单词 + * + * @param key + * @return + */ + public CoreDictionary.Attribute get(String key) + { + if (normalization) key = CharTable.convert(key); + CoreDictionary.Attribute attribute = dat.get(key); + if (attribute != null) return attribute; + if (trie == null) return null; + return trie.get(key); + } + + /** + * 删除单词
+ * 动态增删不会持久化到词典文件 + * + * @param key + */ + public void remove(String key) + { + if (normalization) key = CharTable.convert(key); + if (trie == null) return; + trie.remove(key); + } + + /** + * 前缀查询 + * + * @param key + * @return + */ + public LinkedList> commonPrefixSearch(String key) + { + return trie.commonPrefixSearchWithValue(key); + } + + /** + * 前缀查询 + * + * @param chars + * @param begin + * @return + */ + public LinkedList> commonPrefixSearch(char[] chars, int begin) + { + return trie.commonPrefixSearchWithValue(chars, begin); + } + + public BaseSearcher getSearcher(String text) + { + return new DynamicCustomDictionary.Searcher(text); + } + + @Override + public String toString() + { + return "DynamicCustomDictionary{" + + "trie=" + trie + + '}'; + } + + /** + * 词典中是否含有词语 + * + * @param key 词语 + * @return 是否包含 + */ + public boolean contains(String key) + { + if (dat.exactMatchSearch(key) >= 0) return true; + return trie != null && trie.containsKey(key); + } + + /** + * 获取一个BinTrie的查询工具 + * + * @param charArray 文本 + * @return 查询者 + */ + public BaseSearcher getSearcher(char[] charArray) + { + return new DynamicCustomDictionary.Searcher(charArray); + } + + class Searcher extends BaseSearcher + { + /** + * 分词从何处开始,这是一个状态 + */ + int begin; + + private LinkedList> entryList; + + protected Searcher(char[] c) + { + super(c); + entryList = new LinkedList>(); + } + + protected Searcher(String text) + { + super(text); + entryList = new LinkedList>(); + } + + @Override + public Map.Entry next() + { + // 保证首次调用找到一个词语 + while (entryList.size() == 0 && begin < c.length) + { + entryList = trie.commonPrefixSearchWithValue(c, begin); + ++begin; + } + // 之后调用仅在缓存用完的时候调用一次 + if (entryList.size() == 0 && begin < c.length) + { + entryList = trie.commonPrefixSearchWithValue(c, begin); + ++begin; + } + if (entryList.size() == 0) + { + return null; + } + Map.Entry result = entryList.getFirst(); + entryList.removeFirst(); + offset = begin - 1; + return result; + } + } + + /** + * 获取词典对应的trie树 + * + * @return + * @deprecated 谨慎操作,有可能废弃此接口 + */ + public BinTrie getTrie() + { + return trie; + } + + /** + * 解析一段文本(目前采用了BinTrie+DAT的混合储存形式,此方法可以统一两个数据结构) + * + * @param text 文本 + * @param processor 处理器 + */ + public void parseText(char[] text, AhoCorasickDoubleArrayTrie.IHit processor) + { + if (trie != null) + { + trie.parseText(text, processor); + } + DoubleArrayTrie.Searcher searcher = dat.getSearcher(text, 0); + while (searcher.next()) + { + processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); + } + } + + /** + * 解析一段文本(目前采用了BinTrie+DAT的混合储存形式,此方法可以统一两个数据结构) + * + * @param text 文本 + * @param processor 处理器 + */ + public void parseText(String text, AhoCorasickDoubleArrayTrie.IHit processor) + { + if (trie != null) + { + BaseSearcher searcher = this.getSearcher(text); + int offset; + Map.Entry entry; + while ((entry = searcher.next()) != null) + { + offset = searcher.getOffset(); + processor.hit(offset, offset + entry.getKey().length(), entry.getValue()); + } + } + DoubleArrayTrie.Searcher searcher = dat.getSearcher(text, 0); + while (searcher.next()) + { + processor.hit(searcher.begin, searcher.begin + searcher.length, searcher.value); + } + } + + /** + * 最长匹配 + * + * @param text 文本 + * @param processor 处理器 + */ + public void parseLongestText(String text, AhoCorasickDoubleArrayTrie.IHit processor) + { + if (trie != null) + { + final int[] lengthArray = new int[text.length()]; + final CoreDictionary.Attribute[] attributeArray = new CoreDictionary.Attribute[text.length()]; + char[] charArray = text.toCharArray(); + DoubleArrayTrie.Searcher searcher = dat.getSearcher(charArray, 0); + while (searcher.next()) + { + lengthArray[searcher.begin] = searcher.length; + attributeArray[searcher.begin] = searcher.value; + } + trie.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + int length = end - begin; + if (length > lengthArray[begin]) + { + lengthArray[begin] = length; + attributeArray[begin] = value; + } + } + }); + for (int i = 0; i < charArray.length; ) + { + if (lengthArray[i] == 0) + { + ++i; + } + else + { + processor.hit(i, i + lengthArray[i], attributeArray[i]); + i += lengthArray[i]; + } + } + } + else + dat.parseLongestText(text, processor); + } + + /** + * 热更新(重新加载)
+ * 集群环境(或其他IOAdapter)需要自行删除缓存文件(路径 = HanLP.Config.CustomDictionaryPath[0] + Predefine.BIN_EXT) + * + * @return 是否加载成功 + */ + public boolean reload() + { + if (path == null || path.length == 0) return false; + IOUtil.deleteFile(path[0] + Predefine.BIN_EXT); // 删掉缓存 + return loadMainDictionary(path[0], normalization); + } +} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrix.java b/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrix.java new file mode 100644 index 000000000..bf1fe1248 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrix.java @@ -0,0 +1,181 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-23 8:30 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.dictionary; + +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.BufferedReader; +import java.io.InputStreamReader; + +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * @author hankcs + */ +public abstract class TransformMatrix +{ + // HMM的五元组 + //int[] observations; + /** + * 隐状态 + */ + public int[] states; + /** + * 初始概率 + */ + public double[] start_probability; + /** + * 转移概率 + */ + public double[][] transititon_probability; + /** + * 内部标签下标最大值不超过这个值,用于矩阵创建 + */ + protected int ordinaryMax; + /** + * 储存转移矩阵 + */ + int[][] matrix; + /** + * 储存每个标签出现的次数 + */ + int[] total; + /** + * 所有标签出现的总次数 + */ + int totalFrequency; + + public boolean load(String path) + { + try + { + BufferedReader br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); + // 第一行是矩阵的各个类型 + String line = br.readLine(); + String[] _param = line.split(","); + // 为了制表方便,第一个label是废物,所以要抹掉它 + String[] labels = new String[_param.length - 1]; + System.arraycopy(_param, 1, labels, 0, labels.length); + int[] ordinaryArray = new int[labels.length]; + ordinaryMax = 0; + for (int i = 0; i < ordinaryArray.length; ++i) + { + ordinaryArray[i] = ordinal(labels[i]); + ordinaryMax = Math.max(ordinaryMax, ordinaryArray[i]); + } + ++ordinaryMax; + matrix = new int[ordinaryMax][ordinaryMax]; + for (int i = 0; i < ordinaryMax; ++i) + { + for (int j = 0; j < ordinaryMax; ++j) + { + matrix[i][j] = 0; + } + } + // 之后就描述了矩阵 + while ((line = br.readLine()) != null) + { + String[] paramArray = line.split(","); + int currentOrdinary = ordinal(paramArray[0]); + for (int i = 0; i < ordinaryArray.length; ++i) + { + matrix[currentOrdinary][ordinaryArray[i]] = Integer.valueOf(paramArray[1 + i]); + } + } + br.close(); + // 需要统计一下每个标签出现的次数 + total = new int[ordinaryMax]; + for (int j = 0; j < ordinaryMax; ++j) + { + total[j] = 0; + for (int i = 0; i < ordinaryMax; ++i) + { + total[j] += matrix[j][i]; // 按行累加 + } + } + for (int j = 0; j < ordinaryMax; ++j) + { + if (total[j] == 0) + { + for (int i = 0; i < ordinaryMax; ++i) + { + total[j] += matrix[i][j]; // 按列累加 + } + } + } + for (int j = 0; j < ordinaryMax; ++j) + { + totalFrequency += total[j]; + } + // 下面计算HMM四元组 + states = ordinaryArray; + start_probability = new double[ordinaryMax]; + for (int s : states) + { + double frequency = total[s] + 1e-8; + start_probability[s] = -Math.log(frequency / totalFrequency); + } + transititon_probability = new double[ordinaryMax][ordinaryMax]; + for (int from : states) + { + for (int to : states) + { + double frequency = matrix[from][to] + 1e-8; + transititon_probability[from][to] = -Math.log(frequency / total[from]); +// System.out.println("from" + NR.values()[from] + " to" + NR.values()[to] + " = " + transititon_probability[from][to]); + } + } + } + catch (Exception e) + { + logger.warning("读取" + path + "失败" + e); + return false; + } + + return true; + } + + /** + * 拓展内部矩阵,仅用于通过反射新增了枚举实例之后的兼容措施 + */ + public void extend(int ordinaryMax) + { + this.ordinaryMax = ordinaryMax; + double[][] n_transititon_probability = new double[ordinaryMax][ordinaryMax]; + for (int i = 0; i < transititon_probability.length; i++) + { + System.arraycopy(transititon_probability[i], 0, n_transititon_probability[i], 0, transititon_probability.length); + } + transititon_probability = n_transititon_probability; + + int[] n_total = new int[ordinaryMax]; + System.arraycopy(total, 0, n_total, 0, total.length); + total = n_total; + + double[] n_start_probability = new double[ordinaryMax]; + System.arraycopy(start_probability, 0, n_start_probability, 0, start_probability.length); + start_probability = n_start_probability; + + int[][] n_matrix = new int[ordinaryMax][ordinaryMax]; + for (int i = 0; i < matrix.length; i++) + { + System.arraycopy(matrix[i], 0, n_matrix[i], 0, matrix.length); + } + matrix = n_matrix; + } + + public abstract int ordinal(String tag); + + public int getTotalFrequency(int ordinal) + { + return total[ordinal]; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrixDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrixDictionary.java index f98445751..f3b9647c6 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrixDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/TransformMatrixDictionary.java @@ -11,152 +11,26 @@ */ package com.hankcs.hanlp.dictionary; -import com.hankcs.hanlp.corpus.io.IOUtil; - -import java.io.BufferedReader; -import java.io.FileInputStream; -import java.io.InputStreamReader; import java.util.Arrays; -import static com.hankcs.hanlp.utility.Predefine.logger; - /** * 转移矩阵词典 * * @param 标签的枚举类型 * @author hankcs */ -public class TransformMatrixDictionary> +public class TransformMatrixDictionary> extends TransformMatrix { Class enumType; - /** - * 内部标签下标最大值不超过这个值,用于矩阵创建 - */ - private int ordinaryMax; public TransformMatrixDictionary(Class enumType) { this.enumType = enumType; } - /** - * 储存转移矩阵 - */ - int matrix[][]; - - /** - * 储存每个标签出现的次数 - */ - int total[]; - - /** - * 所有标签出现的总次数 - */ - int totalFrequency; - - // HMM的五元组 - /** - * 隐状态 - */ - public int[] states; - //int[] observations; - /** - * 初始概率 - */ - public double[] start_probability; - /** - * 转移概率 - */ - public double[][] transititon_probability; - - public boolean load(String path) + public TransformMatrixDictionary() { - try - { - BufferedReader br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8")); - // 第一行是矩阵的各个类型 - String line = br.readLine(); - String[] _param = line.split(","); - // 为了制表方便,第一个label是废物,所以要抹掉它 - String[] labels = new String[_param.length - 1]; - System.arraycopy(_param, 1, labels, 0, labels.length); - int[] ordinaryArray = new int[labels.length]; - ordinaryMax = 0; - for (int i = 0; i < ordinaryArray.length; ++i) - { - ordinaryArray[i] = convert(labels[i]).ordinal(); - ordinaryMax = Math.max(ordinaryMax, ordinaryArray[i]); - } - ++ordinaryMax; - matrix = new int[ordinaryMax][ordinaryMax]; - for (int i = 0; i < ordinaryMax; ++i) - { - for (int j = 0; j < ordinaryMax; ++j) - { - matrix[i][j] = 0; - } - } - // 之后就描述了矩阵 - while ((line = br.readLine()) != null) - { - String[] paramArray = line.split(","); - int currentOrdinary = convert(paramArray[0]).ordinal(); - for (int i = 0; i < ordinaryArray.length; ++i) - { - matrix[currentOrdinary][ordinaryArray[i]] = Integer.valueOf(paramArray[1 + i]); - } - } - br.close(); - // 需要统计一下每个标签出现的次数 - total = new int[ordinaryMax]; - for (int j = 0; j < ordinaryMax; ++j) - { - total[j] = 0; - for (int i = 0; i < ordinaryMax; ++i) - { - total[j] += matrix[j][i]; // 按行累加 - } - } - for (int j = 0; j < ordinaryMax; ++j) - { - if (total[j] == 0) - { - for (int i = 0; i < ordinaryMax; ++i) - { - total[j] += matrix[i][j]; // 按列累加 - } - } - } - for (int j = 0; j < ordinaryMax; ++j) - { - totalFrequency += total[j]; - } - // 下面计算HMM四元组 - states = ordinaryArray; - start_probability = new double[ordinaryMax]; - for (int s : states) - { - double frequency = total[s] + 1e-8; - start_probability[s] = -Math.log(frequency / totalFrequency); - } - transititon_probability = new double[ordinaryMax][ordinaryMax]; - for (int from : states) - { - for (int to : states) - { - double frequency = matrix[from][to] + 1e-8; - transititon_probability[from][to] = -Math.log(frequency / total[from]); -// System.out.println("from" + NR.values()[from] + " to" + NR.values()[to] + " = " + transititon_probability[from][to]); - } - } - } - catch (Exception e) - { - logger.warning("读取" + path + "失败" + e); - return false; - } - return true; } /** @@ -209,35 +83,6 @@ protected E convert(String label) return Enum.valueOf(enumType, label); } - /** - * 拓展内部矩阵,仅用于通过反射新增了枚举实例之后的兼容措施 - */ - public void extendSize() - { - ++ordinaryMax; - double[][] n_transititon_probability = new double[ordinaryMax][ordinaryMax]; - for (int i = 0; i < transititon_probability.length; i++) - { - System.arraycopy(transititon_probability[i], 0, n_transititon_probability[i], 0, transititon_probability.length); - } - transititon_probability = n_transititon_probability; - - int[] n_total = new int[ordinaryMax]; - System.arraycopy(total, 0, n_total, 0, total.length); - total = n_total; - - double[] n_start_probability = new double[ordinaryMax]; - System.arraycopy(start_probability, 0, n_start_probability, 0, start_probability.length); - start_probability = n_start_probability; - - int[][] n_matrix = new int[ordinaryMax][ordinaryMax]; - for (int i = 0; i < matrix.length; i++) - { - System.arraycopy(matrix[i], 0, n_matrix[i], 0, matrix.length); - } - matrix = n_matrix; - } - @Override public String toString() { @@ -250,4 +95,10 @@ public String toString() sb.append('}'); return sb.toString(); } + + @Override + public int ordinal(String tag) + { + return Enum.valueOf(enumType, tag).ordinal(); + } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/common/CommonDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/common/CommonDictionary.java index c9b7b24a5..4e9151140 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/common/CommonDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/common/CommonDictionary.java @@ -16,10 +16,7 @@ import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.utility.TextUtility; -import java.io.BufferedReader; -import java.io.DataOutputStream; -import java.io.IOException; -import java.io.InputStreamReader; +import java.io.*; import java.util.*; import static com.hankcs.hanlp.utility.Predefine.BIN_EXT; @@ -71,6 +68,7 @@ public boolean load(String path) catch (Exception e) { logger.warning("读取" + path + "失败" + e); + return false; } onLoaded(map); Set> entrySet = map.entrySet(); @@ -119,7 +117,7 @@ protected boolean saveDat(String path, List valueArray) { try { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path)); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path))); out.writeInt(valueArray.size()); for (V item : valueArray) { diff --git a/src/main/java/com/hankcs/hanlp/dictionary/nr/JapanesePersonDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/nr/JapanesePersonDictionary.java index 4fe8256ae..d08b444d8 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/nr/JapanesePersonDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/nr/JapanesePersonDictionary.java @@ -16,7 +16,6 @@ import com.hankcs.hanlp.corpus.io.ByteArray; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.dictionary.BaseSearcher; -import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.utility.Predefine; import java.io.*; @@ -54,7 +53,7 @@ public class JapanesePersonDictionary throw new IllegalArgumentException("日本人名词典" + path + "加载失败"); } - logger.info("日本人名词典" + HanLP.Config.PinyinDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); + logger.info("日本人名词典" + HanLP.Config.JapanesePersonDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); } static boolean load() @@ -95,7 +94,7 @@ static boolean saveDat(TreeMap map) { try { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path + Predefine.VALUE_EXT)); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.VALUE_EXT))); out.writeInt(map.size()); for (Character character : map.values()) { @@ -151,9 +150,9 @@ public static Character get(String key) return trie.get(key); } - public static BaseSearcher getSearcher(char[] charArray) + public static DoubleArrayTrie.LongestSearcher getSearcher(char[] charArray) { - return new Searcher(charArray, trie); + return trie.getLongestSearcher(charArray, 0); } /** diff --git a/src/main/java/com/hankcs/hanlp/dictionary/nr/NRPattern.java b/src/main/java/com/hankcs/hanlp/dictionary/nr/NRPattern.java index db19e2b64..f1056a908 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/nr/NRPattern.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/nr/NRPattern.java @@ -29,7 +29,7 @@ public enum NRPattern BG, DG, EG, - BXD, +// BXD, BZ, // CD, EE, diff --git a/src/main/java/com/hankcs/hanlp/dictionary/nr/PersonDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/nr/PersonDictionary.java index 4774c3344..611b3d1a2 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/nr/PersonDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/nr/PersonDictionary.java @@ -128,11 +128,11 @@ public static void parsePattern(List nrList, List vertexList, final sbPattern.append(NR.L.toString()); // 对串也做一些修改 listIterator.previous(); - String nowED = current.realWord.substring(current.realWord.length() - 1); - String nowL = current.realWord.substring(0, current.realWord.length() - 1); - listIterator.set(new Vertex(nowED)); - listIterator.add(new Vertex(nowL)); + String EorD = current.realWord.substring(0, 1); + String L = current.realWord.substring(1, current.realWord.length()); + listIterator.set(new Vertex(EorD)); listIterator.next(); + listIterator.add(new Vertex(L)); continue; default: sbPattern.append(nr.toString()); diff --git a/src/main/java/com/hankcs/hanlp/dictionary/nr/TranslatedPersonDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/nr/TranslatedPersonDictionary.java index a9255acac..a95c157c3 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/nr/TranslatedPersonDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/nr/TranslatedPersonDictionary.java @@ -17,7 +17,6 @@ import com.hankcs.hanlp.utility.Predefine; import java.io.BufferedReader; -import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.Map; import java.util.TreeMap; @@ -80,7 +79,7 @@ static boolean load() logger.info("音译人名词典" + path + "开始构建双数组……"); trie.build(map); logger.info("音译人名词典" + path + "开始编译DAT文件……"); - logger.info("音译人名词典" + path + "编译结果:" + saveDat(map)); + logger.info("音译人名词典" + path + "编译结果:" + saveDat()); } catch (Exception e) { @@ -93,10 +92,9 @@ static boolean load() /** * 保存dat到磁盘 - * @param map * @return */ - static boolean saveDat(TreeMap map) + static boolean saveDat() { return trie.save(path + Predefine.TRIE_EXT); } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/ns/PlaceDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/ns/PlaceDictionary.java index 3e22bdd4b..db569223a 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/ns/PlaceDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/ns/PlaceDictionary.java @@ -58,8 +58,10 @@ public class PlaceDictionary { long start = System.currentTimeMillis(); dictionary = new NSDictionary(); - dictionary.load(HanLP.Config.PlaceDictionaryPath); - logger.info(HanLP.Config.PlaceDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); + if (dictionary.load(HanLP.Config.PlaceDictionaryPath)) + logger.info(HanLP.Config.PlaceDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); + else + throw new IllegalArgumentException(HanLP.Config.PlaceDictionaryPath + "加载失败"); transformMatrixDictionary = new TransformMatrixDictionary(NS.class); transformMatrixDictionary.load(HanLP.Config.PlaceDictionaryTrPath); trie = new AhoCorasickDoubleArrayTrie(); diff --git a/src/main/java/com/hankcs/hanlp/dictionary/nt/OrganizationDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/nt/OrganizationDictionary.java index 45948ff76..a5a2bf460 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/nt/OrganizationDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/nt/OrganizationDictionary.java @@ -62,8 +62,10 @@ private static void addKeyword(TreeMap patternMap, String keywor { long start = System.currentTimeMillis(); dictionary = new NTDictionary(); - dictionary.load(HanLP.Config.OrganizationDictionaryPath); - logger.info(HanLP.Config.OrganizationDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); + if (dictionary.load(HanLP.Config.OrganizationDictionaryPath)) + logger.info(HanLP.Config.OrganizationDictionaryPath + "加载成功,耗时" + (System.currentTimeMillis() - start) + "ms"); + else + throw new IllegalArgumentException(HanLP.Config.OrganizationDictionaryPath + "加载失败"); transformMatrixDictionary = new TransformMatrixDictionary(NT.class); transformMatrixDictionary.load(HanLP.Config.OrganizationDictionaryTrPath); trie = new AhoCorasickDoubleArrayTrie(); diff --git a/src/main/java/com/hankcs/hanlp/dictionary/other/CharTable.java b/src/main/java/com/hankcs/hanlp/dictionary/other/CharTable.java index 00e917d96..c33b3d14f 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/other/CharTable.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/other/CharTable.java @@ -12,6 +12,10 @@ package com.hankcs.hanlp.dictionary.other; import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.utility.Predefine; @@ -58,11 +62,20 @@ private static boolean load(String path) if (line.length() != 3) continue; CONVERT[line.charAt(0)] = CONVERT[line.charAt(2)]; } + loadSpace(); logger.info("正在缓存字符正规化表到" + binPath); IOUtil.saveObjectTo(CONVERT, binPath); return true; } + + private static void loadSpace() { + for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++) { + if (Character.isWhitespace(i) || Character.isSpaceChar(i)) { + CONVERT[i] = ' '; + } + } + } private static boolean loadBin(String path) { @@ -102,16 +115,21 @@ public static char[] convert(char[] charArray) return result; } - public static String convert(String charArray) + public static String convert(String sentence) + { + assert sentence != null; + char[] result = new char[sentence.length()]; + convert(sentence, result); + + return new String(result); + } + + public static void convert(String charArray, char[] result) { - assert charArray != null; - char[] result = new char[charArray.length()]; for (int i = 0; i < charArray.length(); i++) { result[i] = CONVERT[charArray.charAt(i)]; } - - return new String(result); } /** @@ -126,4 +144,20 @@ public static void normalization(char[] charArray) charArray[i] = CONVERT[charArray[i]]; } } + + public static void normalize(Sentence sentence) + { + for (IWord word : sentence) + { + if (word instanceof CompoundWord) + { + for (Word w : ((CompoundWord) word).innerList) + { + w.value = convert(w.value); + } + } + else + word.setValue(word.getValue()); + } + } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/other/CharType.java b/src/main/java/com/hankcs/hanlp/dictionary/other/CharType.java index f8439a51d..d4f382ae0 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/other/CharType.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/other/CharType.java @@ -13,10 +13,10 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.utility.TextUtility; import java.io.DataOutputStream; -import java.io.FileOutputStream; import java.io.IOException; import java.util.LinkedList; import java.util.List; @@ -131,7 +131,7 @@ private static ByteArray generate() throws IOException typeList.add(array); } // System.out.print("int[" + typeList.size() + "][3] array = \n"); - DataOutputStream out = new DataOutputStream(new FileOutputStream(HanLP.Config.CharTypePath)); + DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(HanLP.Config.CharTypePath)); for (int[] array : typeList) { // System.out.printf("%d %d %d\n", array[0], array[1], array[2]); @@ -154,4 +154,15 @@ public static byte get(char c) { return type[(int) c]; } + + /** + * 设置字符类型 + * + * @param c 字符 + * @param t 类型 + */ + public static void set(char c, byte t) + { + type[c] = t; + } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionary.java new file mode 100644 index 000000000..7b4064d60 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionary.java @@ -0,0 +1,60 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-24 下午6:46 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.dictionary.other; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.util.Map; +import java.util.TreeMap; + +/** + * 词性标注集中英映射表 + * + * @author hankcs + */ +public class PartOfSpeechTagDictionary +{ + /** + * 词性映射表 + */ + public static Map translator = new TreeMap(); + + static + { + load(HanLP.Config.PartOfSpeechTagDictionary); + } + + public static void load(String path) + { + IOUtil.LineIterator iterator = new IOUtil.LineIterator(path); + iterator.next(); // header + while (iterator.hasNext()) + { + String[] args = iterator.next().split(","); + if (args.length < 3) continue; + translator.put(args[1], args[2]); + } + } + + /** + * 翻译词性 + * + * @param tag + * @return + */ + public static String translate(String tag) + { + String cn = translator.get(tag); + if (cn == null) return tag; + return cn; + } +} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/py/Pinyin.java b/src/main/java/com/hankcs/hanlp/dictionary/py/Pinyin.java index ac8c864d6..f89f675c9 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/py/Pinyin.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/py/Pinyin.java @@ -699,7 +699,7 @@ public enum Pinyin lv3(Shengmu.l, Yunmu.v, 3, "lǚ", "lv", Head.l, 'l'), lv4(Shengmu.l, Yunmu.v, 4, "lǜ", "lv", Head.l, 'l'), lve3(Shengmu.l, Yunmu.ve, 3, "lüě", "lve", Head.l, 'l'), - lve4(Shengmu.l, Yunmu.ue, 4, "lüè", "lve", Head.l, 'l'), + lve4(Shengmu.l, Yunmu.ve, 4, "lüè", "lve", Head.l, 'l'), ma1(Shengmu.m, Yunmu.a, 1, "mā", "ma", Head.m, 'm'), ma2(Shengmu.m, Yunmu.a, 2, "má", "ma", Head.m, 'm'), ma3(Shengmu.m, Yunmu.a, 3, "mǎ", "ma", Head.m, 'm'), diff --git a/src/main/java/com/hankcs/hanlp/dictionary/py/PinyinDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/py/PinyinDictionary.java index 60b86e9cd..4dcd6f489 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/py/PinyinDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/py/PinyinDictionary.java @@ -22,6 +22,7 @@ import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.utility.Predefine; +import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileOutputStream; import java.util.*; @@ -107,7 +108,7 @@ static boolean saveDat(String path, AhoCorasickDoubleArrayTrie trie, S { try { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT)); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT))); out.writeInt(entrySet.size()); for (Map.Entry entry : entrySet) { @@ -179,15 +180,21 @@ protected static List segLongest(char[] charArray, AhoCorasickDoubleArra protected static List segLongest(char[] charArray, AhoCorasickDoubleArrayTrie trie, boolean remainNone) { final Pinyin[][] wordNet = new Pinyin[charArray.length][]; + final int[] lengths = new int[charArray.length]; trie.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit() { @Override public void hit(int begin, int end, Pinyin[] value) { int length = end - begin; - if (wordNet[begin] == null || length > wordNet[begin].length) + if (length == 1 && value.length > 1) { - wordNet[begin] = length == 1 ? new Pinyin[]{value[0]} : value; + value = new Pinyin[]{value[0]}; + } + if (length > lengths[begin]) + { + wordNet[begin] = value; + lengths[begin] = length; } } }); @@ -207,7 +214,7 @@ public void hit(int begin, int end, Pinyin[] value) { pinyinList.add(pinyin); } - offset += wordNet[offset].length; + offset += lengths[offset]; } return pinyinList; } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/py/SYTDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/py/SYTDictionary.java deleted file mode 100644 index 65b02af33..000000000 --- a/src/main/java/com/hankcs/hanlp/dictionary/py/SYTDictionary.java +++ /dev/null @@ -1,99 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/11/2 11:41 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.dictionary.py; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.collection.set.UnEmptyStringSet; -import com.hankcs.hanlp.corpus.dictionary.StringDictionary; -import com.hankcs.hanlp.corpus.io.IOUtil; - -import java.util.*; - -import static com.hankcs.hanlp.utility.Predefine.logger; - -/** - * 声母韵母音调词典 - * - * @author hankcs - */ -public class SYTDictionary -{ - static Set smSet = new UnEmptyStringSet(); - static Set ymSet = new UnEmptyStringSet(); - static Set ydSet = new UnEmptyStringSet(); - static Map map = new TreeMap(); - - static - { - StringDictionary dictionary = new StringDictionary(); - if (dictionary.load(HanLP.Config.SYTDictionaryPath)) - { - logger.info("载入声母韵母音调词典" + HanLP.Config.SYTDictionaryPath + "成功"); - for (Map.Entry entry : dictionary.entrySet()) - { - // 0 1 2 - // bai1=b,ai,1 - String[] args = entry.getValue().split(","); - if (args[0].length() == 0) args[0] = "none"; - smSet.add(args[0]); - ymSet.add(args[1]); - ydSet.add(args[2]); - String[] valueArray = new String[4]; - System.arraycopy(args, 0, valueArray, 0, args.length); - valueArray[3] = PinyinUtil.convertToneNumber2ToneMark(entry.getKey()); - map.put(entry.getKey(), valueArray); - } - } - else - { - logger.warning("载入声母韵母音调词典" + HanLP.Config.SYTDictionaryPath + "失败"); - } - } - - /** - * 导出声母表等等 - * - * @param path - */ - public static void dumpEnum(String path) - { - dumpEnum(smSet, path + "sm.txt"); - dumpEnum(ymSet, path + "ym.txt"); - dumpEnum(ydSet, path + "yd.txt"); - Set hdSet = new TreeSet(); - for (Pinyin pinyin : PinyinDictionary.pinyins) - { - hdSet.add(pinyin.getHeadString()); - } - dumpEnum(hdSet, path + "head.txt"); - StringBuilder sb = new StringBuilder(); - for (Map.Entry entry : map.entrySet()) - { - // 0声母 1韵母 2音调 3带音标 - String[] value = entry.getValue(); - Pinyin pinyin = Pinyin.valueOf(entry.getKey()); - sb.append(entry.getKey() + "(" + Shengmu.class.getSimpleName() + "." + value[0] + ", " + Yunmu.class.getSimpleName() + "." + value[1] + ", " + value[2] + ", \"" + value[3] + "\", \"" + entry.getKey().substring(0, entry.getKey().length() - 1) + "\"" + ", " + Head.class.getSimpleName() + "." + pinyin.getHeadString() + ", '" + pinyin.getFirstChar() + "'" + "),\n"); - } - IOUtil.saveTxt(path + "py.txt", sb.toString()); - } - - private static boolean dumpEnum(Set set, String path) - { - StringBuilder sb = new StringBuilder(); - for (String s : set) - { - sb.append(s); - sb.append(",\n"); - } - return IOUtil.saveTxt(path, sb.toString()); - } -} diff --git a/src/main/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionary.java index 1241a3c18..334a96cb9 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionary.java @@ -18,8 +18,8 @@ import com.hankcs.hanlp.utility.Predefine; import com.hankcs.hanlp.utility.TextUtility; +import java.io.BufferedOutputStream; import java.io.DataOutputStream; -import java.io.File; import java.util.List; import java.util.ListIterator; import static com.hankcs.hanlp.utility.Predefine.logger; @@ -31,22 +31,44 @@ */ public class CoreStopWordDictionary { - static StopWordDictionary dictionary; + /** + * 储存词条的结构 + */ + public static StopWordDictionary dictionary; static { - ByteArray byteArray = ByteArray.createByteArray(HanLP.Config.CoreStopWordDictionaryPath + Predefine.BIN_EXT); + load(HanLP.Config.CoreStopWordDictionaryPath, true); + } + + /** + * 重新加载{@link HanLP.Config#CoreStopWordDictionaryPath}所指定的停用词词典,并且生成新缓存。 + */ + public static void reload() + { + load(HanLP.Config.CoreStopWordDictionaryPath, false); + } + + /** + * 加载另一部停用词词典 + * @param coreStopWordDictionaryPath 词典路径 + * @param loadCacheIfPossible 是否优先加载缓存(速度更快) + */ + public static void load(String coreStopWordDictionaryPath, boolean loadCacheIfPossible) + { + ByteArray byteArray = loadCacheIfPossible ? ByteArray.createByteArray(coreStopWordDictionaryPath + Predefine.BIN_EXT) : null; if (byteArray == null) { try { - dictionary = new StopWordDictionary(HanLP.Config.CoreStopWordDictionaryPath); - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(HanLP.Config.CoreStopWordDictionaryPath + Predefine.BIN_EXT)); + dictionary = new StopWordDictionary(coreStopWordDictionaryPath); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(coreStopWordDictionaryPath + Predefine.BIN_EXT))); dictionary.save(out); out.close(); } catch (Exception e) { - logger.severe("载入停用词词典" + HanLP.Config.CoreStopWordDictionaryPath + "失败" + TextUtility.exceptionToString(e)); + logger.severe("载入停用词词典" + coreStopWordDictionaryPath + "失败" + TextUtility.exceptionToString(e)); + throw new RuntimeException("载入停用词词典" + coreStopWordDictionaryPath + "失败"); } } else @@ -148,12 +170,13 @@ public static boolean remove(String stopWord) * 对分词结果应用过滤 * @param termList */ - public static void apply(List termList) + public static List apply(List termList) { ListIterator listIterator = termList.listIterator(); while (listIterator.hasNext()) { if (shouldRemove(listIterator.next())) listIterator.remove(); } + return termList; } } diff --git a/src/main/java/com/hankcs/hanlp/dictionary/ts/BaseChineseDictionary.java b/src/main/java/com/hankcs/hanlp/dictionary/ts/BaseChineseDictionary.java index 9b00786ee..e417e5d2e 100644 --- a/src/main/java/com/hankcs/hanlp/dictionary/ts/BaseChineseDictionary.java +++ b/src/main/java/com/hankcs/hanlp/dictionary/ts/BaseChineseDictionary.java @@ -21,6 +21,7 @@ import com.hankcs.hanlp.dictionary.py.Pinyin; import com.hankcs.hanlp.utility.Predefine; +import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileOutputStream; import java.util.*; @@ -150,7 +151,7 @@ static boolean saveDat(String path, AhoCorasickDoubleArrayTrie trie, Set } try { - DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path)); + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path))); out.writeInt(entrySet.size()); for (Map.Entry entry : entrySet) { diff --git a/src/main/java/com/hankcs/hanlp/mining/cluster/Cluster.java b/src/main/java/com/hankcs/hanlp/mining/cluster/Cluster.java new file mode 100644 index 000000000..548b815f1 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/cluster/Cluster.java @@ -0,0 +1,308 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-12 7:11 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.mining.cluster; + +import java.util.*; + +/** + * @author hankcs + */ +public class Cluster implements Comparable> +{ + List> documents_; ///< documents + SparseVector composite_; ///< a composite SparseVector + SparseVector centroid_; ///< a centroid SparseVector + List> sectioned_clusters_; ///< sectioned clusters + double sectioned_gain_; ///< a sectioned gain + Random random; + + public Cluster() + { + this(new ArrayList>()); + } + + public Cluster(List> documents) + { + this.documents_ = documents; + composite_ = new SparseVector(); + random = new Random(); + } + + /** + * Add the vectors of all documents to a composite vector. + */ + void set_composite_vector() + { + composite_.clear(); + for (Document document : documents_) + { + composite_.add_vector(document.feature()); + } + } + + /** + * Clear status. + */ + void clear() + { + documents_.clear(); + composite_.clear(); + if (centroid_ != null) + centroid_.clear(); + if (sectioned_clusters_ != null) + sectioned_clusters_.clear(); + sectioned_gain_ = 0.0; + } + + + /** + * Get the size. + * + * @return the size of this cluster + */ + int size() + { + return documents_.size(); + } + + /** + * Get the pointer of a centroid vector. + * + * @return the pointer of a centroid vector + */ + SparseVector centroid_vector() + { + if (documents_.size() > 0 && composite_.size() == 0) + set_composite_vector(); + centroid_ = (SparseVector) composite_vector().clone(); + centroid_.normalize(); + return centroid_; + } + + /** + * Get the pointer of a composite vector. + * + * @return the pointer of a composite vector + */ + SparseVector composite_vector() + { + return composite_; + } + + /** + * Get documents in this cluster. + * + * @return documents in this cluster + */ + List> documents() + { + return documents_; + } + + /** + * Add a document. + * + * @param doc the pointer of a document object + */ + void add_document(Document doc) + { + doc.feature().normalize(); + documents_.add(doc); + composite_.add_vector(doc.feature()); + } + + /** + * Remove a document from this cluster. + * + * @param index the index of vector container of documents + */ + void remove_document(int index) + { + ListIterator> listIterator = documents_.listIterator(index); + Document document = listIterator.next(); + listIterator.set(null); + composite_.sub_vector(document.feature()); + } + + /** + * Delete removed documents from the internal container. + */ + void refresh() + { + ListIterator> listIterator = documents_.listIterator(); + while (listIterator.hasNext()) + { + if (listIterator.next() == null) + listIterator.remove(); + } + } + + /** + * Get a gain when this cluster sectioned. + * + * @return a gain + */ + double sectioned_gain() + { + return sectioned_gain_; + } + + /** + * Set a gain when the cluster sectioned. + */ + void set_sectioned_gain() + { + double gain = 0.0f; + if (sectioned_gain_ == 0 && sectioned_clusters_.size() > 1) + { + for (Cluster cluster : sectioned_clusters_) + { + gain += cluster.composite_vector().norm(); + } + gain -= composite_.norm(); + } + sectioned_gain_ = gain; + } + + /** + * Get sectioned clusters. + * + * @return sectioned clusters + */ + List> sectioned_clusters() + { + return sectioned_clusters_; + } + +// /** +// * Choose documents randomly. +// */ +// void choose_randomly(int ndocs, List docs) +//{ +// HashMap.type choosed; +// int siz = size(); +// init_hash_map(siz, choosed, ndocs); +// if (siz < ndocs) +// ndocs = siz; +// int count = 0; +// while (count < ndocs) +// { +// int index = myrand(seed_) % siz; +// if (choosed.find(index) == choosed.end()) +// { +// choosed.insert(std.pair(index, true)); +// docs.push_back(documents_[index]); +// ++count; +// } +// } +//} + + /** + * 选取初始质心 + * + * @param ndocs 质心数量 + * @param docs 输出到该列表中 + */ + void choose_smartly(int ndocs, List docs) + { + int siz = size(); + double[] closest = new double[siz]; + if (siz < ndocs) + ndocs = siz; + int index, count = 0; + + index = random.nextInt(siz); // initial center + docs.add(documents_.get(index)); + ++count; + double potential = 0.0; + for (int i = 0; i < documents_.size(); i++) + { + double dist = 1.0 - SparseVector.inner_product(documents_.get(i).feature(), documents_.get(index).feature()); + potential += dist; + closest[i] = dist; + } + + // choose each center + while (count < ndocs) + { + double randval = random.nextDouble() * potential; + + for (index = 0; index < documents_.size(); index++) + { + double dist = closest[index]; + if (randval <= dist) + break; + randval -= dist; + } + if (index == documents_.size()) + index--; + docs.add(documents_.get(index)); + ++count; + + double new_potential = 0.0; + for (int i = 0; i < documents_.size(); i++) + { + double dist = 1.0 - SparseVector.inner_product(documents_.get(i).feature(), documents_.get(index).feature()); + double min = closest[i]; + if (dist < min) + { + closest[i] = dist; + min = dist; + } + new_potential += min; + } + potential = new_potential; + } + } + + /** + * 将本簇划分为nclusters个簇 + * + * @param nclusters + */ + void section(int nclusters) + { + if (size() < nclusters) + throw new IllegalArgumentException("簇数目小于文档数目"); + + sectioned_clusters_ = new ArrayList>(nclusters); + List centroids = new ArrayList(nclusters); + // choose_randomly(nclusters, centroids); + choose_smartly(nclusters, centroids); + for (int i = 0; i < centroids.size(); i++) + { + Cluster cluster = new Cluster(); + sectioned_clusters_.add(cluster); + } + + for (Document d : documents_) + { + double max_similarity = -1.0; + int max_index = 0; + for (int j = 0; j < centroids.size(); j++) + { + double similarity = SparseVector.inner_product(d.feature(), centroids.get(j).feature()); + if (max_similarity < similarity) + { + max_similarity = similarity; + max_index = j; + } + } + sectioned_clusters_.get(max_index).add_document(d); + } + } + + @Override + public int compareTo(Cluster o) + { + return Double.compare(o.sectioned_gain(), sectioned_gain()); + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzer.java b/src/main/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzer.java new file mode 100644 index 000000000..0f961885c --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzer.java @@ -0,0 +1,460 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-12 6:37 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.mining.cluster; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.classification.utilities.TextProcessUtility; +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.utility.MathUtility; + +import java.io.File; +import java.io.IOException; +import java.util.*; + +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; + +/** + * 文本聚类 + * + * @param 文档的id类型 + * @author hankcs + */ +public class ClusterAnalyzer +{ + protected HashMap> documents_; + protected Segment segment; + protected MutableDoubleArrayTrieInteger vocabulary; + static final int NUM_REFINE_LOOP = 30; + + public ClusterAnalyzer() + { + documents_ = new HashMap>(); + segment = HanLP.newSegment(); + vocabulary = new MutableDoubleArrayTrieInteger(); + } + + protected int id(String word) + { + int id = vocabulary.get(word); + if (id == -1) + { + id = vocabulary.size(); + vocabulary.put(word, id); + } + return id; + } + + /** + * 重载此方法实现自己的预处理逻辑(预处理、分词、去除停用词) + * + * @param document 文档 + * @return 单词列表 + */ + protected List preprocess(String document) + { + List termList = segment.seg(document); + ListIterator listIterator = termList.listIterator(); + while (listIterator.hasNext()) + { + Term term = listIterator.next(); + if (CoreStopWordDictionary.contains(term.word) || + term.nature.startsWith("w") + ) + { + listIterator.remove(); + } + } + List wordList = new ArrayList(termList.size()); + for (Term term : termList) + { + wordList.add(term.word); + } + return wordList; + } + + protected SparseVector toVector(List wordList) + { + SparseVector vector = new SparseVector(); + for (String word : wordList) + { + int id = id(word); + Double f = vector.get(id); + if (f == null) + { + f = 1.; + vector.put(id, f); + } + else + { + vector.put(id, ++f); + } + } + return vector; + } + + /** + * 添加文档 + * + * @param id 文档id + * @param document 文档内容 + * @return 文档对象 + */ + public Document addDocument(K id, String document) + { + return addDocument(id, preprocess(document)); + } + + /** + * 添加文档 + * + * @param id 文档id + * @param document 文档内容 + * @return 文档对象 + */ + public Document addDocument(K id, List document) + { + SparseVector vector = toVector(document); + Document d = new Document(id, vector); + return documents_.put(id, d); + } + + /** + * k-means聚类 + * + * @param nclusters 簇的数量 + * @return 指定数量的簇(Set)构成的集合 + */ + public List> kmeans(int nclusters) + { + if (nclusters > size()) + { + logger.err("传入聚类数目%d大于文档数量%d,已纠正为文档数量\n", nclusters, size()); + nclusters = size(); + } + Cluster cluster = new Cluster(); + for (Document document : documents_.values()) + { + cluster.add_document(document); + } + cluster.section(nclusters); + refine_clusters(cluster.sectioned_clusters()); + List> clusters_ = new ArrayList>(nclusters); + for (Cluster s : cluster.sectioned_clusters()) + { + s.refresh(); + clusters_.add(s); + } + return toResult(clusters_); + } + + /** + * 已向聚类分析器添加的文档数量 + * + * @return 文档数量 + */ + public int size() + { + return this.documents_.size(); + } + + private List> toResult(List> clusters_) + { + List> result = new ArrayList>(clusters_.size()); + for (Cluster c : clusters_) + { + Set s = new HashSet(); + for (Document d : c.documents_) + { + s.add(d.id_); + } + result.add(s); + } + return result; + } + + /** + * repeated bisection 聚类 + * + * @param nclusters 簇的数量 + * @return 指定数量的簇(Set)构成的集合 + */ + public List> repeatedBisection(int nclusters) + { + return repeatedBisection(nclusters, 0); + } + + /** + * repeated bisection 聚类 + * + * @param limit_eval 准则函数增幅阈值 + * @return 指定数量的簇(Set)构成的集合 + */ + public List> repeatedBisection(double limit_eval) + { + return repeatedBisection(0, limit_eval); + } + + /** + * repeated bisection 聚类 + * + * @param nclusters 簇的数量 + * @param limit_eval 准则函数增幅阈值 + * @return 指定数量的簇(Set)构成的集合 + */ + public List> repeatedBisection(int nclusters, double limit_eval) + { + if (nclusters > size()) + { + logger.err("传入聚类数目%d大于文档数量%d,已纠正为文档数量\n", nclusters, size()); + nclusters = size(); + } + Cluster cluster = new Cluster(); + List> clusters_ = new ArrayList>(nclusters > 0 ? nclusters : 16); + for (Document document : documents_.values()) + { + cluster.add_document(document); + } + + PriorityQueue> que = new PriorityQueue>(); + cluster.section(2); + refine_clusters(cluster.sectioned_clusters()); + cluster.set_sectioned_gain(); + cluster.composite_vector().clear(); + que.add(cluster); + + while (!que.isEmpty()) + { + if (nclusters > 0 && que.size() >= nclusters) + break; + cluster = que.peek(); + if (cluster.sectioned_clusters().size() < 1) + break; + if (limit_eval > 0 && cluster.sectioned_gain() < limit_eval) + break; + que.poll(); + List> sectioned = cluster.sectioned_clusters(); + + for (Cluster c : sectioned) + { + if (c.size() >= 2) + { + c.section(2); + refine_clusters(c.sectioned_clusters()); + c.set_sectioned_gain(); + if (c.sectioned_gain() < limit_eval) + { + for (Cluster sub : c.sectioned_clusters()) + { + sub.clear(); + } + } + c.composite_vector().clear(); + } + que.add(c); + } + } + while (!que.isEmpty()) + { + clusters_.add(0, que.poll()); + } + return toResult(clusters_); + } + + /** + * 根据k-means算法迭代优化聚类 + * + * @param clusters 簇 + * @return 准则函数的值 + */ + double refine_clusters(List> clusters) + { + double[] norms = new double[clusters.size()]; + int offset = 0; + for (Cluster cluster : clusters) + { + norms[offset++] = cluster.composite_vector().norm(); + } + + double eval_cluster = 0.0; + int loop_count = 0; + while (loop_count++ < NUM_REFINE_LOOP) + { + List items = new ArrayList(size()); + for (int i = 0; i < clusters.size(); i++) + { + for (int j = 0; j < clusters.get(i).documents().size(); j++) + { + items.add(new int[]{i, j}); + } + } + Collections.shuffle(items); + + boolean changed = false; + for (int[] item : items) + { + int cluster_id = item[0]; + int item_id = item[1]; + Cluster cluster = clusters.get(cluster_id); + Document doc = cluster.documents().get(item_id); + double value_base = refined_vector_value(cluster.composite_vector(), doc.feature(), -1); + double norm_base_moved = Math.pow(norms[cluster_id], 2) + value_base; + norm_base_moved = norm_base_moved > 0 ? Math.sqrt(norm_base_moved) : 0.0; + + double eval_max = -1.0; + double norm_max = 0.0; + int max_index = 0; + for (int j = 0; j < clusters.size(); j++) + { + if (cluster_id == j) + continue; + Cluster other = clusters.get(j); + double value_target = refined_vector_value(other.composite_vector(), doc.feature(), 1); + double norm_target_moved = Math.pow(norms[j], 2) + value_target; + norm_target_moved = norm_target_moved > 0 ? Math.sqrt(norm_target_moved) : 0.0; + double eval_moved = norm_base_moved + norm_target_moved - norms[cluster_id] - norms[j]; + if (eval_max < eval_moved) + { + eval_max = eval_moved; + norm_max = norm_target_moved; + max_index = j; + } + } + if (eval_max > 0) + { + eval_cluster += eval_max; + clusters.get(max_index).add_document(doc); + clusters.get(cluster_id).remove_document(item_id); + norms[cluster_id] = norm_base_moved; + norms[max_index] = norm_max; + changed = true; + } + } + if (!changed) + break; + for (Cluster cluster : clusters) + { + cluster.refresh(); + } + } + return eval_cluster; + } + + /** + * c^2 - 2c(a + c) + d^2 - 2d(b + d) + * + * @param composite (a+c,b+d) + * @param vec (c,d) + * @param sign + * @return + */ + double refined_vector_value(SparseVector composite, SparseVector vec, int sign) + { + double sum = 0.0; + for (Map.Entry entry : vec.entrySet()) + { + sum += Math.pow(entry.getValue(), 2) + sign * 2 * composite.get(entry.getKey()) * entry.getValue(); + } + return sum; + } + + /** + * 训练模型 + * + * @param folderPath 分类语料的根目录.目录必须满足如下结构:
+ * 根目录
+ * ├── 分类A
+ * │ └── 1.txt
+ * │ └── 2.txt
+ * │ └── 3.txt
+ * ├── 分类B
+ * │ └── 1.txt
+ * │ └── ...
+ * └── ...
+ * 文件不一定需要用数字命名,也不需要以txt作为后缀名,但一定需要是文本文件. + * @param algorithm kmeans 或 repeated bisection + * @throws IOException 任何可能的IO异常 + */ + public static double evaluate(String folderPath, String algorithm) + { + if (folderPath == null) throw new IllegalArgumentException("参数 folderPath == null"); + File root = new File(folderPath); + if (!root.exists()) throw new IllegalArgumentException(String.format("目录 %s 不存在", root.getAbsolutePath())); + if (!root.isDirectory()) + throw new IllegalArgumentException(String.format("目录 %s 不是一个目录", root.getAbsolutePath())); + + ClusterAnalyzer analyzer = new ClusterAnalyzer(); + File[] folders = root.listFiles(); + if (folders == null) return 1.; + logger.start("根目录:%s\n加载中...\n", folderPath); + int docSize = 0; + int[] ni = new int[folders.length]; + String[] cat = new String[folders.length]; + int offset = 0; + for (File folder : folders) + { + if (folder.isFile()) continue; + File[] files = folder.listFiles(); + if (files == null) continue; + String category = folder.getName(); + cat[offset] = category; + logger.out("[%s]...", category); + int b = 0; + int e = files.length; + + int logEvery = (int) Math.ceil((e - b) / 10000f); + for (int i = b; i < e; i++) + { + analyzer.addDocument(folder.getName() + " " + files[i].getName(), IOUtil.readTxt(files[i].getAbsolutePath())); + if (i % logEvery == 0) + { + logger.out("%c[%s]...%.2f%%", 13, category, MathUtility.percentage(i - b + 1, e - b)); + } + ++docSize; + ++ni[offset]; + } + logger.out(" %d 篇文档\n", e - b); + ++offset; + } + logger.finish(" 加载了 %d 个类目,共 %d 篇文档\n", folders.length, docSize); + logger.start(algorithm + "聚类中..."); + List> clusterList = algorithm.replaceAll("[-\\s]", "").toLowerCase().equals("kmeans") ? + analyzer.kmeans(ni.length) : analyzer.repeatedBisection(ni.length); + logger.finish(" 完毕。\n"); + double[] fi = new double[ni.length]; + for (int i = 0; i < ni.length; i++) + { + for (Set j : clusterList) + { + int nij = 0; + for (String d : j) + { + if (d.startsWith(cat[i])) + ++nij; + } + if (nij == 0) continue; + double p = nij / (double) (j.size()); + double r = nij / (double) (ni[i]); + double f = 2 * p * r / (p + r); + fi[i] = Math.max(fi[i], f); + } + } + double f = 0; + for (int i = 0; i < fi.length; i++) + { + f += fi[i] * ni[i] / docSize; + } + return f; + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/cluster/Document.java b/src/main/java/com/hankcs/hanlp/mining/cluster/Document.java new file mode 100644 index 000000000..43d5f9ae9 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/cluster/Document.java @@ -0,0 +1,117 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-12 7:15 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.mining.cluster; + +import java.util.HashMap; +import java.util.Map; + +/** + * @author hankcs + */ +public class Document +{ + K id_; /// the identifier of a document + SparseVector feature_; /// feature vector of a document + + public Document(K id_, SparseVector feature_) + { + this.id_ = id_; + this.feature_ = feature_; + } + + public Document(K id_) + { + this(id_, new SparseVector()); + } + + /** + * Get an identifier. + * + * @return an identifier + */ + K id() + { + return id_; + } + + /** + * Get the pointer of a feature vector + * + * @return the pointer of a feature vector + */ + SparseVector feature() + { + return feature_; + } + + + /** + * Add a feature. + * + * @param key the key of a feature + * @param value the value of a feature + */ + void add_feature(int key, double value) + { + feature_.put(key, value); + } + + /** + * Set features. + * + * @param feature a feature vector + */ + void set_features(SparseVector feature) + { + feature_ = feature; + } + + /** + * Clear features. + */ + void clear() + { + feature_.clear(); + } + + /** + * Apply IDF(inverse document frequency) weighting. + * + * @param df document frequencies + * @param ndocs the number of documents + */ + void idf(HashMap df, int ndocs) + { + for (Map.Entry entry : feature_.entrySet()) + { + Integer denom = df.get(entry.getKey()); + if (denom == null) denom = 1; + entry.setValue((double) (entry.getValue() * Math.log(ndocs / denom))); + } + } + + @Override + public boolean equals(Object o) + { + if (this == o) return true; + if (o == null || getClass() != o.getClass()) return false; + + Document document = (Document) o; + + return id_ != null ? id_.equals(document.id_) : document.id_ == null; + } + + @Override + public int hashCode() + { + return id_ != null ? id_.hashCode() : 0; + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/cluster/SparseVector.java b/src/main/java/com/hankcs/hanlp/mining/cluster/SparseVector.java new file mode 100644 index 000000000..c24616bb3 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/cluster/SparseVector.java @@ -0,0 +1,204 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-12 6:40 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.mining.cluster; + +import java.util.Iterator; +import java.util.Map; +import java.util.TreeMap; + +/** + * @author hankcs + */ +public class SparseVector extends TreeMap +{ + @Override + public Double get(Object key) + { + Double v = super.get(key); + if (v == null) return 0.; + return v; + } + + /** + * Normalize a vector. + */ + void normalize() + { + double nrm = norm(); + for (Map.Entry d : entrySet()) + { + d.setValue(d.getValue() / nrm); + } + } + + /** + * Calculate a squared norm. + */ + double norm_squared() + { + double sum = 0; + for (Double point : values()) + { + sum += point * point; + } + return sum; + } + + /** + * Calculate a norm. + */ + double norm() + { + return (double) Math.sqrt(norm_squared()); + } + + /** + * Multiply each value of avector by a constant value. + */ + void multiply_constant(double x) + { + for (Map.Entry entry : entrySet()) + { + entry.setValue(entry.getValue() * x); + } + } + + /** + * Add other vector. + */ + void add_vector(SparseVector vec) + { + + for (Map.Entry entry : vec.entrySet()) + { + Double v = get(entry.getKey()); + if (v == null) + v = 0.; + put(entry.getKey(), v + entry.getValue()); + } + } + + /** + * Subtract other vector. + */ + void sub_vector(SparseVector vec) + { + + for (Map.Entry entry : vec.entrySet()) + { + Double v = get(entry.getKey()); + if (v == null) + v = 0.; + put(entry.getKey(), v - entry.getValue()); + } + } + +// /** +// * Calculate the squared euclid distance between vectors. +// */ +// double euclid_distance_squared(const Vector &vec1, const Vector &vec2) +//{ +// HashMap::type done; +// init_hash_map(VECTOR_EMPTY_KEY, done, vec1.size()); +// VecHashMap::const_iterator it1, it2; +// double dist = 0; +// for (it1 = vec1.hash_map()->begin(); it1 != vec1.hash_map()->end(); ++it1) +// { +// double val = vec2.get(it1->first); +// dist += (it1->second - val) * (it1->second - val); +// done[it1->first] = true; +// } +// for (it2 = vec2.hash_map()->begin(); it2 != vec2.hash_map()->end(); ++it2) +// { +// if (done.find(it2->first) == done.end()) +// { +// double val = vec1.get(it2->first); +// dist += (it2->second - val) * (it2->second - val); +// } +// } +// return dist; +//} +// +// /** +// * Calculate the euclid distance between vectors. +// */ +// double euclid_distance(const Vector &vec1, const Vector &vec2) +//{ +// return sqrt(euclid_distance_squared(vec1, vec2)); +//} + + /** + * Calculate the inner product value between vectors. + */ + static double inner_product(SparseVector vec1, SparseVector vec2) + { + Iterator> it; + SparseVector other; + if (vec1.size() < vec2.size()) + { + it = vec1.entrySet().iterator(); + other = vec2; + } + else + { + it = vec2.entrySet().iterator(); + other = vec1; + } + double prod = 0; + while (it.hasNext()) + { + Map.Entry entry = it.next(); + prod += entry.getValue() * other.get(entry.getKey()); + } + return prod; + } + + /** + * Calculate the cosine value between vectors. + */ + double cosine(SparseVector vec1, SparseVector vec2) + { + double norm1 = vec1.norm(); + double norm2 = vec2.norm(); + double result = 0.0f; + if (norm1 == 0 && norm2 == 0) + { + return result; + } + else + { + double prod = inner_product(vec1, vec2); + result = prod / (norm1 * norm2); + return Double.isNaN(result) ? 0.0f : result; + } + } + +// /** +// * Calculate the Jaccard coefficient value between vectors. +// */ +// double jaccard(const Vector &vec1, const Vector &vec2) +//{ +// double norm1 = vec1.norm(); +// double norm2 = vec2.norm(); +// double prod = inner_product(vec1, vec2); +// double denom = norm1 + norm2 - prod; +// double result = 0.0; +// if (!denom) +// { +// return result; +// } +// else +// { +// result = prod / denom; +// return isnan(result) ? 0.0 : result; +// } +//} +} diff --git a/src/main/java/com/hankcs/hanlp/mining/cluster/package-info.java b/src/main/java/com/hankcs/hanlp/mining/cluster/package-info.java new file mode 100644 index 000000000..0d175f435 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/cluster/package-info.java @@ -0,0 +1,16 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-19 8:19 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +/** + * 文本聚类模块(k-means和repeated bisection) + * 参考文献 Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques[C]//KDD workshop on text mining. 2000, 400(1): 525-526. + * 实现上参考了 https://github.com/fujimizu/bayon 的C++代码。 + */ +package com.hankcs.hanlp.mining.cluster; \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/mining/phrase/MutualInformationEntropyPhraseExtractor.java b/src/main/java/com/hankcs/hanlp/mining/phrase/MutualInformationEntropyPhraseExtractor.java index aa10c78c0..c9ff7af13 100644 --- a/src/main/java/com/hankcs/hanlp/mining/phrase/MutualInformationEntropyPhraseExtractor.java +++ b/src/main/java/com/hankcs/hanlp/mining/phrase/MutualInformationEntropyPhraseExtractor.java @@ -22,6 +22,9 @@ import java.util.LinkedList; import java.util.List; +import static com.hankcs.hanlp.corpus.tag.Nature.nx; +import static com.hankcs.hanlp.corpus.tag.Nature.t; + /** * 利用互信息和左右熵的短语提取器 * @author hankcs @@ -41,12 +44,8 @@ public List extractPhrase(String text, int size) @Override public boolean shouldInclude(Term term) { - switch (term.nature) - { - case t: - case nx: - return false; - } + if (term.nature == t || term.nature == nx) + return false; return true; } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word/NewWordDiscover.java b/src/main/java/com/hankcs/hanlp/mining/word/NewWordDiscover.java index 3e9665c55..247cabbcc 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word/NewWordDiscover.java +++ b/src/main/java/com/hankcs/hanlp/mining/word/NewWordDiscover.java @@ -58,7 +58,7 @@ public List discover(BufferedReader reader, int size) throws IOExcepti String doc; Map word_cands = new TreeMap(); int totalLength = 0; - Pattern delimiter = Pattern.compile("[\\s\\d,.<>/?:;'\"\\[\\]{}()\\|~!@#$%^&*\\-_=+a-zA-Z,。《》、?:;“”‘’{}【】()…¥!—┄-]+"); + Pattern delimiter = Pattern.compile("[\\s\\d,.<>/?:;'\"\\[\\]{}()\\|~!@#$%^&*\\-_=+,。《》、?:;“”‘’{}【】()…¥!—┄-]+"); while ((doc = reader.readLine()) != null) { doc = delimiter.matcher(doc).replaceAll("\0"); @@ -69,6 +69,8 @@ public List discover(BufferedReader reader, int size) throws IOExcepti for (int j = i + 1; j < end; ++j) { String word = doc.substring(i, j); + if (word.indexOf('\0') >= 0) + continue; // 含有分隔符的不认为是词语 WordInfo info = word_cands.get(word); if (info == null) { diff --git a/src/main/java/com/hankcs/hanlp/mining/word/TermFrequencyCounter.java b/src/main/java/com/hankcs/hanlp/mining/word/TermFrequencyCounter.java new file mode 100644 index 000000000..d0330d51d --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/word/TermFrequencyCounter.java @@ -0,0 +1,251 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-31 9:16 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.mining.word; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.algorithm.MaxHeap; +import com.hankcs.hanlp.corpus.occurrence.TermFrequency; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.summary.KeywordExtractor; + +import java.util.*; + +/** + * 词频统计工具 + * + * @author hankcs + */ +public class TermFrequencyCounter extends KeywordExtractor implements Collection +{ + boolean filterStopWord; + Map termFrequencyMap; + + /** + * 构造 + * + * @param filterStopWord 是否过滤停用词 + * @param segment 分词器 + */ + public TermFrequencyCounter(Segment segment, boolean filterStopWord) + { + this.filterStopWord = filterStopWord; + this.defaultSegment = segment; + termFrequencyMap = new TreeMap(); + } + + public TermFrequencyCounter() + { + this(HanLP.newSegment(), true); + } + + public void add(String document) + { + if (document == null || document.isEmpty()) return; + List termList = defaultSegment.seg(document); + add(termList); + } + + public void add(List termList) + { + if (filterStopWord) + { + filter(termList); + } + for (Term term : termList) + { + String word = term.word; + TermFrequency frequency = termFrequencyMap.get(word); + if (frequency == null) + { + frequency = new TermFrequency(word); + termFrequencyMap.put(word, frequency); + } + else + { + frequency.increase(); + } + } + } + + /** + * 取前N个高频词 + * + * @param N + * @return + */ + public Collection top(int N) + { + MaxHeap heap = new MaxHeap(N, new Comparator() + { + @Override + public int compare(TermFrequency o1, TermFrequency o2) + { + return o1.compareTo(o2); + } + }); + heap.addAll(termFrequencyMap.values()); + return heap.toList(); + } + + /** + * 所有词汇的频次 + * + * @return + */ + public Collection all() + { + return termFrequencyMap.values(); + } + + @Override + public int size() + { + return termFrequencyMap.size(); + } + + @Override + public boolean isEmpty() + { + return termFrequencyMap.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + if (o instanceof String) + return termFrequencyMap.containsKey(o); + else if (o instanceof TermFrequency) + return termFrequencyMap.containsValue(o); + return false; + } + + @Override + public Iterator iterator() + { + return termFrequencyMap.values().iterator(); + } + + @Override + public Object[] toArray() + { + return termFrequencyMap.values().toArray(); + } + + @Override + public T[] toArray(T[] a) + { + return termFrequencyMap.values().toArray(a); + } + + @Override + public boolean add(TermFrequency termFrequency) + { + TermFrequency tf = termFrequencyMap.get(termFrequency.getTerm()); + if (tf == null) + { + termFrequencyMap.put(termFrequency.getKey(), termFrequency); + return true; + } + tf.increase(termFrequency.getFrequency()); + return false; + } + + @Override + public boolean remove(Object o) + { + return termFrequencyMap.remove(o) != null; + } + + @Override + public boolean containsAll(Collection c) + { + for (Object o : c) + { + if (!contains(o)) + return false; + } + return true; + } + + @Override + public boolean addAll(Collection c) + { + for (TermFrequency termFrequency : c) + { + add(termFrequency); + } + return !c.isEmpty(); + } + + @Override + public boolean removeAll(Collection c) + { + for (Object o : c) + { + if (!remove(o)) + return false; + } + return true; + } + + @Override + public boolean retainAll(Collection c) + { + return termFrequencyMap.values().retainAll(c); + } + + @Override + public void clear() + { + termFrequencyMap.clear(); + } + + /** + * 提取关键词(非线程安全) + * + * @param termList + * @param size + * @return + */ + @Override + public List getKeywords(List termList, int size) + { + clear(); + add(termList); + Collection topN = top(size); + List r = new ArrayList(topN.size()); + for (TermFrequency termFrequency : topN) + { + r.add(termFrequency.getTerm()); + } + return r; + } + + /** + * 提取关键词(线程安全) + * + * @param document 文档内容 + * @param size 希望提取几个关键词 + * @return 一个列表 + */ + public static List getKeywordList(String document, int size) + { + return new TermFrequencyCounter().getKeywords(document, size); + } + + @Override + public String toString() + { + final int max = 100; + return top(Math.min(max, size())).toString(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/word/TfIdf.java b/src/main/java/com/hankcs/hanlp/mining/word/TfIdf.java new file mode 100644 index 000000000..eadf22a27 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/word/TfIdf.java @@ -0,0 +1,288 @@ +package com.hankcs.hanlp.mining.word; + + +import java.util.*; + +/** + * 词频-倒排文档词频统计 + */ +public class TfIdf +{ + + /** + * 词频统计方式 + */ + public enum TfType + { + /** + * 普通词频 + */ + NATURAL, + /** + * 词频的对数并加1 + */ + LOGARITHM, + /** + * 01词频 + */ + BOOLEAN + } + + /** + * tf-idf 向量的正规化算法 + */ + public enum Normalization + { + /** + * 不正规化 + */ + NONE, + /** + * cosine正规化 + */ + COSINE + } + + /** + * 单文档词频 + * + * @param document 词袋 + * @param type 词频计算方式 + * @param 词语类型 + * @return 一个包含词频的Map + */ + public static Map tf(Collection document, TfType type) + { + Map tf = new HashMap(); + for (TERM term : document) + { + Double f = tf.get(term); + if (f == null) f = 0.0; + tf.put(term, f + 1); + } + if (type != TfType.NATURAL) + { + for (TERM term : tf.keySet()) + { + switch (type) + { + case LOGARITHM: + tf.put(term, 1 + Math.log(tf.get(term))); + break; + case BOOLEAN: + tf.put(term, tf.get(term) == 0.0 ? 0.0 : 1.0); + break; + } + } + } + return tf; + } + + /** + * 单文档词频 + * + * @param document 词袋 + * @param 词语类型 + * @return 一个包含词频的Map + */ + public static Map tf(Collection document) + { + return tf(document, TfType.NATURAL); + } + + /** + * 多文档词频 + * + * @param documents 多个文档,每个文档都是一个词袋 + * @param type 词频计算方式 + * @param 词语类型 + * @return 一个包含词频的Map的列表 + */ + public static Iterable> tfs(Iterable> documents, TfType type) + { + List> tfs = new ArrayList>(); + for (Collection document : documents) + { + tfs.add(tf(document, type)); + } + return tfs; + } + + /** + * 多文档词频 + * + * @param documents 多个文档,每个文档都是一个词袋 + * @param 词语类型 + * @return 一个包含词频的Map的列表 + */ + public static Iterable> tfs(Iterable> documents) + { + return tfs(documents, TfType.NATURAL); + } + + /** + * 一系列文档的倒排词频 + * + * @param documentVocabularies 词表 + * @param smooth 平滑参数,视作额外有一个文档,该文档含有smooth个每个词语 + * @param addOne tf-idf加一平滑 + * @param 词语类型 + * @return 一个词语->倒排文档的Map + */ + public static Map idf(Iterable> documentVocabularies, + boolean smooth, boolean addOne) + { + Map df = new HashMap(); + int d = smooth ? 1 : 0; + int a = addOne ? 1 : 0; + int n = d; + for (Iterable documentVocabulary : documentVocabularies) + { + n += 1; + for (TERM term : documentVocabulary) + { + Integer t = df.get(term); + if (t == null) t = d; + df.put(term, t + 1); + } + } + Map idf = new HashMap(); + for (Map.Entry e : df.entrySet()) + { + TERM term = e.getKey(); + double f = e.getValue(); + idf.put(term, Math.log(n / f) + a); + } + return idf; + } + + /** + * 平滑处理后的一系列文档的倒排词频 + * + * @param documentVocabularies 词表 + * @param 词语类型 + * @return 一个词语->倒排文档的Map + */ + public static Map idf(Iterable> documentVocabularies) + { + return idf(documentVocabularies, true, true); + } + + /** + * 计算文档的tf-idf + * + * @param tf 词频 + * @param idf 倒排频率 + * @param normalization 正规化 + * @param 词语类型 + * @return 一个词语->tf-idf的Map + */ + public static Map tfIdf(Map tf, Map idf, + Normalization normalization) + { + Map tfIdf = new HashMap(); + for (TERM term : tf.keySet()) + { + Double TF = tf.get(term); + if (TF == null) TF = 1.; + Double IDF = idf.get(term); + if (IDF == null) IDF = 1.; + tfIdf.put(term, TF * IDF); + } + if (normalization == Normalization.COSINE) + { + double n = 0.0; + for (double x : tfIdf.values()) + { + n += x * x; + } + n = Math.sqrt(n); + + for (TERM term : tfIdf.keySet()) + { + tfIdf.put(term, tfIdf.get(term) / n); + } + } + return tfIdf; + } + + /** + * 计算文档的tf-idf(不正规化) + * + * @param tf 词频 + * @param idf 倒排频率 + * @param 词语类型 + * @return 一个词语->tf-idf的Map + */ + public static Map tfIdf(Map tf, Map idf) + { + return tfIdf(tf, idf, Normalization.NONE); + } + + /** + * 从词频集合建立倒排频率 + * + * @param tfs 次品集合 + * @param smooth 平滑参数,视作额外有一个文档,该文档含有smooth个每个词语 + * @param addOne tf-idf加一平滑 + * @param 词语类型 + * @return 一个词语->倒排文档的Map + */ + public static Map idfFromTfs(Iterable> tfs, boolean smooth, boolean addOne) + { + return idf(new KeySetIterable(tfs), smooth, addOne); + } + + /** + * 从词频集合建立倒排频率(默认平滑词频,且加一平滑tf-idf) + * + * @param tfs 次品集合 + * @param 词语类型 + * @return 一个词语->倒排文档的Map + */ + public static Map idfFromTfs(Iterable> tfs) + { + return idfFromTfs(tfs, true, true); + } + + /** + * Map的迭代器 + * + * @param map 键类型 + * @param map 值类型 + */ + static private class KeySetIterable implements Iterable> + { + final private Iterator> maps; + + public KeySetIterable(Iterable> maps) + { + this.maps = maps.iterator(); + } + + @Override + public Iterator> iterator() + { + return new Iterator>() + { + @Override + public boolean hasNext() + { + return maps.hasNext(); + } + + @Override + public Iterable next() + { + return maps.next().keySet(); + } + + @Override + public void remove() + { + + } + }; + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/word/TfIdfCounter.java b/src/main/java/com/hankcs/hanlp/mining/word/TfIdfCounter.java new file mode 100644 index 000000000..c071103ee --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/mining/word/TfIdfCounter.java @@ -0,0 +1,289 @@ +/* + * Hankcs + * me@hankcs.com + * 2016-09-12 PM4:22 + * + * + * Copyright (c) 2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.mining.word; + +import com.hankcs.hanlp.algorithm.MaxHeap; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.summary.KeywordExtractor; +import com.hankcs.hanlp.tokenizer.StandardTokenizer; + +import java.io.BufferedReader; +import java.io.InputStreamReader; +import java.util.*; + +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * TF-IDF统计工具兼关键词提取工具 + * + * @author hankcs + */ +public class TfIdfCounter extends KeywordExtractor +{ + private boolean filterStopWord; + private Map> tfMap; + private Map> tfidfMap; + private Map idf; + + public TfIdfCounter() + { + this(true); + } + + public TfIdfCounter(boolean filterStopWord) + { + this(StandardTokenizer.SEGMENT, filterStopWord); + } + + public TfIdfCounter(Segment defaultSegment, boolean filterStopWord) + { + super(defaultSegment); + this.filterStopWord = filterStopWord; + tfMap = new HashMap>(); + } + + public TfIdfCounter(Segment defaultSegment) + { + this(defaultSegment, true); + } + + @Override + public List getKeywords(List termList, int size) + { + List> entryList = getKeywordsWithTfIdf(termList, size); + List r = new ArrayList(entryList.size()); + for (Map.Entry entry : entryList) + { + r.add(entry.getKey()); + } + + return r; + } + + public List> getKeywordsWithTfIdf(String document, int size) + { + return getKeywordsWithTfIdf(preprocess(document), size); + } + + + public List> getKeywordsWithTfIdf(List termList, int size) + { + if (idf == null) + compute(); + + Map tfIdf = TfIdf.tfIdf(TfIdf.tf(convert(termList)), idf); + return topN(tfIdf, size); + } + + public void add(Object id, List termList) + { + List words = convert(termList); + Map tf = TfIdf.tf(words); + tfMap.put(id, tf); + idf = null; + } + + private static List convert(List termList) + { + List words = new ArrayList(termList.size()); + for (Term term : termList) + { + words.add(term.word); + } + return words; + } + + public void add(List termList) + { + add(tfMap.size(), termList); + } + + /** + * 添加文档 + * + * @param id 文档id + * @param text 文档内容 + */ + public void add(Object id, String text) + { + List termList = preprocess(text); + add(id, termList); + } + + private List preprocess(String text) + { + List termList = defaultSegment.seg(text); + if (filterStopWord) + { + filter(termList); + } + return termList; + } + + /** + * 添加文档,自动分配id + * + * @param text + */ + public int add(String text) + { + int id = tfMap.size(); + add(id, text); + return id; + } + + /** + * 加载自定义idf文件 + * + * @param idfPath + */ + public void loadIdfFile(String idfPath){ + String line = null; + boolean first = true; + try + { + idf = new HashMap(); + BufferedReader bw = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(idfPath), "UTF-8")); + while ((line = bw.readLine()) != null) + { + if (first) + { + first = false; + if (!line.isEmpty() && line.charAt(0) == '\uFEFF') + line = line.substring(1); + } + String lineValue[] = line.split(" "); + idf.put(lineValue[0],Double.valueOf( lineValue[1])); + } + bw.close(); + } + catch (Exception e) + { + logger.warning("加载" + idfPath + "失败," + e); + throw new RuntimeException("载入反文档词频文件" + idfPath + "失败"); + } + + } + + public Map> compute() + { + // 如果没有加载idf文件,则通过tf计算 + if(idf==null) { + idf = TfIdf.idfFromTfs(tfMap.values()); + } + tfidfMap = new HashMap>(idf.size()); + for (Map.Entry> entry : tfMap.entrySet()) + { + Map tfidf = TfIdf.tfIdf(entry.getValue(), idf); + tfidfMap.put(entry.getKey(), tfidf); + } + return tfidfMap; + } + + public List> getKeywordsOf(Object id) + { + return getKeywordsOf(id, 10); + } + + + public List> getKeywordsOf(Object id, int size) + { + Map tfidfs = tfidfMap.get(id); + if (tfidfs == null) return null; + + return topN(tfidfs, size); + } + + private List> topN(Map tfidfs, int size) + { + MaxHeap> heap = new MaxHeap>(size, new Comparator>() + { + @Override + public int compare(Map.Entry o1, Map.Entry o2) + { + return o1.getValue().compareTo(o2.getValue()); + } + }); + + heap.addAll(tfidfs.entrySet()); + return heap.toList(); + } + + public Set documents() + { + return tfMap.keySet(); + } + + public Map> getTfMap() + { + return tfMap; + } + + public List> sortedAllTf() + { + return sort(allTf()); + } + + public List> sortedAllTfInt() + { + return doubleToInteger(sortedAllTf()); + } + + public Map allTf() + { + Map result = new HashMap(); + for (Map d : tfMap.values()) + { + for (Map.Entry tf : d.entrySet()) + { + Double f = result.get(tf.getKey()); + if (f == null) + { + result.put(tf.getKey(), tf.getValue()); + } + else + { + result.put(tf.getKey(), f + tf.getValue()); + } + } + } + + return result; + } + + private static List> sort(Map map) + { + List> list = new ArrayList>(map.entrySet()); + Collections.sort(list, new Comparator>() + { + @Override + public int compare(Map.Entry o1, Map.Entry o2) + { + return o2.getValue().compareTo(o1.getValue()); + } + }); + + return list; + } + + private static List> doubleToInteger(List> list) + { + List> result = new ArrayList>(list.size()); + for (Map.Entry entry : list) + { + result.add(new AbstractMap.SimpleEntry(entry.getKey(), entry.getValue().intValue())); + } + + return result; + } +} diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/AbstractTrainer.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/AbstractTrainer.java index 3fe59d9cd..f684c67d7 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/AbstractTrainer.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/AbstractTrainer.java @@ -22,13 +22,13 @@ protected void usage() paramDesc("-window ", "Set max skip length between words; default is 5"); paramDesc("-sample ", "Set threshold for occurrence of words. Those that appear with higher frequency in the training data" + " will be randomly down-sampled; default is 0.001, useful range is (0, 0.00001)"); - paramDesc("-hs", "Use Hierarchical Softmax; default is not used"); + paramDesc("-hs ", "Use Hierarchical Softmax; default is 0 (not used)"); paramDesc("-negative ", "Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)"); paramDesc("-threads ", "Use threads (default is the cores of local machine)"); paramDesc("-iter ", "Run more training iterations (default 5)"); paramDesc("-min-count ", "This will discard words that appear less than times; default is 5"); paramDesc("-alpha ", "Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW"); - paramDesc("-cbow", "Use the continuous bag of words model; default is skip-gram model"); + paramDesc("-cbow ", "Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)"); localUsage(); diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/DocVectorModel.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/DocVectorModel.java index 873bf25e2..4c918928c 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/DocVectorModel.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/DocVectorModel.java @@ -11,10 +11,13 @@ package com.hankcs.hanlp.mining.word2vec; +import com.hankcs.hanlp.dictionary.stopword.CoreStopWordDictionary; +import com.hankcs.hanlp.seg.Segment; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.tokenizer.NotionalTokenizer; -import java.util.*; +import java.util.List; +import java.util.Map; /** * 文档向量模型 @@ -24,11 +27,26 @@ public class DocVectorModel extends AbstractVectorModel { private final WordVectorModel wordVectorModel; + /** + * 分词器 + */ + private Segment segment; + /** + * 是否使用CoreStopwordDictionary的过滤器 + */ + private boolean filter; public DocVectorModel(WordVectorModel wordVectorModel) + { + this(wordVectorModel, NotionalTokenizer.SEGMENT, true); + } + + public DocVectorModel(WordVectorModel wordVectorModel, Segment segment, boolean filter) { super(); this.wordVectorModel = wordVectorModel; + this.segment = segment; + this.filter = filter; } /** @@ -58,6 +76,17 @@ public List> nearest(String query) return queryNearest(query, 10); } + /** + * 查询最相似的前n个文档 + * + * @param query 查询语句(或者说一个文档的内容) + * @return + */ + public List> nearest(String query, int n) + { + return queryNearest(query, n); + } + /** * 将一个文档转为向量 @@ -68,7 +97,11 @@ public List> nearest(String query) public Vector query(String content) { if (content == null || content.length() == 0) return null; - List termList = NotionalTokenizer.segment(content); + List termList = segment.seg(content); + if (filter) + { + CoreStopWordDictionary.apply(termList); + } Vector result = new Vector(dimension()); int n = 0; for (Term term : termList) @@ -97,6 +130,7 @@ public int dimension() /** * 文档相似度计算 + * * @param what * @param with * @return @@ -109,4 +143,34 @@ public float similarity(String what, String with) if (B == null) return -1f; return A.cosineForUnitVector(B); } + + public Segment getSegment() + { + return segment; + } + + public void setSegment(Segment segment) + { + this.segment = segment; + } + + /** + * 是否激活了停用词过滤器 + * + * @return + */ + public boolean isFilterEnabled() + { + return filter; + } + + /** + * 激活/关闭停用词过滤器 + * + * @param filter + */ + public void enableFilter(boolean filter) + { + this.filter = filter; + } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/KMeansClustering.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/KMeansClustering.java index d314fc98a..aa5f9e033 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/KMeansClustering.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/KMeansClustering.java @@ -114,9 +114,9 @@ public void clustering() throws IOException } finally { - Utils.closeQuietly(pw); - Utils.closeQuietly(w); - Utils.closeQuietly(os); + Utility.closeQuietly(pw); + Utility.closeQuietly(w); + Utility.closeQuietly(os); } } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/TextFileCorpus.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/TextFileCorpus.java index ec1e60fe2..a5b0bda50 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/TextFileCorpus.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/TextFileCorpus.java @@ -20,7 +20,7 @@ public TextFileCorpus(Config config) throws IOException @Override public void shutdown() throws IOException { - Utils.closeQuietly(raf); + Utility.closeQuietly(raf); wordsBuffer = null; } @@ -153,9 +153,9 @@ public void learnVocab() throws IOException } finally { - Utils.closeQuietly(fileInputStream); - Utils.closeQuietly(raf); - Utils.closeQuietly(cache); + Utility.closeQuietly(fileInputStream); + Utility.closeQuietly(raf); + Utility.closeQuietly(cache); System.err.println(); } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/Train.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/Train.java index a894a5f9d..fcabae3c4 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/Train.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/Train.java @@ -11,7 +11,7 @@ protected void localUsage() { paramDesc("-input ", "Use text data from to train the model"); System.err.printf("\nExamples:\n"); - System.err.printf("java %s -input corpus.txt -output vectors.txt -size 200 -window 5 -sample 0.0001 -negative 5 -hs 0 -binary -cbow -iter 3\n\n", + System.err.printf("java %s -input corpus.txt -output vectors.txt -size 200 -window 5 -sample 0.0001 -negative 5 -hs 0 -binary -cbow 1 -iter 3\n\n", Train.class.getName()); } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/Utility.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/Utility.java index 54c1cf26b..e56212674 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/Utility.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/Utility.java @@ -1,12 +1,13 @@ package com.hankcs.hanlp.mining.word2vec; +import java.io.*; + /** * 一些工具方法 */ final class Utility { - private static final int SECOND = 1000; private static final int MINUTE = 60 * SECOND; private static final int HOUR = 60 * MINUTE; @@ -44,4 +45,91 @@ static String humanTime(long ms) return text.toString(); } + + /** + * @param c + */ + public static void closeQuietly(Closeable c) + { + try + { + if (c != null) c.close(); + } + catch (IOException ignored) + { + } + } + + /** + * @param raf + */ + public static void closeQuietly(RandomAccessFile raf) + { + try + { + if (raf != null) raf.close(); + } + catch (IOException ignored) + { + } + } + + public static void closeQuietly(InputStream is) + { + try + { + if (is != null) is.close(); + } + catch (IOException ignored) + { + } + } + + public static void closeQuietly(Reader r) + { + try + { + if (r != null) r.close(); + } + catch (IOException ignored) + { + } + } + + public static void closeQuietly(OutputStream os) + { + try + { + if (os != null) os.close(); + } + catch (IOException ignored) + { + } + } + + public static void closeQuietly(Writer w) + { + try + { + if (w != null) w.close(); + } + catch (IOException ignored) + { + } + } + + /** + * 数组分割 + * + * @param from 源 + * @param to 目标 + * @param 类型 + * @return 目标 + */ + public static T[] shrink(T[] from, T[] to) + { + assert to.length <= from.length; + System.arraycopy(from, 0, to, 0, to.length); + return to; + } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/Utils.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/Utils.java deleted file mode 100644 index 9c538855b..000000000 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/Utils.java +++ /dev/null @@ -1,121 +0,0 @@ - -package com.hankcs.hanlp.mining.word2vec; - -import java.io.*; - -/** - * some utils - */ -public final class Utils -{ - - private static final int SECOND = 1000; - private static final int MINUTE = 60 * SECOND; - private static final int HOUR = 60 * MINUTE; - private static final int DAY = 24 * HOUR; - - /** - * @param c - */ - public static void closeQuietly(Closeable c) - { - try - { - if (c != null) c.close(); - } - catch (IOException ignored) - { - } - } - - /** - * @param raf - */ - public static void closeQuietly(RandomAccessFile raf) - { - try - { - if (raf != null) raf.close(); - } - catch (IOException ignored) - { - } - } - - public static void closeQuietly(InputStream is) - { - try - { - if (is != null) is.close(); - } - catch (IOException ignored) - { - } - } - - public static void closeQuietly(Reader r) - { - try - { - if (r != null) r.close(); - } - catch (IOException ignored) - { - } - } - - public static void closeQuietly(OutputStream os) - { - try - { - if (os != null) os.close(); - } - catch (IOException ignored) - { - } - } - - public static void closeQuietly(Writer w) - { - try - { - if (w != null) w.close(); - } - catch (IOException ignored) - { - } - } - - static String humanTime(long ms) - { - StringBuffer text = new StringBuffer(""); - if (ms > DAY) - { - text.append(ms / DAY).append(" d "); - ms %= DAY; - } - if (ms > HOUR) - { - text.append(ms / HOUR).append(" h "); - ms %= HOUR; - } - if (ms > MINUTE) - { - text.append(ms / MINUTE).append(" m "); - ms %= MINUTE; - } - if (ms > SECOND) - { - long s = ms / SECOND; - if (s < 10) - { - text.append('0'); - } - text.append(s).append(" s "); -// ms %= SECOND; - } -// text.append(ms + " ms"); - - return text.toString(); - } -} diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/Vector.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/Vector.java index 473922eb0..f14ec1b6f 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/Vector.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/Vector.java @@ -135,4 +135,14 @@ public Vector normalize() divideToSelf(norm()); return this; } + + public float[] getElementArray() + { + return elementArray; + } + + public void setElementArray(float[] elementArray) + { + this.elementArray = elementArray; + } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/VectorsReader.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/VectorsReader.java index 047bba826..ed774c2fe 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/VectorsReader.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/VectorsReader.java @@ -1,5 +1,7 @@ package com.hankcs.hanlp.mining.word2vec; +import com.hankcs.hanlp.corpus.io.IOUtil; + import java.io.*; import java.nio.charset.Charset; import static com.hankcs.hanlp.utility.Predefine.logger; @@ -27,7 +29,7 @@ public void readVectorFile() throws IOException BufferedReader br = null; try { - is = new FileInputStream(file); + is = IOUtil.newInputStream(file); r = new InputStreamReader(is, ENCODING); br = new BufferedReader(r); @@ -42,6 +44,13 @@ public void readVectorFile() throws IOException { line = br.readLine().trim(); String[] params = line.split("\\s+"); + if (params.length != size + 1) + { + logger.info("词向量有一行格式不规范(可能是单词含有空格):" + line); + --words; + --i; + continue; + } vocab[i] = params[0]; matrix[i] = new float[size]; double len = 0; @@ -56,12 +65,17 @@ public void readVectorFile() throws IOException matrix[i][j] /= len; } } + if (words != vocab.length) + { + vocab = Utility.shrink(vocab, new String[words]); + matrix = Utility.shrink(matrix, new float[words][]); + } } - catch (IOException e) + finally { - Utils.closeQuietly(br); - Utils.closeQuietly(r); - Utils.closeQuietly(is); + Utility.closeQuietly(br); + Utility.closeQuietly(r); + Utility.closeQuietly(is); } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/Word2VecTraining.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/Word2VecTraining.java index d24cdc7c2..e5fcb866e 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/Word2VecTraining.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/Word2VecTraining.java @@ -110,7 +110,7 @@ public void run() System.err.printf("%cAlpha: %f iter: %d Progress: %.2f%% Words/thread/sec: %.2fk", 13, alpha, local_iter, percent * 100, wordCountActual / (float) (cost_time)); - String etd = Utils.humanTime((long) (cost_time / percent * (1.f - percent))); + String etd = Utility.humanTime((long) (cost_time / percent * (1.f - percent))); if (etd.length() > 0) System.err.printf(" ETD: %s", etd); System.err.flush(); } @@ -360,7 +360,7 @@ public void trainModel() throws IOException } System.err.println(); - logger.info(String.format("finished training in %s", Utils.humanTime(System.currentTimeMillis() - timeStart))); + logger.info(String.format("finished training in %s", Utility.humanTime(System.currentTimeMillis() - timeStart))); // lose weight syn1 = null; table = null; @@ -391,9 +391,9 @@ public void trainModel() throws IOException finally { corpus.close(); - Utils.closeQuietly(pw); - Utils.closeQuietly(w); - Utils.closeQuietly(os); + Utility.closeQuietly(pw); + Utility.closeQuietly(w); + Utility.closeQuietly(os); } } diff --git a/src/main/java/com/hankcs/hanlp/mining/word2vec/WordVectorModel.java b/src/main/java/com/hankcs/hanlp/mining/word2vec/WordVectorModel.java index 7c1925f5b..e15da6182 100644 --- a/src/main/java/com/hankcs/hanlp/mining/word2vec/WordVectorModel.java +++ b/src/main/java/com/hankcs/hanlp/mining/word2vec/WordVectorModel.java @@ -29,19 +29,30 @@ public class WordVectorModel extends AbstractVectorModel */ public WordVectorModel(String modelFileName) throws IOException { - super(loadVectorMap(modelFileName)); + this(modelFileName, new TreeMap()); } - private static TreeMap loadVectorMap(String modelFileName) throws IOException + /** + * 加载模型 + * + * @param modelFileName 模型路径 + * @param storage 一个空白的Map(HashMap等) + * @throws IOException 加载错误 + */ + public WordVectorModel(String modelFileName, Map storage) throws IOException + { + super(loadVectorMap(modelFileName, storage)); + } + + private static Map loadVectorMap(String modelFileName, Map storage) throws IOException { VectorsReader reader = new VectorsReader(modelFileName); reader.readVectorFile(); - TreeMap map = new TreeMap(); for (int i = 0; i < reader.vocab.length; i++) { - map.put(reader.vocab[i], new Vector(reader.matrix[i])); + storage.put(reader.vocab[i], new Vector(reader.matrix[i])); } - return map; + return storage; } /** diff --git a/src/main/java/com/hankcs/hanlp/model/CRFSegmentModel.java b/src/main/java/com/hankcs/hanlp/model/CRFSegmentModel.java index fe7cf45e2..4d86cdb57 100644 --- a/src/main/java/com/hankcs/hanlp/model/CRFSegmentModel.java +++ b/src/main/java/com/hankcs/hanlp/model/CRFSegmentModel.java @@ -24,6 +24,7 @@ * * @author hankcs */ +// * @deprecated 已废弃,请使用功能更丰富、设计更优雅的{@link com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer}。 public final class CRFSegmentModel extends CRFModel { private int idM; diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzer.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzer.java new file mode 100644 index 000000000..04b2b786d --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzer.java @@ -0,0 +1,106 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:29 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; + +import java.io.IOException; + +/** + * CRF词法分析器(中文分词、词性标注和命名实体识别) + * + * @author hankcs + * @since 1.6.2 + */ +public class CRFLexicalAnalyzer extends AbstractLexicalAnalyzer +{ + /** + * 构造CRF词法分析器 + * + * @param segmenter CRF分词器 + */ + public CRFLexicalAnalyzer(CRFSegmenter segmenter) + { + this.segmenter = segmenter; + } + + /** + * 构造CRF词法分析器 + * + * @param segmenter CRF分词器 + * @param posTagger CRF词性标注器 + */ + public CRFLexicalAnalyzer(CRFSegmenter segmenter, CRFPOSTagger posTagger) + { + this.segmenter = segmenter; + this.posTagger = posTagger; + config.speechTagging = true; + } + + /** + * 构造CRF词法分析器 + * + * @param segmenter CRF分词器 + * @param posTagger CRF词性标注器 + * @param neRecognizer CRF命名实体识别器 + */ + public CRFLexicalAnalyzer(CRFSegmenter segmenter, CRFPOSTagger posTagger, CRFNERecognizer neRecognizer) + { + this.segmenter = segmenter; + this.posTagger = posTagger; + this.neRecognizer = neRecognizer; + config.speechTagging = true; + config.nameRecognize = true; + } + + /** + * 构造CRF词法分析器 + * + * @param cwsModelPath CRF分词器模型路径 + */ + public CRFLexicalAnalyzer(String cwsModelPath) throws IOException + { + this(new CRFSegmenter(cwsModelPath)); + } + + /** + * 构造CRF词法分析器 + * + * @param cwsModelPath CRF分词器模型路径 + * @param posModelPath CRF词性标注器模型路径 + */ + public CRFLexicalAnalyzer(String cwsModelPath, String posModelPath) throws IOException + { + this(new CRFSegmenter(cwsModelPath), new CRFPOSTagger(posModelPath)); + } + + /** + * 构造CRF词法分析器 + * + * @param cwsModelPath CRF分词器模型路径 + * @param posModelPath CRF词性标注器模型路径 + * @param nerModelPath CRF命名实体识别器模型路径 + */ + public CRFLexicalAnalyzer(String cwsModelPath, String posModelPath, String nerModelPath) throws IOException + { + this(new CRFSegmenter(cwsModelPath), new CRFPOSTagger(posModelPath), new CRFNERecognizer(nerModelPath)); + } + + /** + * 加载配置文件指定的模型 + * + * @throws IOException + */ + public CRFLexicalAnalyzer() throws IOException + { + this(new CRFSegmenter(), new CRFPOSTagger(), new CRFNERecognizer()); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFModel.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFModel.java index 8796c876e..2db7d6889 100644 --- a/src/main/java/com/hankcs/hanlp/model/crf/CRFModel.java +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFModel.java @@ -16,6 +16,7 @@ import com.hankcs.hanlp.corpus.io.ByteArray; import com.hankcs.hanlp.corpus.io.ICacheAble; import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.crf.crfpp.Model; import com.hankcs.hanlp.utility.Predefine; import com.hankcs.hanlp.utility.TextUtility; @@ -27,6 +28,8 @@ import static com.hankcs.hanlp.utility.Predefine.logger; /** + * 这份代码目前做到了与CRF++解码结果完全一致。也可以直接使用移植版的CRF++ {@link CRFLexicalAnalyzer} + * * @author hankcs */ public class CRFModel implements ICacheAble @@ -59,6 +62,7 @@ public CRFModel() /** * 以指定的trie树结构储存内部特征函数 + * * @param featureFunctionTrie */ public CRFModel(ITrie featureFunctionTrie) @@ -73,7 +77,8 @@ protected void onLoadTxtFinished() /** * 加载Txt形式的CRF++模型 - * @param path 模型路径 + * + * @param path 模型路径 * @param instance 模型的实例(这里允许用户构造不同的CRFModel来储存最终读取的结果) * @return 该模型 */ @@ -104,7 +109,7 @@ public static CRFModel loadTxt(String path, CRFModel instance) CRFModel.id2tag[entry.getValue()] = entry.getKey(); } TreeMap featureFunctionMap = new TreeMap(); // 构建trie树的时候用 - List featureFunctionList = new LinkedList(); // 读取权值的时候用 + TreeMap featureFunctionList = new TreeMap(); // 读取权值的时候用 CRFModel.featureTemplateList = new LinkedList(); while ((line = lineIterator.next()).length() != 0) { @@ -119,9 +124,12 @@ public static CRFModel loadTxt(String path, CRFModel instance) } } + int b = -1;// 转换矩阵的权重位置 if (CRFModel.matrix != null) { - lineIterator.next(); // 0 B + String[] args = lineIterator.next().split(" ", 2); // 0 B + b = Integer.valueOf(args[0]); + featureFunctionList.put(b, null); } while ((line = lineIterator.next()).length() != 0) @@ -130,25 +138,29 @@ public static CRFModel loadTxt(String path, CRFModel instance) char[] charArray = args[1].toCharArray(); FeatureFunction featureFunction = new FeatureFunction(charArray, size); featureFunctionMap.put(args[1], featureFunction); - featureFunctionList.add(featureFunction); + featureFunctionList.put(Integer.parseInt(args[0]), featureFunction); } - if (CRFModel.matrix != null) + for (Map.Entry entry : featureFunctionList.entrySet()) { - for (int i = 0; i < size; i++) + int fid = entry.getKey(); + FeatureFunction featureFunction = entry.getValue(); + if (fid == b) { - for (int j = 0; j < size; j++) + for (int i = 0; i < size; i++) { - CRFModel.matrix[i][j] = Double.parseDouble(lineIterator.next()); + for (int j = 0; j < size; j++) + { + CRFModel.matrix[i][j] = Double.parseDouble(lineIterator.next()); + } } } - } - - for (FeatureFunction featureFunction : featureFunctionList) - { - for (int i = 0; i < size; i++) + else { - featureFunction.w[i] = Double.parseDouble(lineIterator.next()); + for (int i = 0; i < size; i++) + { + featureFunction.w[i] = Double.parseDouble(lineIterator.next()); + } } } if (lineIterator.hasNext()) @@ -258,6 +270,7 @@ public void tag(Table table) /** * 根据特征函数计算输出 + * * @param table * @param current * @return @@ -381,7 +394,8 @@ public boolean load(ByteArray byteArray) /** * 加载Txt形式的CRF++模型
- * 同时生成path.bin模型缓存 + * 同时生成path.bin模型缓存 + * * @param path 模型路径 * @return 该模型 */ @@ -392,7 +406,8 @@ public static CRFModel loadTxt(String path) /** * 加载CRF++模型
- * 如果存在缓存的话,优先读取缓存,否则读取txt,并且建立缓存 + * 如果存在缓存的话,优先读取缓存,否则读取txt,并且建立缓存 + * * @param path txt的路径,即使不存在.txt,只存在.bin,也应传入txt的路径,方法内部会自动加.bin后缀 * @return */ @@ -405,7 +420,8 @@ public static CRFModel load(String path) /** * 加载Bin形式的CRF++模型
- * 注意该Bin形式不是CRF++的二进制模型,而是HanLP由CRF++的文本模型转换过来的私有格式 + * 注意该Bin形式不是CRF++的二进制模型,而是HanLP由CRF++的文本模型转换过来的私有格式 + * * @param path * @return */ @@ -420,6 +436,7 @@ public static CRFModel loadBin(String path) /** * 获取某个tag的ID + * * @param tag * @return */ @@ -427,4 +444,8 @@ public Integer getTagId(String tag) { return tag2id.get(tag); } + + public Map getTag2id() { + return tag2id; + } } diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFNERecognizer.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFNERecognizer.java new file mode 100644 index 000000000..183f765ce --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFNERecognizer.java @@ -0,0 +1,176 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 上午3:45 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.model.crf.crfpp.FeatureIndex; +import com.hankcs.hanlp.model.crf.crfpp.TaggerImpl; +import com.hankcs.hanlp.model.perceptron.PerceptronNERecognizer; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.NERInstance; +import com.hankcs.hanlp.model.perceptron.instance.POSInstance; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; + +/** + * @author hankcs + */ +public class CRFNERecognizer extends CRFTagger implements NERecognizer +{ + public NERTagSet tagSet; + /** + * 复用感知机的解码模块 + */ + private PerceptronNERecognizer perceptronNERecognizer; + + public CRFNERecognizer() throws IOException + { + this(HanLP.Config.CRFNERModelPath); + } + + public CRFNERecognizer(String modelPath) throws IOException + { + this(modelPath,null); + } + + public CRFNERecognizer(String modelPath,String[] customNERTags) throws IOException + { + super(modelPath); + if (model == null) + { + tagSet = new NERTagSet(); + addDefaultNERLabels(); + if (customNERTags != null) { + for (String nerTags : customNERTags) { + addNERLabels(nerTags); + } + } + } + else + { + perceptronNERecognizer = new PerceptronNERecognizer(this.model); + tagSet = perceptronNERecognizer.getNERTagSet(); + } + } + + protected void addDefaultNERLabels() + { + tagSet.nerLabels.add("nr"); + tagSet.nerLabels.add("ns"); + tagSet.nerLabels.add("nt"); + } + + public void addNERLabels(String newNerTag) + { + tagSet.nerLabels.add(newNerTag); + } + + @Override + protected void convertCorpus(Sentence sentence, BufferedWriter bw) throws IOException + { + List collector = Utility.convertSentenceToNER(sentence, tagSet); + for (String[] tuple : collector) + { + bw.write(tuple[0]); + bw.write('\t'); + bw.write(tuple[1]); + bw.write('\t'); + bw.write(tuple[2]); + bw.newLine(); + } + } + + @Override + public String[] recognize(String[] wordArray, String[] posArray) + { + return perceptronNERecognizer.recognize(createInstance(wordArray, posArray)); + } + + @Override + public NERTagSet getNERTagSet() + { + return tagSet; + } + + private NERInstance createInstance(String[] wordArray, String[] posArray) + { + final FeatureTemplate[] featureTemplateArray = model.getFeatureTemplateArray(); + return new NERInstance(wordArray, posArray, model.featureMap) + { + @Override + protected int[] extractFeature(String[] wordArray, String[] posArray, FeatureMap featureMap, int position) + { + StringBuilder sbFeature = new StringBuilder(); + List featureVec = new LinkedList(); + for (int i = 0; i < featureTemplateArray.length; i++) + { + Iterator offsetIterator = featureTemplateArray[i].offsetList.iterator(); + Iterator delimiterIterator = featureTemplateArray[i].delimiterList.iterator(); + delimiterIterator.next(); // ignore U0 之类的id + while (offsetIterator.hasNext()) + { + int[] offset = offsetIterator.next(); + int t = offset[0] + position; + boolean first = offset[1] == 0; + if (t < 0) + sbFeature.append(FeatureIndex.BOS[-(t + 1)]); + else if (t >= wordArray.length) + sbFeature.append(FeatureIndex.EOS[t - wordArray.length]); + else + sbFeature.append(first ? wordArray[t] : posArray[t]); + if (delimiterIterator.hasNext()) + sbFeature.append(delimiterIterator.next()); + else + sbFeature.append(i); + } + addFeatureThenClear(sbFeature, featureVec, featureMap); + } + return toFeatureArray(featureVec); + } + }; + } + + @Override + protected String getDefaultFeatureTemplate() + { + return "# Unigram\n" + + // form + "U0:%x[-2,0]\n" + + "U1:%x[-1,0]\n" + + "U2:%x[0,0]\n" + + "U3:%x[1,0]\n" + + "U4:%x[2,0]\n" + + // pos + "U5:%x[-2,1]\n" + + "U6:%x[-1,1]\n" + + "U7:%x[0,1]\n" + + "U8:%x[1,1]\n" + + "U9:%x[2,1]\n" + + // pos 2-gram + "UA:%x[-2,1]%x[-1,1]\n" + + "UB:%x[-1,1]%x[0,1]\n" + + "UC:%x[0,1]%x[1,1]\n" + + "UD:%x[1,1]%x[2,1]\n" + + "UE:%x[2,1]%x[3,1]\n" + + "\n" + + "# Bigram\n" + + "B"; + } + +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFPOSTagger.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFPOSTagger.java new file mode 100644 index 000000000..4de7636e6 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFPOSTagger.java @@ -0,0 +1,178 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-30 上午3:04 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.crf.crfpp.Encoder; +import com.hankcs.hanlp.model.crf.crfpp.FeatureIndex; +import com.hankcs.hanlp.model.crf.crfpp.crf_learn; +import com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.POSInstance; +import com.hankcs.hanlp.tokenizer.lexical.POSTagger; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; + +/** + * @author hankcs + */ +public class CRFPOSTagger extends CRFTagger implements POSTagger +{ + private PerceptronPOSTagger perceptronPOSTagger; + + public CRFPOSTagger() throws IOException + { + this(HanLP.Config.CRFPOSModelPath); + } + + public CRFPOSTagger(String modelPath) throws IOException + { + super(modelPath); + if (modelPath != null) + { + perceptronPOSTagger = new PerceptronPOSTagger(this.model); + } + } + + @Override + public void train(String trainCorpusPath, String modelPath) throws IOException + { + crf_learn.Option option = new crf_learn.Option(); + train(trainCorpusPath, modelPath, option.maxiter, 10, option.eta, option.cost, + option.thread, option.shrinking_size, Encoder.Algorithm.fromString(option.algorithm)); + } + + @Override + protected void convertCorpus(Sentence sentence, BufferedWriter bw) throws IOException + { + List simpleWordList = sentence.toSimpleWordList(); + List wordList = new ArrayList(simpleWordList.size()); + for (Word word : simpleWordList) + { + wordList.add(word.value); + } + String[] words = wordList.toArray(new String[0]); + Iterator iterator = simpleWordList.iterator(); + for (int i = 0; i < words.length; i++) + { + String curWord = words[i]; + String[] cells = createCells(true); + extractFeature(curWord, cells); + cells[5] = iterator.next().label; + for (int j = 0; j < cells.length; j++) + { + bw.write(cells[j]); + if (j != cells.length - 1) + bw.write('\t'); + } + bw.newLine(); + } + } + + private String[] createCells(boolean withTag) + { + return withTag ? new String[6] : new String[5]; + } + + private void extractFeature(String curWord, String[] cells) + { + int length = curWord.length(); + cells[0] = curWord; + cells[1] = curWord.substring(0, 1); + cells[2] = length > 1 ? curWord.substring(0, 2) : "_"; + // length > 2 ? curWord.substring(0, 3) : "<>" + cells[3] = curWord.substring(length - 1); + cells[4] = length > 1 ? curWord.substring(length - 2) : "_"; + // length > 2 ? curWord.substring(length - 3) : "<>" + } + + @Override + protected String getDefaultFeatureTemplate() + { + return "# Unigram\n" + + "U0:%x[-1,0]\n" + + "U1:%x[0,0]\n" + + "U2:%x[1,0]\n" + + "U3:%x[0,1]\n" + + "U4:%x[0,2]\n" + + "U5:%x[0,3]\n" + + "U6:%x[0,4]\n" + +// "U7:%x[0,5]\n" + +// "U8:%x[0,6]\n" + + "\n" + + "# Bigram\n" + + "B"; + } + + public String[] tag(List wordList) + { + String[] words = new String[wordList.size()]; + wordList.toArray(words); + return tag(words); + } + + @Override + public String[] tag(String... words) + { + return perceptronPOSTagger.tag(createInstance(words)); + } + + private POSInstance createInstance(String[] words) + { + final FeatureTemplate[] featureTemplateArray = model.getFeatureTemplateArray(); + final String[][] table = new String[words.length][5]; + for (int i = 0; i < words.length; i++) + { + extractFeature(words[i], table[i]); + } + + return new POSInstance(words, model.featureMap) + { + @Override + protected int[] extractFeature(String[] words, FeatureMap featureMap, int position) + { + StringBuilder sbFeature = new StringBuilder(); + List featureVec = new LinkedList(); + for (int i = 0; i < featureTemplateArray.length; i++) + { + Iterator offsetIterator = featureTemplateArray[i].offsetList.iterator(); + Iterator delimiterIterator = featureTemplateArray[i].delimiterList.iterator(); + delimiterIterator.next(); // ignore U0 之类的id + while (offsetIterator.hasNext()) + { + int[] offset = offsetIterator.next(); + int t = offset[0] + position; + int j = offset[1]; + if (t < 0) + sbFeature.append(FeatureIndex.BOS[-(t + 1)]); + else if (t >= words.length) + sbFeature.append(FeatureIndex.EOS[t - words.length]); + else + sbFeature.append(table[t][j]); + if (delimiterIterator.hasNext()) + sbFeature.append(delimiterIterator.next()); + else + sbFeature.append(i); + } + addFeatureThenClear(sbFeature, featureVec, featureMap); + } + return toFeatureArray(featureVec); + } + }; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFSegmenter.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFSegmenter.java new file mode 100644 index 000000000..353ac0c56 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFSegmenter.java @@ -0,0 +1,153 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-30 上午1:07 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.model.crf.crfpp.FeatureIndex; +import com.hankcs.hanlp.model.crf.crfpp.TaggerImpl; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.CWSInstance; +import com.hankcs.hanlp.tokenizer.lexical.Segmenter; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +//import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; + +/** + * @author hankcs + */ +public class CRFSegmenter extends CRFTagger implements Segmenter +{ + + private PerceptronSegmenter perceptronSegmenter; + + public CRFSegmenter() throws IOException + { + this(HanLP.Config.CRFCWSModelPath); + } + + public CRFSegmenter(String modelPath) throws IOException + { + super(modelPath); + if (modelPath != null) + { + perceptronSegmenter = new PerceptronSegmenter(this.model); + } + } + + @Override + protected void convertCorpus(Sentence sentence, BufferedWriter bw) throws IOException + { + for (Word w : sentence.toSimpleWordList()) + { + String word = CharTable.convert(w.value); + if (word.length() == 1) + { + bw.write(word); + bw.write('\t'); + bw.write('S'); + bw.write('\n'); + } + else + { + bw.write(word.charAt(0)); + bw.write('\t'); + bw.write('B'); + bw.write('\n'); + for (int i = 1; i < word.length() - 1; ++i) + { + bw.write(word.charAt(i)); + bw.write('\t'); + bw.write('M'); + bw.write('\n'); + } + bw.write(word.charAt(word.length() - 1)); + bw.write('\t'); + bw.write('E'); + bw.write('\n'); + } + } + } + + public List segment(String text) + { + List wordList = new LinkedList(); + segment(text, CharTable.convert(text), wordList); + + return wordList; + } + + @Override + public void segment(String text, String normalized, List wordList) + { + perceptronSegmenter.segment(text, createInstance(normalized), wordList); + } + + private CWSInstance createInstance(String text) + { + final FeatureTemplate[] featureTemplateArray = model.getFeatureTemplateArray(); + return new CWSInstance(text, model.featureMap) + { + @Override + protected int[] extractFeature(String sentence, FeatureMap featureMap, int position) + { + StringBuilder sbFeature = new StringBuilder(); + List featureVec = new LinkedList(); + for (int i = 0; i < featureTemplateArray.length; i++) + { + Iterator offsetIterator = featureTemplateArray[i].offsetList.iterator(); + Iterator delimiterIterator = featureTemplateArray[i].delimiterList.iterator(); + delimiterIterator.next(); // ignore U0 之类的id + while (offsetIterator.hasNext()) + { + int offset = offsetIterator.next()[0] + position; + if (offset < 0) + sbFeature.append(FeatureIndex.BOS[-(offset + 1)]); + else if (offset >= sentence.length()) + sbFeature.append(FeatureIndex.EOS[offset - sentence.length()]); + else + sbFeature.append(sentence.charAt(offset)); + if (delimiterIterator.hasNext()) + sbFeature.append(delimiterIterator.next()); + else + sbFeature.append(i); + } + addFeatureThenClear(sbFeature, featureVec, featureMap); + } + return toFeatureArray(featureVec); + } + }; + } + + @Override + protected String getDefaultFeatureTemplate() + { + return "# Unigram\n" + + "U0:%x[-1,0]\n" + + "U1:%x[0,0]\n" + + "U2:%x[1,0]\n" + + "U3:%x[-2,0]%x[-1,0]\n" + + "U4:%x[-1,0]%x[0,0]\n" + + "U5:%x[0,0]%x[1,0]\n" + + "U6:%x[1,0]%x[2,0]\n" + + "\n" + + "# Bigram\n" + + "B"; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/CRFTagger.java b/src/main/java/com/hankcs/hanlp/model/crf/CRFTagger.java new file mode 100644 index 000000000..a5506caf7 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/CRFTagger.java @@ -0,0 +1,178 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-30 上午2:51 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.crf.crfpp.Encoder; +import com.hankcs.hanlp.model.crf.crfpp.crf_learn; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.utility.IOUtility; +import com.hankcs.hanlp.model.perceptron.utility.Utility; + +import java.io.BufferedWriter; +import java.io.File; +import java.io.IOException; +import java.util.Date; + +/** + * @author hankcs + */ +public abstract class CRFTagger +{ + protected LogLinearModel model; + + public CRFTagger() + { + } + + public CRFTagger(String modelPath) throws IOException + { + if (modelPath == null) return; // 训练模式 + model = new LogLinearModel(modelPath); + } + + /** + * 训练 + * + * @param templFile 模板文件 + * @param trainFile 训练文件 + * @param modelFile 模型文件 + * @param maxitr 最大迭代次数 + * @param freq 特征最低频次 + * @param eta 收敛阈值 + * @param C cost-factor + * @param threadNum 线程数 + * @param shrinkingSize + * @param algorithm 训练算法 + * @return + */ + public void train(String templFile, String trainFile, String modelFile, + int maxitr, int freq, double eta, double C, int threadNum, int shrinkingSize, + Encoder.Algorithm algorithm) throws IOException + { + Encoder encoder = new Encoder(); + if (!encoder.learn(templFile, trainFile, modelFile, + true, maxitr, freq, eta, C, threadNum, shrinkingSize, algorithm)) + { + throw new IOException("fail to learn model"); + } + convert(modelFile); + } + + /** + * 将CRF++格式转为HanLP格式 + * + * @param modelFile + * @throws IOException + */ + private void convert(String modelFile) throws IOException + { + this.model = new LogLinearModel(modelFile + ".txt", modelFile); + } + + public void train(String trainCorpusPath, String modelPath) throws IOException + { + crf_learn.Option option = new crf_learn.Option(); + train(trainCorpusPath, modelPath, option.maxiter, option.freq, option.eta, option.cost, + option.thread, option.shrinking_size, Encoder.Algorithm.fromString(option.algorithm)); + } + + public void train(String trainFile, String modelFile, + int maxitr, int freq, double eta, double C, int threadNum, int shrinkingSize, + Encoder.Algorithm algorithm) throws IOException + { + String templFile = null; + File tmpTemplate = File.createTempFile("crfpp-template-" + new Date().getTime(), ".txt"); + tmpTemplate.deleteOnExit(); + templFile = tmpTemplate.getAbsolutePath(); + String template = getDefaultFeatureTemplate(); + IOUtil.saveTxt(templFile, template); + + File tmpTrain = File.createTempFile("crfpp-train-" + new Date().getTime(), ".txt"); + tmpTrain.deleteOnExit(); + convertCorpus(trainFile, tmpTrain.getAbsolutePath()); + trainFile = tmpTrain.getAbsolutePath(); + System.out.printf("Java效率低,建议安装CRF++,执行下列等价训练命令(不要终止本进程,否则临时语料库和特征模板将被清除):\n" + + "crf_learn -m %d -f %d -e %f -c %f -p %d -H %d -a %s -t %s %s %s\n", maxitr, freq, eta, + C, threadNum, shrinkingSize, algorithm.toString().replace('_', '-'), + templFile, trainFile, modelFile); + Encoder encoder = new Encoder(); + if (!encoder.learn(templFile, trainFile, modelFile, + true, maxitr, freq, eta, C, threadNum, shrinkingSize, algorithm)) + { + throw new IOException("fail to learn model"); + } + convert(modelFile); + } + + protected abstract void convertCorpus(Sentence sentence, BufferedWriter bw) throws IOException; + + protected abstract String getDefaultFeatureTemplate(); + + public void convertCorpus(String pkuPath, String tsvPath) throws IOException + { + final BufferedWriter bw = IOUtil.newBufferedWriter(tsvPath); + IOUtility.loadInstance(pkuPath, new InstanceHandler() + { + @Override + public boolean process(Sentence sentence) + { + Utility.normalize(sentence); + try + { + convertCorpus(sentence, bw); + bw.newLine(); + } + catch (IOException e) + { + throw new RuntimeException(e); + } + return false; + } + }); + bw.close(); + } + + /** + * 导出特征模板 + * + * @param templatePath + * @throws IOException + */ + public void dumpTemplate(String templatePath) throws IOException + { + BufferedWriter bw = IOUtil.newBufferedWriter(templatePath); + String template = getTemplate(); + bw.write(template); + bw.close(); + } + + /** + * 获取特征模板 + * + * @return + */ + public String getTemplate() + { + String template = getDefaultFeatureTemplate(); + if (model != null && model.getFeatureTemplateArray() != null) + { + StringBuilder sbTemplate = new StringBuilder(); + for (FeatureTemplate featureTemplate : model.getFeatureTemplateArray()) + { + sbTemplate.append(featureTemplate.getTemplate()).append('\n'); + } + } + return template; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/FeatureFunction.java b/src/main/java/com/hankcs/hanlp/model/crf/FeatureFunction.java index ebb2aa2fa..f4f8ecb09 100644 --- a/src/main/java/com/hankcs/hanlp/model/crf/FeatureFunction.java +++ b/src/main/java/com/hankcs/hanlp/model/crf/FeatureFunction.java @@ -46,6 +46,11 @@ public FeatureFunction() { } + public FeatureFunction(String o, int tagSize) + { + this(o.toCharArray(), tagSize); + } + @Override public void save(DataOutputStream out) throws Exception { diff --git a/src/main/java/com/hankcs/hanlp/model/crf/FeatureTemplate.java b/src/main/java/com/hankcs/hanlp/model/crf/FeatureTemplate.java index 5c358c3f4..bc7629192 100644 --- a/src/main/java/com/hankcs/hanlp/model/crf/FeatureTemplate.java +++ b/src/main/java/com/hankcs/hanlp/model/crf/FeatureTemplate.java @@ -16,6 +16,7 @@ import java.io.DataInputStream; import java.io.DataOutputStream; +import java.io.IOException; import java.util.ArrayList; import java.util.LinkedList; import java.util.List; @@ -78,7 +79,7 @@ public char[] generateParameter(Table table, int current) } @Override - public void save(DataOutputStream out) throws Exception + public void save(DataOutputStream out) throws IOException { out.writeUTF(template); out.writeInt(offsetList.size()); @@ -122,4 +123,9 @@ public String toString() sb.append('}'); return sb.toString(); } + + public String getTemplate() + { + return template; + } } diff --git a/src/main/java/com/hankcs/hanlp/model/crf/LogLinearModel.java b/src/main/java/com/hankcs/hanlp/model/crf/LogLinearModel.java new file mode 100644 index 000000000..bcca2ce7e --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/LogLinearModel.java @@ -0,0 +1,285 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-28 7:37 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.FileIOAdapter; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.utility.Predefine; + +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.*; + +import static com.hankcs.hanlp.utility.Predefine.BIN_EXT; +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * 对数线性模型形式的CRF模型 + * + * @author hankcs + */ +public class LogLinearModel extends LinearModel +{ + /** + * 特征模板 + */ + private FeatureTemplate[] featureTemplateArray; + + private LogLinearModel(FeatureMap featureMap, float[] parameter) + { + super(featureMap, parameter); + } + + private LogLinearModel(FeatureMap featureMap) + { + super(featureMap); + } + + @Override + public boolean load(ByteArray byteArray) + { + if (!super.load(byteArray)) return false; + int size = byteArray.nextInt(); + featureTemplateArray = new FeatureTemplate[size]; + for (int i = 0; i < size; ++i) + { + FeatureTemplate featureTemplate = new FeatureTemplate(); + featureTemplate.load(byteArray); + featureTemplateArray[i] = featureTemplate; + } + if (!byteArray.hasMore()) + byteArray.close(); + return true; + } + + /** + * 加载CRF模型 + * + * @param modelFile HanLP的.bin格式,或CRF++的.txt格式(将会自动转换为model.txt.bin,下次会直接加载.txt.bin) + * @throws IOException + */ + public LogLinearModel(String modelFile) throws IOException + { + super(null, null); + if (modelFile.endsWith(BIN_EXT)) + { + load(modelFile); // model.bin + return; + } + String binPath = modelFile + Predefine.BIN_EXT; + + if (!((HanLP.Config.IOAdapter == null || HanLP.Config.IOAdapter instanceof FileIOAdapter) && !IOUtil.isFileExisted(binPath))) + { + try + { + load(binPath); // model.txt -> model.bin + return; + } + catch (Exception e) + { + // ignore + } + } + + convert(modelFile, binPath); + } + + /** + * 加载txt,转换为bin + * + * @param txtFile txt + * @param binFile bin + * @throws IOException + */ + public LogLinearModel(String txtFile, String binFile) throws IOException + { + super(null, null); + convert(txtFile, binFile); + } + + private void convert(String txtFile, String binFile) throws IOException + { + TagSet tagSet = new TagSet(TaskType.CLASSIFICATION); + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(txtFile); + if (!lineIterator.hasNext()) throw new IOException("空白文件"); + logger.info(lineIterator.next()); // verson + logger.info(lineIterator.next()); // cost-factor + int maxid = Integer.parseInt(lineIterator.next().substring("maxid:".length()).trim()); + logger.info(lineIterator.next()); // xsize + lineIterator.next(); // blank + String line; + while ((line = lineIterator.next()).length() != 0) + { + tagSet.add(line); + } + tagSet.type = guessModelType(tagSet); + switch (tagSet.type) + { + case CWS: + tagSet = new CWSTagSet(tagSet.idOf("B"), tagSet.idOf("M"), tagSet.idOf("E"), tagSet.idOf("S")); + break; + case NER: + tagSet = new NERTagSet(tagSet.idOf("O"), tagSet.tags()); + break; + } + tagSet.lock(); + this.featureMap = new MutableFeatureMap(tagSet); + FeatureMap featureMap = this.featureMap; + final int sizeOfTagSet = tagSet.size(); + TreeMap featureFunctionMap = new TreeMap(); // 构建trie树的时候用 + TreeMap featureFunctionList = new TreeMap(); // 读取权值的时候用 + ArrayList featureTemplateList = new ArrayList(); + float[][] matrix = null; + while ((line = lineIterator.next()).length() != 0) + { + if (!"B".equals(line)) + { + FeatureTemplate featureTemplate = FeatureTemplate.create(line); + featureTemplateList.add(featureTemplate); + } + else + { + matrix = new float[sizeOfTagSet][sizeOfTagSet]; + } + } + this.featureTemplateArray = featureTemplateList.toArray(new FeatureTemplate[0]); + + int b = -1;// 转换矩阵的权重位置 + if (matrix != null) + { + String[] args = lineIterator.next().split(" ", 2); // 0 B + b = Integer.valueOf(args[0]); + featureFunctionList.put(b, null); + } + + while ((line = lineIterator.next()).length() != 0) + { + String[] args = line.split(" ", 2); + char[] charArray = args[1].toCharArray(); + FeatureFunction featureFunction = new FeatureFunction(charArray, sizeOfTagSet); + featureFunctionMap.put(args[1], featureFunction); + featureFunctionList.put(Integer.parseInt(args[0]), featureFunction); + } + + for (Map.Entry entry : featureFunctionList.entrySet()) + { + int fid = entry.getKey(); + FeatureFunction featureFunction = entry.getValue(); + if (fid == b) + { + for (int i = 0; i < sizeOfTagSet; i++) + { + for (int j = 0; j < sizeOfTagSet; j++) + { + matrix[i][j] = Float.parseFloat(lineIterator.next()); + } + } + } + else + { + for (int i = 0; i < sizeOfTagSet; i++) + { + featureFunction.w[i] = Double.parseDouble(lineIterator.next()); + } + } + } + if (lineIterator.hasNext()) + { + logger.warning("文本读取有残留,可能会出问题!" + txtFile); + } + lineIterator.close(); + logger.info("文本读取结束,开始转换模型"); + int transitionFeatureOffset = (sizeOfTagSet + 1) * sizeOfTagSet; + parameter = new float[transitionFeatureOffset + featureFunctionMap.size() * sizeOfTagSet]; + if (matrix != null) + { + for (int i = 0; i < sizeOfTagSet; ++i) + { + for (int j = 0; j < sizeOfTagSet; ++j) + { + parameter[i * sizeOfTagSet + j] = matrix[i][j]; + } + } + } + for (Map.Entry entry : featureFunctionList.entrySet()) + { + int id = entry.getKey(); + FeatureFunction f = entry.getValue(); + if (f == null) continue; + String feature = new String(f.o); + for (int tid = 0; tid < featureTemplateList.size(); tid++) + { + FeatureTemplate template = featureTemplateList.get(tid); + Iterator iterator = template.delimiterList.iterator(); + String header = iterator.next(); + if (feature.startsWith(header)) + { + int fid = featureMap.idOf(feature.substring(header.length()) + tid); +// assert id == sizeOfTagSet * sizeOfTagSet + (fid - sizeOfTagSet - 1) * sizeOfTagSet; + for (int i = 0; i < sizeOfTagSet; ++i) + { + parameter[fid * sizeOfTagSet + i] = (float) f.w[i]; + } + break; + } + } + } + DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(binFile)); + save(out); + out.writeInt(featureTemplateList.size()); + for (FeatureTemplate template : featureTemplateList) + { + template.save(out); + } + out.close(); + } + + + private TaskType guessModelType(TagSet tagSet) + { + if (tagSet.size() == 4 && + tagSet.idOf("B") != -1 && + tagSet.idOf("M") != -1 && + tagSet.idOf("E") != -1 && + tagSet.idOf("S") != -1 + ) + { + return TaskType.CWS; + } + if (tagSet.idOf("O") != -1) + { + for (String tag : tagSet.tags()) + { + String[] parts = tag.split("-"); + if (parts.length > 1) + { + if (parts[0].length() == 1 && "BMES".contains(parts[0])) + return TaskType.NER; + } + } + } + return TaskType.POS; + } + + public FeatureTemplate[] getFeatureTemplateArray() + { + return featureTemplateArray; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/CRFEncoderThread.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/CRFEncoderThread.java new file mode 100644 index 000000000..732944d1d --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/CRFEncoderThread.java @@ -0,0 +1,54 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.util.Arrays; +import java.util.List; +import java.util.concurrent.Callable; + +/** + * @author zhifac + */ +public class CRFEncoderThread implements Callable +{ + public List x; + public int start_i; + public int wSize; + public int threadNum; + public int zeroone; + public int err; + public int size; + public double obj; + public double[] expected; + + public CRFEncoderThread(int wsize) + { + if (wsize > 0) + { + this.wSize = wsize; + expected = new double[wsize]; + Arrays.fill(expected, 0.0); + } + } + + public Integer call() + { + obj = 0.0; + err = zeroone = 0; + if (expected == null) + { + expected = new double[wSize]; + } + Arrays.fill(expected, 0.0); + for (int i = start_i; i < size; i = i + threadNum) + { + obj += x.get(i).gradient(expected); + int errorNum = x.get(i).eval(); + x.get(i).clearNodes(); + err += errorNum; + if (errorNum != 0) + { + ++zeroone; + } + } + return err; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DecoderFeatureIndex.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DecoderFeatureIndex.java new file mode 100644 index 000000000..ad014d9a1 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DecoderFeatureIndex.java @@ -0,0 +1,220 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import com.hankcs.hanlp.collection.dartsclone.DoubleArray; +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.*; +import java.text.DecimalFormat; +import java.util.ArrayList; +import java.util.List; + +/** + * @author zhifac + */ +public class DecoderFeatureIndex extends FeatureIndex +{ + private MutableDoubleArrayTrieInteger dat; + + public DecoderFeatureIndex() + { + dat = new MutableDoubleArrayTrieInteger(); + } + + public int getID(String key) + { + return dat.get(key); + } + + public boolean open(InputStream stream) + { + try + { + ObjectInputStream ois = new ObjectInputStream(stream); + int version = (Integer) ois.readObject(); + costFactor_ = (Double) ois.readObject(); + maxid_ = (Integer) ois.readObject(); + xsize_ = (Integer) ois.readObject(); + y_ = (List) ois.readObject(); + unigramTempls_ = (List) ois.readObject(); + bigramTempls_ = (List) ois.readObject(); + dat = (MutableDoubleArrayTrieInteger) ois.readObject(); + alpha_ = (double[]) ois.readObject(); + ois.close(); + return true; + } + catch (Exception e) + { + e.printStackTrace(); + return false; + } + } + + public boolean convert(String binarymodel, String textmodel) + { + try + { + if (!open(IOUtil.newInputStream(binarymodel))) + { + System.err.println("Fail to read binary model " + binarymodel); + return false; + } + OutputStreamWriter osw = new OutputStreamWriter(IOUtil.newOutputStream(textmodel), "UTF-8"); + osw.write("version: " + Encoder.MODEL_VERSION + "\n"); + osw.write("cost-factor: " + costFactor_ + "\n"); + osw.write("maxid: " + maxid_ + "\n"); + osw.write("xsize: " + xsize_ + "\n"); + osw.write("\n"); + for (String y : y_) + { + osw.write(y + "\n"); + } + osw.write("\n"); + for (String utempl : unigramTempls_) + { + osw.write(utempl + "\n"); + } + for (String bitempl : bigramTempls_) + { + osw.write(bitempl + "\n"); + } + osw.write("\n"); + + for (MutableDoubleArrayTrieInteger.KeyValuePair pair : dat) + { + osw.write(pair.getValue() + " " + pair.getKey() + "\n"); + } + + osw.write("\n"); + + for (int k = 0; k < maxid_; k++) + { + String val = new DecimalFormat("0.0000000000000000").format(alpha_[k]); + osw.write(val + "\n"); + } + osw.close(); + return true; + } + catch (Exception e) + { + System.err.println(binarymodel + " does not exist"); + return false; + } + } + + public boolean openTextModel(String filename1, boolean cacheBinModel) + { + InputStreamReader isr = null; + try + { + String binFileName = filename1 + ".bin"; + try + { + if (open(IOUtil.newInputStream(binFileName))) + { + System.out.println("Found binary model " + binFileName); + return true; + } + } + catch (IOException e) + { + // load text model + } + + isr = new InputStreamReader(IOUtil.newInputStream(filename1), "UTF-8"); + BufferedReader br = new BufferedReader(isr); + String line; + + int version = Integer.valueOf(br.readLine().substring("version: ".length())); + costFactor_ = Double.valueOf(br.readLine().substring("cost-factor: ".length())); + maxid_ = Integer.valueOf(br.readLine().substring("maxid: ".length())); + xsize_ = Integer.valueOf(br.readLine().substring("xsize: ".length())); + System.out.println("Done reading meta-info"); + br.readLine(); + + while ((line = br.readLine()) != null && line.length() > 0) + { + y_.add(line); + } + System.out.println("Done reading labels"); + while ((line = br.readLine()) != null && line.length() > 0) + { + if (line.startsWith("U")) + { + unigramTempls_.add(line); + } + else if (line.startsWith("B")) + { + bigramTempls_.add(line); + } + } + System.out.println("Done reading templates"); + while ((line = br.readLine()) != null && line.length() > 0) + { + String[] content = line.trim().split(" "); + dat.put(content[1], Integer.valueOf(content[0])); + } + List alpha = new ArrayList(); + while ((line = br.readLine()) != null && line.length() > 0) + { + alpha.add(Double.valueOf(line)); + } + System.out.println("Done reading weights"); + alpha_ = new double[alpha.size()]; + for (int i = 0; i < alpha.size(); i++) + { + alpha_[i] = alpha.get(i); + } + br.close(); + + if (cacheBinModel) + { + System.out.println("Writing binary model to " + binFileName); + ObjectOutputStream oos = new ObjectOutputStream(IOUtil.newOutputStream(binFileName)); + oos.writeObject(version); + oos.writeObject(costFactor_); + oos.writeObject(maxid_); + oos.writeObject(xsize_); + oos.writeObject(y_); + oos.writeObject(unigramTempls_); + oos.writeObject(bigramTempls_); + oos.writeObject(dat); + oos.writeObject(alpha_); + oos.close(); + } + } + catch (Exception e) + { + if (isr != null) + { + try + { + isr.close(); + } + catch (Exception e2) + { + } + } + e.printStackTrace(); + System.err.println("Error reading " + filename1); + return false; + } + return true; + } + + public static void main(String[] args) + { + if (args.length < 2) + { + return; + } + else + { + DecoderFeatureIndex featureIndex = new DecoderFeatureIndex(); + if (!featureIndex.convert(args[0], args[1])) + { + System.err.println("fail to convert binary model to text model"); + } + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DoubleArrayTrieInteger.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DoubleArrayTrieInteger.java new file mode 100644 index 000000000..e4663c675 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/DoubleArrayTrieInteger.java @@ -0,0 +1,586 @@ +/** + * DoubleArrayTrie: Java implementation of Darts (Double-ARray Trie System) + *

+ *

+ * Copyright(C) 2001-2007 Taku Kudo <taku@chasen.org>
+ * Copyright(C) 2009 MURAWAKI Yugo <murawaki@nlp.kuee.kyoto-u.ac.jp> + * Copyright(C) 2012 KOMIYA Atsushi <komiya.atsushi@gmail.com> + *

+ *

+ *

+ * The contents of this file may be used under the terms of either of the GNU + * Lesser General Public License Version 2.1 or later (the "LGPL"), or the BSD + * License (the "BSD"). + *

+ */ +package com.hankcs.hanlp.model.crf.crfpp; + +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.*; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Stack; + +/** + * 储存{@code Integer}的{@code DoubleArrayTrie},相当于{@code DoubleArrayTrie},但比后者省内存,所以保留两份代码 + */ +public class DoubleArrayTrieInteger implements Serializable +{ + + private final static int BUF_SIZE = 16384; + private final static int UNIT_SIZE = 8; // size of int + int + private static final long serialVersionUID = -4908582458604586299L; + + private static class Node + { + int code; + int depth; + int left; + int right; + } + + ; + + private int check[]; + private int base[]; + + private boolean used[]; + private int size; + private int allocSize; + private List key; + private int keySize; + private int length[]; + private int value[]; + private int progress; + private int nextCheckPos; + // boolean no_delete_; + int error_; + + private int resize(int newSize) + { + int[] base2 = new int[newSize]; + int[] check2 = new int[newSize]; + boolean used2[] = new boolean[newSize]; + if (allocSize > 0) + { + System.arraycopy(base, 0, base2, 0, allocSize); + System.arraycopy(check, 0, check2, 0, allocSize); + System.arraycopy(used, 0, used2, 0, allocSize); + } + + base = base2; + check = check2; + used = used2; + + return allocSize = newSize; + } + + private int fetch(Node parent, List siblings) + { + if (error_ < 0) + return 0; + + int prev = 0; + + for (int i = parent.left; i < parent.right; i++) + { + if ((length != null ? length[i] : key.get(i).length()) < parent.depth) + continue; + + String tmp = key.get(i); + + int cur = 0; + if ((length != null ? length[i] : tmp.length()) != parent.depth) + cur = (int) tmp.charAt(parent.depth) + 1; + + if (prev > cur) + { + error_ = -3; + return 0; + } + + if (cur != prev || siblings.size() == 0) + { + Node tmp_node = new Node(); + tmp_node.depth = parent.depth + 1; + tmp_node.code = cur; + tmp_node.left = i; + if (siblings.size() != 0) + siblings.get(siblings.size() - 1).right = i; + + siblings.add(tmp_node); + } + + prev = cur; + } + + if (siblings.size() != 0) + siblings.get(siblings.size() - 1).right = parent.right; + + return siblings.size(); + } + + private int insert(List siblings) + { + if (error_ < 0) + return 0; + + int begin = 0; + int pos = ((siblings.get(0).code + 1 > nextCheckPos) ? siblings.get(0).code + 1 + : nextCheckPos) - 1; + int nonzero_num = 0; + int first = 0; + + if (allocSize <= pos) + resize(pos + 1); + + outer: + while (true) + { + pos++; + + if (allocSize <= pos) + resize(pos + 1); + + if (check[pos] != 0) + { + nonzero_num++; + continue; + } + else if (first == 0) + { + nextCheckPos = pos; + first = 1; + } + + begin = pos - siblings.get(0).code; + if (allocSize <= (begin + siblings.get(siblings.size() - 1).code)) + { + // progress can be zero + double l = (1.05 > 1.0 * keySize / (progress + 1)) ? 1.05 : 1.0 + * keySize / (progress + 1); + resize((int) (allocSize * l)); + } + + if (used[begin]) + continue; + + for (int i = 1; i < siblings.size(); i++) + if (check[begin + siblings.get(i).code] != 0) + continue outer; + + break; + } + + // -- Simple heuristics -- + // if the percentage of non-empty contents in check between the + // index + // 'next_check_pos' and 'check' is greater than some constant value + // (e.g. 0.9), + // new 'next_check_pos' index is written by 'check'. + if (1.0 * nonzero_num / (pos - nextCheckPos + 1) >= 0.95) + nextCheckPos = pos; + + used[begin] = true; + size = (size > begin + siblings.get(siblings.size() - 1).code + 1) ? size + : begin + siblings.get(siblings.size() - 1).code + 1; + + for (int i = 0; i < siblings.size(); i++) + check[begin + siblings.get(i).code] = begin; + + for (int i = 0; i < siblings.size(); i++) + { + List new_siblings = new ArrayList(); + + if (fetch(siblings.get(i), new_siblings) == 0) + { + base[begin + siblings.get(i).code] = (value != null) ? (-value[siblings + .get(i).left] - 1) : (-siblings.get(i).left - 1); + + if (value != null && (-value[siblings.get(i).left] - 1) >= 0) + { + error_ = -2; + return 0; + } + + progress++; + } + else + { + int h = insert(new_siblings); + base[begin + siblings.get(i).code] = h; + } + } + return begin; + } + + public DoubleArrayTrieInteger() + { + check = null; + base = null; + used = null; + size = 0; + allocSize = 0; + // no_delete_ = false; + error_ = 0; + } + + // no deconstructor + + // set_result omitted + // the search methods returns (the list of) the value(s) instead + // of (the list of) the pair(s) of value(s) and length(s) + + // set_array omitted + // array omitted + + void clear() + { + // if (! no_delete_) + check = null; + base = null; + used = null; + allocSize = 0; + size = 0; + // no_delete_ = false; + } + + public int getUnitSize() + { + return UNIT_SIZE; + } + + public int getSize() + { + return size; + } + + public int getTotalSize() + { + return size * UNIT_SIZE; + } + + public int getNonzeroSize() + { + int result = 0; + for (int i = 0; i < size; i++) + if (check[i] != 0) + result++; + return result; + } + + public int build(List key) + { + return build(key, null, null, key.size()); + } + + public int build(List _key, int _length[], int _value[], + int _keySize) + { + if (_keySize > _key.size() || _key == null) + return 0; + + key = _key; + length = _length; + keySize = _keySize; + value = _value; + progress = 0; + + resize(65536 * 32); + + base[0] = 1; + nextCheckPos = 0; + + Node root_node = new Node(); + root_node.left = 0; + root_node.right = keySize; + root_node.depth = 0; + + List siblings = new ArrayList(); + fetch(root_node, siblings); + insert(siblings); + + used = null; + key = null; + + return error_; + } + + /* + * recover original key list and value array from a DoubleArrayTrie object + */ + public void recoverKeyValue() + { + key = new ArrayList(); + List val1 = new ArrayList(); + HashMap> childIdxMap = new HashMap>(); + for (int i = 0; i < check.length; i++) + { + if (check[i] <= 0) continue; + if (!childIdxMap.containsKey(check[i])) + { + List childList = new ArrayList(); + childIdxMap.put(check[i], childList); + } + childIdxMap.get(check[i]).add(i); + } + Stack s = new Stack(); + s.add(new Integer[]{1, -1}); + + List charBuf = new ArrayList(); + while (true) + { + Integer[] pair = s.peek(); + List childList = childIdxMap.get(pair[0]); + if (childList == null || (childList.size() - 1) == pair[1]) + { + s.pop(); + if (s.empty()) + { + break; + } + else + { + if (!charBuf.isEmpty()) + { + charBuf.remove(charBuf.size() - 1); + } + continue; + } + } + else + { + pair[1]++; + } + int c = (int) childList.get(pair[1]); + int code = (c - 1 - pair[0]); + if (base[c] > 0) + { + s.add(new Integer[]{base[c], -1}); + charBuf.add(code); + continue; + } + else if (base[c] < 0) + { + if (check[c] == c) + { + char[] chars = new char[charBuf.size()]; + for (int i = 0; i < charBuf.size(); i++) + { + chars[i] = (char) (int) charBuf.get(i); + } + key.add(new String(chars)); + val1.add(-base[c] - 1); + } + continue; + } + } + if (!val1.isEmpty()) + { + value = new int[val1.size()]; + for (int i = 0; i < val1.size(); i++) + { + value[i] = val1.get(i); + } + } + } + + public void open(String fileName) throws IOException + { + File file = new File(fileName); + size = (int) file.length() / UNIT_SIZE; + check = new int[size]; + base = new int[size]; + + DataInputStream is = null; + try + { + is = new DataInputStream(new BufferedInputStream( + new FileInputStream(file), BUF_SIZE)); + for (int i = 0; i < size; i++) + { + base[i] = is.readInt(); + check[i] = is.readInt(); + } + } + finally + { + if (is != null) + is.close(); + } + } + + public void save(String fileName) throws IOException + { + DataOutputStream out = null; + try + { + out = new DataOutputStream(new BufferedOutputStream( + IOUtil.newOutputStream(fileName))); + for (int i = 0; i < size; i++) + { + out.writeInt(base[i]); + out.writeInt(check[i]); + } + out.close(); + } + finally + { + if (out != null) + out.close(); + } + } + + public int exactMatchSearch(String key) + { + return exactMatchSearch(key, 0, 0, 0); + } + + public int exactMatchSearch(String key, int pos, int len, int nodePos) + { + if (len <= 0) + len = key.length(); + if (nodePos <= 0) + nodePos = 0; + + int result = -1; + + char[] keyChars = key.toCharArray(); + + int b = base[nodePos]; + int p; + + for (int i = pos; i < len; i++) + { + p = b + (int) (keyChars[i]) + 1; + if (b == check[p]) + b = base[p]; + else + return result; + } + + p = b; + int n = base[p]; + if (b == check[p] && n < 0) + { + result = -n - 1; + } + return result; + } + + public List commonPrefixSearch(String key) + { + return commonPrefixSearch(key, 0, 0, 0); + } + + public List commonPrefixSearch(String key, int pos, int len, + int nodePos) + { + if (len <= 0) + len = key.length(); + if (nodePos <= 0) + nodePos = 0; + + List result = new ArrayList(); + + char[] keyChars = key.toCharArray(); + + int b = base[nodePos]; + int n; + int p; + + for (int i = pos; i < len; i++) + { + p = b; + n = base[p]; + + if (b == check[p] && n < 0) + { + result.add(-n - 1); + } + + p = b + (int) (keyChars[i]) + 1; + if (b == check[p]) + b = base[p]; + else + return result; + } + + p = b; + n = base[p]; + + if (b == check[p] && n < 0) + { + result.add(-n - 1); + } + + return result; + } + + // debug + public void dump() + { + for (int i = 0; i < size; i++) + { + System.err.println("i: " + i + " [" + base[i] + ", " + check[i] + + "]"); + } + } + + public List getKey() + { + return key; + } + + public int[] getValue() + { + return value; + } + + public void setValue(int[] value) + { + this.value = value; + } + + public void setKey(List key) + { + this.key = key; + } + + public int[] getCheck() + { + return check; + } + + public void setCheck(int[] check) + { + this.check = check; + } + + public int[] getBase() + { + return base; + } + + public void setBase(int[] base) + { + this.base = base; + } + + public int[] getLength() + { + return length; + } + + public void setLength(int[] length) + { + this.length = length; + } + + public void setSize(int size) + { + this.size = size; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Encoder.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Encoder.java new file mode 100644 index 000000000..1746f5fbb --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Encoder.java @@ -0,0 +1,489 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Date; +import java.util.List; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; + +/** + * 训练入口 + * + * @author zhifac + */ +public class Encoder +{ + public static int MODEL_VERSION = 100; + + public enum Algorithm + { + CRF_L2, CRF_L1, MIRA; + + public static Algorithm fromString(String algorithm) + { + algorithm = algorithm.toLowerCase(); + if (algorithm.equals("crf") || algorithm.equals("crf-l2")) + { + return Encoder.Algorithm.CRF_L2; + } + else if (algorithm.equals("crf-l1")) + { + return Encoder.Algorithm.CRF_L1; + } + else if (algorithm.equals("mira")) + { + return Encoder.Algorithm.MIRA; + } + throw new IllegalArgumentException("invalid algorithm: " + algorithm); + } + } + + public Encoder() + { + } + + /** + * 训练 + * + * @param templFile 模板文件 + * @param trainFile 训练文件 + * @param modelFile 模型文件 + * @param textModelFile 是否输出文本形式的模型文件 + * @param maxitr 最大迭代次数 + * @param freq 特征最低频次 + * @param eta 收敛阈值 + * @param C cost-factor + * @param threadNum 线程数 + * @param shrinkingSize + * @param algorithm 训练算法 + * @return + */ + public boolean learn(String templFile, String trainFile, String modelFile, boolean textModelFile, + int maxitr, int freq, double eta, double C, int threadNum, int shrinkingSize, + Algorithm algorithm) + { + if (eta <= 0) + { + System.err.println("eta must be > 0.0"); + return false; + } + if (C < 0.0) + { + System.err.println("C must be >= 0.0"); + return false; + } + if (shrinkingSize < 1) + { + System.err.println("shrinkingSize must be >= 1"); + return false; + } + if (threadNum <= 0) + { + System.err.println("thread must be > 0"); + return false; + } + EncoderFeatureIndex featureIndex = new EncoderFeatureIndex(threadNum); + List x = new ArrayList(); + if (!featureIndex.open(templFile, trainFile)) + { + System.err.println("Fail to open " + templFile + " " + trainFile); + } +// File file = new File(trainFile); +// if (!file.exists()) +// { +// System.err.println("train file " + trainFile + " does not exist."); +// return false; +// } + BufferedReader br = null; + try + { + InputStreamReader isr = new InputStreamReader(IOUtil.newInputStream(trainFile), "UTF-8"); + br = new BufferedReader(isr); + int lineNo = 0; + while (true) + { + TaggerImpl tagger = new TaggerImpl(TaggerImpl.Mode.LEARN); + tagger.open(featureIndex); + TaggerImpl.ReadStatus status = tagger.read(br); + if (status == TaggerImpl.ReadStatus.ERROR) + { + System.err.println("error when reading " + trainFile); + return false; + } + if (!tagger.empty()) + { + if (!tagger.shrink()) + { + System.err.println("fail to build feature index "); + return false; + } + tagger.setThread_id_(lineNo % threadNum); + x.add(tagger); + } + else if (status == TaggerImpl.ReadStatus.EOF) + { + break; + } + else + { + continue; + } + if (++lineNo % 100 == 0) + { + System.out.print(lineNo + ".. "); + } + } + br.close(); + } + catch (IOException e) + { + System.err.println("train file " + trainFile + " does not exist."); + return false; + } + featureIndex.shrink(freq, x); + + double[] alpha = new double[featureIndex.size()]; + Arrays.fill(alpha, 0.0); + featureIndex.setAlpha_(alpha); + + System.out.println("Number of sentences: " + x.size()); + System.out.println("Number of features: " + featureIndex.size()); + System.out.println("Number of thread(s): " + threadNum); + System.out.println("Freq: " + freq); + System.out.println("eta: " + eta); + System.out.println("C: " + C); + System.out.println("shrinking size: " + shrinkingSize); + + switch (algorithm) + { + case CRF_L1: + if (!runCRF(x, featureIndex, alpha, maxitr, C, eta, shrinkingSize, threadNum, true)) + { + System.err.println("CRF_L1 execute error"); + return false; + } + break; + case CRF_L2: + if (!runCRF(x, featureIndex, alpha, maxitr, C, eta, shrinkingSize, threadNum, false)) + { + System.err.println("CRF_L2 execute error"); + return false; + } + break; + case MIRA: + if (!runMIRA(x, featureIndex, alpha, maxitr, C, eta, shrinkingSize, threadNum)) + { + System.err.println("MIRA execute error"); + return false; + } + break; + default: + break; + } + + if (!featureIndex.save(modelFile, textModelFile)) + { + System.err.println("Failed to save model"); + } + System.out.println("Done!"); + return true; + } + + /** + * CRF训练 + * + * @param x 句子列表 + * @param featureIndex 特征编号表 + * @param alpha 特征函数的代价 + * @param maxItr 最大迭代次数 + * @param C cost factor + * @param eta 收敛阈值 + * @param shrinkingSize 未使用 + * @param threadNum 线程数 + * @param orthant 是否使用L1范数 + * @return 是否成功 + */ + private boolean runCRF(List x, + EncoderFeatureIndex featureIndex, + double[] alpha, + int maxItr, + double C, + double eta, + int shrinkingSize, + int threadNum, + boolean orthant) + { + double oldObj = 1e+37; + int converge = 0; + LbfgsOptimizer lbfgs = new LbfgsOptimizer(); + List threads = new ArrayList(); + + for (int i = 0; i < threadNum; i++) + { + CRFEncoderThread thread = new CRFEncoderThread(alpha.length); + thread.start_i = i; + thread.size = x.size(); + thread.threadNum = threadNum; + thread.x = x; + threads.add(thread); + } + + int all = 0; + for (int i = 0; i < x.size(); i++) + { + all += x.get(i).size(); + } + + ExecutorService executor = Executors.newFixedThreadPool(threadNum); + for (int itr = 0; itr < maxItr; itr++) + { + featureIndex.clear(); + + try + { + executor.invokeAll(threads); + } + catch (Exception e) + { + e.printStackTrace(); + return false; + } + + for (int i = 1; i < threadNum; i++) + { + threads.get(0).obj += threads.get(i).obj; + threads.get(0).err += threads.get(i).err; + threads.get(0).zeroone += threads.get(i).zeroone; + } + for (int i = 1; i < threadNum; i++) + { + for (int k = 0; k < featureIndex.size(); k++) + { + threads.get(0).expected[k] += threads.get(i).expected[k]; + } + } + int numNonZero = 0; + if (orthant) + { + for (int k = 0; k < featureIndex.size(); k++) + { + threads.get(0).obj += Math.abs(alpha[k] / C); + if (alpha[k] != 0.0) + { + numNonZero++; + } + } + } + else + { + numNonZero = featureIndex.size(); + for (int k = 0; k < featureIndex.size(); k++) + { + threads.get(0).obj += (alpha[k] * alpha[k] / (2.0 * C)); + threads.get(0).expected[k] += alpha[k] / C; + } + } + for (int i = 1; i < threadNum; i++) + { + // try to free some memory + threads.get(i).expected = null; + } + + double diff = (itr == 0 ? 1.0 : Math.abs(oldObj - threads.get(0).obj) / oldObj); + StringBuilder b = new StringBuilder(); + b.append("iter=").append(itr); + b.append(" terr=").append(1.0 * threads.get(0).err / all); + b.append(" serr=").append(1.0 * threads.get(0).zeroone / x.size()); + b.append(" act=").append(numNonZero); + b.append(" obj=").append(threads.get(0).obj); + b.append(" diff=").append(diff); + System.out.println(b.toString()); + + oldObj = threads.get(0).obj; + + if (diff < eta) + { + converge++; + } + else + { + converge = 0; + } + + if (itr > maxItr || converge == 3) + { + break; + } + + int ret = lbfgs.optimize(featureIndex.size(), alpha, threads.get(0).obj, threads.get(0).expected, orthant, C); + if (ret <= 0) + { + return false; + } + } + executor.shutdown(); + try + { + executor.awaitTermination(-1, TimeUnit.SECONDS); + } + catch (Exception e) + { + e.printStackTrace(); + System.err.println("fail waiting executor to shutdown"); + } + return true; + } + + public boolean runMIRA(List x, + EncoderFeatureIndex featureIndex, + double[] alpha, + int maxItr, + double C, + double eta, + int shrinkingSize, + int threadNum) + { + Integer[] shrinkArr = new Integer[x.size()]; + Arrays.fill(shrinkArr, 0); + List shrink = Arrays.asList(shrinkArr); + Double[] upperArr = new Double[x.size()]; + Arrays.fill(upperArr, 0.0); + List upperBound = Arrays.asList(upperArr); + Double[] expectArr = new Double[featureIndex.size()]; + List expected = Arrays.asList(expectArr); + + if (threadNum > 1) + { + System.err.println("WARN: MIRA does not support multi-threading"); + } + int converge = 0; + int all = 0; + for (int i = 0; i < x.size(); i++) + { + all += x.get(i).size(); + } + + for (int itr = 0; itr < maxItr; itr++) + { + int zeroone = 0; + int err = 0; + int activeSet = 0; + int upperActiveSet = 0; + double maxKktViolation = 0.0; + + for (int i = 0; i < x.size(); i++) + { + if (shrink.get(i) >= shrinkingSize) + { + continue; + } + ++activeSet; + for (int t = 0; t < expected.size(); t++) + { + expected.set(t, 0.0); + } + double costDiff = x.get(i).collins(expected); + int errorNum = x.get(i).eval(); + err += errorNum; + if (errorNum != 0) + { + ++zeroone; + } + if (errorNum == 0) + { + shrink.set(i, shrink.get(i) + 1); + } + else + { + shrink.set(i, 0); + double s = 0.0; + for (int k = 0; k < expected.size(); k++) + { + s += expected.get(k) * expected.get(k); + } + double mu = Math.max(0.0, (errorNum - costDiff) / s); + + if (upperBound.get(i) + mu > C) + { + mu = C - upperBound.get(i); + upperActiveSet++; + } + else + { + maxKktViolation = Math.max(errorNum - costDiff, maxKktViolation); + } + + if (mu > 1e-10) + { + upperBound.set(i, upperBound.get(i) + mu); + upperBound.set(i, Math.min(C, upperBound.get(i))); + for (int k = 0; k < expected.size(); k++) + { + alpha[k] += mu * expected.get(k); + } + } + } + } + double obj = 0.0; + for (int i = 0; i < featureIndex.size(); i++) + { + obj += alpha[i] * alpha[i]; + } + + StringBuilder b = new StringBuilder(); + b.append("iter=").append(itr); + b.append(" terr=").append(1.0 * err / all); + b.append(" serr=").append(1.0 * zeroone / x.size()); + b.append(" act=").append(activeSet); + b.append(" uact=").append(upperActiveSet); + b.append(" obj=").append(obj); + b.append(" kkt=").append(maxKktViolation); + System.out.println(b.toString()); + + if (maxKktViolation <= 0.0) + { + for (int i = 0; i < shrink.size(); i++) + { + shrink.set(i, 0); + } + converge++; + } + else + { + converge = 0; + } + if (itr > maxItr || converge == 2) + { + break; + } + } + return true; + } + + public static void main(String[] args) + { + if (args.length < 3) + { + System.err.println("incorrect No. of args"); + return; + } + String templFile = args[0]; + String trainFile = args[1]; + String modelFile = args[2]; + Encoder enc = new Encoder(); + long time1 = new Date().getTime(); + if (!enc.learn(templFile, trainFile, modelFile, false, 100000, 1, 0.0001, 1.0, 1, 20, Algorithm.CRF_L2)) + { + System.err.println("error training model"); + return; + } + System.out.println(new Date().getTime() - time1); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/EncoderFeatureIndex.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/EncoderFeatureIndex.java new file mode 100644 index 000000000..7f92c628f --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/EncoderFeatureIndex.java @@ -0,0 +1,416 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import com.hankcs.hanlp.collection.dartsclone.DoubleArray; +import com.hankcs.hanlp.collection.trie.datrie.IntArrayList; +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.*; +import java.text.DecimalFormat; +import java.util.*; + +/** + * @author zhifac + */ +public class EncoderFeatureIndex extends FeatureIndex +{ + private MutableDoubleArrayTrieInteger dic_; + private IntArrayList frequency; + private int bId = Integer.MAX_VALUE; + + public EncoderFeatureIndex(int n) + { + threadNum_ = n; + dic_ = new MutableDoubleArrayTrieInteger(); + frequency = new IntArrayList(); + } + + public int getID(String key) + { + int k = dic_.get(key); + if (k == -1) + { + dic_.put(key, maxid_); + frequency.append(1); + int n = maxid_; + if (key.charAt(0) == 'U') + { + maxid_ += y_.size(); + } + else + { + bId = n; + maxid_ += y_.size() * y_.size(); + } + return n; + } + else + { + int cid = continuousId(k); + int oldVal = frequency.get(cid); + frequency.set(cid, oldVal + 1); + return k; + } + } + + private int continuousId(int id) + { + if (id <= bId) + { + return id / y_.size(); + } + else + { + return id / y_.size() - y_.size() + 1; + } + } + + /** + * 读取特征模板文件 + * + * @param filename + * @return + */ + private boolean openTemplate(String filename) + { + InputStreamReader isr = null; + try + { + isr = new InputStreamReader(IOUtil.newInputStream(filename), "UTF-8"); + BufferedReader br = new BufferedReader(isr); + String line; + while ((line = br.readLine()) != null) + { + if (line.length() == 0 || line.charAt(0) == ' ' || line.charAt(0) == '#') + { + continue; + } + else if (line.charAt(0) == 'U') + { + unigramTempls_.add(line.trim()); + } + else if (line.charAt(0) == 'B') + { + bigramTempls_.add(line.trim()); + } + else + { + System.err.println("unknown type: " + line); + } + } + br.close(); + templs_ = makeTempls(unigramTempls_, bigramTempls_); + } + catch (Exception e) + { + if (isr != null) + { + try + { + isr.close(); + } + catch (Exception e2) + { + } + } + e.printStackTrace(); + System.err.println("Error reading " + filename); + return false; + } + return true; + } + + /** + * 读取训练文件中的标注集 + * + * @param filename + * @return + */ + private boolean openTagSet(String filename) + { + int max_size = 0; + InputStreamReader isr = null; + y_.clear(); + try + { + isr = new InputStreamReader(IOUtil.newInputStream(filename), "UTF-8"); + BufferedReader br = new BufferedReader(isr); + String line; + while ((line = br.readLine()) != null) + { + if (line.length() == 0) + { + continue; + } + char firstChar = line.charAt(0); + if (firstChar == '\0' || firstChar == ' ' || firstChar == '\t') + { + continue; + } + String[] cols = line.split("[\t ]", -1); + if (max_size == 0) + { + max_size = cols.length; + } + if (max_size != cols.length) + { + String msg = "inconsistent column size: " + max_size + + " " + cols.length + " " + filename; + throw new RuntimeException(msg); + } + xsize_ = cols.length - 1; + if (y_.indexOf(cols[max_size - 1]) == -1) + { + y_.add(cols[max_size - 1]); + } + } + Collections.sort(y_); + br.close(); + } + catch (Exception e) + { + if (isr != null) + { + try + { + isr.close(); + } + catch (Exception e2) + { + } + } + e.printStackTrace(); + System.err.println("Error reading " + filename); + return false; + } + return true; + } + + public boolean open(String filename1, String filename2) + { + checkMaxXsize_ = true; + return openTemplate(filename1) && openTagSet(filename2); + } + + public boolean save(String filename, boolean textModelFile) + { + try + { + ObjectOutputStream oos = new ObjectOutputStream(IOUtil.newOutputStream(filename)); + oos.writeObject(Encoder.MODEL_VERSION); + oos.writeObject(costFactor_); + oos.writeObject(maxid_); + if (max_xsize_ > 0) + { + xsize_ = Math.min(xsize_, max_xsize_); + } + oos.writeObject(xsize_); + oos.writeObject(y_); + oos.writeObject(unigramTempls_); + oos.writeObject(bigramTempls_); + oos.writeObject(dic_); +// List keyList = new ArrayList(dic_.size()); +// int[] values = new int[dic_.size()]; +// int i = 0; +// for (MutableDoubleArrayTrieInteger.KeyValuePair pair : dic_) +// { +// keyList.add(pair.key()); +// values[i++] = pair.value(); +// } +// DoubleArray doubleArray = new DoubleArray(); +// doubleArray.build(keyList, values); +// oos.writeObject(doubleArray); + oos.writeObject(alpha_); + oos.close(); + + if (textModelFile) + { + OutputStreamWriter osw = new OutputStreamWriter(IOUtil.newOutputStream(filename + ".txt"), "UTF-8"); + osw.write("version: " + Encoder.MODEL_VERSION + "\n"); + osw.write("cost-factor: " + costFactor_ + "\n"); + osw.write("maxid: " + maxid_ + "\n"); + osw.write("xsize: " + xsize_ + "\n"); + osw.write("\n"); + for (String y : y_) + { + osw.write(y + "\n"); + } + osw.write("\n"); + for (String utempl : unigramTempls_) + { + osw.write(utempl + "\n"); + } + for (String bitempl : bigramTempls_) + { + osw.write(bitempl + "\n"); + } + osw.write("\n"); + for (MutableDoubleArrayTrieInteger.KeyValuePair pair : dic_) + { + osw.write(pair.getValue() + " " + pair.getKey() + "\n"); + } + osw.write("\n"); + + for (int k = 0; k < maxid_; k++) + { + String val = new DecimalFormat("0.0000000000000000").format(alpha_[k]); + osw.write(val + "\n"); + } + osw.close(); + } + } + catch (Exception e) + { + e.printStackTrace(); + System.err.println("Error saving model to " + filename); + return false; + } + return true; + } + + public void clear() + { + + } + + public void shrink(int freq, List taggers) + { + if (freq <= 1) + { + return; + } + int newMaxId = 0; + Map old2new = new TreeMap(); + List deletedKeys = new ArrayList(dic_.size() / 8); + List> l = new LinkedList>(dic_.entrySet()); + // update dictionary in key order, to make result compatible with crfpp + for (MutableDoubleArrayTrieInteger.KeyValuePair pair : dic_) + { + String key = pair.key(); + int id = pair.value(); + int cid = continuousId(id); + int f = frequency.get(cid); + if (f >= freq) + { + old2new.put(id, newMaxId); + pair.setValue(newMaxId); + newMaxId += (key.charAt(0) == 'U' ? y_.size() : y_.size() * y_.size()); + } + else + { + deletedKeys.add(key); + } + } + for (String key : deletedKeys) + { + dic_.remove(key); + } + + for (TaggerImpl tagger : taggers) + { + List> featureCache = tagger.getFeatureCache_(); + for (int k = 0; k < featureCache.size(); k++) + { + List featureCacheItem = featureCache.get(k); + List newCache = new ArrayList(); + for (Integer it : featureCacheItem) + { + if (it == -1) + { + continue; + } + Integer nid = old2new.get(it); + if (nid != null) + { + newCache.add(nid); + } + } + newCache.add(-1); + featureCache.set(k, newCache); + } + } + maxid_ = newMaxId; + } + + public boolean convert(String textmodel, String binarymodel) + { + try + { + InputStreamReader isr = new InputStreamReader(IOUtil.newInputStream(textmodel), "UTF-8"); + BufferedReader br = new BufferedReader(isr); + String line; + + int version = Integer.valueOf(br.readLine().substring("version: ".length())); + costFactor_ = Double.valueOf(br.readLine().substring("cost-factor: ".length())); + maxid_ = Integer.valueOf(br.readLine().substring("maxid: ".length())); + xsize_ = Integer.valueOf(br.readLine().substring("xsize: ".length())); + System.out.println("Done reading meta-info"); + br.readLine(); + + while ((line = br.readLine()) != null && line.length() > 0) + { + y_.add(line); + } + System.out.println("Done reading labels"); + while ((line = br.readLine()) != null && line.length() > 0) + { + if (line.startsWith("U")) + { + unigramTempls_.add(line); + } + else if (line.startsWith("B")) + { + bigramTempls_.add(line); + } + } + System.out.println("Done reading templates"); + dic_ = new MutableDoubleArrayTrieInteger(); + while ((line = br.readLine()) != null && line.length() > 0) + { + String[] content = line.trim().split(" "); + if (content.length != 2) + { + System.err.println("feature indices format error"); + return false; + } + dic_.put(content[1], Integer.valueOf(content[0])); + } + System.out.println("Done reading feature indices"); + List alpha = new ArrayList(); + while ((line = br.readLine()) != null && line.length() > 0) + { + alpha.add(Double.valueOf(line)); + } + System.out.println("Done reading weights"); + alpha_ = new double[alpha.size()]; + for (int i = 0; i < alpha.size(); i++) + { + alpha_[i] = alpha.get(i); + } + br.close(); + System.out.println("Writing binary model to " + binarymodel); + return save(binarymodel, false); + } + catch (Exception e) + { + e.printStackTrace(); + return false; + } + } + + public static void main(String[] args) + { + if (args.length < 2) + { + return; + } + else + { + EncoderFeatureIndex featureIndex = new EncoderFeatureIndex(1); + if (!featureIndex.convert(args[0], args[1])) + { + System.err.println("Fail to convert text model"); + } + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/FeatureIndex.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/FeatureIndex.java new file mode 100644 index 000000000..3bda24719 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/FeatureIndex.java @@ -0,0 +1,403 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; + +/** + * @author zhifac + */ +public abstract class FeatureIndex +{ + public static String[] BOS = {"_B-1", "_B-2", "_B-3", "_B-4", "_B-5", "_B-6", "_B-7", "_B-8"}; + public static String[] EOS = {"_B+1", "_B+2", "_B+3", "_B+4", "_B+5", "_B+6", "_B+7", "_B+8"}; + protected int maxid_; + protected double[] alpha_; + protected float[] alphaFloat_; + protected double costFactor_; + protected int xsize_; + protected boolean checkMaxXsize_; + protected int max_xsize_; + protected int threadNum_; + protected List unigramTempls_; + protected List bigramTempls_; + protected String templs_; + protected List y_; + protected List> pathList_; + protected List> nodeList_; + + public FeatureIndex() + { + maxid_ = 0; + alpha_ = null; + alphaFloat_ = null; + costFactor_ = 1.0; + xsize_ = 0; + checkMaxXsize_ = false; + max_xsize_ = 0; + threadNum_ = 1; + unigramTempls_ = new ArrayList(); + bigramTempls_ = new ArrayList(); + y_ = new ArrayList(); + } + + protected abstract int getID(String s); + + /** + * 计算状态特征函数的代价 + * + * @param node + */ + public void calcCost(Node node) + { + node.cost = 0.0; + if (alphaFloat_ != null) + { + float c = 0.0f; + for (int i = 0; node.fVector.get(i) != -1; i++) + { + c += alphaFloat_[node.fVector.get(i) + node.y]; + } + node.cost = costFactor_ * c; + } + else + { + double c = 0.0; + for (int i = 0; node.fVector.get(i) != -1; i++) + { + c += alpha_[node.fVector.get(i) + node.y]; + } + node.cost = costFactor_ * c; + } + } + + /** + * 计算转移特征函数的代价 + * + * @param path 边 + */ + public void calcCost(Path path) + { + path.cost = 0.0; + if (alphaFloat_ != null) + { + float c = 0.0f; + for (int i = 0; path.fvector.get(i) != -1; i++) + { + c += alphaFloat_[path.fvector.get(i) + path.lnode.y * y_.size() + path.rnode.y]; + } + path.cost = costFactor_ * c; + } + else + { + double c = 0.0; + for (int i = 0; path.fvector.get(i) != -1; i++) + { + c += alpha_[path.fvector.get(i) + path.lnode.y * y_.size() + path.rnode.y]; + } + path.cost = costFactor_ * c; + } + } + + public String makeTempls(List unigramTempls, List bigramTempls) + { + StringBuilder sb = new StringBuilder(); + for (String temp : unigramTempls) + { + sb.append(temp).append("\n"); + } + for (String temp : bigramTempls) + { + sb.append(temp).append("\n"); + } + return sb.toString(); + } + + public String getTemplate() + { + return templs_; + } + + public String getIndex(String[] idxStr, int cur, TaggerImpl tagger) + { + int row = Integer.valueOf(idxStr[0]); + int col = Integer.valueOf(idxStr[1]); + int pos = row + cur; + if (row < -EOS.length || row > EOS.length || col < 0 || col >= tagger.xsize()) + { + return null; + } + + //TODO(taku): very dirty workaround + if (checkMaxXsize_) + { + max_xsize_ = Math.max(max_xsize_, col + 1); + } + if (pos < 0) + { + return BOS[-pos - 1]; + } + else if (pos >= tagger.size()) + { + return EOS[pos - tagger.size()]; + } + else + { + return tagger.x(pos, col); + } + } + + public String applyRule(String str, int cur, TaggerImpl tagger) + { + StringBuilder sb = new StringBuilder(); + for (String tmp : str.split("%x", -1)) + { + if (tmp.startsWith("U") || tmp.startsWith("B")) + { + sb.append(tmp); + } + else if (tmp.length() > 0) + { + String[] tuple = tmp.split("]"); + String[] idx = tuple[0].replace("[", "").split(","); + String r = getIndex(idx, cur, tagger); + if (r != null) + { + sb.append(r); + } + if (tuple.length > 1) + { + sb.append(tuple[1]); + } + } + } + + return sb.toString(); + } + + private boolean buildFeatureFromTempl(List feature, List templs, int curPos, TaggerImpl tagger) + { + for (String tmpl : templs) + { + String featureID = applyRule(tmpl, curPos, tagger); + if (featureID == null || featureID.length() == 0) + { + System.err.println("format error"); + return false; + } + int id = getID(featureID); + if (id != -1) + { + feature.add(id); + } + } + return true; + } + + public boolean buildFeatures(TaggerImpl tagger) + { + List feature = new ArrayList(); + List> featureCache = tagger.getFeatureCache_(); + tagger.setFeature_id_(featureCache.size()); + + for (int cur = 0; cur < tagger.size(); cur++) + { + if (!buildFeatureFromTempl(feature, unigramTempls_, cur, tagger)) + { + return false; + } + feature.add(-1); + featureCache.add(feature); + feature = new ArrayList(); + } + for (int cur = 1; cur < tagger.size(); cur++) + { + if (!buildFeatureFromTempl(feature, bigramTempls_, cur, tagger)) + { + return false; + } + feature.add(-1); + featureCache.add(feature); + feature = new ArrayList(); + } + return true; + } + + public void rebuildFeatures(TaggerImpl tagger) + { + int fid = tagger.getFeature_id_(); + List> featureCache = tagger.getFeatureCache_(); + for (int cur = 0; cur < tagger.size(); cur++) + { + List f = featureCache.get(fid++); + for (int i = 0; i < y_.size(); i++) + { + Node n = new Node(); + n.clear(); + n.x = cur; + n.y = i; + n.fVector = f; + tagger.set_node(n, cur, i); + } + } + for (int cur = 1; cur < tagger.size(); cur++) + { + List f = featureCache.get(fid++); + for (int j = 0; j < y_.size(); j++) + { + for (int i = 0; i < y_.size(); i++) + { + Path p = new Path(); + p.clear(); + p.add(tagger.node(cur - 1, j), tagger.node(cur, i)); + p.fvector = f; + } + } + } + } + + public boolean open(String file) + { + return true; + } + + public boolean open(InputStream stream) + { + return true; + } + + public void clear() + { + + } + + public int size() + { + return getMaxid_(); + } + + public int ysize() + { + return y_.size(); + } + + public int getMaxid_() + { + return maxid_; + } + + public void setMaxid_(int maxid_) + { + this.maxid_ = maxid_; + } + + public double[] getAlpha_() + { + return alpha_; + } + + public void setAlpha_(double[] alpha_) + { + this.alpha_ = alpha_; + } + + public float[] getAlphaFloat_() + { + return alphaFloat_; + } + + public void setAlphaFloat_(float[] alphaFloat_) + { + this.alphaFloat_ = alphaFloat_; + } + + public double getCostFactor_() + { + return costFactor_; + } + + public void setCostFactor_(double costFactor_) + { + this.costFactor_ = costFactor_; + } + + public int getXsize_() + { + return xsize_; + } + + public void setXsize_(int xsize_) + { + this.xsize_ = xsize_; + } + + public int getMax_xsize_() + { + return max_xsize_; + } + + public void setMax_xsize_(int max_xsize_) + { + this.max_xsize_ = max_xsize_; + } + + public int getThreadNum_() + { + return threadNum_; + } + + public void setThreadNum_(int threadNum_) + { + this.threadNum_ = threadNum_; + } + + public List getUnigramTempls_() + { + return unigramTempls_; + } + + public void setUnigramTempls_(List unigramTempls_) + { + this.unigramTempls_ = unigramTempls_; + } + + public List getBigramTempls_() + { + return bigramTempls_; + } + + public void setBigramTempls_(List bigramTempls_) + { + this.bigramTempls_ = bigramTempls_; + } + + public List getY_() + { + return y_; + } + + public void setY_(List y_) + { + this.y_ = y_; + } + + public List> getPathList_() + { + return pathList_; + } + + public void setPathList_(List> pathList_) + { + this.pathList_ = pathList_; + } + + public List> getNodeList_() + { + return nodeList_; + } + + public void setNodeList_(List> nodeList_) + { + this.nodeList_ = nodeList_; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/LbfgsOptimizer.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/LbfgsOptimizer.java new file mode 100644 index 000000000..d6a908543 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/LbfgsOptimizer.java @@ -0,0 +1,339 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.util.Arrays; + +/** + * @author zhifac + */ +public class LbfgsOptimizer +{ + int iflag_, iscn, nfev, iycn, point, npt, iter, info, ispt, isyt, iypt, maxfev; + double stp, stp1; + double[] diag_ = null; + double[] w_ = null; + double[] v_ = null; + double[] xi_ = null; + Mcsrch mcsrch_ = null; + + public void pseudo_gradient(int size, + double[] v, + double[] x, + double[] g, + double C) + { + for (int i = 0; i < size; ++i) + { + if (x[i] == 0) + { + if (g[i] + C < 0) + { + v[i] = g[i] + C; + } + else if (g[i] - C > 0) + { + v[i] = g[i] - C; + } + else + { + v[i] = 0; + } + } + else + { + v[i] = g[i] + C * Mcsrch.sigma(x[i]); + } + } + } + + int lbfgs_optimize(int size, + int msize, + double[] x, + double f, + double[] g, + double[] diag, + double[] w, boolean orthant, double C, + double[] v, double[] xi, int iflag) + { + double yy = 0.0; + double ys = 0.0; + int bound = 0; + int cp = 0; + + if (orthant) + { + pseudo_gradient(size, v, x, g, C); + } + + if (mcsrch_ == null) + { + mcsrch_ = new Mcsrch(); + } + + boolean firstLoop = true; + + // initialization + if (iflag == 0) + { + point = 0; + for (int i = 0; i < size; ++i) + { + diag[i] = 1.0; + } + ispt = size + (msize << 1); + iypt = ispt + size * msize; + for (int i = 0; i < size; ++i) + { + w[ispt + i] = -v[i] * diag[i]; + } + stp1 = 1.0 / Math.sqrt(Mcsrch.ddot_(size, v, 0, v, 0)); + } + + // MAIN ITERATION LOOP + while (true) + { + if (!firstLoop || (firstLoop && iflag != 1 && iflag != 2)) + { + ++iter; + info = 0; + if (orthant) + { + for (int i = 0; i < size; ++i) + { + xi[i] = (x[i] != 0 ? Mcsrch.sigma(x[i]) : Mcsrch.sigma(-v[i])); + } + } + if (iter != 1) + { + if (iter > size) bound = size; + + // COMPUTE -H*G USING THE FORMULA GIVEN IN: Nocedal, J. 1980, + // "Updating quasi-Newton matrices with limited storage", + // Mathematics of Computation, Vol.24, No.151, pp. 773-782. + ys = Mcsrch.ddot_(size, w, iypt + npt, w, ispt + npt); + yy = Mcsrch.ddot_(size, w, iypt + npt, w, iypt + npt); + for (int i = 0; i < size; ++i) + { + diag[i] = ys / yy; + } + } + } + if (iter != 1 && (!firstLoop || (iflag != 1 && firstLoop))) + { + cp = point; + if (point == 0) + { + cp = msize; + } + w[size + cp - 1] = 1.0 / ys; + + for (int i = 0; i < size; ++i) + { + w[i] = -v[i]; + } + + bound = Math.min(iter - 1, msize); + + cp = point; + for (int i = 0; i < bound; ++i) + { + --cp; + if (cp == -1) + { + cp = msize - 1; + } + double sq = Mcsrch.ddot_(size, w, ispt + cp * size, w, 0); + int inmc = size + msize + cp; + iycn = iypt + cp * size; + w[inmc] = w[size + cp] * sq; + double d = -w[inmc]; + Mcsrch.daxpy_(size, d, w, iycn, w, 0); + } + + for (int i = 0; i < size; ++i) + { + w[i] = diag[i] * w[i]; + } + + for (int i = 0; i < bound; ++i) + { + double yr = Mcsrch.ddot_(size, w, iypt + cp * size, w, 0); + double beta = w[size + cp] * yr; + int inmc = size + msize + cp; + beta = w[inmc] - beta; + iscn = ispt + cp * size; + Mcsrch.daxpy_(size, beta, w, iscn, w, 0); + ++cp; + if (cp == msize) + { + cp = 0; + } + } + + if (orthant) + { + for (int i = 0; i < size; ++i) + { + w[i] = (Mcsrch.sigma(w[i]) == Mcsrch.sigma(-v[i]) ? w[i] : 0); + } + } + // STORE THE NEW SEARCH DIRECTION + for (int i = 0; i < size; ++i) + { + w[ispt + point * size + i] = w[i]; + } + } + // OBTAIN THE ONE-DIMENSIONAL MINIMIZER OF THE FUNCTION + // BY USING THE LINE SEARCH ROUTINE MCSRCH + if (!firstLoop || (firstLoop && iflag != 1)) + { + nfev = 0; + stp = 1.0; + if (iter == 1) + { + stp = stp1; + } + for (int i = 0; i < size; ++i) + { + w[i] = g[i]; + } + } + double[] stpArr = {stp}; + int[] infoArr = {info}; + int[] nfevArr = {nfev}; + + mcsrch_.mcsrch(size, x, f, v, w, ispt + point * size, + stpArr, infoArr, nfevArr, diag); + stp = stpArr[0]; + info = infoArr[0]; + nfev = nfevArr[0]; + + if (info == -1) + { + if (orthant) + { + for (int i = 0; i < size; ++i) + { + x[i] = (Mcsrch.sigma(x[i]) == Mcsrch.sigma(xi[i]) ? x[i] : 0); + } + } + return 1; // next value + } + if (info != 1) + { + System.err.println("The line search routine mcsrch failed: error code:" + info); + return -1; + } + + // COMPUTE THE NEW STEP AND GRADIENT CHANGE + npt = point * size; + for (int i = 0; i < size; ++i) + { + w[ispt + npt + i] = stp * w[ispt + npt + i]; + w[iypt + npt + i] = g[i] - w[i]; + } + ++point; + if (point == msize) point = 0; + + double gnorm = Math.sqrt(Mcsrch.ddot_(size, v, 0, v, 0)); + double xnorm = Math.max(1.0, Math.sqrt(Mcsrch.ddot_(size, x, 0, x, 0))); + if (gnorm / xnorm <= Mcsrch.eps) + { + return 0; // OK terminated + } + + firstLoop = false; + } + } + + + public LbfgsOptimizer() + { + iflag_ = iscn = nfev = 0; + iycn = point = npt = iter = info = ispt = isyt = iypt = maxfev = 0; + mcsrch_ = null; + } + + public void clear() + { + iflag_ = iscn = nfev = iycn = point = npt = + iter = info = ispt = isyt = iypt = 0; + stp = stp1 = 0.0; + diag_ = null; + w_ = null; + v_ = null; + mcsrch_ = null; + } + + public int init(int n, int m) + { + //This is old interface for backword compatibility + final int msize = 5; + final int size = n; + iflag_ = 0; + w_ = new double[size * (2 * msize + 1) + 2 * msize]; + Arrays.fill(w_, 0.0); + diag_ = new double[size]; + v_ = new double[size]; + return 0; + } + + public int optimize(double[] x, double f, double[] g) + { + return optimize(diag_.length, x, f, g, false, 1.0); + } + + public int optimize(int size, double[] x, double f, double[] g, boolean orthant, double C) + { + int msize = 5; + if (w_ == null) + { + iflag_ = 0; + w_ = new double[size * (2 * msize + 1) + 2 * msize]; + Arrays.fill(w_, 0.0); + diag_ = new double[size]; + v_ = new double[size]; + if (orthant) + { + xi_ = new double[size]; + } + } + else if (diag_.length != size || v_.length != size) + { + System.err.println("size of array is different"); + return -1; + } + else if (orthant && v_.length != size) + { + System.err.println("size of array is different"); + return -1; + } + int iflag = 0; + if (orthant) + { + + iflag = lbfgs_optimize(size, + msize, x, f, g, diag_, w_, orthant, C, v_, xi_, iflag_); + iflag_ = iflag; + } + else + { + iflag = lbfgs_optimize(size, + msize, x, f, g, diag_, w_, orthant, C, g, xi_, iflag_); + iflag_ = iflag; + } + + if (iflag < 0) + { + System.err.println("routine stops with unexpected error"); + return -1; + } + + if (iflag == 0) + { + clear(); + return 0; // terminate + } + + return 1; // evaluate next f and g + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Mcsrch.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Mcsrch.java new file mode 100644 index 000000000..98cc6b251 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Mcsrch.java @@ -0,0 +1,464 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +/** + * @author zhifac + */ +public class Mcsrch +{ + public static final double ftol = 1e-4; + public static final double xtol = 1e-16; + public static final double eps = 1e-7; + public static final double lb3_1_gtol = 0.9; + public static final double lb3_1_stpmin = 1e-20; + public static final double lb3_1_stpmax = 1e20; + public static final int lb3_1_mp = 6; + public static final int lb3_1_lp = 6; + + int infoc; + boolean stage1, brackt; + double finit, dginit, dgtest, width, width1; + double stx, fx, dgx, sty, fy, dgy, stmin, stmax; + + public Mcsrch() + { + infoc = 0; + stage1 = brackt = false; + finit = dginit = dgtest = width = width1 = 0.0; + stx = fx = dgx = sty = fy = dgy = stmin = stmax = 0.0; + } + + public static double sigma(double x) + { + if (x > 0) return 1.0; + else if (x < 0) return -1.0; + return 0.0; + } + + public double pi(double x, double y) + { + return sigma(x) == sigma(y) ? x : 0.0; + } + + public static void daxpy_(int n, double da, double[] dx, int offsetX, double[] dy, int offsetY) + { + for (int i = 0; i < n; ++i) + dy[i + offsetY] += da * dx[i + offsetX]; + } + + public static double ddot_(int size, double[] dx, int offsetX, double[] dy, int offsetY) + { + double res = 0.0; + for (int i = 0; i < size; i++) + { + res += dx[i + offsetX] * dy[i + offsetY]; + } + return res; + } + + public static void mcstep(double[] stx, double[] fx, double[] dx, + double[] sty, double[] fy, double[] dy, + double[] stp, double fp, double dp, + boolean[] brackt, + double stpmin, double stpmax, + int[] info) + { + boolean bound = true; + double p, q, s, d1, d2, d3, r, gamma, theta, stpq, stpc, stpf; + info[0] = 0; + + if (brackt[0] && ((stp[0] <= Math.min(stx[0], sty[0]) || stp[0] >= Math.max(stx[0], sty[0])) || + dx[0] * (stp[0] - stx[0]) >= 0.0 || stpmax < stpmin)) + { + return; + } + + double sgnd = dp * (dx[0] / Math.abs(dx[0])); + + if (fp > fx[0]) + { + info[0] = 1; + bound = true; + theta = (fx[0] - fp) * 3 / (stp[0] - stx[0]) + dx[0] + dp; + d1 = Math.abs(theta); + d2 = Math.abs(dx[0]); + d1 = Math.max(d1, d2); + d2 = Math.abs(dp); + s = Math.max(d1, d2); + d1 = theta / s; + gamma = s * Math.sqrt(d1 * d1 - dx[0] / s * (dp / s)); + if (stp[0] < stx[0]) + { + gamma = -gamma; + } + p = gamma - dx[0] + theta; + q = gamma - dx[0] + gamma + dp; + r = p / q; + stpc = stx[0] + r * (stp[0] - stx[0]); + stpq = stx[0] + dx[0] / ((fx[0] - fp) / + (stp[0] - stx[0]) + dx[0]) / 2 * (stp[0] - stx[0]); + d1 = stpc - stx[0]; + d2 = stpq - stx[0]; + if (Math.abs(d1) < Math.abs(d2)) + { + stpf = stpc; + } + else + { + stpf = stpc + (stpq - stpc) / 2; + } + brackt[0] = true; + } + else if (sgnd < 0.0) + { + info[0] = 2; + bound = false; + theta = (fx[0] - fp) * 3 / (stp[0] - stx[0]) + dx[0] + dp; + d1 = Math.abs(theta); + d2 = Math.abs(dx[0]); + d1 = Math.max(d1, d2); + d2 = Math.abs(dp); + s = Math.max(d1, d2); + d1 = theta / s; + gamma = s * Math.sqrt(d1 * d1 - dx[0] / s * (dp / s)); + if (stp[0] > stx[0]) + { + gamma = -gamma; + } + p = gamma - dp + theta; + q = gamma - dp + gamma + dx[0]; + r = p / q; + stpc = stp[0] + r * (stx[0] - stp[0]); + stpq = stp[0] + dp / (dp - dx[0]) * (stx[0] - stp[0]); + d1 = stpc - stp[0]; + d2 = stpq - stp[0]; + if (Math.abs(d1) > Math.abs(d2)) + { + stpf = stpc; + } + else + { + stpf = stpq; + } + brackt[0] = true; + } + else if (Math.abs(dp) < Math.abs(dx[0])) + { + info[0] = 3; + bound = true; + theta = (fx[0] - fp) * 3 / (stp[0] - stx[0]) + dx[0] + dp; + d1 = Math.abs(theta); + d2 = Math.abs(dx[0]); + d1 = Math.max(d1, d2); + d2 = Math.abs(dp); + s = Math.max(d1, d2); + d3 = theta / s; + d1 = 0.0; + d2 = d3 * d3 - dx[0] / s * (dp / s); + gamma = s * Math.sqrt((Math.max(d1, d2))); + if (stp[0] > stx[0]) + { + gamma = -gamma; + } + p = gamma - dp + theta; + q = gamma + (dx[0] - dp) + gamma; + r = p / q; + if (r < 0.0 && gamma != 0.0) + { + stpc = stp[0] + r * (stx[0] - stp[0]); + } + else if (stp[0] > stx[0]) + { + stpc = stpmax; + } + else + { + stpc = stpmin; + } + stpq = stp[0] + dp / (dp - dx[0]) * (stx[0] - stp[0]); + if (brackt[0]) + { + d1 = stp[0] - stpc; + d2 = stp[0] - stpq; + if (Math.abs(d1) < Math.abs(d2)) + { + stpf = stpc; + } + else + { + stpf = stpq; + } + } + else + { + d1 = stp[0] - stpc; + d2 = stp[0] - stpq; + if (Math.abs(d1) > Math.abs(d2)) + { + stpf = stpc; + } + else + { + stpf = stpq; + } + } + } + else + { + info[0] = 4; + bound = false; + if (brackt[0]) + { + theta = (fp - fy[0]) * 3 / (sty[0] - stp[0]) + dy[0] + dp; + d1 = Math.abs(theta); + d2 = Math.abs(dy[0]); + d1 = Math.max(d1, d2); + d2 = Math.abs(dp); + s = Math.max(d1, d2); + d1 = theta / s; + gamma = s * Math.sqrt(d1 * d1 - dy[0] / s * (dp / s)); + if (stp[0] > sty[0]) + { + gamma = -gamma; + } + p = gamma - dp + theta; + q = gamma - dp + gamma + dy[0]; + r = p / q; + stpc = stp[0] + r * (sty[0] - stp[0]); + stpf = stpc; + } + else if (stp[0] > stx[0]) + { + stpf = stpmax; + } + else + { + stpf = stpmin; + } + } + + if (fp > fx[0]) + { + sty[0] = stp[0]; + fy[0] = fp; + dy[0] = dp; + } + else + { + if (sgnd < 0.0) + { + sty[0] = stx[0]; + fy[0] = fx[0]; + dy[0] = dx[0]; + } + stx[0] = stp[0]; + fx[0] = fp; + dx[0] = dp; + } + + stpf = Math.min(stpmax, stpf); + stpf = Math.max(stpmin, stpf); + stp[0] = stpf; + if (brackt[0] && bound) + { + if (sty[0] > stx[0]) + { + d1 = stx[0] + (sty[0] - stx[0]) * 0.66; + stp[0] = Math.min(d1, stp[0]); + } + else + { + d1 = stx[0] + (sty[0] - stx[0]) * 0.66; + stp[0] = Math.max(d1, stp[0]); + } + } + + return; + } + + + void mcsrch(int size, + double[] x, + double f, double[] g, double[] s, int startOffset, + double[] stp, + int[] info, int[] nfev, double[] wa) + { + double p5 = 0.5; + double p66 = 0.66; + double xtrapf = 4.0; + int maxfev = 20; + + if (info[0] != -1) + { + infoc = 1; + + if (size <= 0 || stp[0] <= 0.0) + { + return; + } + + dginit = ddot_(size, g, 0, s, startOffset); + if (dginit >= 0.0) return; + + brackt = false; + stage1 = true; + nfev[0] = 0; + finit = f; + dgtest = ftol * dginit; + width = lb3_1_stpmax - lb3_1_stpmin; + width1 = width / p5; + for (int j = 0; j < size; ++j) + { + wa[j] = x[j]; + } + + stx = 0.0; + fx = finit; + dgx = dginit; + sty = 0.0; + fy = finit; + dgy = dginit; + } + + boolean firstLoop = true; + while (true) + { + if (!firstLoop || (firstLoop && info[0] != -1)) + { + if (brackt) + { + stmin = Math.min(stx, sty); + stmax = Math.max(stx, sty); + } + else + { + stmin = stx; + stmax = stp[0] + xtrapf * (stp[0] - stx); + } + + stp[0] = Math.max(stp[0], lb3_1_stpmin); + stp[0] = Math.min(stp[0], lb3_1_stpmax); + + if ((brackt && ((stp[0] <= stmin || stp[0] >= stmax) || + nfev[0] >= maxfev - 1 || infoc == 0)) || + (brackt && (stmax - stmin <= xtol * stmax))) + { + stp[0] = stx; + } + + for (int j = 0; j < size; ++j) + { + x[j] = wa[j] + stp[0] * s[startOffset + j]; + } + info[0] = -1; + return; + } + + info[0] = 0; + ++(nfev[0]); + double dg = ddot_(size, g, 0, s, startOffset); + double ftest1 = finit + stp[0] * dgtest; + + if (brackt && ((stp[0] <= stmin || stp[0] >= stmax) || infoc == 0)) + { + info[0] = 6; + } + if (stp[0] == lb3_1_stpmax && f <= ftest1 && dg <= dgtest) + { + info[0] = 5; + } + if (stp[0] == lb3_1_stpmin && (f > ftest1 || dg >= dgtest)) + { + info[0] = 4; + } + if (nfev[0] >= maxfev) + { + info[0] = 3; + } + if (brackt && stmax - stmin <= xtol * stmax) + { + info[0] = 2; + } + if (f <= ftest1 && Math.abs(dg) <= lb3_1_gtol * (-dginit)) + { + info[0] = 1; + } + if (info[0] != 0) + { + return; + } + + if (stage1 && f <= ftest1 && dg >= Math.min(ftol, lb3_1_gtol) * dginit) + { + stage1 = false; + } + + if (stage1 && f <= fx && f > ftest1) + { + double fm = f - stp[0] * dgtest; + double fxm = fx - stx * dgtest; + double fym = fy - sty * dgtest; + double dgm = dg - dgtest; + double dgxm = dgx - dgtest; + double dgym = dgy - dgtest; + + double[] stxArr = {stx}; + double[] fxmArr = {fxm}; + double[] dgxmArr = {dgxm}; + double[] styArr = {sty}; + double[] fymArr = {fym}; + double[] dgymArr = {dgym}; + boolean[] bracktArr = {brackt}; + int[] infocArr = {infoc}; + mcstep(stxArr, fxmArr, dgxmArr, styArr, fymArr, dgymArr, stp, fm, dgm, bracktArr, + stmin, stmax, infocArr); + stx = stxArr[0]; + fxm = fxmArr[0]; + dgxm = dgxmArr[0]; + sty = styArr[0]; + fym = fymArr[0]; + dgym = dgymArr[0]; + brackt = bracktArr[0]; + infoc = infocArr[0]; + + fx = fxm + stx * dgtest; + fy = fym + sty * dgtest; + dgx = dgxm + dgtest; + dgy = dgym + dgtest; + } + else + { + double[] stxArr = {stx}; + double[] fxArr = {fx}; + double[] dgxArr = {dgx}; + double[] styArr = {sty}; + double[] fyArr = {fy}; + double[] dgyArr = {dgy}; + boolean[] bracktArr = {brackt}; + int[] infocArr = {infoc}; + mcstep(stxArr, fxArr, dgxArr, styArr, fyArr, dgyArr, stp, f, dg, bracktArr, + stmin, stmax, infocArr); + stx = stxArr[0]; + fx = fxArr[0]; + dgx = dgxArr[0]; + sty = styArr[0]; + fy = fyArr[0]; + dgy = dgyArr[0]; + brackt = bracktArr[0]; + infoc = infocArr[0]; + } + + if (brackt) + { + double d1 = sty - stx; + if (Math.abs(d1) >= p66 * width1) + { + stp[0] = stx + p5 * (sty - stx); + } + width1 = width; + d1 = sty - stx; + width = Math.abs(d1); + } + firstLoop = false; + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Model.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Model.java new file mode 100644 index 000000000..6b0bf8066 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Model.java @@ -0,0 +1,28 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +/** + * @author zhifac + */ +public abstract class Model +{ + + public boolean open(String[] args) + { + return true; + } + + public boolean open(String arg) + { + return true; + } + + public boolean close() + { + return true; + } + + public Tagger createTagger() + { + return null; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/ModelImpl.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/ModelImpl.java new file mode 100644 index 000000000..413a5c6aa --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/ModelImpl.java @@ -0,0 +1,138 @@ +package com.hankcs.hanlp.model.crf.crfpp; + + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.cli.Args; +import com.hankcs.hanlp.model.perceptron.cli.Argument; + +import java.io.InputStream; + +/** + * @author zhifac + */ +public class ModelImpl extends Model +{ + private int nbest_; + private int vlevel_; + private DecoderFeatureIndex featureIndex_; + + public ModelImpl() + { + nbest_ = vlevel_ = 0; + featureIndex_ = null; + } + + public Tagger createTagger() + { + if (featureIndex_ == null) + { + return null; + } + TaggerImpl tagger = new TaggerImpl(TaggerImpl.Mode.TEST); + tagger.open(featureIndex_, nbest_, vlevel_); + return tagger; + } + + public boolean open(String arg) + { + return open(arg.split(" ", -1)); + } + + private static class Option + { + @Argument(description = "set FILE for model file", alias = "m", required = true) + String model; + @Argument(description = "output n-best results", alias = "n") + Integer nbest = 0; + @Argument(description = "set INT for verbose level", alias = "v") + Integer verbose = 0; + @Argument(description = "set cost factor", alias = "c") + Double cost_factor = 1.0; + } + + public boolean open(String[] args) + { + Option cmd = new Option(); + try + { + Args.parse(cmd, args); + } + catch (IllegalArgumentException e) + { + System.err.println("invalid arguments"); + return false; + } + String model = cmd.model; + int nbest = cmd.nbest; + int vlevel = cmd.verbose; + double costFactor = cmd.cost_factor; + return open(model, nbest, vlevel, costFactor); + } + + public boolean open(InputStream stream, int nbest, int vlevel, double costFactor) + { + featureIndex_ = new DecoderFeatureIndex(); + nbest_ = nbest; + vlevel_ = vlevel; + if (costFactor > 0) + { + featureIndex_.setCostFactor_(costFactor); + } + return featureIndex_.open(stream); + } + + public boolean open(String model, int nbest, int vlevel, double costFactor) + { + try + { + InputStream stream = IOUtil.newInputStream(model); + return open(stream, nbest, vlevel, costFactor); + } + catch (Exception e) + { + return false; + } + } + + public String getTemplate() + { + if (featureIndex_ != null) + { + return featureIndex_.getTemplate(); + } + else + { + return null; + } + } + + public int getNbest_() + { + return nbest_; + } + + public void setNbest_(int nbest_) + { + this.nbest_ = nbest_; + } + + public int getVlevel_() + { + return vlevel_; + } + + public void setVlevel_(int vlevel_) + { + this.vlevel_ = vlevel_; + } + + public DecoderFeatureIndex getFeatureIndex_() + { + return featureIndex_; + } + + public void setFeatureIndex_(DecoderFeatureIndex featureIndex_) + { + this.featureIndex_ = featureIndex_; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Node.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Node.java new file mode 100644 index 000000000..4081468e6 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Node.java @@ -0,0 +1,103 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.util.ArrayList; +import java.util.List; + +/** + * 图模型中的节点 + * + * @author zhifac + */ +public class Node +{ + public int x; + public int y; + public double alpha; + public double beta; + public double cost; + public double bestCost; + public Node prev; + public List fVector; + public List lpath; + public List rpath; + public static double LOG2 = 0.69314718055; + public static int MINUS_LOG_EPSILON = 50; + + public Node() + { + lpath = new ArrayList(); + rpath = new ArrayList(); + clear(); + bestCost = 0.0; + prev = null; + } + + public static double logsumexp(double x, double y, boolean flg) + { + if (flg) + { + return y; + } + double vmin = Math.min(x, y); + double vmax = Math.max(x, y); + if (vmax > vmin + MINUS_LOG_EPSILON) + { + return vmax; + } + else + { + return vmax + Math.log(Math.exp(vmin - vmax) + 1.0); + } + } + + public void calcAlpha() + { + alpha = 0.0; + for (Path p : lpath) + { + alpha = logsumexp(alpha, p.cost + p.lnode.alpha, p == lpath.get(0)); + } + alpha += cost; + } + + public void calcBeta() + { + beta = 0.0; + for (Path p : rpath) + { + beta = logsumexp(beta, p.cost + p.rnode.beta, p == rpath.get(0)); + } + beta += cost; + } + + /** + * 计算节点期望 + * + * @param expected 输出期望 + * @param Z 规范化因子 + * @param size 标签个数 + */ + public void calcExpectation(double[] expected, double Z, int size) + { + double c = Math.exp(alpha + beta - cost - Z); + for (int i = 0; fVector.get(i) != -1; i++) + { + int idx = fVector.get(i) + y; + expected[idx] += c; + } + for (Path p : lpath) + { + p.calcExpectation(expected, Z, size); + } + } + + public void clear() + { + x = y = 0; + alpha = beta = cost = 0; + prev = null; + fVector = null; + lpath.clear(); + rpath.clear(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Pair.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Pair.java new file mode 100644 index 000000000..02d3f3201 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Pair.java @@ -0,0 +1,100 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.io.Serializable; + +/** + * @author zhifac + */ +public class Pair implements Serializable { + + /** + * Key of this Pair. + */ + private K key; + + /** + * Gets the key for this pair. + * @return key for this pair + */ + public K getKey() { return key; } + + /** + * Value of this this Pair. + */ + private V value; + + /** + * Gets the value for this pair. + * @return value for this pair + */ + public V getValue() { return value; } + + /** + * Creates a new pair + * @param key The key for this pair + * @param value The value to use for this pair + */ + public Pair(K key, V value) { + this.key = key; + this.value = value; + } + + /** + *

String representation of this + * Pair.

+ * + *

The default name/value delimiter '=' is always used.

+ * + * @return String representation of this Pair + */ + @Override + public String toString() { + return key + "=" + value; + } + + /** + *

Generate a hash code for this Pair.

+ * + *

The hash code is calculated using both the name and + * the value of the Pair.

+ * + * @return hash code for this Pair + */ + @Override + public int hashCode() { + // name's hashCode is multiplied by an arbitrary prime number (13) + // in order to make sure there is a difference in the hashCode between + // these two parameters: + // name: a value: aa + // name: aa value: a + return key.hashCode() * 13 + (value == null ? 0 : value.hashCode()); + } + + /** + *

Test this Pair for equality with another + * Object.

+ * + *

If the Object to be tested is not a + * Pair or is null, then this method + * returns false.

+ * + *

Two Pairs are considered equal if and only if + * both the names and values are equal.

+ * + * @param o the Object to test for + * equality with this Pair + * @return true if the given Object is + * equal to this Pair else false + */ + @Override + public boolean equals(Object o) { + if (this == o) return true; + if (o instanceof Pair) { + Pair pair = (Pair) o; + if (key != null ? !key.equals(pair.key) : pair.key != null) return false; + if (value != null ? !value.equals(pair.value) : pair.value != null) return false; + return true; + } + return false; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Path.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Path.java new file mode 100644 index 000000000..ce0e6e5d2 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Path.java @@ -0,0 +1,53 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.util.List; + +/** + * 边 + * + * @author zhifac + */ +public class Path +{ + public Node rnode; + public Node lnode; + public List fvector; + public double cost; + + public Path() + { + clear(); + } + + public void clear() + { + rnode = lnode = null; + fvector = null; + cost = 0.0; + } + + /** + * 计算边的期望 + * + * @param expected 输出期望 + * @param Z 规范化因子 + * @param size 标签个数 + */ + public void calcExpectation(double[] expected, double Z, int size) + { + double c = Math.exp(lnode.alpha + cost + rnode.beta - Z); + for (int i = 0; fvector.get(i) != -1; i++) + { + int idx = fvector.get(i) + lnode.y * size + rnode.y; + expected[idx] += c; + } + } + + public void add(Node _lnode, Node _rnode) + { + lnode = _lnode; + rnode = _rnode; + lnode.rpath.add(this); + rnode.lpath.add(this); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Tagger.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Tagger.java new file mode 100644 index 000000000..64f424e43 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/Tagger.java @@ -0,0 +1,222 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import java.util.List; + +/** + * @author zhifac + */ +public abstract class Tagger +{ + public boolean open(String[] args) + { + return true; + } + + public boolean open(FeatureIndex featureIndex, int nbest, int vlevel, double costFactor) + { + return true; + } + + public boolean open(FeatureIndex featureIndex, int nbest, int vlevel) + { + return true; + } + + public boolean open(String arg) + { + return true; + } + + public boolean add(String[] strArr) + { + return true; + } + + public void close() + { + } + + public float[] weightVector() + { + return null; + } + + public boolean add(String str) + { + return true; + } + + public int size() + { + return 0; + } + + public int xsize() + { + return 0; + } + + public int dsize() + { + return 0; + } + + public int result(int i) + { + return 0; + } + + public int answer(int i) + { + return 0; + } + + public int y(int i) + { + return result(i); + } + + public String y2(int i) + { + return ""; + } + + public String yname(int i) + { + return ""; + } + + public String x(int i, int j) + { + return ""; + } + + public int ysize() + { + return 0; + } + + public double prob(int i, int j) + { + return 0.0; + } + + public double prob(int i) + { + return 0.0; + } + + public double prob() + { + return 0.0; + } + + public double alpha(int i, int j) + { + return 0.0; + } + + public double beta(int i, int j) + { + return 0.0; + } + + public double emissionCost(int i, int j) + { + return 0.0; + } + + public double nextTransitionCost(int i, int j, int k) + { + return 0.0; + } + + public double prevTransitionCost(int i, int j, int k) + { + return 0.0; + } + + public double bestCost(int i, int j) + { + return 0.0; + } + + public List emissionVector(int i, int j) + { + return null; + } + + public List nextTransitionVector(int i, int j, int k) + { + return null; + } + + public List prevTransitionVector(int i, int j, int k) + { + return null; + } + + public double Z() + { + return 0.0; + } + + public boolean parse() + { + return true; + } + + public boolean empty() + { + return true; + } + + public boolean clear() + { + return true; + } + + public boolean next() + { + return true; + } + + public String parse(String str) + { + return ""; + } + + public String toString() + { + return ""; + } + + public String toString(String result, int size) + { + return ""; + } + + public String parse(String str, int size) + { + return ""; + } + + public String parse(String str, int size1, String result, int size2) + { + return ""; + } + + // set token-level penalty. It would be useful for implementing + // Dual decompositon decoding. + // e.g. + // "Dual Decomposition for Parsing with Non-Projective Head Automata" + // Terry Koo Alexander M. Rush Michael Collins Tommi Jaakkola David Sontag + public void setPenalty(int i, int j, double penalty) + { + } + + public double penalty(int i, int j) + { + return 0.0; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/TaggerImpl.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/TaggerImpl.java new file mode 100644 index 000000000..cfc7f7c5d --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/TaggerImpl.java @@ -0,0 +1,1032 @@ +package com.hankcs.hanlp.model.crf.crfpp; + +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.io.*; +import java.util.*; + +/** + * @author zhifac + */ +public class TaggerImpl extends Tagger +{ + class QueueElement + { + Node node; + QueueElement next; + double fx; + double gx; + } + + public enum Mode + { + TEST, LEARN + } + + public enum ReadStatus + { + SUCCESS, EOF, ERROR + } + + Mode mode_ = Mode.TEST; + int vlevel_ = 0; + int nbest_ = 0; + int ysize_; + double cost_; + double Z_; + int feature_id_; + int thread_id_; + FeatureIndex feature_index_; + List> x_; + List> node_; + List answer_; + List result_; + String lastError; + PriorityQueue agenda_; + List> penalty_; + List> featureCache_; + + public TaggerImpl(Mode mode) + { + mode_ = mode; + vlevel_ = 0; + nbest_ = 0; + ysize_ = 0; + Z_ = 0; + feature_id_ = 0; + thread_id_ = 0; + lastError = null; + feature_index_ = null; + x_ = new ArrayList>(); + node_ = new ArrayList>(); + answer_ = new ArrayList(); + result_ = new ArrayList(); + agenda_ = null; + penalty_ = new ArrayList>(); + featureCache_ = new ArrayList>(); + } + + public void clearNodes() + { + if (node_ != null && !node_.isEmpty()) + { + for (List n : node_) + { + for (int i = 0; i < n.size(); i++) + { + if (n.get(i) != null) + { + n.get(i).clear(); + n.set(i, null); + } + } + } + } + } + + public void setPenalty(int i, int j, double penalty) + { + if (penalty_.isEmpty()) + { + for (int s = 0; s < node_.size(); s++) + { + List penaltys = Arrays.asList(new Double[ysize_]); + penalty_.add(penaltys); + } + } + penalty_.get(i).set(j, penalty); + } + + public double penalty(int i, int j) + { + return penalty_.isEmpty() ? 0.0 : penalty_.get(i).get(j); + } + + /** + * 前向后向算法 + */ + public void forwardbackward() + { + if (!x_.isEmpty()) + { + for (int i = 0; i < x_.size(); i++) + { + for (int j = 0; j < ysize_; j++) + { + node_.get(i).get(j).calcAlpha(); + } + } + for (int i = x_.size() - 1; i >= 0; i--) + { + for (int j = 0; j < ysize_; j++) + { + node_.get(i).get(j).calcBeta(); + } + } + Z_ = 0.0; + for (int j = 0; j < ysize_; j++) + { + Z_ = Node.logsumexp(Z_, node_.get(0).get(j).beta, j == 0); + } + } + } + + public void viterbi() + { + for (int i = 0; i < x_.size(); i++) + { + for (int j = 0; j < ysize_; j++) + { + double bestc = -1e37; + Node best = null; + List lpath = node_.get(i).get(j).lpath; + for (Path p : lpath) + { + double cost = p.lnode.bestCost + p.cost + node_.get(i).get(j).cost; + if (cost > bestc) + { + bestc = cost; + best = p.lnode; + } + } + node_.get(i).get(j).prev = best; + node_.get(i).get(j).bestCost = best != null ? bestc : node_.get(i).get(j).cost; + } + } + double bestc = -1e37; + Node best = null; + int s = x_.size() - 1; + for (int j = 0; j < ysize_; j++) + { + if (bestc < node_.get(s).get(j).bestCost) + { + best = node_.get(s).get(j); + bestc = node_.get(s).get(j).bestCost; + } + } + for (Node n = best; n != null; n = n.prev) + { + result_.set(n.x, n.y); + } + cost_ = -node_.get(x_.size() - 1).get(result_.get(x_.size() - 1)).bestCost; + } + + public void buildLattice() + { + if (!x_.isEmpty()) + { + feature_index_.rebuildFeatures(this); + for (int i = 0; i < x_.size(); i++) + { + for (int j = 0; j < ysize_; j++) + { + feature_index_.calcCost(node_.get(i).get(j)); + List lpath = node_.get(i).get(j).lpath; + for (Path p : lpath) + { + feature_index_.calcCost(p); + } + } + } + + // Add penalty for Dual decomposition. + if (!penalty_.isEmpty()) + { + for (int i = 0; i < x_.size(); i++) + { + for (int j = 0; j < ysize_; j++) + { + node_.get(i).get(j).cost += penalty_.get(i).get(j); + } + } + } + } + } + + public boolean initNbest() + { + if (agenda_ == null) + { + agenda_ = new PriorityQueue(10, new Comparator() + { + public int compare(QueueElement o1, QueueElement o2) + { + return (int) (o1.fx - o2.fx); + } + }); + } + agenda_.clear(); + int k = x_.size() - 1; + for (int i = 0; i < ysize_; i++) + { + QueueElement eos = new QueueElement(); + eos.node = node_.get(k).get(i); + eos.fx = -node_.get(k).get(i).bestCost; + eos.gx = -node_.get(k).get(i).cost; + eos.next = null; + agenda_.add(eos); + } + return true; + } + + public Node node(int i, int j) + { + return node_.get(i).get(j); + } + + public void set_node(Node n, int i, int j) + { + node_.get(i).set(j, n); + } + + public int eval() + { + int err = 0; + for (int i = 0; i < x_.size(); i++) + { + if (!answer_.get(i).equals(result_.get(i))) + { + err++; + } + } + return err; + } + + /** + * 计算梯度 + * + * @param expected 梯度向量 + * @return 损失函数的值 + */ + public double gradient(double[] expected) + { + if (x_.isEmpty()) + { + return 0.0; + } + buildLattice(); + forwardbackward(); + double s = 0.0; + + for (int i = 0; i < x_.size(); i++) + { + for (int j = 0; j < ysize_; j++) + { + node_.get(i).get(j).calcExpectation(expected, Z_, ysize_); + } + } + for (int i = 0; i < x_.size(); i++) + { + List fvector = node_.get(i).get(answer_.get(i)).fVector; + for (int j = 0; fvector.get(j) != -1; j++) + { + int idx = fvector.get(j) + answer_.get(i); + expected[idx]--; + } + s += node_.get(i).get(answer_.get(i)).cost; //UNIGRAM COST + List lpath = node_.get(i).get(answer_.get(i)).lpath; + for (Path p : lpath) + { + if (p.lnode.y == answer_.get(p.lnode.x)) + { + for (int k = 0; p.fvector.get(k) != -1; k++) + { + int idx = p.fvector.get(k) + p.lnode.y * ysize_ + p.rnode.y; + expected[idx]--; + } + s += p.cost; // BIGRAM COST + break; + } + } + } + + viterbi(); + return Z_ - s; + } + + public double collins(List collins) + { + if (x_.isEmpty()) + { + return 0.0; + } + buildLattice(); + viterbi(); // call for finding argmax y + double s = 0.0; + + int num = 0; + for (int i = 0; i < x_.size(); i++) + { + if (answer_.get(i).equals(result_.get(i))) + { + num++; + } + } + if (num == x_.size()) + { + // if correct parse, do not run forward + backward + return 0.0; + } + + for (int i = 0; i < x_.size(); i++) + { + // answer + s += node_.get(i).get(answer_.get(i)).cost; + List fvector = node_.get(i).get(answer_.get(i)).fVector; + for (int k = 0; fvector.get(k) != -1; k++) + { + int idx = fvector.get(k) + answer_.get(i); + collins.set(idx, collins.get(idx) + 1); + } + List lpath = node_.get(i).get(answer_.get(i)).lpath; + for (Path p : lpath) + { + if (p.lnode.y == answer_.get(p.lnode.x)) + { + for (int j = 0; p.fvector.get(j) != -1; j++) + { + int idx = p.fvector.get(j) + p.lnode.y * ysize_ + p.rnode.y; + collins.set(idx, collins.get(i) + 1); + } + s += p.cost; + break; + } + } + + // result + s -= node_.get(i).get(result_.get(i)).cost; + List fvectorR = node_.get(i).get(result_.get(i)).fVector; + for (int k = 0; fvectorR.get(k) != -1; k++) + { + int idx = fvector.get(k) + result_.get(i); + collins.set(idx, collins.get(idx) - 1); + } + List lpathR = node_.get(i).get(result_.get(i)).lpath; + for (Path p : lpathR) + { + if (p.lnode.y == result_.get(p.lnode.x)) + { + for (int j = 0; p.fvector.get(j) != -1; j++) + { + int idx = p.fvector.get(j) + p.lnode.y * ysize_ + p.rnode.y; + collins.set(idx, collins.get(i) - 1); + } + s -= p.cost; + break; + } + } + } + + return -s; + } + + public boolean shrink() + { + if (!feature_index_.buildFeatures(this)) + { + System.err.println("build features failed"); + return false; + } + return true; + } + + public ReadStatus read(BufferedReader br) + { + clear(); + ReadStatus status = ReadStatus.SUCCESS; + try + { + String line; + while (true) + { + if ((line = br.readLine()) == null) + { + return ReadStatus.EOF; + } + else if (line.length() == 0) + { + break; + } + if (!add(line)) + { + System.err.println("fail to add line: " + line); + return ReadStatus.ERROR; + } + } + } + catch (Exception e) + { + e.printStackTrace(); + System.err.println("Error reading stream"); + return ReadStatus.ERROR; + } + return status; + } + + public String toString() + { + StringBuilder sb = new StringBuilder(); + if (nbest_ < 1) + { + if (vlevel_ >= 1) + { + sb.append("# "); + sb.append(prob()); + sb.append("\n"); + } + for (int i = 0; i < x_.size(); i++) + { + for (String s : x_.get(i)) + { + sb.append(s); + sb.append("\t"); + } + sb.append(yname(y(i))); + if (vlevel_ >= 1) + { + sb.append("/"); + sb.append(prob(i)); + } + if (vlevel_ >= 2) + { + for (int j = 0; j < ysize_; j++) + { + sb.append("\t"); + sb.append(yname(j)); + sb.append("/"); + sb.append(prob(i, j)); + } + } + sb.append("\n"); + } + sb.append("\n"); + } + else + { + for (int n = 0; n < nbest_; n++) + { + if (!next()) + { + break; + } + sb.append("# ").append(n).append(" ").append(prob()).append("\n"); + for (int i = 0; i < x_.size(); ++i) + { + for (String s : x_.get(i)) + { + sb.append(s).append('\t'); + } + sb.append(yname(y(i))); + if (vlevel_ >= 1) + { + sb.append('/').append(prob(i)); + } + if (vlevel_ >= 2) + { + for (int j = 0; j < ysize_; ++j) + { + sb.append('\t').append(yname(j)).append('/').append(prob(i, j)); + } + } + sb.append('\n'); + } + sb.append('\n'); + } + } + return sb.toString(); + } + + public boolean open(FeatureIndex featureIndex) + { + mode_ = Mode.LEARN; + feature_index_ = featureIndex; + ysize_ = feature_index_.ysize(); + return true; + } + + public boolean open(String filename) + { + return true; + } + + public boolean setModel(ModelImpl model) + { + mode_ = Mode.TEST; + feature_index_ = model.getFeatureIndex_(); + nbest_ = model.getNbest_(); + vlevel_ = model.getVlevel_(); + ysize_ = feature_index_.ysize(); + return true; + } + + public void close() + { + } + + public boolean add(String line) + { + String[] cols = line.split("[\t ]", -1); + return add(cols); + } + + @Override + public boolean add(String[] cols) + { + int xsize = feature_index_.getXsize_(); + if ((mode_ == Mode.LEARN && cols.length < xsize + 1) || + (mode_ == Mode.TEST && cols.length < xsize)) + { + System.err.println("# x is small: size=" + cols.length + " xsize=" + xsize); + return false; + } + x_.add(Arrays.asList(cols)); + result_.add(0); + int tmpAnswer = 0; + if (mode_ == Mode.LEARN) + { + int r = ysize_; + for (int i = 0; i < ysize_; i++) + { + if (cols[xsize].equals(yname(i))) + { + r = i; + } + } + if (r == ysize_) + { + System.err.println("cannot find answer"); + return false; + } + tmpAnswer = r; + } + answer_.add(tmpAnswer); + List l = Arrays.asList(new Node[ysize_]); + node_.add(l); + return true; + } + + public List> getFeatureCache_() + { + return featureCache_; + } + + public void setFeatureCache_(List> featureCache_) + { + this.featureCache_ = featureCache_; + } + + public int size() + { + return x_.size(); + } + + public int xsize() + { + return feature_index_.getXsize_(); + } + + public int dsize() + { + return feature_index_.size(); + } + + public float[] weightVector() + { + return feature_index_.getAlphaFloat_(); + } + + public boolean empty() + { + return x_.isEmpty(); + } + + public double prob() + { + return Math.exp(-cost_ - Z_); + } + + public double prob(int i, int j) + { + return toProb(node_.get(i).get(j), Z_); + } + + public double prob(int i) + { + return toProb(node_.get(i).get(result_.get(i)), Z_); + } + + public double alpha(int i, int j) + { + return node_.get(i).get(j).alpha; + } + + public double beta(int i, int j) + { + return node_.get(i).get(j).beta; + } + + public double emissionCost(int i, int j) + { + return node_.get(i).get(j).cost; + } + + public double nextTransitionCost(int i, int j, int k) + { + return node_.get(i).get(j).rpath.get(k).cost; + } + + public double prevTransitionCost(int i, int j, int k) + { + return node_.get(i).get(j).lpath.get(k).cost; + } + + public double bestCost(int i, int j) + { + return node_.get(i).get(j).bestCost; + } + + public List emissionVector(int i, int j) + { + return node_.get(i).get(j).fVector; + } + + public List nextTransitionVector(int i, int j, int k) + { + return node_.get(i).get(j).rpath.get(k).fvector; + } + + public List prevTransitionVector(int i, int j, int k) + { + return node_.get(i).get(j).lpath.get(k).fvector; + } + + public int answer(int i) + { + return answer_.get(i); + } + + public int result(int i) + { + return result_.get(i); + } + + public int y(int i) + { + return result_.get(i); + } + + public String yname(int i) + { + return feature_index_.getY_().get(i); + } + + public String y2(int i) + { + return yname(result_.get(i)); + } + + public String x(int i, int j) + { + return x_.get(i).get(j); + } + + public List x(int i) + { + return x_.get(i); + } + + public String parse(String s) + { + return ""; + } + + public String parse(String s, int i) + { + return ""; + } + + public String parse(String s, int i, String s2, int j) + { + return ""; + } + + public boolean parse() + { + if (!feature_index_.buildFeatures(this)) + { + System.err.println("fail to build featureIndex"); + return false; + } + if (x_.isEmpty()) + { + return true; + } + buildLattice(); + if (nbest_ != 0 || vlevel_ >= 1) + { + forwardbackward(); + } + viterbi(); + if (nbest_ != 0) + { + initNbest(); + } + return true; + } + + + public boolean clear() + { + if (mode_ == Mode.TEST) + { + feature_index_.clear(); + } + lastError = null; + x_.clear(); + node_.clear(); + answer_.clear(); + result_.clear(); + featureCache_.clear(); + Z_ = cost_ = 0.0; + return true; + } + + public boolean next() + { + while (!agenda_.isEmpty()) + { + QueueElement top = agenda_.peek(); + Node rnode = top.node; + agenda_.remove(top); + if (rnode.x == 0) + { + for (QueueElement n = top; n != null; n = n.next) + { + result_.set(n.node.x, n.node.y); + } + cost_ = top.gx; + return true; + } + for (Path p : rnode.lpath) + { + QueueElement n = new QueueElement(); + n.node = p.lnode; + n.gx = -p.lnode.cost - p.cost + top.gx; + n.fx = -p.lnode.bestCost - p.cost + top.gx; + n.next = top; + agenda_.add(n); + } + } + return false; + } + + + public float costFactor() + { + return (float) feature_index_.getCostFactor_(); + } + + void setCostFactor(float cost_factor) + { + if (cost_factor > 0) + feature_index_.setCostFactor_(cost_factor); + } + + void setNbest(int nbest) + { + nbest_ = nbest; + } + + private static double toProb(Node n, double Z) + { + return Math.exp(n.alpha + n.beta - n.cost - Z); + } + + public boolean open(FeatureIndex featureIndex, int nbest, int vlevel) + { + return open(featureIndex, nbest, vlevel, 1.0); + } + + public boolean open(FeatureIndex featureIndex, int nbest, int vlevel, double costFactor) + { + if (costFactor <= 0.0) + { + System.err.println("cost factor must be positive"); + return false; + } + nbest_ = nbest; + vlevel_ = vlevel; + feature_index_ = featureIndex; + feature_index_.setCostFactor_(costFactor); + ysize_ = feature_index_.ysize(); + return true; + } + + public boolean open(InputStream stream, int nbest, int vlevel, double costFactor) + { + if (costFactor <= 0.0) + { + System.err.println("cost factor must be positive"); + return false; + } + feature_index_ = new DecoderFeatureIndex(); + if (!feature_index_.open(stream)) + { + System.err.println("Failed to open model file "); + return false; + } + nbest_ = nbest; + vlevel_ = vlevel; + feature_index_.setCostFactor_(costFactor); + ysize_ = feature_index_.ysize(); + return true; + } + + public Mode getMode_() + { + return mode_; + } + + public void setMode_(Mode mode_) + { + this.mode_ = mode_; + } + + public int getVlevel_() + { + return vlevel_; + } + + public void setVlevel_(int vlevel_) + { + this.vlevel_ = vlevel_; + } + + public int getNbest_() + { + return nbest_; + } + + public void setNbest_(int nbest_) + { + this.nbest_ = nbest_; + } + + public int getYsize_() + { + return ysize_; + } + + public void setYsize_(int ysize_) + { + this.ysize_ = ysize_; + } + + public double getCost_() + { + return cost_; + } + + public void setCost_(double cost_) + { + this.cost_ = cost_; + } + + public double getZ_() + { + return Z_; + } + + public void setZ_(double z_) + { + Z_ = z_; + } + + public int getFeature_id_() + { + return feature_id_; + } + + public void setFeature_id_(int feature_id_) + { + this.feature_id_ = feature_id_; + } + + public int getThread_id_() + { + return thread_id_; + } + + public void setThread_id_(int thread_id_) + { + this.thread_id_ = thread_id_; + } + + public FeatureIndex getFeature_index_() + { + return feature_index_; + } + + public void setFeature_index_(FeatureIndex feature_index_) + { + this.feature_index_ = feature_index_; + } + + public List> getX_() + { + return x_; + } + + public void setX_(List> x_) + { + this.x_ = x_; + } + + public List> getNode_() + { + return node_; + } + + public void setNode_(List> node_) + { + this.node_ = node_; + } + + public List getAnswer_() + { + return answer_; + } + + public void setAnswer_(List answer_) + { + this.answer_ = answer_; + } + + public List getResult_() + { + return result_; + } + + public void setResult_(List result_) + { + this.result_ = result_; + } + + public static void main(String[] args) throws Exception + { + if (args.length < 1) + { + return; + } + + TaggerImpl tagger = new TaggerImpl(Mode.TEST); + InputStream stream = null; + try + { + stream = IOUtil.newInputStream(args[0]); + } + catch (IOException e) + { + System.err.printf("model not exits for %s", args[0]); + return; + } + if (stream != null && !tagger.open(stream, 2, 0, 1.0)) + { + System.err.println("open error"); + return; + } + System.out.println("Done reading model"); + + if (args.length >= 2) + { + InputStream fis = IOUtil.newInputStream(args[1]); + InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); + BufferedReader br = new BufferedReader(isr); + + while (true) + { + ReadStatus status = tagger.read(br); + if (ReadStatus.ERROR == status) + { + System.err.println("read error"); + return; + } + else if (ReadStatus.EOF == status) + { + break; + } + if (tagger.getX_().isEmpty()) + { + break; + } + if (!tagger.parse()) + { + System.err.println("parse error"); + return; + } + System.out.print(tagger.toString()); + } + br.close(); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_learn.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_learn.java new file mode 100644 index 000000000..62689d0ba --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_learn.java @@ -0,0 +1,138 @@ +package com.hankcs.hanlp.model.crf.crfpp; + + +import com.hankcs.hanlp.model.perceptron.cli.Args; +import com.hankcs.hanlp.model.perceptron.cli.Argument; + +import java.util.List; + +/** + * 对应crf_learn + * + * @author zhifac + */ +public class crf_learn +{ + public static class Option + { + @Argument(description = "use features that occur no less than INT(default 1)", alias = "f") + public Integer freq = 1; + @Argument(description = "set INT for max iterations in LBFGS routine(default 10k)", alias = "m") + public Integer maxiter = 10000; + @Argument(description = "set FLOAT for cost parameter(default 1.0)", alias = "c") + public Double cost = 1.0; + @Argument(description = "set FLOAT for termination criterion(default 0.0001)", alias = "e") + public Double eta = 0.0001; + @Argument(description = "convert text model to binary model", alias = "C") + public Boolean convert = false; + @Argument(description = "convert binary model to text model", alias = "T") + public Boolean convert_to_text = false; + @Argument(description = "build also text model file for debugging", alias = "t") + public Boolean textmodel = false; + @Argument(description = "(CRF|CRF-L1|CRF-L2|MIRA)\", \"select training algorithm", alias = "a") + public String algorithm = "CRF-L2"; + @Argument(description = "set INT for number of iterations variable needs to be optimal before considered for shrinking. (default 20)", alias = "H") + public Integer shrinking_size = 20; + @Argument(description = "show this help and exit", alias = "h") + public Boolean help = false; + @Argument(description = "number of threads(default auto detect)") + public Integer thread = Runtime.getRuntime().availableProcessors(); + } + + public static boolean run(String args) + { + return run(args.split("\\s")); + } + + public static boolean run(String[] args) + { + Option option = new Option(); + List unkownArgs = null; + try + { + unkownArgs = Args.parse(option, args, false); + } + catch (IllegalArgumentException e) + { + System.err.println(e.getMessage()); + Args.usage(option); + return false; + } + + boolean convert = option.convert; + boolean convertToText = option.convert_to_text; + String[] restArgs = unkownArgs.toArray(new String[0]); + if (option.help || ((convertToText || convert) && restArgs.length != 2) || + (!convert && !convertToText && restArgs.length != 3)) + { + Args.usage(option); + return option.help; + } + int freq = option.freq; + int maxiter = option.maxiter; + double C = option.cost; + double eta = option.eta; + boolean textmodel = option.textmodel; + int threadNum = option.thread; + if (threadNum <= 0) + { + threadNum = Runtime.getRuntime().availableProcessors(); + } + int shrinkingSize = option.shrinking_size; + + String algorithm = option.algorithm; + algorithm = algorithm.toLowerCase(); + Encoder.Algorithm algo = Encoder.Algorithm.CRF_L2; + if (algorithm.equals("crf") || algorithm.equals("crf-l2")) + { + algo = Encoder.Algorithm.CRF_L2; + } + else if (algorithm.equals("crf-l1")) + { + algo = Encoder.Algorithm.CRF_L1; + } + else if (algorithm.equals("mira")) + { + algo = Encoder.Algorithm.MIRA; + } + else + { + System.err.println("unknown algorithm: " + algorithm); + return false; + } + if (convert) + { + EncoderFeatureIndex featureIndex = new EncoderFeatureIndex(1); + if (!featureIndex.convert(restArgs[0], restArgs[1])) + { + System.err.println("fail to convert text model"); + return false; + } + } + else if (convertToText) + { + DecoderFeatureIndex featureIndex = new DecoderFeatureIndex(); + if (!featureIndex.convert(restArgs[0], restArgs[1])) + { + System.err.println("fail to convert binary model"); + return false; + } + } + else + { + Encoder encoder = new Encoder(); + if (!encoder.learn(restArgs[0], restArgs[1], restArgs[2], + textmodel, maxiter, freq, eta, C, threadNum, shrinkingSize, algo)) + { + System.err.println("fail to learn model"); + return false; + } + } + return true; + } + + public static void main(String[] args) + { + crf_learn.run(args); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_test.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_test.java new file mode 100644 index 000000000..5b4cbf0f0 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/crf_test.java @@ -0,0 +1,133 @@ +package com.hankcs.hanlp.model.crf.crfpp; + + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.cli.Args; +import com.hankcs.hanlp.model.perceptron.cli.Argument; + +import java.io.*; +import java.util.List; + +/** + * 对应crf_test + * + * @author zhifac + */ +public class crf_test +{ + private static class Option + { + @Argument(description = "set FILE for model file", alias = "m", required = true) + String model; + @Argument(description = "output n-best results", alias = "n") + Integer nbest = 0; + @Argument(description = "set INT for verbose level", alias = "v") + Integer verbose = 0; + @Argument(description = "set cost factor", alias = "c") + Double cost_factor = 1.0; + @Argument(description = "output file path", alias = "o") + String output; + @Argument(description = "show this help and exit", alias = "h") + Boolean help = false; + } + + public static boolean run(String[] args) + { + Option cmd = new Option(); + List unkownArgs = null; + try + { + unkownArgs = Args.parse(cmd, args, false); + } + catch (IllegalArgumentException e) + { + Args.usage(cmd); + return false; + } + if (cmd.help) + { + Args.usage(cmd); + return true; + } + int nbest = cmd.nbest; + int vlevel = cmd.verbose; + double costFactor = cmd.cost_factor; + String model = cmd.model; + String outputFile = cmd.output; + + TaggerImpl tagger = new TaggerImpl(TaggerImpl.Mode.TEST); + try + { + InputStream stream = IOUtil.newInputStream(model); + if (!tagger.open(stream, nbest, vlevel, costFactor)) + { + System.err.println("open error"); + return false; + } + String[] restArgs = unkownArgs.toArray(new String[0]); + if (restArgs.length == 0) + { + return false; + } + + OutputStreamWriter osw = null; + if (outputFile != null) + { + osw = new OutputStreamWriter(IOUtil.newOutputStream(outputFile)); + } + for (String inputFile : restArgs) + { + InputStream fis = IOUtil.newInputStream(inputFile); + InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); + BufferedReader br = new BufferedReader(isr); + + while (true) + { + TaggerImpl.ReadStatus status = tagger.read(br); + if (TaggerImpl.ReadStatus.ERROR == status) + { + System.err.println("read error"); + return false; + } + else if (TaggerImpl.ReadStatus.EOF == status && tagger.empty()) + { + break; + } + if (!tagger.parse()) + { + System.err.println("parse error"); + return false; + } + if (osw == null) + { + System.out.print(tagger.toString()); + } + else + { + osw.write(tagger.toString()); + } + } + if (osw != null) + { + osw.flush(); + } + br.close(); + } + if (osw != null) + { + osw.close(); + } + } + catch (Exception e) + { + e.printStackTrace(); + return false; + } + return true; + } + + public static void main(String[] args) + { + crf_test.run(args); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/crf/crfpp/package-info.java b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/package-info.java new file mode 100644 index 000000000..57f6116a0 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/crf/crfpp/package-info.java @@ -0,0 +1,13 @@ +/** + * 这个包下面是由Zhifa Chen移植的CRF++。
+ * 做了一些注释、修改与debug。本来想自己移植的,后来发现已经有移植版,所以就没有浪费时间重复造轮子。 + * 关于理论,请参考《CRF++代码分析》。 + * 这份代码(含CRF++和darts-java)的许可证是LGPL & Modified BSD,需注明如下版权声明: + *

+ * Copyright (c) 2001-2008, Taku Kudo + * Copyright(C) 2009 MURAWAKI Yugo + * Copyright(C) 2012 KOMIYA Atsushi + * Copyright(C) 2017 Zhifa Chen + * All rights reserved. + */ +package com.hankcs.hanlp.model.crf.crfpp; \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModel.java b/src/main/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModel.java new file mode 100644 index 000000000..0b6d0272b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModel.java @@ -0,0 +1,111 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-08 5:34 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +/** + * 一阶隐马尔可夫模型 + * + * @author hankcs + */ +public class FirstOrderHiddenMarkovModel extends HiddenMarkovModel +{ + + /** + * 创建空白的隐马尔可夫模型以供训练 + */ + public FirstOrderHiddenMarkovModel() + { + this(null, null, null); + } + + public FirstOrderHiddenMarkovModel(float[] start_probability, float[][] transition_probability, float[][] emission_probability) + { + super(start_probability, transition_probability, emission_probability); + toLog(); + } + + @Override + public int[][] generate(int length) + { + double[] pi = logToCdf(start_probability); + double[][] A = logToCdf(transition_probability); + double[][] B = logToCdf(emission_probability); + int xy[][] = new int[2][length]; + xy[1][0] = drawFrom(pi); // 采样首个隐状态 + xy[0][0] = drawFrom(B[xy[1][0]]); // 根据首个隐状态采样它的显状态 + for (int t = 1; t < length; t++) + { + xy[1][t] = drawFrom(A[xy[1][t - 1]]); + xy[0][t] = drawFrom(B[xy[1][t]]); + } + return xy; + } + + @Override + public float predict(int[] observation, int[] state) + { + final int time = observation.length; // 序列长度 + final int max_s = start_probability.length; // 状态种数 + + float[] score = new float[max_s]; + + // link[t][s] := 第t个时刻在当前状态是s时,前1个状态是什么 + int[][] link = new int[time][max_s]; + // 第一个时刻,使用初始概率向量乘以发射概率矩阵 + for (int cur_s = 0; cur_s < max_s; ++cur_s) + { + score[cur_s] = start_probability[cur_s] + emission_probability[cur_s][observation[0]]; + } + + // 第二个时刻,使用前一个时刻的概率向量乘以一阶转移矩阵乘以发射概率矩阵 + float[] pre = new float[max_s]; + for (int t = 1; t < observation.length; t++) + { + // swap(now, pre) + float[] buffer = pre; + pre = score; + score = buffer; + // end of swap + for (int s = 0; s < max_s; ++s) + { + score[s] = Integer.MIN_VALUE; + for (int f = 0; f < max_s; ++f) + { + float p = pre[f] + transition_probability[f][s] + emission_probability[s][observation[t]]; + if (p > score[s]) + { + score[s] = p; + link[t][s] = f; + } + } + } + } + + float max_score = Integer.MIN_VALUE; + int best_s = 0; + for (int s = 0; s < max_s; s++) + { + if (score[s] > max_score) + { + max_score = score[s]; + best_s = s; + } + } + + for (int t = link.length - 1; t >= 0; --t) + { + state[t] = best_s; + best_s = link[t][best_s]; + } + + return max_score; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzer.java b/src/main/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzer.java new file mode 100644 index 000000000..34f425a07 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzer.java @@ -0,0 +1,40 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-03 12:44 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; + +/** + * 基于隐马尔可夫模型的词法分析器 + * + * @author hankcs + */ +public class HMMLexicalAnalyzer extends AbstractLexicalAnalyzer +{ + private HMMLexicalAnalyzer() + { + } + + public HMMLexicalAnalyzer(HMMSegmenter segmenter) + { + super(segmenter); + } + + public HMMLexicalAnalyzer(HMMSegmenter segmenter, HMMPOSTagger posTagger) + { + super(segmenter, posTagger); + } + + public HMMLexicalAnalyzer(HMMSegmenter segmenter, HMMPOSTagger posTagger, HMMNERecognizer neRecognizer) + { + super(segmenter, posTagger, neRecognizer); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HMMNERecognizer.java b/src/main/java/com/hankcs/hanlp/model/hmm/HMMNERecognizer.java new file mode 100644 index 000000000..5913ddbca --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HMMNERecognizer.java @@ -0,0 +1,84 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-02 9:15 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.util.List; + +/** + * @author hankcs + */ +public class HMMNERecognizer extends HMMTrainer implements NERecognizer +{ + NERTagSet tagSet; + + public HMMNERecognizer(HiddenMarkovModel model) + { + super(model); + tagSet = new NERTagSet(); + tagSet.nerLabels.add("nr"); + tagSet.nerLabels.add("ns"); + tagSet.nerLabels.add("nt"); + } + + public HMMNERecognizer() + { + this(new FirstOrderHiddenMarkovModel()); + } + + @Override + protected List convertToSequence(Sentence sentence) + { + List collector = Utility.convertSentenceToNER(sentence, tagSet); + for (String[] pair : collector) + { + pair[1] = pair[2]; + } + + return collector; + } + + @Override + protected TagSet getTagSet() + { + return tagSet; + } + + @Override + public String[] recognize(String[] wordArray, String[] posArray) + { + int[] obsArray = new int[wordArray.length]; + for (int i = 0; i < obsArray.length; i++) + { + obsArray[i] = vocabulary.idOf(wordArray[i]); + } + int[] tagArray = new int[obsArray.length]; + model.predict(obsArray, tagArray); + String[] tags = new String[obsArray.length]; + for (int i = 0; i < tagArray.length; i++) + { + tags[i] = tagSet.stringOf(tagArray[i]); + } + + return tags; + } + + @Override + public NERTagSet getNERTagSet() + { + return tagSet; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HMMPOSTagger.java b/src/main/java/com/hankcs/hanlp/model/hmm/HMMPOSTagger.java new file mode 100644 index 000000000..fddf0fab1 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HMMPOSTagger.java @@ -0,0 +1,83 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-02 8:49 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.tagset.POSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.tokenizer.lexical.POSTagger; + +import java.util.ArrayList; +import java.util.List; + +/** + * @author hankcs + */ +public class HMMPOSTagger extends HMMTrainer implements POSTagger +{ + POSTagSet tagSet; + + public HMMPOSTagger(HiddenMarkovModel model) + { + super(model); + tagSet = new POSTagSet(); + } + + public HMMPOSTagger() + { + super(); + tagSet = new POSTagSet(); + } + + @Override + protected List convertToSequence(Sentence sentence) + { + List wordList = sentence.toSimpleWordList(); + List xyList = new ArrayList(wordList.size()); + for (Word word : wordList) + { + xyList.add(new String[]{word.getValue(), word.getLabel()}); + } + return xyList; + } + + @Override + protected TagSet getTagSet() + { + return tagSet; + } + + @Override + public String[] tag(String... words) + { + int[] obsArray = new int[words.length]; + for (int i = 0; i < obsArray.length; i++) + { + obsArray[i] = vocabulary.idOf(words[i]); + } + int[] tagArray = new int[obsArray.length]; + model.predict(obsArray, tagArray); + String[] tags = new String[obsArray.length]; + for (int i = 0; i < tagArray.length; i++) + { + tags[i] = tagSet.stringOf(tagArray[i]); + } + + return tags; + } + + @Override + public String[] tag(List wordList) + { + return tag(wordList.toArray(new String[0])); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HMMSegmenter.java b/src/main/java/com/hankcs/hanlp/model/hmm/HMMSegmenter.java new file mode 100644 index 000000000..0ae153e0e --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HMMSegmenter.java @@ -0,0 +1,131 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-13 2:05 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.lexical.Segmenter; + +import java.util.LinkedList; +import java.util.List; + +/** + * @author hankcs + */ +public class HMMSegmenter extends HMMTrainer implements Segmenter +{ + CWSTagSet tagSet; + + public HMMSegmenter(HiddenMarkovModel model) + { + super(model); + tagSet = new CWSTagSet(); + } + + public HMMSegmenter() + { + tagSet = new CWSTagSet(); + } + + @Override + public List segment(String text) + { + List wordList = new LinkedList(); + segment(text, CharTable.convert(text), wordList); + return wordList; + } + + @Override + public void segment(String text, String normalized, List output) + { + int[] obsArray = new int[text.length()]; + for (int i = 0; i < obsArray.length; i++) + { + obsArray[i] = vocabulary.idOf(normalized.substring(i, i + 1)); + } + int[] tagArray = new int[text.length()]; + model.predict(obsArray, tagArray); + StringBuilder result = new StringBuilder(); + result.append(text.charAt(0)); + + for (int i = 1; i < tagArray.length; i++) + { + if (tagArray[i] == tagSet.B || tagArray[i] == tagSet.S) + { + output.add(result.toString()); + result.setLength(0); + } + result.append(text.charAt(i)); + } + if (result.length() != 0) + { + output.add(result.toString()); + } + } + + @Override + protected List convertToSequence(Sentence sentence) + { + List charList = new LinkedList(); + for (Word w : sentence.toSimpleWordList()) + { + String word = CharTable.convert(w.value); + if (word.length() == 1) + { + charList.add(new String[]{word, "S"}); + } + else + { + charList.add(new String[]{word.substring(0, 1), "B"}); + for (int i = 1; i < word.length() - 1; ++i) + { + charList.add(new String[]{word.substring(i, i + 1), "M"}); + } + charList.add(new String[]{word.substring(word.length() - 1), "E"}); + } + } + return charList; + } + + @Override + protected TagSet getTagSet() + { + return tagSet; + } + + /** + * 获取兼容旧的Segment接口 + * + * @return + */ + public Segment toSegment() + { + return new Segment() + { + @Override + protected List segSentence(char[] sentence) + { + List wordList = segment(new String(sentence)); + List termList = new LinkedList(); + for (String word : wordList) + { + termList.add(new Term(word, null)); + } + return termList; + } + }.enableCustomDictionary(false); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HMMTrainer.java b/src/main/java/com/hankcs/hanlp/model/hmm/HMMTrainer.java new file mode 100644 index 000000000..685664931 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HMMTrainer.java @@ -0,0 +1,85 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-13 2:17 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.utility.IOUtility; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.LinkedList; +import java.util.List; + +/** + * @author hankcs + */ +public abstract class HMMTrainer +{ + HiddenMarkovModel model; + Vocabulary vocabulary; + + public HMMTrainer(HiddenMarkovModel model, Vocabulary vocabulary) + { + this.model = model; + this.vocabulary = vocabulary; + } + + public HMMTrainer(HiddenMarkovModel model) + { + this(model, new Vocabulary()); + } + + public HMMTrainer() + { + this(new FirstOrderHiddenMarkovModel()); + } + + public void train(String corpus) throws IOException + { + final List> sequenceList = new LinkedList>(); + IOUtility.loadInstance(corpus, new InstanceHandler() + { + @Override + public boolean process(Sentence sentence) + { + sequenceList.add(convertToSequence(sentence)); + return false; + } + }); + + TagSet tagSet = getTagSet(); + + List sampleList = new ArrayList(sequenceList.size()); + for (List sequence : sequenceList) + { + int[][] sample = new int[2][sequence.size()]; + int i = 0; + for (String[] os : sequence) + { + sample[0][i] = vocabulary.idOf(os[0]); + assert sample[0][i] != -1; + sample[1][i] = tagSet.add(os[1]); + assert sample[1][i] != -1; + ++i; + } + sampleList.add(sample); + } + + model.train(sampleList); + vocabulary.mutable = false; + } + + protected abstract List convertToSequence(Sentence sentence); + protected abstract TagSet getTagSet(); +} diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/HiddenMarkovModel.java b/src/main/java/com/hankcs/hanlp/model/hmm/HiddenMarkovModel.java new file mode 100644 index 000000000..a9f91f315 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/HiddenMarkovModel.java @@ -0,0 +1,335 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-09 7:47 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.utility.MathUtility; + +import java.io.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.List; + +/** + * @author hankcs + */ +public abstract class HiddenMarkovModel +{ + /** + * 初始状态概率向量 + */ + public float[] start_probability; + /** + * 观测概率矩阵 + */ + public float[][] emission_probability; + /** + * 状态转移概率矩阵 + */ + public float[][] transition_probability; + + /** + * 构造隐马模型 + * + * @param start_probability 初始状态概率向量 + * @param transition_probability 状态转移概率矩阵 + * @param emission_probability 观测概率矩阵 + */ + public HiddenMarkovModel(float[] start_probability, float[][] transition_probability, float[][] emission_probability) + { + this.start_probability = (float[]) deepCopy(start_probability); + this.transition_probability = (float[][]) deepCopy(transition_probability); + this.emission_probability = (float[][]) deepCopy(emission_probability); + } + + /** + * 对数概率转为累积分布函数 + * + * @param log + * @return + */ + protected static double[] logToCdf(float[] log) + { + double[] cdf = new double[log.length]; + cdf[0] = Math.exp(log[0]); + for (int i = 1; i < cdf.length - 1; i++) + { + cdf[i] = cdf[i - 1] + Math.exp(log[i]); + } + cdf[cdf.length - 1] = 1.0; + return cdf; + } + + /** + * 对数概率转化为累积分布函数 + * + * @param log + * @return + */ + protected static double[][] logToCdf(float[][] log) + { + double[][] cdf = new double[log.length][log[0].length]; + for (int i = 0; i < log.length; i++) + cdf[i] = logToCdf(log[i]); + return cdf; + } + + /** + * 采样 + * + * @param cdf 累积分布函数 + * @return + */ + protected static int drawFrom(double[] cdf) + { + int index = Arrays.binarySearch(cdf, Math.random()); + if (index >= 0) + { + return index; + } + else + { + return -index - 1; + } + } + + /** + * 频次向量归一化为概率分布 + * + * @param freq + */ + protected void normalize(float[] freq) + { + float sum = MathUtility.sum(freq); + for (int i = 0; i < freq.length; i++) + freq[i] /= sum; + } + + public void unLog() + { + for (int i = 0; i < start_probability.length; i++) + { + start_probability[i] = (float) Math.exp(start_probability[i]); + } + for (int i = 0; i < emission_probability.length; i++) + { + for (int j = 0; j < emission_probability[i].length; j++) + { + emission_probability[i][j] = (float) Math.exp(emission_probability[i][j]); + } + } + for (int i = 0; i < transition_probability.length; i++) + { + for (int j = 0; j < transition_probability[i].length; j++) + { + transition_probability[i][j] = (float) Math.exp(transition_probability[i][j]); + } + } + } + + protected void toLog() + { + if (start_probability == null || transition_probability == null || emission_probability == null) return; + for (int i = 0; i < start_probability.length; i++) + { + start_probability[i] = (float) Math.log(start_probability[i]); + for (int j = 0; j < start_probability.length; j++) + transition_probability[i][j] = (float) Math.log(transition_probability[i][j]); + for (int j = 0; j < emission_probability[0].length; j++) + emission_probability[i][j] = (float) Math.log(emission_probability[i][j]); + } + } + + /** + * 训练 + * + * @param samples 数据集 int[i][j] i=0为观测,i=1为状态,j为时序轴 + */ + public void train(Collection samples) + { + if (samples.isEmpty()) return; + int max_state = 0; + int max_obser = 0; + for (int[][] sample : samples) + { + if (sample.length != 2 || sample[0].length != sample[1].length) throw new IllegalArgumentException("非法样本"); + for (int o : sample[0]) + max_obser = Math.max(max_obser, o); + for (int s : sample[1]) + max_state = Math.max(max_state, s); + } + estimateStartProbability(samples, max_state); + estimateTransitionProbability(samples, max_state); + estimateEmissionProbability(samples, max_state, max_obser); + toLog(); + } + + /** + * 估计状态发射概率 + * + * @param samples 训练样本集 + * @param max_state 状态的最大下标 + * @param max_obser 观测的最大下标 + */ + protected void estimateEmissionProbability(Collection samples, int max_state, int max_obser) + { + emission_probability = new float[max_state + 1][max_obser + 1]; + for (int[][] sample : samples) + { + for (int i = 0; i < sample[0].length; i++) + { + int o = sample[0][i]; + int s = sample[1][i]; + ++emission_probability[s][o]; + } + } + for (int i = 0; i < transition_probability.length; i++) + normalize(emission_probability[i]); + } + + /** + * 利用极大似然估计转移概率 + * + * @param samples 训练样本集 + * @param max_state 状态的最大下标,等于N-1 + */ + protected void estimateTransitionProbability(Collection samples, int max_state) + { + transition_probability = new float[max_state + 1][max_state + 1]; + for (int[][] sample : samples) + { + int prev_s = sample[1][0]; + for (int i = 1; i < sample[0].length; i++) + { + int s = sample[1][i]; + ++transition_probability[prev_s][s]; + prev_s = s; + } + } + for (int i = 0; i < transition_probability.length; i++) + normalize(transition_probability[i]); + } + + /** + * 估计初始状态概率向量 + * + * @param samples 训练样本集 + * @param max_state 状态的最大下标 + */ + protected void estimateStartProbability(Collection samples, int max_state) + { + start_probability = new float[max_state + 1]; + for (int[][] sample : samples) + { + int s = sample[1][0]; + ++start_probability[s]; + } + normalize(start_probability); + } + + /** + * 生成样本序列 + * + * @param length 序列长度 + * @return 序列 + */ + public abstract int[][] generate(int length); + + + /** + * 生成样本序列 + * + * @param minLength 序列最低长度 + * @param maxLength 序列最高长度 + * @param size 需要生成多少个 + * @return 样本序列集合 + */ + public List generate(int minLength, int maxLength, int size) + { + List samples = new ArrayList(size); + for (int i = 0; i < size; i++) + { + samples.add(generate((int) (Math.floor(Math.random() * (maxLength - minLength)) + minLength))); + } + return samples; + } + + /** + * 预测(维特比算法) + * + * @param o 观测序列 + * @param s 预测状态序列(需预先分配内存) + * @return 概率的对数,可利用 (float) Math.exp(maxScore) 还原 + */ + public abstract float predict(int[] o, int[] s); + + /** + * 预测(维特比算法) + * + * @param o 观测序列 + * @param s 预测状态序列(需预先分配内存) + * @return 概率的对数,可利用 (float) Math.exp(maxScore) 还原 + */ + public float predict(int[] o, Integer[] s) + { + int[] states = new int[s.length]; + float p = predict(o, states); + for (int i = 0; i < states.length; i++) + { + s[i] = states[i]; + } + return p; + } + + public boolean similar(HiddenMarkovModel model) + { + if (!similar(start_probability, model.start_probability)) return false; + for (int i = 0; i < transition_probability.length; i++) + { + if (!similar(transition_probability[i], model.transition_probability[i])) return false; + if (!similar(emission_probability[i], model.emission_probability[i])) return false; + } + return true; + } + + protected static boolean similar(float[] A, float[] B) + { + final float eta = 1e-2f; + for (int i = 0; i < A.length; i++) + if (Math.abs(A[i] - B[i]) > eta) return false; + return true; + } + + protected static Object deepCopy(Object object) + { + if (object == null) + { + return null; + } + try + { + ByteArrayOutputStream bos = new ByteArrayOutputStream(); + ObjectOutputStream oos = new ObjectOutputStream(bos); + oos.writeObject(object); + oos.flush(); + oos.close(); + bos.close(); + + byte[] byteData = bos.toByteArray(); + ByteArrayInputStream bais = new ByteArrayInputStream(byteData); + return new ObjectInputStream(bais).readObject(); + } + catch (Exception e) + { + throw new RuntimeException(e); + } + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModel.java b/src/main/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModel.java new file mode 100644 index 000000000..032f46af4 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModel.java @@ -0,0 +1,256 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-09 7:47 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import java.util.Collection; + +/** + * @author hankcs + */ +public class SecondOrderHiddenMarkovModel extends HiddenMarkovModel +{ + /** + * 状态转移概率矩阵 + */ + public float[][][] transition_probability2; + + /** + * 构造隐马模型 + * + * @param start_probability 初始状态概率向量 + * @param transition_probability 状态转移概率矩阵 + * @param emission_probability 观测概率矩阵 + */ + private SecondOrderHiddenMarkovModel(float[] start_probability, float[][] transition_probability, float[][] emission_probability) + { + super(start_probability, transition_probability, emission_probability); + } + + public SecondOrderHiddenMarkovModel(float[] start_probability, float[][] transition_probability, float[][] emission_probability, float[][][] transition_probability2) + { + this(start_probability, transition_probability, emission_probability); + this.transition_probability2 = transition_probability2; + toLog(); + } + + public SecondOrderHiddenMarkovModel() + { + this(null, null, null, null); + } + + @Override + protected void estimateTransitionProbability(Collection samples, int max_state) + { + transition_probability = new float[max_state + 1][max_state + 1]; + transition_probability2 = new float[max_state + 1][max_state + 1][max_state + 1]; + for (int[][] sample : samples) + { + int prev_s = sample[1][0]; + int prev_prev_s = -1; + for (int i = 1; i < sample[0].length; i++) + { + int s = sample[1][i]; + if (i == 1) + ++transition_probability[prev_s][s]; + else + ++transition_probability2[prev_prev_s][prev_s][s]; + prev_prev_s = prev_s; + prev_s = s; + } + } + for (float[] p : transition_probability) + normalize(p); + for (float[][] pp : transition_probability2) + for (float[] p : pp) + normalize(p); + } + + @Override + public int[][] generate(int length) + { + double[] pi = logToCdf(start_probability); + double[][] A = logToCdf(transition_probability); + double[][][] A2 = logToCdf(transition_probability2); + double[][] B = logToCdf(emission_probability); + int os[][] = new int[2][length]; + os[1][0] = drawFrom(pi); // 采样首个隐状态 + os[0][0] = drawFrom(B[os[1][0]]); // 根据首个隐状态采样它的显状态 + + os[1][1] = drawFrom(A[os[1][0]]); + os[0][1] = drawFrom(B[os[1][1]]); + + for (int t = 2; t < length; t++) + { + os[1][t] = drawFrom(A2[os[1][t - 2]][os[1][t - 1]]); + os[0][t] = drawFrom(B[os[1][t]]); + } + + return os; + } + + private double[][][] logToCdf(float[][][] log) + { + double[][][] cdf = new double[log.length][log[0].length][log[0][0].length]; + for (int i = 0; i < log.length; i++) + { + cdf[i] = logToCdf(log[i]); + } + return cdf; + } + + @Override + protected void toLog() + { + super.toLog(); + if (transition_probability2 != null) + { + for (float[][] m : transition_probability2) + { + for (float[] v : m) + { + for (int i = 0; i < v.length; i++) + { + v[i] = (float) Math.log(v[i]); + } + } + } + } + } + + @Override + public float predict(int[] observation, int[] state) + { + final int time = observation.length; // 序列长度 + final int max_s = start_probability.length; // 状态种数 + + float[][] score = new float[max_s][max_s]; + float[] first = new float[max_s]; + + // link[i][s][t] := 第i个时刻在前一个状态是s,当前状态是t时,前2个状态是什么 + int[][][] link = new int[time][max_s][max_s]; + // 第一个时刻,使用初始概率向量乘以发射概率矩阵 + for (int cur_s = 0; cur_s < max_s; ++cur_s) + { + first[cur_s] = start_probability[cur_s] + emission_probability[cur_s][observation[0]]; + } + + if (time == 1) + { + int best_s = 0; + float max_score = Integer.MIN_VALUE; + for (int cur_s = 0; cur_s < max_s; ++cur_s) + { + if (first[cur_s] > max_score) + { + best_s = cur_s; + max_score = first[cur_s]; + } + } + state[0] = best_s; + return max_score; + } + + // 第二个时刻,使用前一个时刻的概率向量乘以一阶转移矩阵乘以发射概率矩阵 + for (int f = 0; f < max_s; ++f) + { + for (int s = 0; s < max_s; ++s) + { + float p = first[f] + transition_probability[f][s] + emission_probability[s][observation[1]]; + score[f][s] = p; + link[1][f][s] = f; + } + } + + // 从第三个时刻开始,使用前一个时刻的概率矩阵乘以二阶转移张量乘以发射概率矩阵 + float[][] pre = new float[max_s][max_s]; + for (int i = 2; i < observation.length; i++) + { + // swap(now, pre) + float[][] buffer = pre; + pre = score; + score = buffer; + // end of swap + for (int s = 0; s < max_s; ++s) + { + for (int t = 0; t < max_s; ++t) + { + score[s][t] = Integer.MIN_VALUE; + for (int f = 0; f < max_s; ++f) + { + float p = pre[f][s] + transition_probability2[f][s][t] + emission_probability[t][observation[i]]; + if (p > score[s][t]) + { + score[s][t] = p; + link[i][s][t] = f; + } + } + } + } + } + + float max_score = Integer.MIN_VALUE; + int best_s = 0, best_t = 0; + for (int s = 0; s < max_s; s++) + { + for (int t = 0; t < max_s; t++) + { + if (score[s][t] > max_score) + { + max_score = score[s][t]; + best_s = s; + best_t = t; + } + } + } + + for (int i = link.length - 1; i >= 0; --i) + { + state[i] = best_t; + int best_f = link[i][best_s][best_t]; + best_t = best_s; + best_s = best_f; + } + + return max_score; + } + + @Override + public void unLog() + { + super.unLog(); + for (float[][] m : transition_probability2) + { + for (float[] v : m) + { + for (int i = 0; i < v.length; i++) + { + v[i] = (float) Math.exp(v[i]); + } + } + } + } + + @Override + public boolean similar(HiddenMarkovModel model) + { + if (!(model instanceof SecondOrderHiddenMarkovModel)) return false; + SecondOrderHiddenMarkovModel hmm2 = (SecondOrderHiddenMarkovModel) model; + for (int i = 0; i < transition_probability.length; i++) + { + for (int j = 0; j < transition_probability.length; j++) + { + if (!similar(transition_probability2[i][j], hmm2.transition_probability2[i][j])) + return false; + } + } + return super.similar(model); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/hmm/Vocabulary.java b/src/main/java/com/hankcs/hanlp/model/hmm/Vocabulary.java new file mode 100644 index 000000000..2c3341935 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/hmm/Vocabulary.java @@ -0,0 +1,53 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-13 2:26 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; +import com.hankcs.hanlp.model.perceptron.common.IStringIdMap; + +/** + * @author hankcs + */ +public class Vocabulary implements IStringIdMap +{ + private BinTrie trie; + boolean mutable; + private static final int UNK = 0; + + public Vocabulary(BinTrie trie, boolean mutable) + { + this.trie = trie; + this.mutable = mutable; + } + + public Vocabulary() + { + this(new BinTrie(), true); + trie.put("\t", UNK); + } + + @Override + public int idOf(String string) + { + Integer id = trie.get(string); + if (id == null) + { + if (mutable) + { + id = trie.size(); + trie.put(string, id); + } + else + id = UNK; + } + return id; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/CWSTrainer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/CWSTrainer.java new file mode 100644 index 000000000..980461778 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/CWSTrainer.java @@ -0,0 +1,57 @@ +/* + *

+ * Hankcs + * me@hankcs.com + * 2016-09-04 PM4:48 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.CWSInstance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +import java.io.IOException; +import java.util.List; + +/** + * 感知机分词器训练工具 + * + * @author hankcs + */ +public class CWSTrainer extends PerceptronTrainer +{ + @Override + protected TagSet createTagSet() + { + return new CWSTagSet(); + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap mutableFeatureMap) + { + List wordList = sentence.toSimpleWordList(); + String[] termArray = Utility.toWordArray(wordList); + Instance instance = new CWSInstance(termArray, mutableFeatureMap); + return instance; + } + + @Override + public double[] evaluate(String developFile, LinearModel model) throws IOException + { + PerceptronSegmenter segmenter = new PerceptronSegmenter(model); + double[] prf = Utility.prf(Utility.evaluateCWS(developFile, segmenter)); + return prf; + } + +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/InstanceConsumer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/InstanceConsumer.java new file mode 100644 index 000000000..71b5bdf70 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/InstanceConsumer.java @@ -0,0 +1,77 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-15 下午7:34 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.utility.IOUtility; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.utility.Utility; + +import java.io.IOException; + +/** + * 需要处理实例的消费者 + * + * @author hankcs + */ +public abstract class InstanceConsumer +{ + private static char[] tableChar; + + static + { + tableChar = new char[CharTable.CONVERT.length]; + System.arraycopy(CharTable.CONVERT, 0, tableChar, 0, tableChar.length); + for (int c = 0; c <= 32; ++c) + { + tableChar[c] = '&'; // 也可以考虑用 '。' + } + } + + protected abstract Instance createInstance(Sentence sentence, final FeatureMap featureMap); + + protected double[] evaluate(String developFile, String modelFile) throws IOException + { + return evaluate(developFile, new LinearModel(modelFile)); + } + + protected double[] evaluate(String developFile, final LinearModel model) throws IOException + { + final int[] stat = new int[2]; + IOUtility.loadInstance(developFile, new InstanceHandler() + { + @Override + public boolean process(Sentence sentence) + { + Utility.normalize(sentence); + Instance instance = createInstance(sentence, model.featureMap); + IOUtility.evaluate(instance, model, stat); + return false; + } + }); + + return new double[]{stat[1] / (double) stat[0] * 100}; + } + + protected String normalize(String text) + { + char[] result = new char[text.length()]; + for (int i = 0; i < result.length; i++) + { + result[i] = tableChar[text.charAt(i)]; + } + return new String(result); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/Main.java b/src/main/java/com/hankcs/hanlp/model/perceptron/Main.java new file mode 100644 index 000000000..c79fc9133 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/Main.java @@ -0,0 +1,172 @@ +/* + * Hankcs + * me@hankcs.com + * 2016-09-11 PM3:53 + * + * + * Copyright (c) 2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.perceptron.cli.Args; +import com.hankcs.hanlp.model.perceptron.cli.Argument; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +import java.io.File; +import java.io.IOException; +import java.io.PrintWriter; +import java.util.Arrays; +import java.util.Scanner; + +import static java.lang.System.out; + +/** + * @author hankcs + */ +public class Main +{ + private static class Option + { + @Argument(description = "任务类型:CWS|POS|NER") + TaskType task = TaskType.CWS; + + @Argument(description = "执行训练任务") + boolean train; + + @Argument(description = "执行预测任务") + boolean test; + + @Argument(description = "执行评估任务") + boolean evaluate; + + @Argument(description = "模型文件路径") + String[] model = new String[]{HanLP.Config.PerceptronCWSModelPath, HanLP.Config.PerceptronPOSModelPath, HanLP.Config.PerceptronNERModelPath}; + + @Argument(description = "输入文本路径") + String input; + + @Argument(description = "结果保存路径") + String result; + + @Argument(description = "标准分词语料") + String gold; + + @Argument(description = "训练集") + String reference; + + @Argument(description = "开发集") + String development; + + @Argument(description = "迭代次数") + Integer iter = 5; + + @Argument(description = "模型压缩比率") + Double compressRatio = 0.0; + + @Argument(description = "线程数") + Integer thread = Runtime.getRuntime().availableProcessors(); + } + + public static void main(String[] args) + { + // nohup time java -jar averaged-perceptron-segment-1.0.jar -train -model 2014_2w.bin -reference 2014_blank.txt -development 2014_1k.txt > log.txt + Option option = new Option(); + try + { + Args.parse(option, args); + PerceptronTrainer trainer = null; + switch (option.task) + { + case CWS: + trainer = new CWSTrainer(); + break; + case POS: + trainer = new POSTrainer(); + break; + case NER: + trainer = new NERTrainer(); + break; + } + if (option.train) + { + trainer.train(option.reference, option.development, option.model[0], option.compressRatio, + option.iter, option.thread); + } + else if (option.evaluate) + { + double[] prf = trainer.evaluate(option.gold, option.model[0]); + out.printf("Performance - P:%.2f R:%.2f F:%.2f\n", prf[0], prf[1], prf[2]); + } + else + { + PerceptronLexicalAnalyzer analyzer; + String[] models = option.model; + switch (models.length) + { + case 1: + analyzer = new PerceptronLexicalAnalyzer(models[0]); + break; + case 2: + analyzer = new PerceptronLexicalAnalyzer(models[0], models[1]); + break; + case 3: + analyzer = new PerceptronLexicalAnalyzer(models[0], models[1], models[2]); + break; + default: + System.err.printf("最多支持载入3个模型,然而传入了多于3个: %s", Arrays.toString(models)); + return; + } + + PrintWriter printer; + if (option.result == null) + { + printer = new PrintWriter(System.out); + } + else + { + printer = new PrintWriter(new File(option.result), "utf-8"); + } + Scanner scanner; + if (option.input == null) + { + scanner = new Scanner(System.in); +// System.err.println("请输入文本:"); + } + else + { + scanner = new Scanner(new File(option.input), "utf-8"); + } + String line; + String lineSeparator = System.getProperty("line.separator"); + while (scanner.hasNext() && (line = scanner.nextLine()) != null) + { + line = line.trim(); + if (line.length() == 0) continue; + Sentence sentence = analyzer.analyze(line); + printer.write(sentence.toString()); + printer.write(lineSeparator); + if (option.result == null) + { + printer.flush(); + } + } + printer.close(); + scanner.close(); + } + } + catch (IllegalArgumentException e) + { + System.err.println(e.getMessage()); + Args.usage(option); + } + catch (IOException e) + { + System.err.println("发生了IO异常,请检查文件路径"); + e.printStackTrace(); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/NERTrainer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/NERTrainer.java new file mode 100644 index 000000000..b23fa74ae --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/NERTrainer.java @@ -0,0 +1,67 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-28 11:39 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.instance.NERInstance; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +/** + * @author hankcs + */ +public class NERTrainer extends PerceptronTrainer +{ + /** + * 支持任意自定义NER类型,例如:
+ * tagSet.nerLabels.clear();
+ * tagSet.nerLabels.add("nr");
+ * tagSet.nerLabels.add("ns");
+ * tagSet.nerLabels.add("nt");
+ */ + public NERTagSet tagSet; + + public NERTrainer(NERTagSet tagSet) + { + this.tagSet = tagSet; + } + + public NERTrainer() + { + tagSet = new NERTagSet(); + tagSet.nerLabels.add("nr"); + tagSet.nerLabels.add("ns"); + tagSet.nerLabels.add("nt"); + } + + /** + * 重载此方法以支持任意自定义NER类型,例如:
+ * NERTagSet tagSet = new NERTagSet();
+ * tagSet.nerLabels.add("nr");
+ * tagSet.nerLabels.add("ns");
+ * tagSet.nerLabels.add("nt");
+ * return tagSet;
+ * @return + */ + @Override + protected TagSet createTagSet() + { + return tagSet; + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + return new NERInstance(sentence, featureMap); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/POSTrainer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/POSTrainer.java new file mode 100644 index 000000000..914b822cf --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/POSTrainer.java @@ -0,0 +1,45 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午9:12 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.instance.POSInstance; +import com.hankcs.hanlp.model.perceptron.tagset.POSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +import java.io.IOException; + +/** + * @author hankcs + */ +public class POSTrainer extends PerceptronTrainer +{ + @Override + protected TagSet createTagSet() + { + return new POSTagSet(); + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + return POSInstance.create(sentence, featureMap); + } + + @Override + public Result train(String trainingFile, String developFile, String modelFile) throws IOException + { + // 词性标注模型压缩会显著降低效果 + return train(trainingFile, developFile, modelFile, 0, 10, Runtime.getRuntime().availableProcessors()); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronClassifier.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronClassifier.java new file mode 100644 index 000000000..7fbc5ffd9 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronClassifier.java @@ -0,0 +1,281 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 11:30 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.LockableFeatureMap; +import com.hankcs.hanlp.model.perceptron.model.AveragedPerceptron; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; + +/** + * 感知机二分类器 + * + * @author hankcs + */ +public abstract class PerceptronClassifier +{ + LinearModel model; + + public PerceptronClassifier() + { + } + + public PerceptronClassifier(LinearModel model) + { + if (model != null && model.taskType() != TaskType.CLASSIFICATION) + throw new IllegalArgumentException("传入的模型并非分类模型"); + this.model = model; + } + + public PerceptronClassifier(String modelPath) throws IOException + { + this(new LinearModel(modelPath)); + } + + /** + * 朴素感知机训练算法 + * @param instanceList 训练实例 + * @param featureMap 特征函数 + * @param maxIteration 训练迭代次数 + */ + private static LinearModel trainNaivePerceptron(Instance[] instanceList, FeatureMap featureMap, int maxIteration) + { + LinearModel model = new LinearModel(featureMap, new float[featureMap.size()]); + for (int it = 0; it < maxIteration; ++it) + { + Utility.shuffleArray(instanceList); + for (Instance instance : instanceList) + { + int y = model.decode(instance.x); + if (y != instance.y) // 误差反馈 + model.update(instance.x, instance.y); + } + } + return model; + } + + /** + * 平均感知机训练算法 + * @param instanceList 训练实例 + * @param featureMap 特征函数 + * @param maxIteration 训练迭代次数 + */ + private static LinearModel trainAveragedPerceptron(Instance[] instanceList, FeatureMap featureMap, int maxIteration) + { + float[] parameter = new float[featureMap.size()]; + double[] sum = new double[featureMap.size()]; + int[] time = new int[featureMap.size()]; + + AveragedPerceptron model = new AveragedPerceptron(featureMap, parameter); + int t = 0; + for (int it = 0; it < maxIteration; ++it) + { + Utility.shuffleArray(instanceList); + for (Instance instance : instanceList) + { + ++t; + int y = model.decode(instance.x); + if (y != instance.y) // 误差反馈 + model.update(instance.x, instance.y, sum, time, t); + } + } + model.average(sum, time, t); + return model; + } + + /** + * 训练 + * + * @param corpus 语料库 + * @param maxIteration 最大迭代次数 + * @return 模型在训练集上的准确率 + */ + public BinaryClassificationFMeasure train(String corpus, int maxIteration) + { + return train(corpus, maxIteration, true); + } + + /** + * 训练 + * + * @param corpus 语料库 + * @param maxIteration 最大迭代次数 + * @param averagePerceptron 是否使用平均感知机算法 + * @return 模型在训练集上的准确率 + */ + public BinaryClassificationFMeasure train(String corpus, int maxIteration, boolean averagePerceptron) + { + FeatureMap featureMap = new LockableFeatureMap(new TagSet(TaskType.CLASSIFICATION)); + featureMap.mutable = true; // 训练时特征映射可拓充 + Instance[] instanceList = readInstance(corpus, featureMap); + model = averagePerceptron ? trainAveragedPerceptron(instanceList, featureMap, maxIteration) + : trainNaivePerceptron(instanceList, featureMap, maxIteration); + featureMap.mutable = false; // 训练结束后特征不可写 + return evaluate(instanceList); + } + + /** + * 预测 + * + * @param text + * @return + */ + public String predict(String text) + { + int y = model.decode(extractFeature(text, model.featureMap)); + if (y == -1) + y = 0; + return model.tagSet().stringOf(y); + } + + /** + * 评估 + * + * @param corpus + * @return + */ + public BinaryClassificationFMeasure evaluate(String corpus) + { + Instance[] instanceList = readInstance(corpus, model.featureMap); + return evaluate(instanceList); + } + + /** + * 评估 + * + * @param instanceList + * @return + */ + public BinaryClassificationFMeasure evaluate(Instance[] instanceList) + { + int TP = 0, FP = 0, FN = 0; + for (Instance instance : instanceList) + { + int y = model.decode(instance.x); + if (y == 1) + { + if (instance.y == 1) + ++TP; + else + ++FP; + } + else if (instance.y == 1) + ++FN; + } + float p = TP / (float) (TP + FP) * 100; + float r = TP / (float) (TP + FN) * 100; + return new BinaryClassificationFMeasure(p, r, 2 * p * r / (p + r)); + } + + /** + * 从语料库读取实例 + * + * @param corpus 语料库 + * @param featureMap 特征映射 + * @return 数据集 + */ + private Instance[] readInstance(String corpus, FeatureMap featureMap) + { + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(corpus); + List instanceList = new LinkedList(); + for (String line : lineIterator) + { + String[] cells = line.split(","); + String text = cells[0], label = cells[1]; + List x = extractFeature(text, featureMap); + int y = featureMap.tagSet.add(label); + if (y == 0) + y = -1; // 感知机标签约定为±1 + else if (y > 1) + throw new IllegalArgumentException("类别数大于2,目前只支持二分类。"); + instanceList.add(new Instance(x, y)); + } + return instanceList.toArray(new Instance[0]); + } + + /** + * 特征提取 + * + * @param text 文本 + * @param featureMap 特征映射 + * @return 特征向量 + */ + protected abstract List extractFeature(String text, FeatureMap featureMap); + + /** + * 向特征向量插入特征 + * + * @param feature 特征 + * @param featureMap 特征映射 + * @param featureList 特征向量 + */ + protected static void addFeature(String feature, FeatureMap featureMap, List featureList) + { + int featureId = featureMap.idOf(feature); + if (featureId != -1) + featureList.add(featureId); + } + + /** + * 样本 + */ + static class Instance + { + /** + * 特征向量 + */ + List x; + /** + * 标签 + */ + int y; + + public Instance(List x, int y) + { + this.x = x; + this.y = y; + } + } + + /** + * 准确率度量 + */ + static class BinaryClassificationFMeasure + { + float P, R, F1; + + public BinaryClassificationFMeasure(float p, float r, float f1) + { + P = p; + R = r; + F1 = f1; + } + + @Override + public String toString() + { + return String.format("P=%.2f R=%.2f F1=%.2f", P, R, F1); + } + } + + public LinearModel getModel() + { + return model; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzer.java new file mode 100644 index 000000000..7a4106ebe --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzer.java @@ -0,0 +1,198 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-05 PM7:56 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; + +import java.io.IOException; +import java.util.List; + +/** + * 感知机词法分析器,支持简繁全半角和大小写 + * + * @author hankcs + */ +public class PerceptronLexicalAnalyzer extends AbstractLexicalAnalyzer +{ + public PerceptronLexicalAnalyzer(PerceptronSegmenter segmenter) + { + super(segmenter); + } + + public PerceptronLexicalAnalyzer(PerceptronSegmenter segmenter, PerceptronPOSTagger posTagger) + { + super(segmenter, posTagger); + } + + public PerceptronLexicalAnalyzer(PerceptronSegmenter segmenter, PerceptronPOSTagger posTagger, PerceptronNERecognizer neRecognizer) + { + super(segmenter, posTagger, neRecognizer); + } + + public PerceptronLexicalAnalyzer(LinearModel cwsModel, LinearModel posModel, LinearModel nerModel) + { + segmenter = new PerceptronSegmenter(cwsModel); + if (posModel != null) + { + this.posTagger = new PerceptronPOSTagger(posModel); + config.speechTagging = true; + } + else + { + this.posTagger = null; + } + if (nerModel != null) + { + neRecognizer = new PerceptronNERecognizer(nerModel); + config.ner = true; + } + else + { + neRecognizer = null; + } + } + + public PerceptronLexicalAnalyzer(String cwsModelFile, String posModelFile, String nerModelFile) throws IOException + { + this(new LinearModel(cwsModelFile), posModelFile == null ? null : new LinearModel(posModelFile), nerModelFile == null ? null : new LinearModel(nerModelFile)); + } + + public PerceptronLexicalAnalyzer(String cwsModelFile, String posModelFile) throws IOException + { + this(new LinearModel(cwsModelFile), posModelFile == null ? null : new LinearModel(posModelFile), null); + } + + public PerceptronLexicalAnalyzer(String cwsModelFile) throws IOException + { + this(new LinearModel(cwsModelFile), null, null); + } + + public PerceptronLexicalAnalyzer(LinearModel CWSModel) + { + this(CWSModel, null, null); + } + + /** + * 加载配置文件指定的模型构造词法分析器 + * + * @throws IOException + */ + public PerceptronLexicalAnalyzer() throws IOException + { + this(HanLP.Config.PerceptronCWSModelPath, HanLP.Config.PerceptronPOSModelPath, HanLP.Config.PerceptronNERModelPath); + } + + /** + * 中文分词 + * + * @param text + * @param output + */ + public void segment(String text, List output) + { + String normalized = CharTable.convert(text); + segment(text, normalized, output); + } + + /** + * 词性标注 + * + * @param wordList + * @return + */ + public String[] partOfSpeechTag(List wordList) + { + if (posTagger == null) + { + throw new IllegalStateException("未提供词性标注模型"); + } + return tag(wordList); + } + + /** + * 命名实体识别 + * + * @param wordArray + * @param posArray + * @return + */ + public String[] namedEntityRecognize(String[] wordArray, String[] posArray) + { + if (neRecognizer == null) + { + throw new IllegalStateException("未提供命名实体识别模型"); + } + return recognize(wordArray, posArray); + } + + /** + * 在线学习 + * + * @param segmentedTaggedSentence 已分词、标好词性和命名实体的人民日报2014格式的句子 + * @return 是否学习成果(失败的原因是句子格式不合法) + */ + public boolean learn(String segmentedTaggedSentence) + { + Sentence sentence = Sentence.create(segmentedTaggedSentence); + return learn(sentence); + } + + /** + * 在线学习 + * + * @param sentence 已分词、标好词性和命名实体的人民日报2014格式的句子 + * @return 是否学习成果(失败的原因是句子格式不合法) + */ + public boolean learn(Sentence sentence) + { + CharTable.normalize(sentence); + if (!getPerceptronSegmenter().learn(sentence)) return false; + if (posTagger != null && !getPerceptronPOSTagger().learn(sentence)) return false; + if (neRecognizer != null && !getPerceptionNERecognizer().learn(sentence)) return false; + return true; + } + + /** + * 获取分词器 + * + * @return + */ + public PerceptronSegmenter getPerceptronSegmenter() + { + return (PerceptronSegmenter) segmenter; + } + + /** + * 获取词性标注器 + * + * @return + */ + public PerceptronPOSTagger getPerceptronPOSTagger() + { + return (PerceptronPOSTagger) posTagger; + } + + /** + * 获取命名实体识别器 + * + * @return + */ + public PerceptronNERecognizer getPerceptionNERecognizer() + { + return (PerceptronNERecognizer) neRecognizer; + } + +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizer.java new file mode 100644 index 000000000..ccee29a34 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizer.java @@ -0,0 +1,104 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-28 15:53 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.instance.NERInstance; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.IOException; + +import static com.hankcs.hanlp.utility.Predefine.logger; + +/** + * 命名实体识别 + * + * @author hankcs + */ +public class PerceptronNERecognizer extends PerceptronTagger implements NERecognizer +{ + final NERTagSet tagSet; + + public PerceptronNERecognizer(LinearModel nerModel) + { + super(nerModel); + if (nerModel.tagSet().type != TaskType.NER) + { + throw new IllegalArgumentException(String.format("错误的模型类型: 传入的不是命名实体识别模型,而是 %s 模型", nerModel.featureMap.tagSet.type)); + } + this.tagSet = (NERTagSet) model.tagSet(); + } + + public PerceptronNERecognizer(String nerModelPath) throws IOException + { + this(new LinearModel(nerModelPath)); + } + + /** + * 加载配置文件指定的模型 + * + * @throws IOException + */ + public PerceptronNERecognizer() throws IOException + { + this(HanLP.Config.PerceptronNERModelPath); + } + + public String[] recognize(String[] wordArray, String[] posArray) + { + NERInstance instance = new NERInstance(wordArray, posArray, model.featureMap); + return recognize(instance); + } + + public String[] recognize(NERInstance instance) + { + instance.tagArray = new int[instance.size()]; + model.viterbiDecode(instance); + + return instance.tags(tagSet); + } + + @Override + public NERTagSet getNERTagSet() + { + return tagSet; + } + + /** + * 在线学习 + * + * @param segmentedTaggedNERSentence 人民日报2014格式的句子 + * @return 是否学习成功(失败的原因是参数错误) + */ + public boolean learn(String segmentedTaggedNERSentence) + { + return learn(new NERInstance(segmentedTaggedNERSentence, model.featureMap)); + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + for (IWord word : sentence) + { + if (word instanceof CompoundWord && !tagSet.nerLabels.contains(word.getLabel())) + logger.warning("在线学习不可能学习新的标签: " + word + " ;请标注语料库后重新全量训练。"); + } + return new NERInstance(sentence, featureMap); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifier.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifier.java new file mode 100644 index 000000000..98388d646 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifier.java @@ -0,0 +1,71 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 9:08 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; + +/** + * 基于感知机的人名性别分类器,预测人名的性别 + * + * @author hankcs + */ +public class PerceptronNameGenderClassifier extends PerceptronClassifier +{ + public PerceptronNameGenderClassifier() + { + } + + public PerceptronNameGenderClassifier(LinearModel model) + { + super(model); + } + + public PerceptronNameGenderClassifier(String modelPath) throws IOException + { + super(modelPath); + } + + @Override + protected List extractFeature(String text, FeatureMap featureMap) + { + List featureList = new LinkedList(); + String givenName = extractGivenName(text); + // 特征模板1:g[0] + addFeature("1" + givenName.substring(0, 1), featureMap, featureList); + // 特征模板2:g[1] + addFeature("2" + givenName.substring(1), featureMap, featureList); + // 特征模板3:g +// addFeature("3" + givenName, featureMap, featureList); + // 偏置特征(代表标签的先验分布,当样本不均衡时有用,但此处的男女预测无用) +// addFeature("b", featureMap, featureList); + return featureList; + } + + /** + * 去掉姓氏,截取中国人名中的名字 + * + * @param name 姓名 + * @return 名 + */ + public static String extractGivenName(String name) + { + if (name.length() <= 2) + return "_" + name.substring(name.length() - 1); + else + return name.substring(name.length() - 2); + + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTagger.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTagger.java new file mode 100644 index 000000000..79468f607 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTagger.java @@ -0,0 +1,132 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-27 下午5:06 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.instance.POSInstance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.tokenizer.lexical.POSTagger; + +import java.io.IOException; +import java.util.List; + +/** + * 词性标注器 + * + * @author hankcs + */ +public class PerceptronPOSTagger extends PerceptronTagger implements POSTagger +{ + public PerceptronPOSTagger(LinearModel model) + { + super(model); + if (model.featureMap.tagSet.type != TaskType.POS) + { + throw new IllegalArgumentException(String.format("错误的模型类型: 传入的不是词性标注模型,而是 %s 模型", model.featureMap.tagSet.type)); + } + } + + public PerceptronPOSTagger(String modelPath) throws IOException + { + this(new LinearModel(modelPath)); + } + + /** + * 加载配置文件指定的模型 + * + * @throws IOException + */ + public PerceptronPOSTagger() throws IOException + { + this(HanLP.Config.PerceptronPOSModelPath); + } + + /** + * 标注 + * + * @param words + * @return + */ + @Override + public String[] tag(String... words) + { + POSInstance instance = new POSInstance(words, model.featureMap); + return tag(instance); + } + + public String[] tag(POSInstance instance) + { + instance.tagArray = new int[instance.featureMatrix.length]; + + model.viterbiDecode(instance, instance.tagArray); + return instance.tags(model.tagSet()); + } + + /** + * 标注 + * + * @param wordList + * @return + */ + @Override + public String[] tag(List wordList) + { + String[] termArray = new String[wordList.size()]; + wordList.toArray(termArray); + return tag(termArray); + } + + /** + * 在线学习 + * + * @param segmentedTaggedSentence 人民日报2014格式的句子 + * @return 是否学习成功(失败的原因是参数错误) + */ + public boolean learn(String segmentedTaggedSentence) + { + return learn(POSInstance.create(segmentedTaggedSentence, model.featureMap)); + } + + /** + * 在线学习 + * + * @param wordTags [单词]/[词性]数组 + * @return 是否学习成功(失败的原因是参数错误) + */ + public boolean learn(String... wordTags) + { + String[] words = new String[wordTags.length]; + String[] tags = new String[wordTags.length]; + for (int i = 0; i < wordTags.length; i++) + { + String[] wordTag = wordTags[i].split("//"); + words[i] = wordTag[0]; + tags[i] = wordTag[1]; + } + return learn(new POSInstance(words, tags, model.featureMap)); + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + for (Word word : sentence.toSimpleWordList()) + { + if (!model.featureMap.tagSet.contains(word.getLabel())) + throw new IllegalArgumentException("在线学习不可能学习新的标签: " + word + " ;请标注语料库后重新全量训练。"); + } + return POSInstance.create(sentence, featureMap); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenter.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenter.java new file mode 100644 index 000000000..5f5b721de --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenter.java @@ -0,0 +1,148 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-05 PM7:56 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.CWSInstance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.tokenizer.lexical.Segmenter; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; + +/** + * 中文分词 + * + * @author hankcs + */ +public class PerceptronSegmenter extends PerceptronTagger implements Segmenter +{ + private final CWSTagSet CWSTagSet; + + public PerceptronSegmenter(LinearModel cwsModel) + { + super(cwsModel); + if (cwsModel.featureMap.tagSet.type != TaskType.CWS) + { + throw new IllegalArgumentException(String.format("错误的模型类型: 传入的不是分词模型,而是 %s 模型", cwsModel.featureMap.tagSet.type)); + } + CWSTagSet = (CWSTagSet) cwsModel.featureMap.tagSet; + } + + public PerceptronSegmenter(String cwsModelFile) throws IOException + { + this(new LinearModel(cwsModelFile)); + } + + /** + * 加载配置文件指定的模型 + * @throws IOException + */ + public PerceptronSegmenter() throws IOException + { + this(HanLP.Config.PerceptronCWSModelPath); + } + + public void segment(String text, List output) + { + String normalized = normalize(text); + segment(text, normalized, output); + } + + public void segment(String text, String normalized, List output) + { + if (text.isEmpty()) return; + Instance instance = new CWSInstance(normalized, model.featureMap); + segment(text, instance, output); + } + + public void segment(String text, Instance instance, List output) + { + int[] tagArray = instance.tagArray; + model.viterbiDecode(instance, tagArray); + + StringBuilder result = new StringBuilder(); + result.append(text.charAt(0)); + + for (int i = 1; i < tagArray.length; i++) + { + if (tagArray[i] == CWSTagSet.B || tagArray[i] == CWSTagSet.S) + { + output.add(result.toString()); + result.setLength(0); + } + result.append(text.charAt(i)); + } + if (result.length() != 0) + { + output.add(result.toString()); + } + } + + public List segment(String sentence) + { + List result = new LinkedList(); + segment(sentence, result); + return result; + } + + /** + * 在线学习 + * + * @param segmentedSentence 分好词的句子,空格或tab分割,不含词性 + * @return 是否学习成功(失败的原因是参数错误) + */ + public boolean learn(String segmentedSentence) + { + return learn(segmentedSentence.split("\\s+")); + } + + /** + * 在线学习 + * + * @param words 分好词的句子 + * @return 是否学习成功(失败的原因是参数错误) + */ + public boolean learn(String... words) + { +// for (int i = 0; i < words.length; i++) // 防止传入带词性的词语 +// { +// int index = words[i].indexOf('/'); +// if (index > 0) +// { +// words[i] = words[i].substring(0, index); +// } +// } + return learn(new CWSInstance(words, model.featureMap)); + } + + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + return CWSInstance.create(sentence, featureMap); + } + + @Override + public double[] evaluate(String corpora) throws IOException + { + // 这里用CWS的F1 + double[] prf = Utility.prf(Utility.evaluateCWS(corpora, this)); + return prf; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTagger.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTagger.java new file mode 100644 index 000000000..1f3acb8d0 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTagger.java @@ -0,0 +1,84 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-11-18 下午10:18 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.model.StructuredPerceptron; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +import java.io.IOException; + +/** + * 抽象的感知机标注器 + * + * @author hankcs + */ +public abstract class PerceptronTagger extends InstanceConsumer +{ + /** + * 用StructurePerceptron实现在线学习 + */ + protected final StructuredPerceptron model; + + public PerceptronTagger(LinearModel model) + { + assert model != null; + this.model = model instanceof StructuredPerceptron ? (StructuredPerceptron) model : new StructuredPerceptron(model.featureMap, model.parameter); + } + + public PerceptronTagger(StructuredPerceptron model) + { + assert model != null; + this.model = model; + } + + public LinearModel getModel() + { + return model; + } + + /** + * 在线学习 + * + * @param instance + * @return + */ + public boolean learn(Instance instance) + { + if (instance == null) return false; + model.update(instance); + return true; + } + + /** + * 在线学习 + * + * @param sentence + * @return + */ + public boolean learn(Sentence sentence) + { + return learn(createInstance(sentence, model.featureMap)); + } + + /** + * 性能测试 + * + * @param corpora 数据集 + * @return 默认返回accuracy,有些子类可能返回P,R,F1 + * @throws IOException + */ + public double[] evaluate(String corpora) throws IOException + { + return evaluate(corpora, this.getModel()); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTrainer.java b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTrainer.java new file mode 100644 index 000000000..5edfd8157 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/PerceptronTrainer.java @@ -0,0 +1,352 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午5:51 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.perceptron.common.FrequencyMap; +import com.hankcs.hanlp.model.perceptron.feature.ImmutableFeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.model.AveragedPerceptron; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.model.StructuredPerceptron; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.utility.IOUtility; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.classification.utilities.io.ConsoleLogger; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +import java.io.*; +import java.util.LinkedList; +import java.util.List; + +import static java.lang.System.err; +import static java.lang.System.out; + +/** + * 感知机训练基类 + * + * @author hankcs + */ +public abstract class PerceptronTrainer extends InstanceConsumer +{ + + /** + * 训练结果 + */ + public static class Result + { + /** + * 模型 + */ + LinearModel model; + /** + * 精确率(Precision), 召回率(Recall)和F1-Measure
+ * 中文参考:https://blog.argcv.com/articles/1036.c + */ + double prf[]; + + public Result(LinearModel model, double[] prf) + { + this.model = model; + this.prf = prf; + } + + /** + * 获取准确率 + * + * @return + */ + public double getAccuracy() + { + if (prf.length == 3) + { + return prf[2]; + } + return prf[0]; + } + + /** + * 获取模型 + * + * @return + */ + public LinearModel getModel() + { + return model; + } + } + + /** + * 创建标注集 + * + * @return + */ + protected abstract TagSet createTagSet(); + + /** + * 训练 + * + * @param trainingFile 训练集 + * @param developFile 开发集 + * @param modelFile 模型保存路径 + * @param compressRatio 压缩比 + * @param maxIteration 最大迭代次数 + * @param threadNum 线程数 + * @return 一个包含模型和精度的结构 + * @throws IOException + */ + public Result train(String trainingFile, String developFile, + String modelFile, final double compressRatio, + final int maxIteration, final int threadNum) throws IOException + { + if (developFile == null) + { + developFile = trainingFile; + } + // 加载训练语料 + TagSet tagSet = createTagSet(); + MutableFeatureMap mutableFeatureMap = new MutableFeatureMap(tagSet); + ConsoleLogger logger = new ConsoleLogger(); + logger.start("开始加载训练集...\n"); + Instance[] instances = loadTrainInstances(trainingFile, mutableFeatureMap); + tagSet.lock(); + logger.finish("\n加载完毕,实例一共%d句,特征总数%d\n", instances.length, mutableFeatureMap.size() * tagSet.size()); + + // 开始训练 + ImmutableFeatureMap immutableFeatureMap = new ImmutableFeatureMap(mutableFeatureMap.featureIdMap, tagSet); + mutableFeatureMap = null; + double[] accuracy = null; + + if (threadNum == 1) + { + AveragedPerceptron model; + model = new AveragedPerceptron(immutableFeatureMap); + final double[] total = new double[model.parameter.length]; + final int[] timestamp = new int[model.parameter.length]; + int current = 0; + for (int iter = 1; iter <= maxIteration; iter++) + { + Utility.shuffleArray(instances); + for (Instance instance : instances) + { + ++current; + int[] guessLabel = new int[instance.length()]; + model.viterbiDecode(instance, guessLabel); + for (int i = 0; i < instance.length(); i++) + { + int[] featureVector = instance.getFeatureAt(i); + int[] goldFeature = new int[featureVector.length]; + int[] predFeature = new int[featureVector.length]; + for (int j = 0; j < featureVector.length - 1; j++) + { + goldFeature[j] = featureVector[j] * tagSet.size() + instance.tagArray[i]; + predFeature[j] = featureVector[j] * tagSet.size() + guessLabel[i]; + } + goldFeature[featureVector.length - 1] = (i == 0 ? tagSet.bosId() : instance.tagArray[i - 1]) * tagSet.size() + instance.tagArray[i]; + predFeature[featureVector.length - 1] = (i == 0 ? tagSet.bosId() : guessLabel[i - 1]) * tagSet.size() + guessLabel[i]; + model.update(goldFeature, predFeature, total, timestamp, current); + } + } + + // 在开发集上校验 + accuracy = trainingFile.equals(developFile) ? IOUtility.evaluate(instances, model) : evaluate(developFile, model); + out.printf("Iter#%d - ", iter); + printAccuracy(accuracy); + } + // 平均 + model.average(total, timestamp, current); + accuracy = trainingFile.equals(developFile) ? IOUtility.evaluate(instances, model) : evaluate(developFile, model); + out.print("AP - "); + printAccuracy(accuracy); + logger.start("以压缩比 %.2f 保存模型到 %s ... ", compressRatio, modelFile); + model.save(modelFile, immutableFeatureMap.featureIdMap.entrySet(), compressRatio); + logger.finish(" 保存完毕\n"); + if (compressRatio == 0) return new Result(model, accuracy); + } + else + { + // 多线程用Structure Perceptron + StructuredPerceptron[] models = new StructuredPerceptron[threadNum]; + for (int i = 0; i < models.length; i++) + { + models[i] = new StructuredPerceptron(immutableFeatureMap); + } + + TrainingWorker[] workers = new TrainingWorker[threadNum]; + int job = instances.length / threadNum; + for (int iter = 1; iter <= maxIteration; iter++) + { + Utility.shuffleArray(instances); + try + { + for (int i = 0; i < workers.length; i++) + { + workers[i] = new TrainingWorker(instances, i * job, + i == workers.length - 1 ? instances.length : (i + 1) * job, + models[i]); + workers[i].start(); + } + for (TrainingWorker worker : workers) + { + worker.join(); + } + for (int j = 0; j < models[0].parameter.length; j++) + { + for (int i = 1; i < models.length; i++) + { + models[0].parameter[j] += models[i].parameter[j]; + } + models[0].parameter[j] /= threadNum; + } + accuracy = trainingFile.equals(developFile) ? IOUtility.evaluate(instances, models[0]) : evaluate(developFile, models[0]); + out.printf("Iter#%d - ", iter); + printAccuracy(accuracy); + } + catch (InterruptedException e) + { + err.printf("线程同步异常,训练失败\n"); + e.printStackTrace(); + return null; + } + } + logger.start("以压缩比 %.2f 保存模型到 %s ... ", compressRatio, modelFile); + models[0].save(modelFile, immutableFeatureMap.featureIdMap.entrySet(), compressRatio, HanLP.Config.DEBUG); + logger.finish(" 保存完毕\n"); + if (compressRatio == 0) return new Result(models[0], accuracy); + } + + LinearModel model = new LinearModel(modelFile); + if (compressRatio > 0) + { + accuracy = evaluate(developFile, model); + out.printf("\n%.2f compressed model - ", compressRatio); + printAccuracy(accuracy); + } + + return new Result(model, accuracy); + } + + private void printAccuracy(double[] accuracy) + { + if (accuracy.length == 3) + { + out.printf("P:%.2f R:%.2f F:%.2f\n", accuracy[0], accuracy[1], accuracy[2]); + } + else + { + out.printf("P:%.2f\n", accuracy[0]); + } + } + + private static class TrainingWorker extends Thread + { + private Instance[] instances; + private int start; + private int end; + private StructuredPerceptron model; + + public TrainingWorker(Instance[] instances, int start, int end, StructuredPerceptron model) + { + this.instances = instances; + this.start = start; + this.end = end; + this.model = model; + } + + @Override + public void run() + { + for (int s = start; s < end; ++s) + { + Instance instance = instances[s]; + model.update(instance); + } +// out.printf("Finished [%d,%d)\n", start, end); + } + } + + protected Instance[] loadTrainInstances(String trainingFile, final MutableFeatureMap mutableFeatureMap) throws IOException + { + final List instanceList = new LinkedList(); + IOUtility.loadInstance(trainingFile, new InstanceHandler() + { + @Override + public boolean process(Sentence sentence) + { + Utility.normalize(sentence); + instanceList.add(PerceptronTrainer.this.createInstance(sentence, mutableFeatureMap)); + return false; + } + }); + Instance[] instances = new Instance[instanceList.size()]; + instanceList.toArray(instances); + return instances; + } + + + private static DoubleArrayTrie loadDictionary(String trainingFile, String dictionaryFile) throws IOException + { + FrequencyMap dictionaryMap = new FrequencyMap(); + if (dictionaryFile == null) + { + out.printf("从训练文件%s中统计词库...\n", trainingFile); + loadWordFromFile(trainingFile, dictionaryMap, true); + } + else + { + out.printf("从外部词典%s中加载词库...\n", trainingFile); + loadWordFromFile(dictionaryFile, dictionaryMap, false); + } + DoubleArrayTrie dat = new DoubleArrayTrie(); + dat.build(dictionaryMap); + out.printf("加载完毕,词库总词数:%d,总词频:%d\n", dictionaryMap.size(), dictionaryMap.totalFrequency); + + return dat; + } + + public Result train(String trainingFile, String modelFile) throws IOException + { + return train(trainingFile, trainingFile, modelFile); + } + + public Result train(String trainingFile, String developFile, String modelFile) throws IOException + { + return train(trainingFile, developFile, modelFile, 0.1, 50, Runtime.getRuntime().availableProcessors()); + } + + private static void loadWordFromFile(String path, FrequencyMap storage, boolean segmented) throws IOException + { + BufferedReader br = IOUtility.newBufferedReader(path); + String line; + while ((line = br.readLine()) != null) + { + if (segmented) + { + for (String word : IOUtility.readLineToArray(line)) + { + storage.add(word); + } + } + else + { + line = line.trim(); + if (line.length() != 0) + { + storage.add(line); + } + } + } + br.close(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Args.java b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Args.java new file mode 100644 index 000000000..68655bb50 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Args.java @@ -0,0 +1,686 @@ +/* + * Copyright (c) 2005, Sam Pullara. All Rights Reserved. + * You may modify and redistribute as long as this attribution remains. + */ + +package com.hankcs.hanlp.model.perceptron.cli; + + +import java.beans.BeanInfo; +import java.beans.IntrospectionException; +import java.beans.Introspector; +import java.beans.PropertyDescriptor; +import java.io.PrintStream; +import java.lang.reflect.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; + +public class Args +{ + + /** + * A convenience method for parsing and automatically producing error messages. + * + * @param target Either an instance or a class + * @param args The arguments you want to parse and populate + * @return The list of arguments that were not consumed + */ + public static List parseOrExit(Object target, String[] args) + { + try + { + return parse(target, args); + } + catch (IllegalArgumentException e) + { + System.err.println(e.getMessage()); + Args.usage(target); + System.exit(1); + throw e; + } + } + + public static List parse(Object target, String[] args) + { + return parse(target, args, true); + } + + /** + * Parse a set of arguments and populate the target with the appropriate values. + * + * @param target Either an instance or a class + * @param args The arguments you want to parse and populate + * @param failOnExtraFlags Throw an IllegalArgumentException if extra flags are present + * @return The list of arguments that were not consumed + */ + public static List parse(Object target, String[] args, boolean failOnExtraFlags) + { + List arguments = new ArrayList(); + arguments.addAll(Arrays.asList(args)); + Class clazz; + if (target instanceof Class) + { + clazz = (Class) target; + } + else + { + clazz = target.getClass(); + try + { + BeanInfo info = Introspector.getBeanInfo(clazz); + for (PropertyDescriptor pd : info.getPropertyDescriptors()) + { + processProperty(target, pd, arguments); + } + } + catch (IntrospectionException e) + { + // If its not a JavaBean we ignore it + } + } + + // Check fields of 'target' class and its superclasses + for (Class currentClazz = clazz; currentClazz != null; currentClazz = currentClazz.getSuperclass()) + { + for (Field field : currentClazz.getDeclaredFields()) + { + processField(target, field, arguments); + } + } + + if (failOnExtraFlags) + { + for (String argument : arguments) + { + if (argument.startsWith("-")) + { + throw new IllegalArgumentException("无效参数: " + argument); + } + } + } + return arguments; + } + + private static void processField(Object target, Field field, List arguments) + { + Argument argument = field.getAnnotation(Argument.class); + if (argument != null) + { + boolean set = false; + for (Iterator i = arguments.iterator(); i.hasNext(); ) + { + String arg = i.next(); + String prefix = argument.prefix(); + String delimiter = argument.delimiter(); + if (arg.startsWith(prefix)) + { + Object value; + String name = getName(argument, field); + String alias = getAlias(argument); + arg = arg.substring(prefix.length()); + Class type = field.getType(); + if (arg.equals(name) || (alias != null && arg.equals(alias))) + { + i.remove(); + value = consumeArgumentValue(name, type, argument, i); + if (!set) + { + setField(type, field, target, value, delimiter); + } + else + { + addArgument(type, field, target, value, delimiter); + } + set = true; + } + if (set && !type.isArray()) break; + } + } + if (!set && argument.required()) + { + String name = getName(argument, field); + throw new IllegalArgumentException("缺少必需参数: " + argument.prefix() + name); + } + } + } + + private static void addArgument(Class type, Field field, Object target, Object value, String delimiter) + { + try + { + Object[] os = (Object[]) field.get(target); + Object[] vs = (Object[]) getValue(type, value, delimiter); + Object[] s = (Object[]) Array.newInstance(type.getComponentType(), os.length + vs.length); + System.arraycopy(os, 0, s, 0, os.length); + System.arraycopy(vs, 0, s, os.length, vs.length); + field.set(target, s); + } + catch (IllegalAccessException iae) + { + throw new IllegalArgumentException("Could not set field " + field, iae); + } + catch (NoSuchMethodException e) + { + throw new IllegalArgumentException("Could not find constructor in class " + type.getName() + " that takes a string", e); + } + } + + private static void addPropertyArgument(Class type, PropertyDescriptor property, Object target, Object value, String delimiter) + { + try + { + Object[] os = (Object[]) property.getReadMethod().invoke(target); + Object[] vs = (Object[]) getValue(type, value, delimiter); + Object[] s = (Object[]) Array.newInstance(type.getComponentType(), os.length + vs.length); + System.arraycopy(os, 0, s, 0, os.length); + System.arraycopy(vs, 0, s, os.length, vs.length); + property.getWriteMethod().invoke(target, (Object) s); + } + catch (IllegalAccessException iae) + { + throw new IllegalArgumentException("Could not set property " + property, iae); + } + catch (NoSuchMethodException e) + { + throw new IllegalArgumentException("Could not find constructor in class " + type.getName() + " that takes a string", e); + } + catch (InvocationTargetException e) + { + throw new IllegalArgumentException("Failed to validate argument " + value + " for " + property); + } + } + + private static void processProperty(Object target, PropertyDescriptor property, List arguments) + { + Method writeMethod = property.getWriteMethod(); + if (writeMethod != null) + { + Argument argument = writeMethod.getAnnotation(Argument.class); + if (argument != null) + { + boolean set = false; + for (Iterator i = arguments.iterator(); i.hasNext(); ) + { + String arg = i.next(); + String prefix = argument.prefix(); + String delimiter = argument.delimiter(); + if (arg.startsWith(prefix)) + { + Object value; + String name = getName(argument, property); + String alias = getAlias(argument); + arg = arg.substring(prefix.length()); + Class type = property.getPropertyType(); + if (arg.equals(name) || (alias != null && arg.equals(alias))) + { + i.remove(); + value = consumeArgumentValue(name, type, argument, i); + if (!set) + { + setProperty(type, property, target, value, delimiter); + } + else + { + addPropertyArgument(type, property, target, value, delimiter); + } + set = true; + } + if (set && !type.isArray()) break; + } + } + if (!set && argument.required()) + { + String name = getName(argument, property); + throw new IllegalArgumentException("You must set argument " + name); + } + } + } + } + + /** + * Generate usage information based on the target annotations. + * + * @param target An instance or class. + */ + public static void usage(Object target) + { + usage(System.err, target); + } + + /** + * Generate usage information based on the target annotations. + * + * @param errStream A {@link PrintStream} to print the usage information to. + * @param target An instance or class. + */ + public static void usage(PrintStream errStream, Object target) + { + Class clazz; + if (target instanceof Class) + { + clazz = (Class) target; + } + else + { + clazz = target.getClass(); + } + String clazzName = clazz.getName(); + { + int index = clazzName.lastIndexOf('$'); + if (index > 0) + { + clazzName = clazzName.substring(0, index); + } + } + errStream.println("Usage: " + clazzName); + for (Class currentClazz = clazz; currentClazz != null; currentClazz = currentClazz.getSuperclass()) + { + for (Field field : currentClazz.getDeclaredFields()) + { + fieldUsage(errStream, target, field); + } + } + try + { + BeanInfo info = Introspector.getBeanInfo(clazz); + for (PropertyDescriptor pd : info.getPropertyDescriptors()) + { + propertyUsage(errStream, target, pd); + } + } + catch (IntrospectionException e) + { + // If its not a JavaBean we ignore it + } + } + + private static void fieldUsage(PrintStream errStream, Object target, Field field) + { + Argument argument = field.getAnnotation(Argument.class); + if (argument != null) + { + String name = getName(argument, field); + String alias = getAlias(argument); + String prefix = argument.prefix(); + String delimiter = argument.delimiter(); + String description = argument.description(); + makeAccessible(field); + try + { + Object defaultValue = field.get(target); + Class type = field.getType(); + propertyUsage(errStream, prefix, name, alias, type, delimiter, description, defaultValue); + } + catch (IllegalAccessException e) + { + throw new IllegalArgumentException("Could not use thie field " + field + " as an argument field", e); + } + } + } + + private static void propertyUsage(PrintStream errStream, Object target, PropertyDescriptor field) + { + Method writeMethod = field.getWriteMethod(); + if (writeMethod != null) + { + Argument argument = writeMethod.getAnnotation(Argument.class); + if (argument != null) + { + String name = getName(argument, field); + String alias = getAlias(argument); + String prefix = argument.prefix(); + String delimiter = argument.delimiter(); + String description = argument.description(); + try + { + Method readMethod = field.getReadMethod(); + Object defaultValue; + if (readMethod == null) + { + defaultValue = null; + } + else + { + defaultValue = readMethod.invoke(target, (Object[]) null); + } + Class type = field.getPropertyType(); + propertyUsage(errStream, prefix, name, alias, type, delimiter, description, defaultValue); + } + catch (IllegalAccessException e) + { + throw new IllegalArgumentException("Could not use thie field " + field + " as an argument field", e); + } + catch (InvocationTargetException e) + { + throw new IllegalArgumentException("Could not get default value for " + field, e); + } + } + } + + } + + private static void propertyUsage(PrintStream errStream, String prefix, String name, String alias, Class type, String delimiter, String description, Object defaultValue) + { + StringBuilder sb = new StringBuilder(" "); + sb.append(prefix); + sb.append(name); + if (alias != null) + { + sb.append(" ("); + sb.append(prefix); + sb.append(alias); + sb.append(")"); + } + if (type == Boolean.TYPE || type == Boolean.class) + { + sb.append("\t[flag]\t"); + sb.append(description); + } + else + { + sb.append("\t["); + if (type.isArray()) + { + String typeName = getTypeName(type.getComponentType()); + sb.append(typeName); + sb.append("["); + sb.append(delimiter); + sb.append("]"); + } + else + { + String typeName = getTypeName(type); + sb.append(typeName); + } + sb.append("]\t"); + sb.append(description); + if (defaultValue != null) + { + sb.append(" ("); + if (type.isArray()) + { + List list = new ArrayList(); + int len = Array.getLength(defaultValue); + for (int i = 0; i < len; i++) + { + list.add(Array.get(defaultValue, i)); + } + sb.append(list); + } + else + { + sb.append(defaultValue); + } + sb.append(")"); + } + + } + errStream.println(sb); + } + + private static String getTypeName(Class type) + { + String typeName = type.getName(); + int beginIndex = typeName.lastIndexOf("."); + typeName = typeName.substring(beginIndex + 1); + return typeName; + } + + static String getName(Argument argument, PropertyDescriptor property) + { + String name = argument.value(); + if (name.equals("")) + { + name = property.getName(); + } + return name; + + } + + private static Object consumeArgumentValue(String name, Class type, Argument argument, Iterator i) + { + Object value; + if (type == Boolean.TYPE || type == Boolean.class) + { + value = true; + } + else + { + if (i.hasNext()) + { + value = i.next(); + i.remove(); + } + else + { + throw new IllegalArgumentException("非flag参数必须指定值: " + argument.prefix() + name); + } + } + return value; + } + + static void setProperty(Class type, PropertyDescriptor property, Object target, Object value, String delimiter) + { + try + { + value = getValue(type, value, delimiter); + property.getWriteMethod().invoke(target, value); + } + catch (IllegalAccessException iae) + { + throw new IllegalArgumentException("Could not set property " + property, iae); + } + catch (NoSuchMethodException e) + { + throw new IllegalArgumentException("Could not find constructor in class " + type.getName() + " that takes a string", e); + } + catch (InvocationTargetException e) + { + throw new IllegalArgumentException("Failed to validate argument " + value + " for " + property); + } + } + + static String getAlias(Argument argument) + { + String alias = argument.alias(); + if (alias.equals("")) + { + alias = null; + } + return alias; + } + + static String getName(Argument argument, Field field) + { + String name = argument.value(); + if (name.equals("")) + { + name = field.getName(); + } + return name; + } + + static void setField(Class type, Field field, Object target, Object value, String delimiter) + { + makeAccessible(field); + try + { + value = getValue(type, value, delimiter); + field.set(target, value); + } + catch (IllegalAccessException iae) + { + throw new IllegalArgumentException("Could not set field " + field, iae); + } + catch (NoSuchMethodException e) + { + throw new IllegalArgumentException("Could not find constructor in class " + type.getName() + " that takes a string", e); + } + } + + private static Object getValue(Class type, Object value, String delimiter) throws NoSuchMethodException + { + if (type != String.class && type != Boolean.class && type != Boolean.TYPE) + { + String string = (String) value; + if (type.isArray()) + { + String[] strings = string.split(delimiter); + type = type.getComponentType(); + if (type == String.class) + { + value = strings; + } + else + { + Object[] array = (Object[]) Array.newInstance(type, strings.length); + for (int i = 0; i < array.length; i++) + { + array[i] = createValue(type, strings[i]); + } + value = array; + } + } + else + { + value = createValue(type, string); + } + } + return value; + } + + private static Object createValue(Class type, String valueAsString) throws NoSuchMethodException + { + for (ValueCreator valueCreator : valueCreators) + { + Object createdValue = valueCreator.createValue(type, valueAsString); + if (createdValue != null) + { + return createdValue; + } + } + throw new IllegalArgumentException(String.format("cannot instanciate any %s object using %s value", type.toString(), valueAsString)); + } + + private static void makeAccessible(AccessibleObject ao) + { + if (ao instanceof Member) + { + Member member = (Member) ao; + if (!Modifier.isPublic(member.getModifiers())) + { + ao.setAccessible(true); + } + } + } + + public static interface ValueCreator + { + /** + * Creates a value object of the given type using the given string value representation; + * + * @param type the type to create an instance of + * @param value the string represented value of the object to create + * @return null if the object could not be created, the value otherwise + */ + public Object createValue(Class type, String value); + } + + /** + * Creates a {@link ValueCreator} object able to create object assignable from given type, + * using a static one arg method which name is the the given one taking a String object as parameter + * + * @param compatibleType the base assignable for which this object will try to invoke the given method + * @param methodName the name of the one arg method taking a String as parameter that will be used to built a new value + * @return null if the object could not be created, the value otherwise + */ + public static ValueCreator byStaticMethodInvocation(final Class compatibleType, final String methodName) + { + return new ValueCreator() + { + public Object createValue(Class type, String value) + { + Object v = null; + if (compatibleType.isAssignableFrom(type)) + { + try + { + Method m = type.getMethod(methodName, String.class); + return m.invoke(null, value); + } + catch (NoSuchMethodException e) + { + // ignore + } + catch (Exception e) + { + throw new IllegalArgumentException(String.format("could not invoke %s#%s to create an obejct from %s", type.toString(), methodName, value)); + } + } + return v; + } + }; + } + + /** + * {@link ValueCreator} building object using a one arg constructor taking a {@link String} object as parameter + */ + public static final ValueCreator FROM_STRING_CONSTRUCTOR = new ValueCreator() + { + public Object createValue(Class type, String value) + { + Object v = null; + try + { + Constructor init = type.getDeclaredConstructor(String.class); + v = init.newInstance(value); + } + catch (NoSuchMethodException e) + { + // ignore + } + catch (Exception e) + { + throw new IllegalArgumentException("Failed to convertPKUtoCWS " + value + " to type " + type.getName(), e); + } + return v; + } + }; + + public static final ValueCreator ENUM_CREATOR = new ValueCreator() + { + @SuppressWarnings({"unchecked", "rawtypes"}) + public Object createValue(Class type, String value) + { + if (Enum.class.isAssignableFrom(type)) + { + return Enum.valueOf(type, value); + } + return null; + } + }; + + private static final List DEFAULT_VALUE_CREATORS = Arrays.asList(Args.FROM_STRING_CONSTRUCTOR, Args.ENUM_CREATOR); + private static List valueCreators = new ArrayList(DEFAULT_VALUE_CREATORS); + + /** + * Allows external extension of the valiue creators. + * + * @param vc another value creator to take into account for trying to create values + */ + public static void registerValueCreator(ValueCreator vc) + { + valueCreators.add(vc); + } + + /** + * Cleanup of registered ValueCreators (mainly for tests) + */ + public static void resetValueCreators() + { + valueCreators.clear(); + valueCreators.addAll(DEFAULT_VALUE_CREATORS); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Argument.java b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Argument.java new file mode 100644 index 000000000..13468f0b2 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/Argument.java @@ -0,0 +1,45 @@ +/* + * Copyright (c) 2005, Sam Pullara. All Rights Reserved. + * You may modify and redistribute as long as this attribution remains. + */ + +package com.hankcs.hanlp.model.perceptron.cli; + +import java.lang.annotation.Documented; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; + +@Documented +@Retention(RetentionPolicy.RUNTIME) +public @interface Argument +{ + /** + * This is the actual command line argument itself + */ + String value() default ""; + + /** + * If this is true, then the argument must be set or the parse will fail + */ + boolean required() default false; + + /** + * This is the prefix expected for the argument + */ + String prefix() default "-"; + + /** + * Each argument can have an alias + */ + String alias() default ""; + + /** + * A description of the argument that will appear in the usage method + */ + String description() default ""; + + /** + * A delimiter for arguments that are multi-valued. + */ + String delimiter() default ","; +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/cli/PropertiesArgs.java b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/PropertiesArgs.java new file mode 100644 index 000000000..5dfcfe6b7 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/cli/PropertiesArgs.java @@ -0,0 +1,118 @@ +package com.hankcs.hanlp.model.perceptron.cli; + +import java.beans.BeanInfo; +import java.beans.IntrospectionException; +import java.beans.Introspector; +import java.beans.PropertyDescriptor; +import java.lang.reflect.Field; +import java.lang.reflect.Method; +import java.util.Properties; + +/** + * 解析命令行 + */ +public class PropertiesArgs +{ + /** + * Parse properties instead of String arguments. Any additional arguments need to be passed some other way. + * This is often used in a second pass when the property filename is passed on the command line. Because of + * required properties you must be careful to set them all in the property file. + * + * @param target Either an instance or a class + * @param arguments The properties that contain the arguments + */ + public static void parse(Object target, Properties arguments) + { + Class clazz; + if (target instanceof Class) + { + clazz = (Class) target; + } + else + { + clazz = target.getClass(); + } + for (Field field : clazz.getDeclaredFields()) + { + processField(target, field, arguments); + } + try + { + BeanInfo info = Introspector.getBeanInfo(clazz); + for (PropertyDescriptor pd : info.getPropertyDescriptors()) + { + processProperty(target, pd, arguments); + } + } + catch (IntrospectionException e) + { + // If its not a JavaBean we ignore it + } + } + + private static void processField(Object target, Field field, Properties arguments) + { + Argument argument = field.getAnnotation(Argument.class); + if (argument != null) + { + String name = Args.getName(argument, field); + String alias = Args.getAlias(argument); + Class type = field.getType(); + Object value = arguments.get(name); + if (value == null && alias != null) + { + value = arguments.get(alias); + } + if (value != null) + { + if (type == Boolean.TYPE || type == Boolean.class) + { + value = true; + } + Args.setField(type, field, target, value, argument.delimiter()); + } + else + { + if (argument.required()) + { + throw new IllegalArgumentException("You must set argument " + name); + } + } + } + } + + private static void processProperty(Object target, PropertyDescriptor property, Properties arguments) + { + Method writeMethod = property.getWriteMethod(); + if (writeMethod != null) + { + Argument argument = writeMethod.getAnnotation(Argument.class); + if (argument != null) + { + String name = Args.getName(argument, property); + String alias = Args.getAlias(argument); + Object value = arguments.get(name); + if (value == null && alias != null) + { + value = arguments.get(alias); + } + if (value != null) + { + Class type = property.getPropertyType(); + if (type == Boolean.TYPE || type == Boolean.class) + { + value = true; + } + Args.setProperty(type, property, target, value, argument.delimiter()); + } + else + { + if (argument.required()) + { + throw new IllegalArgumentException("You must set argument " + name); + } + } + } + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/common/FrequencyMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/common/FrequencyMap.java new file mode 100644 index 000000000..7aa47645b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/common/FrequencyMap.java @@ -0,0 +1,38 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM7:41 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.common; + +import java.util.TreeMap; + +/** + * @author hankcs + */ +public class FrequencyMap extends TreeMap +{ + public int totalFrequency; + + public int add(String word) + { + ++totalFrequency; + Integer frequency = get(word); + if (frequency == null) + { + put(word, 1); + return 1; + } + else + { + put(word, ++frequency); + return frequency; + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/common/IIdStringMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/common/IIdStringMap.java new file mode 100644 index 000000000..6d1432802 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/common/IIdStringMap.java @@ -0,0 +1,21 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM4:36 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.common; + +/** + * 从id到label的映射 + * @author hankcs + */ +public interface IIdStringMap +{ + String stringOf(int id); +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/common/IStringIdMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/common/IStringIdMap.java new file mode 100644 index 000000000..d5c3492ba --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/common/IStringIdMap.java @@ -0,0 +1,18 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM4:39 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ + +package com.hankcs.hanlp.model.perceptron.common; + +public interface IStringIdMap +{ + int idOf(String string); +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/common/TaskType.java b/src/main/java/com/hankcs/hanlp/model/perceptron/common/TaskType.java new file mode 100644 index 000000000..990f65d24 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/common/TaskType.java @@ -0,0 +1,19 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午5:22 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.common; + +/** + * @author hankcs + */ +public enum TaskType +{ + CWS, POS, NER, CLASSIFICATION +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureMap.java new file mode 100644 index 000000000..7c1bbba07 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureMap.java @@ -0,0 +1,117 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM5:23 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.ICacheAble; +import com.hankcs.hanlp.model.perceptron.common.IStringIdMap; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.POSTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Map; +import java.util.Set; + +/** + * @author hankcs + */ +public abstract class FeatureMap implements IStringIdMap, ICacheAble +{ + public abstract int size(); + + public int[] allLabels() + { + return tagSet.allTags(); + } + + public int bosTag() + { + return tagSet.size(); + } + + public TagSet tagSet; + /** + * 是否允许新增特征 + */ + public boolean mutable; + + public FeatureMap(TagSet tagSet) + { + this(tagSet, false); + } + + public FeatureMap(TagSet tagSet, boolean mutable) + { + this.tagSet = tagSet; + this.mutable = mutable; + } + + public abstract Set> entrySet(); + + public FeatureMap(boolean mutable) + { + this.mutable = mutable; + } + + public FeatureMap() + { + this(false); + } + + @Override + public void save(DataOutputStream out) throws IOException + { + tagSet.save(out); + out.writeInt(size()); + for (Map.Entry entry : entrySet()) + { + out.writeUTF(entry.getKey()); + } + } + + @Override + public boolean load(ByteArray byteArray) + { + loadTagSet(byteArray); + int size = byteArray.nextInt(); + for (int i = 0; i < size; i++) + { + idOf(byteArray.nextUTF()); + } + return true; + } + + protected final void loadTagSet(ByteArray byteArray) + { + TaskType type = TaskType.values()[byteArray.nextInt()]; + switch (type) + { + case CWS: + tagSet = new CWSTagSet(); + break; + case POS: + tagSet = new POSTagSet(); + break; + case NER: + tagSet = new NERTagSet(); + break; + case CLASSIFICATION: + tagSet = new TagSet(TaskType.CLASSIFICATION); + break; + } + tagSet.load(byteArray); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureSortItem.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureSortItem.java new file mode 100644 index 000000000..ed43be455 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/FeatureSortItem.java @@ -0,0 +1,34 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-10 PM7:30 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import java.util.Map; + +/** + * @author hankcs + */ +public class FeatureSortItem +{ + public String key; + public Integer id; + public float total; + + public FeatureSortItem(Map.Entry entry, float[] parameter, int tagSetSize) + { + key = entry.getKey(); + id = entry.getValue(); + for (int i = 0; i < tagSetSize; ++i) + { + total += Math.abs(parameter[id * tagSetSize + i]); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureDatMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureDatMap.java new file mode 100644 index 000000000..7e3ffb261 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureDatMap.java @@ -0,0 +1,52 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-05 PM8:19 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; + +import java.util.Map; +import java.util.Set; +import java.util.TreeMap; + +/** + * @author hankcs + */ +public class ImmutableFeatureDatMap extends FeatureMap +{ + DoubleArrayTrie dat; + + public ImmutableFeatureDatMap(TreeMap featureIdMap, TagSet tagSet) + { + super(tagSet); + dat = new DoubleArrayTrie(); + dat.build(featureIdMap); + } + + @Override + public int idOf(String string) + { + return dat.exactMatchSearch(string); + } + + @Override + public int size() + { + return dat.size(); + } + + @Override + public Set> entrySet() + { + throw new UnsupportedOperationException("这份DAT实现不支持遍历"); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMap.java new file mode 100644 index 000000000..627789f8c --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMap.java @@ -0,0 +1,95 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-11-18 下午8:57 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.Map; +import java.util.Set; + +/** + * 用MutableDoubleArrayTrie实现的ImmutableFeatureMap + * @author hankcs + */ +public class ImmutableFeatureMDatMap extends FeatureMap +{ + MutableDoubleArrayTrieInteger dat; + + public ImmutableFeatureMDatMap() + { + super(); + dat = new MutableDoubleArrayTrieInteger(); + } + + public ImmutableFeatureMDatMap(TagSet tagSet) + { + super(tagSet); + dat = new MutableDoubleArrayTrieInteger(); + } + + public ImmutableFeatureMDatMap(MutableDoubleArrayTrieInteger dat, TagSet tagSet) + { + super(tagSet); + this.dat = dat; + } + + public ImmutableFeatureMDatMap(Map featureIdMap, TagSet tagSet) + { + super(tagSet); + dat = new MutableDoubleArrayTrieInteger(featureIdMap); + } + + public ImmutableFeatureMDatMap(Set> featureIdSet, TagSet tagSet) + { + super(tagSet); + dat = new MutableDoubleArrayTrieInteger(); + for (Map.Entry entry : featureIdSet) + { + dat.put(entry.getKey(), entry.getValue()); + } + } + + @Override + public int idOf(String string) + { + return dat.get(string); + } + + @Override + public int size() + { + return dat.size(); + } + + @Override + public Set> entrySet() + { + return dat.entrySet(); + } + + @Override + public void save(DataOutputStream out) throws IOException + { + tagSet.save(out); + dat.save(out); + } + + @Override + public boolean load(ByteArray byteArray) + { + loadTagSet(byteArray); + return dat.load(byteArray); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMap.java new file mode 100644 index 000000000..1c4d050a2 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMap.java @@ -0,0 +1,62 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-05 PM8:39 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +/** + * @author hankcs + */ +public class ImmutableFeatureMap extends FeatureMap +{ + public Map featureIdMap; + + public ImmutableFeatureMap(Map featureIdMap, TagSet tagSet) + { + super(tagSet); + this.featureIdMap = featureIdMap; + } + + public ImmutableFeatureMap(Set> entrySet, TagSet tagSet) + { + super(tagSet); + this.featureIdMap = new HashMap(); + for (Map.Entry entry : entrySet) + { + featureIdMap.put(entry.getKey(), entry.getValue()); + } + } + + @Override + public int idOf(String string) + { + Integer id = featureIdMap.get(string); + if (id == null) return -1; + return id; + } + + @Override + public int size() + { + return featureIdMap.size(); + } + + @Override + public Set> entrySet() + { + return featureIdMap.entrySet(); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/LockableFeatureMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/LockableFeatureMap.java new file mode 100644 index 000000000..4037fa048 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/LockableFeatureMap.java @@ -0,0 +1,38 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 9:04 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +/** + * 可切换锁定状态的特征id映射 + * + * @author hankcs + */ +public class LockableFeatureMap extends ImmutableFeatureMDatMap +{ + public LockableFeatureMap(TagSet tagSet) + { + super(tagSet); + } + + @Override + public int idOf(String string) + { + int id = super.idOf(string); // 查询id + if (id == -1 && mutable) // 如果不存在该key且处于可写状态 + { + id = dat.size(); + dat.put(string, id); // 则为key分配新id + } + return id; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/feature/MutableFeatureMap.java b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/MutableFeatureMap.java new file mode 100644 index 000000000..496093692 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/feature/MutableFeatureMap.java @@ -0,0 +1,94 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM5:24 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.util.Map; +import java.util.Set; +import java.util.TreeMap; + +/** + * @author hankcs + */ +public class MutableFeatureMap extends FeatureMap +{ + public Map featureIdMap; + // TreeMap 5136 + // Bin 2712 + // DAT minutes + // trie4j 3411 + + public MutableFeatureMap(TagSet tagSet) + { + super(tagSet, true); + featureIdMap = new TreeMap(); + addTransitionFeatures(tagSet); + } + + private void addTransitionFeatures(TagSet tagSet) + { + for (int i = 0; i < tagSet.size(); i++) + { + idOf("BL=" + tagSet.stringOf(i)); + } + idOf("BL=_BL_"); + } + + public MutableFeatureMap(TagSet tagSet, Map featureIdMap) + { + super(tagSet); + this.featureIdMap = featureIdMap; + addTransitionFeatures(tagSet); + } + + @Override + public Set> entrySet() + { + return featureIdMap.entrySet(); + } + + @Override + public int idOf(String string) + { + Integer id = featureIdMap.get(string); + if (id == null) + { + id = featureIdMap.size(); + featureIdMap.put(string, id); + } + + return id; + } + + public int size() + { + return featureIdMap.size(); + } + + public Set featureSet() + { + return featureIdMap.keySet(); + } + + @Override + public int[] allLabels() + { + return tagSet.allTags(); + } + + @Override + public int bosTag() + { + return tagSet.size(); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/instance/CWSInstance.java b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/CWSInstance.java new file mode 100644 index 000000000..2d18f43e5 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/CWSInstance.java @@ -0,0 +1,229 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午9:21 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.instance; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap; +import com.hankcs.hanlp.model.perceptron.tagset.CWSTagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +import java.util.LinkedList; +import java.util.List; + +/** + * @author hankcs + */ +public class CWSInstance extends Instance +{ + private static final char CHAR_BEGIN = '\u0001'; + private static final char CHAR_END = '\u0002'; + + /** + * 生成分词实例 + * + * @param termArray 分词序列 + * @param featureMap 特征收集 + */ + public CWSInstance(String[] termArray, FeatureMap featureMap) + { + String sentence = com.hankcs.hanlp.utility.TextUtility.combine(termArray); + CWSTagSet tagSet = (CWSTagSet) featureMap.tagSet; + + tagArray = new int[sentence.length()]; + for (int i = 0, j = 0; i < termArray.length; i++) + { + assert termArray[i].length() > 0 : "句子中出现了长度为0的单词,不合法:" + sentence; + if (termArray[i].length() == 1) + tagArray[j++] = tagSet.S; + else + { + tagArray[j++] = tagSet.B; + for (int k = 1; k < termArray[i].length() - 1; k++) + tagArray[j++] = tagSet.M; + tagArray[j++] = tagSet.E; + } + } + + initFeatureMatrix(sentence, featureMap); + } + + public CWSInstance(String sentence, FeatureMap featureMap) + { + initFeatureMatrix(sentence, featureMap); + tagArray = new int[sentence.length()]; + } + + protected int[] extractFeature(String sentence, FeatureMap featureMap, int position) + { + List featureVec = new LinkedList(); + + char pre2Char = position >= 2 ? sentence.charAt(position - 2) : CHAR_BEGIN; + char preChar = position >= 1 ? sentence.charAt(position - 1) : CHAR_BEGIN; + char curChar = sentence.charAt(position); + char nextChar = position < sentence.length() - 1 ? sentence.charAt(position + 1) : CHAR_END; + char next2Char = position < sentence.length() - 2 ? sentence.charAt(position + 2) : CHAR_END; + + StringBuilder sbFeature = new StringBuilder(); + //char unigram feature +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("U[-2,0]=").append(pre2Char); +// addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(preChar).append('1'); + addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(curChar).append('2'); + addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(nextChar).append('3'); + addFeature(sbFeature, featureVec, featureMap); + +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("U[2,0]=").append(next2Char); +// addFeature(sbFeature, featureVec, featureMap); + + //char bigram feature + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(pre2Char).append("/").append(preChar).append('4'); + addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(preChar).append("/").append(curChar).append('5'); + addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(curChar).append("/").append(nextChar).append('6'); + addFeature(sbFeature, featureVec, featureMap); + + sbFeature.delete(0, sbFeature.length()); + sbFeature.append(nextChar).append("/").append(next2Char).append('7'); + addFeature(sbFeature, featureVec, featureMap); + +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[-2,0]=").append(pre2Char).append("/").append(curChar); +// addFeature(sbFeature, featureVec, featureMap); +// +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[-1,1]=").append(preChar).append("/").append(nextChar); +// addFeature(sbFeature, featureVec, featureMap); +// +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[0,2]=").append(curChar).append("/").append(next2Char); +// addFeature(sbFeature, featureVec, featureMap); + + //char trigram feature +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("T[-1,0]=").append(preChar).append("/").append(curChar).append("/").append(nextChar); +// addFeature(sbFeature, featureVec, featureMap); + sbFeature = null; + +// if (preChar == curChar) +// addFeature("-1AABBT", featureVec, featureMap); +// if (curChar == nextChar) +// addFeature("0AABBT", featureVec, featureMap); +// +// if (pre2Char == curChar) +// addFeature("-2ABABT", featureVec, featureMap); +// if (preChar == nextChar) +// addFeature("-1ABABT", featureVec, featureMap); +// if (curChar == next2Char) +// addFeature("0ABABT", featureVec, featureMap); + + //char type unigram feature +// addFeature("cT=" + CharType.get(sentence.charAt(position)), featureVec, featureMap); +// +// //char type trigram feature +// StringBuffer trigram = new StringBuffer(); +// +// if (position > 0) +// trigram.append(CharType.get(sentence.charAt(position - 1))); +// else +// trigram.append("_BT_"); +// +// trigram.append("/" + CharType.get(sentence.charAt(position))); +// +// if (position < sentence.length() - 1) +// trigram.append("/" + CharType.get(sentence.charAt(position + 1))); +// else +// trigram.append("/_EL_"); +// +// addFeature("cTT=" + trigram, featureVec, featureMap); + + //dictionary feature +// int[] begin = new int[sentence.length()]; +// int[] middle = new int[sentence.length()]; +// int[] end = new int[sentence.length()]; +// // 查词典 +// for (int i = 0; i < sentence.length(); i++) +// { +// int maxPre = 0; +// int offset = -1; +// int state = 1; +// while (state > 0 && i + ++offset < sentence.length()) +// { +// state = dat.transition(sentence.charAt(i + offset), state); +// if (dat.output(state) != null) +// { +// maxPre = offset + 1; +// } +// } +// +// begin[i] = maxPre; +// +// if (maxPre > 0 && end[i + maxPre - 1] < maxPre) +// end[i + maxPre - 1] = maxPre; +// for (int k = i + 1; k < i + maxPre - 1; k++) +// if (middle[k] < maxPre) +// middle[k] = maxPre; +// } +// addFeature("b=" + begin[position], featureVec, featureMap); +// addFeature("m=" + middle[position], featureVec, featureMap); +// addFeature("e=" + end[position], featureVec, featureMap); + + //label bigram feature +// char preLabel = position > 0 ? tagArray[position - 1].toChar() : CHAR_BEGIN; +// +// addFeature("BL=" + preLabel, featureVec, featureMap); // 虽然有preLabel,但并没有加上当前label,当前label是由调用者自行加的 + + return toFeatureArray(featureVec); + } + + protected void initFeatureMatrix(String sentence, FeatureMap featureMap) + { + featureMatrix = new int[sentence.length()][]; + for (int i = 0; i < sentence.length(); i++) + { + featureMatrix[i] = extractFeature(sentence, featureMap, i); + } + } + + public static CWSInstance create(Sentence sentence, FeatureMap featureMap) + { + if (sentence == null || featureMap == null) + { + return null; + } + List wordList = sentence.toSimpleWordList(); + String[] termArray = new String[wordList.size()]; + int i = 0; + for (Word word : wordList) + { + termArray[i] = word.getValue(); + ++i; + } + return new CWSInstance(termArray, featureMap); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/instance/Instance.java b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/Instance.java new file mode 100644 index 000000000..97473bca7 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/Instance.java @@ -0,0 +1,107 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM5:16 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.instance; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.util.List; + +/** + * @author hankcs + */ +public class Instance +{ + public int[][] featureMatrix; + public int[] tagArray; + + protected Instance() + { + } + + protected static int[] toFeatureArray(List featureVector) + { + int[] featureArray = new int[featureVector.size() + 1]; // 最后一列留给转移特征 + int index = -1; + for (Integer feature : featureVector) + { + featureArray[++index] = feature; + } + + return featureArray; + } + + public int[] getFeatureAt(int position) + { + return featureMatrix[position]; + } + + public int length() + { + return tagArray.length; + } + + protected static void addFeature(CharSequence rawFeature, List featureVector, FeatureMap featureMap) + { + int id = featureMap.idOf(rawFeature.toString()); + if (id != -1) + { + featureVector.add(id); + } + } + + /** + * 添加特征,同时清空缓存 + * + * @param rawFeature + * @param featureVector + * @param featureMap + */ + protected static void addFeatureThenClear(StringBuilder rawFeature, List featureVector, FeatureMap featureMap) + { + int id = featureMap.idOf(rawFeature.toString()); + if (id != -1) + { + featureVector.add(id); + } + rawFeature.setLength(0); + } + + /** + * 根据标注集还原字符形式的标签 + * + * @param tagSet + * @return + */ + public String[] tags(TagSet tagSet) + { + assert tagArray != null; + + String[] tags = new String[tagArray.length]; + for (int i = 0; i < tags.length; i++) + { + tags[i] = tagSet.stringOf(tagArray[i]); + } + + return tags; + } + + /** + * 实例大小(有多少个要预测的元素) + * + * @return + */ + public int size() + { + return featureMatrix.length; + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/instance/InstanceHandler.java b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/InstanceHandler.java new file mode 100644 index 000000000..22edccf1b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/InstanceHandler.java @@ -0,0 +1,21 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-15 下午7:21 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.instance; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +/** + * @author hankcs + */ +public interface InstanceHandler +{ + boolean process(Sentence instance); +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/instance/NERInstance.java b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/NERInstance.java new file mode 100644 index 000000000..77970eae0 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/NERInstance.java @@ -0,0 +1,126 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-28 14:35 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.instance; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.model.perceptron.utility.Utility; + +import java.util.ArrayList; +import java.util.List; + +/** + * @author hankcs + */ +public class NERInstance extends Instance +{ + public NERInstance(String[] wordArray, String[] posArray, String[] nerArray, NERTagSet tagSet, FeatureMap featureMap) + { + this(wordArray, posArray, featureMap); + + tagArray = new int[wordArray.length]; + for (int i = 0; i < wordArray.length; i++) + { + tagArray[i] = tagSet.add(nerArray[i]); + } + } + + public NERInstance(String[][] tuples, NERTagSet tagSet, FeatureMap featureMap) + { + this(tuples[0], tuples[1], tuples[2], tagSet, featureMap); + } + + public NERInstance(String[] wordArray, String[] posArray, FeatureMap featureMap) + { + initFeatureMatrix(wordArray, posArray, featureMap); + } + + private void initFeatureMatrix(String[] wordArray, String[] posArray, FeatureMap featureMap) + { + featureMatrix = new int[wordArray.length][]; + for (int i = 0; i < featureMatrix.length; i++) + { + featureMatrix[i] = extractFeature(wordArray, posArray, featureMap, i); + } + } + + /** + * 提取特征,override此方法来拓展自己的特征模板 + * + * @param wordArray 词语 + * @param posArray 词性 + * @param featureMap 储存特征的结构 + * @param position 当前提取的词语所在的位置 + * @return 特征向量 + */ + protected int[] extractFeature(String[] wordArray, String[] posArray, FeatureMap featureMap, int position) + { + List featVec = new ArrayList(); + + String pre2Word = position >= 2 ? wordArray[position - 2] : "_B_"; + String preWord = position >= 1 ? wordArray[position - 1] : "_B_"; + String curWord = wordArray[position]; + String nextWord = position <= wordArray.length - 2 ? wordArray[position + 1] : "_E_"; + String next2Word = position <= wordArray.length - 3 ? wordArray[position + 2] : "_E_"; + + String pre2Pos = position >= 2 ? posArray[position - 2] : "_B_"; + String prePos = position >= 1 ? posArray[position - 1] : "_B_"; + String curPos = posArray[position]; + String nextPos = position <= posArray.length - 2 ? posArray[position + 1] : "_E_"; + String next2Pos = position <= posArray.length - 3 ? posArray[position + 2] : "_E_"; + + StringBuilder sb = new StringBuilder(); + addFeatureThenClear(sb.append(pre2Word).append('1'), featVec, featureMap); + addFeatureThenClear(sb.append(preWord).append('2'), featVec, featureMap); + addFeatureThenClear(sb.append(curWord).append('3'), featVec, featureMap); + addFeatureThenClear(sb.append(nextWord).append('4'), featVec, featureMap); + addFeatureThenClear(sb.append(next2Word).append('5'), featVec, featureMap); +// addFeatureThenClear(sb.append(pre2Word).append(preWord).append('6'), featVec, featureMap); +// addFeatureThenClear(sb.append(preWord).append(curWord).append('7'), featVec, featureMap); +// addFeatureThenClear(sb.append(curWord).append(nextWord).append('8'), featVec, featureMap); +// addFeatureThenClear(sb.append(nextWord).append(next2Word).append('9'), featVec, featureMap); + + addFeatureThenClear(sb.append(pre2Pos).append('A'), featVec, featureMap); + addFeatureThenClear(sb.append(prePos).append('B'), featVec, featureMap); + addFeatureThenClear(sb.append(curPos).append('C'), featVec, featureMap); + addFeatureThenClear(sb.append(nextPos).append('D'), featVec, featureMap); + addFeatureThenClear(sb.append(next2Pos).append('E'), featVec, featureMap); + addFeatureThenClear(sb.append(pre2Pos).append(prePos).append('F'), featVec, featureMap); + addFeatureThenClear(sb.append(prePos).append(curPos).append('G'), featVec, featureMap); + addFeatureThenClear(sb.append(curPos).append(nextPos).append('H'), featVec, featureMap); + addFeatureThenClear(sb.append(nextPos).append(next2Pos).append('I'), featVec, featureMap); + + return toFeatureArray(featVec); + } + + public NERInstance(String segmentedTaggedNERSentence, FeatureMap featureMap) + { + this(Sentence.create(segmentedTaggedNERSentence), featureMap); + } + + public NERInstance(Sentence sentence, FeatureMap featureMap) + { + this(convertSentenceToArray(sentence, featureMap), (NERTagSet) featureMap.tagSet, featureMap); + } + + private static String[][] convertSentenceToArray(Sentence sentence, FeatureMap featureMap) + { + NERTagSet tagSet = (NERTagSet) featureMap.tagSet; + List collector = Utility.convertSentenceToNER(sentence, tagSet); + String[][] tuples = new String[3][collector.size()]; + String[] wordArray = tuples[0]; + String[] posArray = tuples[1]; + String[] tagArray = tuples[2]; + Utility.reshapeNER(collector, wordArray, posArray, tagArray); + return tuples; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/instance/POSInstance.java b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/POSInstance.java new file mode 100644 index 000000000..85e257496 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/instance/POSInstance.java @@ -0,0 +1,264 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午9:26 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.instance; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap; +import com.hankcs.hanlp.model.perceptron.tagset.POSTagSet; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +import java.util.ArrayList; +import java.util.List; + +/** + * @author hankcs + */ +public class POSInstance extends Instance +{ + /** + * 构建词性标注实例 + * + * @param termArray 词语 + * @param posArray 词性 + */ + public POSInstance(String[] termArray, String[] posArray, FeatureMap featureMap) + { +// String sentence = TextUtility.combine(termArray); + this(termArray, featureMap); + + POSTagSet tagSet = (POSTagSet) featureMap.tagSet; + tagArray = new int[termArray.length]; + for (int i = 0; i < termArray.length; i++) + { + tagArray[i] = tagSet.add(posArray[i]); + } + } + + public POSInstance(String[] termArray, FeatureMap featureMap) + { + initFeatureMatrix(termArray, featureMap); + } + + protected int[] extractFeature(String[] words, FeatureMap featureMap, int position) + { + List featVec = new ArrayList(); + +// String pre2Word = position >= 2 ? words[position - 2] : "_B_"; + String preWord = position >= 1 ? words[position - 1] : "_B_"; + String curWord = words[position]; + + // System.out.println("cur: " + curWord); + String nextWord = position <= words.length - 2 ? words[position + 1] : "_E_"; +// String next2Word = position <= words.length - 3 ? words[position + 2] : "_E_"; + + StringBuilder sbFeature = new StringBuilder(); +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("U[-2,0]=").append(pre2Word); +// addFeature(sbFeature, featVec, featureMap); + + sbFeature.append(preWord).append('1'); + addFeatureThenClear(sbFeature, featVec, featureMap); + + sbFeature.append(curWord).append('2'); + addFeatureThenClear(sbFeature, featVec, featureMap); + + sbFeature.append(nextWord).append('3'); + addFeatureThenClear(sbFeature, featVec, featureMap); + +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("U[2,0]=").append(next2Word); +// addFeature(sbFeature, featVec, featureMap); + + // wiwi+1(i = − 1, 0) +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[-1,0]=").append(preWord).append("/").append(curWord); +// addFeature(sbFeature, featVec, featureMap); +// +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[0,1]=").append(curWord).append("/").append(nextWord); +// addFeature(sbFeature, featVec, featureMap); +// +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("B[-1,1]=").append(preWord).append("/").append(nextWord); +// addFeature(sbFeature, featVec, featureMap); + + // last char(w−1)w0 +// String lastChar = position >= 1 ? "" + words[position - 1].charAt(words[position - 1].length() - 1) : "_BC_"; +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("CW[-1,0]=").append(lastChar).append("/").append(curWord); +// addFeature(sbFeature, featVec, featureMap); +// +// // w0 first_char(w1) +// String nextChar = position <= words.length - 2 ? "" + words[position + 1].charAt(0) : "_EC_"; +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("CW[1,0]=").append(curWord).append("/").append(nextChar); +// addFeature(sbFeature, featVec, featureMap); +// + int length = curWord.length(); +// +// // firstchar(w0)lastchar(w0) +// sbFeature.delete(0, sbFeature.length()); +// sbFeature.append("BE=").append(curWord.charAt(0)).append("/").append(curWord.charAt(length - 1)); +// addFeature(sbFeature, featVec, featureMap); + + // prefix + sbFeature.append(curWord.substring(0, 1)).append('4'); + addFeatureThenClear(sbFeature, featVec, featureMap); + + if (length > 1) + { + sbFeature.append(curWord.substring(0, 2)).append('4'); + addFeatureThenClear(sbFeature, featVec, featureMap); + } + + if (length > 2) + { + sbFeature.append(curWord.substring(0, 3)).append('4'); + addFeatureThenClear(sbFeature, featVec, featureMap); + } + + // suffix(w0, i)(i = 1, 2, 3) + sbFeature.append(curWord.charAt(length - 1)).append('5'); + addFeatureThenClear(sbFeature, featVec, featureMap); + + if (length > 1) + { + sbFeature.append(curWord.substring(length - 2)).append('5'); + addFeatureThenClear(sbFeature, featVec, featureMap); + } + + if (length > 2) + { + sbFeature.append(curWord.substring(length - 3)).append('5'); + addFeatureThenClear(sbFeature, featVec, featureMap); + } + + // length +// if (length >= 5) +// { +// addFeature("le=" + 5, featVec, featureMap); +// } +// else +// { +// addFeature("le=" + length, featVec, featureMap); +// } + + // label feature +// String preLabel; +// if (position >= 1) +// { +// preLabel = label[position - 1]; +// } +// else +// { +// preLabel = "_BL_"; +// } +// +// addFeature("BL=" + preLabel, featVec, featureMap); + +// for (int i = 0; i < curWord.length(); i++) +// { +// String prefix = curWord.substring(0, 1) + curWord.charAt(i) + ""; +// addFeature("p2f=" + prefix, featVec, featureMap); +// String suffix = curWord.substring(curWord.length() - 1) + curWord.charAt(i) + ""; +// addFeature("s2f=" + suffix, featVec, featureMap); + +// if ((i < curWord.length() - 1) && (curWord.charAt(i) == curWord.charAt(i + 1))) +// { +// addFeature("dulC=" + curWord.substring(i, i + 1), featVec, featureMap); +// } +// if ((i < curWord.length() - 2) && (curWord.charAt(i) == curWord.charAt(i + 2))) +// { +// addFeature("dul2C=" + curWord.substring(i, i + 1), featVec, featureMap); +// } +// } + +// boolean isDigit = true; +// for (int i = 0; i < curWord.length(); i++) +// { +// if (CharType.get(curWord.charAt(i)) != CharType.CT_NUM) +// { +// isDigit = false; +// break; +// } +// } +// if (isDigit) +// { +// addFeature("wT=d", featVec, featureMap); +// } + +// boolean isPunt = true; +// for (int i = 0; i < curWord.length(); i++) +// { +// if (!CharType.punctSet.contains(curWord.charAt(i) + "")) +// { +// isPunt = false; +// break; +// } +// } +// if (isPunt) +// { +// featVec.add("wT=p"); +// } + +// boolean isLetter = true; +// for (int i = 0; i < curWord.length(); i++) +// { +// if (CharType.get(curWord.charAt(i)) != CharType.CT_LETTER) +// { +// isLetter = false; +// break; +// } +// } +// if (isLetter) +// { +// addFeature("wT=l", featVec, featureMap); +// } +// sbFeature = null; + + return toFeatureArray(featVec); + } + + private void initFeatureMatrix(String[] termArray, FeatureMap featureMap) + { + featureMatrix = new int[termArray.length][]; + for (int i = 0; i < featureMatrix.length; i++) + { + featureMatrix[i] = extractFeature(termArray, featureMap, i); + } + } + + public static POSInstance create(String segmentedTaggedSentence, FeatureMap featureMap) + { + return create(Sentence.create(segmentedTaggedSentence), featureMap); + } + + public static POSInstance create(Sentence sentence, FeatureMap featureMap) + { + if (sentence == null || featureMap == null) + { + return null; + } + List wordList = sentence.toSimpleWordList(); + String[] termArray = new String[wordList.size()]; + String[] posArray = new String[wordList.size()]; + int i = 0; + for (Word word : wordList) + { + termArray[i] = word.getValue(); + posArray[i] = word.getLabel(); + ++i; + } + return new POSInstance(termArray, posArray, featureMap); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/model/AveragedPerceptron.java b/src/main/java/com/hankcs/hanlp/model/perceptron/model/AveragedPerceptron.java new file mode 100644 index 000000000..b223f0ea6 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/model/AveragedPerceptron.java @@ -0,0 +1,99 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM4:45 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.model; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; + +import java.util.Collection; + +/** + * 平均感知机算法学习的线性模型 + * + * @author hankcs + */ +public class AveragedPerceptron extends LinearModel +{ + public AveragedPerceptron(FeatureMap featureMap, float[] parameter) + { + super(featureMap, parameter); + } + + public AveragedPerceptron(FeatureMap featureMap) + { + super(featureMap); + } + + /** + * 根据答案和预测更新参数 + * + * @param goldIndex 预测正确的特征函数(非压缩形式) + * @param predictIndex 命中的特征函数 + */ + public void update(int[] goldIndex, int[] predictIndex, double[] total, int[] timestamp, int current) + { + for (int i = 0; i < goldIndex.length; ++i) + { + if (goldIndex[i] == predictIndex[i]) + continue; + else + { + update(goldIndex[i], 1, total, timestamp, current); + if (predictIndex[i] >= 0 && predictIndex[i] < parameter.length) + update(predictIndex[i], -1, total, timestamp, current); + else + { + throw new IllegalArgumentException("更新参数时传入了非法的下标"); + } + } + } + } + + /** + * 根据答案和预测更新参数 + * + * @param featureVector 特征向量 + * @param value 更新量 + * @param total 权值向量总和 + * @param timestamp 每个权值上次更新的时间戳 + * @param current 当前时间戳 + */ + public void update(Collection featureVector, float value, double[] total, int[] timestamp, int current) + { + for (Integer i : featureVector) + update(i, value, total, timestamp, current); + } + + /** + * 根据答案和预测更新参数 + * + * @param index 特征向量的下标 + * @param value 更新量 + * @param total 权值向量总和 + * @param timestamp 每个权值上次更新的时间戳 + * @param current 当前时间戳 + */ + private void update(int index, float value, double[] total, int[] timestamp, int current) + { + int passed = current - timestamp[index]; + total[index] += passed * parameter[index]; + parameter[index] += value; + timestamp[index] = current; + } + + public void average(double[] total, int[] timestamp, int current) + { + for (int i = 0; i < parameter.length; i++) + { + parameter[i] = (float) ((total[i] + (current - timestamp[i]) * parameter[i]) / current); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/model/LinearModel.java b/src/main/java/com/hankcs/hanlp/model/perceptron/model/LinearModel.java new file mode 100644 index 000000000..7a0356f35 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/model/LinearModel.java @@ -0,0 +1,454 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM10:29 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.model; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.algorithm.MaxHeap; +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.ByteArrayStream; +import com.hankcs.hanlp.corpus.io.ICacheAble; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.common.TaskType; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.feature.FeatureSortItem; +import com.hankcs.hanlp.model.perceptron.feature.ImmutableFeatureMDatMap; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.utility.MathUtility; + +import java.io.*; +import java.util.*; + +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; + +/** + * 在线学习标注模型 + * + * @author hankcs + */ +public class LinearModel implements ICacheAble +{ + /** + * 特征函数 + */ + public FeatureMap featureMap; + /** + * 特征权重 + */ + public float[] parameter; + + + public LinearModel(FeatureMap featureMap, float[] parameter) + { + this.featureMap = featureMap; + this.parameter = parameter; + } + + public LinearModel(FeatureMap featureMap) + { + this.featureMap = featureMap; + parameter = new float[featureMap.size() * featureMap.tagSet.size()]; + } + + public LinearModel(String modelFile) throws IOException + { + load(modelFile); + } + + /** + * 模型压缩 + * @param ratio 压缩比c(压缩掉的体积,压缩后体积变为1-c) + * @return + */ + public LinearModel compress(final double ratio) + { + return compress(ratio, 1e-3f); + } + + /** + * @param ratio 压缩比c(压缩掉的体积,压缩后体积变为1-c) + * @param threshold 特征权重绝对值之和最低阈值 + * @return + */ + public LinearModel compress(final double ratio, final double threshold) + { + if (ratio < 0 || ratio >= 1) + { + throw new IllegalArgumentException("压缩比必须介于 0 和 1 之间"); + } + if (ratio == 0) return this; + Set> featureIdSet = featureMap.entrySet(); + TagSet tagSet = featureMap.tagSet; + MaxHeap heap = new MaxHeap((int) ((featureIdSet.size() - tagSet.sizeIncludingBos()) * (1.0f - ratio)), new Comparator() + { + @Override + public int compare(FeatureSortItem o1, FeatureSortItem o2) + { + return Float.compare(o1.total, o2.total); + } + }); + + logger.start("裁剪特征...\n"); + int logEvery = (int) Math.ceil(featureMap.size() / 10000f); + int n = 0; + for (Map.Entry entry : featureIdSet) + { + if (++n % logEvery == 0 || n == featureMap.size()) + { + logger.out("\r%.2f%% ", MathUtility.percentage(n, featureMap.size())); + } + if (entry.getValue() < tagSet.sizeIncludingBos()) + { + continue; + } + FeatureSortItem item = new FeatureSortItem(entry, this.parameter, tagSet.size()); + if (item.total < threshold) continue; + heap.add(item); + } + logger.finish("\n裁剪完毕\n"); + + int size = heap.size() + tagSet.sizeIncludingBos(); + float[] parameter = new float[size * tagSet.size()]; + MutableDoubleArrayTrieInteger mdat = new MutableDoubleArrayTrieInteger(); + for (Map.Entry tag : tagSet) + { + mdat.add("BL=" + tag.getKey()); + } + mdat.add("BL=_BL_"); + for (int i = 0; i < tagSet.size() * tagSet.sizeIncludingBos(); i++) + { + parameter[i] = this.parameter[i]; + } + logger.start("构建双数组trie树...\n"); + logEvery = (int) Math.ceil(heap.size() / 10000f); + n = 0; + for (FeatureSortItem item : heap) + { + if (++n % logEvery == 0 || n == heap.size()) + { + logger.out("\r%.2f%% ", MathUtility.percentage(n, heap.size())); + } + int id = mdat.size(); + mdat.put(item.key, id); + for (int i = 0; i < tagSet.size(); ++i) + { + parameter[id * tagSet.size() + i] = this.parameter[item.id * tagSet.size() + i]; + } + } + logger.finish("\n构建完毕\n"); + this.featureMap = new ImmutableFeatureMDatMap(mdat, tagSet); + this.parameter = parameter; + return this; + } + + /** + * 保存到路径 + * + * @param modelFile + * @throws IOException + */ + public void save(String modelFile) throws IOException + { + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(modelFile))); + save(out); + out.close(); + } + + /** + * 压缩并保存 + * + * @param modelFile 路径 + * @param ratio 压缩比c(压缩掉的体积,压缩后体积变为1-c) + * @throws IOException + */ + public void save(String modelFile, final double ratio) throws IOException + { + save(modelFile, featureMap.entrySet(), ratio); + } + + public void save(String modelFile, Set> featureIdSet, final double ratio) throws IOException + { + save(modelFile, featureIdSet, ratio, false); + } + + /** + * 保存 + * + * @param modelFile 路径 + * @param featureIdSet 特征集(有些数据结构不支持遍历,可以提供构造时用到的特征集来规避这个缺陷) + * @param ratio 压缩比 + * @param text 是否输出文本以供调试 + * @throws IOException + */ + public void save(String modelFile, Set> featureIdSet, final double ratio, boolean text) throws IOException + { + float[] parameter = this.parameter; + this.compress(ratio, 1e-3f); + + DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(modelFile))); + save(out); + out.close(); + + if (text) + { + BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(modelFile + ".txt"), "UTF-8")); + TagSet tagSet = featureMap.tagSet; + for (Map.Entry entry : featureIdSet) + { + bw.write(entry.getKey()); + if (featureIdSet.size() == parameter.length) + { + bw.write("\t"); + bw.write(String.valueOf(parameter[entry.getValue()])); + } + else + { + for (int i = 0; i < tagSet.size(); ++i) + { + bw.write("\t"); + bw.write(String.valueOf(parameter[entry.getValue() * tagSet.size() + i])); + } + } + bw.newLine(); + } + bw.close(); + } + } + + /** + * 参数更新 + * + * @param x 特征向量 + * @param y 正确答案 + */ + public void update(Collection x, int y) + { + assert y == 1 || y == -1 : "感知机的标签y必须是±1"; + for (Integer f : x) + parameter[f] += y; + } + + /** + * 分离超平面解码 + * + * @param x 特征向量 + * @return sign(wx) + */ + public int decode(Collection x) + { + float y = 0; + for (Integer f : x) + y += parameter[f]; + return y < 0 ? -1 : 1; + } + + /** + * 维特比解码 + * + * @param instance 实例 + * @return + */ + public double viterbiDecode(Instance instance) + { + return viterbiDecode(instance, instance.tagArray); + } + + /** + * 维特比解码 + * + * @param instance 实例 + * @param guessLabel 输出标签 + * @return + */ + public double viterbiDecode(Instance instance, int[] guessLabel) + { + final int[] allLabel = featureMap.allLabels(); + final int bos = featureMap.bosTag(); + final int sentenceLength = instance.tagArray.length; + final int labelSize = allLabel.length; + + int[][] preMatrix = new int[sentenceLength][labelSize]; + double[][] scoreMatrix = new double[2][labelSize]; + + for (int i = 0; i < sentenceLength; i++) + { + int _i = i & 1; + int _i_1 = 1 - _i; + int[] allFeature = instance.getFeatureAt(i); + final int transitionFeatureIndex = allFeature.length - 1; + if (0 == i) + { + allFeature[transitionFeatureIndex] = bos; + for (int j = 0; j < allLabel.length; j++) + { + preMatrix[0][j] = j; + + double score = score(allFeature, j); + + scoreMatrix[0][j] = score; + } + } + else + { + for (int curLabel = 0; curLabel < allLabel.length; curLabel++) + { + + double maxScore = Integer.MIN_VALUE; + + for (int preLabel = 0; preLabel < allLabel.length; preLabel++) + { + + allFeature[transitionFeatureIndex] = preLabel; + double score = score(allFeature, curLabel); + + double curScore = scoreMatrix[_i_1][preLabel] + score; + + if (maxScore < curScore) + { + maxScore = curScore; + preMatrix[i][curLabel] = preLabel; + scoreMatrix[_i][curLabel] = maxScore; + } + } + } + + } + } + + int maxIndex = 0; + double maxScore = scoreMatrix[(sentenceLength - 1) & 1][0]; + + for (int index = 1; index < allLabel.length; index++) + { + if (maxScore < scoreMatrix[(sentenceLength - 1) & 1][index]) + { + maxIndex = index; + maxScore = scoreMatrix[(sentenceLength - 1) & 1][index]; + } + } + + for (int i = sentenceLength - 1; i >= 0; --i) + { + guessLabel[i] = allLabel[maxIndex]; + maxIndex = preMatrix[i][maxIndex]; + } + + return maxScore; + } + + /** + * 通过命中的特征函数计算得分 + * + * @param featureVector 压缩形式的特征id构成的特征向量 + * @return + */ + public double score(int[] featureVector, int currentTag) + { + double score = 0; + for (int index : featureVector) + { + if (index == -1) + { + continue; + } + else if (index < -1 || index >= featureMap.size()) + { + throw new IllegalArgumentException("在打分时传入了非法的下标"); + } + else + { + index = index * featureMap.tagSet.size() + currentTag; + score += parameter[index]; // 其实就是特征权重的累加 + } + } + return score; + } + + /** + * 加载模型 + * + * @param modelFile + * @throws IOException + */ + public void load(String modelFile) throws IOException + { + if (HanLP.Config.DEBUG) + logger.start("加载 %s ... ", modelFile); + ByteArrayStream byteArray = ByteArrayStream.createByteArrayStream(modelFile); + if (!load(byteArray)) + { + throw new IOException(String.format("%s 加载失败", modelFile)); + } + if (HanLP.Config.DEBUG) + logger.finish(" 加载完毕\n"); + } + + public TagSet tagSet() + { + return featureMap.tagSet; + } + + @Override + public void save(DataOutputStream out) throws IOException + { + if (!(featureMap instanceof ImmutableFeatureMDatMap)) + { + featureMap = new ImmutableFeatureMDatMap(featureMap.entrySet(), tagSet()); + } + featureMap.save(out); + for (float aParameter : this.parameter) + { + out.writeFloat(aParameter); + } + } + + @Override + public boolean load(ByteArray byteArray) + { + if (byteArray == null) + return false; + featureMap = new ImmutableFeatureMDatMap(); + featureMap.load(byteArray); + int size = featureMap.size(); + TagSet tagSet = featureMap.tagSet; + if (tagSet.type == TaskType.CLASSIFICATION) + { + parameter = new float[size]; + for (int i = 0; i < size; i++) + { + parameter[i] = byteArray.nextFloat(); + } + } + else + { + parameter = new float[size * tagSet.size()]; + for (int i = 0; i < size; i++) + { + for (int j = 0; j < tagSet.size(); ++j) + { + parameter[i * tagSet.size() + j] = byteArray.nextFloat(); + } + } + } +// assert !byteArray.hasMore(); +// byteArray.close(); + if (!byteArray.hasMore()) + byteArray.close(); + return true; + } + + public TaskType taskType() + { + return featureMap.tagSet.type; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/model/StructuredPerceptron.java b/src/main/java/com/hankcs/hanlp/model/perceptron/model/StructuredPerceptron.java new file mode 100644 index 000000000..f00333ab8 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/model/StructuredPerceptron.java @@ -0,0 +1,85 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-05 PM11:07 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.model; + +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; +import com.hankcs.hanlp.model.perceptron.instance.Instance; + +/** + * 结构化感知机算法学习的线性模型 + * + * @author hankcs + */ +public class StructuredPerceptron extends LinearModel +{ + public StructuredPerceptron(FeatureMap featureMap, float[] parameter) + { + super(featureMap, parameter); + } + + public StructuredPerceptron(FeatureMap featureMap) + { + super(featureMap); + } + + /** + * 根据答案和预测更新参数 + * + * @param goldIndex 答案的特征函数(非压缩形式) + * @param predictIndex 预测的特征函数(非压缩形式) + */ + public void update(int[] goldIndex, int[] predictIndex) + { + for (int i = 0; i < goldIndex.length; ++i) + { + if (goldIndex[i] == predictIndex[i]) + continue; + else // 预测与答案不一致 + { + parameter[goldIndex[i]]++; // 奖励正确的特征函数(将它的权值加一) + if (predictIndex[i] >= 0 && predictIndex[i] < parameter.length) + parameter[predictIndex[i]]--; // 惩罚招致错误的特征函数(将它的权值减一) + else + { + throw new IllegalArgumentException("更新参数时传入了非法的下标"); + } + } + } + } + + /** + * 在线学习 + * + * @param instance 样本 + */ + public void update(Instance instance) + { + int[] guessLabel = new int[instance.length()]; + viterbiDecode(instance, guessLabel); + TagSet tagSet = featureMap.tagSet; + for (int i = 0; i < instance.length(); i++) + { + int[] featureVector = instance.getFeatureAt(i); + int[] goldFeature = new int[featureVector.length]; // 根据答案应当被激活的特征 + int[] predFeature = new int[featureVector.length]; // 实际预测时激活的特征 + for (int j = 0; j < featureVector.length - 1; j++) + { + goldFeature[j] = featureVector[j] * tagSet.size() + instance.tagArray[i]; + predFeature[j] = featureVector[j] * tagSet.size() + guessLabel[i]; + } + goldFeature[featureVector.length - 1] = (i == 0 ? tagSet.bosId() : instance.tagArray[i - 1]) * tagSet.size() + instance.tagArray[i]; + predFeature[featureVector.length - 1] = (i == 0 ? tagSet.bosId() : guessLabel[i - 1]) * tagSet.size() + guessLabel[i]; + update(goldFeature, predFeature); + } + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/package-info.java b/src/main/java/com/hankcs/hanlp/model/perceptron/package-info.java new file mode 100644 index 000000000..5b170429e --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/package-info.java @@ -0,0 +1,16 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-02-28 下午9:44 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +/** + * 感知机在线学习算法的线性序列标注模型。基于这套框架实现了一整套分词、词性标注和命名实体识别功能。 + * 理论参考邓知龙 《基于感知器算法的高效中文分词与词性标注系统设计与实现》, + * 简介:http://www.hankcs.com/nlp/segment/implementation-of-word-segmentation-device-java-based-on-structured-average-perceptron.html + */ +package com.hankcs.hanlp.model.perceptron; \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/CWSTagSet.java b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/CWSTagSet.java new file mode 100644 index 000000000..45e361948 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/CWSTagSet.java @@ -0,0 +1,54 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM5:28 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.tagset; + +import com.hankcs.hanlp.model.perceptron.common.TaskType; + +/** + * @author hankcs + */ +public class CWSTagSet extends TagSet +{ + public final int B; + public final int M; + public final int E; + public final int S; + + public CWSTagSet(int b, int m, int e, int s) + { + super(TaskType.CWS); + B = b; + M = m; + E = e; + S = s; + String[] id2tag = new String[4]; + id2tag[b] = "B"; + id2tag[m] = "M"; + id2tag[e] = "E"; + id2tag[s] = "S"; + for (String tag : id2tag) + { + add(tag); + } + lock(); + } + + public CWSTagSet() + { + super(TaskType.CWS); + B = add("B"); + M = add("M"); + E = add("E"); + S = add("S"); + lock(); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/NERTagSet.java b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/NERTagSet.java new file mode 100644 index 000000000..da1695393 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/NERTagSet.java @@ -0,0 +1,88 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-28 11:40 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.tagset; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.model.perceptron.common.TaskType; + +import java.util.Collection; +import java.util.HashSet; +import java.util.Map; +import java.util.Set; + +/** + * @author hankcs + */ +public class NERTagSet extends TagSet +{ + public final String O_TAG = "O"; + public final char O_TAG_CHAR = 'O'; + public final String B_TAG_PREFIX = "B-"; + public final char B_TAG_CHAR = 'B'; + public final String M_TAG_PREFIX = "M-"; + public final String E_TAG_PREFIX = "E-"; + public final String S_TAG = "S"; + public final char S_TAG_CHAR = 'S'; + public final Set nerLabels = new HashSet(); + + /** + * 非NER + */ + public final int O; + + public NERTagSet() + { + super(TaskType.NER); + O = add(O_TAG); + } + + public NERTagSet(int o, Collection tags) + { + super(TaskType.NER); + O = o; + for (String tag : tags) + { + add(tag); + String label = NERTagSet.posOf(tag); + if (label.length() != tag.length()) + nerLabels.add(label); + } + } + + public static String posOf(String tag) + { + int index = tag.indexOf('-'); + if (index == -1) + { + return tag; + } + + return tag.substring(index + 1); + } + + @Override + public boolean load(ByteArray byteArray) + { + super.load(byteArray); + nerLabels.clear(); + for (Map.Entry entry : this) + { + String tag = entry.getKey(); + int index = tag.indexOf('-'); + if (index != -1) + { + nerLabels.add(tag.substring(index + 1)); + } + } + + return true; + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/POSTagSet.java b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/POSTagSet.java new file mode 100644 index 000000000..d7fc43234 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/POSTagSet.java @@ -0,0 +1,25 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午5:49 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.tagset; + +import com.hankcs.hanlp.model.perceptron.common.TaskType; + +/** + * 词性标注集 + * @author hankcs + */ +public class POSTagSet extends TagSet +{ + public POSTagSet() + { + super(TaskType.POS); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/TagSet.java b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/TagSet.java new file mode 100644 index 000000000..3c10e1578 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/tagset/TagSet.java @@ -0,0 +1,167 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-26 下午4:40 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.tagset; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.ICacheAble; +import com.hankcs.hanlp.model.perceptron.common.IIdStringMap; +import com.hankcs.hanlp.model.perceptron.common.IStringIdMap; +import com.hankcs.hanlp.model.perceptron.common.TaskType; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.util.*; + +/** + * @author hankcs + */ +public class TagSet implements IIdStringMap, IStringIdMap, Iterable>, ICacheAble +{ + private Map stringIdMap; + private ArrayList idStringMap; + private int[] allTags; + public TaskType type; + + public TagSet(TaskType type) + { + stringIdMap = new TreeMap(); + idStringMap = new ArrayList(); + this.type = type; + } + + public int add(String tag) + { +// assertUnlock(); + Integer id = stringIdMap.get(tag); + if (id == null) + { + id = stringIdMap.size(); + stringIdMap.put(tag, id); + idStringMap.add(tag); + } + + return id; + } + + public int size() + { + return stringIdMap.size(); + } + + public int sizeIncludingBos() + { + return size() + 1; + } + + public int bosId() + { + return size(); + } + + public void lock() + { +// assertUnlock(); + allTags = new int[size()]; + for (int i = 0; i < size(); i++) + { + allTags[i] = i; + } + } + +// private void assertUnlock() +// { +// if (allTags != null) +// { +// throw new IllegalStateException("标注集已锁定,无法修改"); +// } +// } + + @Override + public String stringOf(int id) + { + return idStringMap.get(id); + } + + @Override + public int idOf(String string) + { + Integer id = stringIdMap.get(string); + if (id == null) id = -1; + return id; + } + + @Override + public Iterator> iterator() + { + return stringIdMap.entrySet().iterator(); + } + + /** + * 获取所有标签及其下标 + * + * @return + */ + public int[] allTags() + { + return allTags; + } + + public void save(DataOutputStream out) throws IOException + { + out.writeInt(type.ordinal()); + out.writeInt(size()); + for (String tag : idStringMap) + { + out.writeUTF(tag); + } + } + + @Override + public boolean load(ByteArray byteArray) + { + idStringMap.clear(); + stringIdMap.clear(); + int size = byteArray.nextInt(); + for (int i = 0; i < size; i++) + { + String tag = byteArray.nextUTF(); + idStringMap.add(tag); + stringIdMap.put(tag, i); + } + lock(); + return true; + } + + public void load(DataInputStream in) throws IOException + { + idStringMap.clear(); + stringIdMap.clear(); + int size = in.readInt(); + for (int i = 0; i < size; i++) + { + String tag = in.readUTF(); + idStringMap.add(tag); + stringIdMap.put(tag, i); + } + lock(); + } + + public Collection tags() + { + return idStringMap; + } + + public boolean contains(String tag) + { + return idStringMap.contains(tag); + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/utility/IOUtility.java b/src/main/java/com/hankcs/hanlp/model/perceptron/utility/IOUtility.java new file mode 100644 index 000000000..46ea375b4 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/utility/IOUtility.java @@ -0,0 +1,116 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM7:29 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.utility; + +import com.hankcs.hanlp.classification.utilities.io.ConsoleLogger; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; + +import java.io.*; +import java.util.regex.Pattern; + + +/** + * @author hankcs + */ +public class IOUtility extends IOUtil +{ + private static Pattern PATTERN_SPACE = Pattern.compile("\\s+"); + + public static String[] readLineToArray(String line) + { + line = line.trim(); + if (line.length() == 0) return new String[0]; + return PATTERN_SPACE.split(line); + } + + public static int loadInstance(final String path, InstanceHandler handler) throws IOException + { + ConsoleLogger logger = new ConsoleLogger(); + int size = 0; + File root = new File(path); + File allFiles[]; + if (root.isDirectory()) + { + allFiles = root.listFiles(new FileFilter() + { + @Override + public boolean accept(File pathname) + { + return pathname.isFile() && pathname.getName().endsWith(".txt"); + } + }); + } + else + { + allFiles = new File[]{root}; + } + + for (File file : allFiles) + { + BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8")); + String line; + while ((line = br.readLine()) != null) + { + line = line.trim(); + if (line.length() == 0) + { + continue; + } + Sentence sentence = Sentence.create(line); + if (sentence.wordList.size() == 0) continue; + ++size; + if (size % 1000 == 0) + { + logger.err("%c语料: %dk...", 13, size / 1000); + } + // debug +// if (size == 100) break; + if (handler.process(sentence)) break; + } + } + + return size; + } + + public static double[] evaluate(Instance[] instances, LinearModel model) + { + int[] stat = new int[2]; + for (int i = 0; i < instances.length; i++) + { + evaluate(instances[i], model, stat); + if (i % 100 == 0 || i == instances.length - 1) + { + System.err.printf("%c进度: %.2f%%", 13, (i + 1) / (float) instances.length * 100); + System.err.flush(); + } + } + return new double[]{stat[1] / (double) stat[0] * 100}; + } + + public static void evaluate(Instance instance, LinearModel model, int[] stat) + { + int[] predLabel = new int[instance.length()]; + model.viterbiDecode(instance, predLabel); + stat[0] += instance.tagArray.length; + for (int i = 0; i < predLabel.length; i++) + { + if (predLabel[i] == instance.tagArray[i]) + { + ++stat[1]; + } + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/perceptron/utility/Utility.java b/src/main/java/com/hankcs/hanlp/model/perceptron/utility/Utility.java new file mode 100644 index 000000000..783fd7794 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/model/perceptron/utility/Utility.java @@ -0,0 +1,493 @@ +/* + * + * Hankcs + * me@hankcs.com + * 2016-09-04 PM7:40 + * + * + * Copyright (c) 2008-2016, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.utility; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.instance.InstanceHandler; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.BufferedWriter; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.OutputStreamWriter; +import java.util.*; + +/** + * @author hankcs + */ +public class Utility +{ + public static double[] prf(int[] stat) + { + return prf(stat[0], stat[1], stat[2]); + } + + public static double[] prf(int goldTotal, int predTotal, int correct) + { + double precision = (correct * 100.0) / predTotal; + double recall = (correct * 100.0) / goldTotal; + double[] performance = new double[3]; + performance[0] = precision; + performance[1] = recall; + performance[2] = (2 * precision * recall) / (precision + recall); + return performance; + } + + /** + * Fisher–Yates shuffle + * + * @param ar + */ + public static void shuffleArray(int[] ar) + { + Random rnd = new Random(); + for (int i = ar.length - 1; i > 0; i--) + { + int index = rnd.nextInt(i + 1); + // Simple swap + int a = ar[index]; + ar[index] = ar[i]; + ar[i] = a; + } + } + + public static void shuffleArray(T[] ar) + { + Random rnd = new Random(); + for (int i = ar.length - 1; i > 0; i--) + { + int index = rnd.nextInt(i + 1); + // Simple swap + T a = ar[index]; + ar[index] = ar[i]; + ar[i] = a; + } + } + + public static String normalize(String text) + { + return CharTable.convert(text); + } + + /** + * 将人民日报格式的分词语料转化为空格分割的语料 + * + * @param inputFolder 输入人民日报语料的上级目录(该目录下的所有文件都是一篇人民日报分词文章) + * @param outputFile 输出一整个CRF训练格式的语料 + * @param begin 取多少个文档之后 + * @param end + * @throws IOException 转换过程中的IO异常 + */ + public static void convertPKUtoCWS(String inputFolder, String outputFile, final int begin, final int end) throws IOException + { + final BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8")); + CorpusLoader.walk(inputFolder, new CorpusLoader.Handler() + { + int doc = 0; + + @Override + public void handle(Document document) + { + ++doc; + if (doc < begin || doc > end) return; + try + { + List> sentenceList = convertComplexWordToSimpleWord(document.getComplexSentenceList()); + if (sentenceList.size() == 0) return; + for (List sentence : sentenceList) + { + if (sentence.size() == 0) continue; + int index = 0; + for (IWord iWord : sentence) + { + bw.write(iWord.getValue()); + if (++index != sentence.size()) + { + bw.write(' '); + } + } + bw.newLine(); + } + } + catch (IOException e) + { + e.printStackTrace(); + } + } + } + + ); + bw.close(); + } + + + /** + * 将人民日报格式的分词语料转化为空格分割的语料 + * + * @param inputFolder 输入人民日报语料的上级目录(该目录下的所有文件都是一篇人民日报分词文章) + * @param outputFile 输出一整个CRF训练格式的语料 + * @param begin 取多少个文档之后 + * @param end + * @throws IOException 转换过程中的IO异常 + */ + public static void convertPKUtoPOS(String inputFolder, String outputFile, final int begin, final int end) throws IOException + { + final BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8")); + CorpusLoader.walk(inputFolder, new CorpusLoader.Handler() + { + int doc = 0; + + @Override + public void handle(Document document) + { + ++doc; + if (doc < begin || doc > end) return; + try + { + List> sentenceList = document.getSimpleSentenceList(); + if (sentenceList.size() == 0) return; + for (List sentence : sentenceList) + { + if (sentence.size() == 0) continue; + int index = 0; + for (IWord iWord : sentence) + { + bw.write(iWord.toString()); + if (++index != sentence.size()) + { + bw.write(' '); + } + } + bw.newLine(); + } + } + catch (IOException e) + { + e.printStackTrace(); + } + } + } + + ); + bw.close(); + } + + private static List> convertComplexWordToSimpleWord(List> document) + { + String nerTag[] = new String[]{"nr", "ns", "nt"}; + List> output = new ArrayList>(document.size()); + for (List sentence : document) + { + List s = new ArrayList(sentence.size()); + for (IWord iWord : sentence) + { + if (iWord instanceof Word) + { + s.add((Word) iWord); + } + else if (isNer(iWord, nerTag)) + { + s.add(new Word(iWord.getValue(), iWord.getLabel())); + } + else + { + for (Word word : ((CompoundWord) iWord).innerList) + { + isNer(word, nerTag); + s.add(word); + } + } + } + output.add(s); + } + + return output; + } + + private static boolean isNer(IWord word, String nerTag[]) + { + for (String tag : nerTag) + { + if (word.getLabel().startsWith(tag)) + { + word.setLabel(tag); + return true; + } + } + + return false; + } + + public static String[] toWordArray(List wordList) + { + String[] wordArray = new String[wordList.size()]; + int i = -1; + for (Word word : wordList) + { + wordArray[++i] = word.getValue(); + } + + return wordArray; + } + + public static int[] evaluateCWS(String developFile, final PerceptronSegmenter segmenter) throws IOException + { + // int goldTotal = 0, predTotal = 0, correct = 0; + final int[] stat = new int[3]; + Arrays.fill(stat, 0); + IOUtility.loadInstance(developFile, new InstanceHandler() + { + @Override + public boolean process(Sentence sentence) + { + List wordList = sentence.toSimpleWordList(); + String[] wordArray = toWordArray(wordList); + stat[0] += wordArray.length; + String text = com.hankcs.hanlp.utility.TextUtility.combine(wordArray); + String[] predArray = segmenter.segment(text).toArray(new String[0]); + stat[1] += predArray.length; + + int goldIndex = 0, predIndex = 0; + int goldLen = 0, predLen = 0; + + while (goldIndex < wordArray.length && predIndex < predArray.length) + { + if (goldLen == predLen) + { + if (wordArray[goldIndex].equals(predArray[predIndex])) + { + stat[2]++; + goldLen += wordArray[goldIndex].length(); + predLen += wordArray[goldIndex].length(); + goldIndex++; + predIndex++; + } + else + { + goldLen += wordArray[goldIndex].length(); + predLen += predArray[predIndex].length(); + goldIndex++; + predIndex++; + } + } + else if (goldLen < predLen) + { + goldLen += wordArray[goldIndex].length(); + goldIndex++; + } + else + { + predLen += predArray[predIndex].length(); + predIndex++; + } + } + + return false; + } + }); + return stat; + } + + /** + * 将句子转换为 (单词,词性,NER标签)三元组 + * + * @param sentence + * @param tagSet + * @return + */ + public static List convertSentenceToNER(Sentence sentence, NERTagSet tagSet) + { + List collector = new LinkedList(); + Set nerLabels = tagSet.nerLabels; + for (IWord word : sentence.wordList) + { + if (word instanceof CompoundWord) + { + List wordList = ((CompoundWord) word).innerList; + Word[] words = wordList.toArray(new Word[0]); + + if (nerLabels.contains(word.getLabel())) + { + collector.add(new String[]{words[0].value, words[0].label, tagSet.B_TAG_PREFIX + word.getLabel()}); + for (int i = 1; i < words.length - 1; i++) + { + collector.add(new String[]{words[i].value, words[i].label, tagSet.M_TAG_PREFIX + word.getLabel()}); + } + collector.add(new String[]{words[words.length - 1].value, words[words.length - 1].label, + tagSet.E_TAG_PREFIX + word.getLabel()}); + } + else + { + for (Word w : words) + { + collector.add(new String[]{w.value, w.label, tagSet.O_TAG}); + } + } + } + else + { + if (nerLabels.contains(word.getLabel())) + { + // 单个实体 + collector.add(new String[]{word.getValue(), word.getLabel(), tagSet.S_TAG}); + } + else + { + collector.add(new String[]{word.getValue(), word.getLabel(), tagSet.O_TAG}); + } + } + } + return collector; + } + + public static void normalize(Sentence sentence) + { + for (IWord word : sentence.wordList) + { + if (word instanceof CompoundWord) + { + for (Word child : ((CompoundWord) word).innerList) + { + child.setValue(CharTable.convert(child.getValue())); + } + } + else + { + word.setValue(CharTable.convert(word.getValue())); + } + } + } + + public static Map evaluateNER(NERecognizer recognizer, String goldFile) + { + Map scores = new TreeMap(); + double[] avg = new double[]{0, 0, 0}; + scores.put("avg.", avg); + NERTagSet tagSet = recognizer.getNERTagSet(); + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(goldFile); + for (String line : lineIterator) + { + line = line.trim(); + if (line.isEmpty()) continue; + Sentence sentence = Sentence.create(line); + if (sentence == null) continue; + String[][] table = reshapeNER(convertSentenceToNER(sentence, tagSet)); + Set pred = combineNER(recognizer.recognize(table[0], table[1]), tagSet); + Set gold = combineNER(table[2], tagSet); + for (String p : pred) + { + String type = p.split("\t")[2]; + double[] s = scores.get(type); + if (s == null) + { + s = new double[]{0, 0, 0}; + scores.put(type, s); + } + if (gold.contains(p)) + { + ++s[2]; // 正确识别该类命名实体数 + ++avg[2]; + } + ++s[0]; // 识别出该类命名实体总数 + ++avg[0]; + } + for (String g : gold) + { + String type = g.split("\t")[2]; + double[] s = scores.get(type); + if (s == null) + { + s = new double[]{0, 0, 0}; + scores.put(type, s); + } + ++s[1]; // 该类命名实体总数 + ++avg[1]; + } + } + for (double[] s : scores.values()) + { + if (s[2] == 0) + { + s[0] = 0; + s[1] = 0; + continue; + } + s[1] = s[2] / s[1] * 100; // R=正确识别该类命名实体数/该类命名实体总数×100% + s[0] = s[2] / s[0] * 100; // P=正确识别该类命名实体数/识别出该类命名实体总数×100% + s[2] = 2 * s[0] * s[1] / (s[0] + s[1]); + } + return scores; + } + + public static Set combineNER(String[] nerArray, NERTagSet tagSet) + { + Set result = new LinkedHashSet(); + int begin = 0; + String prePos = NERTagSet.posOf(nerArray[0]); + for (int i = 1; i < nerArray.length; i++) + { + if (nerArray[i].charAt(0) == tagSet.B_TAG_CHAR || nerArray[i].charAt(0) == tagSet.S_TAG_CHAR || nerArray[i].charAt(0) == tagSet.O_TAG_CHAR) + { + if (i - begin > 1) + result.add(String.format("%d\t%d\t%s", begin, i, prePos)); + begin = i; + } + prePos = NERTagSet.posOf(nerArray[i]); + } + if (nerArray.length - 1 - begin >= 1) + { + result.add(String.format("%d\t%d\t%s", begin, nerArray.length, prePos)); + } + return result; + } + + public static String[][] reshapeNER(List ner) + { + String[] wordArray = new String[ner.size()]; + String[] posArray = new String[ner.size()]; + String[] nerArray = new String[ner.size()]; + reshapeNER(ner, wordArray, posArray, nerArray); + return new String[][]{wordArray, posArray, nerArray}; + } + + public static void reshapeNER(List collector, String[] wordArray, String[] posArray, String[] tagArray) + { + int i = 0; + for (String[] tuple : collector) + { + wordArray[i] = tuple[0]; + posArray[i] = tuple[1]; + tagArray[i] = tuple[2]; + ++i; + } + } + + public static void printNERScore(Map scores) + { + System.out.printf("%4s\t%6s\t%6s\t%6s\n", "NER", "P", "R", "F1"); + for (Map.Entry entry : scores.entrySet()) + { + String type = entry.getKey(); + double[] s = entry.getValue(); + System.out.printf("%4s\t%6.2f\t%6.2f\t%6.2f\n", type, s[0], s[1], s[2]); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModel.java b/src/main/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModel.java index 1958449ec..d4f6c702c 100644 --- a/src/main/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModel.java +++ b/src/main/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModel.java @@ -207,9 +207,9 @@ public char[] tag(char[] charArray) for (int i = 2; i < charArray.length; i++) { // swap(now, pre) - double[][] _ = pre; + double[][] buffer = pre; pre = now; - now = _; + now = buffer; // end of swap for (int s = 0; s < 4; ++s) { diff --git a/src/main/java/com/hankcs/hanlp/recognition/nr/JapanesePersonRecognition.java b/src/main/java/com/hankcs/hanlp/recognition/nr/JapanesePersonRecognition.java index f6adf7b8f..e92f17e8f 100644 --- a/src/main/java/com/hankcs/hanlp/recognition/nr/JapanesePersonRecognition.java +++ b/src/main/java/com/hankcs/hanlp/recognition/nr/JapanesePersonRecognition.java @@ -11,6 +11,7 @@ */ package com.hankcs.hanlp.recognition.nr; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.dictionary.BaseSearcher; import com.hankcs.hanlp.dictionary.CoreDictionary; @@ -39,20 +40,19 @@ public class JapanesePersonRecognition * @param wordNetOptimum 粗分结果对应的词图 * @param wordNetAll 全词图 */ - public static void Recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll) + public static void recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll) { StringBuilder sbName = new StringBuilder(); int appendTimes = 0; char[] charArray = wordNetAll.charArray; - BaseSearcher searcher = JapanesePersonDictionary.getSearcher(charArray); - Map.Entry entry; + DoubleArrayTrie.LongestSearcher searcher = JapanesePersonDictionary.getSearcher(charArray); int activeLine = 1; int preOffset = 0; - while ((entry = searcher.next()) != null) + while (searcher.next()) { - Character label = entry.getValue(); - String key = entry.getKey(); - int offset = searcher.getOffset(); + Character label = searcher.value; + int offset = searcher.begin; + String key = new String(charArray, offset, searcher.length); if (preOffset != offset) { if (appendTimes > 1 && sbName.length() > 2) // 日本人名最短为3字 diff --git a/src/main/java/com/hankcs/hanlp/recognition/nr/PersonRecognition.java b/src/main/java/com/hankcs/hanlp/recognition/nr/PersonRecognition.java index 697fc57d0..6c22d03be 100644 --- a/src/main/java/com/hankcs/hanlp/recognition/nr/PersonRecognition.java +++ b/src/main/java/com/hankcs/hanlp/recognition/nr/PersonRecognition.java @@ -15,6 +15,7 @@ import com.hankcs.hanlp.algorithm.Viterbi; import com.hankcs.hanlp.corpus.dictionary.item.EnumItem; import com.hankcs.hanlp.corpus.tag.NR; +import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.dictionary.nr.PersonDictionary; import com.hankcs.hanlp.seg.common.Vertex; import com.hankcs.hanlp.seg.common.WordNet; @@ -23,13 +24,16 @@ import java.util.LinkedList; import java.util.List; +import static com.hankcs.hanlp.corpus.tag.Nature.nnt; +import static com.hankcs.hanlp.corpus.tag.Nature.nr; + /** * 人名识别 * @author hankcs */ public class PersonRecognition { - public static boolean Recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) + public static boolean recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) { List> roleTagList = roleObserve(pWordSegResult); if (HanLP.Config.DEBUG) @@ -85,26 +89,27 @@ public static List> roleObserve(List wordSegResult) EnumItem nrEnumItem = PersonDictionary.dictionary.get(vertex.realWord); if (nrEnumItem == null) { - switch (vertex.guessNature()) + Nature nature = vertex.guessNature(); + if (nature == nr) { - case nr: - { - // 有些双名实际上可以构成更长的三名 - if (vertex.getAttribute().totalFrequency <= 1000 && vertex.realWord.length() == 2) - { - nrEnumItem = new EnumItem(NR.X, NR.G); - } - else nrEnumItem = new EnumItem(NR.A, PersonDictionary.transformMatrixDictionary.getTotalFrequency(NR.A)); - }break; - case nnt: - { - // 姓+职位 - nrEnumItem = new EnumItem(NR.G, NR.K); - }break; - default: + // 有些双名实际上可以构成更长的三名 + if (vertex.getAttribute().totalFrequency <= 1000 && vertex.realWord.length() == 2) { + nrEnumItem = new EnumItem(); + nrEnumItem.labelMap.put(NR.X, 2); // 认为是三字人名前2个字=双字人名的可能性更高 + nrEnumItem.labelMap.put(NR.G, 1); + } + else nrEnumItem = new EnumItem(NR.A, PersonDictionary.transformMatrixDictionary.getTotalFrequency(NR.A)); - }break; + } + else if (nature == nnt) + { + // 姓+职位 + nrEnumItem = new EnumItem(NR.G, NR.K); + } + else + { + nrEnumItem = new EnumItem(NR.A, PersonDictionary.transformMatrixDictionary.getTotalFrequency(NR.A)); } } tagList.add(nrEnumItem); diff --git a/src/main/java/com/hankcs/hanlp/recognition/nr/TranslatedPersonRecognition.java b/src/main/java/com/hankcs/hanlp/recognition/nr/TranslatedPersonRecognition.java index b7a17d806..5b12a36a0 100644 --- a/src/main/java/com/hankcs/hanlp/recognition/nr/TranslatedPersonRecognition.java +++ b/src/main/java/com/hankcs/hanlp/recognition/nr/TranslatedPersonRecognition.java @@ -36,7 +36,7 @@ public class TranslatedPersonRecognition * @param wordNetOptimum 粗分结果对应的词图 * @param wordNetAll 全词图 */ - public static void Recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll) + public static void recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll) { StringBuilder sbName = new StringBuilder(); int appendTimes = 0; diff --git a/src/main/java/com/hankcs/hanlp/recognition/ns/PlaceRecognition.java b/src/main/java/com/hankcs/hanlp/recognition/ns/PlaceRecognition.java index a0fdb62e1..ab97f593f 100644 --- a/src/main/java/com/hankcs/hanlp/recognition/ns/PlaceRecognition.java +++ b/src/main/java/com/hankcs/hanlp/recognition/ns/PlaceRecognition.java @@ -31,7 +31,7 @@ */ public class PlaceRecognition { - public static boolean Recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) + public static boolean recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) { List> roleTagList = roleTag(pWordSegResult, wordNetAll); if (HanLP.Config.DEBUG) @@ -48,7 +48,7 @@ public static boolean Recognition(List pWordSegResult, WordNet wordNetOp } System.out.printf("地名角色观察:%s\n", sbLog.toString()); } - List NSList = viterbiExCompute(roleTagList); + List NSList = viterbiCompute(roleTagList); if (HanLP.Config.DEBUG) { StringBuilder sbLog = new StringBuilder(); @@ -117,20 +117,12 @@ public static List> roleTag(List vertexList, WordNet wordNe return tagList; } - private static void insert(ListIterator listIterator, List> tagList, WordNet wordNetAll, int line, NS ns) - { - Vertex vertex = wordNetAll.getFirst(line); - assert vertex != null : "全词网居然有空白行!"; - listIterator.add(vertex); - tagList.add(new EnumItem(ns, 1000)); - } - /** * 维特比算法求解最优标签 * @param roleTagList * @return */ - public static List viterbiExCompute(List> roleTagList) + public static List viterbiCompute(List> roleTagList) { return Viterbi.computeEnum(roleTagList, PlaceDictionary.transformMatrixDictionary); } diff --git a/src/main/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognition.java b/src/main/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognition.java index 0ffc1370d..e0fea4c09 100644 --- a/src/main/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognition.java +++ b/src/main/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognition.java @@ -24,6 +24,8 @@ import java.util.LinkedList; import java.util.List; +import static com.hankcs.hanlp.corpus.tag.Nature.*; + /** * 地址识别 * @@ -31,7 +33,7 @@ */ public class OrganizationRecognition { - public static boolean Recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) + public static boolean recognition(List pWordSegResult, WordNet wordNetOptimum, WordNet wordNetAll) { List> roleTagList = roleTag(pWordSegResult, wordNetAll); if (HanLP.Config.DEBUG) @@ -48,7 +50,7 @@ public static boolean Recognition(List pWordSegResult, WordNet wordNetOp } System.out.printf("机构名角色观察:%s\n", sbLog.toString()); } - List NTList = viterbiExCompute(roleTagList); + List NTList = viterbiCompute(roleTagList); if (HanLP.Config.DEBUG) { StringBuilder sbLog = new StringBuilder(); @@ -78,32 +80,25 @@ public static List> roleTag(List vertexList, WordNet wordNe { // 构成更长的 Nature nature = vertex.guessNature(); - switch (nature) + if (nature == nrf) { - case nrf: + if (vertex.getAttribute().totalFrequency <= 1000) { - if (vertex.getAttribute().totalFrequency <= 1000) - { - tagList.add(new EnumItem(NT.F, 1000)); - } - else break; - } - continue; - case ni: - case nic: - case nis: - case nit: - { - EnumItem ntEnumItem = new EnumItem(NT.K, 1000); - ntEnumItem.addLabel(NT.D, 1000); - tagList.add(ntEnumItem); + tagList.add(new EnumItem(NT.F, 1000)); + continue; } + } + else if (nature == ni || nature == nic || nature == nis || nature == nit) + { + EnumItem ntEnumItem = new EnumItem(NT.K, 1000); + ntEnumItem.addLabel(NT.D, 1000); + tagList.add(ntEnumItem); continue; - case m: - { - EnumItem ntEnumItem = new EnumItem(NT.M, 1000); - tagList.add(ntEnumItem); - } + } + else if (nature == m) + { + EnumItem ntEnumItem = new EnumItem(NT.M, 1000); + tagList.add(ntEnumItem); continue; } @@ -124,7 +119,7 @@ public static List> roleTag(List vertexList, WordNet wordNe * @param roleTagList * @return */ - public static List viterbiExCompute(List> roleTagList) + public static List viterbiCompute(List> roleTagList) { return Viterbi.computeEnum(roleTagList, OrganizationDictionary.transformMatrixDictionary); } diff --git a/src/main/java/com/hankcs/hanlp/seg/CRF/CRFSegment.java b/src/main/java/com/hankcs/hanlp/seg/CRF/CRFSegment.java index ffdb37707..490b082a3 100644 --- a/src/main/java/com/hankcs/hanlp/seg/CRF/CRFSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/CRF/CRFSegment.java @@ -19,10 +19,9 @@ import com.hankcs.hanlp.model.crf.CRFModel; import com.hankcs.hanlp.model.crf.FeatureFunction; import com.hankcs.hanlp.model.crf.Table; -import com.hankcs.hanlp.seg.CharacterBasedGenerativeModelSegment; +import com.hankcs.hanlp.seg.CharacterBasedSegment; import com.hankcs.hanlp.seg.Segment; import com.hankcs.hanlp.seg.common.Term; -import com.hankcs.hanlp.seg.common.Vertex; import com.hankcs.hanlp.utility.CharacterHelper; import com.hankcs.hanlp.utility.GlobalObjectPool; @@ -35,8 +34,9 @@ * 基于CRF的分词器 * * @author hankcs + * @deprecated 已废弃,请使用{@link com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer} */ -public class CRFSegment extends CharacterBasedGenerativeModelSegment +public class CRFSegment extends CharacterBasedSegment { private CRFModel crfModel; @@ -47,6 +47,7 @@ public CRFSegment(CRFSegmentModel crfModel) public CRFSegment(String modelPath) { + logger.warning("已废弃CRFSegment,请使用功能更丰富、设计更优雅的CRFLexicalAnalyzer"); crfModel = GlobalObjectPool.get(modelPath); if (crfModel != null) { @@ -66,6 +67,7 @@ public CRFSegment(String modelPath) GlobalObjectPool.put(modelPath, crfModel); } + // 已废弃,请使用功能更丰富、设计更优雅的{@link com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer}。 public CRFSegment() { this(HanLP.Config.CRFSegmentModelPath); @@ -106,29 +108,30 @@ protected List roughSegSentence(char[] sentence) } if (i == table.v.length) { - termList.add(new Term(new String(sentence, begin, offset - begin), toDefaultNature(table.v[i][0]) )); + termList.add(new Term(new String(sentence, begin, offset - begin), toDefaultNature(table.v[i][0]))); break OUTER; } else - termList.add(new Term(new String(sentence, begin, offset - begin + table.v[i][1].length()), toDefaultNature(table.v[i][0]) )); + termList.add(new Term(new String(sentence, begin, offset - begin + table.v[i][1].length()), toDefaultNature(table.v[i][0]))); } break; default: { - termList.add(new Term(new String(sentence, offset, table.v[i][1].length()), toDefaultNature(table.v[i][0]) )); + termList.add(new Term(new String(sentence, offset, table.v[i][1].length()), toDefaultNature(table.v[i][0]))); } break; } } return termList; } - - protected static Nature toDefaultNature(String compiledChar) { - if (compiledChar.equals("M")) - return Nature.m; - if (compiledChar.equals("W")) - return Nature.nx; - return null; + + protected static Nature toDefaultNature(String compiledChar) + { + if (compiledChar.equals("M")) + return Nature.m; + if (compiledChar.equals("W")) + return Nature.nx; + return null; } public static List atomSegment(char[] sentence) @@ -288,6 +291,7 @@ else if (CharacterHelper.isEnglishLetter(sentence[i]) || sentence[i] == ' ') */ private static String[][] resizeArray(String[][] array, int size) { + if (array.length == size) return array; String[][] nArray = new String[size][]; System.arraycopy(array, 0, nArray, 0, size); return nArray; diff --git a/src/main/java/com/hankcs/hanlp/seg/CharacterBasedGenerativeModelSegment.java b/src/main/java/com/hankcs/hanlp/seg/CharacterBasedSegment.java similarity index 69% rename from src/main/java/com/hankcs/hanlp/seg/CharacterBasedGenerativeModelSegment.java rename to src/main/java/com/hankcs/hanlp/seg/CharacterBasedSegment.java index 03ea54a8f..a1171f171 100644 --- a/src/main/java/com/hankcs/hanlp/seg/CharacterBasedGenerativeModelSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/CharacterBasedSegment.java @@ -12,8 +12,6 @@ import java.util.ArrayList; import java.util.Collections; -import java.util.Iterator; -import java.util.LinkedList; import java.util.List; import com.hankcs.hanlp.algorithm.Viterbi; @@ -25,10 +23,10 @@ import com.hankcs.hanlp.seg.common.Vertex; /** - * 基于字构词的生成式模型分词器基类 + * 基于“由字构词”方法分词器基类 * @author hankcs */ -public abstract class CharacterBasedGenerativeModelSegment extends Segment +public abstract class CharacterBasedSegment extends Segment { /** @@ -37,7 +35,7 @@ public abstract class CharacterBasedGenerativeModelSegment extends Segment * @param term * @return */ - public static CoreDictionary.Attribute guessAttribute(Term term) + public static CoreDictionary.Attribute guessAttribute(Term term) { CoreDictionary.Attribute attribute = CoreDictionary.get(term.word); if (attribute == null) @@ -60,14 +58,12 @@ else if (term.word.trim().length() == 0) else term.nature = attribute.nature[0]; return attribute; } - - - /* + + + /** * 以下方法用于纯分词模型 * 分词、词性标注联合模型则直接重载segSentence */ - - @Override protected List segSentence(char[] sentence) { @@ -100,9 +96,9 @@ protected List segSentence(char[] sentence) * @return */ protected abstract List roughSegSentence(char[] sentence); - + /** - * 将中间结果转换为词网顶点, + * 将中间结果转换为词网顶点, * 这样就可以利用基于Vertex开发的功能, 如词性标注、NER等 * @param wordList * @param appendStart @@ -122,53 +118,4 @@ protected List toVertexList(List wordList, boolean appendStart) return vertexList; } - /** - * 将一条路径转为最终结果 - * - * @param vertexList - * @param offsetEnabled 是否计算offset - * @return - */ - protected static List convert(List vertexList, boolean offsetEnabled) - { - assert vertexList != null; - assert vertexList.size() >= 2 : "这条路径不应当短于2" + vertexList.toString(); - int length = vertexList.size() - 2; - List resultList = new ArrayList(length); - Iterator iterator = vertexList.iterator(); - iterator.next(); - if (offsetEnabled) - { - int offset = 0; - for (int i = 0; i < length; ++i) - { - Vertex vertex = iterator.next(); - Term term = convert(vertex); - term.offset = offset; - offset += term.length(); - resultList.add(term); - } - } - else - { - for (int i = 0; i < length; ++i) - { - Vertex vertex = iterator.next(); - Term term = convert(vertex); - resultList.add(term); - } - } - return resultList; - } - - /** - * 将节点转为term - * - * @param vertex - * @return - */ - private static Term convert(Vertex vertex) - { - return new Term(vertex.realWord, vertex.guessNature()); - } } diff --git a/src/main/java/com/hankcs/hanlp/seg/Config.java b/src/main/java/com/hankcs/hanlp/seg/Config.java index 2f5db20c2..28a198d0a 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Config.java +++ b/src/main/java/com/hankcs/hanlp/seg/Config.java @@ -11,13 +11,15 @@ */ package com.hankcs.hanlp.seg; +import com.hankcs.hanlp.HanLP; + /** * 分词器配置项 */ public class Config { /** - * 是否是索引分词(合理地最小分割) + * 是否是索引分词(合理地最小分割),indexMode代表全切分词语的最小长度(包含) */ public int indexMode = 0; /** @@ -76,4 +78,19 @@ public void updateNerConfig() { ner = nameRecognize || translatedNameRecognize || japaneseNameRecognize || placeRecognize || organizationRecognize; } + + /** + * 是否是索引模式 + * + * @return + */ + public boolean isIndexMode() + { + return indexMode > 0; + } + + /** + * 是否执行字符正规化(繁体->简体,全角->半角,大写->小写),切换配置后必须删CustomDictionary.txt.bin缓存 + */ + public boolean normalization = HanLP.Config.Normalization; } diff --git a/src/main/java/com/hankcs/hanlp/seg/DictionaryBasedSegment.java b/src/main/java/com/hankcs/hanlp/seg/DictionaryBasedSegment.java index f1d278c27..3b3e27671 100644 --- a/src/main/java/com/hankcs/hanlp/seg/DictionaryBasedSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/DictionaryBasedSegment.java @@ -10,8 +10,16 @@ */ package com.hankcs.hanlp.seg; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.NShort.Path.AtomNode; + +import java.util.List; + +import static com.hankcs.hanlp.utility.Predefine.logger; + /** * 基于词典的机械分词器基类 + * * @author hankcs */ public abstract class DictionaryBasedSegment extends Segment @@ -19,6 +27,7 @@ public abstract class DictionaryBasedSegment extends Segment /** * 开启数词和英文识别(与标准意义上的词性标注不同,只是借用这个配置方法,不是真的开启了词性标注。 * 一般用词典分词的用户不太可能是NLP专业人士,对词性准确率要求不高,所以干脆不为词典分词实现词性标注。) + * * @param enable * @return */ @@ -26,4 +35,54 @@ public Segment enablePartOfSpeechTagging(boolean enable) { return super.enablePartOfSpeechTagging(enable); } + + /** + * 词性标注 + * + * @param charArray 字符数组 + * @param wordNet 词语长度 + * @param natureArray 输出词性 + */ + protected void posTag(char[] charArray, int[] wordNet, Nature[] natureArray) + { + if (config.speechTagging) + { + for (int i = 0; i < natureArray.length; ) + { + if (natureArray[i] == null) + { + int j = i + 1; + for (; j < natureArray.length; ++j) + { + if (natureArray[j] != null) break; + } + List atomNodeList = quickAtomSegment(charArray, i, j); + for (AtomNode atomNode : atomNodeList) + { + if (atomNode.sWord.length() >= wordNet[i]) + { + wordNet[i] = atomNode.sWord.length(); + natureArray[i] = atomNode.getNature(); + i += wordNet[i]; + } + } + i = j; + } + else + { + ++i; + } + } + } + } + + @Override + public Segment enableCustomDictionary(boolean enable) + { + if (enable) + { + logger.warning("为基于词典的分词器开启用户词典太浪费了,建议直接将所有词典的路径传入构造函数,这样速度更快、内存更省"); + } + return super.enableCustomDictionary(enable); + } } diff --git a/src/main/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegment.java b/src/main/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegment.java index 3adbbcca0..ffa8f30f3 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegment.java @@ -18,7 +18,7 @@ import com.hankcs.hanlp.recognition.ns.PlaceRecognition; import com.hankcs.hanlp.recognition.nt.OrganizationRecognition; import com.hankcs.hanlp.seg.Dijkstra.Path.State; -import com.hankcs.hanlp.seg.WordBasedGenerativeModelSegment; +import com.hankcs.hanlp.seg.WordBasedSegment; import com.hankcs.hanlp.seg.common.*; import java.util.*; @@ -27,7 +27,7 @@ * 最短路径分词 * @author hankcs */ -public class DijkstraSegment extends WordBasedGenerativeModelSegment +public class DijkstraSegment extends WordBasedSegment { @Override public List segSentence(char[] sentence) @@ -35,9 +35,9 @@ public List segSentence(char[] sentence) WordNet wordNetOptimum = new WordNet(sentence); WordNet wordNetAll = new WordNet(wordNetOptimum.charArray); ////////////////生成词网//////////////////// - GenerateWordNet(wordNetAll); + generateWordNet(wordNetAll); ///////////////生成词图//////////////////// - Graph graph = GenerateBiGraph(wordNetAll); + Graph graph = generateBiGraph(wordNetAll); if (HanLP.Config.DEBUG) { System.out.printf("粗分词图:%s\n", graph.printByTo()); @@ -70,33 +70,33 @@ public List segSentence(char[] sentence) int preSize = wordNetOptimum.size(); if (config.nameRecognize) { - PersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.translatedNameRecognize) { - TranslatedPersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + TranslatedPersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.japaneseNameRecognize) { - JapanesePersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + JapanesePersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.placeRecognize) { - PlaceRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PlaceRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.organizationRecognize) { // 层叠隐马模型——生成输出作为下一级隐马输入 - graph = GenerateBiGraph(wordNetOptimum); + graph = generateBiGraph(wordNetOptimum); vertexList = dijkstra(graph); wordNetOptimum.clear(); wordNetOptimum.addAll(vertexList); preSize = wordNetOptimum.size(); - OrganizationRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + OrganizationRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (wordNetOptimum.size() != preSize) { - graph = GenerateBiGraph(wordNetOptimum); + graph = generateBiGraph(wordNetOptimum); vertexList = dijkstra(graph); if (HanLP.Config.DEBUG) { diff --git a/src/main/java/com/hankcs/hanlp/seg/Dijkstra/Path/State.java b/src/main/java/com/hankcs/hanlp/seg/Dijkstra/Path/State.java index 0bfd4e343..16a4f1138 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Dijkstra/Path/State.java +++ b/src/main/java/com/hankcs/hanlp/seg/Dijkstra/Path/State.java @@ -11,8 +11,6 @@ */ package com.hankcs.hanlp.seg.Dijkstra.Path; -import com.hankcs.hanlp.seg.common.Vertex; - /** * @author hankcs */ diff --git a/src/main/java/com/hankcs/hanlp/seg/HMM/HMMSegment.java b/src/main/java/com/hankcs/hanlp/seg/HMM/HMMSegment.java index 14d680b5c..c53d9f156 100644 --- a/src/main/java/com/hankcs/hanlp/seg/HMM/HMMSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/HMM/HMMSegment.java @@ -13,7 +13,7 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.io.ByteArray; import com.hankcs.hanlp.model.trigram.CharacterBasedGenerativeModel; -import com.hankcs.hanlp.seg.CharacterBasedGenerativeModelSegment; +import com.hankcs.hanlp.seg.CharacterBasedSegment; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.utility.GlobalObjectPool; import com.hankcs.hanlp.utility.TextUtility; @@ -27,7 +27,7 @@ * * @author hankcs */ -public class HMMSegment extends CharacterBasedGenerativeModelSegment +public class HMMSegment extends CharacterBasedSegment { CharacterBasedGenerativeModel model; diff --git a/src/main/java/com/hankcs/hanlp/seg/NShort/NShortSegment.java b/src/main/java/com/hankcs/hanlp/seg/NShort/NShortSegment.java index e405ecb61..bd35118c5 100644 --- a/src/main/java/com/hankcs/hanlp/seg/NShort/NShortSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/NShort/NShortSegment.java @@ -18,7 +18,7 @@ import com.hankcs.hanlp.recognition.nr.TranslatedPersonRecognition; import com.hankcs.hanlp.recognition.ns.PlaceRecognition; import com.hankcs.hanlp.recognition.nt.OrganizationRecognition; -import com.hankcs.hanlp.seg.WordBasedGenerativeModelSegment; +import com.hankcs.hanlp.seg.WordBasedSegment; import com.hankcs.hanlp.seg.NShort.Path.*; import com.hankcs.hanlp.seg.common.Graph; import com.hankcs.hanlp.seg.common.Term; @@ -32,21 +32,8 @@ * * @author hankcs */ -public class NShortSegment extends WordBasedGenerativeModelSegment +public class NShortSegment extends WordBasedSegment { - List BiOptimumSegment(WordNet wordNetOptimum) - { -// logger.trace("细分词网:\n{}", wordNetOptimum); - Graph graph = GenerateBiGraph(wordNetOptimum); - if (HanLP.Config.DEBUG) - { - System.out.printf("细分词图:%s\n", graph.printByTo()); - } - NShortPath nShortPath = new NShortPath(graph, 1); - List spResult = nShortPath.getNPaths(1); - assert spResult.size() > 0 : "最短路径求解失败,请检查下图是否有悬孤节点或负圈\n" + graph.printByTo(); - return graph.parsePath(spResult.get(0)); - } @Override public List segSentence(char[] sentence) @@ -55,7 +42,7 @@ public List segSentence(char[] sentence) WordNet wordNetAll = new WordNet(sentence); // char[] charArray = text.toCharArray(); // 粗分 - List> coarseResult = BiSegment(sentence, 2, wordNetOptimum, wordNetAll); + List> coarseResult = biSegment(sentence, 2, wordNetOptimum, wordNetAll); boolean NERexists = false; for (List vertexList : coarseResult) { @@ -70,26 +57,26 @@ public List segSentence(char[] sentence) int preSize = wordNetOptimum.size(); if (config.nameRecognize) { - PersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.translatedNameRecognize) { - TranslatedPersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + TranslatedPersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.japaneseNameRecognize) { - JapanesePersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + JapanesePersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.placeRecognize) { - PlaceRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PlaceRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.organizationRecognize) { // 层叠隐马模型——生成输出作为下一级隐马输入 - vertexList = Dijkstra.compute(GenerateBiGraph(wordNetOptimum)); + vertexList = Dijkstra.compute(generateBiGraph(wordNetOptimum)); wordNetOptimum.addAll(vertexList); - OrganizationRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + OrganizationRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (!NERexists && preSize != wordNetOptimum.size()) { @@ -101,7 +88,7 @@ public List segSentence(char[] sentence) List vertexList = coarseResult.get(0); if (NERexists) { - Graph graph = GenerateBiGraph(wordNetOptimum); + Graph graph = generateBiGraph(wordNetOptimum); vertexList = Dijkstra.compute(graph); if (HanLP.Config.DEBUG) { @@ -146,15 +133,15 @@ public List segSentence(char[] sentence) * @param wordNetAll * @return 一系列粗分结果 */ - public List> BiSegment(char[] sSentence, int nKind, WordNet wordNetOptimum, WordNet wordNetAll) + public List> biSegment(char[] sSentence, int nKind, WordNet wordNetOptimum, WordNet wordNetAll) { List> coarseResult = new LinkedList>(); ////////////////生成词网//////////////////// - GenerateWordNet(wordNetAll); + generateWordNet(wordNetAll); // logger.trace("词网大小:" + wordNetAll.size()); // logger.trace("打印词网:\n" + wordNetAll); ///////////////生成词图//////////////////// - Graph graph = GenerateBiGraph(wordNetAll); + Graph graph = generateBiGraph(wordNetAll); // logger.trace(graph.toString()); if (HanLP.Config.DEBUG) { @@ -176,7 +163,7 @@ public List> BiSegment(char[] sSentence, int nKind, WordNet wordNet for (int[] path : spResult) { List vertexes = graph.parsePath(path); - GenerateWord(vertexes, wordNetOptimum); + generateWord(vertexes, wordNetOptimum); coarseResult.add(vertexes); } return coarseResult; diff --git a/src/main/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegment.java b/src/main/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegment.java index 6b9e30a33..995f84637 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegment.java @@ -10,21 +10,21 @@ */ package com.hankcs.hanlp.seg.Other; +import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.seg.DictionaryBasedSegment; -import com.hankcs.hanlp.seg.NShort.Path.AtomNode; import com.hankcs.hanlp.seg.Segment; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.utility.TextUtility; -import static com.hankcs.hanlp.utility.Predefine.logger; - import java.io.IOException; import java.util.*; +import static com.hankcs.hanlp.utility.Predefine.logger; + /** * 使用AhoCorasickDoubleArrayTrie实现的最长分词器
* 需要用户调用setTrie()提供一个AhoCorasickDoubleArrayTrie @@ -35,6 +35,34 @@ public class AhoCorasickDoubleArrayTrieSegment extends DictionaryBasedSegment { AhoCorasickDoubleArrayTrie trie; + public AhoCorasickDoubleArrayTrieSegment() throws IOException + { + this(HanLP.Config.CoreDictionaryPath); + } + + public AhoCorasickDoubleArrayTrieSegment(TreeMap dictionary) + { + this(new AhoCorasickDoubleArrayTrie(dictionary)); + } + + public AhoCorasickDoubleArrayTrieSegment(AhoCorasickDoubleArrayTrie trie) + { + this.trie = trie; + config.useCustomDictionary = false; + config.speechTagging = false; + } + + /** + * 加载自己的词典,构造分词器 + * @param dictionaryPaths 任意数量个词典 + * + * @throws IOException 加载过程中的IO异常 + */ + public AhoCorasickDoubleArrayTrieSegment(String... dictionaryPaths) throws IOException + { + this(new AhoCorasickDoubleArrayTrie(IOUtil.loadDictionary(dictionaryPaths))); + } + @Override protected List segSentence(char[] sentence) { @@ -63,35 +91,7 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) } }); LinkedList termList = new LinkedList(); - if (config.speechTagging) - { - for (int i = 0; i < natureArray.length; ) - { - if (natureArray[i] == null) - { - int j = i + 1; - for (; j < natureArray.length; ++j) - { - if (natureArray[j] != null) break; - } - List atomNodeList = quickAtomSegment(sentence, i, j); - for (AtomNode atomNode : atomNodeList) - { - if (atomNode.sWord.length() >= wordNet[i]) - { - wordNet[i] = atomNode.sWord.length(); - natureArray[i] = atomNode.getNature(); - i += wordNet[i]; - } - } - i = j; - } - else - { - ++i; - } - } - } + posTag(sentence, wordNet, natureArray); for (int i = 0; i < wordNet.length; ) { Term term = new Term(new String(sentence, i, wordNet[i]), config.speechTagging ? (natureArray[i] == null ? Nature.nz : natureArray[i]) : null); @@ -102,21 +102,6 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) return termList; } - public AhoCorasickDoubleArrayTrieSegment() - { - super(); - config.useCustomDictionary = false; - config.speechTagging = true; - } - - public AhoCorasickDoubleArrayTrieSegment(TreeMap dictionary) - { - this(); - trie = new AhoCorasickDoubleArrayTrie(); - trie.build(dictionary); - setTrie(trie); - } - @Override public Segment enableCustomDictionary(boolean enable) { diff --git a/src/main/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegment.java b/src/main/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegment.java index 95d4ebf57..a15381be1 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegment.java @@ -13,13 +13,14 @@ import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.dictionary.CustomDictionary; import com.hankcs.hanlp.seg.DictionaryBasedSegment; -import com.hankcs.hanlp.seg.NShort.Path.AtomNode; import com.hankcs.hanlp.seg.common.Term; +import java.io.IOException; import java.util.Arrays; import java.util.LinkedList; import java.util.List; @@ -56,6 +57,17 @@ public DoubleArrayTrieSegment(DoubleArrayTrie trie) config.useCustomDictionary = false; } + /** + * 加载自己的词典,构造分词器 + * @param dictionaryPaths 任意数量个词典 + * + * @throws IOException 加载过程中的IO异常 + */ + public DoubleArrayTrieSegment(String... dictionaryPaths) throws IOException + { + this(new DoubleArrayTrie(IOUtil.loadDictionary(dictionaryPaths))); + } + @Override protected List segSentence(char[] sentence) { @@ -66,10 +78,10 @@ protected List segSentence(char[] sentence) matchLongest(sentence, wordNet, natureArray, trie); if (config.useCustomDictionary) { - matchLongest(sentence, wordNet, natureArray, CustomDictionary.dat); - if (CustomDictionary.trie != null) + matchLongest(sentence, wordNet, natureArray, customDictionary.dat); + if (customDictionary.trie != null) { - CustomDictionary.trie.parseLongestText(charArray, new AhoCorasickDoubleArrayTrie.IHit() + customDictionary.trie.parseLongestText(charArray, new AhoCorasickDoubleArrayTrie.IHit() { @Override public void hit(int begin, int end, CoreDictionary.Attribute value) @@ -88,35 +100,7 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) } } LinkedList termList = new LinkedList(); - if (config.speechTagging) - { - for (int i = 0; i < natureArray.length; ) - { - if (natureArray[i] == null) - { - int j = i + 1; - for (; j < natureArray.length; ++j) - { - if (natureArray[j] != null) break; - } - List atomNodeList = quickAtomSegment(charArray, i, j); - for (AtomNode atomNode : atomNodeList) - { - if (atomNode.sWord.length() >= wordNet[i]) - { - wordNet[i] = atomNode.sWord.length(); - natureArray[i] = atomNode.getNature(); - i += wordNet[i]; - } - } - i = j; - } - else - { - ++i; - } - } - } + posTag(charArray, wordNet, natureArray); for (int i = 0; i < wordNet.length; ) { Term term = new Term(new String(charArray, i, wordNet[i]), config.speechTagging ? (natureArray[i] == null ? Nature.nz : natureArray[i]) : null); diff --git a/src/main/java/com/hankcs/hanlp/seg/Other/LongestBinSegmentToy.java b/src/main/java/com/hankcs/hanlp/seg/Other/LongestBinSegmentToy.java deleted file mode 100644 index 6a857dce4..000000000 --- a/src/main/java/com/hankcs/hanlp/seg/Other/LongestBinSegmentToy.java +++ /dev/null @@ -1,118 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/5/3 14:12 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.seg.Other; - - -import com.hankcs.hanlp.collection.trie.bintrie.BaseNode; -import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; - -import java.util.AbstractMap; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; - -/** - * 最长分词玩具 - * @author hankcs - */ -public class LongestBinSegmentToy -{ - private BinTrie trie; - /** - * 待分词转化的char - */ - private char c[]; - /** - * 指向当前处理字串的开始位置(前面的已经分词分完了) - */ - private int offset; - - public LongestBinSegmentToy(BinTrie trie) - { - this.trie = trie; - } - - public List> seg(String text) - { - reset(text); - List> allWords = new ArrayList>(); - Map.Entry entry; - while ((entry = next()) != null) - { - allWords.add(entry); - } - c = null; - return allWords; - } - - /** - * 将分词器复原或置为准备工作的状态 - * @param text 待分词的字串 - */ - public void reset(String text) - { - offset = 0; - c = text.toCharArray(); - } - - public Map.Entry next() - { - StringBuffer key = new StringBuffer(); // 构造key - BaseNode branch = trie; - BaseNode possibleBranch = null; - while (offset < c.length) - { - if (possibleBranch != null) - { - branch = possibleBranch; - possibleBranch = null; - } - else - { - branch = branch.getChild(c[offset]); - if (branch == null) - { - branch = trie; - ++offset; - continue; - } - } - key.append(c[offset]); - ++offset; - if (branch.getStatus() == BaseNode.Status.WORD_END_3 -// || branch.getStatus() == BaseNode.Status.WORD_MIDDLE_2 - ) - { - return new AbstractMap.SimpleEntry(key.toString(), (V) branch.getValue()); - } - else if (branch.getStatus() == BaseNode.Status.WORD_MIDDLE_2) // 最长分词的关键 - { - possibleBranch = offset < c.length ? branch.getChild(c[offset]) : null; - if (possibleBranch == null) - { - return new AbstractMap.SimpleEntry(key.toString(), (V) branch.getValue()); - } - } - } - - return null; - } - - /** - * 获取当前偏移,如果想要知道next分出的词string的起始偏移,那么用 getOffset() - string.length 就行了。 - * @return - */ - public int getOffset() - { - return offset; - } -} diff --git a/src/main/java/com/hankcs/hanlp/seg/Segment.java b/src/main/java/com/hankcs/hanlp/seg/Segment.java index 98da1ae1b..93fa37e25 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Segment.java +++ b/src/main/java/com/hankcs/hanlp/seg/Segment.java @@ -18,6 +18,7 @@ import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.dictionary.DynamicCustomDictionary; import com.hankcs.hanlp.dictionary.other.CharTable; import com.hankcs.hanlp.dictionary.other.CharType; import com.hankcs.hanlp.seg.NShort.Path.AtomNode; @@ -54,6 +55,11 @@ public Segment() config = new Config(); } + /** + * 本分词器专用的词典,默认公用 CustomDictionary.DEFAULT + */ + public DynamicCustomDictionary customDictionary = CustomDictionary.DEFAULT; + /** * 原子分词 * @@ -91,7 +97,7 @@ else if (charTypeArray[i] == CharType.CT_LETTER) nCurType = charTypeArray[pCur - start]; if (nCurType == CharType.CT_CHINESE || nCurType == CharType.CT_INDEX || - nCurType == CharType.CT_DELIMITER || nCurType == CharType.CT_OTHER) + nCurType == CharType.CT_DELIMITER || nCurType == CharType.CT_OTHER) { String single = String.valueOf(charArray[pCur]); if (single.length() != 0) @@ -169,9 +175,9 @@ protected static List quickAtomSegment(char[] charArray, int start, in // 浮点数识别 if (preType == CharType.CT_NUM && ",,..".indexOf(charArray[offsetAtom]) != -1) { - if (offsetAtom+1 < end) + if (offsetAtom + 1 < end) { - int nextType = CharType.get(charArray[offsetAtom+1]); + int nextType = CharType.get(charArray[offsetAtom + 1]); if (nextType == CharType.CT_NUM) { continue; @@ -191,16 +197,28 @@ protected static List quickAtomSegment(char[] charArray, int start, in /** * 使用用户词典合并粗分结果 + * + * @param vertexList 粗分结果 + * @return 合并后的结果 + */ + protected List combineByCustomDictionary(List vertexList) + { + return combineByCustomDictionary(vertexList, customDictionary.dat); + } + + /** + * 使用用户词典合并粗分结果 + * * @param vertexList 粗分结果 + * @param dat 用户自定义词典 * @return 合并后的结果 */ - protected static List combineByCustomDictionary(List vertexList) + protected List combineByCustomDictionary(List vertexList, DoubleArrayTrie dat) { - assert vertexList.size() > 2 : "vertexList至少包含 始##始 和 末##末"; + assert vertexList.size() >= 2 : "vertexList至少包含 始##始 和 末##末"; Vertex[] wordNet = new Vertex[vertexList.size()]; vertexList.toArray(wordNet); // DAT合并 - DoubleArrayTrie dat = CustomDictionary.dat; int length = wordNet.length - 1; // 跳过首尾 for (int i = 1; i < length; ++i) { @@ -230,12 +248,12 @@ protected static List combineByCustomDictionary(List vertexList) } } // BinTrie合并 - if (CustomDictionary.trie != null) + if (customDictionary.trie != null) { for (int i = 1; i < length; ++i) { if (wordNet[i] == null) continue; - BaseNode state = CustomDictionary.trie.transition(wordNet[i].realWord.toCharArray(), 0); + BaseNode state = customDictionary.trie.transition(wordNet[i].realWord.toCharArray(), 0); if (state != null) { int to = i + 1; @@ -270,13 +288,27 @@ protected static List combineByCustomDictionary(List vertexList) /** * 使用用户词典合并粗分结果,并将用户词语收集到全词图中 + * + * @param vertexList 粗分结果 + * @param wordNetAll 收集用户词语到全词图中 + * @return 合并后的结果 + */ + protected List combineByCustomDictionary(List vertexList, final WordNet wordNetAll) + { + return combineByCustomDictionary(vertexList, customDictionary.dat, wordNetAll); + } + + /** + * 使用用户词典合并粗分结果,并将用户词语收集到全词图中 + * * @param vertexList 粗分结果 + * @param dat 用户自定义词典 * @param wordNetAll 收集用户词语到全词图中 * @return 合并后的结果 */ - protected static List combineByCustomDictionary(List vertexList, final WordNet wordNetAll) + protected List combineByCustomDictionary(List vertexList, DoubleArrayTrie dat, final WordNet wordNetAll) { - List outputList = combineByCustomDictionary(vertexList); + List outputList = combineByCustomDictionary(vertexList, dat); int line = 0; for (final Vertex vertex : outputList) { @@ -284,7 +316,7 @@ protected static List combineByCustomDictionary(List vertexList, final int currentLine = line; if (parentLength >= 3) { - CustomDictionary.parseText(vertex.realWord, new AhoCorasickDoubleArrayTrie.IHit() + customDictionary.parseText(vertex.realWord, new AhoCorasickDoubleArrayTrie.IHit() { @Override public void hit(int begin, int end, CoreDictionary.Attribute value) @@ -301,10 +333,11 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) /** * 将连续的词语合并为一个 + * * @param wordNet 词图 - * @param start 起始下标(包含) - * @param end 结束下标(不包含) - * @param value 新的属性 + * @param start 起始下标(包含) + * @param end 结束下标(不包含) + * @param value 新的属性 */ private static void combineWords(Vertex[] wordNet, int start, int end, CoreDictionary.Attribute value) { @@ -322,12 +355,64 @@ private static void combineWords(Vertex[] wordNet, int start, int end, CoreDicti sbTerm.append(realWord); wordNet[j] = null; } - wordNet[start] = new Vertex(sbTerm.toString(), value); + String realWord = sbTerm.toString(); + wordNet[start] = new Vertex(realWord, realWord, value); } } + /** + * 将一条路径转为最终结果 + * + * @param vertexList + * @param offsetEnabled 是否计算offset + * @return + */ + protected static List convert(List vertexList, boolean offsetEnabled) + { + assert vertexList != null; + assert vertexList.size() >= 2 : "这条路径不应当短于2" + vertexList.toString(); + int length = vertexList.size() - 2; + List resultList = new ArrayList(length); + Iterator iterator = vertexList.iterator(); + iterator.next(); + if (offsetEnabled) + { + int offset = 0; + for (int i = 0; i < length; ++i) + { + Vertex vertex = iterator.next(); + Term term = convert(vertex); + term.offset = offset; + offset += term.length(); + resultList.add(term); + } + } + else + { + for (int i = 0; i < length; ++i) + { + Vertex vertex = iterator.next(); + Term term = convert(vertex); + resultList.add(term); + } + } + return resultList; + } + + /** + * 将节点转为term + * + * @param vertex + * @return + */ + static Term convert(Vertex vertex) + { + return new Term(vertex.realWord, vertex.guessNature()); + } + /** * 合并数字 + * * @param termList */ protected void mergeNumberQuantifier(List termList, WordNet wordNetAll, Config config) @@ -388,10 +473,11 @@ protected void mergeNumberQuantifier(List termList, WordNet wordNetAll, /** * 将一个词语从词网中彻底抹除 - * @param cur 词语 + * + * @param cur 词语 * @param wordNetAll 词网 - * @param line 当前扫描的行数 - * @param length 当前缓冲区的长度 + * @param line 当前扫描的行数 + * @param length 当前缓冲区的长度 */ private static void removeFromWordNet(Vertex cur, WordNet wordNetAll, int line, int length) { @@ -420,7 +506,7 @@ private static void removeFromWordNet(Vertex cur, WordNet wordNetAll, int line, public List seg(String text) { char[] charArray = text.toCharArray(); - if (HanLP.Config.Normalization) + if (config.normalization) { CharTable.normalization(charArray); } @@ -516,7 +602,7 @@ public List seg(String text) public List seg(char[] text) { assert text != null; - if (HanLP.Config.Normalization) + if (config.normalization) { CharTable.normalization(text); } @@ -530,10 +616,22 @@ public List seg(char[] text) * @return 句子列表,每个句子由一个单词列表组成 */ public List> seg2sentence(String text) + { + return seg2sentence(text, true); + } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public List> seg2sentence(String text, boolean shortest) { List> resultList = new LinkedList>(); { - for (String sentence : SentencesUtil.toSentenceList(text)) + for (String sentence : SentencesUtil.toSentenceList(text, shortest)) { resultList.add(segSentence(sentence.toCharArray())); } @@ -637,13 +735,25 @@ public Segment enableCustomDictionary(boolean enable) return this; } + /** + * 启用新的用户词典 + * + * @param customDictionary 新的自定义词典 + */ + public Segment enableCustomDictionary(DynamicCustomDictionary customDictionary) + { + config.useCustomDictionary = true; + this.customDictionary = customDictionary; + return this; + } + /** * 是否尽可能强制使用用户词典(使用户词典的优先级尽可能高)
- * 警告:具体实现由各子类决定,可能会破坏分词器的统计特性(例如,如果用户词典 - * 含有“和服”,则“商品和服务”的分词结果可能会被用户词典的高优先级影响)。 + * 警告:具体实现由各子类决定,可能会破坏分词器的统计特性(例如,如果用户词典 + * 含有“和服”,则“商品和服务”的分词结果可能会被用户词典的高优先级影响)。 + * * @param enable * @return 分词器本身 - * * @since 1.3.5 */ public Segment enableCustomDictionaryForcing(boolean enable) @@ -694,7 +804,8 @@ public Segment enableOffset(boolean enable) /** * 是否启用数词和数量词识别
- * 即[二, 十, 一] => [二十一],[十, 九, 元] => [十九元] + * 即[二, 十, 一] => [二十一],[十, 九, 元] => [十九元] + * * @param enable * @return */ @@ -748,6 +859,7 @@ public void run() /** * 开启多线程 + * * @param enable true表示开启[系统CPU核心数]个线程,false表示单线程 * @return */ @@ -760,6 +872,7 @@ public Segment enableMultithreading(boolean enable) /** * 开启多线程 + * * @param threadNumber 线程数量 * @return */ @@ -768,4 +881,14 @@ public Segment enableMultithreading(int threadNumber) config.threadNumber = threadNumber; return this; } + + /** + * 是否执行字符正规化(繁体->简体,全角->半角,大写->小写),切换配置后必须删CustomDictionary.txt.bin缓存 + */ + public Segment enableNormalization(boolean normalization) + { + + config.normalization = normalization; + return this; + } } diff --git a/src/main/java/com/hankcs/hanlp/seg/SegmentPipeline.java b/src/main/java/com/hankcs/hanlp/seg/SegmentPipeline.java new file mode 100644 index 000000000..338324007 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/seg/SegmentPipeline.java @@ -0,0 +1,245 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-29 5:05 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.seg; + +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.pipe.Pipe; + +import java.util.*; + +/** + * @author hankcs + */ +public class SegmentPipeline extends Segment implements Pipe>, List, List>> +{ + Pipe> first; + Pipe, List> last; + List, List>> pipeList; + + private SegmentPipeline(Pipe> first, Pipe, List> last) + { + this.first = first; + this.last = last; + pipeList = new ArrayList, List>>(); + } + + public SegmentPipeline(final Segment delegate) + { + this(new Pipe>() + { + @Override + public List flow(String input) + { + List task = new LinkedList(); + task.add(new Word(input, null)); + return task; + } + }, + new Pipe, List>() + { + @Override + public List flow(List input) + { + List output = new ArrayList(input.size()); + for (IWord word : input) + { + if (word.getLabel() == null) + { + output.addAll(delegate.seg(word.getValue())); + } + else + { + output.add(new Term(word.getValue(), Nature.create(word.getLabel()))); + } + } + return output; + } + }); + config = delegate.config; + } + + + @Override + protected List segSentence(char[] sentence) + { + return seg(new String(sentence)); + } + + @Override + public List seg(String text) + { + return flow(text); + } + + @Override + public List flow(String input) + { + List i = first.flow(input); + for (Pipe, List> pipe : pipeList) + { + i = pipe.flow(i); + } + return last.flow(i); + } + + @Override + public int size() + { + return pipeList.size(); + } + + @Override + public boolean isEmpty() + { + return pipeList.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + return pipeList.contains(o); + } + + @Override + public Iterator, List>> iterator() + { + return pipeList.iterator(); + } + + @Override + public Object[] toArray() + { + return pipeList.toArray(); + } + + @Override + public T[] toArray(T[] a) + { + return pipeList.toArray(a); + } + + @Override + public boolean add(Pipe, List> pipe) + { + return pipeList.add(pipe); + } + + @Override + public boolean remove(Object o) + { + return pipeList.remove(o); + } + + @Override + public boolean containsAll(Collection c) + { + return pipeList.containsAll(c); + } + + @Override + public boolean addAll(Collection, List>> c) + { + return pipeList.addAll(c); + } + + @Override + public boolean addAll(int index, Collection, List>> c) + { + return pipeList.addAll(c); + } + + @Override + public boolean removeAll(Collection c) + { + return pipeList.removeAll(c); + } + + @Override + public boolean retainAll(Collection c) + { + return pipeList.retainAll(c); + } + + @Override + public void clear() + { + pipeList.clear(); + } + + @Override + public boolean equals(Object o) + { + return pipeList.equals(o); + } + + @Override + public int hashCode() + { + return pipeList.hashCode(); + } + + @Override + public Pipe, List> get(int index) + { + return pipeList.get(index); + } + + @Override + public Pipe, List> set(int index, Pipe, List> element) + { + return pipeList.set(index, element); + } + + @Override + public void add(int index, Pipe, List> element) + { + pipeList.add(index, element); + } + + @Override + public Pipe, List> remove(int index) + { + return pipeList.remove(index); + } + + @Override + public int indexOf(Object o) + { + return pipeList.indexOf(o); + } + + @Override + public int lastIndexOf(Object o) + { + return pipeList.lastIndexOf(o); + } + + @Override + public ListIterator, List>> listIterator() + { + return pipeList.listIterator(); + } + + @Override + public ListIterator, List>> listIterator(int index) + { + return pipeList.listIterator(index); + } + + @Override + public List, List>> subList(int fromIndex, int toIndex) + { + return pipeList.subList(fromIndex, toIndex); + } +} diff --git a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Graph.java b/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Graph.java deleted file mode 100644 index 53738f99c..000000000 --- a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Graph.java +++ /dev/null @@ -1,73 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2015/1/19 21:05 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.seg.Viterbi.Path; - -import com.hankcs.hanlp.seg.common.Vertex; - -import java.util.LinkedList; -import java.util.List; - -/** - * @author hankcs - */ -public class Graph -{ - Node nodes[][]; - - public Graph(List vertexes[]) - { - nodes = new Node[vertexes.length][]; - int i = 0; - for (List vertexList : vertexes) - { - if (vertexList == null) continue; - nodes[i] = new Node[vertexList.size()]; - int j = 0; - for (Vertex vertex : vertexList) - { - nodes[i][j] = new Node(vertex); - ++j; - } - ++i; - } - } - - public List viterbi() - { - LinkedList vertexList = new LinkedList(); - for (Node node : nodes[1]) - { - node.updateFrom(nodes[0][0]); - } - for (int i = 1; i < nodes.length - 1; ++i) - { - Node[] nodeArray = nodes[i]; - if (nodeArray == null) continue; - for (Node node : nodeArray) - { - if (node.from == null) continue; - for (Node to : nodes[i + node.vertex.realWord.length()]) - { - to.updateFrom(node); - } - } - } - Node from = nodes[nodes.length - 1][0]; - while (from != null) - { - vertexList.addFirst(from.vertex); - from = from.from; - } - return vertexList; - } - -} diff --git a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Node.java b/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Node.java index c5f1f8930..864f099d0 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Node.java +++ b/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/Node.java @@ -11,8 +11,8 @@ */ package com.hankcs.hanlp.seg.Viterbi.Path; +import com.hankcs.hanlp.utility.MathUtility; import com.hankcs.hanlp.seg.common.Vertex; -import com.hankcs.hanlp.utility.MathTools; /** * @author hankcs @@ -39,7 +39,7 @@ public Node(Vertex vertex) public void updateFrom(Node from) { - double weight = from.weight + MathTools.calculateWeight(from.vertex, this.vertex); + double weight = from.weight + MathUtility.calculateWeight(from.vertex, this.vertex); if (this.from == null || this.weight > weight) { this.from = from; diff --git a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/SimpleGraph.java b/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/SimpleGraph.java deleted file mode 100644 index be9e951c4..000000000 --- a/src/main/java/com/hankcs/hanlp/seg/Viterbi/Path/SimpleGraph.java +++ /dev/null @@ -1,58 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2015/4/24 22:17 - * - * - * Copyright (c) 2003-2015, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.seg.Viterbi.Path; - -import com.hankcs.hanlp.seg.common.Vertex; - -import java.util.LinkedList; -import java.util.List; - -/** - * @author hankcs - */ -public class SimpleGraph -{ - LinkedList nodes[]; - public SimpleGraph(LinkedList vertexes[]) - { - nodes = vertexes; - } - - public List viterbi() - { - LinkedList vertexList = new LinkedList(); - for (Vertex node : nodes[1]) - { - node.updateFrom(nodes[0].getFirst()); - } - for (int i = 1; i < nodes.length - 1; ++i) - { - LinkedList nodeArray = nodes[i]; - if (nodeArray == null) continue; - for (Vertex node : nodeArray) - { - if (node.from == null) continue; - for (Vertex to : nodes[i + node.realWord.length()]) - { - to.updateFrom(node); - } - } - } - Vertex from = nodes[nodes.length - 1].getFirst(); - while (from != null) - { - vertexList.addFirst(from); - from = from.from; - } - return vertexList; - } -} diff --git a/src/main/java/com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java b/src/main/java/com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java index a0fee61fa..95eeefb80 100644 --- a/src/main/java/com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java @@ -12,34 +12,72 @@ package com.hankcs.hanlp.seg.Viterbi; import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.dictionary.DynamicCustomDictionary; import com.hankcs.hanlp.recognition.nr.JapanesePersonRecognition; import com.hankcs.hanlp.recognition.nr.PersonRecognition; import com.hankcs.hanlp.recognition.nr.TranslatedPersonRecognition; import com.hankcs.hanlp.recognition.ns.PlaceRecognition; import com.hankcs.hanlp.recognition.nt.OrganizationRecognition; -import com.hankcs.hanlp.seg.WordBasedGenerativeModelSegment; +import com.hankcs.hanlp.seg.WordBasedSegment; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.seg.common.Vertex; import com.hankcs.hanlp.seg.common.WordNet; +import com.hankcs.hanlp.utility.TextUtility; +import java.io.File; import java.util.LinkedList; import java.util.List; +import static com.hankcs.hanlp.utility.Predefine.logger; + /** * Viterbi分词器
* 也是最短路分词,最短路求解采用Viterbi算法 * * @author hankcs */ -public class ViterbiSegment extends WordBasedGenerativeModelSegment +public class ViterbiSegment extends WordBasedSegment { + public ViterbiSegment() + { + } + + /** + * @param customPath 自定义字典路径(绝对路径,多词典使用英文分号隔开) + */ + public ViterbiSegment(String customPath) + { + loadCustomDic(customPath, false); + } + + /** + * @param customPath customPath 自定义字典路径(绝对路径,多词典使用英文分号隔开) + * @param cache 是否缓存词典 + */ + public ViterbiSegment(String customPath, boolean cache) + { + loadCustomDic(customPath, cache); + } + + public DoubleArrayTrie getDat() + { + return customDictionary.dat; + } + + public void setDat(DoubleArrayTrie dat) + { + this.customDictionary.dat = dat; + } + @Override protected List segSentence(char[] sentence) { // long start = System.currentTimeMillis(); WordNet wordNetAll = new WordNet(sentence); ////////////////生成词网//////////////////// - GenerateWordNet(wordNetAll); + generateWordNet(wordNetAll); ///////////////生成词图//////////////////// // System.out.println("构图:" + (System.currentTimeMillis() - start)); if (HanLP.Config.DEBUG) @@ -53,8 +91,8 @@ protected List segSentence(char[] sentence) if (config.useCustomDictionary) { if (config.indexMode > 0) - combineByCustomDictionary(vertexList, wordNetAll); - else combineByCustomDictionary(vertexList); + combineByCustomDictionary(vertexList, customDictionary.dat, wordNetAll); + else combineByCustomDictionary(vertexList, customDictionary.dat); } if (HanLP.Config.DEBUG) @@ -75,28 +113,29 @@ protected List segSentence(char[] sentence) int preSize = wordNetOptimum.size(); if (config.nameRecognize) { - PersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.translatedNameRecognize) { - TranslatedPersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + TranslatedPersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.japaneseNameRecognize) { - JapanesePersonRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + JapanesePersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.placeRecognize) { - PlaceRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + PlaceRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (config.organizationRecognize) { // 层叠隐马模型——生成输出作为下一级隐马输入 + wordNetOptimum.clean(); vertexList = viterbi(wordNetOptimum); wordNetOptimum.clear(); wordNetOptimum.addAll(vertexList); preSize = wordNetOptimum.size(); - OrganizationRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll); + OrganizationRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); } if (wordNetOptimum.size() != preSize) { @@ -154,6 +193,29 @@ private static List viterbi(WordNet wordNet) return vertexList; } + private void loadCustomDic(String customPath, boolean isCache) + { + if (TextUtility.isBlank(customPath)) + { + return; + } + logger.info("开始加载自定义词典:" + customPath); + DoubleArrayTrie dat = new DoubleArrayTrie(); + String path[] = customPath.split(";"); + String mainPath = path[0]; + StringBuilder combinePath = new StringBuilder(); + for (String aPath : path) + { + combinePath.append(aPath.trim()); + } + File file = new File(mainPath); + mainPath = file.getParent() + "/" + Math.abs(combinePath.toString().hashCode()); + mainPath = mainPath.replace("\\", "/"); + if (DynamicCustomDictionary.loadMainDictionary(mainPath, path, dat, isCache, config.normalization)) { + this.customDictionary = new DynamicCustomDictionary(dat, null, null); + } + } + /** * 第二次维特比,可以利用前一次的结果,降低复杂度 * diff --git a/src/main/java/com/hankcs/hanlp/seg/WordBasedGenerativeModelSegment.java b/src/main/java/com/hankcs/hanlp/seg/WordBasedSegment.java similarity index 82% rename from src/main/java/com/hankcs/hanlp/seg/WordBasedGenerativeModelSegment.java rename to src/main/java/com/hankcs/hanlp/seg/WordBasedSegment.java index 9fb32f8ee..1dcf5e650 100644 --- a/src/main/java/com/hankcs/hanlp/seg/WordBasedGenerativeModelSegment.java +++ b/src/main/java/com/hankcs/hanlp/seg/WordBasedSegment.java @@ -23,7 +23,6 @@ import com.hankcs.hanlp.seg.common.Vertex; import com.hankcs.hanlp.seg.common.WordNet; import com.hankcs.hanlp.utility.TextUtility; -import com.hankcs.hanlp.utility.Predefine; import java.util.*; @@ -32,10 +31,10 @@ * * @author hankcs */ -public abstract class WordBasedGenerativeModelSegment extends Segment +public abstract class WordBasedSegment extends Segment { - public WordBasedGenerativeModelSegment() + public WordBasedSegment() { super(); } @@ -46,7 +45,7 @@ public WordBasedGenerativeModelSegment() * @param linkedArray 粗分结果 * @param wordNetOptimum 合并了所有粗分结果的词网 */ - protected static void GenerateWord(List linkedArray, WordNet wordNetOptimum) + protected static void generateWord(List linkedArray, WordNet wordNetOptimum) { fixResultByRule(linkedArray); @@ -69,13 +68,13 @@ protected static void fixResultByRule(List linkedArray) //-------------------------------------------------------------------- //The delimiter "--" - ChangeDelimiterPOS(linkedArray); + changeDelimiterPOS(linkedArray); //-------------------------------------------------------------------- //如果前一个词是数字,当前词以“-”或“-”开始,并且不止这一个字符, //那么将此“-”符号从当前词中分离出来。 //例如 “3 / -4 / 月”需要拆分成“3 / - / 4 / 月” - SplitMiddleSlashFromDigitalWords(linkedArray); + splitMiddleSlashFromDigitalWords(linkedArray); //-------------------------------------------------------------------- //1、如果当前词是数字,下一个词是“月、日、时、分、秒、月份”中的一个,则合并,且当前词词性是时间 @@ -83,11 +82,10 @@ protected static void fixResultByRule(List linkedArray) //3、如果最后一个汉字是"点" ,则认为当前数字是时间 //4、如果当前串最后一个汉字不是"∶·./"和半角的'.''/',那么是数 //5、当前串最后一个汉字是"∶·./"和半角的'.''/',且长度大于1,那么去掉最后一个字符。例如"1." - CheckDateElements(linkedArray); - + checkDateElements(linkedArray); } - static void ChangeDelimiterPOS(List linkedArray) + static void changeDelimiterPOS(List linkedArray) { for (Vertex vertex : linkedArray) { @@ -103,7 +101,7 @@ static void ChangeDelimiterPOS(List linkedArray) //那么将此“-”符号从当前词中分离出来。 //例如 “3-4 / 月”需要拆分成“3 / - / 4 / 月” //==================================================================== - private static void SplitMiddleSlashFromDigitalWords(List linkedArray) + private static void splitMiddleSlashFromDigitalWords(List linkedArray) { if (linkedArray.size() < 2) return; @@ -148,7 +146,7 @@ private static void SplitMiddleSlashFromDigitalWords(List linkedArray) //4、如果当前串最后一个汉字不是"∶·./"和半角的'.''/',那么是数 //5、当前串最后一个汉字是"∶·./"和半角的'.''/',且长度大于1,那么去掉最后一个字符。例如"1." //==================================================================== - private static void CheckDateElements(List linkedArray) + private static void checkDateElements(List linkedArray) { if (linkedArray.size() < 2) return; @@ -164,26 +162,14 @@ private static void CheckDateElements(List linkedArray) String nextWord = next.realWord; if ((nextWord.length() == 1 && "月日时分秒".contains(nextWord)) || (nextWord.length() == 2 && nextWord.equals("月份"))) { - current = Vertex.newTimeInstance(current.realWord + next.realWord); - listIterator.previous(); - listIterator.previous(); - listIterator.set(current); - listIterator.next(); - listIterator.next(); - listIterator.remove(); + mergeDate(listIterator, next, current); } //===== 2、如果当前词是可以作为年份的数字,下一个词是“年”,则合并,词性为时间,否则为数字。 else if (nextWord.equals("年")) { if (TextUtility.isYearTime(current.realWord)) { - current = Vertex.newTimeInstance(current.realWord + next.realWord); - listIterator.previous(); - listIterator.previous(); - listIterator.set(current); - listIterator.next(); - listIterator.next(); - listIterator.remove(); + mergeDate(listIterator, next, current); } //===== 否则当前词就是数字了 ===== else @@ -226,43 +212,15 @@ else if (current.realWord.length() > 1) // logger.trace("日期识别后:" + Graph.parseResult(linkedArray)); } - /** - * 将一条路径转为最终结果 - * - * @param vertexList - * @param offsetEnabled 是否计算offset - * @return - */ - protected static List convert(List vertexList, boolean offsetEnabled) + private static void mergeDate(ListIterator listIterator, Vertex next, Vertex current) { - assert vertexList != null; - assert vertexList.size() >= 2 : "这条路径不应当短于2" + vertexList.toString(); - int length = vertexList.size() - 2; - List resultList = new ArrayList(length); - Iterator iterator = vertexList.iterator(); - iterator.next(); - if (offsetEnabled) - { - int offset = 0; - for (int i = 0; i < length; ++i) - { - Vertex vertex = iterator.next(); - Term term = convert(vertex); - term.offset = offset; - offset += term.length(); - resultList.add(term); - } - } - else - { - for (int i = 0; i < length; ++i) - { - Vertex vertex = iterator.next(); - Term term = convert(vertex); - resultList.add(term); - } - } - return resultList; + current = Vertex.newTimeInstance(current.realWord + next.realWord); + listIterator.previous(); + listIterator.previous(); + listIterator.set(current); + listIterator.next(); + listIterator.next(); + listIterator.remove(); } /** @@ -282,7 +240,7 @@ protected static List convert(List vertexList) * @param wordNet * @return */ - protected static Graph GenerateBiGraph(WordNet wordNet) + protected static Graph generateBiGraph(WordNet wordNet) { return wordNet.toGraph(); } @@ -296,7 +254,7 @@ protected static Graph GenerateBiGraph(WordNet wordNet) * @return * @deprecated 应该使用字符数组的版本 */ - private static List AtomSegment(String sSentence, int start, int end) + private static List atomSegment(String sSentence, int start, int end) { if (end < start) { @@ -337,7 +295,7 @@ else if (charTypeArray[i] == CharType.CT_LETTER) nCurType = charTypeArray[pCur]; if (nCurType == CharType.CT_CHINESE || nCurType == CharType.CT_INDEX || - nCurType == CharType.CT_DELIMITER || nCurType == CharType.CT_OTHER) + nCurType == CharType.CT_DELIMITER || nCurType == CharType.CT_OTHER) { String single = String.valueOf(charArray[pCur]); if (single.length() != 0) @@ -425,7 +383,7 @@ private static void mergeContinueNumIntoOne(List linkedArray) * * @param wordNetStorage */ - protected void GenerateWordNet(final WordNet wordNetStorage) + protected void generateWordNet(final WordNet wordNetStorage) { final char[] charArray = wordNetStorage.charArray; @@ -438,7 +396,7 @@ protected void GenerateWordNet(final WordNet wordNetStorage) // 强制用户词典查询 if (config.forceCustomDictionary) { - CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit() + this.customDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit() { @Override public void hit(int begin, int end, CoreDictionary.Attribute value) @@ -456,7 +414,7 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) int j = i + 1; for (; j < vertexes.length - 1; ++j) { - if (!vertexes[j].isEmpty()) break; + if (!vertexes[j].isEmpty() && CharType.get(charArray[j - 1]) != CharType.CT_CNUM) break; } wordNetStorage.add(i, quickAtomSegment(charArray, i - 1, j - 1)); i = j; @@ -495,10 +453,10 @@ protected List decorateResultForIndexMode(List vertexList, WordNet { Vertex smallVertex = iterator.next(); if ( - ((termMain.nature == Nature.mq && smallVertex.hasNature(Nature.q)) || - smallVertex.realWord.length() >= config.indexMode) - && smallVertex != vertex // 防止重复添加 - && currentLine + smallVertex.realWord.length() <= line + vertex.realWord.length() // 防止超出边界 + ((termMain.nature == Nature.mq && smallVertex.hasNature(Nature.q)) || + smallVertex.realWord.length() >= config.indexMode) + && smallVertex != vertex // 防止重复添加 + && currentLine + smallVertex.realWord.length() <= line + vertex.realWord.length() // 防止超出边界 ) { listIterator.add(smallVertex); @@ -516,17 +474,6 @@ protected List decorateResultForIndexMode(List vertexList, WordNet return termList; } - /** - * 将节点转为term - * - * @param vertex - * @return - */ - private static Term convert(Vertex vertex) - { - return new Term(vertex.realWord, vertex.guessNature()); - } - /** * 词性标注 * @@ -536,8 +483,4 @@ protected static void speechTagging(List vertexList) { Viterbi.compute(vertexList, CoreDictionaryTransformMatrixDictionary.transformMatrixDictionary); } - -// protected static void performNamedEntityRecognize(List vertexList, WordNet wordNetOptimum, WordNet wordNetAll) -// { -// } } diff --git a/src/main/java/com/hankcs/hanlp/seg/common/CWSEvaluator.java b/src/main/java/com/hankcs/hanlp/seg/common/CWSEvaluator.java new file mode 100644 index 000000000..096efe53b --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/seg/common/CWSEvaluator.java @@ -0,0 +1,266 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-03 上午10:23 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.seg.common; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.seg.Segment; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.util.*; + +/** + * 中文分词评测工具 + * + * @author hankcs + */ +public class CWSEvaluator +{ + private int A_size, B_size, A_cap_B_size, OOV, OOV_R, IV, IV_R; + private Set dic; + + public CWSEvaluator() + { + } + + public CWSEvaluator(Set dic) + { + this.dic = dic; + } + + public CWSEvaluator(String dictPath) throws IOException + { + this(new TreeSet()); + if (dictPath == null) return; + try + { + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(dictPath); + for (String word : lineIterator) + { + word = word.trim(); + if (word.isEmpty()) continue; + dic.add(word); + } + } + catch (Exception e) + { + throw new IOException(e); + } + } + + /** + * 获取PRF + * + * @param percentage 百分制 + * @return + */ + public Result getResult(boolean percentage) + { + float p = A_cap_B_size / (float) B_size; + float r = A_cap_B_size / (float) A_size; + if (percentage) + { + p *= 100; + r *= 100; + } + float oov_r = Float.NaN; + if (OOV > 0) + { + oov_r = OOV_R / (float) OOV; + if (percentage) + oov_r *= 100; + } + float iv_r = Float.NaN; + if (IV > 0) + { + iv_r = IV_R / (float) IV; + if (percentage) + iv_r *= 100; + } + return new Result(p, r, 2 * p * r / (p + r), oov_r, iv_r); + } + + + /** + * 获取PRF + * + * @return + */ + public Result getResult() + { + return getResult(true); + } + + /** + * 比较标准答案与分词结果 + * + * @param gold + * @param pred + */ + public void compare(String gold, String pred) + { + String[] wordArray = gold.split("\\s+"); + A_size += wordArray.length; + String[] predArray = pred.split("\\s+"); + B_size += predArray.length; + + int goldIndex = 0, predIndex = 0; + int goldLen = 0, predLen = 0; + + while (goldIndex < wordArray.length && predIndex < predArray.length) + { + if (goldLen == predLen) + { + if (wordArray[goldIndex].equals(predArray[predIndex])) + { + if (dic != null) + { + if (dic.contains(wordArray[goldIndex])) + IV_R += 1; + else + OOV_R += 1; + } + A_cap_B_size++; + goldLen += wordArray[goldIndex].length(); + predLen += predArray[predIndex].length(); + goldIndex++; + predIndex++; + } + else + { + goldLen += wordArray[goldIndex].length(); + predLen += predArray[predIndex].length(); + goldIndex++; + predIndex++; + } + } + else if (goldLen < predLen) + { + goldLen += wordArray[goldIndex].length(); + goldIndex++; + } + else + { + predLen += predArray[predIndex].length(); + predIndex++; + } + } + + if (dic != null) + { + for (String word : wordArray) + { + if (dic.contains(word)) + IV += 1; + else + OOV += 1; + } + } + } + + /** + * 在标准答案与分词结果上执行评测 + * + * @param goldFile + * @param predFile + * @return + */ + public static Result evaluate(String goldFile, String predFile) throws IOException + { + return evaluate(goldFile, predFile, null); + } + + /** + * 标准化评测分词器 + * + * @param segment 分词器 + * @param outputPath 分词预测输出文件 + * @param goldFile 测试集segmented file + * @param dictPath 训练集单词列表 + * @return 一个储存准确率的结构 + * @throws IOException + */ + public static CWSEvaluator.Result evaluate(Segment segment, String outputPath, String goldFile, String dictPath) throws IOException + { + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(goldFile); + BufferedWriter bw = IOUtil.newBufferedWriter(outputPath); + for (String line : lineIterator) + { + List termList = segment.seg(line.replaceAll("\\s+", "")); // 一些testFile与goldFile根本不匹配,比如MSR的testFile有些行缺少单词,所以用goldFile去掉空格代替 + int i = 0; + for (Term term : termList) + { + bw.write(term.word); + if (++i != termList.size()) + bw.write(" "); + } + bw.newLine(); + } + bw.close(); + CWSEvaluator.Result result = CWSEvaluator.evaluate(goldFile, outputPath, dictPath); + return result; + } + + /** + * 标准化评测分词器 + * + * @param segment 分词器 + * @param testFile 测试集raw text + * @param outputPath 分词预测输出文件 + * @param goldFile 测试集segmented file + * @param dictPath 训练集单词列表 + * @return 一个储存准确率的结构 + * @throws IOException + */ + public static CWSEvaluator.Result evaluate(Segment segment, String testFile, String outputPath, String goldFile, String dictPath) throws IOException + { + return evaluate(segment, outputPath, goldFile, dictPath); + } + + /** + * 在标准答案与分词结果上执行评测 + * + * @param goldFile + * @param predFile + * @return + */ + public static Result evaluate(String goldFile, String predFile, String dictPath) throws IOException + { + IOUtil.LineIterator goldIter = new IOUtil.LineIterator(goldFile); + IOUtil.LineIterator predIter = new IOUtil.LineIterator(predFile); + CWSEvaluator evaluator = new CWSEvaluator(dictPath); + while (goldIter.hasNext() && predIter.hasNext()) + { + evaluator.compare(goldIter.next(), predIter.next()); + } + return evaluator.getResult(); + } + + public static class Result + { + public float P, R, F1, OOV_R, IV_R; + + public Result(float p, float r, float f1, float OOV_R, float IV_R) + { + P = p; + R = r; + F1 = f1; + this.OOV_R = OOV_R; + this.IV_R = IV_R; + } + + @Override + public String toString() + { + return String.format("P:%.2f R:%.2f F1:%.2f OOV-R:%.2f IV-R:%.2f", P, R, F1, OOV_R, IV_R); + } + } +} diff --git a/src/main/java/com/hankcs/hanlp/seg/common/Vertex.java b/src/main/java/com/hankcs/hanlp/seg/common/Vertex.java index deed6204d..ff3354e4e 100644 --- a/src/main/java/com/hankcs/hanlp/seg/common/Vertex.java +++ b/src/main/java/com/hankcs/hanlp/seg/common/Vertex.java @@ -11,9 +11,9 @@ */ package com.hankcs.hanlp.seg.common; +import com.hankcs.hanlp.utility.MathUtility; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.utility.MathTools; import com.hankcs.hanlp.utility.Predefine; import java.util.Map; @@ -62,7 +62,7 @@ public class Vertex public void updateFrom(Vertex from) { - double weight = from.weight + MathTools.calculateWeight(from, this); + double weight = from.weight + MathUtility.calculateWeight(from, this); if (this.from == null || this.weight > weight) { this.from = from; @@ -103,75 +103,58 @@ private String compileRealWord(String realWord, CoreDictionary.Attribute attribu { if (attribute.nature.length == 1) { - switch (attribute.nature[0]) + Nature nature = attribute.nature[0]; + if (nature.startsWith("nr")) { - case nr: - case nr1: - case nr2: - case nrf: - case nrj: - { - wordID = CoreDictionary.NR_WORD_ID; + wordID = CoreDictionary.NR_WORD_ID; // this.attribute = CoreDictionary.get(CoreDictionary.NR_WORD_ID); - return Predefine.TAG_PEOPLE; - } - case ns: - case nsf: - { - wordID = CoreDictionary.NS_WORD_ID; - // 在地名识别的时候,希望类似"河镇"的词语保持自己的词性,而不是未##地的词性 + return Predefine.TAG_PEOPLE; + } + else if (nature.startsWith("ns")) + { + wordID = CoreDictionary.NS_WORD_ID; + // 在地名识别的时候,希望类似"河镇"的词语保持自己的词性,而不是未##地的词性 // this.attribute = CoreDictionary.get(CoreDictionary.NS_WORD_ID); - return Predefine.TAG_PLACE; - } + return Predefine.TAG_PLACE; + } // case nz: - case nx: - { - wordID = CoreDictionary.NX_WORD_ID; - if (wordID == -1) - wordID = CoreDictionary.X_WORD_ID; + else if (nature == Nature.nx) + { + wordID = CoreDictionary.NX_WORD_ID; + if (wordID == -1) + wordID = CoreDictionary.X_WORD_ID; // this.attribute = CoreDictionary.get(wordID); - return Predefine.TAG_PROPER; - } - case nt: - case ntc: - case ntcf: - case ntcb: - case ntch: - case nto: - case ntu: - case nts: - case nth: - case nit: - { - wordID = CoreDictionary.NT_WORD_ID; + return Predefine.TAG_PROPER; + } + else if (nature.startsWith("nt") || nature == Nature.nit) + { + wordID = CoreDictionary.NT_WORD_ID; // this.attribute = CoreDictionary.get(CoreDictionary.NT_WORD_ID); - return Predefine.TAG_GROUP; - } - case m: - case mq: - { - wordID = CoreDictionary.M_WORD_ID; - this.attribute = CoreDictionary.get(CoreDictionary.M_WORD_ID); - return Predefine.TAG_NUMBER; - } - case x: - { - wordID = CoreDictionary.X_WORD_ID; - this.attribute = CoreDictionary.get(CoreDictionary.X_WORD_ID); - return Predefine.TAG_CLUSTER; - } + return Predefine.TAG_GROUP; + } + else if (nature.startsWith('m')) + { + wordID = CoreDictionary.M_WORD_ID; + this.attribute = CoreDictionary.get(CoreDictionary.M_WORD_ID); + return Predefine.TAG_NUMBER; + } + else if (nature.startsWith('x')) + { + wordID = CoreDictionary.X_WORD_ID; + this.attribute = CoreDictionary.get(CoreDictionary.X_WORD_ID); + return Predefine.TAG_CLUSTER; + } // case xx: // case w: // { // word= Predefine.TAG_OTHER; // } // break; - case t: - { - wordID = CoreDictionary.T_WORD_ID; - this.attribute = CoreDictionary.get(CoreDictionary.T_WORD_ID); - return Predefine.TAG_TIME; - } + else if (nature == Nature.t) + { + wordID = CoreDictionary.T_WORD_ID; + this.attribute = CoreDictionary.get(CoreDictionary.T_WORD_ID); + return Predefine.TAG_TIME; } } @@ -229,6 +212,16 @@ public String getRealWord() return realWord; } + public Vertex getFrom() + { + return from; + } + + public void setFrom(Vertex from) + { + this.from = from; + } + /** * 获取词的属性 * @@ -271,13 +264,13 @@ public boolean confirmNature(Nature nature) */ public boolean confirmNature(Nature nature, boolean updateWord) { - switch (nature) + switch (nature.firstChar()) { - case m: + case 'm': word = Predefine.TAG_NUMBER; break; - case t: + case 't': word = Predefine.TAG_TIME; break; default: @@ -459,7 +452,8 @@ public static Vertex newTimeInstance(String realWord) */ public static Vertex newB() { - return new Vertex(Predefine.TAG_BIGIN, " ", new CoreDictionary.Attribute(Nature.begin, Predefine.MAX_FREQUENCY / 10), CoreDictionary.getWordID(Predefine.TAG_BIGIN)); + int wordId = CoreDictionary.BEGIN_WORD_ID; + return new Vertex(Predefine.TAG_BIGIN, " ", new CoreDictionary.Attribute(Nature.begin, Predefine.TOTAL_FREQUENCY / 10), wordId); } /** @@ -468,7 +462,13 @@ public static Vertex newB() */ public static Vertex newE() { - return new Vertex(Predefine.TAG_END, " ", new CoreDictionary.Attribute(Nature.end, Predefine.MAX_FREQUENCY / 10), CoreDictionary.getWordID(Predefine.TAG_END)); + int wordId = CoreDictionary.END_WORD_ID; + return new Vertex(Predefine.TAG_END, " ", new CoreDictionary.Attribute(Nature.end, Predefine.TOTAL_FREQUENCY / 10), wordId); + } + + public int length() + { + return realWord.length(); } @Override diff --git a/src/main/java/com/hankcs/hanlp/seg/common/WordNet.java b/src/main/java/com/hankcs/hanlp/seg/common/WordNet.java index 8f6fa0360..717978a5a 100644 --- a/src/main/java/com/hankcs/hanlp/seg/common/WordNet.java +++ b/src/main/java/com/hankcs/hanlp/seg/common/WordNet.java @@ -11,11 +11,11 @@ */ package com.hankcs.hanlp.seg.common; +import com.hankcs.hanlp.utility.MathUtility; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.dictionary.other.CharType; import com.hankcs.hanlp.corpus.tag.Nature; import com.hankcs.hanlp.seg.NShort.Path.AtomNode; -import com.hankcs.hanlp.utility.MathTools; import com.hankcs.hanlp.utility.Predefine; import java.util.Iterator; @@ -147,46 +147,29 @@ public void insert(int line, Vertex vertex, WordNet wordNetAll) } vertexes[line].add(vertex); ++size; - // 保证连接 - for (int l = line - 1; l > 1; --l) + // 保证这个词语前面直连 + final int start = Math.max(0, line - 5); // 效率起见,只扫描前4行 + for (int l = line - 1; l > start; --l) { - if (get(l, 1) == null) + LinkedList all = wordNetAll.get(l); + if (all.size() <= vertexes[l].size()) + continue; + for (Vertex pre : all) { - Vertex first = wordNetAll.getFirst(l); - if (first == null) break; - vertexes[l].add(first); - ++size; - if (vertexes[l].size() > 1) break; - } - else - { - break; + if (pre.length() + l == line) + { + vertexes[l].add(pre); + ++size; + } } } - // 首先保证这个词语可直达 + // 保证这个词语后面直连 int l = line + vertex.realWord.length(); - if (get(l).size() == 0) + LinkedList targetLine = wordNetAll.get(l); + if (vertexes[l].size() == 0 && targetLine.size() != 0) // 有时候vertexes里面的词语已经经过用户词典合并,造成数量更少 { - List targetLine = wordNetAll.get(l); - if (targetLine == null || targetLine.size() == 0) return; - vertexes[l].addAll(targetLine); size += targetLine.size(); - } - // 直达之后一直往后 - for (++l; l < vertexes.length; ++l) - { - if (get(l).size() == 0) - { - Vertex first = wordNetAll.getFirst(l); - if (first == null) break; - vertexes[l].add(first); - ++size; - if (vertexes[l].size() > 1) break; - } - else - { - break; - } + vertexes[l] = targetLine; } } @@ -211,13 +194,14 @@ public void addAll(List vertexList) * @param line 行号 * @return 一个数组 */ - public List get(int line) + public LinkedList get(int line) { return vertexes[line]; } /** * 获取某一行的逆序迭代器 + * * @param line 行号 * @return 逆序迭代器 */ @@ -299,7 +283,7 @@ public void add(int line, List atomSegment) break; } // 这些通用符的量级都在10万左右 - add(line + offset, new Vertex(sWord, atomNode.sWord, new CoreDictionary.Attribute(nature, 10000), id)); + add(line + offset, new Vertex(sWord, atomNode.sWord, new CoreDictionary.Attribute(nature, Predefine.OOV_DEFAULT_FREQUENCY), id)); offset += atomNode.sWord.length(); } } @@ -348,7 +332,7 @@ public Graph toGraph() int toIndex = row + from.realWord.length(); for (Vertex to : vertexes[toIndex]) { - graph.connect(from.index, to.index, MathTools.calculateWeight(from, to)); + graph.connect(from.index, to.index, MathUtility.calculateWeight(from, to)); } } } @@ -416,6 +400,20 @@ public void clear() size = 0; } + /** + * 清理from属性 + */ + public void clean() + { + for (List vertexList : vertexes) + { + for (Vertex vertex : vertexList) + { + vertex.from = null; + } + } + } + /** * 获取内部顶点表格,谨慎操作! * diff --git a/src/main/java/com/hankcs/hanlp/suggest/Suggester.java b/src/main/java/com/hankcs/hanlp/suggest/Suggester.java index 6c17c0a5c..9a8871eaf 100644 --- a/src/main/java/com/hankcs/hanlp/suggest/Suggester.java +++ b/src/main/java/com/hankcs/hanlp/suggest/Suggester.java @@ -86,7 +86,7 @@ public List suggest(String key, int size) { Double score = scoreMap.get(entry.getKey()); if (score == null) score = 0.0; - scoreMap.put(entry.getKey(), score / max + entry.getValue() * scorer.boost); + scoreMap.put(entry.getKey(), score + entry.getValue() * scorer.boost / max); } } for (Map.Entry> entry : sortScoreMap(scoreMap).entrySet()) diff --git a/src/main/java/com/hankcs/hanlp/summary/BM25.java b/src/main/java/com/hankcs/hanlp/summary/BM25.java index b9b12dbb3..e16e16630 100644 --- a/src/main/java/com/hankcs/hanlp/summary/BM25.java +++ b/src/main/java/com/hankcs/hanlp/summary/BM25.java @@ -109,6 +109,13 @@ private void init() } } + /** + * 计算一个句子与一个文档的BM25相似度 + * + * @param sentence 句子(查询语句) + * @param index 文档(用语料库中的下标表示) + * @return BM25 score + */ public double sim(List sentence, int index) { double score = 0; @@ -116,9 +123,9 @@ public double sim(List sentence, int index) { if (!f[index].containsKey(word)) continue; int d = docs.get(index).size(); - Integer wf = f[index].get(word); - score += (idf.get(word) * wf * (k1 + 1) - / (wf + k1 * (1 - b + b * d + Integer tf = f[index].get(word); + score += (idf.get(word) * tf * (k1 + 1) + / (tf + k1 * (1 - b + b * d / avgdl))); } diff --git a/src/main/java/com/hankcs/hanlp/summary/KeywordExtractor.java b/src/main/java/com/hankcs/hanlp/summary/KeywordExtractor.java index eeeac3aba..d63b39e24 100644 --- a/src/main/java/com/hankcs/hanlp/summary/KeywordExtractor.java +++ b/src/main/java/com/hankcs/hanlp/summary/KeywordExtractor.java @@ -16,16 +16,31 @@ import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.tokenizer.StandardTokenizer; +import java.util.Collection; +import java.util.List; +import java.util.ListIterator; + /** * 提取关键词的基类 + * * @author hankcs */ -public class KeywordExtractor +public abstract class KeywordExtractor { /** * 默认分词器 */ - Segment defaultSegment = StandardTokenizer.SEGMENT; + protected Segment defaultSegment; + + public KeywordExtractor(Segment defaultSegment) + { + this.defaultSegment = defaultSegment; + } + + public KeywordExtractor() + { + this(StandardTokenizer.SEGMENT); + } /** * 是否应当将这个term纳入计算,词性属于名词、动词、副词、形容词 @@ -33,44 +48,15 @@ public class KeywordExtractor * @param term * @return 是否应当 */ - public boolean shouldInclude(Term term) + protected boolean shouldInclude(Term term) { // 除掉停用词 - if (term.nature == null) return false; - String nature = term.nature.toString(); - char firstChar = nature.charAt(0); - switch (firstChar) - { - case 'm': - case 'b': - case 'c': - case 'e': - case 'o': - case 'p': - case 'q': - case 'u': - case 'y': - case 'z': - case 'r': - case 'w': - { - return false; - } - default: - { - if (term.word.trim().length() > 1 && !CoreStopWordDictionary.contains(term.word)) - { - return true; - } - } - break; - } - - return false; + return CoreStopWordDictionary.shouldInclude(term); } /** * 设置关键词提取器使用的分词器 + * * @param segment 任何开启了词性标注的分词器 * @return 自己 */ @@ -79,4 +65,44 @@ public KeywordExtractor setSegment(Segment segment) defaultSegment = segment; return this; } + + public Segment getSegment() + { + return defaultSegment; + } + + /** + * 提取关键词 + * + * @param document 关键词 + * @param size 需要几个关键词 + * @return + */ + public List getKeywords(String document, int size) + { + return getKeywords(defaultSegment.seg(document), size); + } + + /** + * 提取关键词(top 10) + * + * @param document 文章 + * @return + */ + public List getKeywords(String document) + { + return getKeywords(defaultSegment.seg(document), 10); + } + + protected void filter(List termList) + { + ListIterator listIterator = termList.listIterator(); + while (listIterator.hasNext()) + { + if (!shouldInclude(listIterator.next())) + listIterator.remove(); + } + } + + abstract public List getKeywords(List termList, int size); } diff --git a/src/main/java/com/hankcs/hanlp/summary/TextRankKeyword.java b/src/main/java/com/hankcs/hanlp/summary/TextRankKeyword.java index 21e83c85d..aab9c2c01 100644 --- a/src/main/java/com/hankcs/hanlp/summary/TextRankKeyword.java +++ b/src/main/java/com/hankcs/hanlp/summary/TextRankKeyword.java @@ -2,6 +2,7 @@ import com.hankcs.hanlp.algorithm.MaxHeap; +import com.hankcs.hanlp.seg.Segment; import com.hankcs.hanlp.seg.common.Term; import java.util.*; @@ -13,10 +14,6 @@ */ public class TextRankKeyword extends KeywordExtractor { - /** - * 提取多少个关键字 - */ - int nKeyword = 10; /** * 阻尼系数(DampingFactor),一般取值为0.85 */ @@ -27,6 +24,15 @@ public class TextRankKeyword extends KeywordExtractor public static int max_iter = 200; final static float min_diff = 0.001f; + public TextRankKeyword(Segment defaultSegment) + { + super(defaultSegment); + } + + public TextRankKeyword() + { + } + /** * 提取关键词 * @@ -37,9 +43,8 @@ public class TextRankKeyword extends KeywordExtractor public static List getKeywordList(String document, int size) { TextRankKeyword textRankKeyword = new TextRankKeyword(); - textRankKeyword.nKeyword = size; - return textRankKeyword.getKeyword(document); + return textRankKeyword.getKeywords(document, size); } /** @@ -47,16 +52,11 @@ public static List getKeywordList(String document, int size) * * @param content * @return + * @deprecated 请使用 {@link KeywordExtractor#getKeywords(java.lang.String)} */ public List getKeyword(String content) { - Set> entrySet = getTermAndRank(content, nKeyword).entrySet(); - List result = new ArrayList(entrySet.size()); - for (Map.Entry entry : entrySet) - { - result.add(entry.getKey()); - } - return result; + return getKeywords(content); } /** @@ -69,7 +69,7 @@ public Map getTermAndRank(String content) { assert content != null; List termList = defaultSegment.seg(content); - return getRank(termList); + return getTermAndRank(termList); } /** @@ -79,9 +79,16 @@ public Map getTermAndRank(String content) * @param size * @return */ - public Map getTermAndRank(String content, Integer size) + public Map getTermAndRank(String content, int size) { Map map = getTermAndRank(content); + Map result = top(size, map); + + return result; + } + + private Map top(int size, Map map) + { Map result = new LinkedHashMap(); for (Map.Entry entry : new MaxHeap>(size, new Comparator>() { @@ -94,7 +101,6 @@ public int compare(Map.Entry o1, Map.Entry o2) { result.put(entry.getKey(), entry.getValue()); } - return result; } @@ -104,7 +110,7 @@ public int compare(Map.Entry o1, Map.Entry o2) * @param termList * @return */ - public Map getRank(List termList) + public Map getTermAndRank(List termList) { List wordList = new ArrayList(termList.size()); for (Term t : termList) @@ -142,6 +148,11 @@ public Map getRank(List termList) } // System.out.println(words); Map score = new HashMap(); + //依据TF来设置初值 + for (Map.Entry> entry : words.entrySet()) + { + score.put(entry.getKey(), sigMoid(entry.getValue().size())); + } for (int i = 0; i < max_iter; ++i) { Map m = new HashMap(); @@ -165,4 +176,27 @@ public Map getRank(List termList) return score; } + + /** + * sigmoid函数 + * + * @param value + * @return + */ + public static float sigMoid(float value) + { + return (float) (1d / (1d + Math.exp(-value))); + } + + @Override + public List getKeywords(List termList, int size) + { + Set> entrySet = top(size, getTermAndRank(termList)).entrySet(); + List result = new ArrayList(entrySet.size()); + for (Map.Entry entry : entrySet) + { + result.add(entry.getKey()); + } + return result; + } } diff --git a/src/main/java/com/hankcs/hanlp/summary/TextRankSentence.java b/src/main/java/com/hankcs/hanlp/summary/TextRankSentence.java index f7217fbf9..2cc21c522 100644 --- a/src/main/java/com/hankcs/hanlp/summary/TextRankSentence.java +++ b/src/main/java/com/hankcs/hanlp/summary/TextRankSentence.java @@ -92,7 +92,7 @@ private void solve() vertex[cnt] = 1.0; ++cnt; } - for (int _ = 0; _ < max_iter; ++_) + for (int buffer = 0; buffer < max_iter; ++buffer) { double[] m = new double[D]; double max_diff = 0; diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/BasicTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/BasicTokenizer.java index 793db781c..4ec136c48 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/BasicTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/BasicTokenizer.java @@ -56,4 +56,16 @@ public static List> seg2sentence(String text) { return SEGMENT.seg2sentence(text); } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/IndexTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/IndexTokenizer.java index c36e3509d..e1b8b3e68 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/IndexTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/IndexTokenizer.java @@ -52,4 +52,16 @@ public static List> seg2sentence(String text) { return SEGMENT.seg2sentence(text); } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/NLPTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/NLPTokenizer.java index d9b4275d5..f70fe8b7b 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/NLPTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/NLPTokenizer.java @@ -11,15 +11,16 @@ */ package com.hankcs.hanlp.tokenizer; -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.seg.Segment; -import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import java.io.IOException; import java.util.List; /** - * 可供自然语言处理用的分词器 + * 可供自然语言处理用的分词器,更重视准确率。 * * @author hankcs */ @@ -28,32 +29,68 @@ public class NLPTokenizer /** * 预置分词器 */ - public static final Segment SEGMENT = HanLP.newSegment().enableNameRecognize(true).enableTranslatedNameRecognize(true) - .enableJapaneseNameRecognize(true).enablePlaceRecognize(true).enableOrganizationRecognize(true) - .enablePartOfSpeechTagging(true); + public static AbstractLexicalAnalyzer ANALYZER; + + static + { + try + { + // 目前感知机的效果相当不错,如果能在更大的语料库上训练就更好了 + ANALYZER = new PerceptronLexicalAnalyzer(); + } + catch (IOException e) + { + throw new RuntimeException(e); + } + } public static List segment(String text) { - return SEGMENT.seg(text); + return ANALYZER.seg(text); } /** * 分词 + * * @param text 文本 * @return 分词结果 */ public static List segment(char[] text) { - return SEGMENT.seg(text); + return ANALYZER.seg(text); } /** * 切分为句子形式 + * * @param text 文本 * @return 句子列表 */ public static List> seg2sentence(String text) { - return SEGMENT.seg2sentence(text); + return ANALYZER.seg2sentence(text); + } + + /** + * 词法分析 + * + * @param sentence + * @return 结构化句子 + */ + public static Sentence analyze(final String sentence) + { + return ANALYZER.analyze(sentence); + } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return ANALYZER.seg2sentence(text, shortest); } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/NotionalTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/NotionalTokenizer.java index 1aea17f89..038aadfcc 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/NotionalTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/NotionalTokenizer.java @@ -82,6 +82,18 @@ public static List> seg2sentence(String text) return sentenceList; } + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } + /** * 切分为句子形式 * diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/SpeedTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/SpeedTokenizer.java index c957af7bc..82fcb5550 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/SpeedTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/SpeedTokenizer.java @@ -51,4 +51,16 @@ public static List> seg2sentence(String text) { return SEGMENT.seg2sentence(text); } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/StandardTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/StandardTokenizer.java index baedccbd7..5bc89c9d6 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/StandardTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/StandardTokenizer.java @@ -59,4 +59,16 @@ public static List> seg2sentence(String text) { return SEGMENT.seg2sentence(text); } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/TraditionalChineseTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/TraditionalChineseTokenizer.java index c5840a482..818077188 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/TraditionalChineseTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/TraditionalChineseTokenizer.java @@ -40,18 +40,9 @@ private static List segSentence(String text) int offset = 0; for (Term term : termList) { - String tText; term.offset = offset; - if (term.length() == 1 || (tText = SimplifiedChineseDictionary.getTraditionalChinese(term.word)) == null) - { - term.word = text.substring(offset, offset + term.length()); - offset += term.length(); - } - else - { - offset += term.length(); - term.word = tText; - } + term.word = text.substring(offset, offset + term.length()); + offset += term.length(); } return termList; @@ -97,4 +88,16 @@ public static List> seg2sentence(String text) return resultList; } + + /** + * 分词断句 输出句子形式 + * + * @param text 待分词句子 + * @param shortest 是否断句为最细的子句(将逗号也视作分隔符) + * @return 句子列表,每个句子由一个单词列表组成 + */ + public static List> seg2sentence(String text, boolean shortest) + { + return SEGMENT.seg2sentence(text, shortest); + } } diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/URLTokenizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/URLTokenizer.java index d1c7b68af..b0a775aaf 100644 --- a/src/main/java/com/hankcs/hanlp/tokenizer/URLTokenizer.java +++ b/src/main/java/com/hankcs/hanlp/tokenizer/URLTokenizer.java @@ -31,7 +31,7 @@ public class URLTokenizer * 预置分词器 */ public static final Segment SEGMENT = HanLP.newSegment(); - private static final Pattern WEB_URL = Pattern.compile("((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?(?:(((([a-zA-Z0-9][a-zA-Z0-9\\-]*)*[a-zA-Z0-9]\\.)+((aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(biz|b[abdefghijmnorstvwyz])|(cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(edu|e[cegrstu])|f[ijkmor]|(gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(info|int|i[delmnoqrst])|(jobs|j[emop])|k[eghimnprwyz]|l[abcikrstuvy]|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])|(name|net|n[acefgilopruz])|(org|om)|(pro|p[aefghklmnrstwy])|qa|r[eosuw]|s[abcdeghijklmnortuvyz]|(tel|travel|t[cdfghjklmnoprtvwz])|u[agksyz]|v[aceginu]|w[fs]|(δοκιμή|испытание|рф|срб|טעסט|آزمایشی|إختبار|الاردن|الجزائر|السعودية|المغرب|امارات|بھارت|تونس|سورية|فلسطين|قطر|مصر|परीक्षा|भारत|ভারত|ਭਾਰਤ|ભારત|இந்தியா|இலங்கை|சிங்கப்பூர்|பரிட்சை|భారత్|ලංකා|ไทย|テスト|中国|中國|台湾|台灣|新加坡|测试|測試|香港|테스트|한국|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)|y[et]|z[amw]))|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\\:\\d{1,5})?)(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"); + private static final Pattern WEB_URL = Pattern.compile("((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?((?:(?:[a-zA-Z0-9\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF][a-zA-Z0-9\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF\\-]{0,64}\\.)+(?:(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop])|k[eghimnprwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])|(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r[eosuw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u[agksyz]|v[aceginu]|w[fs]|(?:xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-80akhbyknj4f|xn\\-\\-9t4b11yi5a|xn\\-\\-deba0ad|xn\\-\\-g6w251d|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-zckzah)|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\\:\\d{1,5})?)(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?(?:\\b|$)"); /** * 分词 diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzer.java b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzer.java new file mode 100644 index 000000000..ca9414678 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzer.java @@ -0,0 +1,737 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:42 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.tokenizer.lexical; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.bintrie.BaseNode; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.dictionary.other.CharType; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.recognition.nr.JapanesePersonRecognition; +import com.hankcs.hanlp.recognition.nr.TranslatedPersonRecognition; +import com.hankcs.hanlp.seg.CharacterBasedSegment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.seg.common.Vertex; +import com.hankcs.hanlp.seg.common.WordNet; +import com.hankcs.hanlp.utility.Predefine; + +import java.util.*; + +/** + * 词法分析器基类(中文分词、词性标注和命名实体识别) + * + * @author hankcs + */ +public class AbstractLexicalAnalyzer extends CharacterBasedSegment implements LexicalAnalyzer +{ + protected Segmenter segmenter; + protected POSTagger posTagger; + protected NERecognizer neRecognizer; + /** + * 字符类型表 + */ + protected static byte[] typeTable; + /** + * 是否执行规则分词(英文数字标点等的规则预处理)。规则永远是丑陋的,默认关闭。 + */ + protected boolean enableRuleBasedSegment = false; + + static + { + typeTable = new byte[CharType.type.length]; + System.arraycopy(CharType.type, 0, typeTable, 0, typeTable.length); + for (char c : Predefine.CHINESE_NUMBERS.toCharArray()) + { + typeTable[c] = CharType.CT_CHINESE; + } + typeTable[CharTable.convert('·')] = CharType.CT_CHINESE; + } + + protected AbstractLexicalAnalyzer() + { + config.translatedNameRecognize = false; + config.japaneseNameRecognize = false; + } + + public AbstractLexicalAnalyzer(Segmenter segmenter) + { + this(); + this.segmenter = segmenter; + } + + public AbstractLexicalAnalyzer(Segmenter segmenter, POSTagger posTagger) + { + this(); + this.segmenter = segmenter; + this.posTagger = posTagger; + } + + public AbstractLexicalAnalyzer(Segmenter segmenter, POSTagger posTagger, NERecognizer neRecognizer) + { + this(); + this.segmenter = segmenter; + this.posTagger = posTagger; + this.neRecognizer = neRecognizer; + if (posTagger != null) + { + config.speechTagging = true; + if (neRecognizer != null) + { + config.ner = true; + } + } + } + + /** + * 分词 + * + * @param sentence 文本 + * @param normalized 正规化后的文本 + * @param wordList 储存单词列表 + * @param attributeList 储存用户词典中的词性,设为null表示不查询用户词典 + */ + protected void segment(final String sentence, final String normalized, final List wordList, final List attributeList) + { + if (attributeList != null) + { + final int[] offset = new int[]{0}; + CustomDictionary.parseLongestText(sentence, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + if (begin != offset[0]) + { + segmentAfterRule(sentence.substring(offset[0], begin), normalized.substring(offset[0], begin), wordList); + } + while (attributeList.size() < wordList.size()) + attributeList.add(null); + wordList.add(sentence.substring(begin, end)); + attributeList.add(value); + assert wordList.size() == attributeList.size() : "词语列表与属性列表不等长"; + offset[0] = end; + } + }); + if (offset[0] != sentence.length()) + { + segmentAfterRule(sentence.substring(offset[0]), normalized.substring(offset[0]), wordList); + } + } + else + { + segmentAfterRule(sentence, normalized, wordList); + } + } + + @Override + public void segment(final String sentence, final String normalized, final List wordList) + { + if (config.useCustomDictionary) + { + final int[] offset = new int[]{0}; + CustomDictionary.parseLongestText(sentence, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + if (begin != offset[0]) + { + segmentAfterRule(sentence.substring(offset[0], begin), normalized.substring(offset[0], begin), wordList); + } + wordList.add(sentence.substring(begin, end)); + offset[0] = end; + } + }); + if (offset[0] != sentence.length()) + { + segmentAfterRule(sentence.substring(offset[0]), normalized.substring(offset[0]), wordList); + } + } + else + { + segmentAfterRule(sentence, normalized, wordList); + } + } + + /** + * 中文分词 + * + * @param sentence + * @return + */ + public List segment(String sentence) + { + return segment(sentence, CharTable.convert(sentence)); + } + + @Override + public String[] recognize(String[] wordArray, String[] posArray) + { + return neRecognizer.recognize(wordArray, posArray); + } + + @Override + public String[] tag(String... words) + { + return posTagger.tag(words); + } + + @Override + public String[] tag(List wordList) + { + return posTagger.tag(wordList); + } + + @Override + public NERTagSet getNERTagSet() + { + return neRecognizer.getNERTagSet(); + } + + @Override + public Sentence analyze(final String sentence) + { + if (sentence.isEmpty()) + { + return new Sentence(Collections.emptyList()); + } + final String normalized = CharTable.convert(sentence); + List wordList = new LinkedList(); + List attributeList = segmentWithAttribute(sentence, normalized, wordList); + + String[] wordArray = new String[wordList.size()]; + int offset = 0; + int id = 0; + for (String word : wordList) + { + wordArray[id] = normalized.substring(offset, offset + word.length()); + ++id; + offset += word.length(); + } + + List termList = new ArrayList(wordList.size()); + if (posTagger != null) + { + String[] posArray = tag(wordArray); + if (neRecognizer != null) + { + String[] nerArray = neRecognizer.recognize(wordArray, posArray); + overwriteTag(attributeList, posArray); + wordList.toArray(wordArray); + + List result = new LinkedList(); + result.add(new Word(wordArray[0], posArray[0])); + String prePos = posArray[0]; + + NERTagSet tagSet = getNERTagSet(); + for (int i = 1; i < nerArray.length; i++) + { + if (nerArray[i].charAt(0) == tagSet.B_TAG_CHAR || nerArray[i].charAt(0) == tagSet.S_TAG_CHAR || nerArray[i].charAt(0) == tagSet.O_TAG_CHAR) + { + termList.add(result.size() > 1 ? new CompoundWord(result, prePos) : result.get(0)); + result = new ArrayList(); + } + result.add(new Word(wordArray[i], posArray[i])); + if (nerArray[i].charAt(0) == tagSet.O_TAG_CHAR || nerArray[i].charAt(0) == tagSet.S_TAG_CHAR) + { + prePos = posArray[i]; + } + else + { + prePos = NERTagSet.posOf(nerArray[i]); + } + } + if (result.size() != 0) + { + termList.add(result.size() > 1 ? new CompoundWord(result, prePos) : result.get(0)); + } + } + else + { + overwriteTag(attributeList, posArray); + wordList.toArray(wordArray); + for (int i = 0; i < wordArray.length; i++) + { + termList.add(new Word(wordArray[i], posArray[i])); + } + } + } + else + { + wordList.toArray(wordArray); + for (String word : wordArray) + { + termList.add(new Word(word, null)); + } + } + + return new Sentence(termList); + } + + private void overwriteTag(List attributeList, String[] posArray) + { + int id; + if (attributeList != null) + { + id = 0; + for (CoreDictionary.Attribute attribute : attributeList) + { + if (attribute != null) + posArray[id] = attribute.nature[0].toString(); + ++id; + } + } + } + + /** + * 这个方法会查询用户词典 + * + * @param sentence + * @param normalized + * @return + */ + public List segment(final String sentence, final String normalized) + { + final List wordList = new LinkedList(); + segment(sentence, normalized, wordList); + return wordList; + } + + /** + * 分词时查询到一个用户词典中的词语,此处控制是否接受它 + * + * @param begin 起始位置 + * @param end 终止位置 + * @param value 词性 + * @return true 表示接受 + * @deprecated 自1.6.7起废弃,强制模式下为最长匹配,否则按分词结果合并 + */ + protected boolean acceptCustomWord(int begin, int end, CoreDictionary.Attribute value) + { + return config.forceCustomDictionary || (end - begin >= 4 && !value.hasNatureStartsWith("nr") && !value.hasNatureStartsWith("ns") && !value.hasNatureStartsWith("nt")); + } + + @Override + protected List roughSegSentence(char[] sentence) + { + return null; + } + + @Override + protected List segSentence(char[] sentence) + { + if (sentence.length == 0) + { + return Collections.emptyList(); + } + String original = new String(sentence); + CharTable.normalization(sentence); + String normalized = new String(sentence); + List wordList = new LinkedList(); + List attributeList; + attributeList = segmentWithAttribute(original, normalized, wordList); + List termList = new ArrayList(wordList.size()); + int offset = 0; + for (String word : wordList) + { + Term term = new Term(word, null); + term.offset = offset; + offset += term.length(); + termList.add(term); + } + if (config.speechTagging) + { + if (posTagger != null) + { + String[] wordArray = new String[wordList.size()]; + offset = 0; + int id = 0; + for (String word : wordList) + { + wordArray[id] = normalized.substring(offset, offset + word.length()); + ++id; + offset += word.length(); + } + String[] posArray = tag(wordArray); + Iterator iterator = termList.iterator(); + Iterator attributeIterator = attributeList == null ? null : attributeList.iterator(); + for (int i = 0; i < posArray.length; i++) + { + if (attributeIterator != null && attributeIterator.hasNext()) + { + CoreDictionary.Attribute attribute = attributeIterator.next(); + if (attribute != null) + { + iterator.next().nature = attribute.nature[0]; // 使用词典中的词性覆盖词性标注器的结果 + continue; + } + } + iterator.next().nature = Nature.create(posArray[i]); + } + + if (config.ner && neRecognizer != null) + { + List childrenList = null; + if (config.isIndexMode()) + { + childrenList = new LinkedList(); + iterator = termList.iterator(); + } + termList = new ArrayList(termList.size()); + String[] nerArray = recognize(wordArray, posArray); + wordList.toArray(wordArray); + StringBuilder result = new StringBuilder(); + result.append(wordArray[0]); + if (childrenList != null) + { + childrenList.add(iterator.next()); + } + if (attributeList != null) + { + attributeIterator = attributeList.iterator(); + for (int i = 0; i < wordArray.length && attributeIterator.hasNext(); i++) + { + CoreDictionary.Attribute attribute = attributeIterator.next(); + if (attribute != null) + posArray[i] = attribute.nature[0].toString(); + } + } + String prePos = posArray[0]; + offset = 0; + + for (int i = 1; i < nerArray.length; i++) + { + NERTagSet tagSet = getNERTagSet(); + if (nerArray[i].charAt(0) == tagSet.B_TAG_CHAR || nerArray[i].charAt(0) == tagSet.S_TAG_CHAR || nerArray[i].charAt(0) == tagSet.O_TAG_CHAR) + { + Term term = new Term(result.toString(), Nature.create(prePos)); + term.offset = offset; + offset += term.length(); + termList.add(term); + if (childrenList != null) + { + if (childrenList.size() > 1) + { + for (Term shortTerm : childrenList) + { + if (shortTerm.length() >= config.indexMode) + { + termList.add(shortTerm); + } + } + } + childrenList.clear(); + } + result.setLength(0); + } + result.append(wordArray[i]); + if (childrenList != null) + { + childrenList.add(iterator.next()); + } + if (nerArray[i].charAt(0) == tagSet.O_TAG_CHAR || nerArray[i].charAt(0) == tagSet.S_TAG_CHAR) + { + prePos = posArray[i]; + } + else + { + prePos = NERTagSet.posOf(nerArray[i]); + } + } + if (result.length() != 0) + { + Term term = new Term(result.toString(), Nature.create(prePos)); + term.offset = offset; + termList.add(term); + if (childrenList != null) + { + if (childrenList.size() > 1) + { + for (Term shortTerm : childrenList) + { + if (shortTerm.length() >= config.indexMode) + { + termList.add(shortTerm); + } + } + } + } + } + } + } + else + { + for (Term term : termList) + { + CoreDictionary.Attribute attribute = CoreDictionary.get(term.word); + if (attribute != null) + { + term.nature = attribute.nature[0]; + } + else + { + term.nature = Nature.n; + } + } + } + } + if (config.translatedNameRecognize || config.japaneseNameRecognize) + { + List vertexList = toVertexList(termList, true); + WordNet wordNetOptimum = new WordNet(sentence, vertexList); + WordNet wordNetAll = wordNetOptimum; + if (config.translatedNameRecognize) + { + TranslatedPersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); + } + if (config.japaneseNameRecognize) + { + JapanesePersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll); + } + termList = convert(vertexList, config.offset); + } + return termList; + } + + /** + * CT_CHINESE区间交给统计分词,否则视作整个单位 + * + * @param sentence + * @param normalized + * @param start + * @param end + * @param preType + * @param wordList + */ + private void pushPiece(String sentence, String normalized, int start, int end, byte preType, List wordList) + { + if (preType == CharType.CT_CHINESE) + { + segmenter.segment(sentence.substring(start, end), normalized.substring(start, end), wordList); + } + else + { + wordList.add(sentence.substring(start, end)); + } + } + + /** + * 丑陋的规则系统 + * + * @param sentence + * @param normalized + * @param wordList + */ + protected void segmentAfterRule(String sentence, String normalized, List wordList) + { + if (!enableRuleBasedSegment) + { + segmenter.segment(sentence, normalized, wordList); + return; + } + int start = 0; + int end = start; + byte preType = typeTable[normalized.charAt(end)]; + byte curType; + while (++end < normalized.length()) + { + curType = typeTable[normalized.charAt(end)]; + if (curType != preType) + { + if (preType == CharType.CT_NUM) + { + // 浮点数识别 + if (",,..".indexOf(normalized.charAt(end)) != -1) + { + if (end + 1 < normalized.length()) + { + if (typeTable[normalized.charAt(end + 1)] == CharType.CT_NUM) + { + continue; + } + } + } + else if ("年月日时分秒".indexOf(normalized.charAt(end)) != -1) + { + preType = curType; // 交给统计分词 + continue; + } + } + pushPiece(sentence, normalized, start, end, preType, wordList); + start = end; + } + preType = curType; + } + if (end == normalized.length()) + pushPiece(sentence, normalized, start, end, preType, wordList); + } + + /** + * 返回用户词典中的attribute的分词 + * + * @param original + * @param normalized + * @param wordList + * @return + */ + private List segmentWithAttribute(String original, String normalized, List wordList) + { + List attributeList; + if (config.useCustomDictionary) + { + if (config.forceCustomDictionary) + { + attributeList = new LinkedList(); + segment(original, normalized, wordList, attributeList); + } + else + { + segmentAfterRule(original, normalized, wordList); + attributeList = combineWithCustomDictionary(wordList); + } + } + else + { + segmentAfterRule(original, normalized, wordList); + attributeList = null; + } + return attributeList; + } + + /** + * 使用用户词典合并粗分结果 + * + * @param vertexList 粗分结果 + * @return 合并后的结果 + */ + protected List combineWithCustomDictionary(List vertexList) + { + String[] wordNet = new String[vertexList.size()]; + vertexList.toArray(wordNet); + CoreDictionary.Attribute[] attributeArray = new CoreDictionary.Attribute[wordNet.length]; + // DAT合并 + DoubleArrayTrie dat = customDictionary.dat; + int length = wordNet.length; + for (int i = 0; i < length; ++i) + { + int state = 1; + state = dat.transition(wordNet[i], state); + if (state > 0) + { + int to = i + 1; + int end = to; + CoreDictionary.Attribute value = dat.output(state); + for (; to < length; ++to) + { + state = dat.transition(wordNet[to], state); + if (state < 0) break; + CoreDictionary.Attribute output = dat.output(state); + if (output != null) + { + value = output; + end = to + 1; + } + } + if (value != null) + { + combineWords(wordNet, i, end, attributeArray, value); + i = end - 1; + } + } + } + // BinTrie合并 + if (customDictionary.trie != null) + { + for (int i = 0; i < length; ++i) + { + if (wordNet[i] == null) continue; + BaseNode state = customDictionary.trie.transition(wordNet[i], 0); + if (state != null) + { + int to = i + 1; + int end = to; + CoreDictionary.Attribute value = state.getValue(); + for (; to < length; ++to) + { + if (wordNet[to] == null) continue; + state = state.transition(wordNet[to], 0); + if (state == null) break; + if (state.getValue() != null) + { + value = state.getValue(); + end = to + 1; + } + } + if (value != null) + { + combineWords(wordNet, i, end, attributeArray, value); + i = end - 1; + } + } + } + } + vertexList.clear(); + List attributeList = new LinkedList(); + for (int i = 0; i < wordNet.length; i++) + { + if (wordNet[i] != null) + { + vertexList.add(wordNet[i]); + attributeList.add(attributeArray[i]); + } + } + return attributeList; + } + + /** + * 将连续的词语合并为一个 + * + * @param wordNet 词图 + * @param start 起始下标(包含) + * @param end 结束下标(不包含) + * @param value 新的属性 + */ + private static void combineWords(String[] wordNet, int start, int end, CoreDictionary.Attribute[] attributeArray, CoreDictionary.Attribute value) + { + if (start + 1 != end) // 小优化,如果只有一个词,那就不需要合并,直接应用新属性 + { + StringBuilder sbTerm = new StringBuilder(); + for (int j = start; j < end; ++j) + { + if (wordNet[j] == null) continue; + sbTerm.append(wordNet[j]); + wordNet[j] = null; + } + wordNet[start] = sbTerm.toString(); + } + attributeArray[start] = value; + } + + /** + * 是否执行规则分词(英文数字标点等的规则预处理)。规则永远是丑陋的,默认关闭。 + * + * @param enableRuleBasedSegment 是否激活 + * @return 词法分析器对象 + */ + public AbstractLexicalAnalyzer enableRuleBasedSegment(boolean enableRuleBasedSegment) + { + this.enableRuleBasedSegment = enableRuleBasedSegment; + return this; + } +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/lexical/LexicalAnalyzer.java b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/LexicalAnalyzer.java new file mode 100644 index 000000000..467155207 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/LexicalAnalyzer.java @@ -0,0 +1,27 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:41 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.tokenizer.lexical; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; + +/** + * @author hankcs + */ +public interface LexicalAnalyzer extends Segmenter, POSTagger, NERecognizer +{ + /** + * 对句子进行词法分析 + * + * @param sentence 纯文本句子 + * @return HanLP定义的结构化句子 + */ + Sentence analyze(final String sentence); +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/lexical/NERecognizer.java b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/NERecognizer.java new file mode 100644 index 000000000..5f3b0e51c --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/NERecognizer.java @@ -0,0 +1,32 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:33 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.tokenizer.lexical; + +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; + +/** + * 命名实体识别接口 + * + * @author hankcs + */ +public interface NERecognizer +{ + /** + * 命名实体识别 + * + * @param wordArray 单词 + * @param posArray 词性 + * @return BMES-NER标签 + */ + String[] recognize(String[] wordArray, String[] posArray); + + NERTagSet getNERTagSet(); +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/lexical/POSTagger.java b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/POSTagger.java new file mode 100644 index 000000000..11c0ca12d --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/POSTagger.java @@ -0,0 +1,37 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:36 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.tokenizer.lexical; + +import java.util.List; + +/** + * 词性标注接口 + * + * @author hankcs + */ +public interface POSTagger +{ + /** + * 词性标注 + * + * @param words 单词 + * @return 词性数组 + */ + String[] tag(String... words); + + /** + * 词性标注 + * + * @param wordList 单词 + * @return 词性数组 + */ + String[] tag(List wordList); +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/lexical/Segmenter.java b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/Segmenter.java new file mode 100644 index 000000000..3e41dae57 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/lexical/Segmenter.java @@ -0,0 +1,30 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午7:30 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.hanlp.tokenizer.lexical; + +import java.util.List; + +/** + * 分词器接口 + * + * @author hankcs + */ +public interface Segmenter +{ + /** + * 中文分词 + * + * @param text 文本 + * @return 词语 + */ + List segment(String text); + void segment(String text, String normalized, List output); +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipe.java b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipe.java new file mode 100644 index 000000000..e3f64a252 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipe.java @@ -0,0 +1,54 @@ +/* + * Han He + * me@hankcs.com + * 2018-11-10 10:36 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.tokenizer.pipe; + +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.tokenizer.lexical.LexicalAnalyzer; + +import java.util.List; +import java.util.ListIterator; + +/** + * 词法分析器管道。约定将IWord的label设为非null表示本级管道已经处理 + * + * @author hankcs + */ +public class LexicalAnalyzerPipe implements Pipe, List> +{ + /** + * 代理的词法分析器 + */ + protected LexicalAnalyzer analyzer; + + public LexicalAnalyzerPipe(LexicalAnalyzer analyzer) + { + this.analyzer = analyzer; + } + + @Override + public List flow(List input) + { + ListIterator listIterator = input.listIterator(); + while (listIterator.hasNext()) + { + IWord wordOrSentence = listIterator.next(); + if (wordOrSentence.getLabel() != null) + continue; // 这是别的管道已经处理过的单词,跳过 + listIterator.remove(); // 否则是句子 + String sentence = wordOrSentence.getValue(); + for (IWord word : analyzer.analyze(sentence)) + { + listIterator.add(word); + } + } + return input; + } +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipeline.java b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipeline.java new file mode 100644 index 000000000..4f8e06778 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/LexicalAnalyzerPipeline.java @@ -0,0 +1,138 @@ +/* + * Han He + * me@hankcs.com + * 2018-11-10 10:23 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.tokenizer.pipe; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.tokenizer.lexical.LexicalAnalyzer; + +import java.util.LinkedList; +import java.util.List; + +/** + * 流水线式词法分析器 + * @author hankcs + */ +public class LexicalAnalyzerPipeline extends Pipeline, List> implements LexicalAnalyzer +{ + public LexicalAnalyzerPipeline(Pipe> first, Pipe, List> last) + { + super(first, last); + } + + public LexicalAnalyzerPipeline(LexicalAnalyzer analyzer) + { + this(new LexicalAnalyzerPipe(analyzer)); + } + + public LexicalAnalyzerPipeline(LexicalAnalyzerPipe analyzer) + { + this(new Pipe>() + { + @Override + public List flow(String input) + { + List output = new LinkedList(); + output.add(new Word(input, null)); + return output; + } + }, + new Pipe, List>() + { + @Override + public List flow(List input) + { + return input; + } + } + ); + add(analyzer); + } + + /** + * 获取代理的词法分析器 + * + * @return + */ + public LexicalAnalyzer getAnalyzer() + { + for (Pipe, List> pipe : this) + { + if (pipe instanceof LexicalAnalyzerPipe) + { + return ((LexicalAnalyzerPipe) pipe).analyzer; + } + } + return null; + } + + @Override + public void segment(String sentence, String normalized, List wordList) + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + analyzer.segment(sentence, normalized, wordList); + } + + @Override + public List segment(String sentence) + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + return analyzer.segment(sentence); + } + + @Override + public String[] recognize(String[] wordArray, String[] posArray) + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + return analyzer.recognize(wordArray, posArray); + } + + @Override + public String[] tag(String... words) + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + return analyzer.tag(words); + } + + @Override + public String[] tag(List wordList) + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + return analyzer.tag(wordList); + } + + @Override + public NERTagSet getNERTagSet() + { + LexicalAnalyzer analyzer = getAnalyzer(); + if (analyzer == null) + throw new IllegalStateException("流水线中没有LexicalAnalyzerPipe"); + return analyzer.getNERTagSet(); + } + + @Override + public Sentence analyze(String sentence) + { + return new Sentence(flow(sentence)); + } +} diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipe.java b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipe.java new file mode 100644 index 000000000..dc61dea08 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipe.java @@ -0,0 +1,29 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-29 4:49 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.tokenizer.pipe; + +/** + * 一截管道 + * + * @param 输入类型 + * @param 输出类型 + * @author hankcs + */ +public interface Pipe +{ + /** + * 流经管道 + * + * @param input 输入 + * @return 输出 + */ + O flow(I input); +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipeline.java b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipeline.java new file mode 100644 index 000000000..da411eafd --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/Pipeline.java @@ -0,0 +1,222 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-29 4:51 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.tokenizer.pipe; + +import java.util.*; + +/** + * 流水线 + * + * @author hankcs + */ +public class Pipeline implements Pipe, List> +{ + /** + * 入口 + */ + protected Pipe first; + /** + * 出口 + */ + protected Pipe last; + /** + * 中间部分 + */ + protected LinkedList> pipeList; + + public Pipeline(Pipe first, Pipe last) + { + this.first = first; + this.last = last; + pipeList = new LinkedList>(); + } + + @Override + public O flow(I input) + { + M i = first.flow(input); + for (Pipe pipe : pipeList) + { + i = pipe.flow(i); + } + return last.flow(i); + } + + @Override + public int size() + { + return pipeList.size(); + } + + @Override + public boolean isEmpty() + { + return pipeList.isEmpty(); + } + + @Override + public boolean contains(Object o) + { + return pipeList.contains(o); + } + + @Override + public Iterator> iterator() + { + return pipeList.iterator(); + } + + @Override + public Object[] toArray() + { + return pipeList.toArray(); + } + + @Override + public T[] toArray(T[] a) + { + return pipeList.toArray(a); + } + + @Override + public boolean add(Pipe pipe) + { + return pipeList.add(pipe); + } + + @Override + public boolean remove(Object o) + { + return pipeList.remove(o); + } + + @Override + public boolean containsAll(Collection c) + { + return pipeList.containsAll(c); + } + + @Override + public boolean addAll(Collection> c) + { + return pipeList.addAll(c); + } + + @Override + public boolean addAll(int index, Collection> c) + { + return pipeList.addAll(c); + } + + @Override + public boolean removeAll(Collection c) + { + return pipeList.removeAll(c); + } + + @Override + public boolean retainAll(Collection c) + { + return pipeList.retainAll(c); + } + + @Override + public void clear() + { + pipeList.clear(); + } + + @Override + public boolean equals(Object o) + { + return pipeList.equals(o); + } + + @Override + public int hashCode() + { + return pipeList.hashCode(); + } + + @Override + public Pipe get(int index) + { + return pipeList.get(index); + } + + @Override + public Pipe set(int index, Pipe element) + { + return pipeList.set(index, element); + } + + @Override + public void add(int index, Pipe element) + { + pipeList.add(index, element); + } + + /** + * 以最高优先级加入管道 + * + * @param pipe + */ + public void addFirst(Pipe pipe) + { + pipeList.addFirst(pipe); + } + + /** + * 以最低优先级加入管道 + * + * @param pipe + */ + public void addLast(Pipe pipe) + { + pipeList.addLast(pipe); + } + + @Override + public Pipe remove(int index) + { + return pipeList.remove(index); + } + + @Override + public int indexOf(Object o) + { + return pipeList.indexOf(o); + } + + @Override + public int lastIndexOf(Object o) + { + return pipeList.lastIndexOf(o); + } + + @Override + public ListIterator> listIterator() + { + return pipeList.listIterator(); + } + + @Override + public ListIterator> listIterator(int index) + { + return pipeList.listIterator(index); + } + + @Override + public List> subList(int fromIndex, int toIndex) + { + return pipeList.subList(fromIndex, toIndex); + } +} \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/tokenizer/pipe/RegexRecognizePipe.java b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/RegexRecognizePipe.java new file mode 100644 index 000000000..a1543bf89 --- /dev/null +++ b/src/main/java/com/hankcs/hanlp/tokenizer/pipe/RegexRecognizePipe.java @@ -0,0 +1,69 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-29 4:55 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.hanlp.tokenizer.pipe; + +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +import java.util.List; +import java.util.ListIterator; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * 正则匹配管道 + * + * @author hankcs + */ +public class RegexRecognizePipe implements Pipe, List> +{ + /** + * 正则表达式 + */ + protected Pattern pattern; + /** + * 所属标签 + */ + protected String label; + + public RegexRecognizePipe(Pattern pattern, String label) + { + this.pattern = pattern; + this.label = label; + } + + + @Override + public List flow(List input) + { + ListIterator listIterator = input.listIterator(); + while (listIterator.hasNext()) + { + IWord wordOrSentence = listIterator.next(); + if (wordOrSentence.getLabel() != null) + continue; // 这是别的管道已经处理过的单词,跳过 + listIterator.remove(); // 否则是句子 + String sentence = wordOrSentence.getValue(); + Matcher matcher = pattern.matcher(sentence); + int begin = 0; + int end; + while (matcher.find()) + { + end = matcher.start(); + listIterator.add(new Word(sentence.substring(begin, end), null)); // 未拦截的部分 + listIterator.add(new Word(matcher.group(), label)); // 拦截到的部分 + begin = matcher.end(); + } + if (begin < sentence.length()) listIterator.add(new Word(sentence.substring(begin), null)); + } + return input; + } +} diff --git a/src/main/java/com/hankcs/hanlp/utility/LexiconUtility.java b/src/main/java/com/hankcs/hanlp/utility/LexiconUtility.java index 075b5d144..2d8f080b8 100644 --- a/src/main/java/com/hankcs/hanlp/utility/LexiconUtility.java +++ b/src/main/java/com/hankcs/hanlp/utility/LexiconUtility.java @@ -12,7 +12,6 @@ package com.hankcs.hanlp.utility; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.corpus.util.CustomNatureUtility; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.dictionary.CustomDictionary; import com.hankcs.hanlp.seg.common.Term; @@ -40,6 +39,16 @@ public static CoreDictionary.Attribute getAttribute(String word) return CustomDictionary.get(word); } + /** + * 词库是否收录了词语(查询核心词典和用户词典) + * @param word + * @return + */ + public static boolean contains(String word) + { + return getAttribute(word) != null; + } + /** * 从HanLP的词库中提取某个单词的属性(包括核心词典和用户词典) * @@ -74,12 +83,12 @@ public static boolean setAttribute(String word, CoreDictionary.Attribute attribu if (attribute == null) return false; if (CoreDictionary.trie.set(word, attribute)) return true; - if (CustomDictionary.dat.set(word, attribute)) return true; - if (CustomDictionary.trie == null) + if (CustomDictionary.DEFAULT.dat.set(word, attribute)) return true; + if (CustomDictionary.DEFAULT.trie == null) { CustomDictionary.add(word); } - CustomDictionary.trie.put(word, attribute); + CustomDictionary.DEFAULT.trie.put(word, attribute); return true; } @@ -139,16 +148,13 @@ public static boolean setAttribute(String word, String natureWithFrequency) */ public static Nature convertStringToNature(String name, LinkedHashSet customNatureCollector) { - try - { - return Nature.valueOf(name); - } - catch (Exception e) + Nature nature = Nature.fromString(name); + if (nature == null) { - Nature nature = CustomNatureUtility.addNature(name); + nature = Nature.create(name); if (customNatureCollector != null) customNatureCollector.add(nature); - return nature; } + return nature; } /** diff --git a/src/main/java/com/hankcs/hanlp/utility/MathTools.java b/src/main/java/com/hankcs/hanlp/utility/MathTools.java deleted file mode 100644 index 7977ab35b..000000000 --- a/src/main/java/com/hankcs/hanlp/utility/MathTools.java +++ /dev/null @@ -1,48 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/05/23 17:09 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.hanlp.utility; - -import com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary; -import com.hankcs.hanlp.seg.common.Vertex; - -import static com.hankcs.hanlp.utility.Predefine.*; - -/** - * @author hankcs - */ -public class MathTools -{ - /** - * 从一个词到另一个词的词的花费 - * - * @param from 前面的词 - * @param to 后面的词 - * @return 分数 - */ - public static double calculateWeight(Vertex from, Vertex to) - { - int frequency = from.getAttribute().totalFrequency; - if (frequency == 0) - { - frequency = 1; // 防止发生除零错误 - } -// int nTwoWordsFreq = BiGramDictionary.getBiFrequency(from.word, to.word); - int nTwoWordsFreq = CoreBiGramTableDictionary.getBiFrequency(from.wordID, to.wordID); - double value = -Math.log(dSmoothingPara * frequency / (MAX_FREQUENCY) + (1 - dSmoothingPara) * ((1 - dTemp) * nTwoWordsFreq / frequency + dTemp)); - if (value < 0.0) - { - value = -value; - } -// logger.info(String.format("%5s frequency:%6d, %s nTwoWordsFreq:%3d, weight:%.2f", from.word, frequency, from.word + "@" + to.word, nTwoWordsFreq, value)); - return value; - } -} diff --git a/src/main/java/com/hankcs/hanlp/classification/utilities/MathUtility.java b/src/main/java/com/hankcs/hanlp/utility/MathUtility.java similarity index 67% rename from src/main/java/com/hankcs/hanlp/classification/utilities/MathUtility.java rename to src/main/java/com/hankcs/hanlp/utility/MathUtility.java index 2d5eb1db3..7fe246918 100644 --- a/src/main/java/com/hankcs/hanlp/classification/utilities/MathUtility.java +++ b/src/main/java/com/hankcs/hanlp/utility/MathUtility.java @@ -9,18 +9,26 @@ * This source is subject to Hankcs. Please contact Hankcs to get more information. * */ -package com.hankcs.hanlp.classification.utilities; +package com.hankcs.hanlp.utility; + +import com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary; +import com.hankcs.hanlp.seg.common.Vertex; import java.util.Map; import java.util.Set; +import static com.hankcs.hanlp.utility.Predefine.TOTAL_FREQUENCY; +import static com.hankcs.hanlp.utility.Predefine.lambda; +import static com.hankcs.hanlp.utility.Predefine.myu; + /** * 一些数学小工具 + * * @author hankcs */ public class MathUtility { - public static int sum(int ... var) + public static int sum(int... var) { int sum = 0; for (int x : var) @@ -31,6 +39,17 @@ public static int sum(int ... var) return sum; } + public static float sum(float... var) + { + float sum = 0; + for (float x : var) + { + sum += x; + } + + return sum; + } + public static double percentage(double current, double total) { return current / total * 100.; @@ -101,4 +120,21 @@ public static void normalizeExp(double[] predictionScores) } } } + + /** + * 从一个词到另一个词的词的花费 + * + * @param from 前面的词 + * @param to 后面的词 + * @return 分数 + */ + public static double calculateWeight(Vertex from, Vertex to) + { + int fFrom = from.getAttribute().totalFrequency; + int fBigram = CoreBiGramTableDictionary.getBiFrequency(from.wordID, to.wordID); + int fTo = to.getAttribute().totalFrequency; + // logger.info(String.format("%5s frequency:%6d, %s fBigram:%3d, weight:%.2f", from.word, frequency, from.word + "@" + to.word, fBigram, value)); + return -Math.log(lambda * (myu * fBigram / (fFrom + 1) + 1 - myu) + (1 - lambda) * fTo / TOTAL_FREQUENCY); + } + } \ No newline at end of file diff --git a/src/main/java/com/hankcs/hanlp/utility/Predefine.java b/src/main/java/com/hankcs/hanlp/utility/Predefine.java index a78313f0f..ecbab06d2 100644 --- a/src/main/java/com/hankcs/hanlp/utility/Predefine.java +++ b/src/main/java/com/hankcs/hanlp/utility/Predefine.java @@ -20,28 +20,13 @@ */ public class Predefine { + public static final String CHINESE_NUMBERS = "零○〇一二两三四五六七八九十廿百千万亿壹贰叁肆伍陆柒捌玖拾佰仟"; /** * hanlp.properties的路径,一般情况下位于classpath目录中。 * 但在某些极端情况下(不标准的Java虚拟机,用户缺乏相关知识等),允许将其设为绝对路径 */ public static String HANLP_PROPERTIES_PATH; public final static double MIN_PROBABILITY = 1e-10; - public final static int CT_SENTENCE_BEGIN = 1; //Sentence begin - public final static int CT_SENTENCE_END = 4; //Sentence ending - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_SINGLE = 5; //SINGLE byte - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_DELIMITER = CT_SINGLE + 1; //delimiter - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_CHINESE = CT_SINGLE + 2; //Chinese Char - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_LETTER = CT_SINGLE + 3; //HanYu Pinyin - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_NUM = CT_SINGLE + 4; //HanYu Pinyin - /** @deprecated 使用CharType中的相应常量 */ - public final static int CT_INDEX = CT_SINGLE + 5; //HanYu Pinyin - /** @deprecated */ - public final static int CT_OTHER = CT_SINGLE + 12; //Other /** * 浮点数正则 */ @@ -57,42 +42,21 @@ public class Predefine "水库","隧道","特区","铁路","新村","雪峰","盐场","盐湖","渔场","直辖市", "自治区","自治县","自治州"}; - //Translation type - public static int TT_ENGLISH = 0; - public static int TT_RUSSIAN = 1; - public static int TT_JAPANESE = 2; - - //Seperator type - public static String SEPERATOR_C_SENTENCE = "。!?:;…"; - public static String SEPERATOR_C_SUB_SENTENCE = "、,()“”‘’"; - public static String SEPERATOR_E_SENTENCE = "!?:;"; - public static String SEPERATOR_E_SUB_SENTENCE = ",()*'"; - //注释:原来程序为",()\042'","\042"为10进制42好ASC字符,为* - public static String SEPERATOR_LINK = "\n\r  "; - - //Seperator between two words - public static String WORD_SEGMENTER = "@"; - - public static int CC_NUM = 6768; - - //The number of Chinese Char,including 5 empty position between 3756-3761 - public static int WORD_MAXLENGTH = 100; - public static int WT_DELIMITER = 0; - public static int WT_CHINESE = 1; - public static int WT_OTHER = 2; - - public static int MAX_WORDS = 650; public static int MAX_SEGMENT_NUM = 10; - public static final int MAX_FREQUENCY = 25146057; // 现在总词频25146057 + public static int TOTAL_FREQUENCY = 25146057; // 现在总词频25146057 /** - * Smoothing 平滑因子 + * 未登录词的默认词频 */ - public static final double dTemp = (double) 1 / MAX_FREQUENCY + 0.00001; + public static int OOV_DEFAULT_FREQUENCY = 10000; /** - * 平滑参数 + * Bigram 平滑因子 */ - public static final double dSmoothingPara = 0.1; + public static double myu = 1 - (double) 1 / TOTAL_FREQUENCY + 0.00001; + /** + * Unigram 平滑因子 + */ + public static final double lambda = 0.9; /** * 地址 ns */ @@ -165,4 +129,11 @@ public class Predefine * 二进制文件后缀 */ public final static String BIN_EXT = ".bin"; + + public static void setTotalFrequency(int totalFrequency) + { + TOTAL_FREQUENCY = totalFrequency; + myu = 1 - ((double) 1 / TOTAL_FREQUENCY + 0.00001); + OOV_DEFAULT_FREQUENCY = Math.max(1, Math.min(OOV_DEFAULT_FREQUENCY / 100, TOTAL_FREQUENCY)); // 默认百分之一 + } } diff --git a/src/main/java/com/hankcs/hanlp/utility/SentencesUtil.java b/src/main/java/com/hankcs/hanlp/utility/SentencesUtil.java index 9d5dbf15e..44d808aae 100644 --- a/src/main/java/com/hankcs/hanlp/utility/SentencesUtil.java +++ b/src/main/java/com/hankcs/hanlp/utility/SentencesUtil.java @@ -8,21 +8,38 @@ /** * 文本断句 - * */ public class SentencesUtil { /** - * 将文本切割为句子 + * 将文本切割为最细小的句子(逗号也视作分隔符) + * * @param content * @return */ public static List toSentenceList(String content) { - return toSentenceList(content.toCharArray()); + return toSentenceList(content.toCharArray(), true); + } + + /** + * 文本分句 + * + * @param content 文本 + * @param shortest 是否切割为最细的单位(将逗号也视作分隔符) + * @return + */ + public static List toSentenceList(String content, boolean shortest) + { + return toSentenceList(content.toCharArray(), shortest); } public static List toSentenceList(char[] chars) + { + return toSentenceList(chars, true); + } + + public static List toSentenceList(char[] chars, boolean shortest) { StringBuilder sb = new StringBuilder(); @@ -55,31 +72,24 @@ public static List toSentenceList(char[] chars) insertIntoList(sb, sentences); sb = new StringBuilder(); } - }break; - case ' ': - case ' ': - case ' ': - case '。': + } + break; case ',': case ',': - insertIntoList(sb, sentences); - sb = new StringBuilder(); - break; case ';': case ';': - insertIntoList(sb, sentences); - sb = new StringBuilder(); - break; + if (!shortest) + { + continue; + } + case ' ': + case ' ': + case ' ': + case '。': case '!': case '!': - insertIntoList(sb, sentences); - sb = new StringBuilder(); - break; case '?': case '?': - insertIntoList(sb, sentences); - sb = new StringBuilder(); - break; case '\n': case '\r': insertIntoList(sb, sentences); @@ -107,6 +117,7 @@ private static void insertIntoList(StringBuilder sb, List sentences) /** * 句子中是否含有词性 + * * @param sentence * @param nature * @return diff --git a/src/main/java/com/hankcs/hanlp/utility/TextUtility.java b/src/main/java/com/hankcs/hanlp/utility/TextUtility.java index 276f09160..41c4c3de7 100644 --- a/src/main/java/com/hankcs/hanlp/utility/TextUtility.java +++ b/src/main/java/com/hankcs/hanlp/utility/TextUtility.java @@ -1,8 +1,16 @@ package com.hankcs.hanlp.utility; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + import java.io.*; import java.util.Collection; +import java.util.Iterator; +import java.util.List; + +import static com.hankcs.hanlp.dictionary.other.CharType.*; /** * 文本工具类 @@ -10,46 +18,6 @@ public class TextUtility { - /** - * 单字节 - */ - public static final int CT_SINGLE = 5;// SINGLE byte - - /** - * 分隔符"!,.?()[]{}+= - */ - public static final int CT_DELIMITER = CT_SINGLE + 1;// delimiter - - /** - * 中文字符 - */ - public static final int CT_CHINESE = CT_SINGLE + 2;// Chinese Char - - /** - * 字母 - */ - public static final int CT_LETTER = CT_SINGLE + 3;// HanYu Pinyin - - /** - * 数字 - */ - public static final int CT_NUM = CT_SINGLE + 4;// HanYu Pinyin - - /** - * 序号 - */ - public static final int CT_INDEX = CT_SINGLE + 5;// HanYu Pinyin - - /** - * 中文数字 - */ - public static final int CT_CNUM = CT_SINGLE + 6; - - /** - * 其他 - */ - public static final int CT_OTHER = CT_SINGLE + 12;// Other - public static int charType(char c) { return charType(String.valueOf(c)); @@ -64,7 +32,7 @@ public static int charType(String str) { if (str != null && str.length() > 0) { - if ("零○〇一二两三四五六七八九十廿百千万亿壹贰叁肆伍陆柒捌玖拾佰仟".contains(str)) return CT_CNUM; + if (Predefine.CHINESE_NUMBERS.contains(str)) return CT_CNUM; byte[] b; try { @@ -81,10 +49,8 @@ public static int charType(String str) int ub2 = getUnsigned(b2); if (ub1 < 128) { - if (ub1 < 32) return CT_DELIMITER; // NON PRINTABLE CHARACTERS - if (' ' == b1) return CT_OTHER; - if ('\n' == b1) return CT_DELIMITER; - if ("*\"!,.?()[]{}+=/\\;:|".indexOf((char) b1) != -1) + if (ub1 <= 32) return CT_OTHER; // NON PRINTABLE CHARACTERS + if ("*\"!,.?()<>[]{}+=/\\;:|".indexOf((char) b1) != -1) return CT_DELIMITER; if ("0123456789".indexOf((char)b1) != -1) return CT_NUM; @@ -703,4 +669,45 @@ public static String join(String delimiter, Collection stringCollection) return sb.toString(); } + + public static String combine(String... termArray) + { + StringBuilder sbSentence = new StringBuilder(); + for (String word : termArray) + { + sbSentence.append(word); + } + return sbSentence.toString(); + } + + public static String join(Iterable s, String delimiter) + { + Iterator iter = s.iterator(); + if (!iter.hasNext()) return ""; + StringBuilder buffer = new StringBuilder(iter.next()); + while (iter.hasNext()) buffer.append(delimiter).append(iter.next()); + return buffer.toString(); + } + + public static String combine(Sentence sentence) + { + StringBuilder sb = new StringBuilder(sentence.wordList.size() * 3); + for (IWord word : sentence.wordList) + { + sb.append(word.getValue()); + } + + return sb.toString(); + } + + public static String combine(List wordList) + { + StringBuilder sb = new StringBuilder(wordList.size() * 3); + for (IWord word : wordList) + { + sb.append(word.getValue()); + } + + return sb.toString(); + } } diff --git a/src/test/java/com/hankcs/book/ch01/HelloWord.java b/src/test/java/com/hankcs/book/ch01/HelloWord.java new file mode 100644 index 000000000..4b0eb8052 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch01/HelloWord.java @@ -0,0 +1,31 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-18 下午5:38 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch01; + +import com.hankcs.hanlp.HanLP; + +/** + * 《自然语言处理入门》1.6 开源工具 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class HelloWord +{ + public static void main(String[] args) + { + HanLP.Config.enableDebug(); // 首次运行会自动建立模型缓存,为了避免你等得无聊,开启调试模式说点什么:-) + System.out.println(HanLP.segment("王国维和服务员")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/book/ch02/AhoCorasickDoubleArrayTrieSegmentation.java b/src/test/java/com/hankcs/book/ch02/AhoCorasickDoubleArrayTrieSegmentation.java new file mode 100644 index 000000000..375321add --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/AhoCorasickDoubleArrayTrieSegmentation.java @@ -0,0 +1,152 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-28 下午5:59 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; + +import java.io.IOException; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.TreeMap; + +/** + * 《自然语言处理入门》2.7 基于双数组字典树的AC自动机 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class AhoCorasickDoubleArrayTrieSegmentation +{ + public static void main(String[] args) throws IOException + { + classicDemo(); + for (int i = 1; i <= 10; ++i) + { + evaluateSpeed(i); + System.gc(); + } + } + + private static void classicDemo() + { + String[] keyArray = new String[]{"hers", "his", "she", "he"}; + TreeMap map = new TreeMap(); + for (String key : keyArray) + map.put(key, key.toUpperCase()); + AhoCorasickDoubleArrayTrie acdat = new AhoCorasickDoubleArrayTrie(map); + for (AhoCorasickDoubleArrayTrie.Hit hit : acdat.parseText("ushers")) // 一下子获取全部结果 + { + System.out.printf("[%d:%d]=%s\n", hit.begin, hit.end, hit.value); + } + System.out.println(); + acdat.parseText("ushers", new AhoCorasickDoubleArrayTrie.IHit() // 及时处理查询结果 + { + @Override + public void hit(int begin, int end, String value) + { + System.out.printf("[%d:%d]=%s\n", begin, end, value); + } + }); + } + + private static void evaluateSpeed(int wordLength) throws IOException + { + TreeMap dictionary = loadDictionary(wordLength); + + AhoCorasickDoubleArrayTrie acdat = new AhoCorasickDoubleArrayTrie(dictionary); + DoubleArrayTrie dat = new DoubleArrayTrie(dictionary); + + String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; + long start; + double costTime; + final int pressure = 1000000; + System.out.printf("长度%d:\n", wordLength); + + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + + } + }); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("ACDAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + dat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + + } + }); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("DAT: %.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + } + + /** + * 加载词典,并限制词语长度 + * + * @param minLength 最低长度 + * @return TreeMap形式的词典 + * @throws IOException + */ + public static TreeMap loadDictionary(int minLength) throws IOException + { + TreeMap dictionary = + IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt"); + + Iterator iterator = dictionary.keySet().iterator(); + while (iterator.hasNext()) + { + if (iterator.next().length() < minLength) + iterator.remove(); + } + return dictionary; + } + + /** + * 基于ACDAT的完全切分式的中文分词算法 + * + * @param text 待分词的文本 + * @param acdat 词典 + * @return 单词列表 + */ + public static List segmentFully(final String text, AhoCorasickDoubleArrayTrie acdat) + { + final List wordList = new LinkedList(); + acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + wordList.add(text.substring(begin, end)); + } + }); + return wordList; + } +} diff --git a/src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java b/src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java new file mode 100644 index 000000000..5e21f1fba --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/AhoCorasickSegmentation.java @@ -0,0 +1,89 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-28 上午11:00 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.algorithm.ahocorasick.trie.Emit; +import com.hankcs.hanlp.algorithm.ahocorasick.trie.Trie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; +import java.util.TreeMap; + +/** + * 《自然语言处理入门》2.6 AC 自动机 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class AhoCorasickSegmentation +{ + public static void main(String[] args) throws IOException + { + classicDemo(); + evaluateSpeed(); + } + + private static void classicDemo() + { + String[] keyArray = new String[]{"hers", "his", "she", "he"}; + Trie trie = new Trie(); + for (String key : keyArray) + trie.addKeyword(key); + for (Emit emit : trie.parseText("ushers")) + System.out.printf("[%d:%d]=%s\n", emit.getStart(), emit.getEnd(), emit.getKeyword()); + } + + private static void evaluateSpeed() throws IOException + { + // 加载词典 + TreeMap dictionary = + IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt"); + Trie trie = new Trie(dictionary.keySet()); + + String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; + long start; + double costTime; + final int pressure = 1000000; + + System.out.println("===AC自动机接口==="); + System.out.println("完全切分"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentFully(text, trie); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + } + + /** + * 基于AC自动机的完全切分式的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentFully(final String text, Trie dictionary) + { + final List wordList = new LinkedList(); + for (Emit emit : dictionary.parseText(text)) + { + wordList.add(emit.getKeyword()); + } + return wordList; + } +} diff --git a/src/test/java/com/hankcs/book/ch02/BinTrieBasedSegmentation.java b/src/test/java/com/hankcs/book/ch02/BinTrieBasedSegmentation.java new file mode 100644 index 000000000..ae154fe4d --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/BinTrieBasedSegmentation.java @@ -0,0 +1,194 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-26 下午6:58 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; + +import java.io.IOException; +import java.util.*; + +import static com.hankcs.book.ch02.NaiveDictionaryBasedSegmentation.evaluateSpeed; + +/** + * 《自然语言处理入门》2.4 字典树 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class BinTrieBasedSegmentation +{ + public static void main(String[] args) throws IOException + { + // 加载词典 + TreeMap dictionary = + IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt"); + final BinTrie binTrie = new BinTrie(dictionary); + Map binTrieMap = new Map() + { + @Override + public int size() + { + return binTrie.size(); + } + + @Override + public boolean isEmpty() + { + return binTrie.size() == 0; + } + + @Override + public boolean containsKey(Object key) + { + return binTrie.containsKey((String) key); + } + + @Override + public boolean containsValue(Object value) + { + throw new UnsupportedOperationException(); + } + + @Override + public CoreDictionary.Attribute get(Object key) + { + return binTrie.get((String) key); + } + + @Override + public CoreDictionary.Attribute put(String key, CoreDictionary.Attribute value) + { + throw new UnsupportedOperationException(); + } + + @Override + public CoreDictionary.Attribute remove(Object key) + { + throw new UnsupportedOperationException(); + } + + @Override + public void putAll(Map m) + { + throw new UnsupportedOperationException(); + } + + @Override + public void clear() + { + throw new UnsupportedOperationException(); + } + + @Override + public Set keySet() + { + throw new UnsupportedOperationException(); + } + + @Override + public Collection values() + { + throw new UnsupportedOperationException(); + } + + @Override + public Set> entrySet() + { + throw new UnsupportedOperationException(); + + } + }; + + String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; + long start; + double costTime; + final int pressure = 10000; + + System.out.println("===朴素接口==="); + + System.out.println("完全切分"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + com.hankcs.book.ch02.NaiveDictionaryBasedSegmentation.segmentFully(text, binTrieMap); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + evaluateSpeed(binTrieMap); + + System.out.println("===BinTrie接口==="); + System.out.println("完全切分"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentFully(text, binTrie); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + + System.out.println("正向最长"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentForwardLongest(text, binTrie); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + } + + /** + * 基于BinTrie的完全切分式的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentFully(final String text, BinTrie dictionary) + { + final List wordList = new LinkedList(); + dictionary.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + wordList.add(text.substring(begin, end)); + } + }); + return wordList; + } + + /** + * 基于BinTrie的正向最长匹配的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentForwardLongest(final String text, BinTrie dictionary) + { + final List wordList = new LinkedList(); + dictionary.parseLongestText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + wordList.add(text.substring(begin, end)); + } + }); + return wordList; + } +} diff --git a/src/test/java/com/hankcs/book/ch02/DemoAhoCorasickDoubleArrayTrieSegment.java b/src/test/java/com/hankcs/book/ch02/DemoAhoCorasickDoubleArrayTrieSegment.java new file mode 100644 index 000000000..541cae73e --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/DemoAhoCorasickDoubleArrayTrieSegment.java @@ -0,0 +1,35 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-29 下午12:19 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.seg.Other.AhoCorasickDoubleArrayTrieSegment; + +import java.io.IOException; + +/** + * 《自然语言处理入门》2.8 HanLP的词典分词实现 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoAhoCorasickDoubleArrayTrieSegment +{ + public static void main(String[] args) throws IOException + { + HanLP.Config.ShowTermNature = false; + AhoCorasickDoubleArrayTrieSegment segment = new AhoCorasickDoubleArrayTrieSegment(); + System.out.println(segment.seg("江西鄱阳湖干枯,中国最大淡水湖变成大草原")); + } +} diff --git a/src/test/java/com/hankcs/book/ch02/DemoDoubleArrayTrieSegment.java b/src/test/java/com/hankcs/book/ch02/DemoDoubleArrayTrieSegment.java new file mode 100644 index 000000000..655697a7f --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/DemoDoubleArrayTrieSegment.java @@ -0,0 +1,51 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-29 上午9:35 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment; +import com.hankcs.hanlp.seg.common.Term; + +import java.io.IOException; + +/** + * 《自然语言处理入门》2.8 HanLP的词典分词实现 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoDoubleArrayTrieSegment +{ + public static void main(String[] args) throws IOException + { + HanLP.Config.ShowTermNature = false; // 分词结果不显示词性 + // 默认加载配置文件指定的 CoreDictionaryPath + DoubleArrayTrieSegment segment = new DoubleArrayTrieSegment(); + System.out.println(segment.seg("江西鄱阳湖干枯,中国最大淡水湖变成大草原")); + // 也支持加载自己的词典 + String dict1 = "data/dictionary/CoreNatureDictionary.mini.txt"; + String dict2 = "data/dictionary/custom/上海地名.txt ns"; + segment = new DoubleArrayTrieSegment(dict1, dict2); + System.out.println(segment.seg("上海市虹口区大连西路550号SISU")); + + segment.enablePartOfSpeechTagging(true); // 激活数词和英文识别 + HanLP.Config.ShowTermNature = true; // 顺便观察一下词性 + System.out.println(segment.seg("上海市虹口区大连西路550号SISU")); + + for (Term term : segment.seg("上海市虹口区大连西路550号SISU")) + { + System.out.printf("单词:%s 词性:%s\n", term.word, term.nature); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch02/DemoStopwords.java b/src/test/java/com/hankcs/book/ch02/DemoStopwords.java new file mode 100644 index 000000000..33ca15bb4 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/DemoStopwords.java @@ -0,0 +1,129 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-04 上午10:40 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; + +import java.io.IOException; +import java.util.List; +import java.util.ListIterator; +import java.util.TreeMap; + +/** + * 《自然语言处理入门》2.10 字典树的其他应用 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoStopwords +{ + /** + * 从词典文件加载停用词 + * + * @param path 词典路径 + * @return 双数组trie树 + * @throws IOException + */ + static DoubleArrayTrie loadStopwordFromFile(String path) throws IOException + { + TreeMap map = new TreeMap(); + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(path); + for (String word : lineIterator) + { + map.put(word, word); + } + + return new DoubleArrayTrie(map); + } + + /** + * 从参数构造停用词trie树 + * + * @param words 停用词数组 + * @return 双数组trie树 + * @throws IOException + */ + static DoubleArrayTrie loadStopwordFromWords(String... words) throws IOException + { + TreeMap map = new TreeMap(); + for (String word : words) + { + map.put(word, word); + } + + return new DoubleArrayTrie(map); + } + + public static void main(String[] args) throws IOException + { + DoubleArrayTrie trie = loadStopwordFromFile(HanLP.Config.CoreStopWordDictionaryPath); + final String text = "停用词的意义相对而言无关紧要吧。"; + HanLP.Config.ShowTermNature = false; + Segment segment = new DoubleArrayTrieSegment(); + List termList = segment.seg(text); + System.out.println("分词结果:" + termList); + removeStopwords(termList, trie); + System.out.println("分词结果去掉停用词:" + termList); + trie = loadStopwordFromWords("的", "相对而言", "吧"); + System.out.println("不分词去掉停用词:" + replaceStopwords(text, "**", trie)); + } + + /** + * 去除分词结果中的停用词 + * + * @param termList 分词结果 + * @param trie 停用词词典 + */ + public static void removeStopwords(List termList, DoubleArrayTrie trie) + { + ListIterator listIterator = termList.listIterator(); + while (listIterator.hasNext()) + if (trie.containsKey(listIterator.next().word)) + listIterator.remove(); + } + + /** + * 停用词过滤 + * + * @param text 母文本 + * @param replacement 停用词统一替换为该字符串 + * @param trie 停用词词典 + * @return 结果 + */ + public static String replaceStopwords(final String text, final String replacement, DoubleArrayTrie trie) + { + final StringBuilder sbOut = new StringBuilder(text.length()); + final int[] offset = new int[]{0}; + trie.parseLongestText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, String value) + { + if (begin > offset[0]) + sbOut.append(text.substring(offset[0], begin)); + sbOut.append(replacement); + offset[0] = end; + } + }); + if (offset[0] < text.length()) + sbOut.append(text.substring(offset[0])); + return sbOut.toString(); + } +} diff --git a/src/test/java/com/hankcs/book/ch02/DoubleArrayTrieBasedSegmentation.java b/src/test/java/com/hankcs/book/ch02/DoubleArrayTrieBasedSegmentation.java new file mode 100644 index 000000000..4ef109cd9 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/DoubleArrayTrieBasedSegmentation.java @@ -0,0 +1,131 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-27 上午10:56 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import junit.framework.TestCase; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; +import java.util.TreeMap; + +/** + * 《自然语言处理入门》2.5 双数组字典树 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DoubleArrayTrieBasedSegmentation +{ + public static void main(String[] args) throws IOException + { + testTinyDAT(); + testSpeed(); + } + + public static void testTinyDAT() + { + TreeMap tinyDictionary = createTinyTreeMap(); + DoubleArrayTrie dat = new DoubleArrayTrie(tinyDictionary); + } + + public static TreeMap createTinyTreeMap() + { + TreeMap tinyDictionary = new TreeMap(); + tinyDictionary.put("自然", "'nature'"); + tinyDictionary.put("自然人", "human"); + tinyDictionary.put("自然语言", "language"); + tinyDictionary.put("自语", "talk to oneself"); + tinyDictionary.put("入门", "introduction"); + return tinyDictionary; + } + + public static void testSpeed() throws IOException + { + // 加载词典 + TreeMap dictionary = + IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt"); + DoubleArrayTrie dat = new DoubleArrayTrie(dictionary); + + String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; + long start; + double costTime; + final int pressure = 1000000; + + System.out.println("===DoubleArrayTrie接口==="); + System.out.println("完全切分"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentFully(text, dat); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + + System.out.println("正向最长"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentForwardLongest(text, dat); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + } + + /** + * 基于BinTrie的完全切分式的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentFully(final String text, DoubleArrayTrie dictionary) + { + final List wordList = new LinkedList(); + dictionary.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + wordList.add(text.substring(begin, end)); + } + }); + return wordList; + } + + /** + * 基于BinTrie的正向最长匹配的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentForwardLongest(final String text, DoubleArrayTrie dictionary) + { + final List wordList = new LinkedList(); + dictionary.parseLongestText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, CoreDictionary.Attribute value) + { + wordList.add(text.substring(begin, end)); + } + }); + return wordList; + } +} diff --git a/src/test/java/com/hankcs/book/ch02/EvaluateCWS.java b/src/test/java/com/hankcs/book/ch02/EvaluateCWS.java new file mode 100644 index 000000000..7b5260667 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/EvaluateCWS.java @@ -0,0 +1,39 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-03 下午5:17 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +/** + * 《自然语言处理入门》2.9 准确率评测 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class EvaluateCWS +{ + public static void main(String[] args) throws IOException + { + String trainWords = MSR.TRAIN_WORDS; + Segment segment = new DoubleArrayTrieSegment(trainWords) + .enablePartOfSpeechTagging(true); + CWSEvaluator.Result result = CWSEvaluator.evaluate(segment, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, trainWords); + System.out.println(result); + } +} diff --git a/src/test/java/com/hankcs/book/ch02/NaiveDictionaryBasedSegmentation.java b/src/test/java/com/hankcs/book/ch02/NaiveDictionaryBasedSegmentation.java new file mode 100644 index 000000000..80a5b96be --- /dev/null +++ b/src/test/java/com/hankcs/book/ch02/NaiveDictionaryBasedSegmentation.java @@ -0,0 +1,236 @@ +/* + * Han He + * me@hankcs.com + * 2018-05-22 下午8:46 + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch02; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import java.util.TreeMap; + +/** + * 《自然语言处理入门》2.3 切分算法 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class NaiveDictionaryBasedSegmentation +{ + public static void main(String[] args) throws IOException + { + // 加载词典 + TreeMap dictionary = + IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt"); + System.out.printf("词典大小:%d个词条\n", dictionary.size()); + System.out.println(dictionary.keySet().iterator().next()); + // 完全切分 + System.out.println(segmentFully("就读北京大学", dictionary)); + // 正向最长匹配 + System.out.println(segmentForwardLongest("就读北京大学", dictionary)); + System.out.println(segmentForwardLongest("研究生命起源", dictionary)); + System.out.println(segmentForwardLongest("项目的研究", dictionary)); + // 逆向最长匹配 + System.out.println(segmentBackwardLongest("研究生命起源", dictionary)); + System.out.println(segmentBackwardLongest("项目的研究", dictionary)); + // 双向最长匹配 + String[] text = new String[]{ + "项目的研究", + "商品和服务", + "研究生命起源", + "当下雨天地面积水", + "结婚的和尚未结婚的", + "欢迎新老师生前来就餐", + }; + for (int i = 0; i < text.length; i++) + { + System.out.printf("| %d | %s | %s | %s | %s |\n", i + 1, text[i], + segmentForwardLongest(text[i], dictionary), + segmentBackwardLongest(text[i], dictionary), + segmentBidirectional(text[i], dictionary) + ); + } + + evaluateSpeed(dictionary); + } + + /** + * 评测速度 + * + * @param dictionary 词典 + */ + public static void evaluateSpeed(Map dictionary) + { + String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; + long start; + double costTime; + final int pressure = 10000; + + System.out.println("正向最长"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentForwardLongest(text, dictionary); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + + System.out.println("逆向最长"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentBackwardLongest(text, dictionary); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + + System.out.println("双向最长"); + start = System.currentTimeMillis(); + for (int i = 0; i < pressure; ++i) + { + segmentBidirectional(text, dictionary); + } + costTime = (System.currentTimeMillis() - start) / (double) 1000; + System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime); + } + + /** + * 完全切分式的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentFully(String text, Map dictionary) + { + List wordList = new LinkedList(); + for (int i = 0; i < text.length(); ++i) + { + for (int j = i + 1; j <= text.length(); ++j) + { + String word = text.substring(i, j); + if (dictionary.containsKey(word)) + { + wordList.add(word); + } + } + } + return wordList; + } + + /** + * 正向最长匹配的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentForwardLongest(String text, Map dictionary) + { + List wordList = new LinkedList(); + for (int i = 0; i < text.length(); ) + { + String longestWord = text.substring(i, i + 1); + for (int j = i + 1; j <= text.length(); ++j) + { + String word = text.substring(i, j); + if (dictionary.containsKey(word)) + { + if (word.length() > longestWord.length()) + { + longestWord = word; + } + } + } + wordList.add(longestWord); + i += longestWord.length(); + } + return wordList; + } + + /** + * 逆向最长匹配的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentBackwardLongest(String text, Map dictionary) + { + List wordList = new LinkedList(); + for (int i = text.length() - 1; i >= 0; ) + { + String longestWord = text.substring(i, i + 1); + for (int j = 0; j <= i; ++j) + { + String word = text.substring(j, i + 1); + if (dictionary.containsKey(word)) + { + if (word.length() > longestWord.length()) + { + longestWord = word; + break; + } + } + } + wordList.add(0, longestWord); + i -= longestWord.length(); + } + return wordList; + } + + /** + * 统计分词结果中的单字数量 + * + * @param wordList 分词结果 + * @return 单字数量 + */ + public static int countSingleChar(List wordList) + { + int size = 0; + for (String word : wordList) + { + if (word.length() == 1) + ++size; + } + return size; + } + + /** + * 双向最长匹配的中文分词算法 + * + * @param text 待分词的文本 + * @param dictionary 词典 + * @return 单词列表 + */ + public static List segmentBidirectional(String text, Map dictionary) + { + List forwardLongest = segmentForwardLongest(text, dictionary); + List backwardLongest = segmentBackwardLongest(text, dictionary); + if (forwardLongest.size() < backwardLongest.size()) + return forwardLongest; + else if (forwardLongest.size() > backwardLongest.size()) + return backwardLongest; + else + { + if (countSingleChar(forwardLongest) < countSingleChar(backwardLongest)) + return forwardLongest; + else + return backwardLongest; + } + } + +} diff --git a/src/test/java/com/hankcs/book/ch03/DemoAdjustModel.java b/src/test/java/com/hankcs/book/ch03/DemoAdjustModel.java new file mode 100644 index 000000000..a9b2c543b --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/DemoAdjustModel.java @@ -0,0 +1,39 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-07 11:38 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.seg.Segment; + +import static com.hankcs.book.ch03.DemoNgramSegment.MSR_MODEL_PATH; +import static com.hankcs.book.ch03.DemoNgramSegment.loadBigram; + +/** + * 《自然语言处理入门》3.5 评测 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoAdjustModel +{ + public static void main(String[] args) + { + Segment segment = loadBigram(MSR_MODEL_PATH, false, false); + assert CoreDictionary.contains("管道"); + String text = "北京输气管道工程"; + HanLP.Config.enableDebug(); + System.out.println(segment.seg(text)); + } +} diff --git a/src/test/java/com/hankcs/book/ch03/DemoCorpusLoader.java b/src/test/java/com/hankcs/book/ch03/DemoCorpusLoader.java new file mode 100644 index 000000000..98267bbaf --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/DemoCorpusLoader.java @@ -0,0 +1,53 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-06 9:39 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.io.IOUtil; + +import java.util.List; + +/** + * 《自然语言处理入门》3.3 训练 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoCorpusLoader +{ + public static String MY_CWS_CORPUS_PATH = "data/test/my_cws_corpus.txt"; + public static void main(String[] args) + { + List> sentenceList = CorpusLoader.convert2SentenceList(MY_CWS_CORPUS_PATH); + for (List sentence : sentenceList) + { + System.out.println(sentence); +// for (IWord word : sentence) +// { +// System.out.println(word); +// } +// System.out.println(); + } + } + + static + { + if (!IOUtil.isFileExisted(MY_CWS_CORPUS_PATH)) + { + IOUtil.saveTxt(MY_CWS_CORPUS_PATH, "商品 和 服务\n" + + "商品 和服 物美价廉\n" + + "服务 和 货币"); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch03/DemoCustomDictionary.java b/src/test/java/com/hankcs/book/ch03/DemoCustomDictionary.java new file mode 100644 index 000000000..277dc326c --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/DemoCustomDictionary.java @@ -0,0 +1,40 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-07 2:24 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.Viterbi.ViterbiSegment; + +/** + * 《自然语言处理入门》3.4 预测 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoCustomDictionary +{ + public static void main(String[] args) + { + Segment segment = new ViterbiSegment(); + final String sentence = "社会摇摆简称社会摇"; + segment.enableCustomDictionary(false); + System.out.println("不挂载词典:" + segment.seg(sentence)); + CustomDictionary.insert("社会摇", "nz 100"); + segment.enableCustomDictionary(true); + System.out.println("低优先级词典:" + segment.seg(sentence)); + segment.enableCustomDictionaryForcing(true); + System.out.println("高优先级词典:" + segment.seg(sentence)); + } +} diff --git a/src/test/java/com/hankcs/book/ch03/DemoJapaneseSegment.java b/src/test/java/com/hankcs/book/ch03/DemoJapaneseSegment.java new file mode 100644 index 000000000..f640a02aa --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/DemoJapaneseSegment.java @@ -0,0 +1,39 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-07 10:50 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.utility.TestUtility; + +import static com.hankcs.book.ch03.DemoNgramSegment.loadBigram; +import static com.hankcs.book.ch03.DemoNgramSegment.trainBigram; + +/** + * 《自然语言处理入门》3.6 日语分词 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoJapaneseSegment +{ + static final String CORPUS_PATH = TestUtility.ensureTestData("jpcorpus", "http://file.hankcs.com/corpus/jpcorpus.zip") + "/ja_gsd-ud-train.txt"; + static final String MODEL_PATH = "data/test/jpcorpus/jp_bigram"; + + public static void main(String[] args) + { + trainBigram(CORPUS_PATH, MODEL_PATH); + Segment segment = loadBigram(MODEL_PATH, false, true); // data/test/jpcorpus/jp_bigram + System.out.println(segment.seg("自然言語処理入門という本が面白いぞ!")); + } +} diff --git a/src/test/java/com/hankcs/book/ch03/DemoNgramSegment.java b/src/test/java/com/hankcs/book/ch03/DemoNgramSegment.java new file mode 100644 index 000000000..229487ebb --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/DemoNgramSegment.java @@ -0,0 +1,123 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-06 11:11 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.corpus.dictionary.NatureDictionaryMaker; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.Viterbi.ViterbiSegment; + +import java.util.List; + +import static com.hankcs.book.ch03.DemoCorpusLoader.MY_CWS_CORPUS_PATH; + + +/** + * 《自然语言处理入门》3.3 训练 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoNgramSegment +{ + public static final String MY_MODEL_PATH = "data/test/my_cws_model"; + public static final String MSR_MODEL_PATH = MSR.MODEL_PATH + "_ngram"; + + public static void main(String[] args) + { + trainBigram(MY_CWS_CORPUS_PATH, MY_MODEL_PATH); + loadBigram(MY_MODEL_PATH); + trainBigram(MSR.TRAIN_PATH, MSR_MODEL_PATH); + loadBigram(MSR_MODEL_PATH); + } + + /** + * 训练bigram模型 + * + * @param corpusPath 语料库路径 + * @param modelPath 模型保存路径 + */ + public static void trainBigram(String corpusPath, String modelPath) + { + List> sentenceList = CorpusLoader.convert2SentenceList(corpusPath); + for (List sentence : sentenceList) + for (IWord word : sentence) + if (word.getLabel() == null) word.setLabel("n"); // 赋予每个单词一个虚拟的名词词性 + final NatureDictionaryMaker dictionaryMaker = new NatureDictionaryMaker(); + dictionaryMaker.compute(sentenceList); + dictionaryMaker.saveTxtTo(modelPath); + } + + public static Segment loadBigram(String modelPath) + { + return loadBigram(modelPath, true, true); + } + + /** + * 加载bigram模型 + * + * @param modelPath 模型路径 + * @param verbose 输出调试信息 + * @param viterbi 是否创建viterbi分词器 + * @return 分词器 + */ + public static Segment loadBigram(String modelPath, boolean verbose, boolean viterbi) + { +// HanLP.Config.enableDebug(); + HanLP.Config.CoreDictionaryPath = modelPath + ".txt"; + HanLP.Config.BiGramDictionaryPath = modelPath + ".ngram.txt"; + CoreDictionary.reload(); + CoreBiGramTableDictionary.reload(); + // 以下部分为兼容新标注集,不感兴趣可以跳过 + HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath = modelPath + ".tr.txt"; + if (!modelPath.equals(MSR_MODEL_PATH)) + { + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath); + if (lineIterator.hasNext()) + { + for (String tag : lineIterator.next().split(",")) + { + if (!tag.trim().isEmpty()) + { + Nature.create(tag); + } + } + } + } + CoreDictionary.getTermFrequency("商品"); + CoreBiGramTableDictionary.getBiFrequency("商品", "和"); + // 兼容代码结束 + if (verbose) + { + HanLP.Config.ShowTermNature = false; + System.out.println("【商品】的词频:" + CoreDictionary.getTermFrequency("商品")); + System.out.println("【商品@和】的频次:" + CoreBiGramTableDictionary.getBiFrequency("商品", "和")); + Segment segment = new DijkstraSegment() + .enableAllNamedEntityRecognize(false)// 禁用命名实体识别 + .enableCustomDictionary(false); // 禁用用户词典 + System.out.println(segment.seg("商品和服务")); +// System.out.println(segment.seg("货币和服务")); + } + return viterbi ? new ViterbiSegment().enableAllNamedEntityRecognize(false).enableCustomDictionary(false) : + new DijkstraSegment().enableAllNamedEntityRecognize(false).enableCustomDictionary(false); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/book/ch03/EvaluateBigram.java b/src/test/java/com/hankcs/book/ch03/EvaluateBigram.java new file mode 100644 index 000000000..82ea1df9b --- /dev/null +++ b/src/test/java/com/hankcs/book/ch03/EvaluateBigram.java @@ -0,0 +1,41 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-07 3:41 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch03; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +import static com.hankcs.book.ch03.DemoNgramSegment.*; +import static com.hankcs.hanlp.seg.common.CWSEvaluator.evaluate; + +/** + * 《自然语言处理入门》3.5 评测 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class EvaluateBigram +{ + + public static void main(String[] args) throws IOException + { + trainBigram(MSR.TRAIN_PATH, MSR_MODEL_PATH); + Segment segment = loadBigram(MSR_MODEL_PATH); + CWSEvaluator.Result result = evaluate(segment, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS); + System.out.println(result); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/book/ch04/CWS_HMM.java b/src/test/java/com/hankcs/book/ch04/CWS_HMM.java new file mode 100644 index 000000000..25c0667c9 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch04/CWS_HMM.java @@ -0,0 +1,55 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-13 3:11 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch04; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel; +import com.hankcs.hanlp.model.hmm.HMMSegmenter; +import com.hankcs.hanlp.model.hmm.HiddenMarkovModel; +import com.hankcs.hanlp.model.hmm.SecondOrderHiddenMarkovModel; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +/** + * 《自然语言处理入门》4.6 隐马尔可夫模型应用于中文分词 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 演示一阶和二阶隐马尔可夫模型用于序列标注问题之中文分词 + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class CWS_HMM +{ + public static void main(String[] args) throws IOException + { + trainAndEvaluate(new FirstOrderHiddenMarkovModel()); + trainAndEvaluate(new SecondOrderHiddenMarkovModel()); + } + + public static void trainAndEvaluate(HiddenMarkovModel model) throws IOException + { + Segment hmm = trainHMM(model); + CWSEvaluator.Result result = CWSEvaluator.evaluate(hmm, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS); + System.out.println(result); + } + + private static Segment trainHMM(HiddenMarkovModel model) throws IOException + { + HMMSegmenter segmenter = new HMMSegmenter(model); + segmenter.train(MSR.TRAIN_PATH); + System.out.println(segmenter.segment("商品和服务")); + return segmenter.toSegment(); + } +} diff --git a/src/test/java/com/hankcs/book/ch05/CheapFeatureClassifier.java b/src/test/java/com/hankcs/book/ch05/CheapFeatureClassifier.java new file mode 100644 index 000000000..f192a6361 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/CheapFeatureClassifier.java @@ -0,0 +1,41 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 7:30 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.model.perceptron.PerceptronNameGenderClassifier; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; + +import java.util.LinkedList; +import java.util.List; + +/** + * 《自然语言处理入门》5.3 基于感知机的人名性别分类 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class CheapFeatureClassifier extends PerceptronNameGenderClassifier +{ + @Override + protected List extractFeature(String text, FeatureMap featureMap) + { + List featureList = new LinkedList(); + String givenName = extractGivenName(text); + // 特征模板1:g[0],与位置无关 + addFeature(givenName.substring(0, 1), featureMap, featureList); + // 特征模板2:g[1],与位置无关 + addFeature(givenName.substring(1), featureMap, featureList); + return featureList; + } +} diff --git a/src/test/java/com/hankcs/book/ch05/DemoPerceptronCWS.java b/src/test/java/com/hankcs/book/ch05/DemoPerceptronCWS.java new file mode 100644 index 000000000..878349c6e --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/DemoPerceptronCWS.java @@ -0,0 +1,70 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-22 3:15 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.model.perceptron.CWSTrainer; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +/** + * 《自然语言处理入门》5.6 基于结构化感知机的中文分词 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoPerceptronCWS +{ + public static void main(String[] args) throws IOException + { + Segment segment = train(); + String[] sents = new String[]{ + "王思斌,男,1949年10月生。", + "山东桓台县起凤镇穆寨村妇女穆玲英", + "现为中国艺术研究院中国文化研究所研究员。", + "我们的父母重男轻女", + "北京输气管道工程", + }; + for (String sent : sents) + { + System.out.println(segment.seg(sent)); + } +// trainUncompressedModel(); + } + + public static Segment train() throws IOException + { + LinearModel model = new CWSTrainer().train(MSR.TRAIN_PATH, MSR.MODEL_PATH).getModel(); // 训练模型 + Segment segment = new PerceptronLexicalAnalyzer(model).enableCustomDictionary(false); // 创建分词器 + System.out.println(CWSEvaluator.evaluate(segment, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS)); // 标准化评测 + return segment; + } + + private static Segment trainUncompressedModel() throws IOException + { + LinearModel model = new CWSTrainer().train(MSR.TRAIN_PATH, MSR.TRAIN_PATH, MSR.MODEL_PATH, 0., 10, 8).getModel(); + model.save(MSR.MODEL_PATH, model.featureMap.entrySet(), 0, true); // 最后一个参数指定导出txt + return new PerceptronLexicalAnalyzer(model).enableCustomDictionary(false); + } + + static + { + HanLP.Config.ShowTermNature = false; + } +} diff --git a/src/test/java/com/hankcs/book/ch05/EvaluatePerceptronCWS.java b/src/test/java/com/hankcs/book/ch05/EvaluatePerceptronCWS.java new file mode 100644 index 000000000..5c1ae397f --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/EvaluatePerceptronCWS.java @@ -0,0 +1,53 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-22 3:15 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.model.perceptron.CWSTrainer; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +/** + * 《自然语言处理入门》5.6 基于结构化感知机的中文分词 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class EvaluatePerceptronCWS +{ + public static Segment trainStructuredPerceptron() throws IOException + { + LinearModel model = new CWSTrainer().train(MSR.TRAIN_PATH, MSR.TRAIN_PATH, MSR.MODEL_PATH, 0., 10, 8).getModel(); + return new PerceptronLexicalAnalyzer(model).enableCustomDictionary(false); + } + + public static Segment trainAveragedPerceptron() throws IOException + { + // 线程数为1时自动用平均感知机算法 + LinearModel model = new CWSTrainer().train(MSR.TRAIN_PATH, MSR.TRAIN_PATH, MSR.MODEL_PATH, 0., 10, 1).getModel(); + return new PerceptronLexicalAnalyzer(model).enableCustomDictionary(false); + } + + public static void main(String[] args) throws IOException + { + System.out.println("结构化感知机"); + System.out.println(CWSEvaluator.evaluate(trainStructuredPerceptron(), MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS)); + System.out.println("平均感知机"); + System.out.println(CWSEvaluator.evaluate(trainAveragedPerceptron(), MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS)); + } +} diff --git a/src/test/java/com/hankcs/book/ch05/FeatureEngineering.java b/src/test/java/com/hankcs/book/ch05/FeatureEngineering.java new file mode 100644 index 000000000..7ddb7a16c --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/FeatureEngineering.java @@ -0,0 +1,101 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-25 3:44 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.model.perceptron.CWSTrainer; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; +import com.hankcs.hanlp.model.perceptron.instance.CWSInstance; +import com.hankcs.hanlp.model.perceptron.instance.Instance; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.model.perceptron.utility.Utility; + +import java.io.IOException; +import java.util.List; + +/** + * 《自然语言处理入门》5.6.7 中文分词特征工程 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class FeatureEngineering +{ + public static void main(String[] args) throws IOException + { + CWSTrainer trainer = new CWSTrainer() + { + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + return createMyCWSInstance(sentence, featureMap); + } + }; + LinearModel model = trainer.train(MSR.TRAIN_PATH, MSR.MODEL_PATH).getModel(); +// LinearModel model = new LinearModel(MSR.MODEL_PATH); + PerceptronSegmenter segmenter = new PerceptronSegmenter(model) + { + @Override + protected Instance createInstance(Sentence sentence, FeatureMap featureMap) + { + return createMyCWSInstance(sentence, featureMap); + } + }; + System.out.println(segmenter.segment("叠字特征帮助识别张文文李冰冰")); + } + + private static Instance createMyCWSInstance(Sentence sentence, FeatureMap mutableFeatureMap) + { + List wordList = sentence.toSimpleWordList(); + String[] termArray = Utility.toWordArray(wordList); + Instance instance = new MyCWSInstance(termArray, mutableFeatureMap); + return instance; + } + + /** + * @author hankcs + */ + public static class MyCWSInstance extends CWSInstance + { + @Override + protected int[] extractFeature(String sentence, FeatureMap featureMap, int position) + { + int[] defaultFeatures = super.extractFeature(sentence, featureMap, position); + char preChar = position >= 1 ? sentence.charAt(position - 1) : '_'; + String myFeature = preChar == sentence.charAt(position) ? "Y" : "N"; // 叠字特征 + int id = featureMap.idOf(myFeature); + if (id != -1) + {// 将叠字特征放到默认特征向量的尾部 + int[] newFeatures = new int[defaultFeatures.length + 1]; + System.arraycopy(defaultFeatures, 0, newFeatures, 0, defaultFeatures.length); + newFeatures[defaultFeatures.length] = id; + return newFeatures; + } + return defaultFeatures; + } + + public MyCWSInstance(String[] termArray, FeatureMap featureMap) + { + super(termArray, featureMap); + } + + public MyCWSInstance(String sentence, FeatureMap featureMap) + { + super(sentence, featureMap); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch05/NameGenderClassification.java b/src/test/java/com/hankcs/book/ch05/NameGenderClassification.java new file mode 100644 index 000000000..41f4b8193 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/NameGenderClassification.java @@ -0,0 +1,51 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 5:38 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.model.perceptron.PerceptronNameGenderClassifier; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; + +import java.io.IOException; + +import static com.hankcs.hanlp.model.perceptron.PerceptronNameGenderClassifierTest.*; + +/** + * 《自然语言处理入门》5.3 基于感知机的人名性别分类 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class NameGenderClassification +{ + public static void main(String[] args) throws IOException + { + trainAndEvaluate("简单特征模板", new CheapFeatureClassifier(), false); + trainAndEvaluate("简单特征模板", new CheapFeatureClassifier(), true); + + trainAndEvaluate("标准特征模板", new PerceptronNameGenderClassifier(), false); + trainAndEvaluate("标准特征模板", new PerceptronNameGenderClassifier(), true); + + trainAndEvaluate("复杂特征模板", new RichFeatureClassifier(), false); + trainAndEvaluate("复杂特征模板", new RichFeatureClassifier(), true); + } + + private static void trainAndEvaluate(String template, PerceptronNameGenderClassifier classifier, boolean averagePerceptron) throws IOException + { + String algorithm = averagePerceptron ? "平均感知机算法" : "朴素感知机算法"; + System.out.println("训练集准确率:" + classifier.train(TRAINING_SET, 10, averagePerceptron)); + LinearModel model = classifier.getModel(); + System.out.printf("特征数量:%d\n", model.parameter.length); + System.out.printf("%s+%s 测试集准确率:%s\n", algorithm, template, classifier.evaluate(TESTING_SET)); + } +} diff --git a/src/test/java/com/hankcs/book/ch05/OnlineLearning.java b/src/test/java/com/hankcs/book/ch05/OnlineLearning.java new file mode 100644 index 000000000..d29b3616e --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/OnlineLearning.java @@ -0,0 +1,52 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-25 1:47 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.seg.Segment; + +import java.io.IOException; + +/** + * 《自然语言处理入门》5.6 基于结构化感知机的中文分词 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class OnlineLearning +{ + public static void main(String[] args) throws IOException + { + HanLP.Config.ShowTermNature = false; + PerceptronLexicalAnalyzer segment = new PerceptronLexicalAnalyzer(MSR.MODEL_PATH); + segment.enableCustomDictionary(false); + String text = "与川普通电话"; + System.out.println(segment.seg(text)); + + CustomDictionary.insert("川普", "nrf 1"); + segment.enableCustomDictionaryForcing(true); + System.out.println(segment.seg(text)); + + System.out.println(segment.seg("银川普通人与川普通电话讲四川普通话")); + + segment.enableCustomDictionary(false); + for (int i = 0; i < 3; ++i) + segment.learn("人 与 川普 通电话"); + System.out.println(segment.seg("银川普通人与川普通电话讲四川普通话")); + } + +} diff --git a/src/test/java/com/hankcs/book/ch05/RichFeatureClassifier.java b/src/test/java/com/hankcs/book/ch05/RichFeatureClassifier.java new file mode 100644 index 000000000..e29bc11f0 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch05/RichFeatureClassifier.java @@ -0,0 +1,45 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-21 7:30 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch05; + +import com.hankcs.hanlp.model.perceptron.PerceptronNameGenderClassifier; +import com.hankcs.hanlp.model.perceptron.feature.FeatureMap; + +import java.util.LinkedList; +import java.util.List; + +/** + * 《自然语言处理入门》5.3 基于感知机的人名性别分类 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class RichFeatureClassifier extends PerceptronNameGenderClassifier +{ + @Override + protected List extractFeature(String text, FeatureMap featureMap) + { + List featureList = new LinkedList(); + String givenName = extractGivenName(text); + // 特征模板1:g[0] + addFeature("1" + givenName.substring(0, 1), featureMap, featureList); + // 特征模板2:g[1] + addFeature("2" + givenName.substring(1), featureMap, featureList); + // 特征模板3:g + addFeature("3" + givenName, featureMap, featureList); + // 偏置特征(代表标签的先验分布,当样本不均衡时有用,但此处的男女预测无用) +// addFeature("b", featureMap, featureList); + return featureList; + } +} diff --git a/src/test/java/com/hankcs/book/ch06/CrfppTrainHanLPLoad.java b/src/test/java/com/hankcs/book/ch06/CrfppTrainHanLPLoad.java new file mode 100644 index 000000000..d01e6700e --- /dev/null +++ b/src/test/java/com/hankcs/book/ch06/CrfppTrainHanLPLoad.java @@ -0,0 +1,52 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-26 2:27 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch06; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.crf.CRFSegmenter; + +import java.io.IOException; + +/** + * 《自然语言处理入门》6.3 条件随机场工具包 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class CrfppTrainHanLPLoad +{ + public static final String TXT_CORPUS_PATH = "data/test/my_cws_corpus.txt"; + public static final String TSV_CORPUS_PATH = TXT_CORPUS_PATH + ".tsv"; + public static final String TEMPLATE_PATH = "data/test/cws-template.txt"; + public static final String CRF_MODEL_PATH = "data/test/crf-cws-model"; + public static final String CRF_MODEL_TXT_PATH = "data/test/crf-cws-model.txt"; + + public static void main(String[] args) throws IOException + { + if (IOUtil.isFileExisted(CRF_MODEL_TXT_PATH)) + { + CRFSegmenter segmenter = new CRFSegmenter(CRF_MODEL_TXT_PATH); + System.out.println(segmenter.segment("商品和服务")); + } + else + { + CRFSegmenter segmenter = new CRFSegmenter(null); // 创建空白分词器 + segmenter.convertCorpus(TXT_CORPUS_PATH, TSV_CORPUS_PATH); // 执行转换 + segmenter.dumpTemplate(TEMPLATE_PATH); + System.out.printf("语料已转换为 %s ,特征模板已导出为 %s\n", TSV_CORPUS_PATH, TEMPLATE_PATH); + System.out.printf("请安装CRF++后执行 crf_learn -f 3 -c 4.0 %s %s %s -t\n", TEMPLATE_PATH, TSV_CORPUS_PATH, CRF_MODEL_PATH); + System.out.printf("或者执行移植版 java -cp hanlp.jar com.hankcs.hanlp.model.crf.crfpp.crf_learn -f 3 -c 4.0 %s %s %s -t\n", TEMPLATE_PATH, TSV_CORPUS_PATH, CRF_MODEL_PATH); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch06/EvaluateCRFCWS.java b/src/test/java/com/hankcs/book/ch06/EvaluateCRFCWS.java new file mode 100644 index 000000000..664e646cb --- /dev/null +++ b/src/test/java/com/hankcs/book/ch06/EvaluateCRFCWS.java @@ -0,0 +1,49 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-26 3:14 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch06; + +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; +import com.hankcs.hanlp.model.crf.CRFSegmenter; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +import static com.hankcs.book.ch06.CrfppTrainHanLPLoad.CRF_MODEL_PATH; +import static com.hankcs.book.ch06.CrfppTrainHanLPLoad.CRF_MODEL_TXT_PATH; + +/** + * 《自然语言处理入门》6.4 HanLP 中的 CRF++ API + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class EvaluateCRFCWS +{ + public static Segment train(String corpus) throws IOException + { + CRFSegmenter segmenter = new CRFSegmenter(null); + segmenter.train(corpus, CRF_MODEL_PATH); + return new CRFLexicalAnalyzer(segmenter); + // 训练完毕时,可传入txt格式的模型(不可传入CRF++的二进制模型,不兼容!) +// return new CRFLexicalAnalyzer(CRF_MODEL_TXT_PATH).enableCustomDictionary(false); + } + + public static void main(String[] args) throws IOException + { + Segment segment = train(MSR.TRAIN_PATH); + System.out.println(CWSEvaluator.evaluate(segment, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS)); // 标准化评测 + } +} diff --git a/src/test/java/com/hankcs/book/ch07/CustomCorpusPOS.java b/src/test/java/com/hankcs/book/ch07/CustomCorpusPOS.java new file mode 100644 index 000000000..f027bfc2b --- /dev/null +++ b/src/test/java/com/hankcs/book/ch07/CustomCorpusPOS.java @@ -0,0 +1,46 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-06 1:36 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch07; + +import com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; + +import static com.hankcs.book.ch07.EvaluatePOS.trainPerceptronPOS; + +/** + * 《自然语言处理入门》7.4 自定义词性 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class CustomCorpusPOS +{ + /** + * 诛仙语料库 + * Zhang, Meishan and Zhang, Yue and Che, Wanxiang and Liu, Ting + * Type-Supervised Domain Adaptation for Joint Segmentation and POS-Tagging + */ + public static final String ZHUXIAN = TestUtility.ensureTestData("zhuxian", "http://file.hankcs.com/corpus/zhuxian.zip") + "/train.txt"; + + public static void main(String[] args) throws IOException + { + PerceptronPOSTagger posTagger = trainPerceptronPOS(ZHUXIAN); // 训练 + AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), posTagger); // 包装 + System.out.println(analyzer.analyze("陆雪琪的天琊神剑不做丝毫退避,直冲而上,瞬间,这两道奇光异宝撞到了一起。")); // 分词+标注 + } +} diff --git a/src/test/java/com/hankcs/book/ch07/CustomPOS.java b/src/test/java/com/hankcs/book/ch07/CustomPOS.java new file mode 100644 index 000000000..da9b657bb --- /dev/null +++ b/src/test/java/com/hankcs/book/ch07/CustomPOS.java @@ -0,0 +1,38 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-05 4:03 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch07; + +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; + +import java.io.IOException; + +/** + * 《自然语言处理入门》7.4 自定义词性 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class CustomPOS +{ + public static void main(String[] args) throws IOException + { + CustomDictionary.insert("苹果", "手机品牌 1"); + CustomDictionary.insert("iPhone X", "手机型号 1"); + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(); + analyzer.enableCustomDictionaryForcing(true); + System.out.println(analyzer.analyze("你们苹果iPhone X保修吗?")); + System.out.println(analyzer.analyze("多吃苹果有益健康")); + } +} diff --git a/src/test/java/com/hankcs/book/ch07/EvaluatePOS.java b/src/test/java/com/hankcs/book/ch07/EvaluatePOS.java new file mode 100644 index 000000000..7b2fe412e --- /dev/null +++ b/src/test/java/com/hankcs/book/ch07/EvaluatePOS.java @@ -0,0 +1,70 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-05 1:43 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch07; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.dependency.nnparser.util.PosTagUtil; +import com.hankcs.hanlp.model.crf.CRFPOSTagger; +import com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel; +import com.hankcs.hanlp.model.hmm.HMMPOSTagger; +import com.hankcs.hanlp.model.hmm.HiddenMarkovModel; +import com.hankcs.hanlp.model.hmm.SecondOrderHiddenMarkovModel; +import com.hankcs.hanlp.model.perceptron.POSTrainer; +import com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger; +import com.hankcs.hanlp.model.perceptron.PerceptronTrainer; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; + +import java.io.File; +import java.io.IOException; + +/** + * 《自然语言处理入门》7.3 序列标注模型应用于词性标注 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class EvaluatePOS +{ + static HMMPOSTagger trainHMM(String corpus, HiddenMarkovModel model) throws IOException + { + HMMPOSTagger tagger = new HMMPOSTagger(model); + tagger.train(corpus); + return tagger; + } + + static PerceptronPOSTagger trainPerceptronPOS(String corpus) throws IOException + { + PerceptronTrainer trainer = new POSTrainer(); + LinearModel model = trainer.train(corpus, File.createTempFile("hanlp", "pos.bin").getAbsolutePath()).getModel(); + return new PerceptronPOSTagger(model); + } + + static CRFPOSTagger trainCRFPOS(String corpus) throws IOException + { + CRFPOSTagger tagger = new CRFPOSTagger(null); + String modelPath = "data/test/pku98/pos.bin"; + tagger.train(corpus, modelPath); + // 或者加载CRF++训练得到的pos.bin.txt +// return new CRFPOSTagger(modelPath + ".txt"); + return new CRFPOSTagger(modelPath); + } + + public static void main(String[] args) throws IOException + { + System.out.printf("一阶HMM\t%.2f%%\n", PosTagUtil.evaluate(trainHMM(PKU.PKU199801_TRAIN, new FirstOrderHiddenMarkovModel()), PKU.PKU199801_TEST)); + System.out.printf("二阶HMM\t%.2f%%\n", PosTagUtil.evaluate(trainHMM(PKU.PKU199801_TRAIN, new SecondOrderHiddenMarkovModel()), PKU.PKU199801_TEST)); + System.out.printf("感知机\t%.2f%%\n", PosTagUtil.evaluate(trainPerceptronPOS(PKU.PKU199801_TRAIN), PKU.PKU199801_TEST)); + System.out.printf("CRF\t%.2f%%\n", PosTagUtil.evaluate(trainCRFPOS(PKU.PKU199801_TRAIN), PKU.PKU199801_TEST)); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoCRFNER.java b/src/test/java/com/hankcs/book/ch08/DemoCRFNER.java new file mode 100644 index 000000000..94c5d6184 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoCRFNER.java @@ -0,0 +1,49 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-29 4:18 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.crf.CRFNERecognizer; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.IOException; + +import static com.hankcs.book.ch08.DemoHMMNER.test; + + +/** + * 《自然语言处理入门》8.5.4 基于条件随机场序列标注的命名实体识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoCRFNER +{ + public static void main(String[] args) throws IOException + { + NERecognizer recognizer = train(PKU.PKU199801_TRAIN, PKU.NER_MODEL); + test(recognizer); + } + + public static NERecognizer train(String corpus, String model) throws IOException + { + if (IOUtil.isFileExisted(model + ".txt")) // 若存在CRF++训练结果,则直接加载 + return new CRFNERecognizer(model + ".txt"); + CRFNERecognizer recognizer = new CRFNERecognizer(null); // 空白 + recognizer.train(corpus, model); + recognizer = new CRFNERecognizer(model + ".txt"); + return recognizer; + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoCRFNERPlane.java b/src/test/java/com/hankcs/book/ch08/DemoCRFNERPlane.java new file mode 100644 index 000000000..9c04340f9 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoCRFNERPlane.java @@ -0,0 +1,55 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-29 4:18 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.crf.CRFNERecognizer; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.IOException; + +import static com.hankcs.book.ch08.DemoPlane.PLANE_CORPUS; +import static com.hankcs.book.ch08.DemoPlane.PLANE_MODEL; + + +/** + * 《自然语言处理入门》8.6.2 训练领域模型 (书本之外的补充试验) + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoCRFNERPlane +{ + public static void main(String[] args) throws IOException + { + NERecognizer recognizer = train(PLANE_CORPUS, PLANE_MODEL); + String[] wordArray = {"歼", "-", "7", "战斗机", "正是", "仿照", "米格", "-", "21", "而", "制", "。"}; // 构造单词序列 + String[] posArray = {"v", "w", "w", "n", "d", "v", "nr", "w", "m", "c", "v", "w"}; // 构造词性序列 + String[] nerTagArray = recognizer.recognize(wordArray, posArray); // 序列标注 + for (int i = 0; i < wordArray.length; i++) + System.out.printf("%-4s\t%s\t%s\t\n", wordArray[i], posArray[i], nerTagArray[i]); + } + + public static NERecognizer train(String corpus, String model) throws IOException + { + if (IOUtil.isFileExisted(model + ".txt")) // 若存在CRF++训练结果,则直接加载 + return new CRFNERecognizer(model + ".txt"); + CRFNERecognizer recognizer = new CRFNERecognizer(null); // 空白 + recognizer.tagSet.nerLabels.clear(); // 不识别nr、ns、nt + recognizer.tagSet.nerLabels.add("np"); // 目标是识别np + recognizer.train(corpus, model); + recognizer = new CRFNERecognizer(model + ".txt"); + return recognizer; + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoHMMNER.java b/src/test/java/com/hankcs/book/ch08/DemoHMMNER.java new file mode 100644 index 000000000..1cfcc3f2b --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoHMMNER.java @@ -0,0 +1,62 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-27 8:52 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.model.hmm.HMMNERecognizer; +import com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.model.perceptron.utility.Utility; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import com.hankcs.hanlp.tokenizer.lexical.LexicalAnalyzer; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.IOException; +import java.util.Map; + +/** + * 《自然语言处理入门》8.5.2 基于隐马尔可夫模型序列标注的命名实体识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoHMMNER +{ + public static void main(String[] args) throws IOException + { + NERecognizer recognizer = train(PKU.PKU199801_TRAIN); + test(recognizer); + } + + public static NERecognizer train(String corpus) throws IOException + { + HMMNERecognizer recognizer = new HMMNERecognizer(); + recognizer.train(corpus); // data/test/pku98/199801-train.txt + return recognizer; + } + + public static void test(NERecognizer recognizer) throws IOException + { + String[] wordArray = {"华北", "电力", "公司"}; // 构造单词序列 + String[] posArray = {"ns", "n", "n"}; // 构造词性序列 + String[] nerTagArray = recognizer.recognize(wordArray, posArray); // 序列标注 + for (int i = 0; i < wordArray.length; i++) + System.out.printf("%s\t%s\t%s\t\n", wordArray[i], posArray[i], nerTagArray[i]); + AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), new PerceptronPOSTagger(), recognizer); + analyzer.enableCustomDictionary(false); + System.out.println(analyzer.analyze("华北电力公司董事长谭旭光和秘书胡花蕊来到美国纽约现代艺术博物馆参观")); + Map scores = Utility.evaluateNER(recognizer, PKU.PKU199801_TEST); + Utility.printNERScore(scores); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoNRF.java b/src/test/java/com/hankcs/book/ch08/DemoNRF.java new file mode 100644 index 000000000..5c7e1c1c5 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoNRF.java @@ -0,0 +1,43 @@ +/* + * Han He + * me@hankcs.com + * 2023-04-04 1:06 PM + * + * + * Copyright (c) 2023, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.Segment; + +import java.io.BufferedWriter; +import java.io.IOException; + +/** + * @author hankcs + */ +public class DemoNRF +{ + public static void main(String[] args) throws IOException + { + HanLP.Config.enableDebug(); + String sentence = "我知道卡利斯勒出生于英格兰"; + Segment segment = new DijkstraSegment().enableTranslatedNameRecognize(true); + System.out.println(segment.seg(sentence)); + + if (CoreBiGramTableDictionary.getBiFrequency("未##人", "出生于") == 0) + { + BufferedWriter bw = IOUtil.newBufferedWriter(HanLP.Config.BiGramDictionaryPath, true); + bw.write("\n未##人@出生于 1\n"); + bw.close(); + CoreBiGramTableDictionary.reload(); + System.out.println(segment.seg(sentence)); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoNumEng.java b/src/test/java/com/hankcs/book/ch08/DemoNumEng.java new file mode 100644 index 000000000..53bcf30bb --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoNumEng.java @@ -0,0 +1,40 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-24 4:41 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.dictionary.other.CharType; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.Viterbi.ViterbiSegment; + +/** + * 《自然语言处理入门》8.2.3 基于规则的数词英文识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoNumEng +{ + public static void main(String[] args) + { + Segment segment = new ViterbiSegment(); + System.out.println(segment.seg("牛奶三〇〇克壹佰块")); + System.out.println(segment.seg("牛奶300克100块")); + System.out.println(segment.seg("牛奶300g100rmb")); + // 演示自定义字符类型 + String text = "牛奶300~400g100rmb"; + System.out.println(segment.seg(text)); + CharType.type['~'] = CharType.CT_NUM; + System.out.println(segment.seg(text)); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoPlane.java b/src/test/java/com/hankcs/book/ch08/DemoPlane.java new file mode 100644 index 000000000..167599ad3 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoPlane.java @@ -0,0 +1,46 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-29 8:49 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.model.perceptron.*; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; + +/** + * 《自然语言处理入门》8.6.2 训练领域模型 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoPlane +{ + static String PLANE_CORPUS = TestUtility.ensureTestData("plane-re", "http://file.hankcs.com/corpus/plane-re.zip") + "/train.txt"; + static String PLANE_MODEL = PLANE_CORPUS.replace("train.txt", "model.bin"); + + public static void main(String[] args) throws IOException + { + NERTrainer trainer = new NERTrainer(); + trainer.tagSet.nerLabels.clear(); // 不识别nr、ns、nt + trainer.tagSet.nerLabels.add("np"); // 目标是识别np + PerceptronNERecognizer recognizer = new PerceptronNERecognizer(trainer.train(PLANE_CORPUS, PLANE_MODEL).getModel()); + // 在NER预测前,需要一个分词器,最好训练自同源语料库 + LinearModel cwsModel = new CWSTrainer().train(PLANE_CORPUS, PLANE_MODEL.replace("model.bin", "cws.bin")).getModel(); + PerceptronSegmenter segmenter = new PerceptronSegmenter(cwsModel); + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(segmenter, new PerceptronPOSTagger(), recognizer); + analyzer.enableTranslatedNameRecognize(false).enableCustomDictionary(false); + System.out.println(analyzer.analyze("米高扬设计米格-17PF:米格-17PF型战斗机比米格-17P性能更好。")); + System.out.println(analyzer.analyze("米格-阿帕奇-666S横空出世。")); } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoRoleTagNR.java b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNR.java new file mode 100644 index 000000000..73881d0ab --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNR.java @@ -0,0 +1,82 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-24 8:07 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NRDictionaryMaker; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.Segment; + +import java.io.IOException; + +/** + * 《自然语言处理入门》8.4.1 基于角色标注的中国人名识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoRoleTagNR +{ + public static final String MODEL = "data/test/nr"; + + public static void main(String[] args) + { + demoNR(); + trainOneSentence(); + train(PKU.PKU199801, MODEL); + test(); + } + + private static void trainOneSentence() + { + EasyDictionary dictionary = EasyDictionary.create(HanLP.Config.CoreDictionaryPath); // 核心词典 + NRDictionaryMaker maker = new NRDictionaryMaker(dictionary); // 训练模块 + maker.verbose = true; // 调试输出 + maker.learn(Sentence.create("这里/r 有/v 关天培/nr 的/u 有关/vn 事迹/n 。/w")); // 学习一个句子 + maker.saveTxtTo(MODEL); // 输出HMM到txt + } + + private static void train(String corpus, String model) + { + EasyDictionary dictionary = EasyDictionary.create(HanLP.Config.CoreDictionaryPath); // 核心词典 + NRDictionaryMaker maker = new NRDictionaryMaker(dictionary); // 训练模块 + maker.train(corpus); // 在语料库上训练 + maker.saveTxtTo(model); // 输出HMM到txt + } + + private static Segment load(String model) + { + HanLP.Config.PersonDictionaryPath = model + ".txt"; // data/test/nr.txt + HanLP.Config.PersonDictionaryTrPath = model + ".tr.txt"; // data/test/nr.tr.txt + Segment segment = new DijkstraSegment(); // 该分词器便于调试 + return segment; + } + + private static void test() + { + Segment segment = load(MODEL); + HanLP.Config.enableDebug(); + System.out.println(segment.seg("龚学平等领导")); + } + + private static void demoNR() + { + HanLP.Config.enableDebug(); + Segment segment = new DijkstraSegment(); + System.out.println(segment.seg("王国维和服务员")); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoRoleTagNS.java b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNS.java new file mode 100644 index 000000000..8923d9baf --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNS.java @@ -0,0 +1,63 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-26 10:18 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NSDictionaryMaker; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.Segment; + +import java.io.IOException; + +/** + * 《自然语言处理入门》8.4.2 基于角色标注的地名识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoRoleTagNS +{ + public static final String MODEL = "data/test/ns"; + + public static void main(String[] args) + { + train(PKU.PKU199801, MODEL); + test(MODEL); + } + + private static void train(String corpus, String model) + { + EasyDictionary dictionary = EasyDictionary.create(HanLP.Config.CoreDictionaryPath); // 核心词典 + NSDictionaryMaker maker = new NSDictionaryMaker(dictionary); // 训练模块 + maker.train(corpus); // 在语料库上训练 + maker.saveTxtTo(model); // 输出HMM到txt + } + + private static Segment load(String model) + { + HanLP.Config.PlaceDictionaryPath = model + ".txt"; // data/test/ns.txt + HanLP.Config.PlaceDictionaryTrPath = model + ".tr.txt"; // data/test/ns.tr.txt + Segment segment = new DijkstraSegment(); // 该分词器便于调试 + return segment.enablePlaceRecognize(true).enableCustomDictionary(false); + } + + private static void test(String model) + { + Segment segment = load(model); + HanLP.Config.enableDebug(); + System.out.println(segment.seg("生于黑牛沟村")); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoRoleTagNT.java b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNT.java new file mode 100644 index 000000000..367f52738 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoRoleTagNT.java @@ -0,0 +1,64 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-27 11:46 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NSDictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.NTDictionaryMaker; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.Segment; + +import java.io.IOException; + +/** + * 《自然语言处理入门》8.4.3 基于角色标注的机构名识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoRoleTagNT +{ + public static final String MODEL = "data/test/nt"; + + public static void main(String[] args) + { + train(PKU.PKU199801, MODEL); + test(MODEL); + } + + private static void train(String corpus, String model) + { + EasyDictionary dictionary = EasyDictionary.create(HanLP.Config.CoreDictionaryPath); // 核心词典 + NTDictionaryMaker maker = new NTDictionaryMaker(dictionary); // 训练模块 + maker.train(corpus); // 在语料库上训练 + maker.saveTxtTo(model); // 输出HMM到txt + } + + private static Segment load(String model) + { + HanLP.Config.PlaceDictionaryPath = model + ".txt"; // data/test/ns.txt + HanLP.Config.PlaceDictionaryTrPath = model + ".tr.txt"; // data/test/ns.tr.txt + Segment segment = new DijkstraSegment(); // 该分词器便于调试 + return segment.enableOrganizationRecognize(true).enableCustomDictionary(false); + } + + private static void test(String model) + { + Segment segment = load(model); + HanLP.Config.enableDebug(); + System.out.println(segment.seg("温州黄鹤皮革制造有限公司是由黄先生创办的企业")); + } +} diff --git a/src/test/java/com/hankcs/book/ch08/DemoSPNER.java b/src/test/java/com/hankcs/book/ch08/DemoSPNER.java new file mode 100644 index 000000000..a46bc623c --- /dev/null +++ b/src/test/java/com/hankcs/book/ch08/DemoSPNER.java @@ -0,0 +1,56 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-28 11:36 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch08; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.model.hmm.HMMNERecognizer; +import com.hankcs.hanlp.model.perceptron.*; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import com.hankcs.hanlp.tokenizer.lexical.LexicalAnalyzer; +import com.hankcs.hanlp.tokenizer.lexical.NERecognizer; + +import java.io.IOException; + +import static com.hankcs.book.ch08.DemoHMMNER.test; + +/** + * 《自然语言处理入门》8.5.3 基于感知机序列标注的命名实体识别 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoSPNER +{ + + public static void main(String[] args) throws IOException + { + NERecognizer recognizer = train(PKU.PKU199801_TRAIN, PKU.NER_MODEL); + test(recognizer); + // 在线学习 + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(new PerceptronSegmenter(), new PerceptronPOSTagger(), (PerceptronNERecognizer) recognizer);//① + Sentence sentence = Sentence.create("与/c 特朗普/nr 通/v 电话/n 讨论/v [太空/s 探索/vn 技术/n 公司/n]/nt");//② + while (!analyzer.analyze(sentence.text()).equals(sentence))//③ + analyzer.learn(sentence); + } + + public static NERecognizer train(String corpus, String model) throws IOException + { + if (IOUtil.isFileExisted(model)) + return new PerceptronNERecognizer(model); + PerceptronTrainer trainer = new NERTrainer(); + return new PerceptronNERecognizer(trainer.train(corpus, corpus, model, 0, 50, 8).getModel()); + } +} diff --git a/src/test/java/com/hankcs/book/ch09/DemoExtractNewWord.java b/src/test/java/com/hankcs/book/ch09/DemoExtractNewWord.java new file mode 100644 index 000000000..36d7c67e8 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch09/DemoExtractNewWord.java @@ -0,0 +1,77 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-30 8:36 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.book.ch09; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.mining.word.WordInfo; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.File; +import java.io.IOException; +import java.util.List; + +/** + * 《自然语言处理入门》9.1 新词提取 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoExtractNewWord +{ + // 文本长度越大越好,试试四大名著? + static final String HLM_PATH = TestUtility.ensureTestData("红楼梦.txt", "http://file.hankcs.com/corpus/红楼梦.zip"); + static final String XYJ_PATH = TestUtility.ensureTestData("西游记.txt", "http://file.hankcs.com/corpus/西游记.zip"); + static final String SHZ_PATH = TestUtility.ensureTestData("水浒传.txt", "http://file.hankcs.com/corpus/水浒传.zip"); + static final String SAN_PATH = TestUtility.ensureTestData("三国演义.txt", "http://file.hankcs.com/corpus/三国演义.zip"); + static final String WEIBO_PATH = TestUtility.ensureTestData("weibo-classification", "http://file.hankcs.com/corpus/weibo-classification.zip"); + + public static void main(String[] args) throws IOException + { + extract(HLM_PATH); + extract(XYJ_PATH); + extract(SHZ_PATH); + extract(SAN_PATH); + testWeibo(); + + // 更多参数 + List wordInfoList = HanLP.extractWords(IOUtil.newBufferedReader(HLM_PATH), 100, true, 4, 0.0f, .5f, 100f); + System.out.println(wordInfoList); + } + + public static void testWeibo() + { + for (File folder : new File(WEIBO_PATH).listFiles()) + { + System.out.println(folder.getName()); + StringBuilder sbText = new StringBuilder(); + for (File file : folder.listFiles()) + { + sbText.append(IOUtil.readTxt(file.getPath())); + } + List wordInfoList = HanLP.extractWords(sbText.toString(), 100); + System.out.println(wordInfoList); + } + } + + private static void extract(String corpus) throws IOException + { + System.out.printf("%s 热词\n", corpus); + List wordInfoList = HanLP.extractWords(IOUtil.newBufferedReader(corpus), 100); + System.out.println(wordInfoList); +// System.out.printf("%s 新词\n", corpus); +// wordInfoList = HanLP.extractWords(IOUtil.newBufferedReader(corpus), 100, true); +// System.out.println(wordInfoList); + } +} diff --git a/src/test/java/com/hankcs/book/ch09/DemoTFIDF.java b/src/test/java/com/hankcs/book/ch09/DemoTFIDF.java new file mode 100644 index 000000000..8fe35c709 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch09/DemoTFIDF.java @@ -0,0 +1,42 @@ +/* + * Han He + * me@hankcs.com + * 2019-09-17 12:22 AM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch09; + +import com.hankcs.hanlp.mining.word.TfIdfCounter; + +/** + * 《自然语言处理入门》9.2 关键词提取 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTFIDF +{ + public static void main(String[] args) + { + TfIdfCounter counter = new TfIdfCounter(); + counter.add("《女排夺冠》", "女排北京奥运会夺冠"); // 输入多篇文档 + counter.add("《羽毛球男单》", "北京奥运会的羽毛球男单决赛"); + counter.add("《女排》", "中国队女排夺北京奥运会金牌重返巅峰,观众欢呼女排女排女排!"); + +// // 加载idf文件 +// counter.loadIdfFile("data/idf.txt"); + + counter.compute(); // 输入完毕 + for (Object id : counter.documents()) // 根据每篇文档的TF-IDF提取关键词 + { + System.out.println(id + " : " + counter.getKeywordsOf(id, 3)); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch09/DemoTermFrequency.java b/src/test/java/com/hankcs/book/ch09/DemoTermFrequency.java new file mode 100644 index 000000000..fd6301812 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch09/DemoTermFrequency.java @@ -0,0 +1,41 @@ +/* + * Han He + * me@hankcs.com + * 2019-09-16 11:59 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch09; + +import com.hankcs.hanlp.corpus.occurrence.TermFrequency; +import com.hankcs.hanlp.mining.word.TermFrequencyCounter; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; + +import java.io.IOException; + +/** + * 《自然语言处理入门》9.2 关键词提取 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTermFrequency +{ + public static void main(String[] args) throws IOException + { + TermFrequencyCounter counter = new TermFrequencyCounter(); +// counter.getSegment().enableIndexMode(true); +// counter.setSegment(new PerceptronLexicalAnalyzer().enableIndexMode(true)); + counter.add("加油加油中国队!"); // 第一个文档 + counter.add("中国观众高呼加油中国"); // 第二个文档 + for (TermFrequency termFrequency : counter) // 遍历每个词与词频 + System.out.printf("%s=%d\n", termFrequency.getTerm(), termFrequency.getFrequency()); + System.out.println(counter.top(2)); // 取top N + } +} diff --git a/src/test/java/com/hankcs/book/ch10/DemoTextClustering.java b/src/test/java/com/hankcs/book/ch10/DemoTextClustering.java new file mode 100644 index 000000000..df96d91f5 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch10/DemoTextClustering.java @@ -0,0 +1,25 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-03 6:54 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch10; + +/** + * 《自然语言处理入门》10 文本聚类 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 请参考{@link com.hankcs.demo.DemoTextClustering} + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTextClustering extends com.hankcs.demo.DemoTextClustering +{ +} diff --git a/src/test/java/com/hankcs/book/ch10/DemoTextClusteringFMeasure.java b/src/test/java/com/hankcs/book/ch10/DemoTextClusteringFMeasure.java new file mode 100644 index 000000000..cebcc875b --- /dev/null +++ b/src/test/java/com/hankcs/book/ch10/DemoTextClusteringFMeasure.java @@ -0,0 +1,25 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-03 6:53 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch10; + +/** + * 《自然语言处理入门》10 文本聚类 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 请参考{@link com.hankcs.demo.DemoTextClusteringFMeasure} + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTextClusteringFMeasure extends com.hankcs.demo.DemoTextClusteringFMeasure +{ +} diff --git a/src/test/java/com/hankcs/book/ch11/DemoLoadTextClassificationCorpus.java b/src/test/java/com/hankcs/book/ch11/DemoLoadTextClassificationCorpus.java new file mode 100644 index 000000000..5ddf25c59 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch11/DemoLoadTextClassificationCorpus.java @@ -0,0 +1,49 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-03 6:56 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch11; + +import com.hankcs.hanlp.classification.corpus.AbstractDataSet; +import com.hankcs.hanlp.classification.corpus.Document; +import com.hankcs.hanlp.classification.corpus.FileDataSet; +import com.hankcs.hanlp.classification.corpus.MemoryDataSet; + +import java.io.IOException; +import java.util.List; + +import static com.hankcs.demo.DemoTextClassification.CORPUS_FOLDER; + + +/** + * 《自然语言处理入门》11.2 文本分类语料库 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 演示加载文本分类语料库 + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoLoadTextClassificationCorpus +{ + public static void main(String[] args) throws IOException + { + AbstractDataSet dataSet = new MemoryDataSet(); // ①将数据集加载到内存中 + dataSet.load(CORPUS_FOLDER); // ②加载data/test/搜狗文本分类语料库迷你版 + dataSet.add("自然语言处理", "自然语言处理很有趣"); // ③新增样本 + List allClasses = dataSet.getCatalog().getCategories(); // ④获取标注集 + System.out.printf("标注集:%s\n", allClasses); + for (Document document : dataSet) + { + System.out.println("第一篇文档的类别:" + allClasses.get(document.category)); + break; + } + } +} diff --git a/src/test/java/com/hankcs/book/ch11/DemoTextClassification.java b/src/test/java/com/hankcs/book/ch11/DemoTextClassification.java new file mode 100644 index 000000000..0e1f5084a --- /dev/null +++ b/src/test/java/com/hankcs/book/ch11/DemoTextClassification.java @@ -0,0 +1,25 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-04 8:22 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch11; + +/** + * 《自然语言处理入门》11.4 朴素贝叶斯分类器 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 请参考 {@link com.hankcs.demo.DemoTextClassification} + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTextClassification extends com.hankcs.demo.DemoTextClassification +{ +} diff --git a/src/test/java/com/hankcs/book/ch11/DemoTextClassificationFMeasure.java b/src/test/java/com/hankcs/book/ch11/DemoTextClassificationFMeasure.java new file mode 100644 index 000000000..0b8eb494c --- /dev/null +++ b/src/test/java/com/hankcs/book/ch11/DemoTextClassificationFMeasure.java @@ -0,0 +1,61 @@ +/* + * Han He + * me@hankcs.com + * 2019-01-06 10:15 PM + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch11; + +import com.hankcs.hanlp.classification.classifiers.IClassifier; +import com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier; +import com.hankcs.hanlp.classification.corpus.FileDataSet; +import com.hankcs.hanlp.classification.corpus.IDataSet; +import com.hankcs.hanlp.classification.corpus.MemoryDataSet; +import com.hankcs.hanlp.classification.statistics.evaluations.Evaluator; +import com.hankcs.hanlp.classification.statistics.evaluations.FMeasure; +import com.hankcs.hanlp.classification.tokenizers.BigramTokenizer; +import com.hankcs.hanlp.classification.tokenizers.HanLPTokenizer; +import com.hankcs.hanlp.classification.tokenizers.ITokenizer; + +import java.io.IOException; + +import static com.hankcs.demo.DemoTextClassification.CORPUS_FOLDER; + +/** + * 《自然语言处理入门》11.6 标准化评测 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTextClassificationFMeasure +{ + public static void main(String[] args) throws IOException + { + evaluate(new NaiveBayesClassifier(), new HanLPTokenizer()); + evaluate(new NaiveBayesClassifier(), new BigramTokenizer()); + // 需要引入 https://github.com/hankcs/text-classification-svm ,或者将下列代码复制到该项目运行 + // evaluate(new NaiveBayesClassifier(), new HanLPTokenizer()); + // evaluate(new NaiveBayesClassifier(), new BigramTokenizer()); + } + + public static void evaluate(IClassifier classifier, ITokenizer tokenizer) throws IOException + { + IDataSet trainingCorpus = new FileDataSet(). // FileDataSet省内存,可加载大规模数据集 + setTokenizer(tokenizer). // 支持不同的ITokenizer,详见源码中的文档 + load(CORPUS_FOLDER, "UTF-8", 0.9); // 前90%作为训练集 + classifier.train(trainingCorpus); + IDataSet testingCorpus = new MemoryDataSet(classifier.getModel()). + load(CORPUS_FOLDER, "UTF-8", -0.1); // 后10%作为测试集 + // 计算准确率 + FMeasure result = Evaluator.evaluate(classifier, testingCorpus); + System.out.println(classifier.getClass().getSimpleName() + "+" + tokenizer.getClass().getSimpleName()); + System.out.println(result); + } +} diff --git a/src/test/java/com/hankcs/book/ch12/DebugKBeamArcEagerDependencyParser.java b/src/test/java/com/hankcs/book/ch12/DebugKBeamArcEagerDependencyParser.java new file mode 100644 index 000000000..8dbfca5bc --- /dev/null +++ b/src/test/java/com/hankcs/book/ch12/DebugKBeamArcEagerDependencyParser.java @@ -0,0 +1,38 @@ +/* + * Han He + * me@hankcs.com + * 2019-02-06 21:15 + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch12; + +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; +import com.hankcs.hanlp.dependency.IDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.transition.parser.ArcEager; + +import java.io.IOException; + +/** + * 《自然语言处理入门》12.4 基于转移的依存句法分析 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 请在{@link ArcEager#commitAction}中下一个断点,观察ArcEager转移系统 + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DebugKBeamArcEagerDependencyParser +{ + public static void main(String[] args) throws IOException, ClassNotFoundException + { + IDependencyParser parser = new KBeamArcEagerDependencyParser(); + CoNLLSentence sentence = parser.parse("人吃鱼"); + System.out.println(sentence); + } +} diff --git a/src/test/java/com/hankcs/book/ch12/DemoTrainParser.java b/src/test/java/com/hankcs/book/ch12/DemoTrainParser.java new file mode 100644 index 000000000..37c2dc558 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch12/DemoTrainParser.java @@ -0,0 +1,44 @@ +/* + * Han He + * me@hankcs.com + * 2019-02-08 01:57 + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch12; + +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; +import java.util.concurrent.ExecutionException; + +/** + * 《自然语言处理入门》12.5 依存句法分析 API + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTrainParser +{ + public static String CTB_ROOT = TestUtility.ensureTestData("ctb8.0-dep", "http://file.hankcs.com/corpus/ctb8.0-dep.zip"); + public static String CTB_TRAIN = CTB_ROOT + "/train.conll"; + public static String CTB_DEV = CTB_ROOT + "/dev.conll"; + public static String CTB_TEST = CTB_ROOT + "/test.conll"; + public static String CTB_MODEL = CTB_ROOT + "/ctb.bin"; + public static String BROWN_CLUSTER = TestUtility.ensureTestData("wiki-cn-cluster.txt", "http://file.hankcs.com/corpus/wiki-cn-cluster.zip"); + + public static void main(String[] args) throws IOException, ClassNotFoundException, ExecutionException, InterruptedException + { + KBeamArcEagerDependencyParser parser = KBeamArcEagerDependencyParser.train(CTB_TRAIN, CTB_DEV, BROWN_CLUSTER, CTB_MODEL); + System.out.println(parser.parse("人吃鱼")); + double[] score = parser.evaluate(CTB_TEST); + System.out.printf("UAS=%.1f LAS=%.1f\n", score[0], score[1]); + } +} diff --git a/src/test/java/com/hankcs/book/ch12/OpinionMining.java b/src/test/java/com/hankcs/book/ch12/OpinionMining.java new file mode 100644 index 000000000..cf2f70ea7 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch12/OpinionMining.java @@ -0,0 +1,88 @@ +/* + * Han He + * me@hankcs.com + * 2019-02-12 00:37 + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch12; + +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; +import com.hankcs.hanlp.dependency.IDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; + +import java.io.IOException; +import java.util.List; + +/** + * 《自然语言处理入门》12.6 案例:基于依存句法树的意见抽取 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class OpinionMining +{ + public static void main(String[] args) throws IOException, ClassNotFoundException + { + IDependencyParser parser = new KBeamArcEagerDependencyParser(); + CoNLLSentence tree = parser.parse("电池非常棒,机身不长,长的是待机,但是屏幕分辨率不高。"); + System.out.println(tree); + System.out.println("第一版"); + extactOpinion1(tree); + System.out.println("第二版"); + extactOpinion2(tree); + System.out.println("第三版"); + extactOpinion3(tree); + } + + static void extactOpinion1(CoNLLSentence tree) + { + for (CoNLLWord word : tree) + if (word.POSTAG.equals("NN") && word.DEPREL.equals("nsubj")) + System.out.printf("%s = %s\n", word.LEMMA, word.HEAD.LEMMA); + } + + static void extactOpinion2(CoNLLSentence tree) + { + for (CoNLLWord word : tree) + { + if (word.POSTAG.equals("NN") && word.DEPREL.equals("nsubj")) + { + if (tree.findChildren(word.HEAD, "neg").isEmpty()) + System.out.printf("%s = %s\n", word.LEMMA, word.HEAD.LEMMA); + else + System.out.printf("%s = 不%s\n", word.LEMMA, word.HEAD.LEMMA); + } + } + } + + static void extactOpinion3(CoNLLSentence tree) + { + for (CoNLLWord word : tree) + { + if (word.POSTAG.equals("NN")) + { + if (word.DEPREL.equals("nsubj")) + { + if (tree.findChildren(word.HEAD, "neg").isEmpty()) + System.out.printf("%s = %s\n", word.LEMMA, word.HEAD.LEMMA); + else + System.out.printf("%s = 不%s\n", word.LEMMA, word.HEAD.LEMMA); + } + else if (word.DEPREL.equals("attr")) // ①属性 + { + List top = tree.findChildren(word.HEAD, "top"); // ②主题 + if (!top.isEmpty()) + System.out.printf("%s = %s\n", word.LEMMA, top.get(0).LEMMA); + } + } + } + } +} diff --git a/src/test/java/com/hankcs/book/ch13/DemoNeuralParser.java b/src/test/java/com/hankcs/book/ch13/DemoNeuralParser.java new file mode 100644 index 000000000..e98c3d5d9 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch13/DemoNeuralParser.java @@ -0,0 +1,54 @@ +/* + * Han He + * me@hankcs.com + * 2019-02-26 22:54 + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch13; + +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; +import com.hankcs.hanlp.dependency.IDependencyParser; +import com.hankcs.hanlp.dependency.nnparser.NeuralNetworkDependencyParser; + +/** + * 《自然语言处理入门》13.4 基于神经网络的高性能依存句法分析器 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoNeuralParser +{ + public static void main(String[] args) + { + IDependencyParser parser = new NeuralNetworkDependencyParser(); + CoNLLSentence sentence = parser.parse("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。"); + System.out.println(sentence); + // 可以方便地遍历它 + for (CoNLLWord word : sentence) + { + System.out.printf("%s --(%s)--> %s\n", word.LEMMA, word.DEPREL, word.HEAD.LEMMA); + } + // 也可以直接拿到数组,任意顺序或逆序遍历 + CoNLLWord[] wordArray = sentence.getWordArray(); + for (int i = wordArray.length - 1; i >= 0; i--) + { + CoNLLWord word = wordArray[i]; + System.out.printf("%s --(%s)--> %s\n", word.LEMMA, word.DEPREL, word.HEAD.LEMMA); + } + // 还可以直接遍历子树,从某棵子树的某个节点一路遍历到虚根 + CoNLLWord head = wordArray[12]; + while ((head = head.HEAD) != null) + { + if (head == CoNLLWord.ROOT) System.out.println(head.LEMMA); + else System.out.printf("%s --(%s)--> ", head.LEMMA, head.DEPREL); + } + } +} diff --git a/src/test/java/com/hankcs/book/ch13/DemoTrainWord2Vec.java b/src/test/java/com/hankcs/book/ch13/DemoTrainWord2Vec.java new file mode 100644 index 000000000..71e23e288 --- /dev/null +++ b/src/test/java/com/hankcs/book/ch13/DemoTrainWord2Vec.java @@ -0,0 +1,27 @@ +/* + * Han He + * me@hankcs.com + * 2019-02-26 12:29 + * + * + * Copyright (c) 2019, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.book.ch13; + +import com.hankcs.demo.DemoWord2Vec; + +/** + * 《自然语言处理入门》13.3 word2vec + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * 请参考 {@link com.hankcs.demo.DemoWord2Vec} + * + * @author hankcs + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +public class DemoTrainWord2Vec extends DemoWord2Vec +{ +} diff --git a/src/test/java/com/hankcs/book/package-info.java b/src/test/java/com/hankcs/book/package-info.java new file mode 100644 index 000000000..21a5f83c9 --- /dev/null +++ b/src/test/java/com/hankcs/book/package-info.java @@ -0,0 +1,9 @@ +/** + * 这个包是《自然语言处理入门》的配套代码,与HanLP1.x同步更新,保持兼容。 + * 配套书籍:http://nlp.hankcs.com/book.php + * 讨论答疑:https://bbs.hankcs.com/ + * + * @see 《自然语言处理入门》 + * @see 讨论答疑 + */ +package com.hankcs.book; \ No newline at end of file diff --git a/src/test/java/com/hankcs/demo/DemoCRFLexicalAnalyzer.java b/src/test/java/com/hankcs/demo/DemoCRFLexicalAnalyzer.java new file mode 100644 index 000000000..e264f96b5 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoCRFLexicalAnalyzer.java @@ -0,0 +1,41 @@ +/* + * Han He + * me@hankcs.com + * 2018-03-30 下午10:01 + * + * + * Copyright (c) 2018, Han He. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He to get more information. + * + */ +package com.hankcs.demo; + +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; + +/** + * CRF词法分析器 + * 自1.6.6版起模型格式不兼容旧版:CRF模型为对数线性模型{@link com.hankcs.hanlp.model.crf.LogLinearModel}, + * 通过复用结构化感知机的维特比解码算法,效率提高10倍。 + * + * @author hankcs + */ +public class DemoCRFLexicalAnalyzer extends TestUtility +{ + public static void main(String[] args) throws IOException + { + CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer(); + String[] tests = new String[]{ + "商品和服务", + "上海华安工业(集团)公司董事长谭旭光和秘书胡花蕊来到美国纽约现代艺术博物馆参观", + "微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。" // 支持繁体中文 + }; + for (String sentence : tests) + { + System.out.println(analyzer.analyze(sentence)); +// System.out.println(analyzer.seg(sentence)); + } + } +} diff --git a/src/test/java/com/hankcs/demo/DemoCRFSegment.java b/src/test/java/com/hankcs/demo/DemoCRFSegment.java deleted file mode 100644 index 80cd962b2..000000000 --- a/src/test/java/com/hankcs/demo/DemoCRFSegment.java +++ /dev/null @@ -1,64 +0,0 @@ -/* - * - * He Han - * hankcs.cn@gmail.com - * 2014/12/10 22:02 - * - * - * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ - * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. - * - */ -package com.hankcs.demo; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.seg.CRF.CRFSegment; -import com.hankcs.hanlp.seg.Segment; -import com.hankcs.hanlp.seg.common.Term; - -import java.util.List; - -/** - * CRF分词(在最新训练的未压缩100MB模型下,能够取得较好的效果,可以投入生产环境) - * - * @author hankcs - */ -public class DemoCRFSegment -{ - public static void main(String[] args) - { - HanLP.Config.ShowTermNature = false; // 关闭词性显示 - Segment segment = new CRFSegment().enableCustomDictionary(false); - String[] sentenceArray = new String[] - { - "HanLP是由一系列模型与算法组成的Java工具包,目标是普及自然语言处理在生产环境中的应用。", - "鐵桿部隊憤怒情緒集結 馬英九腹背受敵", // 繁体无压力 - "馬英九回應連勝文“丐幫說”:稱黨內同志談話應謹慎", - "高锰酸钾,强氧化剂,紫红色晶体,可溶于水,遇乙醇即被还原。常用作消毒剂、水净化剂、氧化剂、漂白剂、毒气吸收剂、二氧化碳精制剂等。", // 专业名词有一定辨识能力 - "《夜晚的骰子》通过描述浅草的舞女在暗夜中扔骰子的情景,寄托了作者对庶民生活区的情感", // 非新闻语料 - "这个像是真的[委屈]前面那个打扮太江户了,一点不上品...@hankcs", // 微博 - "鼎泰丰的小笼一点味道也没有...每样都淡淡的...淡淡的,哪有食堂2A的好次", - "克里斯蒂娜·克罗尔说:不,我不是虎妈。我全家都热爱音乐,我也鼓励他们这么做。", - "今日APPS:Sago Mini Toolbox培养孩子动手能力", - "财政部副部长王保安调任国家统计局党组书记", - "2.34米男子娶1.53米女粉丝 称夫妻生活没问题", - "你看过穆赫兰道吗", - "国办发布网络提速降费十四条指导意见 鼓励流量不清零", - "乐视超级手机能否承载贾布斯的生态梦" - }; - for (String sentence : sentenceArray) - { - List termList = segment.seg(sentence); - System.out.println(termList); - } - - /** - * 内存CookBook: - * HanLP内部有智能的内存池,对于同一个CRF模型(模型文件路径作为id区分),只要它没被释放或者内存充足,就不会重新加载。 - */ - for (int i = 0; i < 5; ++i) - { - segment = new CRFSegment(); - } - } -} diff --git a/src/test/java/com/hankcs/demo/DemoChineseNameRecognition.java b/src/test/java/com/hankcs/demo/DemoChineseNameRecognition.java index 65d846a55..861694e4e 100644 --- a/src/test/java/com/hankcs/demo/DemoChineseNameRecognition.java +++ b/src/test/java/com/hankcs/demo/DemoChineseNameRecognition.java @@ -27,6 +27,7 @@ public static void main(String[] args) { String[] testCase = new String[]{ "签约仪式前,秦光荣、李纪恒、仇和等一同会见了参加签约的企业家。", + "武大靖创世界纪录夺冠,中国代表团平昌首金", "区长庄木弟新年致辞", "朱立伦:两岸都希望共创双赢 习朱历史会晤在即", "陕西首富吴一坚被带走 与令计划妻子有交集", diff --git a/src/test/java/com/hankcs/demo/DemoCustomDictionary.java b/src/test/java/com/hankcs/demo/DemoCustomDictionary.java index 243e1056b..0613acea7 100644 --- a/src/test/java/com/hankcs/demo/DemoCustomDictionary.java +++ b/src/test/java/com/hankcs/demo/DemoCustomDictionary.java @@ -16,6 +16,8 @@ import com.hankcs.hanlp.dictionary.BaseSearcher; import com.hankcs.hanlp.dictionary.CoreDictionary; import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.dictionary.DynamicCustomDictionary; +import com.hankcs.hanlp.tokenizer.StandardTokenizer; import java.util.Map; @@ -63,5 +65,12 @@ public void hit(int begin, int end, CoreDictionary.Attribute value) // Note:动态增删不会影响词典文件 // 目前CustomDictionary使用DAT储存词典文件中的词语,用BinTrie储存动态加入的词语,前者性能高,后者性能低 // 之所以保留动态增删功能,一方面是历史遗留特性,另一方面是调试用;未来可能会去掉动态增删特性。 + + // 系统默认的词典 + DynamicCustomDictionary dictionary = CustomDictionary.DEFAULT; + // 每个分词器都有一份词典,默认公用 CustomDictionary.DEFAULT,你可以为任何分词器指定一份不同的词典 + DynamicCustomDictionary myDictionary = new DynamicCustomDictionary("data/dictionary/custom/CustomDictionary.txt", "data/dictionary/custom/机构名词典.txt"); + StandardTokenizer.SEGMENT.enableCustomDictionary(myDictionary); + StandardTokenizer.SEGMENT.customDictionary.insert("插入到该分词器专用的词典中"); } } diff --git a/src/test/java/com/hankcs/demo/DemoCustomNature.java b/src/test/java/com/hankcs/demo/DemoCustomNature.java index 92d6d6e44..86581e7f2 100644 --- a/src/test/java/com/hankcs/demo/DemoCustomNature.java +++ b/src/test/java/com/hankcs/demo/DemoCustomNature.java @@ -13,7 +13,6 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.tag.Nature; -import com.hankcs.hanlp.corpus.util.CustomNatureUtility; import com.hankcs.hanlp.dictionary.CustomDictionary; import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.tokenizer.StandardTokenizer; @@ -21,9 +20,10 @@ import java.util.List; +import static com.hankcs.hanlp.corpus.tag.Nature.n; + /** * 演示自定义词性,以及往词典中插入自定义词性的词语 - * !!!由于采用了反射技术,用户需对本地环境的兼容性和稳定性负责!!! * * @author hankcs */ @@ -57,14 +57,12 @@ public static void main(String[] args) StandardTokenizer.SEGMENT.enablePartOfSpeechTagging(true); // 依然支持隐马词性标注 termList = HanLP.segment("苹果电脑可以运行开源阿尔法狗代码吗"); System.out.println(termList); - // 如果使用了动态词性之后任何类使用了switch(nature)语句,必须注册每个类: - CustomNatureUtility.registerSwitchClass(DemoCustomNature.class); + // 1.6.5之后Nature不再是枚举类型,无法switch。但终于不再涉及反射了,在各种JRE环境下都更稳定。 for (Term term : termList) { - switch (term.nature) + if (term.nature == n) { - case n: - System.out.printf("找到了 [%s] : %s\n", "名词", term.word); + System.out.printf("找到了 [%s] : %s\n", "名词", term.word); } } } diff --git a/src/test/java/com/hankcs/demo/DemoDependencyParser.java b/src/test/java/com/hankcs/demo/DemoDependencyParser.java index de9887e12..229c25e06 100644 --- a/src/test/java/com/hankcs/demo/DemoDependencyParser.java +++ b/src/test/java/com/hankcs/demo/DemoDependencyParser.java @@ -14,16 +14,25 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; +import com.hankcs.hanlp.dependency.IDependencyParser; +import com.hankcs.hanlp.dependency.perceptron.parser.KBeamArcEagerDependencyParser; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; /** - * 依存句法分析(CRF句法模型需要-Xms512m -Xmx512m -Xmn256m,MaxEnt和神经网络句法模型需要-Xms1g -Xmx1g -Xmn512m) + * 依存句法分析(神经网络句法模型需要-Xms1g -Xmx1g -Xmn512m) + * * @author hankcs */ -public class DemoDependencyParser +public class DemoDependencyParser extends TestUtility { - public static void main(String[] args) + public static void main(String[] args) throws IOException, ClassNotFoundException { CoNLLSentence sentence = HanLP.parseDependency("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。"); + // 也可以用基于ArcEager转移系统的依存句法分析器 +// IDependencyParser parser = new KBeamArcEagerDependencyParser(); +// CoNLLSentence sentence = parser.parse("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。"); System.out.println(sentence); // 可以方便地遍历它 for (CoNLLWord word : sentence) diff --git a/src/test/java/com/hankcs/demo/DemoEvaluateCWS.java b/src/test/java/com/hankcs/demo/DemoEvaluateCWS.java new file mode 100644 index 000000000..f50df6dd2 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoEvaluateCWS.java @@ -0,0 +1,43 @@ +package com.hankcs.demo; + +import com.hankcs.hanlp.model.perceptron.CWSTrainer; +import com.hankcs.hanlp.corpus.MSR; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.model.perceptron.PerceptronTrainer; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.CWSEvaluator; + +import java.io.IOException; + +import static com.hankcs.hanlp.classification.utilities.io.ConsoleLogger.logger; + +/** + * 演示如何正确规范地评测中文分词的准确率
+ * 1、公平公正。训练模块、分词模块、语料库、评测程序全部开源。 + * 2、禁止使用语料库之外的词典及其等价物(词向量等)。 + * 3、试验结果可复现,可通过其他评分脚本校验。 + */ +public class DemoEvaluateCWS +{ + public static void main(String[] args) throws IOException + { + logger.start("开始训练...\n"); + PerceptronTrainer trainer = new CWSTrainer(); + PerceptronTrainer.Result result = trainer.train(MSR.TRAIN_PATH, MSR.TRAIN_PATH, MSR.MODEL_PATH, + 0.0, // 压缩比对准确率的影响很小 + 50, // 一般50个迭代就差不多收敛了 + 8 + ); + logger.finish(" 训练完毕\n"); + + Segment segment = new PerceptronLexicalAnalyzer(result.getModel()).enableCustomDictionary(false); // 重要!必须禁用词典 + System.out.println(CWSEvaluator.evaluate(segment, MSR.TEST_PATH, MSR.OUTPUT_PATH, MSR.GOLD_PATH, MSR.TRAIN_WORDS)); // 标准化评测 + // P:96.80 R:96.55 F1:96.68 OOV-R:70.91 IV-R:97.25 + // 受随机数影响,可能在96.60%左右波动 + System.out.printf("上述结果可通过sighan05官方脚本校验:perl %s %s %s %s\n", + MSR.SIGHAN05_ROOT + "/scripts/score", + MSR.TRAIN_WORDS, + MSR.GOLD_PATH, + MSR.OUTPUT_PATH); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/demo/DemoHMMSegment.java b/src/test/java/com/hankcs/demo/DemoHMMSegment.java deleted file mode 100644 index cb095769c..000000000 --- a/src/test/java/com/hankcs/demo/DemoHMMSegment.java +++ /dev/null @@ -1,65 +0,0 @@ -/* - * - * hankcs - * me@hankcs.com - * 2015/5/7 19:01 - * - * - * Copyright (c) 2003-2015, hankcs. All Right Reserved, http://www.hankcs.com/ - * - */ -package com.hankcs.demo; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.seg.HMM.HMMSegment; -import com.hankcs.hanlp.seg.Segment; -import com.hankcs.hanlp.seg.common.Term; - -import java.util.List; - -/** - * 演示二阶隐马分词,这是一种基于字标注的分词方法,对未登录词支持较好,对已登录词的分词速度慢。综合性能不如CRF分词。 - * 还未稳定,请不要用于生产环境。二阶隐马标注分词效果尚且不好,许多开源分词器使用甚至使用一阶隐马(BiGram二元文法), - * 效果可想而知。对基于字符的序列标注分词方法,我只推荐CRF。 - * - * @author hankcs - */ -public class DemoHMMSegment -{ - public static void main(String[] args) - { - HanLP.Config.ShowTermNature = false; // 关闭词性显示 - Segment segment = new HMMSegment(); - String[] sentenceArray = new String[] - { - "HanLP是由一系列模型与算法组成的Java工具包,目标是普及自然语言处理在生产环境中的应用。", - "高锰酸钾,强氧化剂,紫红色晶体,可溶于水,遇乙醇即被还原。常用作消毒剂、水净化剂、氧化剂、漂白剂、毒气吸收剂、二氧化碳精制剂等。", // 专业名词有一定辨识能力 - "《夜晚的骰子》通过描述浅草的舞女在暗夜中扔骰子的情景,寄托了作者对庶民生活区的情感", // 非新闻语料 - "这个像是真的[委屈]前面那个打扮太江户了,一点不上品...@hankcs", // 微博 - "鼎泰丰的小笼一点味道也没有...每样都淡淡的...淡淡的,哪有食堂2A的好次", - "克里斯蒂娜·克罗尔说:不,我不是虎妈。我全家都热爱音乐,我也鼓励他们这么做。", - "今日APPS:Sago Mini Toolbox培养孩子动手能力", - "财政部副部长王保安调任国家统计局党组书记", - "2.34米男子娶1.53米女粉丝 称夫妻生活没问题", - "你看过穆赫兰道吗", - "乐视超级手机能否承载贾布斯的生态梦" - }; - for (String sentence : sentenceArray) - { - List termList = segment.seg(sentence); - System.out.println(termList); - } - - // 测个速度 - String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原"; - System.out.println(segment.seg(text)); - long start = System.currentTimeMillis(); - int pressure = 1000; - for (int i = 0; i < pressure; ++i) - { - segment.seg(text); - } - double costTime = (System.currentTimeMillis() - start) / (double)1000; - System.out.printf("HMM2分词速度:%.2f字每秒\n", text.length() * pressure / costTime); - } -} diff --git a/src/test/java/com/hankcs/demo/DemoMultithreadingSegment.java b/src/test/java/com/hankcs/demo/DemoMultithreadingSegment.java index c1923d0fe..4c78662a9 100644 --- a/src/test/java/com/hankcs/demo/DemoMultithreadingSegment.java +++ b/src/test/java/com/hankcs/demo/DemoMultithreadingSegment.java @@ -11,9 +11,12 @@ package com.hankcs.demo; import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; import com.hankcs.hanlp.seg.CRF.CRFSegment; import com.hankcs.hanlp.seg.Segment; +import java.io.IOException; + /** * 演示多线程并行分词 * 由于HanLP的任何分词器都是线程安全的,所以用户只需调用一个配置接口就可以启用任何分词器的并行化 @@ -22,9 +25,9 @@ */ public class DemoMultithreadingSegment { - public static void main(String[] args) + public static void main(String[] args) throws IOException { - Segment segment = new CRFSegment(); // CRF分词器效果好,速度慢,并行化之后可以提高一些速度 + Segment segment = new CRFLexicalAnalyzer(HanLP.Config.CRFCWSModelPath).enableCustomDictionary(false); // CRF分词器效果好,速度慢,并行化之后可以提高一些速度 String text = "举办纪念活动铭记二战历史,不忘战争带给人类的深重灾难,是为了防止悲剧重演,确保和平永驻;" + "铭记二战历史,更是为了提醒国际社会,需要共同捍卫二战胜利成果和国际公平正义," + diff --git a/src/test/java/com/hankcs/demo/DemoNLPSegment.java b/src/test/java/com/hankcs/demo/DemoNLPSegment.java index 3b8c93e9a..1b4be587d 100644 --- a/src/test/java/com/hankcs/demo/DemoNLPSegment.java +++ b/src/test/java/com/hankcs/demo/DemoNLPSegment.java @@ -11,22 +11,25 @@ */ package com.hankcs.demo; -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.seg.common.Term; import com.hankcs.hanlp.tokenizer.NLPTokenizer; - -import java.util.List; +import com.hankcs.hanlp.utility.TestUtility; /** - * NLP分词 + * NLP分词,更精准的中文分词、词性标注与命名实体识别。 + * 语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上训练新模型以适应新领域、识别新的命名实体。 + * 标注集请查阅 https://github.com/hankcs/HanLP/blob/master/data/dictionary/other/TagPKU98.csv + * 或者干脆调用 Sentence#translateLabels() 转为中文 + * * @author hankcs */ -public class DemoNLPSegment +public class DemoNLPSegment extends TestUtility { public static void main(String[] args) { - HanLP.Config.enableDebug(); - List termList = NLPTokenizer.segment("上外日本文化经济学院的陆晚霞教授正在教授泛读课程"); - System.out.println(termList); + NLPTokenizer.ANALYZER.enableCustomDictionary(false); // 中文分词≠词典,不用词典照样分词。 + System.out.println(NLPTokenizer.segment("我新造一个词叫幻想乡你能识别并正确标注词性吗?")); // “正确”是副形词。 + // 注意观察下面两个“希望”的词性、两个“晚霞”的词性 + System.out.println(NLPTokenizer.analyze("我的希望是希望张晚霞的背影被晚霞映红").translateLabels()); + System.out.println(NLPTokenizer.analyze("支援臺灣正體香港繁體:微软公司於1975年由比爾·蓋茲和保羅·艾倫創立。")); } } diff --git a/src/test/java/com/hankcs/demo/DemoNewWordDiscover.java b/src/test/java/com/hankcs/demo/DemoNewWordDiscover.java index fb8310926..8b068b900 100644 --- a/src/test/java/com/hankcs/demo/DemoNewWordDiscover.java +++ b/src/test/java/com/hankcs/demo/DemoNewWordDiscover.java @@ -13,6 +13,7 @@ import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.mining.word.WordInfo; +import com.hankcs.hanlp.utility.TestUtility; import java.io.IOException; import java.util.List; @@ -24,10 +25,12 @@ */ public class DemoNewWordDiscover { + static final String CORPUS_PATH = TestUtility.ensureTestData("红楼梦.txt", "http://file.hankcs.com/corpus/红楼梦.zip"); + public static void main(String[] args) throws IOException { // 文本长度越大越好,试试红楼梦? - List wordInfoList = HanLP.extractWords(IOUtil.newBufferedReader("data/test/红楼梦.txt"), 100); + List wordInfoList = HanLP.extractWords(IOUtil.newBufferedReader(CORPUS_PATH), 100); System.out.println(wordInfoList); } } diff --git a/src/test/java/com/hankcs/demo/DemoOrganizationRecognition.java b/src/test/java/com/hankcs/demo/DemoOrganizationRecognition.java index f672d89e6..ab870780c 100644 --- a/src/test/java/com/hankcs/demo/DemoOrganizationRecognition.java +++ b/src/test/java/com/hankcs/demo/DemoOrganizationRecognition.java @@ -29,8 +29,9 @@ public static void main(String[] args) "我在上海林原科技有限公司兼职工作,", "我经常在台川喜宴餐厅吃饭,", "偶尔去开元地中海影城看电影。", + "不用词典,福哈生态工程有限公司是动态识别的结果。", }; - Segment segment = HanLP.newSegment().enableOrganizationRecognize(true); + Segment segment = HanLP.newSegment().enableCustomDictionary(false).enableOrganizationRecognize(true); for (String sentence : testCase) { List termList = segment.seg(sentence); diff --git a/src/test/java/com/hankcs/demo/DemoPerceptronLexicalAnalyzer.java b/src/test/java/com/hankcs/demo/DemoPerceptronLexicalAnalyzer.java new file mode 100644 index 000000000..29ef01e80 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoPerceptronLexicalAnalyzer.java @@ -0,0 +1,61 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-15 下午5:39 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.demo; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.IOException; + +/** + * 基于感知机序列标注的词法分析器,可选多个模型。 + * - large训练自一亿字的大型综合语料库,是已知范围内全世界最大的中文分词语料库。 + * - pku199801训练自个人修订版1998人民日报语料1月份,仅有183万字。 + * 语料库规模决定实际效果,面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上训练新模型以适应新领域、识别新的命名实体。 + * 无论在何种语料上训练,都完全支持简繁全半角和大小写。 + * + * @author hankcs + */ +public class DemoPerceptronLexicalAnalyzer extends TestUtility +{ + public static void main(String[] args) throws IOException + { + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer("data/model/perceptron/pku199801/cws.bin", + HanLP.Config.PerceptronPOSModelPath, + HanLP.Config.PerceptronNERModelPath); + System.out.println(analyzer.analyze("上海华安工业(集团)公司董事长谭旭光和秘书胡花蕊来到美国纽约现代艺术博物馆参观")); + System.out.println(analyzer.analyze("微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。")); + + // 任何模型总会有失误,特别是98年这种陈旧的语料库 + System.out.println(analyzer.analyze("总统普京与特朗普通电话讨论太空探索技术公司")); + // 支持在线学习 + analyzer.learn("与/c 特朗普/nr 通/v 电话/n 讨论/v [太空/s 探索/vn 技术/n 公司/n]/nt"); + // 学习到新知识 + System.out.println(analyzer.analyze("总统普京与特朗普通电话讨论太空探索技术公司")); + // 还可以举一反三 + System.out.println(analyzer.analyze("主席和特朗普通电话")); + + // 知识的泛化不是死板的规则,而是比较灵活的统计信息 + System.out.println(analyzer.analyze("我在浙江金华出生")); + analyzer.learn("在/p 浙江/ns 金华/ns 出生/v"); + System.out.println(analyzer.analyze("我在四川金华出生,我的名字叫金华")); + + // 在线学习后的模型支持序列化,以分词模型为例: +// analyzer.getPerceptronSegmenter().getModel().save(HanLP.Config.PerceptronCWSModelPath); + + // 请用户按需执行对空格制表符等的预处理,只有你最清楚自己的文本中都有些什么奇怪的东西 + System.out.println(analyzer.analyze("空格 \t\n\r\f 统统都不要" + .replaceAll("\\s+", "") // 去除所有空白符 + .replaceAll(" ", "") // 如果一些文本中含有html控制符 + )); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/demo/DemoPipeline.java b/src/test/java/com/hankcs/demo/DemoPipeline.java new file mode 100644 index 000000000..87e250fd6 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoPipeline.java @@ -0,0 +1,58 @@ +/* + * Han He + * me@hankcs.com + * 2018-11-10 10:51 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * See LICENSE file in the project root for full license information. + * + */ +package com.hankcs.demo; + +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.tokenizer.pipe.LexicalAnalyzerPipeline; +import com.hankcs.hanlp.tokenizer.pipe.Pipe; +import com.hankcs.hanlp.tokenizer.pipe.RegexRecognizePipe; + +import java.io.IOException; +import java.util.List; +import java.util.regex.Pattern; + +/** + * 演示流水线模式,几个概念: + * - pipe:流水线的一节管道,执行统计分词或规则逻辑 + * - flow:管道的数据流,在同名方法中执行本节管道的业务 + * - pipeline:流水线,由至少一节管道(统计分词管道)构成,可自由调整管道的拼装方式 + * + * @author hankcs + */ +public class DemoPipeline +{ + private static final Pattern WEB_URL = Pattern.compile("((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?(?:(((([a-zA-Z0-9][a-zA-Z0-9\\-]*)*[a-zA-Z0-9]\\.)+((aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(biz|b[abdefghijmnorstvwyz])|(cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(edu|e[cegrstu])|f[ijkmor]|(gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(info|int|i[delmnoqrst])|(jobs|j[emop])|k[eghimnprwyz]|l[abcikrstuvy]|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])|(name|net|n[acefgilopruz])|(org|om)|(pro|p[aefghklmnrstwy])|qa|r[eosuw]|s[abcdeghijklmnortuvyz]|(tel|travel|t[cdfghjklmnoprtvwz])|u[agksyz]|v[aceginu]|w[fs]|(δοκιμή|испытание|рф|срб|טעסט|آزمایشی|إختبار|الاردن|الجزائر|السعودية|المغرب|امارات|بھارت|تونس|سورية|فلسطين|قطر|مصر|परीक्षा|भारत|ভারত|ਭਾਰਤ|ભારત|இந்தியா|இலங்கை|சிங்கப்பூர்|பரிட்சை|భారత్|ලංකා|ไทย|テスト|中国|中國|台湾|台灣|新加坡|测试|測試|香港|테스트|한국|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)|y[et]|z[amw]))|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\\:\\d{1,5})?)(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"); + private static final Pattern EMAIL = Pattern.compile("(\\w+(?:[-+.]\\w+)*)@(\\w+(?:[-.]\\w+)*\\.\\w+(?:[-.]\\w+)*)"); + + public static void main(String[] args) throws IOException + { + LexicalAnalyzerPipeline analyzer = new LexicalAnalyzerPipeline(new PerceptronLexicalAnalyzer()); + // 管道顺序=优先级,自行调整管道顺序以控制优先级 + analyzer.addFirst(new RegexRecognizePipe(WEB_URL, "【网址】")); + analyzer.addFirst(new RegexRecognizePipe(EMAIL, "【邮件】")); + analyzer.addLast(new Pipe, List>() // 自己写个管道也并非难事 + { + @Override + public List flow(List input) + { + for (IWord word : input) + { + if ("nx".equals(word.getLabel())) + word.setLabel("字母"); + } + return input; + } + }); + String text = "HanLP的项目地址是https://github.com/hankcs/HanLP,联系邮箱abc@def.com"; + System.out.println(analyzer.analyze(text)); + } +} diff --git a/src/test/java/com/hankcs/demo/DemoSentimentAnalysis.java b/src/test/java/com/hankcs/demo/DemoSentimentAnalysis.java index 578f5d985..9b2e8ed03 100644 --- a/src/test/java/com/hankcs/demo/DemoSentimentAnalysis.java +++ b/src/test/java/com/hankcs/demo/DemoSentimentAnalysis.java @@ -14,6 +14,7 @@ import com.hankcs.hanlp.classification.classifiers.IClassifier; import com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier; +import com.hankcs.hanlp.utility.TestUtility; import java.io.File; import java.io.IOException; @@ -28,7 +29,7 @@ public class DemoSentimentAnalysis /** * 中文情感挖掘语料-ChnSentiCorp 谭松波 */ - public static final String CORPUS_FOLDER = "data/test/ChnSentiCorp情感分析酒店评论"; + public static final String CORPUS_FOLDER = TestUtility.ensureTestData("ChnSentiCorp情感分析酒店评论", "http://file.hankcs.com/corpus/ChnSentiCorp.zip"); public static void main(String[] args) throws IOException { diff --git a/src/test/java/com/hankcs/demo/DemoStopWord.java b/src/test/java/com/hankcs/demo/DemoStopWord.java index 5d92f1034..91ce97c7c 100644 --- a/src/test/java/com/hankcs/demo/DemoStopWord.java +++ b/src/test/java/com/hankcs/demo/DemoStopWord.java @@ -19,6 +19,8 @@ import java.util.List; +import static com.hankcs.hanlp.corpus.tag.Nature.nz; + /** * 演示如何去除停用词 * @@ -45,9 +47,8 @@ public static void main(String[] args) @Override public boolean shouldInclude(Term term) { - switch (term.nature) + if (term.nature == nz) { - case nz: return !CoreStopWordDictionary.contains(term.word); } return false; diff --git a/src/test/java/com/hankcs/demo/DemoTextClassification.java b/src/test/java/com/hankcs/demo/DemoTextClassification.java index 42e733c01..282dccf71 100644 --- a/src/test/java/com/hankcs/demo/DemoTextClassification.java +++ b/src/test/java/com/hankcs/demo/DemoTextClassification.java @@ -16,6 +16,7 @@ import com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier; import com.hankcs.hanlp.classification.models.NaiveBayesModel; import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.utility.TestUtility; import java.io.File; import java.io.IOException; @@ -30,16 +31,17 @@ public class DemoTextClassification /** * 搜狗文本分类语料库5个类目,每个类目下1000篇文章,共计5000篇文章 */ - public static final String CORPUS_FOLDER = "data/test/搜狗文本分类语料库迷你版"; + public static final String CORPUS_FOLDER = TestUtility.ensureTestData("搜狗文本分类语料库迷你版", "http://file.hankcs.com/corpus/sogou-text-classification-corpus-mini.zip"); /** * 模型保存路径 */ public static final String MODEL_PATH = "data/test/classification-model.ser"; + public static void main(String[] args) throws IOException { IClassifier classifier = new NaiveBayesClassifier(trainOrLoadModel()); - predict(classifier, "C罗压梅西内马尔蝉联金球奖 2017=C罗年"); + predict(classifier, "C罗获2018环球足球奖最佳球员 德尚荣膺最佳教练"); predict(classifier, "英国造航母耗时8年仍未服役 被中国速度远远甩在身后"); predict(classifier, "研究生考录模式亟待进一步专业化"); predict(classifier, "如果真想用食物解压,建议可以食用燕麦"); diff --git a/src/test/java/com/hankcs/demo/DemoTextClustering.java b/src/test/java/com/hankcs/demo/DemoTextClustering.java new file mode 100644 index 000000000..c7d7f25f4 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoTextClustering.java @@ -0,0 +1,33 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-18 11:11 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.demo; + +import com.hankcs.hanlp.mining.cluster.ClusterAnalyzer; + +/** + * @author hankcs + */ +public class DemoTextClustering +{ + public static void main(String[] args) + { + ClusterAnalyzer analyzer = new ClusterAnalyzer(); + analyzer.addDocument("赵一", "流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 摇滚, 摇滚, 摇滚, 摇滚"); + analyzer.addDocument("钱二", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲"); + analyzer.addDocument("张三", "古典, 古典, 古典, 古典, 民谣, 民谣, 民谣, 民谣"); + analyzer.addDocument("李四", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 金属, 金属, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲"); + analyzer.addDocument("王五", "流行, 流行, 流行, 流行, 摇滚, 摇滚, 摇滚, 嘻哈, 嘻哈, 嘻哈"); + analyzer.addDocument("马六", "古典, 古典, 古典, 古典, 古典, 古典, 古典, 古典, 摇滚"); + System.out.println(analyzer.kmeans(3)); + System.out.println(analyzer.repeatedBisection(3)); + System.out.println(analyzer.repeatedBisection(1.0)); // 自动判断聚类数量k + } +} diff --git a/src/test/java/com/hankcs/demo/DemoTextClusteringFMeasure.java b/src/test/java/com/hankcs/demo/DemoTextClusteringFMeasure.java new file mode 100644 index 000000000..c3da6da38 --- /dev/null +++ b/src/test/java/com/hankcs/demo/DemoTextClusteringFMeasure.java @@ -0,0 +1,29 @@ +/* + * Han He + * me@hankcs.com + * 2018-08-18 11:11 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.demo; + +import com.hankcs.hanlp.mining.cluster.ClusterAnalyzer; + +import static com.hankcs.demo.DemoTextClassification.CORPUS_FOLDER; + +/** + * @author hankcs + */ +public class DemoTextClusteringFMeasure +{ + public static void main(String[] args) + { + for (String algorithm : new String[]{"kmeans", "repeated bisection"}) + { + System.out.printf("%s F1=%.2f\n", algorithm, ClusterAnalyzer.evaluate(CORPUS_FOLDER, algorithm) * 100); + } + } +} diff --git a/src/test/java/com/hankcs/demo/DemoUseAhoCorasickDoubleArrayTrieSegment.java b/src/test/java/com/hankcs/demo/DemoUseAhoCorasickDoubleArrayTrieSegment.java index b302792e8..d3dacba75 100644 --- a/src/test/java/com/hankcs/demo/DemoUseAhoCorasickDoubleArrayTrieSegment.java +++ b/src/test/java/com/hankcs/demo/DemoUseAhoCorasickDoubleArrayTrieSegment.java @@ -12,23 +12,22 @@ package com.hankcs.demo; import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.seg.Other.AhoCorasickDoubleArrayTrieSegment; +import java.io.IOException; + /** * 基于AhoCorasickDoubleArrayTrie的分词器,该分词器允许用户跳过核心词典,直接使用自己的词典。 * 需要注意的是,自己的词典必须遵守HanLP词典格式。 + * * @author hankcs */ public class DemoUseAhoCorasickDoubleArrayTrieSegment { - public static void main(String[] args) + public static void main(String[] args) throws IOException { - String dictionaryPath = HanLP.Config.CustomDictionaryPath[0]; - if (!IOUtil.isFileExisted(dictionaryPath)) return; // AhoCorasickDoubleArrayTrieSegment要求用户必须提供自己的词典路径 - AhoCorasickDoubleArrayTrieSegment segment = new AhoCorasickDoubleArrayTrieSegment() - .loadDictionary(dictionaryPath); + AhoCorasickDoubleArrayTrieSegment segment = new AhoCorasickDoubleArrayTrieSegment(HanLP.Config.CustomDictionaryPath[0]); System.out.println(segment.seg("微观经济学继续教育循环经济")); } } diff --git a/src/test/java/com/hankcs/demo/DemoWord2Vec.java b/src/test/java/com/hankcs/demo/DemoWord2Vec.java index 19f506130..182dd8584 100644 --- a/src/test/java/com/hankcs/demo/DemoWord2Vec.java +++ b/src/test/java/com/hankcs/demo/DemoWord2Vec.java @@ -10,10 +10,12 @@ */ package com.hankcs.demo; +import com.hankcs.hanlp.corpus.MSR; import com.hankcs.hanlp.corpus.io.IOUtil; import com.hankcs.hanlp.mining.word2vec.DocVectorModel; import com.hankcs.hanlp.mining.word2vec.Word2VecTrainer; import com.hankcs.hanlp.mining.word2vec.WordVectorModel; +import com.hankcs.hanlp.utility.TestUtility; import java.io.IOException; import java.util.Map; @@ -25,15 +27,17 @@ */ public class DemoWord2Vec { - private static final String TRAIN_FILE_NAME = "data/test/搜狗文本分类语料库已分词.txt"; + private static final String TRAIN_FILE_NAME = MSR.TRAIN_PATH; private static final String MODEL_FILE_NAME = "data/test/word2vec.txt"; public static void main(String[] args) throws IOException { WordVectorModel wordVectorModel = trainOrLoadModel(); - printNearest("中国", wordVectorModel); + printNearest("上海", wordVectorModel); printNearest("美丽", wordVectorModel); printNearest("购买", wordVectorModel); + System.out.println(wordVectorModel.similarity("上海", "广州")); + System.out.println(wordVectorModel.analogy("日本", "自民党", "共和党")); // 文档向量 DocVectorModel docVectorModel = new DocVectorModel(wordVectorModel); diff --git a/src/test/java/com/hankcs/hanlp/HanLPTest.java b/src/test/java/com/hankcs/hanlp/HanLPTest.java new file mode 100644 index 000000000..f8c137d33 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/HanLPTest.java @@ -0,0 +1,24 @@ +package com.hankcs.hanlp; + +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.seg.Viterbi.ViterbiSegment; +import junit.framework.TestCase; + +public class HanLPTest extends TestCase +{ + public void testNewSegment() throws Exception + { + assertTrue(HanLP.newSegment("维特比") instanceof ViterbiSegment); + assertTrue(HanLP.newSegment("感知机") instanceof PerceptronLexicalAnalyzer); + } + + public void testDicUpdate() + { + System.out.println(HanLP.segment("大数据是一个新词汇!")); + } + + public void testConvertToPinyinList() + { + System.out.println(HanLP.convertToPinyinString("你好", " ", false)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/algorithm/EditDistanceTest.java b/src/test/java/com/hankcs/hanlp/algorithm/EditDistanceTest.java new file mode 100644 index 000000000..e7112c257 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/algorithm/EditDistanceTest.java @@ -0,0 +1,53 @@ +package com.hankcs.hanlp.algorithm; + +import com.hankcs.hanlp.corpus.synonym.Synonym; +import com.hankcs.hanlp.dictionary.common.CommonSynonymDictionary.SynonymItem; +import org.junit.Assert; +import org.junit.Test; +import java.util.ArrayList; + +public class EditDistanceTest { + + @Test + public void testComputeCharArray() { + Assert.assertEquals(2, + EditDistance.compute("foo".toCharArray(), "oof".toCharArray())); + } + + @Test + public void testComputeString() { + Assert.assertEquals(2, EditDistance.compute("foo", "oof")); + } + + @Test + public void testComputeList() { + ArrayList synonymItems1 = new ArrayList(); + synonymItems1.add(new SynonymItem(new Synonym("", 32L), null, '=')); + + ArrayList synonymItems2 = new ArrayList(); + synonymItems2.add(new SynonymItem(new Synonym("", 64L), null, '=')); + + Assert.assertEquals(32L, + EditDistance.compute(synonymItems1, synonymItems2)); + } + + @Test + public void testComputeLong() { + Assert.assertEquals(3074457345618258602L, + EditDistance.compute(new long[]{}, new long[]{4L, 0L})); + Assert.assertEquals(-15, + EditDistance.compute(new long[]{-16}, new long[]{32})); + Assert.assertEquals(0, + EditDistance.compute(new long[]{16}, new long[]{16})); + } + + @Test + public void testComputeInt() { + Assert.assertEquals(715827882, + EditDistance.compute(new int[]{4, 0}, new int[]{})); + Assert.assertEquals(6, + EditDistance.compute(new int[]{4, 0}, new int[]{8, 16})); + Assert.assertEquals(0, + EditDistance.compute(new int[]{16}, new int[]{16})); + } +} diff --git a/src/test/java/com/hankcs/hanlp/algorithm/ahocorasick/trie/TrieTest.java b/src/test/java/com/hankcs/hanlp/algorithm/ahocorasick/trie/TrieTest.java new file mode 100644 index 000000000..daf2fb94d --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/algorithm/ahocorasick/trie/TrieTest.java @@ -0,0 +1,66 @@ +package com.hankcs.hanlp.algorithm.ahocorasick.trie; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.util.TreeMap; + +public class TrieTest extends TestCase +{ + public void testHasKeyword() throws Exception + { + TreeMap map = new TreeMap(); + String[] keyArray = new String[] + { + "hers", + "his", + "she", + "he" + }; + for (String key : keyArray) + { + map.put(key, key); + } + Trie trie = new Trie(); + trie.addAllKeyword(map.keySet()); + for (String key : keyArray) + { + assertTrue(trie.hasKeyword(key)); + } + assertTrue(trie.hasKeyword("ushers")); + assertFalse(trie.hasKeyword("构建耗时")); + } + + public void testParseText() throws Exception + { + TreeMap map = new TreeMap(); + String[] keyArray = new String[] + { + "hers", + "his", + "she", + "he" + }; + for (String key : keyArray) + { + map.put(key, key); + } + AhoCorasickDoubleArrayTrie act = new AhoCorasickDoubleArrayTrie(); + act.build(map); +// act.debug(); + final String text = "uhers"; + act.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() + { + @Override + public void hit(int begin, int end, String value) + { +// System.out.printf("[%d:%d]=%s\n", begin, end, value); + assertEquals(value, text.substring(begin, end)); + } + }); + } + + +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifierTest.java b/src/test/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifierTest.java index e47490880..4d8ff509c 100644 --- a/src/test/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifierTest.java +++ b/src/test/java/com/hankcs/hanlp/classification/classifiers/NaiveBayesClassifierTest.java @@ -15,6 +15,11 @@ public class NaiveBayesClassifierTest extends TestCase private static final String MODEL_PATH = "data/test/classification.ser"; private Map trainingDataSet; + @Override + public void setUp() throws Exception + { + super.setUp(); + } private void loadDataSet() { @@ -50,7 +55,7 @@ public void testPredictAndAccuracy() throws Exception } NaiveBayesClassifier naiveBayesClassifier = new NaiveBayesClassifier(model); // 预测单个文档 - String path = CORPUS_FOLDER + "/财经/12.txt"; + String path = CORPUS_FOLDER + "/体育/0004.txt"; String text = IOUtil.readTxt(path); String label = naiveBayesClassifier.classify(text); String title = text.split("\\n")[0].replaceAll("\\s", ""); diff --git a/src/test/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrieTest.java b/src/test/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrieTest.java new file mode 100644 index 000000000..2640da76d --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/collection/AhoCorasick/AhoCorasickDoubleArrayTrieTest.java @@ -0,0 +1,122 @@ +package com.hankcs.hanlp.collection.AhoCorasick; + +import com.hankcs.hanlp.algorithm.ahocorasick.trie.Emit; +import com.hankcs.hanlp.algorithm.ahocorasick.trie.Trie; +import com.hankcs.hanlp.collection.trie.DoubleArrayTrie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; +import org.junit.Assert; + +import java.util.*; + +public class AhoCorasickDoubleArrayTrieTest extends TestCase +{ + + public void testTwoAC() throws Exception + { + TreeMap map = new TreeMap(); + IOUtil.LineIterator iterator = new IOUtil.LineIterator("data/dictionary/CoreNatureDictionary.mini.txt"); + while (iterator.hasNext()) + { + String line = iterator.next().split("\\s")[0]; + map.put(line, line); + } + + Trie trie = new Trie(); + trie.addAllKeyword(map.keySet()); + AhoCorasickDoubleArrayTrie act = new AhoCorasickDoubleArrayTrie(); + act.build(map); + + for (String key : map.keySet()) + { + Collection emits = trie.parseText(key); + Set otherSet = new HashSet(); + for (Emit emit : emits) + { + otherSet.add(emit.getKeyword() + emit.getEnd()); + } + + List.Hit> entries = act.parseText(key); + Set mySet = new HashSet(); + for (AhoCorasickDoubleArrayTrie.Hit entry : entries) + { + mySet.add(entry.value + (entry.end - 1)); + } + + assertEquals(otherSet, mySet); + } + } + + public void testBuildEmptyTrie() + { + AhoCorasickDoubleArrayTrie acdat = new AhoCorasickDoubleArrayTrie(); + TreeMap map = new TreeMap(); + acdat.build(map); + assertEquals(0, acdat.size()); + assertEquals(0, acdat.parseText("uhers").size()); + } + + /** + * 测试构建和匹配,使用《我的团长我的团》.txt作为测试数据,并且判断匹配是否正确 + * @throws Exception + */ +// public void testSegment() throws Exception +// { +// TreeMap map = new TreeMap(); +// IOUtil.LineIterator iterator = new IOUtil.LineIterator("data/dictionary/CoreNatureDictionary.txt"); +// while (iterator.hasNext()) +// { +// String line = iterator.next().split("\\s")[0]; +// map.put(line, line); +// } +// +// Trie trie = new Trie(); +// trie.addAllKeyword(map.keySet()); +// AhoCorasickDoubleArrayTrie act = new AhoCorasickDoubleArrayTrie(); +// long timeMillis = System.currentTimeMillis(); +// act.build(map); +// System.out.println("构建耗时:" + (System.currentTimeMillis() - timeMillis) + " ms"); +// +// LinkedList lineList = IOUtil.readLineList("D:\\Doc\\语料库\\《我的团长我的团》.txt"); +// timeMillis = System.currentTimeMillis(); +// for (String sentence : lineList) +// { +//// System.out.println(sentence); +// List.Hit> entryList = act.parseText(sentence); +// for (AhoCorasickDoubleArrayTrie.Hit entry : entryList) +// { +// int end = entry.end; +// int start = entry.begin; +//// System.out.printf("[%d:%d]=%s\n", start, end, entry.value); +// +// assertEquals(sentence.substring(start, end), entry.value); +// } +// } +// System.out.printf("%d ms\n", System.currentTimeMillis() - timeMillis); +// } + + public void testEnableFastBuild() { + TreeMap map = new TreeMap(); + IOUtil.LineIterator iterator = new IOUtil.LineIterator("data/dictionary/CoreNatureDictionary.txt"); + while (iterator.hasNext()) + { + String line = iterator.next(); + map.put(line, line); + } + + long startTimeMillis1 = System.currentTimeMillis(); + AhoCorasickDoubleArrayTrie trie1 = new AhoCorasickDoubleArrayTrie(); + trie1.build(map); + long costTimeMillis1 = System.currentTimeMillis() - startTimeMillis1; + + long startTimeMillis2 = System.currentTimeMillis(); + AhoCorasickDoubleArrayTrie trie2 = new AhoCorasickDoubleArrayTrie(true); + trie2.build(map); + long costTimeMillis2 = System.currentTimeMillis() - startTimeMillis2; + + System.out.printf("[trie1]size=%s,costTimeMillis=%s\n", trie1.size, costTimeMillis1); + System.out.printf("[trie2]size=%s,costTimeMillis=%s\n", trie2.size, costTimeMillis2); + Assert.assertTrue(trie1.size < trie2.size && trie2.size < trie1.size * 1.5); + Assert.assertTrue(costTimeMillis2 < costTimeMillis1 / 1.5); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrieTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrieTest.java index 11d841c71..1c05f7c51 100644 --- a/src/test/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrieTest.java +++ b/src/test/java/com/hankcs/hanlp/collection/trie/DoubleArrayTrieTest.java @@ -1,11 +1,42 @@ package com.hankcs.hanlp.collection.trie; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.dictionary.CustomDictionary; import junit.framework.TestCase; +import org.junit.Assert; import java.util.TreeMap; public class DoubleArrayTrieTest extends TestCase { + public void testDatFromFile() throws Exception + { + TreeMap map = new TreeMap(); + IOUtil.LineIterator iterator = new IOUtil.LineIterator("data/dictionary/CoreNatureDictionary.mini.txt"); + while (iterator.hasNext()) + { + String line = iterator.next(); + map.put(line, line); + } + DoubleArrayTrie trie = new DoubleArrayTrie(); + trie.build(map); + for (String key : map.keySet()) + { + assertEquals(key, trie.get(key)); + } + + trie.build(map); + for (String key : map.keySet()) + { + assertEquals(key, trie.get(key)); + } + } + + public void testGet() throws Exception + { + } + public void testLongestSearcher() throws Exception { TreeMap buildFrom = new TreeMap(); @@ -15,11 +46,154 @@ public void testLongestSearcher() throws Exception buildFrom.put(key, key); } DoubleArrayTrie trie = new DoubleArrayTrie(buildFrom); - String text = "her3he6his! "; + String text = "her3he6his-hers! "; DoubleArrayTrie.LongestSearcher searcher = trie.getLongestSearcher(text.toCharArray(), 0); while (searcher.next()) { - System.out.printf("[%d, %d)=%s\n", searcher.begin, searcher.begin + searcher.length, searcher.value); +// System.out.printf("[%d, %d)=%s\n", searcher.begin, searcher.begin + searcher.length, searcher.value); + assertEquals(searcher.value, text.substring(searcher.begin, searcher.begin + searcher.length)); + } + } + + public void testLongestSearcherWithNullValue() { + TreeMap buildFrom = new TreeMap(); + TreeMap buildFromValueNull = new TreeMap(); + String[] keys = new String[]{"he", "her", "his"}; + for (String key : keys) { + buildFrom.put(key, key); + buildFromValueNull.put(key, null); + } + DoubleArrayTrie trie = new DoubleArrayTrie(buildFrom); + DoubleArrayTrie trieValueNull = new DoubleArrayTrie(buildFromValueNull); + + String text = "her3he6his-hers! "; + + DoubleArrayTrie.LongestSearcher searcher = trie.getLongestSearcher(text.toCharArray(), 0); + DoubleArrayTrie.LongestSearcher searcherValueNull = trieValueNull.getLongestSearcher(text.toCharArray(), 0); + + while (true) { + boolean next = searcher.next(); + boolean nextValueNull = searcherValueNull.next(); + + if (next && nextValueNull) { + assertTrue(searcher.begin == searcherValueNull.begin && searcher.length == searcherValueNull.length); + } else if (next || nextValueNull) { + assert false; + break; + } else { + break; + } } } + + public void testTransmit() throws Exception + { + DoubleArrayTrie dat = CustomDictionary.DEFAULT.dat; + int index = dat.transition("钱", 1); + assertNull(dat.output(index)); + index = dat.transition("龙", index); + assertEquals("n 1 ", dat.output(index).toString()); + } + +// public void testCombine() throws Exception +// { +// DoubleArrayTrie dat = CustomDictionary.dat; +// String[] wordNet = new String[] +// { +// "他", +// "一", +// "丁", +// "不", +// "识", +// "一", +// "丁", +// "呀", +// }; +// for (int i = 0; i < wordNet.length; ++i) +// { +// int state = 1; +// state = dat.transition(wordNet[i], state); +// if (state > 0) +// { +// int start = i; +// int to = i + 1; +// int end = - 1; +// CoreDictionary.Attribute value = null; +// for (; to < wordNet.length; ++to) +// { +// state = dat.transition(wordNet[to], state); +// if (state < 0) break; +// CoreDictionary.Attribute output = dat.output(state); +// if (output != null) +// { +// value = output; +// end = to + 1; +// } +// } +// if (value != null) +// { +// StringBuilder sbTerm = new StringBuilder(); +// for (int j = start; j < end; ++j) +// { +// sbTerm.append(wordNet[j]); +// } +// System.out.println(sbTerm.toString() + "/" + value); +// i = end - 1; +// } +// } +// } +// } + + public void testHandleEmptyString() throws Exception + { + String emptyString = ""; + DoubleArrayTrie dat = new DoubleArrayTrie(); + TreeMap dictionary = new TreeMap(); + dictionary.put("bug", "问题"); + dat.build(dictionary); + DoubleArrayTrie.Searcher searcher = dat.getSearcher(emptyString, 0); + while (searcher.next()) + { + } + } + + public void testIssue966() throws Exception + { + TreeMap map = new TreeMap(); + for (String word : "001乡道, 北京, 北京市通信公司, 来广营乡, 通州区".split(", ")) + { + map.put(word, word); + } + DoubleArrayTrie trie = new DoubleArrayTrie(map); + DoubleArrayTrie.LongestSearcher searcher = trie.getLongestSearcher("北京市通州区001乡道发生了一件有意思的事情,来广营乡歌舞队正在跳舞", 0); + while (searcher.next()) + { + System.out.printf("%d %s\n", searcher.begin, searcher.value); + } + } + + public void testEnableFastBuild() { + TreeMap map = new TreeMap(); + IOUtil.LineIterator iterator = new IOUtil.LineIterator("data/dictionary/CoreNatureDictionary.mini.txt"); + while (iterator.hasNext()) + { + String line = iterator.next(); + map.put(line, line); + } + + long startTimeMillis1 = System.currentTimeMillis(); + DoubleArrayTrie trie1 = new DoubleArrayTrie(); + trie1.build(map); + long costTimeMillis1 = System.currentTimeMillis() - startTimeMillis1; + + long startTimeMillis2 = System.currentTimeMillis(); + DoubleArrayTrie trie2 = new DoubleArrayTrie(true); + trie2.build(map); + long costTimeMillis2 = System.currentTimeMillis() - startTimeMillis2; + + System.out.printf("[trie1]size=%s,costTimeMillis=%s\n", trie1.size, costTimeMillis1); + System.out.printf("[trie2]size=%s,costTimeMillis=%s\n", trie2.size, costTimeMillis2); + Assert.assertTrue(trie1.size < trie2.size && trie2.size < trie1.size * 1.5); + Assert.assertTrue(costTimeMillis2 < costTimeMillis1 / 1.5); + } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieParseTextTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieParseTextTest.java new file mode 100644 index 000000000..7f48d8a7a --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieParseTextTest.java @@ -0,0 +1,56 @@ +package com.hankcs.hanlp.collection.trie.bintrie; + +import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.util.HashSet; +import java.util.Set; + +public class BinTrieParseTextTest { + + + private final String[] words = new String[]{"溜", "儿", "溜儿", "一溜儿", "一溜"}; + private BinTrie trie; + + @Before + public void setup() { + this.trie = new BinTrie(); + /*构建一个简单的词典, 从 core dict 文件中扣出的一部分*/ + for (int i = 0; i < words.length; i++) { + this.trie.put(words[i], i); + } + } + + + @Test + public void testFullParse() { + assertFullParse("一溜儿"); + assertFullParse("一溜儿 "); + assertFullParse("一溜儿 "); + } + + private void assertFullParse(String text) { + Set result = parseText(text); + /*确保每个词都被分出来了*/ + for (String word : words) { + Assert.assertTrue(result.contains(word)); + } + } + + + private Set parseText(final String text) { + final Set result = new HashSet(words.length); + trie.parseText(text, new AhoCorasickDoubleArrayTrie.IHit() { + @Override + public void hit(int begin, int end, Integer value) { + result.add(text.substring(begin, end)); + } + }); + + return result; + } + + +} diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieTest.java index ae1456ae5..a3a27c4cc 100644 --- a/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieTest.java +++ b/src/test/java/com/hankcs/hanlp/collection/trie/bintrie/BinTrieTest.java @@ -1,10 +1,27 @@ package com.hankcs.hanlp.collection.trie.bintrie; +import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie; +import com.hankcs.hanlp.corpus.util.DictionaryUtil; +import com.hankcs.hanlp.dictionary.CustomDictionary; import junit.framework.TestCase; +import java.io.File; +import java.util.Map; +import java.util.Set; + public class BinTrieTest extends TestCase { + static String DATA_TEST_OUT_BIN; + private File tempFile; + + @Override + public void setUp() throws Exception + { + tempFile = File.createTempFile("hanlp-", ".dat"); + DATA_TEST_OUT_BIN = tempFile.getAbsolutePath(); + } + public void testParseText() throws Exception { BinTrie trie = new BinTrie(); @@ -19,11 +36,58 @@ public void testParseText() throws Exception @Override public void hit(int begin, int end, String value) { - System.out.printf("[%d, %d)=%s\n", begin, end, value); +// System.out.printf("[%d, %d)=%s\n", begin, end, value); assertEquals(value, text.substring(begin, end)); } }; // trie.parseLongestText(text, processor); trie.parseText(text, processor); } + + public void testPut() throws Exception + { + BinTrie trie = new BinTrie(); + trie.put("加入", true); + trie.put("加入", false); + + assertEquals(new Boolean(false), trie.get("加入")); + } + + public void testArrayIndexOutOfBoundsException() throws Exception + { + BinTrie trie = new BinTrie(); + trie.put(new char[]{'\uffff'}, true); + } + + public void testSaveAndLoad() throws Exception + { + BinTrie trie = new BinTrie(); + trie.put("haha", 0); + trie.put("hankcs", 1); + trie.put("hello", 2); + trie.put("za", 3); + trie.put("zb", 4); + trie.put("zzz", 5); + assertTrue(trie.save(DATA_TEST_OUT_BIN)); + trie = new BinTrie(); + Integer[] value = new Integer[100]; + for (int i = 0; i < value.length; ++i) + { + value[i] = i; + } + assertTrue(trie.load(DATA_TEST_OUT_BIN, value)); + Set> entrySet = trie.entrySet(); + assertEquals("[haha=0, hankcs=1, hello=2, za=3, zb=4, zzz=5]", entrySet.toString()); + } + +// public void testCustomDictionary() throws Exception +// { +// HanLP.Config.enableDebug(true); +// System.out.println(CustomDictionary.get("龟兔赛跑")); +// } +// +// public void testSortCustomDictionary() throws Exception +// { +// DictionaryUtil.sortDictionary(HanLP.Config.CustomDictionaryPath[0]); +// } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayListTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayListTest.java new file mode 100644 index 000000000..27b0450d2 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/IntArrayListTest.java @@ -0,0 +1,33 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import junit.framework.TestCase; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; + +public class IntArrayListTest extends TestCase +{ + IntArrayList array = new IntArrayList(); + + @Override + public void setUp() throws Exception + { + for (int i = 0; i < 64; ++i) + { + array.append(i); + } + } + + public void testSaveLoad() throws Exception + { + File tempFile = File.createTempFile("hanlp", ".intarray"); + array.save(new DataOutputStream(new FileOutputStream(tempFile.getAbsolutePath()))); + array.load(ByteArray.createByteArray(tempFile.getAbsolutePath())); + for (int i = 0; i < 64; ++i) + { + assertEquals(i, array.get(i)); + } + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieIntegerTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieIntegerTest.java new file mode 100644 index 000000000..88c2ed478 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/MutableDoubleArrayTrieIntegerTest.java @@ -0,0 +1,43 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import com.hankcs.hanlp.corpus.io.ByteArray; +import junit.framework.TestCase; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; + +public class MutableDoubleArrayTrieIntegerTest extends TestCase +{ + MutableDoubleArrayTrieInteger mdat; + private int size; + + @Override + public void setUp() throws Exception + { + mdat = new MutableDoubleArrayTrieInteger(); + size = 64; + for (int i = 0; i < size; ++i) + { + mdat.put(String.valueOf(i), i); + } + } + + public void testSaveLoad() throws Exception + { + File tempFile = File.createTempFile("hanlp", ".mdat"); + mdat.save(new DataOutputStream(new FileOutputStream(tempFile))); + mdat = new MutableDoubleArrayTrieInteger(); + mdat.load(ByteArray.createByteArray(tempFile.getAbsolutePath())); + assertEquals(size, mdat.size()); + for (int i = 0; i < size; ++i) + { + assertEquals(i, mdat.get(String.valueOf(i))); + } + + for (int i = size; i < 2 * size; ++i) + { + assertEquals(-1, mdat.get(String.valueOf(i))); + } + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMappingTest.java b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMappingTest.java new file mode 100644 index 000000000..bf142b118 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/collection/trie/datrie/Utf8CharacterMappingTest.java @@ -0,0 +1,37 @@ +package com.hankcs.hanlp.collection.trie.datrie; + +import junit.framework.TestCase; + +public class Utf8CharacterMappingTest extends TestCase +{ + public void testToIdList() throws Exception + { + Utf8CharacterMapping ucm = new Utf8CharacterMapping(); + String s = "汉字\uD801\uDC00\uD801\uDC00ab\uD801\uDC00\uD801\uDC00cd"; + int[] bytes1 = ucm.toIdList(s); + System.out.println("UTF-8: " + bytes1.length); + { + int charCount = 1; + int start = 0; + for (int i = 0; i < s.length(); i += charCount) + { + int codePoint = s.codePointAt(i); + charCount = Character.charCount(codePoint); + + int[] arr = ucm.toIdList(codePoint); + for (int j = 0; j < arr.length; j++, start++) + { + if (bytes1[start] != arr[j]) + { + System.out.println("error: " + start + "," + j); + System.exit(-1); + } + } + } + if (start != bytes1.length) + { + System.out.println("error: " + start + "," + bytes1.length); + } + } + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/MSR.java b/src/test/java/com/hankcs/hanlp/corpus/MSR.java new file mode 100644 index 000000000..5d19c4e42 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/MSR.java @@ -0,0 +1,38 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-24 10:34 AM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.utility.TestUtility; + +/** + * @author hankcs + */ +public class MSR +{ + public static String TRAIN_PATH = "data/test/icwb2-data/training/msr_training.utf8"; + public static String TEST_PATH = "data/test/icwb2-data/testing/msr_test.utf8"; + public static String GOLD_PATH = "data/test/icwb2-data/gold/msr_test_gold.utf8"; + public static String MODEL_PATH = "data/test/msr_cws"; + public static String OUTPUT_PATH = "data/test/msr_output.txt"; + public static String TRAIN_WORDS = "data/test/icwb2-data/gold/msr_training_words.utf8"; + public static String SIGHAN05_ROOT; + + static + { + SIGHAN05_ROOT = TestUtility.ensureTestData("icwb2-data", "http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip"); + if (!IOUtil.isFileExisted(TRAIN_PATH)) + { + System.err.println("请下载 http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip 并解压为 data/test/icwb2-data"); + System.exit(1); + } + } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/PKU.java b/src/test/java/com/hankcs/hanlp/corpus/PKU.java new file mode 100644 index 000000000..6b39f799c --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/PKU.java @@ -0,0 +1,69 @@ +/* + * Han He + * me@hankcs.com + * 2018-07-04 5:36 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.utility.TestUtility; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.util.ArrayList; + +/** + * @author hankcs + */ +public class PKU +{ + public static String PKU199801; + public static String PKU199801_TRAIN = "data/test/pku98/199801-train.txt"; + public static String PKU199801_TEST = "data/test/pku98/199801-test.txt"; + public static String POS_MODEL = "/pos.bin"; + public static String NER_MODEL = "/ner.bin"; + public static final String PKU_98 = TestUtility.ensureTestData("pku98", "http://file.hankcs.com/corpus/pku98.zip"); + + static + { + PKU199801 = PKU_98 + "/199801.txt"; + POS_MODEL = PKU_98 + POS_MODEL; + NER_MODEL = PKU_98 +NER_MODEL; + if (!IOUtil.isFileExisted(PKU199801_TRAIN)) + { + ArrayList all = new ArrayList(); + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(PKU199801); + while (lineIterator.hasNext()) + { + all.add(lineIterator.next()); + } + try + { + BufferedWriter bw = IOUtil.newBufferedWriter(PKU199801_TRAIN); + for (String line : all.subList(0, (int) (all.size() * 0.9))) + { + bw.write(line); + bw.newLine(); + } + bw.close(); + + bw = IOUtil.newBufferedWriter(PKU199801_TEST); + for (String line : all.subList((int) (all.size() * 0.9), all.size())) + { + bw.write(line); + bw.newLine(); + } + bw.close(); + } + catch (IOException e) + { + e.printStackTrace(); + } + } + } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestAdjustCoreDictionary.java b/src/test/java/com/hankcs/hanlp/corpus/TestAdjustCoreDictionary.java new file mode 100644 index 000000000..c1a201305 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestAdjustCoreDictionary.java @@ -0,0 +1,116 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/12/24 12:11 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.TFDictionary; +import com.hankcs.hanlp.corpus.dictionary.item.Item; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.occurrence.TermFrequency; +import com.hankcs.hanlp.corpus.util.CorpusUtil; +import junit.framework.TestCase; + +import java.util.List; +import java.util.Map; + +/** + * 往核心词典里补充等效词串 + * @author hankcs + */ +public class TestAdjustCoreDictionary extends TestCase +{ + +// public static final String DATA_DICTIONARY_CORE_NATURE_DICTIONARY_TXT = HanLP.Config.CoreDictionaryPath; +// +// public void testGetCompiledWordFromDictionary() throws Exception +// { +// DictionaryMaker dictionaryMaker = DictionaryMaker.load("data/test/CoreNatureDictionary.txt"); +// for (Map.Entry entry : dictionaryMaker.entrySet()) +// { +// String word = entry.getKey(); +// Item item = entry.getValue(); +// if (word.matches(".##.")) +// { +// System.out.println(item); +// } +// } +// } +// +// public void testViewNGramDictionary() throws Exception +// { +// TFDictionary tfDictionary = new TFDictionary(); +// tfDictionary.load("data/dictionary/CoreNatureDictionary.ngram.txt"); +// for (Map.Entry entry : tfDictionary.entrySet()) +// { +// String word = entry.getKey(); +// TermFrequency frequency = entry.getValue(); +// if (word.contains("##")) +// { +// System.out.println(frequency); +// } +// } +// } +// +// public void testSortCoreNatureDictionary() throws Exception +// { +// DictionaryMaker dictionaryMaker = DictionaryMaker.load(DATA_DICTIONARY_CORE_NATURE_DICTIONARY_TXT); +// dictionaryMaker.saveTxtTo(DATA_DICTIONARY_CORE_NATURE_DICTIONARY_TXT); +// } +// +// public void testSimplifyNZ() throws Exception +// { +// final DictionaryMaker nzDictionary = new DictionaryMaker(); +// CorpusLoader.walk("D:\\Doc\\语料库\\2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// for (List sentence : document.getComplexSentenceList()) +// { +// for (IWord word : sentence) +// { +// if (word instanceof CompoundWord && "nz".equals(word.getLabel())) +// { +// nzDictionary.add(word); +// } +// } +// } +// } +// }); +// nzDictionary.saveTxtTo("data/test/nz.txt"); +// } +// +// public void testRemoveNumber() throws Exception +// { +// // 一些汉字数词留着没用,除掉它们 +// DictionaryMaker dictionaryMaker = DictionaryMaker.load(DATA_DICTIONARY_CORE_NATURE_DICTIONARY_TXT); +// dictionaryMaker.saveTxtTo(DATA_DICTIONARY_CORE_NATURE_DICTIONARY_TXT, new DictionaryMaker.Filter() +// { +// @Override +// public boolean onSave(Item item) +// { +// if (item.key.length() == 1 && "0123456789零○〇一二两三四五六七八九十廿百千万亿壹贰叁肆伍陆柒捌玖拾佰仟".indexOf(item.key.charAt(0)) >= 0) +// { +// System.out.println(item); +// return false; +// } +// +// return true; +// } +// }); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestICWB.java b/src/test/java/com/hankcs/hanlp/corpus/TestICWB.java new file mode 100644 index 000000000..be18667b1 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestICWB.java @@ -0,0 +1,96 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/29 21:11 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.OutputStreamWriter; +import java.util.LinkedList; +import java.util.List; + +/** + * 玩玩ICWB的数据 + * + * @author hankcs + */ +public class TestICWB extends TestCase +{ + +// public static final String PATH = "D:\\Doc\\语料库\\icwb2-data\\training\\msr_training.utf8"; +// +// public void testGenerateBMES() throws Exception +// { +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(PATH + ".bmes.txt"))); +// for (String line : IOUtil.readLineListWithLessMemory(PATH)) +// { +// String[] wordArray = line.split("\\s"); +// for (String word : wordArray) +// { +// if (word.length() == 1) +// { +// bw.write(word + "\tS\n"); +// } +// else if (word.length() > 1) +// { +// bw.write(word.charAt(0) + "\tB\n"); +// for (int i = 1; i < word.length() - 1; ++i) +// { +// bw.write(word.charAt(i) + "\tM\n"); +// } +// bw.write(word.charAt(word.length() - 1) + "\tE\n"); +// } +// } +// bw.newLine(); +// } +// bw.close(); +// } +// +// public void testDumpPeople2014ToBEMS() throws Exception +// { +// final BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("D:\\Tools\\CRF++-0.58\\example\\seg_cn\\2014.txt"))); +// CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// List> simpleSentenceList = document.getSimpleSentenceList(); +// for (List wordList : simpleSentenceList) +// { +// try +// { +// for (Word word : wordList) +// { +// +// bw.write(word.value); +// bw.write(' '); +// +// } +// bw.newLine(); +// } +// catch (IOException e) +// { +// e.printStackTrace(); +// } +// } +// } +// }); +// bw.close(); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestJianFanDictionaryMaker.java b/src/test/java/com/hankcs/hanlp/corpus/TestJianFanDictionaryMaker.java new file mode 100644 index 000000000..40ef47d4b --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestJianFanDictionaryMaker.java @@ -0,0 +1,188 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/1 19:46 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.StringDictionary; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.other.CharTable; +import junit.framework.TestCase; + +import java.io.*; +import java.util.*; + +/** + * @author hankcs + */ +public class TestJianFanDictionaryMaker extends TestCase +{ + + private String cc = "/Users/hankcs/CppProjects/OpenCC/data/dictionary/"; + private String root = "data/dictionary/tc/"; + + public void testCombine() throws Exception + { +// StringDictionary dictionaryHanLP = new StringDictionary("="); +// dictionaryHanLP.load(HanLP.Config.t2sDictionaryPath); +// +// StringDictionary dictionaryOuter = new StringDictionary("="); +// dictionaryOuter.load("D:\\Doc\\语料库\\简繁分歧词表.txt"); +// +// for (Map.Entry entry : dictionaryOuter.entrySet()) +// { +// String t = entry.getKey(); +// String s = entry.getValue(); +// if (t.length() == 1) continue; +// if (HanLP.convertToTraditionalChinese(s).equals(t)) continue; +// dictionaryHanLP.add(t, s); +// } +// +// dictionaryHanLP.save(HanLP.Config.t2sDictionaryPath); + } + +// public void testConvertSingle() throws Exception +// { +// System.out.println(HanLP.convertToTraditionalChinese("一个劲")); +// } +// +// public void testIssue() throws Exception +// { +// System.out.println(HanLP.convertToSimplifiedChinese("缐")); +// System.out.println(CharTable.convert("缐")); +// } +// +// public void testImportOpenCC() throws Exception +// { +// // 转换OpenCC的词库 +// Map s2t = new TreeMap(); +// combine("\t", s2t, cc + "STCharacters.txt", +// cc + "STPhrases.txt" +// ); +// save(s2t, "data/dictionary/tc/s2t.txt"); +// Map t2s = new TreeMap(); +// combine("=", t2s, "data/dictionary/tc/TraditionalChinese.txt"); +// combine("\t", t2s, cc + "TSCharacters.txt", +// cc + "TSPhrases.txt" +// ); +// save(t2s, "data/dictionary/tc/t2s.txt"); +// } +// +// public void testMakeHK() throws Exception +// { +// Map t2hk = new TreeMap(); +// combine("\t", t2hk, +// cc + "HKVariantsPhrases.txt", +// cc + "HKVariants.txt" +// ); +// save(t2hk, "data/dictionary/tc/t2hk.txt"); +// } +// +// public void testMakeTW() throws Exception +// { +// Map t2tw = new TreeMap(); +// combine("\t", t2tw, +// cc + "TWPhrasesIT.txt", +// cc + "TWPhrasesName.txt", +// cc + "TWPhrasesOther.txt", +// cc + "TWVariants.txt" +// ); +// save(t2tw, "data/dictionary/tc/t2tw.txt"); +// } +// +// private void save(Map storage, String path) throws IOException +// { +// BufferedWriter bw = IOUtil.newBufferedWriter(path); +// for (Map.Entry entry : storage.entrySet()) +// { +// String line = entry.toString(); +// int firstBlank = line.indexOf(' '); +// if (firstBlank != -1) +// { +// line = line.substring(0, firstBlank); +// } +// bw.write(line); +// bw.newLine(); +// } +// bw.close(); +// } +// +// private Map> combine(Map s2t, Map t2s) +// { +// Map> all = new TreeMap>(); +// for (Map.Entry entry : s2t.entrySet()) +// { +// String key = entry.getKey(); +// Set value = all.get(key); +// if (value == null) +// { +// value = new TreeSet(); +// all.put(key, value); +// } +// for (String v : entry.getValue().split(" ")) +// { +// if (key.length() == 1 && key.equals(v)) +// { +// continue; +// } +// value.add(v); +// } +// } +// +// for (Map.Entry entry : t2s.entrySet()) +// { +// for (String key : entry.getValue().split(" ")) +// { +// if (key.length() == 1 && key.equals(entry.getKey())) +// { +// continue; +// } +// Set value = all.get(key); +// if (value == null) +// { +// value = new TreeSet(); +// all.put(key, value); +// } +// +// value.add(entry.getKey()); +// } +// } +// +// return all; +// } +// +// private Map combine(String delimiter, Map storage, String... pathArray) +// { +// for (String path : pathArray) +// { +// IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(path); +// while (lineIterator.hasNext()) +// { +// String line = lineIterator.next(); +// String[] args = line.split(delimiter); +// if (args.length != 2) +// { +// System.err.println(line); +// System.exit(-1); +// } +// storage.put(args[0], args[1]); +// } +// } +// +// return storage; +// } +// +// public void testChar() throws Exception +// { +// String line = "㐹\t㑶 㐹"; +// System.out.println('㐹' == '㐹'); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestMakeCompanyCorpus.java b/src/test/java/com/hankcs/hanlp/corpus/TestMakeCompanyCorpus.java new file mode 100644 index 000000000..598cfda2b --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestMakeCompanyCorpus.java @@ -0,0 +1,118 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/18 19:48 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NTDictionaryMaker; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.io.*; +import java.util.List; + + +/** + * @author hankcs + */ +public class TestMakeCompanyCorpus extends TestCase +{ +// public void testMake() throws Exception +// { +// DijkstraSegment segment = new DijkstraSegment(); +// String line = null; +// BufferedReader bw = new BufferedReader(new InputStreamReader(new FileInputStream("D:\\Doc\\语料库\\company.dic"))); +// BufferedWriter br = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("data/test/nt/company.txt"))); +// int limit = Integer.MAX_VALUE; +// while ((line = bw.readLine()) != null && limit-- > 0) +// { +// if (line.endsWith(")")) continue; +// if (line.length() < 4) continue; +// if (line.contains("个体") || line.contains("个人")) +// { +// continue; +// } +// List termList = segment.seg(line); +// if (termList.size() == 0) continue; +// Term last = termList.get(termList.size() - 1); +// last.nature = Nature.nis; +// br.write("["); +// for (Term term : termList) +// { +// br.write(term.toString()); +// if (term != last) br.write(" "); +// } +// br.write("]/ntc"); +// br.newLine(); +// br.flush(); +// } +// bw.close(); +// br.close(); +// } +// +// public void testParse() throws Exception +// { +// EasyDictionary dictionary = EasyDictionary.create("data/dictionary/2014_dictionary.txt"); +// final NTDictionaryMaker nsDictionaryMaker = new NTDictionaryMaker(dictionary); +// // CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\", new CorpusLoader.Handler() +// CorpusLoader.walk("data/test/nt/part/", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// nsDictionaryMaker.compute(document.getComplexSentenceList()); +// } +// }); +// nsDictionaryMaker.saveTxtTo("D:\\JavaProjects\\HanLP\\data\\dictionary\\organization\\outerNT"); +// } +// +// public void testSplitLargeFile() throws Exception +// { +// String line = null; +// BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/test/nt/company.txt"))); +// int id = 1; +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("data/test/nt/part/" + id + ".txt"))); +// int count = 1; +// while ((line = br.readLine()) != null) +// { +// if (count == 1000) +// { +// bw.close(); +// bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("data/test/nt/part/" + id + ".txt"))); +// ++id; +// count = 0; +// } +// bw.write(line); +// bw.newLine(); +// ++count; +// } +// br.close(); +// } +// +// public void testCase() throws Exception +// { +// HanLP.Config.enableDebug(); +// DijkstraSegment segment = new DijkstraSegment(); +// segment.enableOrganizationRecognize(true); +// System.out.println(segment.seg("黑龙江建筑职业技术学院近百学生发生冲突")); +// } +// +// public void testCombine() throws Exception +// { +// DictionaryMaker.combine("data/dictionary/organization/nt.txt", "data/dictionary/organization/outerNT.txt").saveTxtTo("data/dictionary/organization/nt.txt"); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestMakePinYinDictionary.java b/src/test/java/com/hankcs/hanlp/corpus/TestMakePinYinDictionary.java new file mode 100644 index 000000000..7f74dbabd --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestMakePinYinDictionary.java @@ -0,0 +1,328 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/1 23:51 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.SimpleDictionary; +import com.hankcs.hanlp.corpus.dictionary.StringDictionary; +import com.hankcs.hanlp.corpus.dictionary.StringDictionaryMaker; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.py.*; +import com.hankcs.hanlp.utility.TextUtility; +import junit.framework.TestCase; + +import java.util.*; + +/** + * @author hankcs + */ +public class TestMakePinYinDictionary extends TestCase +{ +// public void testCombine() throws Exception +// { +// HanLP.Config.enableDebug(); +// StringDictionary dictionaryPY = new StringDictionary(); +// dictionaryPY.load("D:\\JavaProjects\\jpinyin\\data\\pinyinTable.standard.txt"); +// +//// StringDictionary dictionaryAnsj = new StringDictionary(); +//// dictionaryAnsj.load("D:\\JavaProjects\\jpinyin\\data\\ansj.txt"); +//// System.out.println(dictionaryAnsj.remove(new SimpleDictionary.Filter() +//// { +//// @Override +//// public boolean remove(Map.Entry entry) +//// { +//// return entry.getValue().toString().endsWith("0"); +//// } +//// })); +// +// StringDictionary dictionaryPolyphone = new StringDictionary(); +// dictionaryPolyphone.load("D:\\JavaProjects\\jpinyin\\data\\polyphone.txt"); +// +// StringDictionary dictionarySingle = new StringDictionary(); +// dictionarySingle.load("data/dictionary/pinyin/single.txt"); +// +// StringDictionary main = StringDictionaryMaker.combine(dictionaryPY, dictionaryPolyphone, dictionarySingle); +// main.save("data/dictionary/pinyin/pinyin.txt"); +// } +// +// public void testCombineSingle() throws Exception +// { +// HanLP.Config.enableDebug(); +// StringDictionary main = StringDictionaryMaker.combine("data/dictionary/pinyin/pinyin.txt", "data/dictionary/pinyin/single.txt"); +// main.save("data/dictionary/pinyin/pinyin.txt"); +// } +// +// public void testSpeed() throws Exception +// { +// +// } +// +// +// public void testMakeSingle() throws Exception +// { +// LinkedList csv = IOUtil.readCsv("D:\\JavaProjects\\jpinyin\\data\\words.csv"); +// StringDictionary dictionarySingle = new StringDictionary(); +// for (String[] args : csv) +// { +// // 0 1 2 3 4 5 6 7 +// // 6895,中,zhong,zh,ong,1,\u4E2D,中 zhong \u4E2D +// String word = args[1]; +// String py = args[2]; +// String sm = args[3]; +// String ym = args[4]; +// String yd = args[5]; +// String pyyd = py + yd; +// // 过滤 +// if (!TextUtility.isAllChinese(word)) continue; +// dictionarySingle.add(word, pyyd); +// } +// dictionarySingle.save("data/dictionary/pinyin/single.txt"); +// } +// +// public void testMakeTable() throws Exception +// { +// LinkedList csv = IOUtil.readCsv("D:\\JavaProjects\\jpinyin\\data\\words.csv"); +// StringDictionary dictionarySingle = new StringDictionary(); +// for (String[] args : csv) +// { +// // 0 1 2 3 4 5 6 7 +// // 6895,中,zhong,zh,ong,1,\u4E2D,中 zhong \u4E2D +// String word = args[1]; +// String py = args[2]; +// String sm = args[3]; +// String ym = args[4]; +// String yd = args[5]; +// String pyyd = py + yd; +// // 过滤 +// if (!TextUtility.isAllChinese(word)) continue; +// dictionarySingle.add(pyyd, sm + "," + ym + "," + yd); +// } +// dictionarySingle.save("data/dictionary/pinyin/sm-ym-table.txt"); +// } +// +// public void testConvert() throws Exception +// { +// String text = "重载不是重担," + HanLP.convertToTraditionalChinese("以后爱皇后"); +// List pinyinList = PinyinDictionary.convertToPinyin(text); +// System.out.print("原文,"); +// for (char c : text.toCharArray()) +// { +// System.out.printf("%c,", c); +// } +// System.out.println(); +// +// System.out.print("拼音(数字音调),"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin); +// } +// System.out.println(); +// +// System.out.print("拼音(符号音调),"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getPinyinWithToneMark()); +// } +// System.out.println(); +// +// System.out.print("拼音(无音调),"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getPinyinWithoutTone()); +// } +// System.out.println(); +// +// System.out.print("声调,"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getTone()); +// } +// System.out.println(); +// +// System.out.print("声母,"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getShengmu()); +// } +// System.out.println(); +// +// System.out.print("韵母,"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getYunmu()); +// } +// System.out.println(); +// +// System.out.print("输入法头,"); +// for (Pinyin pinyin : pinyinList) +// { +// System.out.printf("%s,", pinyin.getHeadString()); +// } +// System.out.println(); +// } +// +// public void testMakePinyinEnum() throws Exception +// { +// StringDictionary dictionary = new StringDictionary(); +// dictionary.load("data/dictionary/pinyin/pinyin.txt"); +// +// StringDictionary pyEnumDictionary = new StringDictionary(); +// for (Map.Entry entry : dictionary.entrySet()) +// { +// String[] args = entry.getValue().split(","); +// for (String arg : args) +// { +// pyEnumDictionary.add(arg, arg); +// } +// } +// +// StringDictionary table = new StringDictionary(); +// table.combine(pyEnumDictionary); +// +// StringBuilder sb = new StringBuilder(); +// for (Map.Entry entry : table.entrySet()) +// { +// sb.append(entry.getKey()); +// sb.append('\n'); +// } +// IOUtil.saveTxt("data/dictionary/pinyin/py.enum.txt", sb.toString()); +// } +// +// /** +// * 有些拼音没有声母和韵母,尝试根据上文拓展它们 +// * @throws Exception +// */ +// public void testExtendTable() throws Exception +// { +// StringDictionary dictionary = new StringDictionary(); +// dictionary.load("data/dictionary/pinyin/pinyin.txt"); +// +// StringDictionary pyEnumDictionary = new StringDictionary(); +// for (Map.Entry entry : dictionary.entrySet()) +// { +// String[] args = entry.getValue().split(","); +// for (String arg : args) +// { +// pyEnumDictionary.add(arg, arg); +// } +// } +// +// StringDictionary table = new StringDictionary(); +// table.load("data/dictionary/pinyin/sm-ym-table.txt"); +// table.combine(pyEnumDictionary); +// +// Iterator> iterator = table.entrySet().iterator(); +// Map.Entry pre = iterator.next(); +// String prePy = pre.getKey().substring(0, pre.getKey().length() - 1); +// String preYd = pre.getKey().substring(pre.getKey().length() - 1); +// while (iterator.hasNext()) +// { +// Map.Entry current = iterator.next(); +// String currentPy = current.getKey().substring(0, current.getKey().length() - 1); +// String currentYd = current.getKey().substring(current.getKey().length() - 1); +// // handle it +// if (!current.getValue().contains(",")) +// { +// if (currentPy.equals(prePy)) +// { +// table.add(current.getKey(), pre.getValue().replace(preYd, currentYd)); +// } +// else +// { +// System.out.println(currentPy + currentYd); +// } +// } +// // end +// pre = current; +// prePy = currentPy; +// preYd = currentYd; +// } +// table.save("data/dictionary/pinyin/sm-ym-yd-table.txt"); +// } +// +// public void testDumpSMT() throws Exception +// { +// HanLP.Config.enableDebug(); +// SYTDictionary.dumpEnum("data/dictionary/pinyin/"); +// } +// +// public void testPinyinDictionary() throws Exception +// { +// HanLP.Config.enableDebug(); +// Pinyin[] pinyins = PinyinDictionary.get("中"); +// System.out.println(Arrays.toString(pinyins)); +// } +// +// public void testCombineAnsjWithPinyinTxt() throws Exception +// { +// StringDictionary dictionaryAnsj = new StringDictionary(); +// dictionaryAnsj.load("D:\\JavaProjects\\jpinyin\\data\\ansj.txt"); +// System.out.println(dictionaryAnsj.remove(new SimpleDictionary.Filter() +// { +// @Override +// public boolean remove(Map.Entry entry) +// { +// String word = entry.getKey(); +// String pinyin = entry.getValue(); +// String[] pinyinStringArray = entry.getValue().split("[,\\s ]"); +// if (word.length() != pinyinStringArray.length || !TonePinyinString2PinyinConverter.valid(pinyinStringArray)) +// { +// System.out.println(entry); +// return false; +// } +// +// return true; +// } +// })); +// +// } +// +// public void testMakePinyinJavaCode() throws Exception +// { +// StringBuilder sb = new StringBuilder(); +// for (Pinyin pinyin : PinyinDictionary.pinyins) +// { +// // 0声母 1韵母 2音调 3带音标 +// sb.append(pinyin + "(" + Shengmu.class.getSimpleName() + "." + pinyin.getShengmu() + ", " + Yunmu.class.getSimpleName() + "." + pinyin.getYunmu() + ", " + pinyin.getTone() + ", \"" + pinyin.getPinyinWithToneMark() + "\", \"" + pinyin.getPinyinWithoutTone() + "\"" + ", " + Head.class.getSimpleName() + "." + pinyin.getHeadString() + ", '" + pinyin.getFirstChar() + "'" + "),\n"); +// } +// IOUtil.saveTxt("data/dictionary/pinyin/py.txt", sb.toString()); +// } +// +// public void testConvertUnicodeTable() throws Exception +// { +// StringDictionary dictionary = new StringDictionary("="); +// for (String line : IOUtil.readLineList("D:\\Doc\\语料库\\Uni2Pinyin.txt")) +// { +// if (line.startsWith("#")) continue; +// String[] argArray = line.split("\\s"); +// if (argArray.length == 1) continue; +// String py = argArray[1]; +// for (int i = 2; i < argArray.length; ++i) +// { +// py += ','; +// py += argArray[i]; +// } +// dictionary.add(String.valueOf((char)(Integer.parseInt(argArray[0], 16))), py); +// } +// dictionary.save("D:\\Doc\\语料库\\Hanzi2Pinyin.txt"); +// } +// +// public void testCombineUnicodeTableWithMainDictionary() throws Exception +// { +// StringDictionary mainDictionary = new StringDictionary("="); +// mainDictionary.load("data/dictionary/pinyin/pinyin.txt"); +// StringDictionary subDictionary = new StringDictionary("="); +// subDictionary.load("D:\\Doc\\语料库\\Hanzi2Pinyin.txt"); +// mainDictionary.combine(subDictionary); +// mainDictionary.save("data/dictionary/pinyin/pinyin.txt"); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestMakeTranslateName.java b/src/test/java/com/hankcs/hanlp/corpus/TestMakeTranslateName.java new file mode 100644 index 000000000..3dc1d1d3c --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestMakeTranslateName.java @@ -0,0 +1,169 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/12 14:03 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.dictionary.nr.TranslatedPersonDictionary; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.StandardTokenizer; +import junit.framework.TestCase; + +import java.util.LinkedList; +import java.util.List; +import java.util.Set; +import java.util.TreeSet; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * @author hankcs + */ +public class TestMakeTranslateName extends TestCase +{ +// public void testCombineOuterDictionary() throws Exception +// { +// String root = "D:\\JavaProjects\\SougouDownload\\data\\"; +// String[] pathArray = new String[]{"常用外国人名.txt", "外国人名", "外国姓名大全.txt", "外国诗人名.txt", "英语姓名词典.txt", "俄罗斯人名.txt"}; +// Set wordSet = new TreeSet(); +// for (String path : pathArray) +// { +// path = root + path; +// for (String word : IOUtil.readLineList(path)) +// { +// word = word.replaceAll("[a-z]", ""); +// if (CoreDictionary.contains(word) || CustomDictionary.contains(word)) continue; +// wordSet.add(word); +// } +// } +// IOUtil.saveCollectionToTxt(wordSet, "data/dictionary/person/nrf.txt"); +// } +// +// public void testSpiltToChar() throws Exception +// { +// String commonChar = "·-—阿埃艾爱安昂敖奥澳笆芭巴白拜班邦保堡鲍北贝本比毕彼别波玻博勃伯泊卜布才采仓查差柴彻川茨慈次达大戴代丹旦但当道德得登迪狄蒂帝丁东杜敦多额俄厄鄂恩尔伐法范菲芬费佛夫福弗甫噶盖干冈哥戈革葛格各根古瓜哈海罕翰汗汉豪合河赫亨侯呼胡华霍基吉及加贾坚简杰金京久居君喀卡凯坎康考柯科可克肯库奎拉喇莱来兰郎朗劳勒雷累楞黎理李里莉丽历利立力连廉良列烈林隆卢虏鲁路伦仑罗洛玛马买麦迈曼茅茂梅门蒙盟米蜜密敏明摩莫墨默姆木穆那娜纳乃奈南内尼年涅宁纽努诺欧帕潘畔庞培佩彭皮平泼普其契恰强乔切钦沁泉让热荣肉儒瑞若萨塞赛桑瑟森莎沙山善绍舍圣施诗石什史士守斯司丝苏素索塔泰坦汤唐陶特提汀图土吐托陀瓦万王旺威韦维魏温文翁沃乌吾武伍西锡希喜夏相香歇谢辛新牙雅亚彦尧叶依伊衣宜义因音英雍尤于约宰泽增詹珍治中仲朱诸卓孜祖佐伽娅尕腓滕济嘉津赖莲琳律略慕妮聂裴浦奇齐琴茹珊卫欣逊札哲智兹芙汶迦珀琪梵斐胥黛" + +// "·-阿安奥巴比彼波布察茨大德得丁杜尔法夫伏甫盖格哈基加坚捷金卡科可克库拉莱兰勒雷里历利连列卢鲁罗洛马梅蒙米姆娜涅宁诺帕泼普奇齐乔切日萨色山申什斯索塔坦特托娃维文乌西希谢亚耶叶依伊以扎佐柴达登蒂戈果海赫华霍吉季津柯理琳玛曼穆纳尼契钦丘桑沙舍泰图瓦万雅卓兹" + +// "-·—丁万丘东丝中丹丽乃久义乌乔买于亚亨京什仑仓代以仲伊伍伏伐伦伯伽但佐佛佩依侯俄保儒克兰其兹内冈凯切列利别力加努劳勃勒北华卓南博卜卡卢卫厄历及古可史叶司各合吉吐君吾呼哈哥哲唐喀善喇喜嘉噶因图土圣坎坚坦埃培基堡塔塞增墨士夏多大夫奇奈奎契奥妮姆威娃娅娜孜季宁守安宜宰密察尔尕尤尧尼居山川差巴布希帕帝干平年库庞康廉弗强当彦彭彻彼律得德恩恰慈慕戈戴才扎托拉拜捷提摩敏敖敦文斐斯新施日旦旺昂明普智曼朗木本札朱李杜来杰林果查柯柴根格桑梅梵森楞次欣欧歇武比毕汀汉汗汤汶沁沃沙河治泉泊法波泰泼泽洛津济浦海涅温滕潘澳烈热爱牙特狄王玛玻珀珊珍班理琪琳琴瑞瑟瓜瓦甫申畔略登白皮盖盟相石祖福科穆立笆简米素索累约纳纽绍维罕罗翁翰考耶聂肉肯胡胥腓舍良色艾芙芬芭苏若英茂范茅茨茹荣莉莎莫莱莲菲萨葛蒂蒙虏蜜衣裴西詹让诗诸诺谢豪贝费贾赖赛赫路辛达迈连迦迪逊道那邦郎鄂采里金钦锡门阿陀陶隆雅雍雷霍革韦音额香马魏鲁鲍麦黎默黛齐" + +// "·—阿埃艾爱安昂敖奥澳笆芭巴白拜班邦保堡鲍北贝本比毕彼别波玻博勃伯泊卜布才采仓查差柴彻川茨慈次达大戴代丹旦但当道德得的登迪狄蒂帝丁东杜敦多额俄厄鄂恩尔伐法范菲芬费佛夫福弗甫噶盖干冈哥戈革葛格各根古瓜哈海罕翰汗汉豪合河赫亨侯呼胡华霍基吉及加贾坚简杰金京久居君喀卡凯坎康考柯科可克肯库奎拉喇莱来兰郎朗劳勒雷累楞黎理李里莉丽历利立力连廉良列烈林隆卢虏鲁路伦仑罗洛玛马买麦迈曼茅茂梅门蒙盟米蜜密敏明摩莫墨默姆木穆那娜纳乃奈南内尼年涅宁纽努诺欧帕潘畔庞培佩彭皮平泼普其契恰强乔切钦沁泉让热荣肉儒瑞若萨塞赛桑瑟森莎沙山善绍舍圣施诗石什史士守斯司丝苏素索塔泰坦汤唐陶特提汀图土吐托陀瓦万王旺威韦维魏温文翁沃乌吾武伍西锡希喜夏相香歇谢辛新牙雅亚彦尧叶依伊衣宜义因音英雍尤于约宰泽增詹珍治中仲朱诸卓孜祖佐伽娅尕腓滕济嘉津赖莲琳律略慕妮聂裴浦奇齐琴茹珊卫欣逊札哲智兹芙汶迦珀琪梵斐胥黛" + +// "·阿安奥巴比彼波布察茨大德得丁杜尔法夫伏甫盖格哈基加坚捷金卡科可克库拉莱兰勒雷里历利连列卢鲁罗洛马梅蒙米姆娜涅宁诺帕泼普奇齐乔切日萨色山申什斯索塔坦特托娃维文乌西希谢亚耶叶依伊以扎佐柴达登蒂戈果海赫华霍吉季津柯理琳玛曼穆纳尼契钦丘桑沙舍泰图瓦万雅卓兹"; +// Set wordSet = new TreeSet(); +// LinkedList wordList = IOUtil.readLineList("data/dictionary/person/nrf.txt"); +// wordList.add(commonChar); +// for (String word : wordList) +// { +// word = word.replaceAll("\\s", ""); +// for (char c : word.toCharArray()) +// { +// wordSet.add(String.valueOf(c)); +// } +// } +// IOUtil.saveCollectionToTxt(wordSet, "data/dictionary/person/音译用字.txt"); +// } +// + public void testQuery() throws Exception + { + assertTrue(TranslatedPersonDictionary.containsKey("汤姆")); +// HanLP.Config.enableDebug(); +// System.out.println(TranslatedPersonDictionary.containsKey("汤姆")); +// System.out.println(TranslatedPersonDictionary.containsKey("汤")); +// System.out.println(TranslatedPersonDictionary.containsKey("姆")); +// System.out.println(TranslatedPersonDictionary.containsKey("点")); +// System.out.println(TranslatedPersonDictionary.containsKey("·")); + } +// +// public void testSeg() throws Exception +// { +// HanLP.Config.enableDebug(); +// System.out.println(StandardTokenizer.segment("齐格林斯基")); +// } +// +// public void testNonRec() throws Exception +// { +// HanLP.Config.enableDebug(); +// DijkstraSegment segment = new DijkstraSegment(); +// segment.enableTranslatedNameRecognize(true); +// System.out.println(segment.seg("汤姆和杰克逊")); +// } +// +// public void testHeadNRF() throws Exception +// { +// DijkstraSegment segment = new DijkstraSegment(); +// segment.enableTranslatedNameRecognize(false); +// for (String name : IOUtil.readLineList("data/dictionary/person/nrf.txt")) +// { +// List termList = segment.seg(name); +// if (termList.get(0).nature != Nature.nrf) +// { +// System.out.println(name + " : " + termList); +// } +// } +// } +// +// public void testDot() throws Exception +// { +// char c1 = '·'; +// char c2 = '·'; +// System.out.println(c1 == c2); +// } +// +// public void testMakeDictionary() throws Exception +// { +// Set wordSet = new TreeSet(); +// Pattern pattern = Pattern.compile("^[a-zA-Z]+ *(\\[.*?])? *([\\u4E00-\\u9FA5]+) ?[::。]"); +// int found = 0; +// for (String line : IOUtil.readLineList("D:\\Doc\\语料库\\英语姓名词典.txt")) +// { +// Matcher matcher = pattern.matcher(line); +// if (matcher.find()) +// { +// wordSet.add(matcher.group(2)); +// ++found; +// } +// } +// System.out.println("一共找到" + found + "条"); +// IOUtil.saveCollectionToTxt(wordSet, "data/dictionary/person/英语姓名词典.txt"); +// } +// +// public void testRegex() throws Exception +// { +// Pattern pattern = Pattern.compile("^[a-zA-Z]+ (\\[.*?])? ?([\\u4E00-\\u9FA5]+) ?[::。]"); +// String text = "Adey 阿迪:Adam的昵称,英格兰人姓氏 \n" + +// "Adkin 阿德金:Adarn的昵称,英格兰人姓氏。 \n" + +// "Adkins 阿德金斯:取自父名,源自Adkin,含义“阿德金之子”(son of Adkin),英格兰人姓氏 \n" + +// "Adlam [英格兰人姓氏] 阿德拉姆。来源于日耳曼语人名,含义是“高贵的+保护,头盔”(noble+protection,helmet) \n" + +// "Zena [女子名] 齐娜。来源于波斯语,含义是“女人”(woman)。 \n" + +// "Zenas [男子名] 泽纳斯。来源于希腊语,含义是“希腊主神宙斯的礼物”(gift of Zeus,the chief Greek god)。 \n" + +// "Zenia [女子名]齐尼娅:Xeniq的变体。 \n" + +// "Zenobia [女子名] 泽诺比娅。来源于希腊语,含义是“希腊主神宙斯+生命”(the chief Greek god Zeus+life)。 \n" + +// "Zillah [女子名] 齐拉。来源于希伯来语,含义是“荫”(shade)。 \n" + +// "Zoe [女子名]佐伊:来源于希腊语,含义是“生命”(life)。 \n" + +// "Zouch [英格兰人姓氏] 朱什。Such的变体。 "; +// +// Matcher matcher = pattern.matcher(text); +// if (matcher.find()) +// { +// System.out.println(matcher.group(2)); +// } +// } +// +// public void testCombineCharAndName() throws Exception +// { +// TreeSet wordSet = new TreeSet(); +// wordSet.addAll(IOUtil.readLineList("data/dictionary/person/音译用字.txt")); +// wordSet.addAll(IOUtil.readLineList("data/dictionary/person/nrf.txt")); +// IOUtil.saveCollectionToTxt(wordSet, "data/dictionary/person/nrf.txt"); +// } +} diff --git a/src/test/java/com/hankcs/test/corpus/TestNRDcitionaryMaker.java b/src/test/java/com/hankcs/hanlp/corpus/TestNRDcitionaryMaker.java similarity index 95% rename from src/test/java/com/hankcs/test/corpus/TestNRDcitionaryMaker.java rename to src/test/java/com/hankcs/hanlp/corpus/TestNRDcitionaryMaker.java index 5d2ca9ea7..aedf8c1c9 100644 --- a/src/test/java/com/hankcs/test/corpus/TestNRDcitionaryMaker.java +++ b/src/test/java/com/hankcs/hanlp/corpus/TestNRDcitionaryMaker.java @@ -1,37 +1,37 @@ -package com.hankcs.test.corpus; - -import java.util.LinkedList; -import java.util.List; - -import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; -import com.hankcs.hanlp.corpus.dictionary.NRDictionaryMaker; -import com.hankcs.hanlp.corpus.document.CorpusLoader; -import com.hankcs.hanlp.corpus.document.Document; -import com.hankcs.hanlp.corpus.document.sentence.word.IWord; -import com.hankcs.hanlp.corpus.document.sentence.word.Word; - -public class TestNRDcitionaryMaker -{ - - public static void main(String[] args) - { - EasyDictionary dictionary = EasyDictionary.create("data/dictionary/2014_dictionary.txt"); - final NRDictionaryMaker nrDictionaryMaker = new NRDictionaryMaker(dictionary); - CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\", new CorpusLoader.Handler() - { - @Override - public void handle(Document document) - { - List> simpleSentenceList = document.getSimpleSentenceList(); - List> compatibleList = new LinkedList>(); - for (List wordList : simpleSentenceList) - { - compatibleList.add(new LinkedList(wordList)); - } - nrDictionaryMaker.compute(compatibleList); - } - }); - nrDictionaryMaker.saveTxtTo("D:\\JavaProjects\\HanLP\\data\\test\\person\\nr1"); - } - -} +package com.hankcs.hanlp.corpus; + +import java.util.LinkedList; +import java.util.List; + +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NRDictionaryMaker; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; + +public class TestNRDcitionaryMaker +{ + + public static void main(String[] args) + { + EasyDictionary dictionary = EasyDictionary.create("data/dictionary/2014_dictionary.txt"); + final NRDictionaryMaker nrDictionaryMaker = new NRDictionaryMaker(dictionary); + CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\", new CorpusLoader.Handler() + { + @Override + public void handle(Document document) + { + List> simpleSentenceList = document.getSimpleSentenceList(); + List> compatibleList = new LinkedList>(); + for (List wordList : simpleSentenceList) + { + compatibleList.add(new LinkedList(wordList)); + } + nrDictionaryMaker.compute(compatibleList); + } + }); + nrDictionaryMaker.saveTxtTo("D:\\JavaProjects\\HanLP\\data\\test\\person\\nr1"); + } + +} diff --git a/src/test/java/com/hankcs/test/corpus/TestNSDictionaryMaker.java b/src/test/java/com/hankcs/hanlp/corpus/TestNSDictionaryMaker.java similarity index 94% rename from src/test/java/com/hankcs/test/corpus/TestNSDictionaryMaker.java rename to src/test/java/com/hankcs/hanlp/corpus/TestNSDictionaryMaker.java index 76d9f7ad4..f49a642d6 100644 --- a/src/test/java/com/hankcs/test/corpus/TestNSDictionaryMaker.java +++ b/src/test/java/com/hankcs/hanlp/corpus/TestNSDictionaryMaker.java @@ -1,24 +1,24 @@ -package com.hankcs.test.corpus; - -import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; -import com.hankcs.hanlp.corpus.dictionary.NSDictionaryMaker; -import com.hankcs.hanlp.corpus.document.CorpusLoader; -import com.hankcs.hanlp.corpus.document.Document; - -public class TestNSDictionaryMaker { - - public static void main(String[] args) - { - EasyDictionary dictionary = EasyDictionary.create("data/dictionary/2014_dictionary.txt"); - final NSDictionaryMaker nsDictionaryMaker = new NSDictionaryMaker(dictionary); - CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\", new CorpusLoader.Handler() - { - @Override - public void handle(Document document) - { - nsDictionaryMaker.compute(document.getComplexSentenceList()); - } - }); - nsDictionaryMaker.saveTxtTo("D:\\JavaProjects\\HanLP\\data\\test\\place\\ns"); - } -} +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; +import com.hankcs.hanlp.corpus.dictionary.NSDictionaryMaker; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; + +public class TestNSDictionaryMaker { + + public static void main(String[] args) + { + EasyDictionary dictionary = EasyDictionary.create("data/dictionary/2014_dictionary.txt"); + final NSDictionaryMaker nsDictionaryMaker = new NSDictionaryMaker(dictionary); + CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\", new CorpusLoader.Handler() + { + @Override + public void handle(Document document) + { + nsDictionaryMaker.compute(document.getComplexSentenceList()); + } + }); + nsDictionaryMaker.saveTxtTo("D:\\JavaProjects\\HanLP\\data\\test\\place\\ns"); + } +} diff --git a/src/test/java/com/hankcs/test/corpus/TestNTDcitionaryMaker.java b/src/test/java/com/hankcs/hanlp/corpus/TestNTDcitionaryMaker.java similarity index 96% rename from src/test/java/com/hankcs/test/corpus/TestNTDcitionaryMaker.java rename to src/test/java/com/hankcs/hanlp/corpus/TestNTDcitionaryMaker.java index dfc2adbf0..fb5187b83 100644 --- a/src/test/java/com/hankcs/test/corpus/TestNTDcitionaryMaker.java +++ b/src/test/java/com/hankcs/hanlp/corpus/TestNTDcitionaryMaker.java @@ -1,4 +1,4 @@ -package com.hankcs.test.corpus; +package com.hankcs.hanlp.corpus; import com.hankcs.hanlp.corpus.dictionary.EasyDictionary; import com.hankcs.hanlp.corpus.dictionary.NTDictionaryMaker; diff --git a/src/test/java/com/hankcs/test/corpus/TestNatureDictionaryMaker.java b/src/test/java/com/hankcs/hanlp/corpus/TestNatureDictionaryMaker.java similarity index 97% rename from src/test/java/com/hankcs/test/corpus/TestNatureDictionaryMaker.java rename to src/test/java/com/hankcs/hanlp/corpus/TestNatureDictionaryMaker.java index 29ff93d5e..09b66f2d8 100644 --- a/src/test/java/com/hankcs/test/corpus/TestNatureDictionaryMaker.java +++ b/src/test/java/com/hankcs/hanlp/corpus/TestNatureDictionaryMaker.java @@ -1,4 +1,4 @@ -package com.hankcs.test.corpus; +package com.hankcs.hanlp.corpus; import com.hankcs.hanlp.corpus.dictionary.NatureDictionaryMaker; import com.hankcs.hanlp.corpus.document.CorpusLoader; diff --git a/src/test/java/com/hankcs/hanlp/corpus/TestPinyinGuesser.java b/src/test/java/com/hankcs/hanlp/corpus/TestPinyinGuesser.java new file mode 100644 index 000000000..750e4a7e0 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/TestPinyinGuesser.java @@ -0,0 +1,75 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/11/5 21:26 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.dictionary.py.Pinyin; +import com.hankcs.hanlp.dictionary.py.PinyinDictionary; +import com.hankcs.hanlp.dictionary.py.PinyinUtil; +import com.hankcs.hanlp.dictionary.py.String2PinyinConverter; +import com.hankcs.hanlp.utility.TextUtility; +import junit.framework.TestCase; + +/** + * @author hankcs + */ +public class TestPinyinGuesser extends TestCase +{ +// public void testGuess() throws Exception +// { +// System.out.println(String2PinyinConverter.convert2Pair("飞流zh下sqianch!", true)); +// } +// +// public void testTextUtil() throws Exception +// { +// System.out.println(TextUtility.isAllLetter(Pinyin.ai1.toString())); +// System.out.println(TextUtility.isAllLetterOrNum(Pinyin.ai1.toString())); +// System.out.println(TextUtility.isAllLetter(Pinyin.ai1.getPinyinWithoutTone())); +// } +// +// public void testGenerateJavaCode() throws Exception +// { +// //case ni2: +// //return Pinyin.ni1; +// Pinyin[] tone2tone1 = new Pinyin[]{Pinyin.a5,Pinyin.a5,Pinyin.a5,Pinyin.a5,Pinyin.a5,Pinyin.ai4,Pinyin.ai4,Pinyin.ai4,Pinyin.ai4,Pinyin.an4,Pinyin.an4,Pinyin.an4,Pinyin.an4,Pinyin.ang4,Pinyin.ang4,Pinyin.ang4,Pinyin.ang4,Pinyin.ao4,Pinyin.ao4,Pinyin.ao4,Pinyin.ao4,Pinyin.ba5,Pinyin.ba5,Pinyin.ba5,Pinyin.ba5,Pinyin.ba5,Pinyin.bai4,Pinyin.bai4,Pinyin.bai4,Pinyin.bai4,Pinyin.ban4,Pinyin.ban4,Pinyin.ban4,Pinyin.bang4,Pinyin.bang4,Pinyin.bang4,Pinyin.bao4,Pinyin.bao4,Pinyin.bao4,Pinyin.bao4,Pinyin.bei5,Pinyin.bei5,Pinyin.bei5,Pinyin.bei5,Pinyin.ben4,Pinyin.ben4,Pinyin.ben4,Pinyin.beng4,Pinyin.beng4,Pinyin.beng4,Pinyin.beng4,Pinyin.bi4,Pinyin.bi4,Pinyin.bi4,Pinyin.bi4,Pinyin.bian5,Pinyin.bian5,Pinyin.bian5,Pinyin.bian5,Pinyin.biao4,Pinyin.biao4,Pinyin.biao4,Pinyin.biao4,Pinyin.bie4,Pinyin.bie4,Pinyin.bie4,Pinyin.bie4,Pinyin.bin4,Pinyin.bin4,Pinyin.bin4,Pinyin.bing4,Pinyin.bing4,Pinyin.bing4,Pinyin.bo5,Pinyin.bo5,Pinyin.bo5,Pinyin.bo5,Pinyin.bo5,Pinyin.bu4,Pinyin.bu4,Pinyin.bu4,Pinyin.bu4,Pinyin.ca4,Pinyin.ca4,Pinyin.ca4,Pinyin.cai4,Pinyin.cai4,Pinyin.cai4,Pinyin.cai4,Pinyin.can4,Pinyin.can4,Pinyin.can4,Pinyin.can4,Pinyin.cang4,Pinyin.cang4,Pinyin.cang4,Pinyin.cang4,Pinyin.cao4,Pinyin.cao4,Pinyin.cao4,Pinyin.cao4,Pinyin.ce4,Pinyin.cen2,Pinyin.cen2,Pinyin.ceng4,Pinyin.ceng4,Pinyin.ceng4,Pinyin.cha5,Pinyin.cha5,Pinyin.cha5,Pinyin.cha5,Pinyin.cha5,Pinyin.chai4,Pinyin.chai4,Pinyin.chai4,Pinyin.chai4,Pinyin.chan4,Pinyin.chan4,Pinyin.chan4,Pinyin.chan4,Pinyin.chang5,Pinyin.chang5,Pinyin.chang5,Pinyin.chang5,Pinyin.chang5,Pinyin.chao4,Pinyin.chao4,Pinyin.chao4,Pinyin.chao4,Pinyin.che4,Pinyin.che4,Pinyin.che4,Pinyin.chen5,Pinyin.chen5,Pinyin.chen5,Pinyin.chen5,Pinyin.chen5,Pinyin.cheng4,Pinyin.cheng4,Pinyin.cheng4,Pinyin.cheng4,Pinyin.chi5,Pinyin.chi5,Pinyin.chi5,Pinyin.chi5,Pinyin.chi5,Pinyin.chong4,Pinyin.chong4,Pinyin.chong4,Pinyin.chong4,Pinyin.chou5,Pinyin.chou5,Pinyin.chou5,Pinyin.chou5,Pinyin.chou5,Pinyin.chu5,Pinyin.chu5,Pinyin.chu5,Pinyin.chu5,Pinyin.chu5,Pinyin.chua1,Pinyin.chuai4,Pinyin.chuai4,Pinyin.chuai4,Pinyin.chuai4,Pinyin.chuan4,Pinyin.chuan4,Pinyin.chuan4,Pinyin.chuan4,Pinyin.chuang4,Pinyin.chuang4,Pinyin.chuang4,Pinyin.chuang4,Pinyin.chui4,Pinyin.chui4,Pinyin.chui4,Pinyin.chun3,Pinyin.chun3,Pinyin.chun3,Pinyin.chuo5,Pinyin.chuo5,Pinyin.chuo5,Pinyin.chuo5,Pinyin.ci4,Pinyin.ci4,Pinyin.ci4,Pinyin.ci4,Pinyin.cong4,Pinyin.cong4,Pinyin.cong4,Pinyin.cou4,Pinyin.cou4,Pinyin.cu4,Pinyin.cu4,Pinyin.cu4,Pinyin.cu4,Pinyin.cuan4,Pinyin.cuan4,Pinyin.cuan4,Pinyin.cui4,Pinyin.cui4,Pinyin.cui4,Pinyin.cui4,Pinyin.cun4,Pinyin.cun4,Pinyin.cun4,Pinyin.cun4,Pinyin.cuo4,Pinyin.cuo4,Pinyin.cuo4,Pinyin.cuo4,Pinyin.da5,Pinyin.da5,Pinyin.da5,Pinyin.da5,Pinyin.da5,Pinyin.dai4,Pinyin.dai4,Pinyin.dai4,Pinyin.dan4,Pinyin.dan4,Pinyin.dan4,Pinyin.dan4,Pinyin.dang4,Pinyin.dang4,Pinyin.dang4,Pinyin.dao4,Pinyin.dao4,Pinyin.dao4,Pinyin.dao4,Pinyin.de5,Pinyin.de5,Pinyin.de5,Pinyin.dei3,Pinyin.dei3,Pinyin.den4,Pinyin.den4,Pinyin.deng4,Pinyin.deng4,Pinyin.deng4,Pinyin.di4,Pinyin.di4,Pinyin.di4,Pinyin.di4,Pinyin.dia3,Pinyin.dian4,Pinyin.dian4,Pinyin.dian4,Pinyin.diao4,Pinyin.diao4,Pinyin.diao4,Pinyin.die4,Pinyin.die4,Pinyin.die4,Pinyin.ding4,Pinyin.ding4,Pinyin.ding4,Pinyin.ding4,Pinyin.diu1,Pinyin.dong4,Pinyin.dong4,Pinyin.dong4,Pinyin.dou4,Pinyin.dou4,Pinyin.dou4,Pinyin.dou4,Pinyin.du4,Pinyin.du4,Pinyin.du4,Pinyin.du4,Pinyin.duan4,Pinyin.duan4,Pinyin.duan4,Pinyin.dui4,Pinyin.dui4,Pinyin.dui4,Pinyin.dun4,Pinyin.dun4,Pinyin.dun4,Pinyin.dun4,Pinyin.duo5,Pinyin.duo5,Pinyin.duo5,Pinyin.duo5,Pinyin.duo5,Pinyin.e4,Pinyin.e4,Pinyin.e4,Pinyin.e4,Pinyin.ei4,Pinyin.ei4,Pinyin.ei4,Pinyin.ei4,Pinyin.en4,Pinyin.en4,Pinyin.en4,Pinyin.eng1,Pinyin.er5,Pinyin.er5,Pinyin.er5,Pinyin.er5,Pinyin.fa4,Pinyin.fa4,Pinyin.fa4,Pinyin.fa4,Pinyin.fan4,Pinyin.fan4,Pinyin.fan4,Pinyin.fan4,Pinyin.fang5,Pinyin.fang5,Pinyin.fang5,Pinyin.fang5,Pinyin.fang5,Pinyin.fei4,Pinyin.fei4,Pinyin.fei4,Pinyin.fei4,Pinyin.fen4,Pinyin.fen4,Pinyin.fen4,Pinyin.fen4,Pinyin.feng4,Pinyin.feng4,Pinyin.feng4,Pinyin.feng4,Pinyin.fiao4,Pinyin.fo2,Pinyin.fou4,Pinyin.fou4,Pinyin.fou4,Pinyin.fou4,Pinyin.fu5,Pinyin.fu5,Pinyin.fu5,Pinyin.fu5,Pinyin.fu5,Pinyin.ga4,Pinyin.ga4,Pinyin.ga4,Pinyin.ga4,Pinyin.gai4,Pinyin.gai4,Pinyin.gai4,Pinyin.gan4,Pinyin.gan4,Pinyin.gan4,Pinyin.gan4,Pinyin.gang4,Pinyin.gang4,Pinyin.gang4,Pinyin.gao4,Pinyin.gao4,Pinyin.gao4,Pinyin.ge4,Pinyin.ge4,Pinyin.ge4,Pinyin.ge4,Pinyin.gei3,Pinyin.gen4,Pinyin.gen4,Pinyin.gen4,Pinyin.gen4,Pinyin.geng4,Pinyin.geng4,Pinyin.geng4,Pinyin.gong4,Pinyin.gong4,Pinyin.gong4,Pinyin.gou4,Pinyin.gou4,Pinyin.gou4,Pinyin.gu4,Pinyin.gu4,Pinyin.gu4,Pinyin.gu4,Pinyin.gua4,Pinyin.gua4,Pinyin.gua4,Pinyin.guai4,Pinyin.guai4,Pinyin.guai4,Pinyin.guai4,Pinyin.guan4,Pinyin.guan4,Pinyin.guan4,Pinyin.guang4,Pinyin.guang4,Pinyin.guang4,Pinyin.gui4,Pinyin.gui4,Pinyin.gui4,Pinyin.gui4,Pinyin.gun4,Pinyin.gun4,Pinyin.gun4,Pinyin.guo5,Pinyin.guo5,Pinyin.guo5,Pinyin.guo5,Pinyin.guo5,Pinyin.ha4,Pinyin.ha4,Pinyin.ha4,Pinyin.ha4,Pinyin.hai4,Pinyin.hai4,Pinyin.hai4,Pinyin.hai4,Pinyin.han5,Pinyin.han5,Pinyin.han5,Pinyin.han5,Pinyin.han5,Pinyin.hang4,Pinyin.hang4,Pinyin.hang4,Pinyin.hang4,Pinyin.hao4,Pinyin.hao4,Pinyin.hao4,Pinyin.hao4,Pinyin.he4,Pinyin.he4,Pinyin.he4,Pinyin.hei1,Pinyin.hen4,Pinyin.hen4,Pinyin.hen4,Pinyin.hen4,Pinyin.heng4,Pinyin.heng4,Pinyin.heng4,Pinyin.hong4,Pinyin.hong4,Pinyin.hong4,Pinyin.hong4,Pinyin.hou4,Pinyin.hou4,Pinyin.hou4,Pinyin.hou4,Pinyin.hu4,Pinyin.hu4,Pinyin.hu4,Pinyin.hu4,Pinyin.hua4,Pinyin.hua4,Pinyin.hua4,Pinyin.hua4,Pinyin.huai4,Pinyin.huai4,Pinyin.huai4,Pinyin.huan4,Pinyin.huan4,Pinyin.huan4,Pinyin.huan4,Pinyin.huang4,Pinyin.huang4,Pinyin.huang4,Pinyin.huang4,Pinyin.hui4,Pinyin.hui4,Pinyin.hui4,Pinyin.hui4,Pinyin.hun4,Pinyin.hun4,Pinyin.hun4,Pinyin.hun4,Pinyin.huo5,Pinyin.huo5,Pinyin.huo5,Pinyin.huo5,Pinyin.huo5,Pinyin.ja4,Pinyin.ji5,Pinyin.ji5,Pinyin.ji5,Pinyin.ji5,Pinyin.ji5,Pinyin.jia5,Pinyin.jia5,Pinyin.jia5,Pinyin.jia5,Pinyin.jia5,Pinyin.jian4,Pinyin.jian4,Pinyin.jian4,Pinyin.jiang4,Pinyin.jiang4,Pinyin.jiang4,Pinyin.jiao4,Pinyin.jiao4,Pinyin.jiao4,Pinyin.jiao4,Pinyin.jie5,Pinyin.jie5,Pinyin.jie5,Pinyin.jie5,Pinyin.jie5,Pinyin.jin4,Pinyin.jin4,Pinyin.jin4,Pinyin.jing4,Pinyin.jing4,Pinyin.jing4,Pinyin.jiong4,Pinyin.jiong4,Pinyin.jiong4,Pinyin.jiu4,Pinyin.jiu4,Pinyin.jiu4,Pinyin.ju5,Pinyin.ju5,Pinyin.ju5,Pinyin.ju5,Pinyin.ju5,Pinyin.juan4,Pinyin.juan4,Pinyin.juan4,Pinyin.juan4,Pinyin.jue4,Pinyin.jue4,Pinyin.jue4,Pinyin.jue4,Pinyin.jun4,Pinyin.jun4,Pinyin.jun4,Pinyin.ka4,Pinyin.ka4,Pinyin.ka4,Pinyin.kai4,Pinyin.kai4,Pinyin.kai4,Pinyin.kan4,Pinyin.kan4,Pinyin.kan4,Pinyin.kang4,Pinyin.kang4,Pinyin.kang4,Pinyin.kang4,Pinyin.kao4,Pinyin.kao4,Pinyin.kao4,Pinyin.kao4,Pinyin.ke5,Pinyin.ke5,Pinyin.ke5,Pinyin.ke5,Pinyin.ke5,Pinyin.kei1,Pinyin.ken4,Pinyin.ken4,Pinyin.keng3,Pinyin.keng3,Pinyin.kong4,Pinyin.kong4,Pinyin.kong4,Pinyin.kou4,Pinyin.kou4,Pinyin.kou4,Pinyin.ku4,Pinyin.ku4,Pinyin.ku4,Pinyin.kua4,Pinyin.kua4,Pinyin.kua4,Pinyin.kuai4,Pinyin.kuai4,Pinyin.kuai4,Pinyin.kuan3,Pinyin.kuan3,Pinyin.kuang4,Pinyin.kuang4,Pinyin.kuang4,Pinyin.kuang4,Pinyin.kui4,Pinyin.kui4,Pinyin.kui4,Pinyin.kui4,Pinyin.kun4,Pinyin.kun4,Pinyin.kun4,Pinyin.kuo4,Pinyin.kuo4,Pinyin.la5,Pinyin.la5,Pinyin.la5,Pinyin.la5,Pinyin.la5,Pinyin.lai4,Pinyin.lai4,Pinyin.lai4,Pinyin.lan5,Pinyin.lan5,Pinyin.lan5,Pinyin.lan5,Pinyin.lang4,Pinyin.lang4,Pinyin.lang4,Pinyin.lang4,Pinyin.lao4,Pinyin.lao4,Pinyin.lao4,Pinyin.lao4,Pinyin.le5,Pinyin.le5,Pinyin.le5,Pinyin.lei5,Pinyin.lei5,Pinyin.lei5,Pinyin.lei5,Pinyin.lei5,Pinyin.leng4,Pinyin.leng4,Pinyin.leng4,Pinyin.leng4,Pinyin.li5,Pinyin.li5,Pinyin.li5,Pinyin.li5,Pinyin.li5,Pinyin.lia3,Pinyin.lian4,Pinyin.lian4,Pinyin.lian4,Pinyin.lian4,Pinyin.liang5,Pinyin.liang5,Pinyin.liang5,Pinyin.liang5,Pinyin.liao4,Pinyin.liao4,Pinyin.liao4,Pinyin.liao4,Pinyin.lie5,Pinyin.lie5,Pinyin.lie5,Pinyin.lie5,Pinyin.lie5,Pinyin.lin4,Pinyin.lin4,Pinyin.lin4,Pinyin.lin4,Pinyin.ling4,Pinyin.ling4,Pinyin.ling4,Pinyin.ling4,Pinyin.liu4,Pinyin.liu4,Pinyin.liu4,Pinyin.liu4,Pinyin.lo5,Pinyin.long4,Pinyin.long4,Pinyin.long4,Pinyin.long4,Pinyin.lou5,Pinyin.lou5,Pinyin.lou5,Pinyin.lou5,Pinyin.lou5,Pinyin.lu5,Pinyin.lu5,Pinyin.lu5,Pinyin.lu5,Pinyin.lu5,Pinyin.luan4,Pinyin.luan4,Pinyin.luan4,Pinyin.lun4,Pinyin.lun4,Pinyin.lun4,Pinyin.lun4,Pinyin.luo5,Pinyin.luo5,Pinyin.luo5,Pinyin.luo5,Pinyin.luo5,Pinyin.lv4,Pinyin.lv4,Pinyin.lv4,Pinyin.lve4,Pinyin.lve4,Pinyin.ma5,Pinyin.ma5,Pinyin.ma5,Pinyin.ma5,Pinyin.ma5,Pinyin.mai4,Pinyin.mai4,Pinyin.mai4,Pinyin.man4,Pinyin.man4,Pinyin.man4,Pinyin.man4,Pinyin.mang3,Pinyin.mang3,Pinyin.mang3,Pinyin.mao4,Pinyin.mao4,Pinyin.mao4,Pinyin.mao4,Pinyin.me5,Pinyin.me5,Pinyin.me5,Pinyin.mei4,Pinyin.mei4,Pinyin.mei4,Pinyin.men5,Pinyin.men5,Pinyin.men5,Pinyin.men5,Pinyin.men5,Pinyin.meng4,Pinyin.meng4,Pinyin.meng4,Pinyin.meng4,Pinyin.mi4,Pinyin.mi4,Pinyin.mi4,Pinyin.mi4,Pinyin.mian4,Pinyin.mian4,Pinyin.mian4,Pinyin.miao4,Pinyin.miao4,Pinyin.miao4,Pinyin.miao4,Pinyin.mie4,Pinyin.mie4,Pinyin.min3,Pinyin.min3,Pinyin.ming4,Pinyin.ming4,Pinyin.ming4,Pinyin.miu4,Pinyin.mo5,Pinyin.mo5,Pinyin.mo5,Pinyin.mo5,Pinyin.mo5,Pinyin.mou4,Pinyin.mou4,Pinyin.mou4,Pinyin.mou4,Pinyin.mu4,Pinyin.mu4,Pinyin.mu4,Pinyin.na5,Pinyin.na5,Pinyin.na5,Pinyin.na5,Pinyin.na5,Pinyin.nai4,Pinyin.nai4,Pinyin.nai4,Pinyin.nan4,Pinyin.nan4,Pinyin.nan4,Pinyin.nan4,Pinyin.nang4,Pinyin.nang4,Pinyin.nang4,Pinyin.nang4,Pinyin.nao4,Pinyin.nao4,Pinyin.nao4,Pinyin.nao4,Pinyin.ne5,Pinyin.ne5,Pinyin.ne5,Pinyin.nei4,Pinyin.nei4,Pinyin.nen4,Pinyin.nen4,Pinyin.nen4,Pinyin.neng3,Pinyin.neng3,Pinyin.ni4,Pinyin.ni4,Pinyin.ni4,Pinyin.ni4,Pinyin.nian4,Pinyin.nian4,Pinyin.nian4,Pinyin.nian4,Pinyin.niang4,Pinyin.niang4,Pinyin.niao4,Pinyin.niao4,Pinyin.nie4,Pinyin.nie4,Pinyin.nie4,Pinyin.nie4,Pinyin.nin3,Pinyin.nin3,Pinyin.ning4,Pinyin.ning4,Pinyin.ning4,Pinyin.niu4,Pinyin.niu4,Pinyin.niu4,Pinyin.niu4,Pinyin.nong4,Pinyin.nong4,Pinyin.nong4,Pinyin.nou4,Pinyin.nou4,Pinyin.nu4,Pinyin.nu4,Pinyin.nu4,Pinyin.nuan4,Pinyin.nuan4,Pinyin.nuan4,Pinyin.nun4,Pinyin.nun4,Pinyin.nuo4,Pinyin.nuo4,Pinyin.nuo4,Pinyin.nv4,Pinyin.nv4,Pinyin.nve4,Pinyin.o5,Pinyin.o5,Pinyin.o5,Pinyin.o5,Pinyin.o5,Pinyin.ou5,Pinyin.ou5,Pinyin.ou5,Pinyin.ou5,Pinyin.ou5,Pinyin.pa4,Pinyin.pa4,Pinyin.pa4,Pinyin.pai4,Pinyin.pai4,Pinyin.pai4,Pinyin.pai4,Pinyin.pan4,Pinyin.pan4,Pinyin.pan4,Pinyin.pan4,Pinyin.pang5,Pinyin.pang5,Pinyin.pang5,Pinyin.pang5,Pinyin.pang5,Pinyin.pao4,Pinyin.pao4,Pinyin.pao4,Pinyin.pao4,Pinyin.pei4,Pinyin.pei4,Pinyin.pei4,Pinyin.pei4,Pinyin.pen5,Pinyin.pen5,Pinyin.pen5,Pinyin.pen5,Pinyin.pen5,Pinyin.peng4,Pinyin.peng4,Pinyin.peng4,Pinyin.peng4,Pinyin.pi5,Pinyin.pi5,Pinyin.pi5,Pinyin.pi5,Pinyin.pi5,Pinyin.pian4,Pinyin.pian4,Pinyin.pian4,Pinyin.pian4,Pinyin.piao4,Pinyin.piao4,Pinyin.piao4,Pinyin.piao4,Pinyin.pie4,Pinyin.pie4,Pinyin.pie4,Pinyin.pin4,Pinyin.pin4,Pinyin.pin4,Pinyin.pin4,Pinyin.ping4,Pinyin.ping4,Pinyin.ping4,Pinyin.ping4,Pinyin.po5,Pinyin.po5,Pinyin.po5,Pinyin.po5,Pinyin.po5,Pinyin.pou4,Pinyin.pou4,Pinyin.pou4,Pinyin.pou4,Pinyin.pu4,Pinyin.pu4,Pinyin.pu4,Pinyin.pu4,Pinyin.qi5,Pinyin.qi5,Pinyin.qi5,Pinyin.qi5,Pinyin.qi5,Pinyin.qia4,Pinyin.qia4,Pinyin.qia4,Pinyin.qian5,Pinyin.qian5,Pinyin.qian5,Pinyin.qian5,Pinyin.qian5,Pinyin.qiang4,Pinyin.qiang4,Pinyin.qiang4,Pinyin.qiang4,Pinyin.qiao4,Pinyin.qiao4,Pinyin.qiao4,Pinyin.qiao4,Pinyin.qie4,Pinyin.qie4,Pinyin.qie4,Pinyin.qie4,Pinyin.qin4,Pinyin.qin4,Pinyin.qin4,Pinyin.qin4,Pinyin.qing4,Pinyin.qing4,Pinyin.qing4,Pinyin.qing4,Pinyin.qiong3,Pinyin.qiong3,Pinyin.qiong3,Pinyin.qiu4,Pinyin.qiu4,Pinyin.qiu4,Pinyin.qiu4,Pinyin.qu4,Pinyin.qu4,Pinyin.qu4,Pinyin.qu4,Pinyin.quan4,Pinyin.quan4,Pinyin.quan4,Pinyin.quan4,Pinyin.que4,Pinyin.que4,Pinyin.que4,Pinyin.qun3,Pinyin.qun3,Pinyin.qun3,Pinyin.ran3,Pinyin.ran3,Pinyin.rang4,Pinyin.rang4,Pinyin.rang4,Pinyin.rang4,Pinyin.rao4,Pinyin.rao4,Pinyin.rao4,Pinyin.re4,Pinyin.re4,Pinyin.re4,Pinyin.ren4,Pinyin.ren4,Pinyin.ren4,Pinyin.reng4,Pinyin.reng4,Pinyin.reng4,Pinyin.ri4,Pinyin.rong4,Pinyin.rong4,Pinyin.rong4,Pinyin.rou4,Pinyin.rou4,Pinyin.rou4,Pinyin.ru4,Pinyin.ru4,Pinyin.ru4,Pinyin.ru4,Pinyin.ruan4,Pinyin.ruan4,Pinyin.ruan4,Pinyin.rui4,Pinyin.rui4,Pinyin.rui4,Pinyin.run4,Pinyin.run4,Pinyin.ruo4,Pinyin.ruo4,Pinyin.sa4,Pinyin.sa4,Pinyin.sa4,Pinyin.sai5,Pinyin.sai5,Pinyin.sai5,Pinyin.sai5,Pinyin.san5,Pinyin.san5,Pinyin.san5,Pinyin.san5,Pinyin.sang5,Pinyin.sang5,Pinyin.sang5,Pinyin.sang5,Pinyin.sao4,Pinyin.sao4,Pinyin.sao4,Pinyin.se4,Pinyin.se4,Pinyin.sen3,Pinyin.sen3,Pinyin.seng1,Pinyin.sha4,Pinyin.sha4,Pinyin.sha4,Pinyin.sha4,Pinyin.shai4,Pinyin.shai4,Pinyin.shai4,Pinyin.shan4,Pinyin.shan4,Pinyin.shan4,Pinyin.shan4,Pinyin.shang5,Pinyin.shang5,Pinyin.shang5,Pinyin.shang5,Pinyin.shang5,Pinyin.shao4,Pinyin.shao4,Pinyin.shao4,Pinyin.shao4,Pinyin.she4,Pinyin.she4,Pinyin.she4,Pinyin.she4,Pinyin.shei2,Pinyin.shen4,Pinyin.shen4,Pinyin.shen4,Pinyin.shen4,Pinyin.sheng4,Pinyin.sheng4,Pinyin.sheng4,Pinyin.sheng4,Pinyin.shi5,Pinyin.shi5,Pinyin.shi5,Pinyin.shi5,Pinyin.shi5,Pinyin.shou4,Pinyin.shou4,Pinyin.shou4,Pinyin.shou4,Pinyin.shu4,Pinyin.shu4,Pinyin.shu4,Pinyin.shu4,Pinyin.shua4,Pinyin.shua4,Pinyin.shua4,Pinyin.shuai4,Pinyin.shuai4,Pinyin.shuai4,Pinyin.shuan4,Pinyin.shuan4,Pinyin.shuang4,Pinyin.shuang4,Pinyin.shuang4,Pinyin.shui4,Pinyin.shui4,Pinyin.shui4,Pinyin.shun4,Pinyin.shun4,Pinyin.shun4,Pinyin.shuo4,Pinyin.shuo4,Pinyin.shuo4,Pinyin.si4,Pinyin.si4,Pinyin.si4,Pinyin.song4,Pinyin.song4,Pinyin.song4,Pinyin.sou4,Pinyin.sou4,Pinyin.sou4,Pinyin.su4,Pinyin.su4,Pinyin.su4,Pinyin.suan4,Pinyin.suan4,Pinyin.suan4,Pinyin.sui4,Pinyin.sui4,Pinyin.sui4,Pinyin.sui4,Pinyin.sun4,Pinyin.sun4,Pinyin.sun4,Pinyin.suo4,Pinyin.suo4,Pinyin.suo4,Pinyin.ta5,Pinyin.ta5,Pinyin.ta5,Pinyin.ta5,Pinyin.tai4,Pinyin.tai4,Pinyin.tai4,Pinyin.tai4,Pinyin.tan4,Pinyin.tan4,Pinyin.tan4,Pinyin.tan4,Pinyin.tang4,Pinyin.tang4,Pinyin.tang4,Pinyin.tang4,Pinyin.tao4,Pinyin.tao4,Pinyin.tao4,Pinyin.tao4,Pinyin.te4,Pinyin.teng4,Pinyin.teng4,Pinyin.teng4,Pinyin.ti4,Pinyin.ti4,Pinyin.ti4,Pinyin.ti4,Pinyin.tian5,Pinyin.tian5,Pinyin.tian5,Pinyin.tian5,Pinyin.tian5,Pinyin.tiao4,Pinyin.tiao4,Pinyin.tiao4,Pinyin.tiao4,Pinyin.tie4,Pinyin.tie4,Pinyin.tie4,Pinyin.tie4,Pinyin.ting4,Pinyin.ting4,Pinyin.ting4,Pinyin.ting4,Pinyin.tong4,Pinyin.tong4,Pinyin.tong4,Pinyin.tong4,Pinyin.tou5,Pinyin.tou5,Pinyin.tou5,Pinyin.tou5,Pinyin.tou5,Pinyin.tu5,Pinyin.tu5,Pinyin.tu5,Pinyin.tu5,Pinyin.tu5,Pinyin.tuan4,Pinyin.tuan4,Pinyin.tuan4,Pinyin.tuan4,Pinyin.tui4,Pinyin.tui4,Pinyin.tui4,Pinyin.tui4,Pinyin.tun5,Pinyin.tun5,Pinyin.tun5,Pinyin.tun5,Pinyin.tun5,Pinyin.tuo4,Pinyin.tuo4,Pinyin.tuo4,Pinyin.tuo4,Pinyin.wa5,Pinyin.wa5,Pinyin.wa5,Pinyin.wa5,Pinyin.wa5,Pinyin.wai4,Pinyin.wai4,Pinyin.wai4,Pinyin.wan4,Pinyin.wan4,Pinyin.wan4,Pinyin.wan4,Pinyin.wang4,Pinyin.wang4,Pinyin.wang4,Pinyin.wang4,Pinyin.wei4,Pinyin.wei4,Pinyin.wei4,Pinyin.wei4,Pinyin.wen4,Pinyin.wen4,Pinyin.wen4,Pinyin.wen4,Pinyin.weng4,Pinyin.weng4,Pinyin.weng4,Pinyin.wo4,Pinyin.wo4,Pinyin.wo4,Pinyin.wu4,Pinyin.wu4,Pinyin.wu4,Pinyin.wu4,Pinyin.xi4,Pinyin.xi4,Pinyin.xi4,Pinyin.xi4,Pinyin.xia4,Pinyin.xia4,Pinyin.xia4,Pinyin.xia4,Pinyin.xian4,Pinyin.xian4,Pinyin.xian4,Pinyin.xian4,Pinyin.xiang4,Pinyin.xiang4,Pinyin.xiang4,Pinyin.xiang4,Pinyin.xiao4,Pinyin.xiao4,Pinyin.xiao4,Pinyin.xiao4,Pinyin.xie4,Pinyin.xie4,Pinyin.xie4,Pinyin.xie4,Pinyin.xin4,Pinyin.xin4,Pinyin.xin4,Pinyin.xin4,Pinyin.xing4,Pinyin.xing4,Pinyin.xing4,Pinyin.xing4,Pinyin.xiong4,Pinyin.xiong4,Pinyin.xiong4,Pinyin.xiong4,Pinyin.xiu4,Pinyin.xiu4,Pinyin.xiu4,Pinyin.xiu4,Pinyin.xu5,Pinyin.xu5,Pinyin.xu5,Pinyin.xu5,Pinyin.xu5,Pinyin.xuan4,Pinyin.xuan4,Pinyin.xuan4,Pinyin.xuan4,Pinyin.xue4,Pinyin.xue4,Pinyin.xue4,Pinyin.xue4,Pinyin.xun4,Pinyin.xun4,Pinyin.xun4,Pinyin.ya5,Pinyin.ya5,Pinyin.ya5,Pinyin.ya5,Pinyin.ya5,Pinyin.yai2,Pinyin.yan4,Pinyin.yan4,Pinyin.yan4,Pinyin.yan4,Pinyin.yang4,Pinyin.yang4,Pinyin.yang4,Pinyin.yang4,Pinyin.yao4,Pinyin.yao4,Pinyin.yao4,Pinyin.yao4,Pinyin.ye5,Pinyin.ye5,Pinyin.ye5,Pinyin.ye5,Pinyin.ye5,Pinyin.yi5,Pinyin.yi5,Pinyin.yi5,Pinyin.yi5,Pinyin.yi5,Pinyin.yin4,Pinyin.yin4,Pinyin.yin4,Pinyin.yin4,Pinyin.ying4,Pinyin.ying4,Pinyin.ying4,Pinyin.ying4,Pinyin.yo5,Pinyin.yo5,Pinyin.yong4,Pinyin.yong4,Pinyin.yong4,Pinyin.yong4,Pinyin.you4,Pinyin.you4,Pinyin.you4,Pinyin.you4,Pinyin.yu4,Pinyin.yu4,Pinyin.yu4,Pinyin.yu4,Pinyin.yuan4,Pinyin.yuan4,Pinyin.yuan4,Pinyin.yuan4,Pinyin.yue4,Pinyin.yue4,Pinyin.yue4,Pinyin.yun4,Pinyin.yun4,Pinyin.yun4,Pinyin.yun4,Pinyin.za3,Pinyin.za3,Pinyin.za3,Pinyin.zai4,Pinyin.zai4,Pinyin.zai4,Pinyin.zan4,Pinyin.zan4,Pinyin.zan4,Pinyin.zan4,Pinyin.zang4,Pinyin.zang4,Pinyin.zang4,Pinyin.zang4,Pinyin.zao4,Pinyin.zao4,Pinyin.zao4,Pinyin.zao4,Pinyin.ze4,Pinyin.ze4,Pinyin.zei2,Pinyin.zen4,Pinyin.zen4,Pinyin.zen4,Pinyin.zeng4,Pinyin.zeng4,Pinyin.zha4,Pinyin.zha4,Pinyin.zha4,Pinyin.zha4,Pinyin.zhai4,Pinyin.zhai4,Pinyin.zhai4,Pinyin.zhai4,Pinyin.zhan4,Pinyin.zhan4,Pinyin.zhan4,Pinyin.zhan4,Pinyin.zhang4,Pinyin.zhang4,Pinyin.zhang4,Pinyin.zhao4,Pinyin.zhao4,Pinyin.zhao4,Pinyin.zhao4,Pinyin.zhe5,Pinyin.zhe5,Pinyin.zhe5,Pinyin.zhe5,Pinyin.zhe5,Pinyin.zhei4,Pinyin.zhen4,Pinyin.zhen4,Pinyin.zhen4,Pinyin.zheng4,Pinyin.zheng4,Pinyin.zheng4,Pinyin.zhi4,Pinyin.zhi4,Pinyin.zhi4,Pinyin.zhi4,Pinyin.zhong4,Pinyin.zhong4,Pinyin.zhong4,Pinyin.zhou4,Pinyin.zhou4,Pinyin.zhou4,Pinyin.zhou4,Pinyin.zhu4,Pinyin.zhu4,Pinyin.zhu4,Pinyin.zhu4,Pinyin.zhua3,Pinyin.zhua3,Pinyin.zhuai4,Pinyin.zhuai4,Pinyin.zhuai4,Pinyin.zhuan4,Pinyin.zhuan4,Pinyin.zhuan4,Pinyin.zhuang4,Pinyin.zhuang4,Pinyin.zhuang4,Pinyin.zhui4,Pinyin.zhui4,Pinyin.zhui4,Pinyin.zhun4,Pinyin.zhun4,Pinyin.zhun4,Pinyin.zhuo4,Pinyin.zhuo4,Pinyin.zhuo4,Pinyin.zhuo4,Pinyin.zi5,Pinyin.zi5,Pinyin.zi5,Pinyin.zi5,Pinyin.zi5,Pinyin.zong4,Pinyin.zong4,Pinyin.zong4,Pinyin.zou4,Pinyin.zou4,Pinyin.zou4,Pinyin.zu4,Pinyin.zu4,Pinyin.zu4,Pinyin.zu4,Pinyin.zuan4,Pinyin.zuan4,Pinyin.zuan4,Pinyin.zui4,Pinyin.zui4,Pinyin.zui4,Pinyin.zun4,Pinyin.zun4,Pinyin.zun4,Pinyin.zuo5,Pinyin.zuo5,Pinyin.zuo5,Pinyin.zuo5,Pinyin.zuo5,Pinyin.none5,}; +// +// for (Pinyin pinyin : PinyinDictionary.pinyins) +// { +// System.out.printf("Pinyin.%s,", convert(pinyin)); +// assertEquals(convert(pinyin), tone2tone1[pinyin.ordinal()]); +// } +// } +// +// public void testPinyin() throws Exception +// { +// System.out.println(HanLP.convertToPinyinString("截至2012年,", " ", true)); +// System.out.println(HanLP.convertToPinyinString("截至2012年,", " ", false)); +// } +// +// private Pinyin convert(Pinyin p) +// { +// String withoutTone = p.getPinyinWithoutTone(); +// for (int i = 5; i >= 0; --i) +// { +// try +// { +// return Pinyin.valueOf(withoutTone + i); +// } +// catch (Exception e) +// { +// // do nothing +// } +// } +// +// return null; +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/ZZGenerateNature.java b/src/test/java/com/hankcs/hanlp/corpus/ZZGenerateNature.java new file mode 100644 index 000000000..b5f63ba68 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/ZZGenerateNature.java @@ -0,0 +1,171 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/9/9 3:35 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.corpus; + +import com.hankcs.hanlp.corpus.tag.Nature; +import junit.framework.TestCase; + +/** + * @author hankcs + */ +public class ZZGenerateNature extends TestCase +{ +// public void testGenerate() throws Exception +// { +// String text = "n 名词\n" + +// "nr 人名\n" + +// "nrj 日语人名\n" + +// "nrf 音译人名\n" + +// "ns 地名\n" + +// "nsf 音译地名\n" + +// "nt 机构团体名\n" + +// "\t\tntc 公司名\n" + +// "\t\t\tntcf 工厂\n" + +// "\t\t\tntcb 银行\n" + +// "\t\t\tntch 酒店宾馆\n" + +// "\t\tnto 政府机构\n" + +// "\t\tntu 大学\n" + +// "\t\tnts 中小学\n" + +// "\t\tnth 医院\n" + +// "nh 医药疾病等健康相关名词\n" + +// "\t\tnhm 药品\n" + +// "\t\tnhd 疾病\n" + +// "nn 工作相关名词\n" + +// "nnt 职务职称\n" + +// "nnd 职业\n" + +// "ng 名词性语素\n" + +// "ni 机构相关(不是独立机构名)\n" + +// "\tnic 下属机构\n" + +// "\tnis 机构后缀\n" + +// "nm 物品名\n" + +// "\tnmc 化学品名\n" + +// "nb 生物名\n" + +// "\tnba 动物名\n" + +// "\tnbp 植物名\n" + +// "nz 其他专名\n" + +// "g 学术词汇\n" + +// "\tgm 数学相关词汇\n" + +// "\tgp 物理相关词汇\n" + +// "\tgc 化学相关词汇\n" + +// "\tgb 生物相关词汇\n" + +// "\t\tgbc 生物类别\n" + +// "\tgg 地理地质相关词汇\n" + +// "\tgi 计算机相关词汇\n" + +// "j 简称略语\n" + +// "i 成语\n" + +// "l 习用语\n" + +// "t 时间词\n" + +// "tg 时间词性语素\n" + +// "s 处所词\n" + +// "f 方位词\n" + +// "v 动词\n" + +// "vd 副动词\n" + +// "vn 名动词\n" + +// "vshi 动词“是”\n" + +// "vyou 动词“有”\n" + +// "vf 趋向动词\n" + +// "vx 形式动词\n" + +// "vi 不及物动词(内动词)\n" + +// "vl 动词性惯用语\n" + +// "vg 动词性语素\n" + +// "a 形容词\n" + +// "ad 副形词\n" + +// "an 名形词\n" + +// "ag 形容词性语素\n" + +// "al 形容词性惯用语\n" + +// "b 区别词\n" + +// "bl 区别词性惯用语\n" + +// "z 状态词\n" + +// "r 代词\n" + +// "rr 人称代词\n" + +// "rz 指示代词\n" + +// "rzt 时间指示代词\n" + +// "rzs 处所指示代词\n" + +// "rzv 谓词性指示代词\n" + +// "ry 疑问代词\n" + +// "ryt 时间疑问代词\n" + +// "rys 处所疑问代词\n" + +// "ryv 谓词性疑问代词\n" + +// "rg 代词性语素\n" + +// "m 数词\n" + +// "mq 数量词\n" + +// "q 量词\n" + +// "qv 动量词\n" + +// "qt 时量词\n" + +// "d 副词\n" + +// "p 介词\n" + +// "pba 介词“把”\n" + +// "pbei 介词“被”\n" + +// "c 连词\n" + +// "\tcc 并列连词\n" + +// "u 助词\n" + +// "uzhe 着\n" + +// "ule 了 喽\n" + +// "uguo 过\n" + +// "ude1 的 底\n" + +// "ude2 地\n" + +// "ude3 得\n" + +// "usuo 所\n" + +// "udeng 等 等等 云云\n" + +// "uyy 一样 一般 似的 般\n" + +// "udh 的话\n" + +// "uls 来讲 来说 而言 说来\n" + +// "\n" + +// "uzhi 之\n" + +// "ulian 连 (“连小学生都会”)\n" + +// "\n" + +// "e 叹词\n" + +// "y 语气词(delete yg)\n" + +// "o 拟声词\n" + +// "h 前缀\n" + +// "k 后缀\n" + +// "x 字符串\n" + +// "\txx 非语素字\n" + +// "\txu 网址URL\n" + +// "w 标点符号\n" + +// "wkz 左括号,全角:( 〔 [ { 《 【 〖 〈 半角:( [ { <\n" + +// "wky 右括号,全角:) 〕 ] } 》 】 〗 〉 半角: ) ] { >\n" + +// "wyz 左引号,全角:“ ‘ 『 \n" + +// "wyy 右引号,全角:” ’ 』 \n" + +// "wj 句号,全角:。\n" + +// "ww 问号,全角:? 半角:?\n" + +// "wt 叹号,全角:! 半角:!\n" + +// "wd 逗号,全角:, 半角:,\n" + +// "wf 分号,全角:; 半角: ;\n" + +// "wn 顿号,全角:、\n" + +// "wm 冒号,全角:: 半角: :\n" + +// "ws 省略号,全角:…… …\n" + +// "wp 破折号,全角:—— -- ——- 半角:--- ----\n" + +// "wb 百分号千分号,全角:% ‰ 半角:%\n" + +// "wh 单位符号,全角:¥ $ £ ° ℃ 半角:$\n" + +// "\n"; +// String[] params = text.split("\\n"); +// int i = 0; +// for (String p : params) +// { +// p = p.trim(); +// if (p.length() == 0) continue; +// System.out.print(++i + " "); +// System.out.println(p); +//// int cut = p.indexOf(' '); +//// System.out.println("/**\n" + +//// "* " + p.substring(cut + 1) + "\n" + +//// "*/\n" + +//// p.substring(0, cut) +",\n"); +// } +// } +// +// public void testSize() throws Exception +// { +// System.out.println(Nature.values().length); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLLoaderTest.java b/src/test/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLLoaderTest.java new file mode 100644 index 000000000..f7e8b13b0 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/dependency/CoNll/CoNLLLoaderTest.java @@ -0,0 +1,217 @@ +package com.hankcs.hanlp.corpus.dependency.CoNll; + +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.item.Item; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.FileOutputStream; +import java.io.OutputStreamWriter; +import java.util.LinkedHashSet; +import java.util.LinkedList; +import java.util.Set; + +public class CoNLLLoaderTest extends TestCase +{ +// public void testConvert() throws Exception +// { +// LinkedList coNLLSentences = CoNLLLoader.loadSentenceList("D:\\Doc\\语料库\\依存分析训练数据\\THU\\dev.conll.fixed.txt"); +// } +// +// /** +// * 细粒度转粗粒度 +// * +// * @throws Exception +// */ +// public void testPosTag() throws Exception +// { +// DictionaryMaker dictionaryMaker = new DictionaryMaker(); +// LinkedList coNLLSentences = CoNLLLoader.loadSentenceList("D:\\Doc\\语料库\\依存分析训练数据\\THU\\dev.conll.fixed.txt"); +// for (CoNLLSentence coNLLSentence : coNLLSentences) +// { +// for (CoNLLWord coNLLWord : coNLLSentence.word) +// { +// dictionaryMaker.add(new Item(coNLLWord.POSTAG, coNLLWord.CPOSTAG)); +// } +// } +// System.out.println(dictionaryMaker.entrySet()); +// } +// +// /** +// * 导出CRF训练语料 +// * +// * @throws Exception +// */ +// public void testMakeCRF() throws Exception +// { +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("D:\\Tools\\CRF++-0.58\\example\\dependency\\dev.txt"))); +// LinkedList coNLLSentences = CoNLLLoader.loadSentenceList("D:\\Doc\\语料库\\依存分析训练数据\\THU\\dev.conll.fixed.txt"); +// for (CoNLLSentence coNLLSentence : coNLLSentences) +// { +// for (CoNLLWord coNLLWord : coNLLSentence.word) +// { +// bw.write(coNLLWord.NAME); +// bw.write('\t'); +// bw.write(coNLLWord.CPOSTAG); +// bw.write('\t'); +// bw.write(coNLLWord.POSTAG); +// bw.write('\t'); +// int d = coNLLWord.HEAD.ID - coNLLWord.ID; +// int posDistance = 1; +// if (d > 0) // 在后面 +// { +// for (int i = 1; i < d; ++i) +// { +// if (coNLLSentence.word[coNLLWord.ID - 1 + i].CPOSTAG.equals(coNLLWord.HEAD.CPOSTAG)) +// { +// ++posDistance; +// } +// } +// } +// else +// { +// for (int i = 1; i < -d; ++i) // 在前面 +// { +// if (coNLLSentence.word[coNLLWord.ID - 1 - i].CPOSTAG.equals(coNLLWord.HEAD.CPOSTAG)) +// { +// ++posDistance; +// } +// } +// } +// bw.write((d > 0 ? "+" : "-") + posDistance + "_" + coNLLWord.HEAD.CPOSTAG +//// + "_" + coNLLWord.DEPREL +// ); +// bw.newLine(); +// } +// bw.newLine(); +// } +// bw.close(); +// } +// +// /** +// * 生成CRF模板 +// * +// * @throws Exception +// */ +// public void testMakeCRFTemplate() throws Exception +// { +// Set templateList = new LinkedHashSet(); +// int maxDistance = 4; +// // 字特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",0]"); +// } +// // 细词性特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",1]"); +// } +// // 粗词性特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",2]"); +// } +// // 组合字特征 +// String[] before = new String[maxDistance + 1]; +// String[] after = new String[maxDistance + 1]; +// before[0] = "%x[0,0]"; +// after[0] = ""; +// for (int i = 1; i <= maxDistance; ++i) +// { +// before[i] = "%x[-" + i + ",0]/" + before[i - 1]; +// after[i] = after[i - 1] + "/%x[" + i + ",0]"; +// } +// for (int i = 0; i <= maxDistance; ++i) +// { +// for (int j = 0; j <= maxDistance; ++j) +// { +// templateList.add(before[i] + after[j]); +// } +// } +// // 组合粗词性特征 +// before[0] = "%x[0,1]"; +// after[0] = ""; +// for (int i = 1; i <= maxDistance; ++i) +// { +// before[i] = "%x[-" + i + ",1]/" + before[i - 1]; +// after[i] = after[i - 1] + "/%x[" + i + ",1]"; +// } +// for (int i = 0; i <= maxDistance; ++i) +// { +// for (int j = 0; j <= maxDistance; ++j) +// { +// templateList.add(before[i] + after[j]); +// } +// } +// // 组合细词性特征 +// before[0] = "%x[0,2]"; +// after[0] = ""; +// for (int i = 1; i <= maxDistance; ++i) +// { +// before[i] = "%x[-" + i + ",2]/" + before[i - 1]; +// after[i] = after[i - 1] + "/%x[" + i + ",2]"; +// } +// for (int i = 0; i <= maxDistance; ++i) +// { +// for (int j = 0; j <= maxDistance; ++j) +// { +// templateList.add(before[i] + after[j]); +// } +// } +// +// int id = 0; +// StringBuilder sb = new StringBuilder(); +// for (String template : templateList) +// { +// sb.append(String.format("U%d:%s\n", id, template)); +// ++id; +// } +// System.out.println(sb.toString()); +// IOUtil.saveTxt("D:\\Tools\\CRF++-0.58\\example\\dependency\\template.txt", sb); +// } +// +// public void testMakeSimpleCRFTemplate() throws Exception +// { +// Set templateList = new LinkedHashSet(); +// int maxDistance = 4; +// // 字特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",0]"); +// } +// // 细词性特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",1]"); +// } +// // 粗词性特征 +// for (int i = -maxDistance; i <= maxDistance; ++i) +// { +// templateList.add("%x[" + i + ",2]"); +// } +// // 组合特征 +// for (int i = 1; i <= maxDistance; ++i) +// { +// templateList.add("%x[-" + i + ",0]/" + "%x[0,0]"); +// templateList.add("%x[0,0]/" + "%x[" + i + ",0]"); +// +// templateList.add("%x[-" + i + ",1]/" + "%x[0,1]"); +// templateList.add("%x[0,1]/" + "%x[" + i + ",1]"); +// +// templateList.add("%x[-" + i + ",2]/" + "%x[0,2]"); +// templateList.add("%x[0,2]/" + "%x[" + i + ",2]"); +// } +// +// int id = 0; +// StringBuilder sb = new StringBuilder(); +// for (String template : templateList) +// { +// sb.append(String.format("U%d:%s\n", id, template)); +// ++id; +// } +// System.out.println(sb.toString()); +// IOUtil.saveTxt("D:\\Tools\\CRF++-0.58\\example\\dependency\\template.txt", sb); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/dictionary/DictionaryMakerTest.java b/src/test/java/com/hankcs/hanlp/corpus/dictionary/DictionaryMakerTest.java new file mode 100644 index 000000000..79d9aecec --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/dictionary/DictionaryMakerTest.java @@ -0,0 +1,219 @@ +package com.hankcs.hanlp.corpus.dictionary; + +import junit.framework.TestCase; + +public class DictionaryMakerTest extends TestCase +{ + // 部分标注有问题,比如逗号缺少标注等等,尝试修复它 +// public void testAdjust() throws Exception +// { +// List fileList = FolderWalker.open("D:\\JavaProjects\\CorpusToolBox\\data\\2014\\"); +// for (File file : fileList) +// { +// handle(file); +// } +// } +// +// private static void handle(File file) +// { +// try +// { +// String text = IOUtil.readTxt(file.getPath()); +// int length = text.length(); +// text = addW(text, ":"); +// text = addW(text, "?"); +// text = addW(text, ","); +// text = addW(text, ")"); +// text = addW(text, "("); +// text = addW(text, "!"); +// text = addW(text, "("); +// text = addW(text, ")"); +// text = addW(text, ","); +// text = addW(text, "‘"); +// text = addW(text, "’"); +// text = addW(text, "“"); +// text = addW(text, "”"); +// text = addW(text, ";"); +// text = addW(text, "……"); +// text = addW(text, "。"); +// text = addW(text, "、"); +// text = addW(text, "《"); +// text = addW(text, "》"); +// if (text.length() != length) +// { +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file))); +// bw.write(text); +// bw.close(); +// System.out.println("修正了" + file); +// } +// } +// catch (Exception e) +// { +// e.printStackTrace(); +// } +// } +// +// private static String addW(String text, String c) +// { +// text = text.replaceAll("\\" + c + "/w ", c); +// return text.replaceAll("\\" + c, c + "/w "); +// } +// +// public void testPlay() throws Exception +// { +// final TFDictionary tfDictionary = new TFDictionary(); +// CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// for (List wordList : document.getComplexSentenceList()) +// { +// for (IWord word : wordList) +// { +// if (word instanceof CompoundWord && word.getLabel().equals("ns")) +// { +// tfDictionary.add(word.toString()); +// } +// } +// } +// } +// }); +// tfDictionary.saveTxtTo("data/test/complex_ns.txt"); +// } +// +// public void testAdjustNGram() throws Exception +// { +// IOUtil.LineIterator iterator = new IOUtil.LineIterator(HanLP.Config.BiGramDictionaryPath); +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(HanLP.Config.BiGramDictionaryPath + "adjust.txt"), "UTF-8")); +// while (iterator.hasNext()) +// { +// String line = iterator.next(); +// String[] params = line.split(" "); +// String first = params[0].split("@", 2)[0]; +// String second = params[0].split("@", 2)[1]; +//// if (params.length != 2) +//// System.err.println(line); +// int biFrequency = Integer.parseInt(params[1]); +// CoreDictionary.Attribute attribute = CoreDictionary.get(first + second); +// if (attribute != null && (first.length() == 1 || second.length() == 1)) +// { +// System.out.println(line); +// continue; +// } +// bw.write(line); +// bw.newLine(); +// } +// bw.close(); +// } +// +// public void testRemoveLabelD() throws Exception +// { +// Set nameFollowers = new TreeSet(); +// IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(HanLP.Config.BiGramDictionaryPath); +// while (lineIterator.hasNext()) +// { +// String line = lineIterator.next(); +// String[] words = line.split("\\s")[0].split("@"); +// if (words[0].equals(Predefine.TAG_PEOPLE)) +// { +// nameFollowers.add(words[1]); +// } +// } +// DictionaryMaker dictionary = DictionaryMaker.load(HanLP.Config.PersonDictionaryPath); +// for (Map.Entry entry : dictionary.entrySet()) +// { +// String key = entry.getKey(); +// int dF = entry.getValue().getFrequency("D"); +// if (key.length() == 1 && 0 < dF && dF < 100) +// { +// CoreDictionary.Attribute attribute = CoreDictionary.get(key); +// if (nameFollowers.contains(key) +// || (attribute != null && attribute.hasNatureStartsWith("v") && attribute.totalFrequency > 1000) +// ) +// { +// System.out.println(key); +// entry.getValue().removeLabel("D"); +// } +// } +// } +// +// dictionary.saveTxtTo(HanLP.Config.PersonDictionaryPath); +// } + +// public void testSingleDocument() throws Exception +// { +// Document document = CorpusLoader.convert2Document(new File("data/2014/0101/c1002-23996898.txt")); +// DictionaryMaker dictionaryMaker = new DictionaryMaker(); +// System.out.println(document); +// addToDictionary(document, dictionaryMaker); +// dictionaryMaker.saveTxtTo("data/dictionaryTest.txt"); +// } +// +// private void addToDictionary(Document document, DictionaryMaker dictionaryMaker) +// { +// for (IWord word : document.getWordList()) +// { +// if (word instanceof CompoundWord) +// { +// for (Word inner : ((CompoundWord)word).innerList) +// { +// // 暂时不统计人名 +// if (inner.getLabel().equals("nr")) +// { +// continue; +// } +// // 如果需要人名,注销上面这句即可 +// dictionaryMaker.add(inner); +// } +// } +// // 暂时不统计人名 +// if (word.getLabel().equals("nr")) +// { +// continue; +// } +// // 如果需要人名,注销上面这句即可 +// dictionaryMaker.add(word); +// } +// } +// +// public void testMakeDictionary() throws Exception +// { +// final DictionaryMaker dictionaryMaker = new DictionaryMaker(); +// CorpusLoader.walk("data/2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// addToDictionary(document, dictionaryMaker); +// } +// }); +// dictionaryMaker.saveTxtTo("data/2014_dictionary.txt"); +// } +// +// public void testLoadItemList() throws Exception +// { +// List itemList = DictionaryMaker.loadAsItemList("data/2014_dictionary.txt"); +// Map labelMap = new TreeMap(); +// for (Item item : itemList) +// { +// for (Map.Entry entry : item.labelMap.entrySet()) +// { +// Integer frequency = labelMap.get(entry.getKey()); +// if (frequency == null) frequency = 0; +// labelMap.put(entry.getKey(), frequency + entry.getValue()); +// } +// } +// for (String label : labelMap.keySet()) +// { +// System.out.println(label); +// } +// System.out.println(labelMap.size()); +// } +// +// public void testLoadEasyDictionary() throws Exception +// { +// EasyDictionary dictionary = EasyDictionary.create("data/2014_dictionary.txt"); +// System.out.println(dictionary.GetWordInfo("高峰")); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMakerTest.java b/src/test/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMakerTest.java new file mode 100644 index 000000000..e0d1f9a18 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/dictionary/TMDictionaryMakerTest.java @@ -0,0 +1,20 @@ +package com.hankcs.hanlp.corpus.dictionary; + +import junit.framework.TestCase; + +public class TMDictionaryMakerTest extends TestCase +{ + public void testCreate() throws Exception + { + TMDictionaryMaker tmDictionaryMaker = new TMDictionaryMaker(); + tmDictionaryMaker.addPair("ab", "cd"); + tmDictionaryMaker.addPair("ab", "cd"); + tmDictionaryMaker.addPair("ab", "Y"); + tmDictionaryMaker.addPair("ef", "gh"); + tmDictionaryMaker.addPair("ij", "kl"); + tmDictionaryMaker.addPair("ij", "kl"); + tmDictionaryMaker.addPair("ij", "kl"); + tmDictionaryMaker.addPair("X", "Y"); +// System.out.println(tmDictionaryMaker); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/dictionary/item/ItemTest.java b/src/test/java/com/hankcs/hanlp/corpus/dictionary/item/ItemTest.java new file mode 100644 index 000000000..f46896c4a --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/dictionary/item/ItemTest.java @@ -0,0 +1,19 @@ +package com.hankcs.hanlp.corpus.dictionary.item; + +import junit.framework.TestCase; + +public class ItemTest extends TestCase +{ + public void testCreate() throws Exception + { + assertEquals("希望 v 7685 vn 616", Item.create("希望 v 7685 vn 616").toString()); + } + + public void testCombine() throws Exception + { + SimpleItem itemA = SimpleItem.create("A 1 B 2"); + SimpleItem itemB = SimpleItem.create("B 1 C 2 D 3"); + itemA.combine(itemB); + assertEquals("B 3 D 3 C 2 A 1 ", itemA.toString()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/document/CorpusLoaderTest.java b/src/test/java/com/hankcs/hanlp/corpus/document/CorpusLoaderTest.java new file mode 100644 index 000000000..a82118554 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/document/CorpusLoaderTest.java @@ -0,0 +1,219 @@ +package com.hankcs.hanlp.corpus.document; + +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import junit.framework.TestCase; + +import java.io.*; +import java.util.List; + +public class CorpusLoaderTest extends TestCase +{ +// public void testMultiThread() throws Exception +// { +// CorpusLoader.HandlerThread[] handlerThreadArray = new CorpusLoader.HandlerThread[4]; +// for (int i = 0; i < handlerThreadArray.length; ++i) +// { +// handlerThreadArray[i] = new CorpusLoader.HandlerThread(String.valueOf(i)) +// { +// @Override +// public void handle(Document document) +// { +// +// } +// }; +// } +// CorpusLoader.walk("data/2014", handlerThreadArray); +// } +// +// public void testSingleThread() throws Exception +// { +// CorpusLoader.walk("data/2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// +// } +// }); +// } +// +// public void testCombineToTxt() throws Exception +// { +// final BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("D:\\Doc\\语料库\\2014_cn.txt"), "UTF-8")); +// CorpusLoader.walk("D:\\Doc\\语料库\\2014_hankcs", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// try +// { +// for (List sentence : document.getSimpleSentenceList()) +// { +// for (IWord word : sentence) +// { +// bw.write(word.getValue()); +// bw.write(' '); +// } +// bw.newLine(); +// } +// bw.newLine(); +// } +// catch (Exception e) +// { +// e.printStackTrace(); +// } +// } +// }); +// bw.close(); +// } +// +// public void testConvert2SimpleSentenceList() throws Exception +// { +// List> simpleSentenceList = CorpusLoader.convert2SimpleSentenceList("data/2014"); +// System.out.println(simpleSentenceList.get(0)); +// } +// +// public void testMakePersonCustomDictionary() throws Exception +// { +// final DictionaryMaker dictionaryMaker = new DictionaryMaker(); +// CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// List> complexSentenceList = document.getComplexSentenceList(); +// for (List wordList : complexSentenceList) +// { +// for (IWord word : wordList) +// { +// if (word.getLabel().startsWith("nr")) +// { +// dictionaryMaker.add(word); +// } +// } +// } +// } +// }); +// dictionaryMaker.saveTxtTo("data/dictionary/custom/人名词典.txt"); +// } +// +// public void testMakeOrganizationCustomDictionary() throws Exception +// { +// final DictionaryMaker dictionaryMaker = new DictionaryMaker(); +// CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// List> complexSentenceList = document.getComplexSentenceList(); +// for (List wordList : complexSentenceList) +// { +// for (IWord word : wordList) +// { +// if (word.getLabel().startsWith("nt")) +// { +// dictionaryMaker.add(word); +// } +// } +// } +// } +// }); +// dictionaryMaker.saveTxtTo("data/dictionary/custom/机构名词典.txt"); +// } +// +// /** +// * 语料库中有很多句号标注得不对,尝试纠正它们 +// * 比如“方言/n 版/n [新年/t 祝福/vn]/nz 。你/rr 的/ude1 一段/mq 话/n ” +// * @throws Exception +// */ +// public void testAdjustDot() throws Exception +// { +// CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() +// { +// int id = 0; +// @Override +// public void handle(Document document) +// { +// try +// { +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("D:\\Doc\\语料库\\2014_hankcs\\" + (++id) + ".txt"), "UTF-8")); +// for (List wordList : document.getComplexSentenceList()) +// { +// if (wordList.size() == 0) continue; +// for (IWord word : wordList) +// { +// if (word.getValue().length() > 1 && word.getValue().charAt(0) == '。') +// { +// bw.write("。/w"); +// bw.write(word.getValue().substring(1)); +// bw.write('/'); +// bw.write(word.getLabel()); +// bw.write(' '); +// continue; +// } +// bw.write(word.toString()); +// bw.write(' '); +// } +// bw.newLine(); +// } +// bw.close(); +// } +// catch (FileNotFoundException e) +// { +// e.printStackTrace(); +// } +// catch (UnsupportedEncodingException e) +// { +// e.printStackTrace(); +// } +// catch (IOException e) +// { +// e.printStackTrace(); +// } +// } +// }); +// } +// +// public void testLoadMyCorpus() throws Exception +// { +// CorpusLoader.walk("D:\\Doc\\语料库\\2014_hankcs\\", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// for (List wordList : document.getComplexSentenceList()) +// { +// System.out.println(wordList); +// } +// } +// }); +// +// } +// +// /** +// * 有些引号不对 +// * @throws Exception +// */ +// public void testFindQuote() throws Exception +// { +// CorpusLoader.walk("D:\\Doc\\语料库\\2014_hankcs\\", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// for (List wordList : document.getSimpleSentenceList()) +// { +// for (Word word : wordList) +// { +// if(word.value.length() > 1 && word.value.endsWith("\"")) +// { +// System.out.println(word); +// } +// } +// } +// } +// }); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/document/DocumentTest.java b/src/test/java/com/hankcs/hanlp/corpus/document/DocumentTest.java new file mode 100644 index 000000000..d22183338 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/document/DocumentTest.java @@ -0,0 +1,12 @@ +package com.hankcs.hanlp.corpus.document; + +import junit.framework.TestCase; + +public class DocumentTest extends TestCase +{ + public void testCreate() throws Exception + { + Document document = Document.create("[上海/ns 华安/nz 工业/n (/w 集团/n )/w 公司/n]/nt 董事长/n 谭旭光/nr 和/c 秘书/n 胡花蕊/nr 来到/v [美国/ns 纽约/ns 现代/t 艺术/n 博物馆/n]/ns 参观/v"); + assertNotNull(document); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/document/sentence/SentenceTest.java b/src/test/java/com/hankcs/hanlp/corpus/document/sentence/SentenceTest.java index da8ec0c69..5ec0a88b5 100644 --- a/src/test/java/com/hankcs/hanlp/corpus/document/sentence/SentenceTest.java +++ b/src/test/java/com/hankcs/hanlp/corpus/document/sentence/SentenceTest.java @@ -1,11 +1,80 @@ package com.hankcs.hanlp.corpus.document.sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.WordFactory; import junit.framework.TestCase; +import java.util.ListIterator; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + public class SentenceTest extends TestCase { + public void testFindFirstWordIteratorByLabel() throws Exception + { + Sentence sentence = Sentence.create("[上海/ns 华安/nz 工业/n (/w 集团/n )/w 公司/n]/nt 董事长/n 谭旭光/nr 和/c 秘书/n 胡花蕊/nr 来到/v [美国/ns 纽约/ns 现代/t 艺术/n 博物馆/n]/ns 参观/v"); + ListIterator nt = sentence.findFirstWordIteratorByLabel("nt"); + assertNotNull(nt); + assertEquals("[上海/ns 华安/nz 工业/n (/w 集团/n )/w 公司/n]/nt", nt.previous().toString()); + CompoundWord apple = CompoundWord.create("[苹果/n 公司/n]/nt"); + nt.set(apple); + assertEquals(sentence.findFirstWordByLabel("nt"), apple); + nt.remove(); + assertEquals("董事长/n 谭旭光/nr 和/c 秘书/n 胡花蕊/nr 来到/v [美国/ns 纽约/ns 现代/t 艺术/n 博物馆/n]/ns 参观/v", sentence.toString()); + ListIterator ns = sentence.findFirstWordIteratorByLabel("ns"); + assertEquals("参观/v", ns.next().toString()); + } + + public void testToStandoff() throws Exception + { + Sentence sentence = Sentence.create("[上海/ns 华安/nz 工业/n (/w 集团/n )/w 公司/n]/nt 董事长/n 谭旭光/nr 和/c 秘书/n 胡花蕊/nr 来到/v [美国/ns 纽约/ns 现代/t 艺术/n 博物馆/n]/ns 参观/v"); + System.out.println(sentence.toStandoff(true)); + } + public void testText() throws Exception { assertEquals("人民网纽约时报", Sentence.create("人民网/nz [纽约/nsf 时报/n]/nz").text()); } + + public void testCreate() throws Exception + { + String text = "人民网/nz 1月1日/t 讯/ng 据/p 《/w [纽约/nsf 时报/n]/nz 》/w 报道/v ,/w"; + Pattern pattern = Pattern.compile("(\\[(.+/[a-z]+)]/[a-z]+)|([^\\s]+/[a-z]+)"); + Matcher matcher = pattern.matcher(text); + while (matcher.find()) + { + String param = matcher.group(); + assertEquals(param, WordFactory.create(param).toString()); + } + assertEquals(text, Sentence.create(text).toString()); + } + + public void testCreateNoTag() throws Exception + { + String text = "商品 和 服务"; + Sentence sentence = Sentence.create(text); + System.out.println(sentence); + } + + public void testMerge() throws Exception + { + Sentence sentence = Sentence.create("晚9时40分/TIME ,/v 鸟/n 迷/v 、/v 专家/n 托尼/PERSON 率领/v 的/u [英国/ns “/w 野翅膀/nz ”/w 观/Vg 鸟/n 团/n]/ORGANIZATION 一行/n 29/INTEGER 人/n ,/v 才/d 吃/v 完/v 晚饭/n 回到/v [金山/nz 宾馆/n]/ORGANIZATION 的/u 大/a 酒吧间/n ,/v 他们/r 一边/d 喝/v 着/u 青岛/LOCATION 啤酒/n ,/v 一边/d 兴致勃勃/i 地/u 回答/v 记者/n 的/u 提问/vn 。/w"); + System.out.println(sentence.mergeCompoundWords()); + } + + public void testRemoveBracket() + { + Sentence sentence = Sentence.create("[关塔那摩/ns]/ns 问题/n 上/f ,/w 美国/nsf 的/ude1 [双重/b 标准/n]/nz"); + for (IWord word : sentence) + { + String text = word.getValue(); + if (text.contains("/")) + { + System.out.println(word); + fail(); + } + } + assertNotNull(Sentence.create("各/rz [2/m //w 3/m 以上/f]/mq 议员/nnt 赞成/v")); + } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/document/sentence/word/WordTest.java b/src/test/java/com/hankcs/hanlp/corpus/document/sentence/word/WordTest.java new file mode 100644 index 000000000..130a7c1c6 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/document/sentence/word/WordTest.java @@ -0,0 +1,19 @@ +package com.hankcs.hanlp.corpus.document.sentence.word; + +import junit.framework.TestCase; + +public class WordTest extends TestCase +{ + public void testCreate() throws Exception + { + assertEquals("人民网/nz", Word.create("人民网/nz").toString()); + assertEquals("[纽约/nsf 时报/n]/nz", CompoundWord.create("[纽约/nsf 时报/n]/nz").toString()); + assertEquals("[中央/n 人民/n 广播/vn 电台/n]/nt", CompoundWord.create("[中央/n 人民/n 广播/vn 电台/n]nt").toString()); + } + + public void testSpace() throws Exception + { + CompoundWord compoundWord = CompoundWord.create("[9/m 11/m 后/f]/mq"); + assertEquals(3, compoundWord.innerList.size()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/io/ByteArrayTest.java b/src/test/java/com/hankcs/hanlp/corpus/io/ByteArrayTest.java new file mode 100644 index 000000000..a2fec68b9 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/io/ByteArrayTest.java @@ -0,0 +1,159 @@ +package com.hankcs.hanlp.corpus.io; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.model.maxent.MaxEntModel; +import com.hankcs.hanlp.utility.ByteUtil; +import com.hankcs.hanlp.utility.Predefine; +import junit.framework.TestCase; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; + +public class ByteArrayTest extends TestCase +{ + static String DATA_TEST_OUT_BIN; + private File tempFile; + + @Override + public void setUp() throws Exception + { + tempFile = File.createTempFile("hanlp-", ".dat"); + DATA_TEST_OUT_BIN = tempFile.getAbsolutePath(); + } + + public void testReadDouble() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(DATA_TEST_OUT_BIN)); + double d = 0.123456789; + out.writeDouble(d); + int i = 3389; + out.writeInt(i); + ByteArray byteArray = ByteArray.createByteArray(DATA_TEST_OUT_BIN); + assertEquals(d, byteArray.nextDouble()); + assertEquals(i, byteArray.nextInt()); + } + + public void testReadUTF() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(DATA_TEST_OUT_BIN)); + String utf = "hankcs你好123"; + out.writeUTF(utf); + ByteArray byteArray = ByteArray.createByteArray(DATA_TEST_OUT_BIN); + assertEquals(utf, byteArray.nextUTF()); + } + + public void testReadUnsignedShort() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(DATA_TEST_OUT_BIN)); + int utflen = 123; + out.writeByte((byte) ((utflen >>> 8) & 0xFF)); + out.writeByte((byte) ((utflen >>> 0) & 0xFF)); + ByteArray byteArray = ByteArray.createByteArray(DATA_TEST_OUT_BIN); + assertEquals(utflen, byteArray.nextUnsignedShort()); + } + + public void testConvertCharToInt() throws Exception + { +// for (int i = 0; i < Integer.MAX_VALUE; ++i) + for (int i = 0; i < 1024; ++i) + { + int n = i; + char[] twoChar = ByteUtil.convertIntToTwoChar(n); + assertEquals(n, ByteUtil.convertTwoCharToInt(twoChar[0], twoChar[1])); + } + } + + public void testNextBoolean() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(tempFile)); + out.writeBoolean(true); + out.writeBoolean(false); + ByteArray byteArray = ByteArray.createByteArray(tempFile.getAbsolutePath()); + assertNotNull(byteArray); + assertEquals(byteArray.nextBoolean(), true); + assertEquals(byteArray.nextBoolean(), false); + tempFile.deleteOnExit(); + } + + public void testWriteAndRead() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(DATA_TEST_OUT_BIN)); + out.writeChar('H'); + out.writeChar('e'); + out.writeChar('l'); + out.writeChar('l'); + out.writeChar('o'); + out.close(); + ByteArray byteArray = ByteArray.createByteArray(DATA_TEST_OUT_BIN); + while (byteArray.hasMore()) + { + byteArray.nextChar(); +// System.out.println(byteArray.nextChar()); + } + } + + public void testWriteBigFile() throws Exception + { + DataOutputStream out = new DataOutputStream(new FileOutputStream(DATA_TEST_OUT_BIN)); + for (int i = 0; i < 10000; i++) + { + out.writeInt(i); + } + out.close(); + } + + public void testStream() throws Exception + { + ByteArray byteArray = ByteArrayFileStream.createByteArrayFileStream(DATA_TEST_OUT_BIN); + while (byteArray.hasMore()) + { + System.out.println(byteArray.nextInt()); + } + } + +// /** +// * 无法在-Xms512m -Xmx512m -Xmn256m下运行
+// * java.lang.OutOfMemoryError: GC overhead limit exceeded +// * @throws Exception +// */ +// public void testLoadByteArray() throws Exception +// { +// ByteArray byteArray = ByteArray.createByteArray(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// } +// +// /** +// * 能够在-Xms512m -Xmx512m -Xmn256m下运行 +// * @throws Exception +// */ +// public void testLoadByteArrayStream() throws Exception +// { +// ByteArray byteArray = ByteArrayFileStream.createByteArrayFileStream(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// } +// +// public void testBenchmark() throws Exception +// { +// long start; +// +// ByteArray byteArray = ByteArray.createByteArray(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// +// byteArray = ByteArrayFileStream.createByteArrayFileStream(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// +// start = System.currentTimeMillis(); +// byteArray = ByteArray.createByteArray(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// System.out.printf("ByteArray: %d ms\n", (System.currentTimeMillis() - start)); +// +// start = System.currentTimeMillis(); +// byteArray = ByteArrayFileStream.createByteArrayFileStream(HanLP.Config.MaxEntModelPath + Predefine.BIN_EXT); +// MaxEntModel.create(byteArray); +// System.out.printf("ByteArrayStream: %d ms\n", (System.currentTimeMillis() - start)); +// +//// ByteArray: 2626 ms +//// ByteArrayStream: 4165 ms +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/io/IOUtilTest.java b/src/test/java/com/hankcs/hanlp/corpus/io/IOUtilTest.java index a0e2b96d9..67b4a9d96 100644 --- a/src/test/java/com/hankcs/hanlp/corpus/io/IOUtilTest.java +++ b/src/test/java/com/hankcs/hanlp/corpus/io/IOUtilTest.java @@ -3,6 +3,7 @@ import junit.framework.TestCase; import java.io.ByteArrayInputStream; +import java.io.File; import java.util.Random; public class IOUtilTest extends TestCase @@ -31,4 +32,17 @@ public synchronized int available() assertEquals(originalData[i], readData[i]); } } + + public void testUTF8BOM() throws Exception + { + File tempFile = File.createTempFile("hanlp-", ".txt"); + tempFile.deleteOnExit(); + IOUtil.saveTxt(tempFile.getAbsolutePath(), "\uFEFF第1行\n第2行"); + IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(tempFile.getAbsolutePath()); + int i = 1; + for (String line : lineIterator) + { + assertEquals(String.format("第%d行", i++), line); + } + } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/synonym/SynonymTest.java b/src/test/java/com/hankcs/hanlp/corpus/synonym/SynonymTest.java new file mode 100644 index 000000000..b4e47ce68 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/corpus/synonym/SynonymTest.java @@ -0,0 +1,94 @@ +package com.hankcs.hanlp.corpus.synonym; + +import com.hankcs.hanlp.dictionary.CoreSynonymDictionary; +import com.hankcs.hanlp.dictionary.common.CommonSynonymDictionary; +import com.hankcs.hanlp.dictionary.common.CommonSynonymDictionaryEx; +import junit.framework.TestCase; + +import java.io.FileInputStream; +import java.util.List; + +public class SynonymTest extends TestCase +{ +// public void testCreate() throws Exception +// { +// String[] testCaseArray = new String[] +// { +// "Bh06A32= 番茄 西红柿", +// "Ad02B05# 白人 白种人 黑人", +// "Bo21A05@ 摩托车" +// }; +// for (String tc : testCaseArray) +// { +// runCase(tc); +// } +// } +// +// public void testSingle() throws Exception +// { +// runCase("Aa01A01= 人 士 人物 人士 人氏 人选"); +// } +// +// public void testDictionary() throws Exception +// { +// String apple = "苹果"; +// String banana = "香蕉"; +// String bike = "自行车"; +// CommonSynonymDictionary.SynonymItem synonymApple = CoreSynonymDictionary.get(apple); +// CommonSynonymDictionary.SynonymItem synonymBanana = CoreSynonymDictionary.get(banana); +// CommonSynonymDictionary.SynonymItem synonymBike = CoreSynonymDictionary.get(bike); +// System.out.println(apple + " " + banana + "之间的距离是" + synonymApple.distance(synonymBanana)); +// System.out.println(apple + " " + bike + "之间的距离是" + synonymApple.distance(synonymBike)); +// } +// +// void runCase(String param) +// { +// List synonymList = Synonym.create(param); +// System.out.println(synonymList); +// } +// +// public void testDictionaryEx() throws Exception +// { +// CommonSynonymDictionaryEx dictionaryEx = CommonSynonymDictionaryEx.create(new FileInputStream("data/dictionary/synonym/CoreSynonym.txt")); +// String[] array = new String[] +// { +// "香蕉", +// "苹果", +// "白菜", +// "水果", +// "蔬菜", +// "自行车", +// "公交车", +// "飞机", +// "买", +// "卖", +// "购入", +// "新年", +// "春节", +// "丢失", +// "补办", +// "办理", +// "太阳", +// "送给", +// "寻找", +// "放飞", +// "孩", +// "孩子", +// "教室", +// "教师", +// "会计", +// }; +// runCase(array, dictionaryEx); +// } +// +// public void runCase(String[] stringArray, CommonSynonymDictionaryEx dictionaryEx) +// { +// for (String a : stringArray) +// { +// for (String b : stringArray) +// { +// System.out.println(a + "\t" + b + "\t之间的距离是\t" + dictionaryEx.distance(a, b)); +// } +// } +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/corpus/tag/NatureTest.java b/src/test/java/com/hankcs/hanlp/corpus/tag/NatureTest.java index 520399457..ed9082f98 100644 --- a/src/test/java/com/hankcs/hanlp/corpus/tag/NatureTest.java +++ b/src/test/java/com/hankcs/hanlp/corpus/tag/NatureTest.java @@ -1,14 +1,13 @@ package com.hankcs.hanlp.corpus.tag; -import com.hankcs.hanlp.corpus.util.CustomNatureUtility; import junit.framework.TestCase; public class NatureTest extends TestCase { public void testFromString() throws Exception { - Nature one = CustomNatureUtility.addNature("新词性1"); - Nature two = CustomNatureUtility.addNature("新词性2"); + Nature one = Nature.create("新词性1"); + Nature two = Nature.create("新词性2"); assertEquals(one, Nature.fromString("新词性1")); assertEquals(two, Nature.fromString("新词性2")); diff --git a/src/test/java/com/hankcs/hanlp/dependency/MaxEntDependencyParserTest.java b/src/test/java/com/hankcs/hanlp/dependency/MaxEntDependencyParserTest.java new file mode 100644 index 000000000..2397be391 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dependency/MaxEntDependencyParserTest.java @@ -0,0 +1,43 @@ +package com.hankcs.hanlp.dependency; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLLoader; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence; +import com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord; +import com.hankcs.hanlp.corpus.dependency.CoNll.Evaluator; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.util.LinkedList; +import java.util.List; + +public class MaxEntDependencyParserTest extends TestCase +{ + public void testMaxEntParser() throws Exception + { +// HanLP.Config.enableDebug(); +// System.out.println(MaxEntDependencyParser.compute("我每天骑车上学")); + } + +// public void testEvaluate() throws Exception +// { +// LinkedList sentenceList = CoNLLLoader.loadSentenceList("D:\\Doc\\语料库\\依存分析训练数据\\THU\\dev.conll"); +// Evaluator evaluator = new Evaluator(); +// int id = 1; +// for (CoNLLSentence sentence : sentenceList) +// { +// System.out.printf("%d / %d...", id++, sentenceList.size()); +// long start = System.currentTimeMillis(); +// List termList = new LinkedList(); +// for (CoNLLWord word : sentence.word) +// { +// termList.add(new Term(word.LEMMA, Nature.valueOf(word.POSTAG))); +// } +// CoNLLSentence out = CRFDependencyParser.compute(termList); +// evaluator.e(sentence, out); +// System.out.println("done in " + (System.currentTimeMillis() - start) + " ms."); +// } +// System.out.println(evaluator); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParserTest.java b/src/test/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParserTest.java new file mode 100644 index 000000000..6d605b981 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dependency/perceptron/parser/KBeamArcEagerDependencyParserTest.java @@ -0,0 +1,13 @@ +package com.hankcs.hanlp.dependency.perceptron.parser; + +import junit.framework.TestCase; + +import java.io.IOException; + +public class KBeamArcEagerDependencyParserTest extends TestCase +{ + public void testLoad() throws IOException, ClassNotFoundException + { + KBeamArcEagerDependencyParser parser = new KBeamArcEagerDependencyParser(); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/CustomDictionaryTest.java b/src/test/java/com/hankcs/hanlp/dictionary/CustomDictionaryTest.java index d2677b188..884e8bbaa 100644 --- a/src/test/java/com/hankcs/hanlp/dictionary/CustomDictionaryTest.java +++ b/src/test/java/com/hankcs/hanlp/dictionary/CustomDictionaryTest.java @@ -1,12 +1,178 @@ package com.hankcs.hanlp.dictionary; +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.item.Item; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.utility.Predefine; import junit.framework.TestCase; +import java.io.*; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.TreeSet; + public class CustomDictionaryTest extends TestCase { - public void testReload() throws Exception +// public void testReload() throws Exception +// { +// assertEquals(true, CustomDictionary.reload()); +// assertEquals(true, CustomDictionary.contains("中华白海豚")); +// } + + public void testGet() throws Exception + { + assertEquals("nz 1 ", CustomDictionary.get("一个心眼儿").toString()); + } + + /** + * 删除一个字的词语 + * @throws Exception + */ +// public void testRemoveShortWord() throws Exception +// { +// BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/dictionary/CustomDictionary.txt"))); +// String line; +// Set fixedDictionary = new TreeSet(); +// while ((line = br.readLine()) != null) +// { +// String[] param = line.split("\\s"); +// if (param[0].length() == 1 || CoreDictionary.contains(param[0])) continue; +// fixedDictionary.add(line); +// } +// br.close(); +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("data/dictionary/CustomDictionary.txt"))); +// for (String word : fixedDictionary) +// { +// bw.write(word); +// bw.newLine(); +// } +// bw.close(); +// } + + /** + * 这里面很多nr不合理,干脆都删掉 + * @throws Exception + */ +// public void testRemoveNR() throws Exception +// { +// BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("data/dictionary/CustomDictionary.txt"))); +// String line; +// Set fixedDictionary = new TreeSet(); +// while ((line = br.readLine()) != null) +// { +// String[] param = line.split("\\s"); +// if (param[1].equals("nr")) continue; +// fixedDictionary.add(line); +// } +// br.close(); +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("data/dictionary/CustomDictionary.txt"))); +// for (String word : fixedDictionary) +// { +// bw.write(word); +// bw.newLine(); +// } +// bw.close(); +// } + +// public void testNext() throws Exception +// { +// BaseSearcher searcher = CustomDictionary.getSearcher("都要亲口"); +// Map.Entry entry; +// while ((entry = searcher.next()) != null) +// { +// int offset = searcher.getOffset(); +// System.out.println(offset + 1 + " " + entry); +// } +// } + +// public void testRemoveJunkWord() throws Exception +// { +// DictionaryMaker dictionaryMaker = DictionaryMaker.load("data/dictionary/custom/CustomDictionary.txt"); +// dictionaryMaker.saveTxtTo("data/dictionary/custom/CustomDictionary.txt", new DictionaryMaker.Filter() +// { +// @Override +// public boolean onSave(Item item) +// { +// if (item.containsLabel("mq") || item.containsLabel("m") || item.containsLabel("t")) +// { +// return false; +// } +// return true; +// } +// }); +// } + + /** + * data/dictionary/custom/全国地名大全.txt中有很多人名,删掉它们 + * @throws Exception + */ +// public void testRemoveNotNS() throws Exception +// { +// String path = "data/dictionary/custom/全国地名大全.txt"; +// final Set suffixSet = new TreeSet(); +// for (char c : Predefine.POSTFIX_SINGLE.toCharArray()) +// { +// suffixSet.add(c); +// } +// DictionaryMaker.load(path).saveTxtTo(path, new DictionaryMaker.Filter() +// { +// Segment segment = HanLP.newSegment().enableCustomDictionary(false); +// @Override +// public boolean onSave(Item item) +// { +// if (suffixSet.contains(item.key.charAt(item.key.length() - 1))) return true; +// List termList = segment.seg(item.key); +// if (termList.size() == 1 && termList.get(0).nature == Nature.nr) +// { +// System.out.println(item); +// return false; +// } +// return true; +// } +// }); +// } + + public void testCustomNature() throws Exception + { + Nature pcNature1 = Nature.create("电脑品牌"); + Nature pcNature2 = Nature.create("电脑品牌"); + assertEquals(pcNature1, pcNature2); + } + +// public void testIssue234() throws Exception +// { +// String customTerm = "攻城狮"; +// String text = "攻城狮逆袭单身狗,迎娶白富美,走上人生巅峰"; +// System.out.println("原始分词结果"); +// System.out.println("CustomDictionary.get(customTerm)=" + CustomDictionary.get(customTerm)); +// System.out.println(HanLP.segment(text)); +// // 动态增加 +// CustomDictionary.add(customTerm); +// System.out.println("添加自定义词组分词结果"); +// System.out.println("CustomDictionary.get(customTerm)=" + CustomDictionary.get(customTerm)); +// System.out.println(HanLP.segment(text)); +// // 删除词语 +// CustomDictionary.remove(customTerm); +// System.out.println("删除自定义词组分词结果"); +// System.out.println("CustomDictionary.get(customTerm)=" + CustomDictionary.get(customTerm)); +// System.out.println(HanLP.segment(text)); +// } + + public void testIssue540() throws Exception + { + CustomDictionary.add("123"); + CustomDictionary.add("摩根"); + CustomDictionary.remove("123"); + CustomDictionary.remove("摩根"); + } + + public void testReload() { - assertEquals(true, CustomDictionary.reload()); - assertEquals(true, CustomDictionary.contains("中华白海豚")); + CustomDictionary.reload(); + System.out.println(HanLP.segment("自然语言处理")); } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/SimplifyNGramDictionary.java b/src/test/java/com/hankcs/hanlp/dictionary/SimplifyNGramDictionary.java new file mode 100644 index 000000000..6a655f11b --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/SimplifyNGramDictionary.java @@ -0,0 +1,150 @@ +/* + * + * He Han + * hankcs.cn@gmail.com + * 2014/10/7 21:06 + * + * + * Copyright (c) 2003-2014, 上海林原信息科技有限公司. All Right Reserved, http://www.linrunsoft.com/ + * This source is subject to the LinrunSpace License. Please contact 上海林原信息科技有限公司 to get more information. + * + */ +package com.hankcs.hanlp.dictionary; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.TFDictionary; +import com.hankcs.hanlp.corpus.occurrence.TermFrequency; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import junit.framework.TestCase; + +import java.io.*; +import java.util.*; + +/** + * 有一些类似于 工程@学 1 的条目会干扰 工程学家 的识别,这类@后接短字符的可以过滤掉 + * @author hankcs + */ +public class SimplifyNGramDictionary extends TestCase +{ +// String path = "data/dictionary/CoreNatureDictionary.ngram.txt"; +// public void testSimplify() throws Exception +// { +// BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path))); +// TreeMap map = new TreeMap(); +// String line; +// while ((line = br.readLine()) != null) +// { +// String[] param = line.split("\\s"); +// map.put(param[0], Integer.valueOf(param[1])); +// } +// br.close(); +// Set> entrySet = map.descendingMap().entrySet(); +// Iterator> iterator = entrySet.iterator(); +// // 第一步去包含 +//// Map.Entry pre = new AbstractMap.SimpleEntry<>(" @ ", 1); +//// while (iterator.hasNext()) +//// { +//// Map.Entry current = iterator.next(); +//// if (current.getKey().length() - current.getKey().indexOf('@') == 2 && pre.getKey().indexOf(current.getKey()) == 0 && current.getValue() <= 2) +//// { +//// System.out.println("应当删除 " + current + " 保留 " + pre); +//// iterator.remove(); +//// } +//// pre = current; +//// } +// // 第二步,尝试移除“学@家”这样的短共现 +//// iterator = entrySet.iterator(); +//// while (iterator.hasNext()) +//// { +//// Map.Entry current = iterator.next(); +//// if (current.getKey().length() == 3) +//// { +//// System.out.println("应当删除 " + current); +//// } +//// } +// // 第三步,对某些@后面的词语太短了,也移除 +//// iterator = entrySet.iterator(); +//// while (iterator.hasNext()) +//// { +//// Map.Entry current = iterator.next(); +//// String[] termArray = current.getKey().split("@", 2); +//// if (termArray[0].equals("未##人") && termArray[1].length() < 2) +//// { +//// System.out.println("删除 " + current.getKey()); +//// iterator.remove(); +//// } +//// } +// // 第四步,人名接续对识别产生太多误命中影响,也删除 +//// iterator = entrySet.iterator(); +//// while (iterator.hasNext()) +//// { +//// Map.Entry current = iterator.next(); +//// if (current.getKey().contains("未##人") && current.getValue() < 10) +//// { +//// System.out.println("删除 " + current.getKey()); +//// iterator.remove(); +//// } +//// } +// // 对人名的终极调优 +// TFDictionary dictionary = new TFDictionary(); +// dictionary.load("D:\\JavaProjects\\HanLP\\data\\dictionary\\CoreNatureDictionary.ngram.mini.txt"); +// iterator = entrySet.iterator(); +// while (iterator.hasNext()) +// { +// Map.Entry current = iterator.next(); +// if (current.getKey().contains("未##人") && dictionary.getFrequency(current.getKey()) < 10) +// { +// System.out.println("删除 " + current.getKey()); +// iterator.remove(); +// } +// } +// // 输出 +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path))); +// for (Map.Entry entry : map.entrySet()) +// { +// bw.write(entry.getKey()); +// bw.write(' '); +// bw.write(String.valueOf(entry.getValue())); +// bw.newLine(); +// } +// bw.close(); +// } +// +// /** +// * 有些词条不在CoreDictionary里面,那就把它们删掉 +// * @throws Exception +// */ +// public void testLoseWeight() throws Exception +// { +// BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")); +// TreeMap map = new TreeMap(); +// String line; +// while ((line = br.readLine()) != null) +// { +// String[] param = line.split(" "); +// map.put(param[0], Integer.valueOf(param[1])); +// } +// br.close(); +// Iterator iterator = map.keySet().iterator(); +// while (iterator.hasNext()) +// { +// line = iterator.next(); +// String[] params = line.split("@", 2); +// String one = params[0]; +// String two = params[1]; +// if (!CoreDictionary.contains(one) || !CoreDictionary.contains(two)) +// iterator.remove(); +// } +// +// // 输出 +// BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), "UTF-8")); +// for (Map.Entry entry : map.entrySet()) +// { +// bw.write(entry.getKey()); +// bw.write(' '); +// bw.write(String.valueOf(entry.getValue())); +// bw.newLine(); +// } +// bw.close(); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/dictionary/other/CharTableTest.java b/src/test/java/com/hankcs/hanlp/dictionary/other/CharTableTest.java new file mode 100644 index 000000000..59823448e --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/other/CharTableTest.java @@ -0,0 +1,187 @@ +package com.hankcs.hanlp.dictionary.other; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.File; +import java.io.FileOutputStream; +import java.io.ObjectOutputStream; +import java.util.HashMap; +import java.util.Map; + +public class CharTableTest extends TestCase +{ + public void testNormalization() throws Exception + { + System.out.println(CharTable.convert('?')); + assertEquals('(', CharTable.convert('(')); + } + + public void testNormalizeSpace() throws Exception + { + assertEquals(CharTable.convert('\t'), ' '); + assertEquals(CharTable.convert('\n'), ' '); + assertEquals(CharTable.convert('\f'), ' '); + } + + public void testIssue1615() + { + new File("data/dictionary/other/CharTable.txt.bin").delete(); + Map normalizationBadCase = new HashMap(); + normalizationBadCase.put("猛", "猛"); + normalizationBadCase.put("蜺", "蜺"); + normalizationBadCase.put("脊", "脊"); + normalizationBadCase.put("骼", "骼"); + normalizationBadCase.put("拾", "拾"); + normalizationBadCase.put("劈", "劈"); + normalizationBadCase.put("溜", "溜"); + normalizationBadCase.put("呱", "呱"); + normalizationBadCase.put("怵", "怵"); + normalizationBadCase.put("糸", "丝"); + normalizationBadCase.put("乾", "乾"); + normalizationBadCase.put("艸", "草"); + for (Map.Entry entry : normalizationBadCase.entrySet()) + { + String input = entry.getKey(); + String result = CharTable.convert(input); + String expected = entry.getValue(); + System.out.println(input + "=" + expected); + assert result.equals(expected); + } + } +// public void testConvert() throws Exception +// { +// System.out.println(CharTable.CONVERT['關']); +// System.out.println(CharTable.CONVERT['A']); +// System.out.println(CharTable.CONVERT['“']); +// System.out.println(CharTable.CONVERT['.']); +// } +// +// public void testEnd() throws Exception +// { +// System.out.println(CharTable.CONVERT[',']); +// System.out.println(CharTable.CONVERT['。']); +// System.out.println(CharTable.CONVERT['!']); +// System.out.println(CharTable.CONVERT['…']); +// } +// +// public void testFix() throws Exception +// { +// char[] CONVERT = CharTable.CONVERT; +// CONVERT['.'] = '.'; +// CONVERT['.'] = '.'; +// CONVERT['。'] = '.'; +// CONVERT['!'] = '!'; +// CONVERT[','] = ','; +// CONVERT['!'] = '!'; +// CONVERT['#'] = '#'; +// CONVERT['&'] = '&'; +// CONVERT['*'] = '*'; +// CONVERT[','] = ','; +// CONVERT['/'] = '/'; +// CONVERT[';'] = ';'; +// CONVERT['?'] = '?'; +// CONVERT['\\'] = '\\'; +// CONVERT['^'] = '^'; +// CONVERT['_'] = '_'; +// CONVERT['`'] = '`'; +// CONVERT['|'] = '|'; +// CONVERT['~'] = '~'; +// CONVERT['¡'] = '¡'; +// CONVERT['¦'] = '¦'; +// CONVERT['´'] = '´'; +// CONVERT['¸'] = '¸'; +// CONVERT['¿'] = '¿'; +// CONVERT['ˇ'] = 'ˇ'; +// CONVERT['ˉ'] = 'ˉ'; +// CONVERT['ˊ'] = 'ˊ'; +// CONVERT['ˋ'] = 'ˋ'; +// CONVERT['˜'] = '˜'; +// CONVERT['—'] = '—'; +// CONVERT['―'] = '―'; +// CONVERT['‖'] = '‖'; +// CONVERT['…'] = '…'; +// CONVERT['∕'] = '∕'; +// CONVERT['︳'] = '︳'; +// CONVERT['︴'] = '︴'; +// CONVERT['﹉'] = '﹉'; +// CONVERT['﹊'] = '﹊'; +// CONVERT['﹋'] = '﹋'; +// CONVERT['﹌'] = '﹌'; +// CONVERT['﹍'] = '﹍'; +// CONVERT['﹎'] = '﹎'; +// CONVERT['﹏'] = '﹏'; +// CONVERT['﹐'] = '﹐'; +// CONVERT['﹑'] = '﹑'; +// CONVERT['﹔'] = '﹔'; +// CONVERT['﹖'] = '﹖'; +// CONVERT['﹟'] = '﹟'; +// CONVERT['﹠'] = '﹠'; +// CONVERT['﹡'] = '﹡'; +// CONVERT['﹨'] = '﹨'; +// CONVERT['''] = '''; +// CONVERT[';'] = ';'; +// CONVERT['?'] = '?'; +// CONVERT['幣'] = '币'; +// CONVERT['繫'] = '系'; +// CONVERT['眾'] = '众'; +// CONVERT['龕'] = '龛'; +// CONVERT['製'] = '制'; +// for (int i = 0; i < CONVERT.length; i++) +// { +// if (CONVERT[i] == '\u0000') +// { +// if (i != '\u0000') CONVERT[i] = (char) i; +// else CONVERT[i] = ' '; +// } +// } +// ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(HanLP.Config.CharTablePath)); +// out.writeObject(CONVERT); +// out.close(); +// } +// +// public void testImportSingleCharFromTraditionalChineseDictionary() throws Exception +// { +//// char[] CONVERT = CharTable.CONVERT; +//// StringDictionary dictionary = new StringDictionary("="); +//// dictionary.load(HanLP.Config.t2sDictionaryPath); +//// for (Map.Entry entry : dictionary.entrySet()) +//// { +//// String key = entry.getKey(); +//// if (key.length() != 1) continue; +//// String value = entry.getValue(); +//// char t = key.charAt(0); +//// char s = value.charAt(0); +////// if (CONVERT[t] != s) +////// { +////// System.out.printf("%s\t%c=%c\n", entry, t, CONVERT[t]); +////// } +//// CONVERT[t] = s; +//// } +//// +//// ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(HanLP.Config.CharTablePath)); +//// out.writeObject(CONVERT); +//// out.close(); +// } +// +// public void testDumpCharTable() throws Exception +// { +// BufferedWriter bw = IOUtil.newBufferedWriter(HanLP.Config.CharTablePath.replace(".bin.yes", ".txt")); +// char[] CONVERT = CharTable.CONVERT; +// for (int i = 0; i < CONVERT.length; i++) +// { +// if (i != CONVERT[i]) +// { +// bw.write(String.format("%c=%c\n", i, CONVERT[i])); +// } +// } +// bw.close(); +// } +// +// public void testLoadCharTableFromTxt() throws Exception +// { +//// CharTable.load(HanLP.Config.CharTablePath.replace(".bin.yes", ".txt")); +// } +} diff --git a/src/test/java/com/hankcs/hanlp/dictionary/other/CharTypeTest.java b/src/test/java/com/hankcs/hanlp/dictionary/other/CharTypeTest.java new file mode 100644 index 000000000..05302572e --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/other/CharTypeTest.java @@ -0,0 +1,44 @@ +package com.hankcs.hanlp.dictionary.other; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.utility.TextUtility; +import junit.framework.TestCase; + +import java.util.List; + +public class CharTypeTest extends TestCase +{ + public void testNumber() throws Exception + { +// for (int i = 0; i <= Character.MAX_VALUE; ++i) +// { +// if (CharType.get((char) i) == CharType.CT_NUM) +// System.out.println((char) i); +// } + assertEquals(CharType.CT_NUM, CharType.get('1')); + + } + + public void testWhiteSpace() throws Exception + { +// CharType.type[' '] = CharType.CT_OTHER; + String text = "1 + 2 = 3; a+b= a + b"; + assertEquals("[1/m, /w, +/w, /w, 2/m, /w, =/w, /w, 3/m, ;/w, /w, a/nx, +/w, b/nx, =/w, /w, a/nx, /w, +/w, /w, b/nx]", HanLP.segment(text).toString()); + } + + public void testTab() throws Exception + { + assertTrue(TextUtility.charType('\t') == CharType.CT_DELIMITER); + assertTrue(TextUtility.charType('\r') == CharType.CT_DELIMITER); + assertTrue(TextUtility.charType('\0') == CharType.CT_DELIMITER); + +// System.out.println(HanLP.segment("\t")); + } + + public void testNonPrintable() + { + List termList = HanLP.segment(")\r\n "); + assertEquals(2, termList.size()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionaryTest.java b/src/test/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionaryTest.java new file mode 100644 index 000000000..1e4079c0f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/other/PartOfSpeechTagDictionaryTest.java @@ -0,0 +1,11 @@ +package com.hankcs.hanlp.dictionary.other; + +import junit.framework.TestCase; + +public class PartOfSpeechTagDictionaryTest extends TestCase +{ + public void testTranslate() throws Exception + { + assertEquals("名词", PartOfSpeechTagDictionary.translate("n")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/py/PinyinDictionaryTest.java b/src/test/java/com/hankcs/hanlp/dictionary/py/PinyinDictionaryTest.java new file mode 100644 index 000000000..b9d7870d7 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/py/PinyinDictionaryTest.java @@ -0,0 +1,17 @@ +package com.hankcs.hanlp.dictionary.py; + +import com.hankcs.hanlp.HanLP; +import junit.framework.TestCase; + +import java.util.Arrays; + +public class PinyinDictionaryTest extends TestCase +{ + + public void testGet() + { + System.out.println(Arrays.toString(PinyinDictionary.get("鼖"))); + System.out.println(PinyinDictionary.convertToPinyin("\uD867\uDF7E\uD867\uDF8C")); + System.out.println(HanLP.convertToPinyinList("\uD867\uDF7E\uD867\uDF8C")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionaryTest.java b/src/test/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionaryTest.java new file mode 100644 index 000000000..25f4430b4 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/stopword/CoreStopWordDictionaryTest.java @@ -0,0 +1,65 @@ +package com.hankcs.hanlp.dictionary.stopword; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.MDAG.MDAGSet; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.tokenizer.NotionalTokenizer; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.File; +import java.util.LinkedList; +import java.util.List; + +public class CoreStopWordDictionaryTest extends TestCase +{ + public void testContains() throws Exception + { + assertTrue(CoreStopWordDictionary.contains("这就是说")); + } + + public void testContainsSomeWords() throws Exception + { + assertEquals(true, CoreStopWordDictionary.contains("可以")); + } + + public void testMDAG() throws Exception + { + List wordList = new LinkedList(); + wordList.add("zoo"); + wordList.add("hello"); + wordList.add("world"); + MDAGSet set = new MDAGSet(wordList); + set.add("bee"); + assertEquals(true, set.contains("bee")); + set.remove("bee"); + assertEquals(false, set.contains("bee")); + } + +// public void testRemoveDuplicateEntries() throws Exception +// { +// StopWordDictionary dictionary = new StopWordDictionary(new File(HanLP.Config.CoreStopWordDictionaryPath)); +// BufferedWriter bw = IOUtil.newBufferedWriter(HanLP.Config.CoreStopWordDictionaryPath); +// for (String word : dictionary) +// { +// bw.write(word); +// bw.newLine(); +// } +// bw.close(); +// } + + + public void testAdd() + { + CoreStopWordDictionary.add("加入"); + System.out.println(NotionalTokenizer.segment("加入单词")); + } + + public void testReload() + { + CoreStopWordDictionary.reload(); + assertTrue(CoreStopWordDictionary.contains("这里")); + CoreStopWordDictionary.dictionary.clear(); + assertFalse(CoreStopWordDictionary.contains("这里")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/dictionary/ts/TraditionalChineseDictionaryTest.java b/src/test/java/com/hankcs/hanlp/dictionary/ts/TraditionalChineseDictionaryTest.java new file mode 100644 index 000000000..77853c4d0 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/dictionary/ts/TraditionalChineseDictionaryTest.java @@ -0,0 +1,54 @@ +package com.hankcs.hanlp.dictionary.ts; + +import com.hankcs.hanlp.HanLP; +import junit.framework.TestCase; + +public class TraditionalChineseDictionaryTest extends TestCase +{ + public void testF2J() throws Exception + { + assertEquals("草莓是红色的", TraditionalChineseDictionary.convertToSimplifiedChinese("士多啤梨是紅色的")); + } + + public void testJ2F() throws Exception + { + assertEquals("草莓是紅色的", SimplifiedChineseDictionary.convertToTraditionalChinese("草莓是红色的")); + } + + public void testInterface() throws Exception + { + assertEquals("“以后等你当上皇后,就能买草莓庆祝了”", HanLP.convertToSimplifiedChinese("「以後等妳當上皇后,就能買士多啤梨慶祝了」")); + assertEquals("「以後等你當上皇后,就能買草莓慶祝了」", HanLP.convertToTraditionalChinese("“以后等你当上皇后,就能买草莓庆祝了”")); + } + + public void testIssue1182() throws Exception + { + String content = "直面现实,直面人生,手工捏面人"; + System.out.println(HanLP.s2hk(content)); + System.out.println(HanLP.s2tw(content)); + } + + public void testIssue1184() + { + String table = ",斟酒的人翻过大金斗【猛】击代君,一下就砸死 |\t猛 | 勐\n" + + "校及科研单位挂钩,并【建】立了长期的协作关系 |\t建 |\t创\n" + + "寇夫人 他自拣一搭金【堦】死。”亦省作“ \u2064|\t堦|\t階\n" + + "综合兼容性   二、【大】众娱乐性   三、|\t大|\t福\n" + + "进行有效的传播控制和【整】合管理。2007年|\t整|\t集\n" + + "行有效的传播控制和整【合】管理。2007年,|\t合|\t成\n" + + "有物饮碧水,高林挂青【蜺】。”\",\"ts\":|\t蜺|\t霓\n" + + "西安市莲湖城内,共计【房】屋231户。\",\"|\t房|\t住\n" + + ";行程万里的“世界屋【脊】汽车挑战赛”等成功|\t脊|\t嵴\n" + + "成“全国性”、“全程【式】”的技术创新公共服|\t式|\t序"; + for (String line : table.split("\n")) + { + String[] cells = line.split("\\|"); + String text = cells[0].trim().replaceAll("[【】]", ""); + String right = cells[1].trim(); + String wrong = cells[2].trim(); + String hanlpOutput = HanLP.convertToTraditionalChinese(text); + assertTrue(hanlpOutput.contains(right)); + assertFalse(hanlpOutput.contains(wrong)); + } + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzerTest.java b/src/test/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzerTest.java new file mode 100644 index 000000000..940571154 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/mining/cluster/ClusterAnalyzerTest.java @@ -0,0 +1,33 @@ +package com.hankcs.hanlp.mining.cluster; + +import com.hankcs.demo.DemoTextClustering; +import junit.framework.TestCase; + +public class ClusterAnalyzerTest extends TestCase +{ + public void testAddDocument() throws Exception + { + DemoTextClustering.main(null); + } + + public void testRepeatedBisection() + { + ClusterAnalyzer analyzer = new ClusterAnalyzer(); + analyzer.addDocument("赵一", "流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 摇滚, 摇滚, 摇滚, 摇滚"); + analyzer.addDocument("钱二", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲"); + analyzer.addDocument("张三", "古典, 古典, 古典, 古典, 民谣, 民谣, 民谣, 民谣"); + analyzer.addDocument("李四", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 金属, 金属, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲"); + analyzer.addDocument("王五", "流行, 流行, 流行, 流行, 摇滚, 摇滚, 摇滚, 嘻哈, 嘻哈, 嘻哈"); + analyzer.addDocument("马六", "古典, 古典, 古典, 古典, 古典, 古典, 古典, 古典, 摇滚"); + System.out.println(analyzer.repeatedBisection(0.12)); // 自动判断聚类数量k + } + + public void testKmeans() + { + ClusterAnalyzer analyzer = new ClusterAnalyzer(); + analyzer.addDocument("赵一", "流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 摇滚, 摇滚, 摇滚, 摇滚"); + analyzer.addDocument("钱二", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲"); + System.out.println(analyzer.kmeans(3)); + System.out.println(analyzer.repeatedBisection(3)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/mining/word/TermFrequencyCounterTest.java b/src/test/java/com/hankcs/hanlp/mining/word/TermFrequencyCounterTest.java new file mode 100644 index 000000000..04b1997e9 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/mining/word/TermFrequencyCounterTest.java @@ -0,0 +1,14 @@ +package com.hankcs.hanlp.mining.word; + +import junit.framework.TestCase; + +public class TermFrequencyCounterTest extends TestCase +{ + public void testGetKeywords() throws Exception + { + TermFrequencyCounter counter = new TermFrequencyCounter(); + counter.add("加油加油中国队!"); + System.out.println(counter); + System.out.println(counter.getKeywords("女排夺冠,观众欢呼女排女排女排!")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/mining/word/TfIdfCounterTest.java b/src/test/java/com/hankcs/hanlp/mining/word/TfIdfCounterTest.java new file mode 100644 index 000000000..df1aea6a0 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/mining/word/TfIdfCounterTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.mining.word; + +import junit.framework.TestCase; + +public class TfIdfCounterTest extends TestCase +{ + public void testGetKeywords() throws Exception + { + TfIdfCounter counter = new TfIdfCounter(); + counter.add("《女排夺冠》", "女排北京奥运会夺冠"); + counter.add("《羽毛球男单》", "北京奥运会的羽毛球男单决赛"); + counter.add("《女排》", "中国队女排夺北京奥运会金牌重返巅峰,观众欢呼女排女排女排!"); + counter.compute(); + + for (Object id : counter.documents()) + { + System.out.println(id + " : " + counter.getKeywordsOf(id, 3)); + } + + System.out.println(counter.getKeywords("奥运会反兴奋剂", 2)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/mining/word2vec/VectorsReaderTest.java b/src/test/java/com/hankcs/hanlp/mining/word2vec/VectorsReaderTest.java new file mode 100644 index 000000000..a1aa93354 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/mining/word2vec/VectorsReaderTest.java @@ -0,0 +1,30 @@ +package com.hankcs.hanlp.mining.word2vec; + +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.File; + +public class VectorsReaderTest extends TestCase +{ + public void testReadVectorFile() throws Exception + { + File tempFile = File.createTempFile("hanlp-vector", ".txt"); + tempFile.deleteOnExit(); + BufferedWriter bw = IOUtil.newBufferedWriter(tempFile.getAbsolutePath()); + bw.write("3 1\n" + + "cat 1.1\n" + + " 2.2\n" + + "dog 3.3\n" + ); + bw.close(); + + VectorsReader reader = new VectorsReader(tempFile.getAbsolutePath()); + reader.readVectorFile(); + assertEquals(2, reader.words); + assertEquals(2, reader.vocab.length); + assertEquals(2, reader.matrix.length); + assertEquals(1f, reader.matrix[1][0]); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/bigram/BigramDependencyModelTest.java b/src/test/java/com/hankcs/hanlp/model/bigram/BigramDependencyModelTest.java new file mode 100644 index 000000000..b3a3a0040 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/bigram/BigramDependencyModelTest.java @@ -0,0 +1,11 @@ +package com.hankcs.hanlp.model.bigram; + +import junit.framework.TestCase; + +public class BigramDependencyModelTest extends TestCase +{ + public void testLoad() throws Exception + { + assertEquals("限定", BigramDependencyModel.get("传", "v", "角落", "n")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzerTest.java b/src/test/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzerTest.java new file mode 100644 index 000000000..99b69a29e --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/CRFLexicalAnalyzerTest.java @@ -0,0 +1,30 @@ +package com.hankcs.hanlp.model.crf; + +import junit.framework.TestCase; + +import java.io.IOException; + +public class CRFLexicalAnalyzerTest extends TestCase +{ + public void testLoad() throws Exception + { + CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer(); + String[] tests = new String[]{ + "商品和服务", + "总统普京与特朗普通电话讨论太空探索技术公司", + "微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。" + }; +// for (String sentence : tests) +// { +// System.out.println(analyzer.analyze(sentence)); +// System.out.println(analyzer.seg(sentence)); +// } + } + + public void testIssue1221() throws IOException + { + CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer(); + analyzer.enableCustomDictionaryForcing(true); + System.out.println(analyzer.seg("商品和服务")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/CRFModelTest.java b/src/test/java/com/hankcs/hanlp/model/crf/CRFModelTest.java new file mode 100644 index 000000000..ec2d7343c --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/CRFModelTest.java @@ -0,0 +1,186 @@ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.trie.bintrie.BinTrie; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.seg.CRF.CRFSegment; +import com.hankcs.hanlp.utility.Predefine; +import junit.framework.TestCase; + +import java.io.*; +import java.util.List; + +public class CRFModelTest extends TestCase +{ +// public void testTemplate() throws Exception +// { +// FeatureTemplate featureTemplate = FeatureTemplate.create("U05:%x[-2,0]/%x[-1,0]/%x[0,0]"); +// Table table = new Table(); +// table.v = new String[][]{ +// {"那", "S"}, +// {"音", "B"}, +// {"韵", "E"},}; +// char[] parameter = featureTemplate.generateParameter(table, 0); +// System.out.println(parameter); +// } + +// public void testTestLoadTemplate() throws Exception +// { +// DataOutputStream out = new DataOutputStream(new FileOutputStream("data/test/out.bin")); +// FeatureTemplate featureTemplate = FeatureTemplate.create("U05:%x[-2,0]/%x[-1,0]/%x[0,0]"); +// featureTemplate.save(out); +// featureTemplate = new FeatureTemplate(); +// featureTemplate.load(ByteArray.createByteArray("data/test/out.bin")); +// System.out.println(featureTemplate); +// } + +// public void testLoadFromTxt() throws Exception +// { +// CRFModel model = CRFModel.loadTxt("D:\\Tools\\CRF++-0.58\\example\\seg_cn\\model.txt"); +// Table table = new Table(); +// table.v = new String[][]{ +// {"商", "?"}, +// {"品", "?"}, +// {"和", "?"}, +// {"服", "?"}, +// {"务", "?"}, +// }; +// model.tag(table); +// System.out.println(table); +// } + +// public void testSegment() throws Exception +// { +// HanLP.Config.enableDebug(); +// CRFSegment segment = new CRFSegment(); +//// segment.enablePartOfSpeechTagging(true); +// System.out.println(segment.seg("乐视超级手机能否承载贾布斯的生态梦")); +// } + + /** + * 现有的CRF效果不满意,重新制作一份以供训练 + * + * @throws Exception + */ +// public void testPrepareCRFTrainingCorpus() throws Exception +// { +// final BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("e:\\2014.txt"), "UTF-8")); +// CorpusLoader.walk("D:\\Doc\\语料库\\2014_hankcs", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// try +// { +// List> sentenceList = document.getSimpleSentenceList(); +// if (sentenceList.size() == 0) return; +// for (List sentence : sentenceList) +// { +// if (sentence.size() == 0) continue; +// for (IWord iWord : sentence) +// { +// String word = iWord.getValue(); +// String tag = iWord.getLabel(); +// String compiledString = compile(tag); +// if (compiledString != null) +// { +// word = compiledString; +// } +// if (word.length() == 1 || compiledString != null) +// { +// bw.write(word); +// bw.write('\t'); +// bw.write('S'); +// bw.write('\n'); +// } +// else +// { +// bw.write(word.charAt(0)); +// bw.write('\t'); +// bw.write('B'); +// bw.write('\n'); +// for (int i = 1; i < word.length() - 1; ++i) +// { +// bw.write(word.charAt(i)); +// bw.write('\t'); +// bw.write('M'); +// bw.write('\n'); +// } +// bw.write(word.charAt(word.length() - 1)); +// bw.write('\t'); +// bw.write('E'); +// bw.write('\n'); +// } +// } +// bw.write('\n'); +// } +// } +// catch (IOException e) +// { +// e.printStackTrace(); +// } +// } +// } +// +// ); +// bw.close(); +// } + +// public void testEnglishAndNumber() throws Exception +// { +// String text = "2.34米"; +//// System.out.println(CRFSegment.atomSegment(text.toCharArray())); +// HanLP.Config.enableDebug(); +// CRFSegment segment = new CRFSegment(); +// System.out.println(segment.seg(text)); +// } + + public static String compile(String tag) + { + if (tag.startsWith("m")) return "M"; + else if (tag.equals("x")) return "W"; + else if (tag.equals("nx")) return "W"; + return null; + } + + public void testLoadModelWithBiGramFeature() throws Exception + { + String path = HanLP.Config.CRFSegmentModelPath + Predefine.BIN_EXT; + CRFModel model = new CRFModel(new BinTrie()); + model.load(ByteArray.createByteArray(path)); + + Table table = new Table(); + String text = "人民生活进一步改善了"; + table.v = new String[text.length()][2]; + for (int i = 0; i < text.length(); i++) + { + table.v[i][0] = String.valueOf(text.charAt(i)); + } + + model.tag(table); +// System.out.println(table); + } + +// public void testRemoveSpace() throws Exception +// { +// String inputPath = "E:\\2014.txt"; +// String outputPath = "E:\\2014f.txt"; +// BufferedReader br = IOUtil.newBufferedReader(inputPath); +// BufferedWriter bw = IOUtil.newBufferedWriter(outputPath); +// String line = ""; +// int preLength = 0; +// while ((line = br.readLine()) != null) +// { +// if (preLength == 0 && line.length() == 0) continue; +// bw.write(line); +// bw.newLine(); +// preLength = line.length(); +// } +// bw.close(); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/CRFNERecognizerTest.java b/src/test/java/com/hankcs/hanlp/model/crf/CRFNERecognizerTest.java new file mode 100644 index 000000000..4f49e7759 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/CRFNERecognizerTest.java @@ -0,0 +1,32 @@ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import junit.framework.TestCase; + +public class CRFNERecognizerTest extends TestCase +{ + public static final String CORPUS = "data/test/pku98/199801.txt"; + public static String NER_MODEL_PATH = "data/model/crf/pku199801/ner.txt"; + public void testTrain() throws Exception + { + CRFTagger tagger = new CRFNERecognizer(null); + tagger.train(CORPUS, NER_MODEL_PATH); + } + + public void testLoad() throws Exception + { + CRFTagger tagger = new CRFNERecognizer(NER_MODEL_PATH); + } + + public void testConvert() throws Exception + { + CRFTagger tagger = new CRFNERecognizer(null); + tagger.convertCorpus(CORPUS, "data/test/crf/ner-corpus.tsv"); + } + + public void testDumpTemplate() throws Exception + { + CRFTagger tagger = new CRFNERecognizer(null); + tagger.dumpTemplate("data/test/crf/ner-template.txt"); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/CRFPOSTaggerTest.java b/src/test/java/com/hankcs/hanlp/model/crf/CRFPOSTaggerTest.java new file mode 100644 index 000000000..2ca4fe020 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/CRFPOSTaggerTest.java @@ -0,0 +1,43 @@ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import junit.framework.TestCase; + +import java.util.Arrays; + +public class CRFPOSTaggerTest extends TestCase +{ + public static final String CORPUS = "data/test/pku98/199801.txt"; + public static String POS_MODEL_PATH = HanLP.Config.CRFPOSModelPath; + + public void testTrain() throws Exception + { + CRFPOSTagger tagger = new CRFPOSTagger(null); // 创建空白标注器 + tagger.train(PKU.PKU199801_TRAIN, PKU.POS_MODEL); // 训练 + tagger = new CRFPOSTagger(PKU.POS_MODEL); // 加载 + System.out.println(Arrays.toString(tagger.tag("他", "的", "希望", "是", "希望", "上学"))); // 预测 + AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), tagger); // 构造词法分析器 + System.out.println(analyzer.analyze("李狗蛋的希望是希望上学")); // 分词+词性标注 + } + + public void testLoad() throws Exception + { + CRFPOSTagger tagger = new CRFPOSTagger("data/model/crf/pku199801/pos.txt"); + System.out.println(Arrays.toString(tagger.tag("我", "的", "希望", "是", "希望", "和平"))); + } + + public void testConvert() throws Exception + { + CRFTagger tagger = new CRFPOSTagger(null); + tagger.convertCorpus(CORPUS, "data/test/crf/pos-corpus.tsv"); + } + + public void testDumpTemplate() throws Exception + { + CRFTagger tagger = new CRFPOSTagger(null); + tagger.dumpTemplate("data/test/crf/pos-template.txt"); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/CRFSegmenterTest.java b/src/test/java/com/hankcs/hanlp/model/crf/CRFSegmenterTest.java new file mode 100644 index 000000000..711827ff5 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/CRFSegmenterTest.java @@ -0,0 +1,71 @@ +package com.hankcs.hanlp.model.crf; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.model.crf.crfpp.crf_learn; +import junit.framework.TestCase; + +import java.util.List; + +public class CRFSegmenterTest extends TestCase +{ + + public static final String CWS_MODEL_PATH = HanLP.Config.CRFCWSModelPath; + + public void testTrain() throws Exception + { + CRFSegmenter segmenter = new CRFSegmenter(null); + segmenter.train(PKU.PKU199801, CWS_MODEL_PATH); + } + + public void testConvert() throws Exception + { + crf_learn.run("-T " + CWS_MODEL_PATH + " " + CWS_MODEL_PATH + ".txt"); + } + + public void testConvertCorpus() throws Exception + { + CRFSegmenter segmenter = new CRFSegmenter(null); + segmenter.convertCorpus(PKU.PKU199801, "data/test/crf/cws-corpus.tsv"); + segmenter.dumpTemplate("data/test/crf/cws-template.txt"); + } + + public void testLoad() throws Exception + { + CRFSegmenter segmenter = new CRFSegmenter("data/test/converted.txt"); + List wordList = segmenter.segment("商品和服务"); + System.out.println(wordList); + } + + public void testOutput() throws Exception + { +// final CRFSegmenter segmenter = new CRFSegmenter(CWS_MODEL_PATH); +// +// final BufferedWriter bw = IOUtil.newBufferedWriter("data/test/crf/cws/mdat.txt"); +// IOUtility.loadInstance(PKU.PKU199801, new InstanceHandler() +// { +// @Override +// public boolean process(Sentence instance) +// { +// String text = instance.text().replace("0", "").replace("X", ""); +// try +// { +// for (String term : segmenter.segment(text)) +// { +// +// bw.write(term); +// bw.write(" "); +// } +// bw.newLine(); +// } +// catch (IOException e) +// { +// e.printStackTrace(); +// } +// return false; +// } +// }); +// bw.close(); + } + +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/crf/LogLinearModelTest.java b/src/test/java/com/hankcs/hanlp/model/crf/LogLinearModelTest.java new file mode 100644 index 000000000..458e81577 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/crf/LogLinearModelTest.java @@ -0,0 +1,11 @@ +package com.hankcs.hanlp.model.crf; + +import junit.framework.TestCase; + +public class LogLinearModelTest extends TestCase +{ + public void testLoad() throws Exception + { + LogLinearModel model = new LogLinearModel("/Users/hankcs/Downloads/crfpp-msr-cws-model.txt"); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModelTest.java b/src/test/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModelTest.java new file mode 100644 index 000000000..2216e5700 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/hmm/FirstOrderHiddenMarkovModelTest.java @@ -0,0 +1,104 @@ +package com.hankcs.hanlp.model.hmm; + +import junit.framework.TestCase; + +import java.util.Arrays; + +import static com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModelTest.Feel.cold; +import static com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModelTest.Feel.dizzy; +import static com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModelTest.Feel.normal; +import static com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModelTest.Status.Fever; +import static com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModelTest.Status.Healthy; + +public class FirstOrderHiddenMarkovModelTest extends TestCase +{ + + /** + * 隐状态 + */ + enum Status + { + Healthy, + Fever, + } + + /** + * 显状态 + */ + enum Feel + { + normal, + cold, + dizzy, + } + /** + * 初始状态概率矩阵 + */ + static float[] start_probability = new float[]{0.6f, 0.4f}; + /** + * 状态转移概率矩阵 + */ + static float[][] transition_probability = new float[][]{ + {0.7f, 0.3f}, + {0.4f, 0.6f}, + }; + /** + * 发射概率矩阵 + */ + static float[][] emission_probability = new float[][]{ + {0.5f, 0.4f, 0.1f}, + {0.1f, 0.3f, 0.6f}, + }; + /** + * 某个病人的观测序列 + */ + static int[] observations = new int[]{normal.ordinal(), cold.ordinal(), dizzy.ordinal()}; + + public void testGenerate() throws Exception + { + FirstOrderHiddenMarkovModel givenModel = new FirstOrderHiddenMarkovModel(start_probability, transition_probability, emission_probability); + for (int[][] sample : givenModel.generate(3, 5, 2)) + { + for (int t = 0; t < sample[0].length; t++) + System.out.printf("%s/%s ", Feel.values()[sample[0][t]], Status.values()[sample[1][t]]); + System.out.println(); + } + } + + public void testTrain() throws Exception + { + FirstOrderHiddenMarkovModel givenModel = new FirstOrderHiddenMarkovModel(start_probability, transition_probability, emission_probability); + FirstOrderHiddenMarkovModel trainedModel = new FirstOrderHiddenMarkovModel(); + trainedModel.train(givenModel.generate(3, 10, 100000)); + assertTrue(trainedModel.similar(givenModel)); + } + + public void testPredict() throws Exception + { + FirstOrderHiddenMarkovModel model = new FirstOrderHiddenMarkovModel(start_probability, transition_probability, emission_probability); + evaluateModel(model); + } + + public void evaluateModel(FirstOrderHiddenMarkovModel model) + { + int[] pred = new int[observations.length]; + float prob = (float) Math.exp(model.predict(observations, pred)); + int[] answer = {Healthy.ordinal(), Healthy.ordinal(), Fever.ordinal()}; + assertEquals(Arrays.toString(answer), Arrays.toString(pred)); +// assertEquals("0.01512", String.format("%.5f", prob)); + assertEquals("0.015", String.format("%.3f", prob)); + + pred = new int[]{pred[0], pred[1]}; + answer = new int[]{answer[0], answer[1]}; + assertEquals(Arrays.toString(answer), Arrays.toString(pred)); + + pred = new int[]{pred[0]}; + answer = new int[]{answer[0]}; + assertEquals(Arrays.toString(answer), Arrays.toString(pred)); +// for (int s : pred) +// { +// System.out.print(Status.values()[s] + " "); +// } +// System.out.printf(" with highest probability of %.5f\n", prob); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzerTest.java b/src/test/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzerTest.java new file mode 100644 index 000000000..b28b09e15 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/hmm/HMMLexicalAnalyzerTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.PKU; +import junit.framework.TestCase; + +public class HMMLexicalAnalyzerTest extends TestCase +{ + + public static final String CORPUS_PATH = PKU.PKU199801_TRAIN; + + public void testTrain() throws Exception + { + HMMSegmenter segmenter = new HMMSegmenter(); + segmenter.train(CORPUS_PATH); + HMMPOSTagger tagger = new HMMPOSTagger(); + tagger.train(CORPUS_PATH); + HMMNERecognizer recognizer = new HMMNERecognizer(); + recognizer.train(CORPUS_PATH); + HMMLexicalAnalyzer analyzer = new HMMLexicalAnalyzer(segmenter, tagger, recognizer); + System.out.println(analyzer.analyze("我的希望是希望人们幸福")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/hmm/HMMPOSTaggerTest.java b/src/test/java/com/hankcs/hanlp/model/hmm/HMMPOSTaggerTest.java new file mode 100644 index 000000000..bd61f89f1 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/hmm/HMMPOSTaggerTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.model.hmm; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.model.perceptron.PerceptronSegmenter; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import com.hankcs.hanlp.utility.TestUtility; +import junit.framework.TestCase; + +import java.util.Arrays; + +public class HMMPOSTaggerTest extends TestCase +{ + public void testTrain() throws Exception + { + HMMPOSTagger tagger = new HMMPOSTagger(); // 创建词性标注器 +// HMMPOSTagger tagger = new HMMPOSTagger(new SecondOrderHiddenMarkovModel()); // 或二阶隐马 + tagger.train(PKU.PKU199801); // 训练 + System.out.println(Arrays.toString(tagger.tag("他", "的", "希望", "是", "希望", "上学"))); // 预测 + AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), tagger); // 构造词法分析器 + System.out.println(analyzer.analyze("他的希望是希望上学").translateLabels()); // 分词+词性标注 + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/hmm/HMMSegmenterTest.java b/src/test/java/com/hankcs/hanlp/model/hmm/HMMSegmenterTest.java new file mode 100644 index 000000000..01d6172ac --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/hmm/HMMSegmenterTest.java @@ -0,0 +1,13 @@ +package com.hankcs.hanlp.model.hmm; + +import junit.framework.TestCase; + +public class HMMSegmenterTest extends TestCase +{ + public void testTrain() throws Exception + { + HMMSegmenter segmenter = new HMMSegmenter(); + segmenter.train("data/test/my_cws_corpus.txt"); + System.out.println(segmenter.segment("商品和服务")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModelTest.java b/src/test/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModelTest.java new file mode 100644 index 000000000..509b2c887 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/hmm/SecondOrderHiddenMarkovModelTest.java @@ -0,0 +1,20 @@ +package com.hankcs.hanlp.model.hmm; + + +public class SecondOrderHiddenMarkovModelTest extends FirstOrderHiddenMarkovModelTest +{ + static float[][][] transition_probability2 = new float[][][]{ + {{0.7f, 0.3f}, {0.4f, 0.6f}}, + {{0.7f, 0.3f}, {0.4f, 0.6f}}, + }; + + public void testPredict() throws Exception + { + SecondOrderHiddenMarkovModel hmm2 = new SecondOrderHiddenMarkovModel(start_probability, transition_probability, emission_probability, transition_probability2); + SecondOrderHiddenMarkovModel trainedModel = new SecondOrderHiddenMarkovModel(); + trainedModel.train(hmm2.generate(3, 10, 100000)); + hmm2.unLog(); + trainedModel.unLog(); + assertTrue(hmm2.similar(trainedModel)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/CWSTrainerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/CWSTrainerTest.java new file mode 100644 index 000000000..b7d83dedf --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/CWSTrainerTest.java @@ -0,0 +1,61 @@ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.tokenizer.NLPTokenizer; +import junit.framework.TestCase; + +public class CWSTrainerTest extends TestCase +{ + + public static final String SENTENCE = "香港特别行政区的张朝阳说商品和服务是三原县鲁桥食品厂的主营业务"; + + public void testTrain() throws Exception + { + HanLP.Config.enableDebug(); + PerceptronTrainer trainer = new CWSTrainer(); + PerceptronTrainer.Result result = trainer.train( + "data/test/pku98/199801.txt", + Config.CWS_MODEL_FILE + ); +// System.out.printf("准确率F1:%.2f\n", result.prf[2]); + PerceptronSegmenter segmenter = new PerceptronSegmenter(result.model); + // 也可以用 +// Segment segmenter = new AveragedPerceptronSegment(POS_MODEL_FILE); + System.out.println(segmenter.segment("商品和服务?")); + } + + public void testCWS() throws Exception + { + PerceptronSegmenter segmenter = new PerceptronSegmenter(Config.CWS_MODEL_FILE); + segmenter.learn("下雨天 地面 积水"); + System.out.println(segmenter.segment("下雨天地面积水分外严重")); + } + + public void testCWSandPOS() throws Exception + { + Segment segmenter = new PerceptronLexicalAnalyzer(Config.CWS_MODEL_FILE, Config.POS_MODEL_FILE); + System.out.println(segmenter.seg(SENTENCE)); + } + + public void testCWSandPOSandNER() throws Exception + { + PerceptronLexicalAnalyzer segmenter = new PerceptronLexicalAnalyzer(Config.CWS_MODEL_FILE, Config.POS_MODEL_FILE, Config.NER_MODEL_FILE); + Sentence sentence = segmenter.analyze(SENTENCE); + System.out.println(sentence); + System.out.println(segmenter.seg(SENTENCE)); + for (IWord word : sentence) + { + if (word instanceof CompoundWord) + System.out.println(((CompoundWord) word).innerList); + } + } + + public void testCompareWithHanLP() throws Exception + { + System.out.println(NLPTokenizer.segment(SENTENCE)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/Config.java b/src/test/java/com/hankcs/hanlp/model/perceptron/Config.java new file mode 100644 index 000000000..01389be31 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/Config.java @@ -0,0 +1,23 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-27 下午5:46 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; + +/** + * @author hankcs + */ +public class Config +{ + public static final String CWS_MODEL_FILE = HanLP.Config.PerceptronCWSModelPath; + public static final String POS_MODEL_FILE = HanLP.Config.PerceptronPOSModelPath; + public static final String NER_MODEL_FILE = HanLP.Config.PerceptronNERModelPath; +} diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainCWS.java b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainCWS.java new file mode 100644 index 000000000..24fec2aab --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainCWS.java @@ -0,0 +1,20 @@ +package com.hankcs.hanlp.model.perceptron; + +import java.io.IOException; + +public class DemoTrainCWS +{ + public static void main(String[] args) throws IOException + { + PerceptronTrainer trainer = new CWSTrainer(); + PerceptronTrainer.Result result = trainer.train( + "data/test/pku98/199801.txt", + Config.CWS_MODEL_FILE + ); + System.out.printf("准确率F1:%.2f\n", result.getAccuracy()); + PerceptronSegmenter segment = new PerceptronSegmenter(result.getModel()); + // 也可以用 +// Segment segment = new AveragedPerceptronSegment(POS_MODEL_FILE); + System.out.println(segment.segment("商品与服务")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainNER.java b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainNER.java new file mode 100644 index 000000000..7d8575f92 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainNER.java @@ -0,0 +1,44 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-28 15:46 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import com.hankcs.hanlp.model.perceptron.tagset.TagSet; + +import java.io.IOException; + +/** + * @author hankcs + */ +public class DemoTrainNER +{ + public static void main(String[] args) throws IOException + { + PerceptronTrainer trainer = new NERTrainer(); + trainer.train("data/test/pku98/199801.txt", Config.NER_MODEL_FILE); + } + + public static void trainYourNER() + { + PerceptronTrainer trainer = new NERTrainer() + { + @Override + protected TagSet createTagSet() + { + NERTagSet tagSet = new NERTagSet(); + tagSet.nerLabels.add("YourNER1"); + tagSet.nerLabels.add("YourNER2"); + tagSet.nerLabels.add("YourNER3"); + return tagSet; + } + }; + } +} diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainPOS.java b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainPOS.java new file mode 100644 index 000000000..94ff02b92 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/DemoTrainPOS.java @@ -0,0 +1,25 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-10-27 下午4:28 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron; + +import java.io.IOException; + +/** + * @author hankcs + */ +public class DemoTrainPOS +{ + public static void main(String[] args) throws IOException + { + PerceptronTrainer trainer = new POSTrainer(); + trainer.train("data/test/pku98/199801.txt", Config.POS_MODEL_FILE); + } +} diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/NERTrainerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/NERTrainerTest.java new file mode 100644 index 000000000..5fbf2ce27 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/NERTrainerTest.java @@ -0,0 +1,20 @@ +package com.hankcs.hanlp.model.perceptron; + +import junit.framework.TestCase; + +import java.util.Arrays; + +public class NERTrainerTest extends TestCase +{ + public void testTrain() throws Exception + { + PerceptronTrainer trainer = new NERTrainer(); + trainer.train("data/test/pku98/199801.txt", Config.NER_MODEL_FILE); + } + + public void testTag() throws Exception + { + PerceptronNERecognizer recognizer = new PerceptronNERecognizer(Config.NER_MODEL_FILE); + System.out.println(Arrays.toString(recognizer.recognize("吴忠市 乳制品 公司 谭利华 来到 布达拉宫 广场".split(" "), "ns n n nr p ns n".split(" ")))); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/POSTrainerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/POSTrainerTest.java new file mode 100644 index 000000000..7f163276b --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/POSTrainerTest.java @@ -0,0 +1,21 @@ +package com.hankcs.hanlp.model.perceptron; + +import junit.framework.TestCase; + +import java.util.Arrays; + +public class POSTrainerTest extends TestCase +{ + + public void testTrain() throws Exception + { + PerceptronTrainer trainer = new POSTrainer(); + trainer.train("data/test/pku98/199801.txt", Config.POS_MODEL_FILE); + } + + public void testLoad() throws Exception + { + PerceptronPOSTagger tagger = new PerceptronPOSTagger(Config.POS_MODEL_FILE); + System.out.println(Arrays.toString(tagger.tag("中国 交响乐团 谭利华 在 布达拉宫 广场 演出".split(" ")))); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzerTest.java new file mode 100644 index 000000000..9a7e0d0e4 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronLexicalAnalyzerTest.java @@ -0,0 +1,130 @@ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.dictionary.other.CharTable; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.util.List; + +public class PerceptronLexicalAnalyzerTest extends TestCase +{ + PerceptronLexicalAnalyzer analyzer; + + @Override + public void setUp() throws Exception + { + analyzer = new PerceptronLexicalAnalyzer(Config.CWS_MODEL_FILE, Config.POS_MODEL_FILE, Config.NER_MODEL_FILE); + } + + public void testIssue() throws Exception + { +// System.out.println(analyzer.seg("")); + for (Term term : analyzer.seg("张三丰,刘五郎,黄三元,张一楠,王三强,丁一楠,李四光,闻一多,赵一楠,李四")) + { + if (term.nature == Nature.w) continue; + assertEquals(Nature.nr, term.nature); + } + } + + public void testLearn() throws Exception + { + analyzer.learn("我/r 在/p 浙江/ns 金华/ns 出生/v"); + assertTrue(analyzer.analyze("我在浙江金华出生").toString().contains("金华/ns")); + assertTrue(analyzer.analyze("我的名字叫金华").toString().contains("金华/nr")); + } + + public void testEmptyInput() throws Exception + { + analyzer.segment(""); + analyzer.seg(""); + } + + public void testCustomDictionary() throws Exception + { + analyzer.enableCustomDictionary(true); + assertTrue(CustomDictionary.contains("一字长蛇阵")); + final String text = "张飞摆出一字长蛇阵如入无人之境,孙权惊呆了"; +// System.out.println(analyzer.analyze(text)); + assertTrue(analyzer.analyze(text).toString().contains(" 一字长蛇阵/")); + } + + public void testCustomNature() throws Exception + { + assertTrue(CustomDictionary.insert("饿了么", "ntc 1")); + analyzer.enableCustomDictionaryForcing(true); + assertEquals("美团/n 与/p 饿了么/ntc 争夺/v 外卖/v 市场/n", analyzer.analyze("美团与饿了么争夺外卖市场").toString()); + } + + public void testIndexMode() throws Exception + { + analyzer.enableIndexMode(true); + String text = "来到美国纽约现代艺术博物馆参观"; + List termList = analyzer.seg(text); + assertEquals("[来到/v, 美国纽约现代艺术博物馆/ns, 美国/ns, 纽约/ns, 现代/t, 艺术/n, 博物馆/n, 参观/v]", termList.toString()); + for (Term term : termList) + { + assertEquals(term.word, text.substring(term.offset, term.offset + term.length())); + } + analyzer.enableIndexMode(false); + } + + public void testOffset() throws Exception + { + analyzer.enableIndexMode(false); + String text = "来到美国纽约现代艺术博物馆参观"; + List termList = analyzer.seg(text); + for (Term term : termList) + { + assertEquals(term.word, text.substring(term.offset, term.offset + term.length())); + } + } + + public void testNormalization() throws Exception + { + analyzer.enableCustomDictionary(false); + String text = "來到美國紐約現代藝術博物館參觀?"; + Sentence sentence = analyzer.analyze(text); +// System.out.println(sentence); + assertEquals("來到/v [美國/ns 紐約/ns 現代/t 藝術/n 博物館/n]/ns 參觀/v ?/w", sentence.toString()); + List termList = analyzer.seg(text); +// System.out.println(termList); + assertEquals("[來到/v, 美國紐約現代藝術博物館/ns, 參觀/v, ?/w]", termList.toString()); + } + + public void testWhiteSpace() throws Exception + { + CharTable.CONVERT[' '] = '!'; + CharTable.CONVERT['\t'] = '!'; + Sentence sentence = analyzer.analyze("\"你好, 我想知道: 风是从哪里来; \t雷是从哪里来; 雨是从哪里来?\""); + for (IWord word : sentence) + { + if (!word.getLabel().equals("w")) + { + assertFalse(word.getValue().contains(" ")); + assertFalse(word.getValue().contains("\t")); + } + } + } + + public void testCustomDictionaryForcing() throws Exception + { + String text = "银川普通人与川普通电话讲四川普通话"; + CustomDictionary.insert("川普", "NRF 1"); + + analyzer.enableCustomDictionaryForcing(false); + System.out.println(analyzer.analyze(text)); + + analyzer.enableCustomDictionaryForcing(true); + System.out.println(analyzer.analyze(text)); + } + + public void testRules() throws Exception + { + analyzer.enableRuleBasedSegment(true); + System.out.println(analyzer.analyze("これは微软公司於1975年由比爾·蓋茲和保羅·艾倫創立,18年啟動以智慧雲端、前端為導向的大改組。")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizerTest.java new file mode 100644 index 000000000..841626848 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNERecognizerTest.java @@ -0,0 +1,12 @@ +package com.hankcs.hanlp.model.perceptron; + +import junit.framework.TestCase; + +public class PerceptronNERecognizerTest extends TestCase +{ + public void testEmptyInput() throws Exception + { + PerceptronNERecognizer recognizer = new PerceptronNERecognizer(); + recognizer.recognize(new String[0], new String[0]); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifierTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifierTest.java new file mode 100644 index 000000000..c0f588a57 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronNameGenderClassifierTest.java @@ -0,0 +1,49 @@ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.utility.TestUtility; +import junit.framework.TestCase; + +public class PerceptronNameGenderClassifierTest extends TestCase +{ + public static String CNNAME = TestUtility.ensureTestData("cnname", "http://file.hankcs.com/corpus/cnname.zip"); + public static String TRAINING_SET = "data/test/cnname/train.csv"; + public static String TESTING_SET = "data/test/cnname/test.csv"; + public static String MODEL = "data/test/cnname.bin"; + + @Override + public void setUp() throws Exception + { + super.setUp(); + TestUtility.ensureTestData("cnname", "http://file.hankcs.com/corpus/cnname.zip"); + } + + public void testTrain() throws Exception + { + PerceptronNameGenderClassifier classifier = new PerceptronNameGenderClassifier(); + System.out.println(classifier.train(TRAINING_SET, 10, false)); + classifier.model.save(MODEL, classifier.model.featureMap.entrySet(), 0, true); + predictNames(classifier); + } + + public static void predictNames(PerceptronNameGenderClassifier classifier) + { + String[] names = new String[]{"赵建军", "沈雁冰", "陆雪琪", "李冰冰"}; + for (String name : names) + { + System.out.printf("%s=%s\n", name, classifier.predict(name)); + } + } + + + public void testEvaluate() throws Exception + { + PerceptronNameGenderClassifier classifier = new PerceptronNameGenderClassifier(MODEL); + System.out.println(classifier.evaluate(TESTING_SET)); + } + + public void testPrediction() throws Exception + { + PerceptronNameGenderClassifier classifier = new PerceptronNameGenderClassifier(MODEL); + predictNames(classifier); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTaggerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTaggerTest.java new file mode 100644 index 000000000..1380ba477 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronPOSTaggerTest.java @@ -0,0 +1,30 @@ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer; +import junit.framework.TestCase; + +import java.util.Arrays; + +public class PerceptronPOSTaggerTest extends TestCase +{ + public void testTrain() throws Exception + { + PerceptronTrainer trainer = new POSTrainer(); + trainer.train(PKU.PKU199801_TRAIN, PKU.POS_MODEL); // 训练 + PerceptronPOSTagger tagger = new PerceptronPOSTagger(PKU.POS_MODEL); // 加载 + System.out.println(Arrays.toString(tagger.tag("他", "的", "希望", "是", "希望", "上学"))); // 预测 + AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), tagger); // 构造词法分析器 + System.out.println(analyzer.analyze("李狗蛋的希望是希望上学")); // 分词+词性标注 + } + + public void testCompress() throws Exception + { + PerceptronPOSTagger tagger = new PerceptronPOSTagger(); + tagger.getModel().compress(0.01); + double[] scores = tagger.evaluate("data/test/pku98/199801.txt"); + System.out.println(scores[0]); + tagger.getModel().save(HanLP.Config.PerceptronPOSModelPath + ".small"); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenterTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenterTest.java new file mode 100644 index 000000000..e349e939f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronSegmenterTest.java @@ -0,0 +1,51 @@ +package com.hankcs.hanlp.model.perceptron; + +import com.hankcs.hanlp.dictionary.CustomDictionary; +import junit.framework.TestCase; + +import java.util.List; + +public class PerceptronSegmenterTest extends TestCase +{ + + private PerceptronSegmenter segmenter; + + @Override + public void setUp() throws Exception + { + segmenter = new PerceptronSegmenter(); + } + + public void testEmptyString() throws Exception + { + segmenter.segment(""); + } + + public void testNRF() throws Exception + { + String text = "他们确保了唐纳德·特朗普在总统大选中获胜。"; + List wordList = segmenter.segment(text); + assertTrue(wordList.contains("唐纳德·特朗普")); + } + + public void testNoCustomDictionary() throws Exception + { + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(); + analyzer.enableCustomDictionary(false); + CustomDictionary.insert("禁用用户词典"); + assertEquals("[禁用/v, 用户/n, 词典/n]", analyzer.seg("禁用用户词典").toString()); + } + + public void testLearnAndSeg() throws Exception + { + PerceptronLexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(); + analyzer.learn("与/c 特朗普/nr 通/v 电话/n 讨论/v [太空/s 探索/vn 技术公司/n]/nt"); + assertEquals("[与/c, 特朗普/k, 通/v, 电话/n, 讨论/v, 太空探索技术公司/nt]", analyzer.seg("与特朗普通电话讨论太空探索技术公司").toString()); + } + + public void testBlanks() + { + System.out.println(segmenter.segment("建议自己处理空格 这就是你们要的效果?")); + System.out.println(segmenter.segment("我买了iPhone X 12 G和爱疯 8")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronTaggerTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronTaggerTest.java new file mode 100644 index 000000000..e5129b4ae --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/PerceptronTaggerTest.java @@ -0,0 +1,14 @@ +package com.hankcs.hanlp.model.perceptron; + +import junit.framework.TestCase; + +import java.util.ArrayList; + +public class PerceptronTaggerTest extends TestCase +{ + public void testEmptyInput() throws Exception + { + PerceptronPOSTagger tagger = new PerceptronPOSTagger(); + tagger.tag(new ArrayList()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/corpus/ConvertPKU.java b/src/test/java/com/hankcs/hanlp/model/perceptron/corpus/ConvertPKU.java new file mode 100644 index 000000000..ff2130460 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/corpus/ConvertPKU.java @@ -0,0 +1,30 @@ +/* + * Hankcs + * me@hankcs.com + * 2017-09-18 上午10:43 + * + * + * Copyright (c) 2017, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.model.perceptron.corpus; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.IWord; +import com.hankcs.hanlp.corpus.io.IOUtil; +import junit.framework.TestCase; + +import java.io.BufferedWriter; +import java.io.File; +import java.io.IOException; +import java.util.List; + +/** + * @author hankcs + */ +public class ConvertPKU extends TestCase +{ +} diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMapTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMapTest.java new file mode 100644 index 000000000..0f1a3d31e --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/feature/ImmutableFeatureMDatMapTest.java @@ -0,0 +1,36 @@ +package com.hankcs.hanlp.model.perceptron.feature; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.collection.trie.datrie.MutableDoubleArrayTrieInteger; +import com.hankcs.hanlp.model.perceptron.model.LinearModel; +import junit.framework.TestCase; + +import java.util.Map; +import java.util.TreeMap; + +public class ImmutableFeatureMDatMapTest extends TestCase +{ + public void testCompress() throws Exception + { + LinearModel model = new LinearModel(HanLP.Config.PerceptronCWSModelPath); + model.compress(0.1); + } + + public void testFeatureMap() throws Exception + { + LinearModel model = new LinearModel(HanLP.Config.PerceptronCWSModelPath); + ImmutableFeatureMDatMap featureMap = (ImmutableFeatureMDatMap) model.featureMap; + MutableDoubleArrayTrieInteger dat = featureMap.dat; + System.out.println(featureMap.size()); + System.out.println(featureMap.entrySet().size()); + System.out.println(featureMap.idOf("\u0001/\u00014")); + TreeMap map = new TreeMap(); + for (Map.Entry entry : dat.entrySet()) + { + map.put(entry.getKey(), entry.getValue()); + assertEquals(entry.getValue().intValue(), dat.get(entry.getKey())); + } + System.out.println(map.size()); + assertEquals(dat.size(), map.size()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/model/LinearModelTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/model/LinearModelTest.java new file mode 100644 index 000000000..4513b8ab2 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/model/LinearModelTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.model.perceptron.model; + +import com.hankcs.hanlp.model.perceptron.CWSTrainer; +import com.hankcs.hanlp.model.perceptron.PerceptronTrainer; +import junit.framework.TestCase; + +import static java.lang.System.out; + +public class LinearModelTest extends TestCase +{ + public static final String MODEL_FILE = "data/pku_mini.bin"; + +// public void testLoad() throws Exception +// { +// LinearModel model = new LinearModel(MODEL_FILE); +// PerceptronTrainer trainer = new CWSTrainer(); +// double[] prf = trainer.evaluate("icwb2-data/mini/pku_development.txt", +// model +// ); +// out.printf("Performance - P:%.2f R:%.2f F:%.2f\n", prf[0], prf[1], prf[2]); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/utility/IOUtilityTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/utility/IOUtilityTest.java new file mode 100644 index 000000000..8af202d57 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/utility/IOUtilityTest.java @@ -0,0 +1,15 @@ +package com.hankcs.hanlp.model.perceptron.utility; + +import junit.framework.TestCase; + +import java.util.Arrays; + +public class IOUtilityTest extends TestCase +{ + public void testReadLineToArray() throws Exception + { + String line = " 你好 世界 ! "; + String[] array = IOUtility.readLineToArray(line); + System.out.println(Arrays.toString(array)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/perceptron/utility/UtilityTest.java b/src/test/java/com/hankcs/hanlp/model/perceptron/utility/UtilityTest.java new file mode 100644 index 000000000..eb7119c2f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/perceptron/utility/UtilityTest.java @@ -0,0 +1,28 @@ +package com.hankcs.hanlp.model.perceptron.utility; + +import com.hankcs.hanlp.corpus.PKU; +import com.hankcs.hanlp.corpus.document.sentence.Sentence; +import com.hankcs.hanlp.model.hmm.HMMNERecognizer; +import com.hankcs.hanlp.model.perceptron.PerceptronNERecognizer; +import com.hankcs.hanlp.model.perceptron.tagset.NERTagSet; +import junit.framework.TestCase; + +import java.util.Arrays; +import java.util.Map; + +public class UtilityTest extends TestCase +{ + public void testCombineNER() throws Exception + { + NERTagSet nerTagSet = new HMMNERecognizer().getNERTagSet(); + String[] nerArray = Utility.reshapeNER(Utility.convertSentenceToNER(Sentence.create("萨哈夫/nr 说/v ,/w 伊拉克/ns 将/d 同/p [联合国/nt 销毁/v 伊拉克/ns 大规模/b 杀伤性/n 武器/n 特别/a 委员会/n]/nt 继续/v 保持/v 合作/v 。/w"), nerTagSet))[2]; + System.out.println(Arrays.toString(nerArray)); + System.out.println(Utility.combineNER(nerArray, nerTagSet)); + } + + public void testEvaluateNER() throws Exception + { + Map scores = Utility.evaluateNER(new PerceptronNERecognizer(), PKU.PKU199801_TEST); + Utility.printNERScore(scores); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModelTest.java b/src/test/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModelTest.java new file mode 100644 index 000000000..c2320179f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/model/trigram/CharacterBasedGenerativeModelTest.java @@ -0,0 +1,96 @@ +package com.hankcs.hanlp.model.trigram; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.document.CorpusLoader; +import com.hankcs.hanlp.corpus.document.Document; +import com.hankcs.hanlp.corpus.document.sentence.word.Word; +import com.hankcs.hanlp.corpus.io.ByteArray; +import com.hankcs.hanlp.seg.HMM.HMMSegment; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.util.LinkedList; +import java.util.List; + +public class CharacterBasedGenerativeModelTest extends TestCase +{ +// public void testTrainAndSegment() throws Exception +// { +// final CharacterBasedGenerativeModel model = new CharacterBasedGenerativeModel(); +// CorpusLoader.walk("D:\\JavaProjects\\HanLP\\data\\test\\cbgm", new CorpusLoader.Handler() +// { +// @Override +// public void handle(Document document) +// { +// for (List sentence : document.getSimpleSentenceList()) +// { +// model.learn(sentence); +// } +// } +// }); +// model.train(); +//// DataOutputStream out = new DataOutputStream(new FileOutputStream(HanLP.Config.HMMSegmentModelPath)); +//// model.save(out); +//// out.close(); +//// model.load(ByteArray.createByteArray(HanLP.Config.HMMSegmentModelPath)); +// String text = "中国领土"; +// char[] charArray = text.toCharArray(); +// char[] tag = model.tag(charArray); +// System.out.println(tag); +// } +// +// public void testLoad() throws Exception +// { +// CharacterBasedGenerativeModel model = new CharacterBasedGenerativeModel(); +// model.load(ByteArray.createByteArray(HanLP.Config.HMMSegmentModelPath)); +// String text = "我实现了一个基于Character Based TriGram的分词器"; +// char[] sentence = text.toCharArray(); +// char[] tag = model.tag(sentence); +// +// List termList = new LinkedList(); +// int offset = 0; +// for (int i = 0; i < tag.length; offset += 1, ++i) +// { +// switch (tag[i]) +// { +// case 'b': +// { +// int begin = offset; +// while (tag[i] != 'e') +// { +// offset += 1; +// ++i; +// if (i == tag.length) +// { +// break; +// } +// } +// if (i == tag.length) +// { +// termList.add(new String(sentence, begin, offset - begin)); +// } +// else +// termList.add(new String(sentence, begin, offset - begin + 1)); +// } +// break; +// default: +// { +// termList.add(new String(sentence, offset, 1)); +// } +// break; +// } +// } +// System.out.println(tag); +// System.out.println(termList); +// } +// +// public void testSegment() throws Exception +// { +// HanLP.Config.ShowTermNature = false; +// String text = "我实现了一个基于Character Based TriGram的分词器"; +// Segment segment = new HMMSegment(); +// List termList = segment.seg(text); +// System.out.println(termList); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/recognition/ns/PlaceRecognitionTest.java b/src/test/java/com/hankcs/hanlp/recognition/ns/PlaceRecognitionTest.java new file mode 100644 index 000000000..df5eac2df --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/recognition/ns/PlaceRecognitionTest.java @@ -0,0 +1,20 @@ +package com.hankcs.hanlp.recognition.ns; + +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import junit.framework.TestCase; + +public class PlaceRecognitionTest extends TestCase +{ + public void testSeg() throws Exception + { +// HanLP.Config.enableDebug(); + DijkstraSegment segment = new DijkstraSegment(); + segment.enableJapaneseNameRecognize(false); + segment.enableTranslatedNameRecognize(false); + segment.enableNameRecognize(false); + segment.enableCustomDictionary(false); + + segment.enablePlaceRecognize(true); +// System.out.println(segment.seg("南翔向宁夏固原市彭阳县红河镇黑牛沟村捐赠了挖掘机")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognitionTest.java b/src/test/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognitionTest.java new file mode 100644 index 000000000..51f331357 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/recognition/nt/OrganizationRecognitionTest.java @@ -0,0 +1,61 @@ +package com.hankcs.hanlp.recognition.nt; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.corpus.dictionary.DictionaryMaker; +import com.hankcs.hanlp.corpus.dictionary.item.Item; +import com.hankcs.hanlp.corpus.io.IOUtil; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.dictionary.common.CommonStringDictionary; +import com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment; +import com.hankcs.hanlp.utility.LexiconUtility; +import junit.framework.TestCase; + +import java.util.Map; +import java.util.Set; + +public class OrganizationRecognitionTest extends TestCase +{ +// public void testSeg() throws Exception +// { +// HanLP.Config.enableDebug(); +// DijkstraSegment segment = new DijkstraSegment(); +// segment.enableCustomDictionary(false); +// +// segment.enableOrganizationRecognize(true); +// System.out.println(segment.seg("东欧的球队")); +// } +// +// public void testGeneratePatternJavaCode() throws Exception +// { +// CommonStringDictionary commonStringDictionary = new CommonStringDictionary(); +// commonStringDictionary.load("data/dictionary/organization/nt.pattern.txt"); +// StringBuilder sb = new StringBuilder(); +// Set keySet = commonStringDictionary.keySet(); +// CommonStringDictionary secondDictionary = new CommonStringDictionary(); +// secondDictionary.load("data/dictionary/organization/outerNT.pattern.txt"); +// keySet.addAll(secondDictionary.keySet()); +// for (String pattern : keySet) +// { +// sb.append("trie.addKeyword(\"" + pattern + "\");\n"); +// } +// IOUtil.saveTxt("data/dictionary/organization/code.txt", sb.toString()); +// } +// +// public void testRemoveP() throws Exception +// { +// DictionaryMaker maker = DictionaryMaker.load(HanLP.Config.OrganizationDictionaryPath); +// for (Map.Entry entry : maker.entrySet()) +// { +// String word = entry.getKey(); +// Item item = entry.getValue(); +// CoreDictionary.Attribute attribute = LexiconUtility.getAttribute(word); +// if (attribute == null) continue; +// if (item.containsLabel("P") && attribute.hasNatureStartsWith("u")) +// { +// System.out.println(item + "\t" + attribute); +// item.removeLabel("P"); +// } +// } +// maker.saveTxtTo(HanLP.Config.OrganizationDictionaryPath); +// } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegmentTest.java b/src/test/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegmentTest.java new file mode 100644 index 000000000..4e206020c --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/Dijkstra/DijkstraSegmentTest.java @@ -0,0 +1,28 @@ +package com.hankcs.hanlp.seg.Dijkstra; + +import com.hankcs.hanlp.seg.SegmentTestCase; +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; + +import java.util.List; + +public class DijkstraSegmentTest extends SegmentTestCase +{ + public void testWrongName() throws Exception + { + Segment segment = new DijkstraSegment(); + List termList = segment.seg("好像向你借钱的人跑了"); + assertNoNature(termList, Nature.nr); +// System.out.println(termList); + } + + public void testIssue770() throws Exception + { +// HanLP.Config.enableDebug(); + Segment segment = new DijkstraSegment(); + List termList = segment.seg("为什么我扔出的瓶子没有人回复?"); +// System.out.println(termList); + assertSegmentationHas(termList, "瓶子 没有"); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/seg/NShort/NShortSegmentTest.java b/src/test/java/com/hankcs/hanlp/seg/NShort/NShortSegmentTest.java new file mode 100644 index 000000000..0122cd14f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/NShort/NShortSegmentTest.java @@ -0,0 +1,41 @@ +package com.hankcs.hanlp.seg.NShort; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import com.hankcs.hanlp.tokenizer.StandardTokenizer; +import junit.framework.TestCase; + +import java.util.LinkedList; +import java.util.List; + +public class NShortSegmentTest extends TestCase +{ + public void testParse() throws Exception + { + List> wordResults = new LinkedList>(); + wordResults.add(NShortSegment.parse("3-4月")); + wordResults.add(NShortSegment.parse("3-4月份")); + wordResults.add(NShortSegment.parse("3-4季")); + wordResults.add(NShortSegment.parse("3-4年")); + wordResults.add(NShortSegment.parse("3-4人")); + wordResults.add(NShortSegment.parse("2014年")); + wordResults.add(NShortSegment.parse("04年")); + wordResults.add(NShortSegment.parse("12点半")); + wordResults.add(NShortSegment.parse("1.abc")); + +// for (List result : wordResults) +// { +// System.out.println(result); +// } + } + + public void testIssue691() throws Exception + { +// HanLP.Config.enableDebug(); + StandardTokenizer.SEGMENT.enableCustomDictionary(false); + Segment nShortSegment = new NShortSegment().enableCustomDictionary(false).enablePlaceRecognize(true).enableOrganizationRecognize(true); +// System.out.println(nShortSegment.seg("今天,刘志军案的关键人物,山西女商人丁书苗在市二中院出庭受审。")); +// System.out.println(nShortSegment.seg("今日消费5,513.58元")); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegmentTest.java b/src/test/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegmentTest.java new file mode 100644 index 000000000..4b7a0b102 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/Other/AhoCorasickDoubleArrayTrieSegmentTest.java @@ -0,0 +1,15 @@ +package com.hankcs.hanlp.seg.Other; + +import com.hankcs.hanlp.HanLP; +import junit.framework.TestCase; + +public class AhoCorasickDoubleArrayTrieSegmentTest extends TestCase +{ + public void testLoadMyDictionary() throws Exception + { + AhoCorasickDoubleArrayTrieSegment segment + = new AhoCorasickDoubleArrayTrieSegment("data/dictionary/CoreNatureDictionary.mini.txt"); + HanLP.Config.ShowTermNature = false; + assertEquals("[江西, 鄱阳湖, 干枯]", segment.seg("江西鄱阳湖干枯").toString()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegmentTest.java b/src/test/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegmentTest.java new file mode 100644 index 000000000..ed6d7ecdc --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/Other/DoubleArrayTrieSegmentTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.seg.Other; + +import com.hankcs.hanlp.HanLP; +import junit.framework.TestCase; + +public class DoubleArrayTrieSegmentTest extends TestCase +{ + public void testLoadMyDictionary() throws Exception + { + DoubleArrayTrieSegment segment = new DoubleArrayTrieSegment("data/dictionary/CoreNatureDictionary.mini.txt"); + HanLP.Config.ShowTermNature = false; + assertEquals("[江西, 鄱阳湖, 干枯]", segment.seg("江西鄱阳湖干枯").toString()); + } + + public void testLoadMyDictionaryWithNature() throws Exception + { + DoubleArrayTrieSegment segment = new DoubleArrayTrieSegment("data/dictionary/CoreNatureDictionary.mini.txt", + "data/dictionary/custom/上海地名.txt ns"); + segment.enablePartOfSpeechTagging(true); + assertEquals("[上海市/ns, 虹口区/ns, 大连西路/ns, 550/m, 号/q]", segment.seg("上海市虹口区大连西路550号").toString()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/seg/SegmentTestCase.java b/src/test/java/com/hankcs/hanlp/seg/SegmentTestCase.java new file mode 100644 index 000000000..a1c41105f --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/SegmentTestCase.java @@ -0,0 +1,45 @@ +/* + * Hankcs + * me@hankcs.com + * 2018-03-20 下午12:19 + * + * + * Copyright (c) 2018, 码农场. All Right Reserved, http://www.hankcs.com/ + * This source is subject to Hankcs. Please contact Hankcs to get more information. + * + */ +package com.hankcs.hanlp.seg; + +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.Assert; +import junit.framework.TestCase; + +import java.util.List; + +/** + * @author hankcs + */ +public class SegmentTestCase extends TestCase +{ + @SuppressWarnings("deprecation") + public static void assertNoNature(List termList, Nature nature) + { + for (Term term : termList) + { + Assert.assertNotSame(nature, term.nature); + } + } + + @SuppressWarnings("deprecation") + public static void assertSegmentationHas(List termList, String part) + { + StringBuilder sbSentence = new StringBuilder(); + for (Term term : termList) + { + sbSentence.append(term.word); + } + assertFalse(sbSentence.toString().contains(part)); + } + +} diff --git a/src/test/java/com/hankcs/hanlp/seg/common/CWSEvaluatorTest.java b/src/test/java/com/hankcs/hanlp/seg/common/CWSEvaluatorTest.java new file mode 100644 index 000000000..c404481e5 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/seg/common/CWSEvaluatorTest.java @@ -0,0 +1,15 @@ +package com.hankcs.hanlp.seg.common; + +import junit.framework.TestCase; + +public class CWSEvaluatorTest extends TestCase +{ + public void testGetPRF() throws Exception + { + CWSEvaluator evaluator = new CWSEvaluator(); + evaluator.compare("结婚 的 和 尚未 结婚 的", "结婚 的 和尚 未结婚 的"); + CWSEvaluator.Result prf = evaluator.getResult(false); + assertEquals(0.6f, prf.P); + assertEquals(0.5f, prf.R); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/suggest/ISuggesterTest.java b/src/test/java/com/hankcs/hanlp/suggest/ISuggesterTest.java new file mode 100644 index 000000000..0cd5f9092 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/suggest/ISuggesterTest.java @@ -0,0 +1,29 @@ +package com.hankcs.hanlp.suggest; + +import junit.framework.TestCase; + +public class ISuggesterTest extends TestCase +{ + public void testRemoveAllSentences() throws Exception + { + ISuggester suggester = new Suggester(); + String[] titleArray = + ( + "威廉王子发表演说 呼吁保护野生动物\n" + + "《时代》年度人物最终入围名单出炉 普京马云入选\n" + + "“黑格比”横扫菲:菲吸取“海燕”经验及早疏散\n" + + "日本保密法将正式生效 日媒指其损害国民知情权\n" + + "英报告说空气污染带来“公共健康危机”" + ).split("\\n"); + for (String title : titleArray) + { + suggester.addSentence(title); + } + + assertEquals(true, suggester.suggest("mayun", 1).size() > 0); + + suggester.removeAllSentences(); + + assertEquals(0, suggester.suggest("mayun", 1).size()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/suggest/scorer/pinyin/PinyinKeyTest.java b/src/test/java/com/hankcs/hanlp/suggest/scorer/pinyin/PinyinKeyTest.java new file mode 100644 index 000000000..746ed05e5 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/suggest/scorer/pinyin/PinyinKeyTest.java @@ -0,0 +1,16 @@ +package com.hankcs.hanlp.suggest.scorer.pinyin; + +import com.hankcs.hanlp.algorithm.LongestCommonSubstring; +import junit.framework.TestCase; + +public class PinyinKeyTest extends TestCase +{ + public void testConstruct() throws Exception + { + PinyinKey pinyinKeyA = new PinyinKey("专题分析"); + PinyinKey pinyinKeyB = new PinyinKey("教室资格"); +// System.out.println(pinyinKeyA); +// System.out.println(pinyinKeyB); + assertEquals(1, LongestCommonSubstring.compute(pinyinKeyA.getFirstCharArray(), pinyinKeyB.getFirstCharArray())); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/test/other/TestExtractSummary.java b/src/test/java/com/hankcs/hanlp/summary/TextRankSentenceTest.java similarity index 82% rename from src/test/java/com/hankcs/test/other/TestExtractSummary.java rename to src/test/java/com/hankcs/hanlp/summary/TextRankSentenceTest.java index 563fb2cfb..98f07ca24 100644 --- a/src/test/java/com/hankcs/test/other/TestExtractSummary.java +++ b/src/test/java/com/hankcs/hanlp/summary/TextRankSentenceTest.java @@ -1,55 +1,55 @@ -/** - * - */ -package com.hankcs.test.other; - -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.util.List; - -import org.junit.Test; - -import com.hankcs.hanlp.HanLP; - -/** - * @author gonggawang - * - */ -public class TestExtractSummary -{ - private static final String str = "7月21日,渤海海况恶劣,至少发生3起沉船事故,10余名船员危在旦夕。危急时刻,中国海油渤海油田再次行动起来,紧急调配救援力量救起10名遇险人员。" - + "21日一早,一阵急促的铃声,在渤海石油管理局总值班室骤然响起。这是天津海上搜救中心打来的电话。正在值班的作业协调部主管邬礼凯心里“咯噔”一下——天津海上搜救中心称," - + "在“海洋石油932”平台西南方7海里处,一艘货轮遇险、处于倾覆边缘,4名船员命悬一线。时间就是生命!邬礼凯立即组织海上救援力量,立即驰奔事故发生地点。" - + "“滨海264”船接到任务单后,仅一个小时便抵达事故现场。此时,货船已完全倾覆。“滨海264”立刻开展救援工作,仅25分钟便将4人全部救出。"; - - private static final String separator = "[。??!!]"; - - @Test - public void testExctractSummay() - { - List oldSum = HanLP.extractSummary(str, 2); - List newSum = HanLP.extractSummary(str, 2, separator); - System.out.println("exctractSummay old:" + oldSum); - System.out.println("exctractSummay new:" + newSum); - - assertTrue(oldSum.toString().length() < newSum.toString().length()); - assertFalse(oldSum.toString().contains(",")); - assertTrue(newSum.toString().contains(",")); - } - - @Test - public void testGetSummary() - { - - String oldSum = HanLP.getSummary(str, 100); - String newSum = HanLP.getSummary(str, 100, separator); - - System.out.println("getSummay old:" + oldSum); - System.out.println("getSummay new:" + newSum); - - assertFalse(oldSum.contains(",")); - assertTrue(newSum.contains(",")); - } - -} +/** + * + */ +package com.hankcs.hanlp.summary; + +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + +import java.util.List; + +import org.junit.Test; + +import com.hankcs.hanlp.HanLP; + +/** + * @author gonggawang + * + */ +public class TextRankSentenceTest +{ + private static final String str = "7月21日,渤海海况恶劣,至少发生3起沉船事故,10余名船员危在旦夕。危急时刻,中国海油渤海油田再次行动起来,紧急调配救援力量救起10名遇险人员。" + + "21日一早,一阵急促的铃声,在渤海石油管理局总值班室骤然响起。这是天津海上搜救中心打来的电话。正在值班的作业协调部主管邬礼凯心里“咯噔”一下——天津海上搜救中心称," + + "在“海洋石油932”平台西南方7海里处,一艘货轮遇险、处于倾覆边缘,4名船员命悬一线。时间就是生命!邬礼凯立即组织海上救援力量,立即驰奔事故发生地点。" + + "“滨海264”船接到任务单后,仅一个小时便抵达事故现场。此时,货船已完全倾覆。“滨海264”立刻开展救援工作,仅25分钟便将4人全部救出。"; + + private static final String separator = "[。??!!]"; + + @Test + public void testExtractSummary() + { + List oldSum = HanLP.extractSummary(str, 2); + List newSum = HanLP.extractSummary(str, 2, separator); +// System.out.println("exctractSummay old:" + oldSum); +// System.out.println("exctractSummay new:" + newSum); + + assertTrue(oldSum.toString().length() < newSum.toString().length()); + assertFalse(oldSum.toString().contains(",")); + assertTrue(newSum.toString().contains(",")); + } + + @Test + public void testGetSummary() + { + + String oldSum = HanLP.getSummary(str, 100); + String newSum = HanLP.getSummary(str, 100, separator); + +// System.out.println("getSummay old:" + oldSum); +// System.out.println("getSummay new:" + newSum); + + assertFalse(oldSum.contains(",")); + assertTrue(newSum.contains(",")); + } + +} diff --git a/src/test/java/com/hankcs/hanlp/tokenizer/URLTokenizerTest.java b/src/test/java/com/hankcs/hanlp/tokenizer/URLTokenizerTest.java new file mode 100644 index 000000000..ed6105aab --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/tokenizer/URLTokenizerTest.java @@ -0,0 +1,16 @@ +package com.hankcs.hanlp.tokenizer; + +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.util.List; + +public class URLTokenizerTest extends TestCase +{ + public void testSegment() + { + String text = "随便写点啥吧?abNfxbGRIAUQfGGgvesskbrhEfvCdOHyxfWBq"; + List terms = URLTokenizer.segment(text); + System.out.println(terms); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzerTest.java b/src/test/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzerTest.java new file mode 100644 index 000000000..54280c35a --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/tokenizer/lexical/AbstractLexicalAnalyzerTest.java @@ -0,0 +1,48 @@ +package com.hankcs.hanlp.tokenizer.lexical; + +import com.hankcs.hanlp.HanLP; +import com.hankcs.hanlp.dictionary.CustomDictionary; +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; +import com.hankcs.hanlp.model.perceptron.PerceptronLexicalAnalyzer; +import com.hankcs.hanlp.seg.Segment; +import com.hankcs.hanlp.seg.common.Term; +import junit.framework.TestCase; + +import java.io.IOException; +import java.util.List; + +public class AbstractLexicalAnalyzerTest extends TestCase +{ + public void testSegment() throws Exception + { + String[] testCase = new String[]{ + "北川景子参演了林诣彬导演的《速度与激情3》", + "林志玲亮相网友:确定不是波多野结衣?", + "龟山千广和近藤公园在龟山公园里喝酒赏花", + }; + Segment segment = HanLP.newSegment("crf").enableJapaneseNameRecognize(true); + for (String sentence : testCase) + { + List termList = segment.seg(sentence); + System.out.println(termList); + } + } + + public void testCustomDictionary() throws Exception + { + LexicalAnalyzer analyzer = new PerceptronLexicalAnalyzer(); + String text = "攻城狮逆袭单身狗,迎娶白富美,走上人生巅峰"; + System.out.println(analyzer.segment(text)); + CustomDictionary.add("攻城狮"); + System.out.println(analyzer.segment(text)); + } + + public void testOverwriteTag() throws IOException + { + CRFLexicalAnalyzer analyzer = new CRFLexicalAnalyzer(); + String text = "强行修改词性"; + System.out.println(analyzer.seg(text)); + CustomDictionary.add("修改", "自定义词性"); + System.out.println(analyzer.seg(text)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/tokenizer/pipe/SegmentPipelineTest.java b/src/test/java/com/hankcs/hanlp/tokenizer/pipe/SegmentPipelineTest.java new file mode 100644 index 000000000..8472a8a63 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/tokenizer/pipe/SegmentPipelineTest.java @@ -0,0 +1,23 @@ +package com.hankcs.hanlp.tokenizer.pipe; + +import com.hankcs.hanlp.model.crf.CRFLexicalAnalyzer; +import com.hankcs.hanlp.seg.SegmentPipeline; +import junit.framework.TestCase; + +import java.util.regex.Pattern; + +public class SegmentPipelineTest extends TestCase +{ + private static final Pattern WEB_URL = Pattern.compile("((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?(?:(((([a-zA-Z0-9][a-zA-Z0-9\\-]*)*[a-zA-Z0-9]\\.)+((aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(biz|b[abdefghijmnorstvwyz])|(cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(edu|e[cegrstu])|f[ijkmor]|(gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(info|int|i[delmnoqrst])|(jobs|j[emop])|k[eghimnprwyz]|l[abcikrstuvy]|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])|(name|net|n[acefgilopruz])|(org|om)|(pro|p[aefghklmnrstwy])|qa|r[eosuw]|s[abcdeghijklmnortuvyz]|(tel|travel|t[cdfghjklmnoprtvwz])|u[agksyz]|v[aceginu]|w[fs]|(δοκιμή|испытание|рф|срб|טעסט|آزمایشی|إختبار|الاردن|الجزائر|السعودية|المغرب|امارات|بھارت|تونس|سورية|فلسطين|قطر|مصر|परीक्षा|भारत|ভারত|ਭਾਰਤ|ભારત|இந்தியா|இலங்கை|சிங்கப்பூர்|பரிட்சை|భారత్|ලංකා|ไทย|テスト|中国|中國|台湾|台灣|新加坡|测试|測試|香港|테스트|한국|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)|y[et]|z[amw]))|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\\:\\d{1,5})?)(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"); + private static final Pattern EMAIL = Pattern.compile("(\\w+(?:[-+.]\\w+)*)@(\\w+(?:[-.]\\w+)*\\.\\w+(?:[-.]\\w+)*)"); + + public void testSegment() throws Exception + { + SegmentPipeline pipeline = new SegmentPipeline(new CRFLexicalAnalyzer()); + pipeline.add(new RegexRecognizePipe(EMAIL, "【邮件】")); + pipeline.add(new RegexRecognizePipe(WEB_URL, "【网址】")); + String text = "HanLP的项目地址是https://github.com/hankcs/HanLP," + + "联系邮箱abc@def.com"; + System.out.println(pipeline.seg(text)); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/utility/MathUtilityTest.java b/src/test/java/com/hankcs/hanlp/utility/MathUtilityTest.java new file mode 100644 index 000000000..2ab1ef0c1 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/utility/MathUtilityTest.java @@ -0,0 +1,98 @@ +package com.hankcs.hanlp.utility; + +import com.hankcs.hanlp.corpus.tag.Nature; +import com.hankcs.hanlp.dictionary.CoreDictionary; +import com.hankcs.hanlp.seg.common.Vertex; +import java.util.HashMap; +import org.junit.Assert; +import org.junit.Test; + +public class MathUtilityTest { + static final double DELTA = 0.0; + + @Test + public void testSumInt() { + Assert.assertEquals(0, MathUtility.sum(new int[0])); + Assert.assertEquals(106, MathUtility.sum(new int[]{1, 32, 73})); + } + + @Test + public void testSumFloat() { + Assert.assertEquals(0.0f, MathUtility.sum(new float[0]), DELTA); + Assert.assertEquals( + 22.5f, + MathUtility.sum(new float[]{1.0f, 5.5f, 16.0f}), + DELTA + ); + } + + @Test + public void testPercentage() { + Assert.assertEquals(75.0, MathUtility.percentage(96.0, 128.0), DELTA); + Assert.assertEquals(302.5, MathUtility.percentage(15.125, 5.0), DELTA); + } + + @Test + public void testAverage() { + Assert.assertEquals( + 2.0, + MathUtility.average(new double[]{1.0, 2.0, 3.0}), + DELTA + ); + } + + @Test + public void testNormalizeExpHashMap() { + HashMap predictionScores = + new HashMap(); + predictionScores.put("foo", 1.0); + predictionScores.put("Bar", 2.0); + predictionScores.put("test", 0.5); + + HashMap expected = new HashMap(); + expected.put("foo", 0.23122389762214907); + expected.put("Bar", 0.6285317192117624); + expected.put("test", 0.14024438316608848); + + MathUtility.normalizeExp(predictionScores); + + Assert.assertEquals(expected, predictionScores); + } + + @Test + public void testNormalizeExpDoubleArray() { + double[] predictionScores = {0, 1, 2, 3}; + double[] expected = { + 0.03205860328008499, 0.08714431874203257, + 0.23688281808991013, 0.6439142598879724 + }; + MathUtility.normalizeExp(predictionScores); + + Assert.assertArrayEquals(expected, predictionScores, DELTA); + } + + @Test + public void testCalculateWeight() { + Nature[] natures = new Nature[]{Nature.begin}; + Vertex vertex1 = new Vertex( + "Bar", + new CoreDictionary.Attribute(10), + 55 + ); + Vertex vertex2 = new Vertex( + "foo", + new CoreDictionary.Attribute(natures, new int[]{-1}), + 65678 + ); + + Assert.assertEquals(2.1972155419463637, + MathUtility.calculateWeight(vertex1, vertex2), + DELTA + ); + + Assert.assertEquals(0.10536051123919675, + MathUtility.calculateWeight(new Vertex("foo"), new Vertex("Bar")), + DELTA + ); + } +} diff --git a/src/test/java/com/hankcs/hanlp/utility/SentencesUtilTest.java b/src/test/java/com/hankcs/hanlp/utility/SentencesUtilTest.java new file mode 100644 index 000000000..365b9dfc8 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/utility/SentencesUtilTest.java @@ -0,0 +1,22 @@ +package com.hankcs.hanlp.utility; + +import junit.framework.TestCase; + +public class SentencesUtilTest extends TestCase +{ + public void testToSentenceList() throws Exception + { +// for (String sentence : SentencesUtil.toSentenceList("逗号把句子切分为意群,表示小于分号大于顿号的停顿。", false)) +// { +// System.out.println(sentence); +// } + assertEquals(1, SentencesUtil.toSentenceList("逗号把句子切分为意群,表示小于分号大于顿号的停顿。", false).size()); + assertEquals(2, SentencesUtil.toSentenceList("逗号把句子切分为意群,表示小于分号大于顿号的停顿。", true).size()); + } + + public void testSplitSentence() throws Exception + { + String content = "我白天是一名语言学习者,晚上是一名初级码农。空的时候喜欢看算法和应用数学书,也喜欢悬疑推理小说,ACG方面喜欢型月、轨迹。喜欢有思想深度的事物,讨厌急躁、拜金与安逸的人\r\n目前在魔都某女校学习,这是我的个人博客。闻道有先后,术业有专攻,请多多关照。"; + assertEquals(12, SentencesUtil.toSentenceList(content).size()); + } +} \ No newline at end of file diff --git a/src/test/java/com/hankcs/hanlp/utility/TestUtility.java b/src/test/java/com/hankcs/hanlp/utility/TestUtility.java new file mode 100644 index 000000000..82fc50900 --- /dev/null +++ b/src/test/java/com/hankcs/hanlp/utility/TestUtility.java @@ -0,0 +1,252 @@ +/* + * Han He + * me@hankcs.com + * 2018-06-23 11:05 PM + * + * + * Copyright (c) 2018, Han He. All Rights Reserved, http://www.hankcs.com/ + * This source is subject to Han He. Please contact Han He for more information. + * + */ +package com.hankcs.hanlp.utility; + +import com.hankcs.hanlp.HanLP; + +import java.io.*; +import java.net.HttpURLConnection; +import java.net.URL; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +/** + * @author hankcs + */ +public class TestUtility +{ + static + { + ensureFullData(); + } + + public static void ensureFullData() + { + ensureData(HanLP.Config.PerceptronCWSModelPath, "http://nlp.hankcs.com/download.php?file=data", HanLP.Config.PerceptronCWSModelPath.split("data")[0], false); + } + + /** + * 保证 name 存在,不存在时自动下载解压 + * + * @param name 路径 + * @param url 下载地址 + * @return name的绝对路径 + */ + public static String ensureData(String name, String url) + { + return ensureData(name, url, null, true); + } + + /** + * 保证 name 存在,不存在时自动下载解压 + * + * @param name 路径 + * @param url 下载地址 + * @return name的绝对路径 + */ + public static String ensureData(String name, String url, String parentPath, boolean overwrite) + { + File target = new File(name); + if (target.exists()) return target.getAbsolutePath(); + try + { + File parentFile = parentPath == null ? new File(name).getParentFile() : new File(parentPath); + if (!parentFile.exists()) parentFile.mkdirs(); + String filePath = downloadFile(url, parentFile.getAbsolutePath()); + if (filePath.endsWith(".zip")) + { + unzip(filePath, parentFile.getAbsolutePath(), overwrite); + } + return target.getAbsolutePath(); + } + catch (Exception e) + { + System.err.printf("数据下载失败,请尝试手动下载 %s 到 %s 。原因如下:\n", url, target.getAbsolutePath()); + e.printStackTrace(); + System.exit(1); + return null; + } + } + + /** + * 保证 data/test/name 存在 + * + * @param name + * @param url + * @return + */ + public static String ensureTestData(String name, String url) + { + return ensureData(String.format("data/test/%s", name), url); + } + + /** + * Downloads a file from a URL + * + * @param fileURL HTTP URL of the file to be downloaded + * @param savePath path of the directory to save the file + * @throws IOException + * @author www.codejava.net + */ + public static String downloadFile(String fileURL, String savePath) + throws IOException + { + System.err.printf("Downloading %s to %s\n", fileURL, savePath); + HttpURLConnection httpConn = request(fileURL); + while (httpConn.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM || httpConn.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) + { + httpConn = request(httpConn.getHeaderField("Location")); + } + + // always check HTTP response code first + if (httpConn.getResponseCode() == HttpURLConnection.HTTP_OK) + { + String fileName = ""; + String disposition = httpConn.getHeaderField("Content-Disposition"); + String contentType = httpConn.getContentType(); + int contentLength = httpConn.getContentLength(); + + if (disposition != null) + { + // extracts file name from header field + int index = disposition.indexOf("filename="); + if (index > 0) + { + fileName = disposition.substring(index + 10, + disposition.length() - 1); + } + } + else + { + // extracts file name from URL + fileName = new File(httpConn.getURL().getPath()).getName(); + } + +// System.out.println("Content-Type = " + contentType); +// System.out.println("Content-Disposition = " + disposition); +// System.out.println("Content-Length = " + contentLength); +// System.out.println("fileName = " + fileName); + + // opens input stream from the HTTP connection + InputStream inputStream = httpConn.getInputStream(); + String saveFilePath = savePath; + if (new File(savePath).isDirectory()) + saveFilePath = savePath + File.separator + fileName; + String realPath; + if (new File(saveFilePath).isFile()) + { + System.err.printf("Use cached %s instead.\n", fileName); + realPath = saveFilePath; + } + else + { + saveFilePath += ".downloading"; + + // opens an output stream to save into file + FileOutputStream outputStream = new FileOutputStream(saveFilePath); + + int bytesRead; + byte[] buffer = new byte[4096]; + long start = System.currentTimeMillis(); + int progress_size = 0; + while ((bytesRead = inputStream.read(buffer)) != -1) + { + outputStream.write(buffer, 0, bytesRead); + long duration = (System.currentTimeMillis() - start) / 1000; + duration = Math.max(duration, 1); + progress_size += bytesRead; + int speed = (int) (progress_size / (1024 * duration)); + float ratio = progress_size / (float) contentLength; + float percent = ratio * 100; + int eta = (int) (duration / ratio * (1 - ratio)); + int minutes = eta / 60; + int seconds = eta % 60; + + System.err.printf("\r%.2f%%, %d MB, %d KB/s, ETA %d min %d s", percent, progress_size / (1024 * 1024), speed, minutes, seconds); + } + System.err.println(); + outputStream.close(); + realPath = saveFilePath.substring(0, saveFilePath.length() - ".downloading".length()); + if (!new File(saveFilePath).renameTo(new File(realPath))) + throw new IOException("Failed to move file"); + } + inputStream.close(); + httpConn.disconnect(); + + return realPath; + } + else + { + httpConn.disconnect(); + throw new IOException("No file to download. Server replied HTTP code: " + httpConn.getResponseCode()); + } + } + + private static HttpURLConnection request(String url) throws IOException + { + HttpURLConnection httpConn = (HttpURLConnection) new URL(url).openConnection(); + httpConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2"); + return httpConn; + } + + private static void unzip(String zipFilePath, String destDir, boolean overwrite) + { + System.err.println("Unzipping to " + destDir); + File dir = new File(destDir); + // create output directory if it doesn't exist + if (!dir.exists()) dir.mkdirs(); + FileInputStream fis; + //buffer for read and write data to file + byte[] buffer = new byte[4096]; + try + { + fis = new FileInputStream(zipFilePath); + ZipInputStream zis = new ZipInputStream(fis); + ZipEntry ze = zis.getNextEntry(); + while (ze != null) + { + String fileName = ze.getName(); + File newFile = new File(destDir + File.separator + fileName); + if (overwrite || !newFile.exists()) + { + if (ze.isDirectory()) + { + //create directories for sub directories in zip + newFile.mkdirs(); + } + else + { + new File(newFile.getParent()).mkdirs(); + FileOutputStream fos = new FileOutputStream(newFile); + int len; + while ((len = zis.read(buffer)) > 0) + { + fos.write(buffer, 0, len); + } + fos.close(); + //close this ZipEntry + zis.closeEntry(); + } + } + ze = zis.getNextEntry(); + } + //close last ZipEntry + zis.closeEntry(); + zis.close(); + fis.close(); + new File(zipFilePath).delete(); + } + catch (IOException e) + { + e.printStackTrace(); + } + } +} diff --git a/src/test/java/com/hankcs/hanlp/utility/TextUtilityTest.java b/src/test/java/com/hankcs/hanlp/utility/TextUtilityTest.java index 950dae7a3..bd4af4725 100644 --- a/src/test/java/com/hankcs/hanlp/utility/TextUtilityTest.java +++ b/src/test/java/com/hankcs/hanlp/utility/TextUtilityTest.java @@ -1,6 +1,10 @@ package com.hankcs.hanlp.utility; +import com.hankcs.hanlp.dictionary.other.CharType; import junit.framework.TestCase; +import org.junit.Test; + +import static org.junit.Assert.assertEquals; public class TextUtilityTest extends TestCase { @@ -9,4 +13,27 @@ public void testIsAllSingleByte() throws Exception assertEquals(false, TextUtility.isAllSingleByte("中文a")); assertEquals(true, TextUtility.isAllSingleByte("abcABC!@#")); } + + @Test + public void testChineseNum() + { + assertEquals(true, TextUtility.isAllChineseNum("两千五百万")); + assertEquals(true, TextUtility.isAllChineseNum("两千分之一")); + assertEquals(true, TextUtility.isAllChineseNum("几十")); + assertEquals(true, TextUtility.isAllChineseNum("十几")); + assertEquals(false,TextUtility.isAllChineseNum("上来")); + } + + @Test + public void testArabicNum() + { + assertEquals(true, TextUtility.isAllNum("2.5")); + assertEquals(true, TextUtility.isAllNum("3600")); + assertEquals(true, TextUtility.isAllNum("500万")); + assertEquals(true, TextUtility.isAllNum("87.53%")); + assertEquals(true, TextUtility.isAllNum("550")); + assertEquals(true, TextUtility.isAllNum("10%")); + assertEquals(true, TextUtility.isAllNum("98.1%")); + assertEquals(false, TextUtility.isAllNum(",")); + } } \ No newline at end of file diff --git a/src/test/java/com/hankcs/test/other/TestTextUtility.java b/src/test/java/com/hankcs/test/other/TestTextUtility.java deleted file mode 100644 index afa8eff7e..000000000 --- a/src/test/java/com/hankcs/test/other/TestTextUtility.java +++ /dev/null @@ -1,35 +0,0 @@ -package com.hankcs.test.other; - -import static org.junit.Assert.*; - -import org.junit.Test; - -import com.hankcs.hanlp.utility.TextUtility; - -public class TestTextUtility -{ - - @Test - public void testChineseNum() - { - assertEquals(Boolean.TRUE, TextUtility.isAllChineseNum("两千五百万")); - assertEquals(Boolean.TRUE, TextUtility.isAllChineseNum("两千分之一")); - assertEquals(Boolean.TRUE, TextUtility.isAllChineseNum("几十")); - assertEquals(Boolean.TRUE, TextUtility.isAllChineseNum("十几")); - assertEquals(Boolean.FALSE,TextUtility.isAllChineseNum("上来")); - } - - @Test - public void testArabicNum() - { - assertEquals(Boolean.TRUE, TextUtility.isAllNum("2.5")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("3600")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("500万")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("87.53%")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("550")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("10%")); - assertEquals(Boolean.TRUE, TextUtility.isAllNum("98.1%")); - assertEquals(Boolean.FALSE, TextUtility.isAllNum(",")); - } - -} diff --git a/src/test/java/com/hankcs/test/seg/TestTerm.java b/src/test/java/com/hankcs/test/seg/TestTerm.java deleted file mode 100644 index e75cd7511..000000000 --- a/src/test/java/com/hankcs/test/seg/TestTerm.java +++ /dev/null @@ -1,23 +0,0 @@ -package com.hankcs.test.seg; - -import com.hankcs.hanlp.HanLP; -import com.hankcs.hanlp.seg.common.Term; -import junit.framework.TestCase; - -import java.util.List; - -public class TestTerm extends TestCase { - - public void testContains(){ - List t1 = HanLP.segment("我在天安门广场吃炸鸡"); - List t2 = HanLP.segment("我在天安门广场喝啤酒"); - for (Term term:t2) - { - if (!t1.contains(term)) - { - t1.add(term); - } - } - System.out.println(t1); - } -}