您的位置:首页 > 编程语言 > Java开发

java 使用sourceforge.pinyin4j查询汉字拼音

2014-12-15 14:07 591 查看
在我们的系统中,可能经常需要按首字母排序一些信息(比如淘宝商城的品牌列表字母序排列),那么我们就需要一个能够根据汉字查询对应的拼音,取出拼音的首字母即可。

我们使用sourceforge.pinyin4j开源包来完成我们的功能。

使用很简单:

提供的工具类是下面这个PinyinHelper.java help类,里面有所有开放的API,有几个方法是对应转换成不同的拼音系统,关于拼音系统大家可以查看 http://wenku.baidu.com/view/28dda445b307e87101f696f9.html

[java] view
plaincopy

/**

* This file is part of pinyin4j (http://sourceforge.net/projects/pinyin4j/)

* and distributed under GNU GENERAL PUBLIC LICENSE (GPL).

*

* pinyin4j is free software; you can redistribute it and/or modify

* it under the terms of the GNU General Public License as published by

* the Free Software Foundation; either version 2 of the License, or

* (at your option) any later version.

*

* pinyin4j is distributed in the hope that it will be useful,

* but WITHOUT ANY WARRANTY; without even the implied warranty of

* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

* GNU General Public License for more details.

*

* You should have received a copy of the GNU General Public License

* along with pinyin4j.

*/

package net.sourceforge.pinyin4j;

import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;

import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;

/**

* A class provides several utility functions to convert Chinese characters

* (both Simplified and Tranditional) into various Chinese Romanization

* representations

*

* @author Li Min (xmlerlimin@gmail.com)

*/

public class PinyinHelper

{

/**

* Get all unformmatted Hanyu Pinyin presentations of a single Chinese

* character (both Simplified and Tranditional)

*

* <p>

* For example, <br/> If the input is '间', the return will be an array with

* two Hanyu Pinyin strings: <br/> "jian1" <br/> "jian4" <br/> <br/> If the

* input is '李', the return will be an array with single Hanyu Pinyin

* string: <br/> "li3"

*

* <p>

* <b>Special Note</b>: If the return is "none0", that means the input

* Chinese character exists in Unicode CJK talbe, however, it has no

* pronounciation in Chinese

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted Hanyu Pinyin

* presentations with tone numbers; null for non-Chinese character

*

*/

static public String[] toHanyuPinyinStringArray(char ch)

{

return getUnformattedHanyuPinyinStringArray(ch);

}

/**

* Get all Hanyu Pinyin presentations of a single Chinese character (both

* Simplified and Tranditional)

*

* <p>

* For example, <br/> If the input is '间', the return will be an array with

* two Hanyu Pinyin strings: <br/> "jian1" <br/> "jian4" <br/> <br/> If the

* input is '李', the return will be an array with single Hanyu Pinyin

* string: <br/> "li3"

*

* <p>

* <b>Special Note</b>: If the return is "none0", that means the input

* Chinese character is in Unicode CJK talbe, however, it has no

* pronounciation in Chinese

*

* @param ch

* the given Chinese character

* @param outputFormat

* describes the desired format of returned Hanyu Pinyin String

*

* @return a String array contains all Hanyu Pinyin presentations with tone

* numbers; return null for non-Chinese character

*

* @throws BadHanyuPinyinOutputFormatCombination

* if certain combination of output formats happens

*

* @see HanyuPinyinOutputFormat

* @see BadHanyuPinyinOutputFormatCombination

*

*/

static public String[] toHanyuPinyinStringArray(char ch,

HanyuPinyinOutputFormat outputFormat)

throws BadHanyuPinyinOutputFormatCombination

{

return getFormattedHanyuPinyinStringArray(ch, outputFormat);

}

/**

* Return the formatted Hanyu Pinyin representations of the given Chinese

* character (both in Simplified and Tranditional) in array format.

*

* @param ch

* the given Chinese character

* @param outputFormat

* Describes the desired format of returned Hanyu Pinyin string

* @return The formatted Hanyu Pinyin representations of the given codepoint

* in array format; null if no record is found in the hashtable.

*/

static private String[] getFormattedHanyuPinyinStringArray(char ch,

HanyuPinyinOutputFormat outputFormat)

throws BadHanyuPinyinOutputFormatCombination

{

String[] pinyinStrArray = getUnformattedHanyuPinyinStringArray(ch);

if (null != pinyinStrArray)

{

for (int i = 0; i < pinyinStrArray.length; i++)

{

pinyinStrArray[i] = PinyinFormatter.formatHanyuPinyin(pinyinStrArray[i], outputFormat);

}

return pinyinStrArray;

} else

return null;

}

/**

* Delegate function

*

* @param ch

* the given Chinese character

* @return unformatted Hanyu Pinyin strings; null if the record is not found

*/

private static String[] getUnformattedHanyuPinyinStringArray(char ch)

{

return ChineseToPinyinResource.getInstance().getHanyuPinyinStringArray(ch);

}

/**

* Get all unformmatted Tongyong Pinyin presentations of a single Chinese

* character (both Simplified and Tranditional)

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted Tongyong Pinyin

* presentations with tone numbers; null for non-Chinese character

*

* @see #toHanyuPinyinStringArray(char)

*

*/

static public String[] toTongyongPinyinStringArray(char ch)

{

return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.TONGYONG_PINYIN);

}

/**

* Get all unformmatted Wade-Giles presentations of a single Chinese

* character (both Simplified and Tranditional)

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted Wade-Giles presentations

* with tone numbers; null for non-Chinese character

*

* @see #toHanyuPinyinStringArray(char)

*

*/

static public String[] toWadeGilesPinyinStringArray(char ch)

{

return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.WADEGILES_PINYIN);

}

/**

* Get all unformmatted MPS2 (Mandarin Phonetic Symbols 2) presentations of

* a single Chinese character (both Simplified and Tranditional)

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted MPS2 (Mandarin Phonetic

* Symbols 2) presentations with tone numbers; null for non-Chinese

* character

*

* @see #toHanyuPinyinStringArray(char)

*

*/

static public String[] toMPS2PinyinStringArray(char ch)

{

return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.MPS2_PINYIN);

}

/**

* Get all unformmatted Yale Pinyin presentations of a single Chinese

* character (both Simplified and Tranditional)

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted Yale Pinyin

* presentations with tone numbers; null for non-Chinese character

*

* @see #toHanyuPinyinStringArray(char)

*

*/

static public String[] toYalePinyinStringArray(char ch)

{

return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.YALE_PINYIN);

}

/**

* @param ch

* the given Chinese character

* @param targetPinyinSystem

* indicates target Chinese Romanization system should be

* converted to

* @return string representations of target Chinese Romanization system

* corresponding to the given Chinese character in array format;

* null if error happens

*

* @see PinyinRomanizationType

*/

private static String[] convertToTargetPinyinStringArray(char ch,

PinyinRomanizationType targetPinyinSystem)

{

String[] hanyuPinyinStringArray = getUnformattedHanyuPinyinStringArray(ch);

if (null != hanyuPinyinStringArray)

{

String[] targetPinyinStringArray = new String[hanyuPinyinStringArray.length];

for (int i = 0; i < hanyuPinyinStringArray.length; i++)

{

targetPinyinStringArray[i] = PinyinRomanizationTranslator.convertRomanizationSystem(hanyuPinyinStringArray[i], PinyinRomanizationType.HANYU_PINYIN, targetPinyinSystem);

}

return targetPinyinStringArray;

} else

return null;

}

/**

* Get all unformmatted Gwoyeu Romatzyh presentations of a single Chinese

* character (both Simplified and Tranditional)

*

* @param ch

* the given Chinese character

*

* @return a String array contains all unformmatted Gwoyeu Romatzyh

* presentations with tone numbers; null for non-Chinese character

*

* @see #toHanyuPinyinStringArray(char)

*

*/

static public String[] toGwoyeuRomatzyhStringArray(char ch)

{

return convertToGwoyeuRomatzyhStringArray(ch);

}

/**

* @param ch

* the given Chinese character

*

* @return Gwoyeu Romatzyh string representations corresponding to the given

* Chinese character in array format; null if error happens

*

* @see PinyinRomanizationType

*/

private static String[] convertToGwoyeuRomatzyhStringArray(char ch)

{

String[] hanyuPinyinStringArray = getUnformattedHanyuPinyinStringArray(ch);

if (null != hanyuPinyinStringArray)

{

String[] targetPinyinStringArray = new String[hanyuPinyinStringArray.length];

for (int i = 0; i < hanyuPinyinStringArray.length; i++)

{

targetPinyinStringArray[i] = GwoyeuRomatzyhTranslator.convertHanyuPinyinToGwoyeuRomatzyh(hanyuPinyinStringArray[i]);

}

return targetPinyinStringArray;

} else

return null;

}

/**

* Get a string which all Chinese characters are replaced by corresponding

* main (first) Hanyu Pinyin representation.

*

* <p>

* <b>Special Note</b>: If the return contains "none0", that means that

* Chinese character is in Unicode CJK talbe, however, it has not

* pronounciation in Chinese. <b> This interface will be removed in next

* release. </b>

*

* @param str

* A given string contains Chinese characters

* @param outputFormat

* Describes the desired format of returned Hanyu Pinyin string

* @param seperater

* The string is appended after a Chinese character (excluding

* the last Chinese character at the end of sentence). <b>Note!

* Seperater will not appear after a non-Chinese character</b>

* @return a String identical to the original one but all recognizable

* Chinese characters are converted into main (first) Hanyu Pinyin

* representation

*

* @deprecated DO NOT use it again because the first retrived pinyin string

* may be a wrong pronouciation in a certain sentence context.

* <b> This interface will be removed in next release. </b>

*/

static public String toHanyuPinyinString(String str,

HanyuPinyinOutputFormat outputFormat, String seperater)

throws BadHanyuPinyinOutputFormatCombination

{

StringBuffer resultPinyinStrBuf = new StringBuffer();

for (int i = 0; i < str.length(); i++)

{

String mainPinyinStrOfChar = getFirstHanyuPinyinString(str.charAt(i), outputFormat);

if (null != mainPinyinStrOfChar)

{

resultPinyinStrBuf.append(mainPinyinStrOfChar);

if (i != str.length() - 1)

{ // avoid appending at the end

resultPinyinStrBuf.append(seperater);

}

} else

{

resultPinyinStrBuf.append(str.charAt(i));

}

}

return resultPinyinStrBuf.toString();

}

/**

* Get the first Hanyu Pinyin of a Chinese character <b> This function will

* be removed in next release. </b>

*

* @param ch

* The given Unicode character

* @param outputFormat

* Describes the desired format of returned Hanyu Pinyin string

* @return Return the first Hanyu Pinyin of given Chinese character; return

* null if the input is not a Chinese character

*

* @deprecated DO NOT use it again because the first retrived pinyin string

* may be a wrong pronouciation in a certain sentence context.

* <b> This function will be removed in next release. </b>

*/

static private String getFirstHanyuPinyinString(char ch,

HanyuPinyinOutputFormat outputFormat)

throws BadHanyuPinyinOutputFormatCombination

{

String[] pinyinStrArray = getFormattedHanyuPinyinStringArray(ch, outputFormat);

if ((null != pinyinStrArray) && (pinyinStrArray.length > 0))

{

return pinyinStrArray[0];

} else

{

return null;

}

}

// ! Hidden constructor

private PinyinHelper()

{

}

}

拼音系统列表如下:

[java] view
plaincopy

/**

* This file is part of pinyin4j (http://sourceforge.net/projects/pinyin4j/)

* and distributed under GNU GENERAL PUBLIC LICENSE (GPL).

*

* pinyin4j is free software; you can redistribute it and/or modify

* it under the terms of the GNU General Public License as published by

* the Free Software Foundation; either version 2 of the License, or

* (at your option) any later version.

*

* pinyin4j is distributed in the hope that it will be useful,

* but WITHOUT ANY WARRANTY; without even the implied warranty of

* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

* GNU General Public License for more details.

*

* You should have received a copy of the GNU General Public License

* along with pinyin4j.

*/

/**

*

*/

package net.sourceforge.pinyin4j;

/**

* The class describes variable Chinese Pinyin Romanization System

*

* @author Li Min (xmlerlimin@gmail.com)

*

*/

class PinyinRomanizationType

{

/**

* Hanyu Pinyin system

*/

static final PinyinRomanizationType HANYU_PINYIN = new PinyinRomanizationType("Hanyu");

/**

* Wade-Giles Pinyin system

*/

static final PinyinRomanizationType WADEGILES_PINYIN = new PinyinRomanizationType("Wade");

/**

* Mandarin Phonetic Symbols 2 (MPS2) Pinyin system

*/

static final PinyinRomanizationType MPS2_PINYIN = new PinyinRomanizationType("MPSII");

/**

* Yale Pinyin system

*/

static final PinyinRomanizationType YALE_PINYIN = new PinyinRomanizationType("Yale");

/**

* Tongyong Pinyin system

*/

static final PinyinRomanizationType TONGYONG_PINYIN = new PinyinRomanizationType("Tongyong");

/**

* Gwoyeu Romatzyh system

*/

static final PinyinRomanizationType GWOYEU_ROMATZYH = new PinyinRomanizationType("Gwoyeu");

/**

* Constructor

*/

protected PinyinRomanizationType(String tagName)

{

setTagName(tagName);

}

/**

* @return Returns the tagName.

*/

String getTagName()

{

return tagName;

}

/**

* @param tagName

* The tagName to set.

*/

protected void setTagName(String tagName)

{

this.tagName = tagName;

}

protected String tagName;

}

我们使用的API demo如下:

[java] view
plaincopy

package demo;

import net.sourceforge.pinyin4j.PinyinHelper;

import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType;

import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;

import net.sourceforge.pinyin4j.format.HanyuPinyinToneType;

import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType;

import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;

public class MyPinyinDemo {

/**

* @param args

* @throws BadHanyuPinyinOutputFormatCombination

*/

public static void main(String[] args) throws BadHanyuPinyinOutputFormatCombination {

char chineseCharacter = "绿".charAt(0);

HanyuPinyinOutputFormat outputFormat = new HanyuPinyinOutputFormat();

outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER); // 输出的声调为数字:第一声为1,第二声为2,第三声为3,第四声为4 如:lu:4

// outputFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE); // 输出拼音不带声调 如:lu:

// outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_MARK); // 输出声调在拼音字母上 如:lǜ

outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_AND_COLON); //ǜ的输出格式设置 'ü' 输出为 "u:"

// outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_UNICODE); //ǜ的输出格式设置 'ü' 输出为 "ü" in Unicode form

// outputFormat.setVCharType(HanyuPinyinVCharType.WITH_V); //ǜ的输出格式设置 'ü' 输出为 "v"

outputFormat.setCaseType(HanyuPinyinCaseType.UPPERCASE); //输出拼音为大写

// outputFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE); //输出拼音为小写

String[] pinyinArray = PinyinHelper.toHanyuPinyinStringArray(chineseCharacter, outputFormat); //汉字拼音

for(String str: pinyinArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

String pinyinstr = PinyinHelper.toHanyuPinyinString("绿色", outputFormat, "|");

System.out.println(pinyinstr);

//其他拼音系统的输出

String[] GwoyeuRomatzyhStringArray = PinyinHelper.toGwoyeuRomatzyhStringArray(chineseCharacter);

for(String str: GwoyeuRomatzyhStringArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

String[] MPS2PinyinStringArray = PinyinHelper.toMPS2PinyinStringArray(chineseCharacter);

for(String str: MPS2PinyinStringArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

String[] TongyongPinyinStringArray = PinyinHelper.toTongyongPinyinStringArray(chineseCharacter);

for(String str: TongyongPinyinStringArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

String[] WadeGilesPinyinStringArray = PinyinHelper.toWadeGilesPinyinStringArray(chineseCharacter);

for(String str: WadeGilesPinyinStringArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

String[] YalePinyinStringArray = PinyinHelper.toYalePinyinStringArray(chineseCharacter);

for(String str: YalePinyinStringArray){ //多音字输出,会返回多音字的格式

System.out.println(str);

}

}

}

输出:

[html] view
plaincopy

LU:4

LU4

LU:4|SE4

liuh

luh

liu4

lu4

lyu4

lu4

lu:4

lu4

lyu4

lu4

这个拼音包里还自带了一个demo, Pinyin4jAppletDemo.java

至于实现,其实很简单,就是有一个词典,汉字跟拼音的对应关系文件词典,unicode_to_hanyu_pinyin.txt是汉字的unicode字符对应的拼音对应表,pinyin_mapping.xml是汉语拼音系统跟其他系统的对照表,pinyin_Gwoyeu_mapping.xml是汉语系统跟Gwoyeu拼音系统的对照列表。格式参考如下,其实整理完这些之后就很容易实现了。



[html] view
plaincopy

<?xml version="1.0"?>

<pinyin_mapping>

<item>

<Hanyu>a</Hanyu>

<Wade>a</Wade>

<MPSII>a</MPSII>

<Yale>a</Yale>

<Tongyong>a</Tongyong>

</item>

<item>

<Hanyu>ai</Hanyu>

<Wade>ai</Wade>

<MPSII>ai</MPSII>

<Yale>ai</Yale>

<Tongyong>ai</Tongyong>

</item>

[html] view
plaincopy

<pinyin_gwoyeu_mapping>

<item>

<Hanyu>a</Hanyu>

<Gwoyeu_I>a</Gwoyeu_I>

<Gwoyeu_II>ar</Gwoyeu_II>

<Gwoyeu_III>aa</Gwoyeu_III>

<Gwoyeu_IV>ah</Gwoyeu_IV>

<Gwoyeu_V>.a</Gwoyeu_V>

</item>

<item>

<Hanyu>ai</Hanyu>

<Gwoyeu_I>ai</Gwoyeu_I>

<Gwoyeu_II>air</Gwoyeu_II>

<Gwoyeu_III>ae</Gwoyeu_III>

<Gwoyeu_IV>ay</Gwoyeu_IV>

<Gwoyeu_V>.ai</Gwoyeu_V>

</item>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: