Java中文处理学习笔记(3)-自然语言处理-人工智能实验室AiLab-中国人工智能网-Powered by AiLab.cn

Java中文处理学习笔记(3)

来源：互联网发布日期：2011-09-06 14:26:19 浏览：13518次

导读： 第3个试验，将字符流按照UTF8方式编码后，写入第3个测试文件hello.utf8.html，我们可以看到UTF8对英文没有影响，但对于其他文字使用了3字节编码方式，因此比GB2312编码方式的存储要大50%， ========Testing2:readinganddecodingfromfiles======== [test2-1]:...

第3个试验，将字符流按照UTF8方式编码后，写入第3个测试文件hello.utf8.html，我们可以看到UTF8对英文没有影响，但对于其他文字使用了3字节编码方式，因此比GB2312编码方式的存储要大50%，
========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好 length=20
char[0]=’H’ byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]=’e’ byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]=’o’ byte=111 \u6F short=111 \u6F BASIC_LATINchar[5]=’ ’ byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]=’w’ byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]=’o’ byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]=’r’ byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]=’d’ byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=’ ’ byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]=’? byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT
char[13]=’? byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT
char[14]=’? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT
char[15]=’? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT
char[16]=’? byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT
char[17]=’? byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT
char[18]=’? byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT
char[19]=’? byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT
按系统从中间存储hello.orig.html文件中读取相应文件，虽然是按字节方式（半个“字”）读取的，但由于能完整的还原，因此输出显示没有错误。其实PHP等应用很少出现字符集问题其实就是这个原因，全程都是按字节流方式处理，很好的还原了输入，但这样处理的同时也失去了对字符的控制
[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]=’H’ byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]=’e’ byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]=’o’ byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=’ ’ byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]=’w’ byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]=’o’ byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]=’r’ byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]=’d’ byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=’ ’ byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]=’?’ byte=63 \u3F short=63 \u3F BASIC_LATIN
char[13]=’?’ byte=63 \u3F short=63 \u3F BASIC_LATIN
char[14]=’?’ byte=63 \u3F short=63 \u3F BASIC_LATIN
char[15]=’?’ byte=63 \u3F short=63 \u3F BASIC_LATIN
最惨的就是输出的时候这些’?’真的是问号char(63)了，数据如果是这样就真的没救了
[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]=’H’ byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]=’e’ byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]=’l’ byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]=’o’ byte=111 \u6F short=111 \u6F BASIC_LATIN转贴于考试大

[1]

交流JAVA认证考试经验请JAVA认证论坛>>

相关热词： Java 中文处理学习笔

Java中文处理学习笔记(3)
来源：互联网发布日期：2011-09-06 14:26:19 浏览：13518次

相关内容

AiLab云推荐

最新资讯

本月热点

热门排行

推荐内容

在线客服

热门栏目HotCates

关于我们

版权声明

Java中文处理学习笔记(3) 来源：互联网 发布日期：2011-09-06 14:26:19 浏览：13518次