Posted 2011-12-17. Last updated 2011-12-17.
The Chinese Character Web API provides a programmatic way to get information about Chinese characters through a live interface on the Web. The data is from the Unihan Database as provided by The Unicode Consortium.
It's intended to be simple. It doesn't use SOAP or XML. It's very much inspired by REST and uses JSON for data. Responses have Cross-Origin Resource Sharing (CORS) enabled, allowing unfettered access from any browser that supports CORS (almost all do).
I did this with a leaning towards Mandarin Chinese (my language of interest), though the database includes information about Cantonese, Japanese, Korean, and Vietnamese.
If the API is not RESTful, then it is at least RESTlike. One apparent disqualifier is that the API is not hypertext-driven, whatever that means. And the data are read-only, meaning that only GET is supported (POST is supported, but only as a synonym for GET); PUT and DELETE are not supported.
The Unihan Database covers the following ranges of characters, but I found only the one bolded range to be useful:
Actual counts are a bit smaller because of holes, but the grand total is upwards of 93,696 characters. The Web service I created only offers the 20,902 characters in the one range I found to be useful (CJK Unified Ideographs). Any tool geared towards characters people know, use, and have font support for will only need to support at most 9,000 of those.
I deployed this on a LAMP server (Linux Apache MySQL PHP). Essentially, the Unihan Database was imported into a MySQL database, and then I wrote the APIs in PHP.
In order to have a single PHP file answer to RESTlike URLs, I used URL rewriting. My primary development environment is Windows, and I found a very handy page in Zend Framework's Programmer's Reference Guide that described exactly how to do URL rewriting on Apache and Microsoft IIS, equivalently.
A request will generally consist of a URL and parameters. The URL specifies the resource, and the parameters specify (e.g.) the database fields and the filtering you'd like. For example, if you wanted all characters that use Kangxi radical 85, the basic URL is:
http://ccdb.hemiola.com/characters/radicals/85
If you were interested in GB2312 characters in the above collection,
you'd simply tack on a filter=gb parameter:
http://ccdb.hemiola.com/characters/radicals/85?filter=gb
If you wanted the definition and Pīnyīn for each character, you'd add a
fields=kDefinition,kMandarin parameter:
http://ccdb.hemiola.com/characters/radicals/85?filter=gb&fields=kDefinition,kMandarin
If all you wanted to do was count the characters, rather than return
them, just use the count parameter. Examples:
http://ccdb.hemiola.com/characters/radicals/85?filter=gb&count
http://ccdb.hemiola.com/characters/radicals/85?count
http://ccdb.hemiola.com/characters?count
By the way, CCDB stands for Chinese Character Database.
The filtering available is what seemed practical and useful to me. It's by no means exhaustive. Still, it's a little complicated because of supporting AND, OR, and NOT.
A basic filter is added to the URL as a parameter. For example: filter=gb. Note that if you leave off the filter, you get all 20,902 of the CJK Unified Ideographs.
The filter names are:
Some notes:
gb is GB2312 (6,763 characters)
Why focus on a character set devised in 1980, when there are newer ones? The newer ones, GBK and GB18030 (the two being almost identical), include the full set of 20,902 CJK Unified Ideographs, plus a few more. Examining GB2312 provides insight into the language that bigger character sets do not. Sticking with GB2312 helps avoid overload when trying to get a handle on the language.
big5 (13,061 characters) = big5a (5,401 characters) + big5b (7,660 characters)
Big5 is split into two parts, though this does not seem to be readily acknowledged. Each part makes a complete pass through "the" dictionary. The first pass has 5,401 characters; the second pass has 7,660 characters. The first set of characters (what I call big5a) seems to be the only useful part of Big5, at least for a learner of Chinese. I'm skeptical that even a language native has much use for, or knowledge of, the second set of characters.
sjis is Shift JIS (6,356 characters)
I threw this in because it was easy, though I know very little about the Japanese language.
simplified (2,549 characters)
simplifiable (2,621 characters)
Characters that are simplified have a traditional variant. Characters that are simplifiable have a simplified variant.
You might wonder where the filter for traditional is. traditional is !simplified (i.e., not simplified; see below for more on this syntax). Here's a bit more detail:
Here's the filter for traditional characters that don't have
simplified variants: !simplified+!simplifiable.
simplified[] (77 characters)
simplifiable[] (6 characters)
Filters can be combined using ANDs, ORs, and NOTs. Here are a few examples:
filter=gb+big5a (characters that are both gb
and big5a)
filter=gb|big5a (characters that are either gb
or big5a, or both)
filter=gb+simplified (gb characters that are
simplified)
filter=gb+!simplified (gb characters that are
not simplified)
filter=gb|simplified (characters that are
either gb or simplified, or both)
Notes
Here's an example of precedence. Let's say you wanted this filter:
gb + (simplified | simplifiable)
The wrong way to do it is: gb+simplified|simplifiable. You need to use a little Boolean arithmetic to arrive at this:
gb+simplified|gb+simplifiable
Fields come from the Unihan Database, with a few extra ones.
I've been using the Unihan Database for over 15 years, and I tend to
assume familiarity with its field names. The Unihan field names are of
key importance to accessing information. They are documented here: http://www.unicode.org/reports/tr38/.
All fields begin with a 'k'. Example of some common fields are
kDefinition, kMandarin, kRSKangXi, and kRSUnicode.
I added a few fields:
uvalue
This is the Unicode value of the character, in the U+4E00 style.
string
This is the character as a display string. In JSON, this ends up
being a Unicode escape code, such as \u4e00.
altMandarin
altDefinition
These are alternatives for the kMandarin and kDefinition fields just
for the radicals. I wanted a way to display single, unambiguous
Pīnyīn and a very short definition (often a single word) for the
radicals. I built these fields from http://en.wikipedia.org/wiki/Kangxi_radicals.
When requesting character counts or stroke count information, a few
fields are used for the information: radical, strokes,
count. Recommendation: Examine the JSON and see how the
data are named.
If you request characters but don't specify any fields, you'll get one
field: the string field.
Use the count parameter to return a character count rather than the characters themselves.
By default, all radical and stroke information comes from the kRSKangXi field. To instruct the backend to instead use the kRSUnicode field, use the kRSUnicode parameter.
Of the 20,902 characters, kRSKangXi and kRSUnicode differ for 333 of them (1.6%). Of the 6,763 GB2312 characters, there are 154 differences (2.3%). Of the 5,401 big5a characters, there are 47 differences (0.9%).
Here's a list of what's available. In the cases that show radical 85,
one example is being used to demonstrate a pattern. Details follow.
/version
/fields
/charactersIn general, using a filter applies to the rest of the
URLs (/characters, /strokes, and /sounds), but it only makes sense to
request fields for the /characters URL.
Notes
I've created a JavaScript library of utility functions, mainly for help dealing with Unihan data, such as parsing and comparing radical/stroke information, and converting ASCII Mandarin and Cantonese pronunciations to use the writing rules and diacritics of Pīnyīn and the Yale system, respectively.
You can leave comments or questions on my blog.