This page is for notes about Taka from an AcchaNi perspective.

Text Format

Jim Breen's kanjidic and edict files are primarily text-based. This makes them easy for non-technical people to look at, read, and use—even if they don't make use of all the data.

For Taka, I'd like a similar text-based format that is easy to read and understand (and edit), that has the same expressive power as the same data in other formats. Here's one example of what that format might look like:

E0060 串
        E0998
        E0648 口 T(10,13) S(80,37)
        E0648 口 T(0,51) S(100,37)
E0060 串 V2
        E0998
        E0648 口 T(0,13) S(100,37)
        E0648 口 T(0,51) S(100,37)
E0061 輩
        E1552 非 T(0,3) S(100,52)
        E1503 車 T(0,55) S(100,42)
E0062 S(100,77)
        M(34.227085,8.35962) L(34.227085,35.1735)
        M(34.227085,8.35962) L(73.974785,8.35962) L(73.974785,35.1735)
        M(34.227085,35.1735) L(73.974785,35.1735)
        M(18.45421,15.6151) L(18.45421,50.6309) L(78.39121,50.6309) L(78.39121,69.8738)
        M(6.78233,69.8738) L(93.2177,69.8738)

A basic description of the format: (fields in square brackets are optional)

Eelement_id [kanji_character] [Vglyph_variant] [S(width,height)]

The non-indented lines represent a single "glyph". An "element" is an etymological entity that may represent a particular kanji character, may be a subelement that appears in another kanji character, or in some cases both. An element may have multiple "glyph variants" that are each written differently, despite retaining a common meaning.

The "element id", a unique integer identifier, is the required first token in the line, preceded by the capital letter 'E'.

The next token is the kanji character that this element represents, if any. Some elements are only present as subelements in other kanji characters, so they will not have this element. (See, for example, E0062 above.)

The third token is the "glyph variant", a single integer preceded by the capital letter 'V'. If the variant is '1' (as most are), this token can be omitted.

The fourth token is the "size" or "scale" attribute. Most elements are normalized to "S(100,100)" (a width of 100 and a height of 100), but there are a few elements (typically subelements) that are drawn to a different initial scale. Clients that wish to graphically display this character may want to normalize to "S(100,100)" themselves, or they may wish to display the character as-is. Here is an example of E2385, both non-normalized (as it might appear with another element), and normalized to S(100,100).

Subelements and Strokes

Beneath the non-indented line is a series of lines that start with '\t' (the TAB character). These lines can be either "subelement" lines, or "stroke" lines.

Subelement lines always begin with the capital letter 'E', and reference another element. The kanji character here is for readability purposes only. Clients of this data are free to ignore it—the definitive kanji character for an element is given on its own non-indented line.

\tEelement_id [kanji_character] [Vglyph_variant] [T(x,y)] [S(width,height)]

The "element id" and "glyph variant", taken together, form a reference to a glyph definition elsewhere in the file. (Again, "V1" may be omitted.) The point represented by 'T' ("translate", or "position") shows where this subglyph should be positioned relative to its parent glyph. The default is "T(0,0)", and this can be omitted. The 'S' is also a size, and denotes how this element should be resized or scaled before insertion at point 'T'.

Stroke lines begin with 'M', 'L', or 'C' and are made up of multiple, space-delimited segments. These segments describe how to draw the particular element on the screen. From a drawing perspective, the representation of the element is simply a list of all the segments in each stroke. The separation into different strokes is solely to represent the human concepts of stroke count and stroke order.

'M' represents a "MoveTo" point, i.e. lifting of the pen and placing it at the given point. 'L' represents a "LineTo" point, i.e. drawing a line from the current pen position to the given point. (The endpoint then becomes the new starting pen position for the next line.) 'C' represents a "CurveTo" segment, made up of three points. Taken together with the current pen position point, these four points represent a cubic parametric (Bezier) curve.

SQL Format

Although the text format is fully expressive, it is not indexed for search. An SQL database is a perfect way to achieve this indexing. Here is a suggested schema for representing the above data:

CREATE TABLE element (
  id SERIAL,
  unicode_hex_value CHAR(4)
);

CREATE TABLE glyph (
  id SERIAL,
  element_id INT NOT NULL,
  variant INT NOT NULL,
  width INT NOT NULL,
  height INT NOT NULL
);

CREATE TABLE subglyph (
  id SERIAL,
  element_id INT NOT NULL,
  parent_glyph_id INT NOT NULL,
  variant INT NOT NULL,
  x INT NOT NULL,
  y INT NOT NULL,
  width INT NOT NULL,
  height INT NOT NULL
);

CREATE TABLE stroke (
  id SERIAL,
  glyph_id INT NOT NULL,
  stroke_order_number INT NOT NULL,
  segments TEXT
);