We do not need any dictionary data except for what letters are considered vowels (and every other letter may quite safely be considered a consonant, though it is not always one). The 4th rule is intended for the better readability. As for the situation when more than two consonants appear between the two vowels, it is complicated and for the sake of simplicity we skip it, not inserting any hyphens at all. This immediately leads us to the 1st rule. So, between two vowels we may insert a hyphen. A soft hyphen is never inserted after the first letter of the word on before the last letter of the word.īasically, hyphenation deals with syllables, and a vowel constitutes a syllable nucleus. In a vowel-consonant-consonant-vowel pattern, a soft hypen is inserted between the consonants (vc-cv).Ĥ.
If a vowel is followed by a consonant followed by a vowel, a soft hyphen is inserted after the first vowel (v-cv).ģ. A soft hyphen is inserted between two vowels (v-v).Ģ. I have thought the following set of rules out:ġ. And there is no real need for all of this. We need to carry a dictionary of suffixes, a dictionary of prefixes, a list of special case rules, and a list of exceptions. pro-gram) that are not split by the rules. Finally a small exception dictionary (about 300 words) is used to handle particularly objectionable errors made by the above rules, and to hyphenate certain common words (e.g. There are also many special case rules for example, “break vowel-q” or “break after ck”. The latter rule states that when the pattern ‘vowel-consonant-consonant-vowel’ appears in a word, we can in most cases split between the consonants. It is essentially a rule-based algorithm, with three main types of rules: (1) suffix removal, (2) prefix removal, and (3) vowel-consonant-consonant-vowel (vccv) breaking. Knuth and the author in the summer of 1977. The original TeX hyphenation algorithm was designed by Prof. To make long story short, I prefer the simple rule-based approach. No, a human reader will correctly understand the word, at least because of the context. If a line break occurs in such a place, noone will read these words as “ex acting”, “coin cidence” or “leg ends”. “Exacting” with the hyphen after “ex”, “coincidence” with the hyphen after “coin”, “legends” with the hyphen after “leg” (samples from another article on the subject) might seem odd, but only when written within a single line of text. It is nothing wrong with breaking “selfadjoint” after “l”, or “Reagan” after “e”, or “homeowners” after “ho”, which all are the Liang’s examples of erroneous hyphenation in the introduction to his PhD thesis.
Rules tend to become simpler, and it’s fine. Times changed, and life shows that almost any hyphenation is comfortable for a reader. Still, I do not share Liang’s superthorough opinion on what correct and incorrect hyphenation is. Yes, it produces fine result and may be considered an industrial standard. It is difficult to understand, it is difficult to make into a working program, and it requires a lot of prepared data. It seems to be overcomplicated for such a simple task.
Update: unfortunately, the problems were still there, so I’ve removed the automatic hyphenation from my website after all.Īs for the hyphenation algorithm, I considered Knuth-Liang algorithm, but discarded the idea. I remember several years ago there was bunch of problems, for example, if a contents of the title element (i.e. I have not tested it in every popular browser yet hopefully it would not cause problems nowadays. The plugin introduced between generating response and sending response takes contents of every XHTML element and, if it is a character data, inserts soft hyphen characters in the places appropriate for a line break. I’ve set my website up with the automatic word hyphenation.