Asian language recognition

Four languages with Asian alphabets are supported: Japanese, Korean, Traditional Chinese and Simplified Chinese. The ideal font size for body text is 12 points, scanned at 300 dpi, resulting in characters with around 48 x 48 pixels. Minimum is 30 x 30, that is 10.5 points at 300 dpi. For smaller characters, 400 dpi should be used.

Japanese and Chinese texts can be horizontal (left-to-right) or vertical (top-to-bottom, right-to-left); Korean text is always horizontal.

Here is a example of Chinese text: Omnipage chinese sample Asian language recognition and Korean text Omnipage korean sample Asian language recognition

Japanese text is shown below.

Operating systems supported by OmniPage 18 can handle Asian languages, but if East Asian language support was not selected during system install, it must be added from Control Panel / Regional and Language Settings / Languages / Supplemental language support / Install files for East Asian languages. You may be required to insert a Windows system disk.

The four Asian languages are listed alphabetically with the others in the Options/OCR panel. You should select only one of these languages at a time and avoid a multiple selection with other languages. Asian OCR can handle short embedded English texts without English being explicitly set; this is not designed for longer English texts or for texts in other Western languages.

 

Vertical text in Japanese and Chinese may have English embedded in different orientations:

Neon

Right

Side-by-side

Omnipage asian vertical neon Asian language recognition

Omnipage asian vertical right Asian language recognition

Omnipage asian vertical sidebyside Asian language recognition

 

Output

 

The program can handle all these; in the output they appear right-rotated.

Language verification

Beside the language list the option Verify language choices invokes automatic language detection that warns of differences between a detected language and the language setting. It works at page-level and identifies four categories: Japanese, Chinese, Korean and non-Asian. It cannot distinguish between Traditional and Simplified Chinese or between non-Asian languages. The last category means Japanese, Chinese or Korean characters were not detected. Verification takes place during image pre-processing, so the required recognition language must be set before image loading. Detection is more robust with at least several lines of text and a minimum of embedded English text.

Single language detection

The Asian languages can be processed with the option Detect single language automatically. This is useful for unattended processing where input documents may be in different languages. See OCR Options. Choose Asian languages or Latin-alphabet and Asian in the drop-down list to have these languages considered during the detection. Verify language choices cannot be used when this option is set, nor can individual language choices be made.

Layout and zoning

Auto-layout and auto-zoning are recommended for Asian pages. This places all detected texts into text zones; by choosing an Asian recognition language you set Asian OCR to run in these zones and that can automatically detect and transmit the text direction, coping with mixed areas of horizontal and vertical texts on a page.

The zoning tool Omnipage zone asian vertical Asian language recognition can however be used to force vertical Asian recognition by manual zoning. Please draw rectangular zones with this tool. To manually zone horizontal Asian text, use the usual text zone type. Do not use the two other vertical-text tools on Asian texts. Drawing a vertical Asian zone does not automatically enable an Asian language, nor influence the language auto-detection.

Digital camera images

These are accepted for Asian languages. However, the automatic 3D deskew algorithm is unlikely to be useful – certainly not for vertical texts. Preferably use the standard image loading command and perform manual 3D deskewing with the relevant SET tool if required. In general, SET tools can be used on Asian images.

Asian texts in the Text Editor

Recognized Asian pages appear in the Text Editor, provided your system has support for East Asian languages – always with horizontal text direction. There is no need to specify Asian fonts under Options/OCR, a default font is automatically applied – typically Arial Unicode MS. Other Asian-capable fonts on your system can be chosen in the Text Editor. If a font without Asian support is selected, the Asian characters are replaced by rectangles.

Editor support allows text viewing and verifying – avoid True Page for vertical texts. Large-scale editing and spell-checking are better done in the target application. Proofing, training and dictionary support are not available for Asian texts. Therefore, prior to performing Asian OCR, go to the Proofing panel under Options and disable dictionary word marking, automatic proofreading and IntelliTrain and ensure that no training file is loaded. Redaction can be applied to Asian texts, either by selection or searching.

Asian output

Typical output converters for Asian texts are RTF, Microsoft Word, Searchable PDF or XPS. The text direction that was detected during pre-processing will be applied to the output file, providing that True Page or Flowing Page are set for the export. Changes made in the Text Editor – where text is always horizontal – will be exported, also to vertical text. Plain Text converters are available (Unicode TXT, Notepad) but here text direction will always be horizontal.

 

  • The workflow step Form Data Extraction should not be applied to Asian pages.

 

  • When handling vertical Asian text, note that setting Formatted Text is best for viewing results in the Text Editor, but True Page or Flowing Page formatting levels should be used for export.

Asian language recognition