Read documents with different languages

About this guide

This guide explains how to process documents that contain text in languages other than English.

Introduction

When reading text from image-based documents, Aluma's default behaviour assumes that the text is English. If your documents contain text in other languages then any reads done as part of read, classify or extract operations will not be optimal. For languages close to English they may be sufficiently good for certain classification and extraction use cases, but generally you should specify the language of the text for the best possible results.

To control the language used during reads, you can create a read-profile that specifies one or more languages and then use the read-profile as part of your read, classify or extract operations.

Create a read-profile

You can create a read-profile using the CLI or the API. Here we'll explain how to do it through the CLI.

The create read-profile command specifies one or more languages using the appropriate language code(s). Here we're creating a read-profile called german specifying just German text, using the appropriate deu language code.

aluma create read-profile german --language deu

You can use -l as shorthand for --language if you prefer.

If your documents might contain text in different languages then you can specify more languages by using the --language or -l switches multiple times:

aluma create read-profile french_or_german -l fra -l deu

You can find the list of available languages and their language codes at the bottom of this page.

Use a read-profile

To use a read-profile in your CLI read, classify or extract commands, add a --read-profile or -r switch with the name of the profile:

aluma read *.tif -r french_or_german

or

aluma extract my-extractor *.pdf -r french_or_german

If you are using the API directly (without one of our client libraries) for your integration then you can specify the read profile with the read-profile query parameter. You can find more details in the API reference.

Language codes

LanguageLanguage code
Afrikaansafr
Arabicara
Azerbaijaniaze
Belarusianbel
Bengaliben
Bulgarianbul
Catalancat
Czechces
Chinese (Simplified)chi_sim
Chinese (Traditional)chi_tra
Cherokeechr
Danishdan
Germandeu
Greekell
Englisheng
Middle Englishenm
Esperantoepo
Estonianest
Basqueeus
Finnishfin
Frenchfra
Frankishfrk
Middle Frenchfrm
Galicianglg
Greek (Ancient)grc
Hebrewheb
Hindihin
Croatianhrv
Hungarianhun
Indonesianind
Icelandicisl
Italianita
Japanesejpn
Kannadakan
Koreankor
Latvianlav
Lithuanianlit
Malayalammal
Macedonianmkd
Maltesemlt
Malaymsa
Dutch, Flemishnld
Norwegiannor
Polishpol
Portuguesepor
Romanianron
Russianrus
Slovakslk
Slovenianslv
Spanishspa
Albaniansqi
Serbiansrp
Swahiliswa
Swedishswe
Tamiltam
Telugutel
Thaitha
Turkishtur
Ukrainianukr
Vietnamesevie

We can support a wide variety of additional languages, so if you're looking for something not listed here, please contact us.