Read documents with different languages
About this guide
This guide explains how to process documents that contain text in languages other than English.
Introduction
When reading text from image-based documents, Aluma's default behaviour assumes that the text is English. If your documents contain text in other languages then any reads done as part of read, classify or extract operations will not be optimal. For languages close to English they may be sufficiently good for certain classification and extraction use cases, but generally you should specify the language of the text for the best possible results.
To control the language used during reads, you can create a read-profile that specifies one or more languages and then use the read-profile as part of your read, classify or extract operations.
Create a read-profile
You can create a read-profile using the CLI or the API. Here we'll explain how to do it through the CLI.
The create read-profile
command specifies one or more languages using the appropriate language code(s). Here we're creating a read-profile called german
specifying just German text, using the appropriate deu
language code.
aluma create read-profile german --language deu
You can use -l
as shorthand for --language
if you prefer.
If your documents might contain text in different languages then you can specify more languages by using the --language
or -l
switches multiple times:
aluma create read-profile french_or_german -l fra -l deu
You can find the list of available languages and their language codes at the bottom of this page.
Use a read-profile
To use a read-profile in your CLI read, classify or extract commands, add a --read-profile
or -r
switch with the name of the profile:
aluma read *.tif -r french_or_german
or
aluma extract my-extractor *.pdf -r french_or_german
If you are using the API directly (without one of our client libraries) for your integration then you can specify the read profile with the read-profile
query parameter. You can find more details in the API reference.
Language codes
Language | Language code |
---|---|
Afrikaans | afr |
Arabic | ara |
Azerbaijani | aze |
Belarusian | bel |
Bengali | ben |
Bulgarian | bul |
Catalan | cat |
Czech | ces |
Chinese (Simplified) | chi_sim |
Chinese (Traditional) | chi_tra |
Cherokee | chr |
Danish | dan |
German | deu |
Greek | ell |
English | eng |
Middle English | enm |
Esperanto | epo |
Estonian | est |
Basque | eus |
Finnish | fin |
French | fra |
Frankish | frk |
Middle French | frm |
Galician | glg |
Greek (Ancient) | grc |
Hebrew | heb |
Hindi | hin |
Croatian | hrv |
Hungarian | hun |
Indonesian | ind |
Icelandic | isl |
Italian | ita |
Japanese | jpn |
Kannada | kan |
Korean | kor |
Latvian | lav |
Lithuanian | lit |
Malayalam | mal |
Macedonian | mkd |
Maltese | mlt |
Malay | msa |
Dutch, Flemish | nld |
Norwegian | nor |
Polish | pol |
Portuguese | por |
Romanian | ron |
Russian | rus |
Slovak | slk |
Slovenian | slv |
Spanish | spa |
Albanian | sqi |
Serbian | srp |
Swahili | swa |
Swedish | swe |
Tamil | tam |
Telugu | tel |
Thai | tha |
Turkish | tur |
Ukrainian | ukr |
Vietnamese | vie |
We can support a wide variety of additional languages, so if you're looking for something not listed here, please contact us.
Updated almost 2 years ago