Create a new document and add a file supplied in the request body to it. The document can then be read, classified or have data extracted from it.
The request body should contain the binary contents of the document's file.
The newly created document resource is returned, along with a 201 Created
status. The document resource includes the document's ID, which can then be used with the Get, Read, Classify, Extract Document Data, Get Redacted PDF and Delete endpoints.
The Supported File Types article contains details of all file types supported by Aluma, and the maximum file size.
Files embedded resource
The document resource contains an embedded files
resource which includes details of the file that the document was created from.
"files": [
{
"id": "p3g-T4kf4EeNQ8baNLA8Uw",
"file_type": "PDF:ImagePlusText",
"size": 73136,
"sha256": "f3ee28bbc30e789202e0f84bcbb187c5abc88d54e081bb3fa8abfa8f1a4603ea"
}
]
The properties are as follows:
id
: A unique identifier for this file.file_type
: The type of the file as determined by the API by examining the contents of the file. This will have one of the values listed in the table below.size
: The size of the file in bytes.sha256
: The SHA-256 hash of the file contents.
It is best practice to calculate your own values for size
, sha256
and file_type
(which in most cases will be a static value) of the file you are submitting and compare these to the values in the response in order to ensure that the file was not corrupted during transmission.
Value of file_type | Description |
---|---|
PDF:ImageOnly | PDF format file comprised only of full-page images, typically indicating a scanned document |
PDF:ImagePlusText | PDF format file that has full-page images with 'hidden' text, typically indicating a scanned document that has had OCR used on it |
PDF:Misc | PDF format file that has content other than full-page images, typically indicating a PDF generated from electronic content |
Image:TIFF | An image in TIFF Format |
Image:JPEG | An image in JPEG format |
Image:JPEG2000 | An image in JPEG-2000 format |
OpenXML:Word | Microsoft Office Word (.docx) documents |
OpenXML:Spreadsheet | Microsoft Office Excel (.xlsx) documents |
OpenXML:Presentation | Microsoft Office PowerPoint (.pptx) documents |
Text:ANSI | Plain text file with text in 8-bit ANSI format |
Text:UTF8 | Plain text file with text in UTF-8 format |
Text:UTF16 Text:UTF16_BigEndian | Plain text file with text in UTF-16 format or UTF-16 (big-endian) format |
Email:MIME | An Email in MIME (.eml) format |
Email:MSG | An Email in Microsoft Outlook (.msg) format |
HTML:ANSI | HTML file encoded in 8-bit ANSI format |
HTML:UTF8 | HTML file encoded in UTF-8 format |
HTML:UTF16 HTML:UTF16_BigEndian | HTML file with text in UTF-16 format or UTF16 (big-endian) format |
If you know the type of the file and want to validate that the API concurs, you can set the Content-Type
header to the MIME-type of the file as shown in the table below. If the file type does not match then the request will be rejected with a 415 response.
File Type | Content-Type header |
---|---|
PDFs | application/pdf |
Microsoft Office Word (.docx) documents | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Microsoft Office Excel (.xlsx) documents | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Microsoft Office PowerPoint (.pptx) documents | application/vnd.openxmlformats-officedocument.presentationml.presentation |
TIFF image | image/tiff |
JPEG image | image/jpeg |
JPEG 2000 image | image/jp2 |
Text document | text/plain |
Email message (.eml) | message/rfc822 |
Outlook email message (.msg) | application/vnd.ms-outlook |
HTML document | text/html |
RESPONSES
201 The Document was created
400 There is no file supplied in the body
401 There is no Authorization header or the access token is invalid
403 You have reached your maximum number of simultaneous documents
413 The file supplied in the body is too large
415 The Content-Type contains an unsupported type or does not match the actual contents of the file