Create a new document and add a file available at a specified URL to it. The document can then be read, classified or have data extracted from it.

The request body should specify the URL from where Aluma can download the contents of the document's file. The Content-Type header must be set to application/json; if it is excluded, the request will be treated as an upload request rather than an import.

Only HTTP and HTTPS schemes are allowed (HTTPS is strongly recommended).

The download of the file must succeed within 10 seconds, otherwise a 422 Unprocessable Entity is returned. The 422 response is returned in a few cases, such as when the download fails or the JSON body does not match the required schema. The reason for the 422 response is provided in the response body.

The newly created document resource is returned, along with a 201 Created status. The document resource includes the document's ID, which can then be used with the Get, Read, Classify, Extract Document Data, Get Redacted PDF and Delete endpoints.

The Supported File Types article contains details of all file types supported by Aluma, and the maximum file size.

Files embedded resource

The document resource contains an embedded files resource which includes details of the file that the document was created from.

"files": [
  {
    "id": "p3g-T4kf4EeNQ8baNLA8Uw",
    "file_type": "PDF:ImagePlusText",
    "size": 73136,
    "sha256": "f3ee28bbc30e789202e0f84bcbb187c5abc88d54e081bb3fa8abfa8f1a4603ea"
  }
]

The properties are as follows:

  • id: A unique identifier for this file.
  • file_type: The type of the file as determined by the API by examining the contents of the file. This will have one of the values listed in the table below.
  • size: The size of the file in bytes.
  • sha256: The SHA-256 hash of the file contents.

It is best practice to calculate your own values for size, sha256 and file_type (which in most cases will be a static value) of the file you are submitting and compare these to the values in the response in order to ensure that the file was not corrupted during transmission.

Value of file_typeDescription
PDF:ImageOnlyPDF format file comprised only of full-page images, typically indicating a scanned document
PDF:ImagePlusTextPDF format file that has full-page images with 'hidden' text, typically indicating a scanned document that has had OCR used on it
PDF:MiscPDF format file that has content other than full-page images, typically indicating a PDF generated from electronic content
Image:TIFFAn image in TIFF Format
OpenXML:WordMicrosoft Office Word (.docx) documents
OpenXML:SpreadsheetMicrosoft Office Excel (.xlsx) documents
OpenXML:PresentationMicrosoft Office PowerPoint (.pptx) documents
Text:ANSIPlain text file with text in 8-bit ANSI format
Text:UTF8Plain text file with text in UTF-8 format
Text:UTF16
Text:UTF16_BigEndian
Plain text file with text in UTF-16 format or UTF-16 (big-endian) format
Email:MIMEAn Email in MIME (.eml) format
Email:MSGAn Email in Microsoft Outlook (.msg) format
HTML:ANSIHTML file encoded in 8-bit ANSI format
HTML:UTF8HTML file encoded in UTF-8 format
HTML:UTF16
HTML:UTF16_BigEndian
HTML file with text in UTF-16 format or UTF16 (big-endian) format

If a Content-Type header is returned in the response when downloading the specified file, Aluma will analyse the contents of file and validate that the file type matches the MIME-type in the header. The valid Content-Type and file type combinations are specified in the table below.

If the file type does not match the header value then the request will be rejected with a 415 response.

File TypeContent-Type header
PDFsapplication/pdf
Microsoft Office Word (.docx) documentsapplication/vnd.openxmlformats-officedocument.wordprocessingml.document
Microsoft Office Excel (.xlsx) documentsapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Microsoft Office PowerPoint (.pptx) documentsapplication/vnd.openxmlformats-officedocument.presentationml.presentation
TIFF imageimage/tiff
Text documenttext/plain
Email message (.eml)message/rfc822
Outlook email message (.msg)application/vnd.ms-outlook
HTML documenttext/html

RESPONSES

201 The Document was created
400 The request is badly formed or invalid
401 There is no Authorization header or the access token is invalid
403 You have reached your maximum number of simultaneous documents
415 The Content-Type is specified and not set to application/json
422 There was a problem downloading the specified file (see the error in the response for details of the specific error).

Language