DetectDocumentText Processor

Part of the AWS Textract processor family

The DetectDocumentText processor will extract text from a given document, which can be either an image or a PDF document.

Properties

All of our Textract processors also include these common properties.

This processor does not have any unique properties outside of the common ones.

Data Output

If the Destination property is set to flowfile-attribute, then the output of this processor will be routed to the FlowFile's ocr.DetectedText attribute, which will be created if it isn't present.

Output Structure
Relevant Data Structures
Example Output
Output Structure

Field Name

Data Type

Description

blocks

array of Block

The list of blocks returned from the API

Relevant Data Structures

Block

Field Name

Data Type

Description

text

string

The text in this block

confidence

float

How confident the API is in its response

id

string

The UUID pertaining to this block. Can be used to cross-reference relationships between blocks

page

int

The page of the document in which this block resides

columnIndex

int

columnSpan

int

rowIndex

int

rowSpan

int

type

string (BlockType)

The kind of block

geometry

Geometry

The position and size of the block

relationship

array of Relationship

The relationships this block has to others

Geometry

Field Name

Data Type

Description

x

float

The X position of the block on the page

y

float

The Y position of the block on the page

width

float

The width of the block

height

float

The height of the block

Relationship

Field Name

Data Type

Description

type

string (equal to eitherVALUE or CHILD)

The kind of relationship

ids

array of strings

The list of block UUIDs that are connected via this relationship

Example Output
{
"output": {
"blocks": [
{
"relationships": [],
"confidence": 99.35694,
"geometry": {
"width": 0.2716,
"height": 0.02702,
"x": 0.36377,
"y": 0.07574
},
"text": "Spirit Game Script",
"id": "e63c08cf-0bef-4c6d-ac04-0e250e254229",
"page": 1,
"type": "LINE",
},
{
"relationships": [
{
"ids": [
"a5ade6e3-a368-49fe-b338-7a8a4fbe9058",
"10e1b516-02c3-4eaf-a216-65efc12af6a7",
"2f44eedd-651b-4ac9-b42a-744bd0dfcbe1",
"fbb0d176-f9fb-4c48-8a5a-4d09252f6d17"
],
"type":"CHILD"
}
],
"confidence": 99.51574,
"geometry": {
"width": 0.37119,
"height": 0.18922,
"x": 0.321415,
"y": 0.251234,
},
"text": "Courageous, pleading and clear.",
"id": "7396c015-e0aa-49e5-8d05-3cf9a7430539",
"page": 1,
"type": "LINE"
},
// ... plus potentially many more entries!
]
}
}