
Working with real world documents is still pain. PDFs, invoices, random exports from legacy tools. Half the work is just getting them into a clean, structured format your models can use. ๐
This post is about that first step. The one that usually gets ignored in demos and tutorials. Parsing and structuring the documents.
The tools here handle OCR, layout, tables, forms and file format so you can focus on the logic around them.
I am walking through a few I actually like using, with short code snippets you can drop straight into your own projects.
So, let's begin. ๐

๐ก Document Ingestion API plus a serverless runtime for agentic data workflows

Tensorlake gives you two big things in one place:
You can send PDFs, Office files, images or raw text and get back well structured content with preserved layout. Long story short, you can treat it as a Document Ingestion API that handles PDFs, Office files, scans and images, then add agent style applications on top using their serverless runtime.
So, instead of handling OCR and background jobs with retry logic, you get one single platform that parses, chunks, classifies and then feeds the results into the agent or tools.
๐ค Is it for you?
If you are building invoice extractors, contract analyzers, or any complex data ingestion or agents that need to actually read documents, Tensorlake sits right in the middle of your stack as the ingestion and workflow layer.

And many more...
Now, let's go through a quick code example of some common use cases.
First, install the SDK and use the DocumentAI client to upload a PDF, start a parse job and stream the markdown chunks once parsing is done.
pip install tensorlakeNow, to extract the text from a PDF, you can do something like:
from tensorlake.documentai import DocumentAI, ParseStatus
doc_ai = DocumentAI(api_key="your-api-key")
# Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf")
# Start parsing
parse_id = doc_ai.parse(file_id)
# Wait until parsing is complete
result = doc_ai.wait_for_completion(parse_id)
if result.status == ParseStatus.SUCCESSFUL:
# Each chunk is a piece of clean markdown
for chunk in result.chunks:
print(chunk.content)This is the basic flow you would use in a backend job that takes uploaded PDFs and turns them into LLM friendly text for something like RAG or search.
Once you have the chunks, you can push them straight into a vector store or a database.
You can have more control over parsing, like using structured parsing, which you can find here: Structured Extraction. I leave it up to you to explore more about this.
To run a small agentic app on top of Tensorlake, it's as simple as:
import os
from agents import Agent, Runner
from agents.tool import WebSearchTool, function_tool
from tensorlake.applications import application, function, run_local_application, Image
# Container image with the dependencies the function needs
FUNCTION_CONTAINER_IMAGE = Image(
base_image="python:3.11-slim",
name="city_guide_image",
).run("pip install openai openai-agents")
@function_tool
@function(
description="Gets the weather for a city",
secrets=["OPENAI_API_KEY"],
image=FUNCTION_CONTAINER_IMAGE,
)
def get_weather_tool(city: str) -> str:
agent = Agent(
name="Weather Reporter",
instructions="Use web search to find current weather in the city",
tools=[WebSearchTool()],
)
result = Runner.run_sync(agent, f"City: {city}")
return result.final_output.strip()
@application(tags={"type": "example", "use_case": "city_guide"})
@function(
description="Creates a simple city guide",
secrets=["OPENAI_API_KEY"],
image=FUNCTION_CONTAINER_IMAGE,
)
def city_guide_app(city: str) -> str:
agent = Agent(
name="Guide Creator",
instructions="Make a friendly city guide that includes the current temperature",
tools=[get_weather_tool],
)
result = Runner.run_sync(agent, f"City: {city}")
return result.final_output.strip()
if __name__ == "__main__":
city = "Paris"
if not os.environ.get("OPENAI_API_KEY"):
print("Error: OPENAI_API_KEY is not set")
raise SystemExit(1)
request = run_local_application("city_guide_app", city)
response = request.output()
print(response)This above code creates a city guide application using OpenAI Agents with tool calls. I'm not going to explain the code here, as the blog will get unnecessarily longer.
You can find the explanation for this code in their GitHub README.
To run the application on Tensorlake Cloud, it first needs to be deployed.
TENSORLAKE_API_KEY in your shell session:export TENSORLAKE_API_KEY="Paste your API key here"OPENAI_API_KEY in your Tensorlake Secrets so that your application can make calls to OpenAI:tensorlake secrets set OPENAI_API_KEY "Paste your API key here"tensorlake deploy examples/readme_example/city_guide.pyexamples/readme_example/test_remote_app.py:from tensorlake.applications import run_remote_application
city = "San Francisco"
# Run the application remotely
request = run_remote_application("city_guide_app", city)
print(f"Request ID: {request.id}")
# Get the output
response = request.output()
print(response)To put it short, Tensorlake takes care of spinning up containers, injecting secrets and keeping the function durable so it can retry tool calls without you building your own queue system.
Here's a quick Tensorlake document ingestion demo to see it in action working with a complex document. ๐

Docling is from the IBM Research Team, licensed under MIT (free and open to commercial use), and turns PDFs, Office docs, images, audio and more into a unified DoclingDocument format. You can then export that into markdown, HTML, DocTags or lossless JSON and plug it straight into RAG, agents or search.
It runs locally and comes with strong layout and table understanding plus OCR and vision models for scanned or complex documents.
Multi format parsing - PDF, DOCX, PPTX, XLSX, HTML, images, audio and more into one structured representation.
Advanced PDF understanding - Page layout, reading order, tables, code, formulas and images handled out of the box.
Multiple export targets - Export a single DoclingDocument to markdown, HTML, DocTags or structured JSON.
Local and privacy friendly - Designed to run completely locally.
Gen AI integrations - Hooks into LangChain, LlamaIndex, Haystack and others out of the box.
And many more...
The basic flow is intentionally simple: create a converter, give it a source and then decide how you want to export the result.
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # can also be a local Path(...)
converter = DocumentConverter()
result = converter.convert(source)
markdown = result.document.export_to_markdown()
print(markdown)This example shows the โone document in, one markdown document outโ path that you would usually add into your indexing step.
This gives you one markdown document you can split into chunks and feed into a vector database.
Docling also comes with a CLI. You can install it with the following command:
pip install doclingNow, you can run it using the following command:
# Convert a PDF at a URL to markdown on stdout
docling https://arxiv.org/pdf/2206.01062
# Use the GraniteDocling vision language model in the pipeline
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062Obviously, there are a few more complex use cases with a lot more flags you can add. For this, visit their documentation.
Here's a quick video by Red Hat to see it in action. ๐

Unstructured gives you an open source library plus a managed platform to turn unstructured content into structured data for LLM apps. It partitions PDFs, slides, HTML, Office files and images into a standard set of elements that downstream tools can easily consume.
On top of that, the ingest layer adds connectors, chunking and embeddings so you can build full ETL style pipelines around your document sources.

partitionThis is the core pattern you will see in most examples, and it is enough to plug into a RAG pipeline.
from unstructured.partition.auto import partition
# Read and partition a document
elements = partition("example-docs/layout-parser-paper.pdf")
# Inspect a few elements
for el in elements[:5]:
print(repr(el.category), "->", str(el)[:80], "...")You end up with a list of elements that know their category, which makes it easy to filter for titles, paragraphs or tables before you use it further.
For real projects you usually need to process many files at once and save the outputs somewhere. It comes with an ingest CLI and is built for exactly that.
# Chunk and partition an entire folder of files
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--chunking-strategy by_title \
--chunk-max-characters 1024 \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_resThis runs a full pipeline that reads documents from LOCAL_FILE_INPUT_DIR, partitions them with the hi_res strategy, chunks them by title and writes the structured outputs into your output directory. From there, you can index or analyze them however you like.
Here's a quick API quickstart to get an idea. ๐

Amazon Textract is AWSโs managed OCR and document analysis service that pulls text, handwriting, layout and structured data out of scanned documents and PDFs.
It runs inside your AWS account, plugs into services like S3, Lambda, SNS and SQS, and is used at scale by companies like PayTM for document workflows.
This is the basic pattern if you just want the text out of a document. You read the file as bytes, call detect_document_text and print the lines Textract finds.
import boto3
textract = boto3.client("textract") # uses your AWS credentials
file_path = "sample-doc.png" # can be any image format
with open(file_path, "rb") as f:
image_bytes = f.read()
response = textract.detect_document_text(
Document={"Bytes": image_bytes}
)
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block["Text"])What is happening here:
To pull structured data from forms and tables, you use analyze_document with the FORMS and TABLES feature types and point Textract at a document in S3.
import boto3
textract = boto3.client("textract")
bucket_name = "my-doc-bucket"
object_key = "invoices/invoice-001.png"
response = textract.analyze_document(
Document={
"S3Object": {
"Bucket": bucket_name,
"Name": object_key,
}
},
FeatureTypes=["FORMS", "TABLES"],
)
print(f"Found {len(response['Blocks'])} blocks")
# Quick peek at found tables
for block in response["Blocks"]:
if block["BlockType"] == "TABLE":
print("Detected a table with Id:", block["Id"])There is a lot of other complex stuff that you can do with Textract. For more details, check out the Textract documentation.
In production you usually wire this up with S3 triggers and Lambda so new documents are picked up and processed by themselves.
Here's a quick intro to Amazon Textract. ๐

Document AI is Google Cloudโs document stack that gives you ready made processors for invoices, receipts, forms, IDs and general OCR. You pick a processor, send it a file and get back a Document object with text, structure, entities and layout info, not just raw strings.
The nice part is how it fits into the rest of GCP (Google Cloud Platform). You can drop files into Cloud Storage, trigger processing with Pub/Sub or Cloud Functions or Cloud Run, then push clean data into BigQuery or your app.
This is the usual Python flow. You create a processor in the console, grab its ID, then call it from your code.
from google.cloud import documentai
project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"
file_path = "path/to/document.pdf"
client = documentai.DocumentProcessorServiceClient()
name = client.processor_path(project_id, location, processor_id)
with open(file_path, "rb") as f:
file_bytes = f.read()
raw_document = documentai.RawDocument(
content=file_bytes,
mime_type="application/pdf",
)
request = documentai.ProcessRequest(
name=name,
raw_document=raw_document,
)
result = client.process_document(request=request)
doc = result.document
print(doc.text[:1000])You send raw bytes plus the MIME type, Document AI runs the selected processor and you get back a Document object. For quick use cases, grabbing doc.text is enough.
If you use a form style processor, Document AI already marks fields as key value pairs, which you can loop over and map into your own schema.
def clean(text: str) -> str:
return text.replace("\n", " ").strip()
form_doc = doc # from the previous example. see above
fields = []
for page in form_doc.pages:
for field in page.form_fields:
name = clean(field.field_name.text_anchor.content)
value = clean(field.field_value.text_anchor.content)
conf = field.field_value.confidence
fields.append((name, value, conf))
for name, value, conf in fields:
print(f"{name}: {value} (conf {conf:.2f})")This is the point where a scanned form basically becomes a Python dict. From here, you can push the data into BigQuery, Firestore or any service you use on GCP.
This is just a start, and there's a lot more to it. Visit the documentation to learn more.
Here's a quick introduction to Google Cloud Document AI. ๐
If you think of any other handy AI tools that I haven't covered in this article, do share them in the comments section below. โ๏ธ
So, that is it for this article. Thank you so much for reading! ๐๐ซก
