Unlocking Data Stored in Images, PDFs and Documents

Azure Search

“An estimated 80% of corporate data is unstructured”

Because I work with SQL Server and BI tools, a lot of the time when I think of data, I immediately think of tables and records stored in databases. But when we look at an organization’s entire data store, most of it likely isn’t nicely organized and normalized here.

No, it’s likely tied up in servers, in emails, and in PDFs and JPEGs. But there are still insights hiding here, and what if we could unlock the data contained in these files? 

Platforms and solutions exist which can allow us to read handwriting and text stored in images and PDFs. We can also add indexes to these files, so users could search for instances for a word or phrase is mentioned. Because I work in automotive financial services industry, I’ll give this example which could greatly improve our fraud detection and customer service capabilities:

Imagine a claims or service representative being able to search through our entire document and photo store to find all files where a specific customerinvoice, VIN, or part number is mentioned.

I recently found out about a Microsoft service called Azure Search, which is a searching/indexing service you can extend to your existing applications (i.e. without having to build your own). To be clear, other cloud providers also offer similar products: AWS CloudSearch and Google Cloud Search, but since I have developer credits I started exploring Microsoft’s options.

Anyways, the use cases for Azure Search are boundless. When combined with the machine learning power of Azure Cognitive Services, you can implement image processing/OCR to recognize text, both handwritten and printed, in images, documents and PDFs. You can index other structured data sources, like SQL databases. So imagine a service that’s enabled your business users to search through your databases, and terabytes of [newly] searchable images, documents and PDFs.

Pretty powerful stuff.

Microsoft set up an example website and data set from the JFK assassination files using the service. You can play around with it yourself here. There’s another example done with NYC job postings here.

Now that I found out about the $200CAD/month Azure credit I get with my Visual Studio subscription, I would love to build a Proof of Concept (PoC). Either on my own, or with a team at my organization’s next Hackathon.

Azure Search JFK


Leave a Reply