Digital Services

Extracting Unstructured Data from 1000s of PDFs using Automation and OCR

data extraction
A U.S. based company that markets and underwrites specialty insurance products and programs to a variety of niche markets required a solution to extract unstructured data from 1000s of policies in various file formats such as PDF and Word documents.

Client Challenges and Requirements

  • Manual effort to read and extract information from various file formats such as PDF, Excel, email, image, etc.
  • Identify documents that are scanned PDFs with unstructured data or digital PDFs and apply appropriate extraction method.
  • Solution to upload the extracted data in usable format to data system.

Bitwise Solution

End-to-end solution to address key pain areas and show value quickly. Bitwise solution covered 3 phases:
  • Strategy and Assessment – identify and prioritize file types and pain areas
  • Solution Development – develop best extraction option using Bitwise re-usable modular utilities and third-party tools to provide maximum level of automation and configuration of scripts to extract the data
  • Validation – ensure accuracy on highly critical files and provide search feature to search the original document
Reusable ‘modular’ utilities used:
  • Email extraction
  • Reading contents of PDF to identify if it is digital or OCR
  • Routing utility to direct to auto or manual
  • Script to auto extract identified data points
  • Script that pushes JSON, CSV or other preferred file type to data system

Tools & Technologies We Used

Open source tools

Tesseract for OCR
of scanned PDFs

iText for digital
PDFs

Key Results
Group 3

Reduced data entry job by over 60% resulting in more efficient use of resources

code (1)

Ability to achieve 100% accuracy on highly critical files

test (4)

Modular application allows for easy re-use

Share This Case Study

Share on linkedin
Share on twitter
Share on whatsapp
Share on telegram

Download Case Study

    To get our latest updates subscribe to our Newsletter.

    Bitwise provides comprehensive solution for all your data projects

    Related Solution

    Data Quality Services

    Ensure complete, consistent and accurate data in order to make confident business decisions
    Ready to start a conversation?

    Share This Case Study

    Share on linkedin
    Share on twitter
    Share on whatsapp
    Share on telegram

    Download Case Study

    Bitwise provides comprehensive solution for all your data projects

    Related Solution

    Automated ETL Migration

    Risk-free conversion and optimization of source ETL jobs to a target ETL tool with maximum automation

    Ready to start a conversation?