Automating Extraction and Conversion of Unstructured Data

Automating Extraction and Conversion of Unstructured Data 

The Gold Mine For Data Analysis To Help Professionals Improve Outcomes

Natural Language Processing (NLP), Artificial Intelligence (AI) and most recently Generative Pretrained Transformer (GPT) are analytic techniques rapidly changing the ability of professionals to use data to make better, faster decisions ultimately improving results. In addition, technical engineers are regularly improving input controls, navigational components, informational components and containers to create new visually appealing and interactive user interfaces. Some have described these developments as revolutionary. While these advancements create powerful analytics displayed in a user friendly format, one change above all others is the key to the next generation of data analytics. The automated extraction and conversion of unstructured data found in primary source documents, reports, records and other medium to structured data that can be used for analysis is the most significant step in the evolution of data analytics.

The Challenges

Technology provides rapid, efficient, consistent and accurate data analysis but the predictions, conclusions and decisions derived from that analysis are only as good as the data used to power it. If the data is not timely, accurate and complete the analysis is limited and may be inaccurate. Most analytic models utilize only structured data because it is accessible. Typically, through a multi-step process, most frequently involving human intervention, primary sources are reviewed, specific pre-identified information is gleaned from those sources and the information is entered into a structured data field created to store that piece of information. The information in the structured field is then connected to the analytics engine that assumes the structured data provides an answer to a specific question. In the context of liability claims, the question might be does this claim involve a potential closed head injury? Extraction from unstructured primary sources and conversion to usable structured data is essential to analytics but it involves several opportunities for “Leakage”. 

In this context, “Leakage” is an inefficiency or error that reduces the value of the end result.  The first area of Leakage is the delay inherent in the review of the primary source information. A professional required to manually review the primary source to identify key data often needs to prioritize multiple tasks. Other tasks may take priority resulting in a delay in extracting the data from the primary source. Moreover, Leakage can occur if a professional is not able to devote sufficient time to accurately extract all relevant data from a primary source document.  

After information is located in a primary source document, it must be accurately transferred to a structured data field. This transfer creates opportunities for additional Leakage. Professionals are not always excited about tasks they consider data entry. This lack of enthusiasm for the task can result in deferral and errors. To allow professionals more time to focus on tasks requiring judgment, data entry is frequently handed off to operations teams. The hand-off creates the opportunity for further delay and errors in the communication of information from the professional to the operations team. Operations teams may use human only, fully automated or a human-automated interactive processes to convert the extracted unstructured data to structured fields.  If humans are involved, the potential for Leakage during the conversion stage increases.

Significantly, accuracy of structured data depends upon consistent definitions of the information entered in the structured field, clear and frequent communication of those definitions and execution utilizing those definitions rather than some personal interpretation. While this sounds basic, in reality it is not always achieved and can result in undetected Leakage. Staff turnover, changes in use cases for the data, infrequent training and limited quality assurance reviews can exacerbate Leakage. For example, Venue or Jurisdiction are standard structured fields frequently used for analytics in liability insurance claims. The court in which litigation, if any, is likely to proceed is a key risk factor. What questions have been used to identify the data that is entered into these fields?  What is the Venue or Jurisdiction if a claim is not yet in suit? What is the venue or jurisdiction if the claimant is not represented by counsel. Are Venue and Jurisdiction interchangeable terms or is there a difference?  Have the same definitions always been utilized for these terms? If litigation is pending what naming conventions are utilized to accurately identify the name of the court?  Does everyone involved in extracting information know the applicable rules?  What do regular quality assurance reviews reveal about the consistency and accuracy of the data entered in these structured fields?

Regardless of the cause, Leakage in the process of extracting information from primary sources and converting it into structured fields diminishes the value of even the best analytics and most advanced user interface.


Technology can now efficiently, consistently, and accurately locate and extract from unstructured information answers to specific questions and convert those answers into structured data that can be used for analytics.  For example, technology can search unstructured documents to confirm the name and location of the court in which litigation is pending or to gather the information necessary to predict where litigation, if commenced, is likely to proceed.  This automated extraction can not only reduce Leakage related to new information but it can also be used on previously obtained unstructured data to validate or correct structured data that is not consistent with the current definitions. Thus, automated extraction and conversion significantly improves the timeliness, completeness and accuracy of the data that drives the analysis providing professionals with the information necessary to achieve the best possible outcome.  

Moreover, by simply changing and refining the question, automated extraction and conversion capabilities can be adopted to find and provide data that is meaningful to any analytics model. The automated ability to mine and use information in unstructured data is the most significant advancement in technology designed to help professionals.





Leave a Reply

Your email address will not be published. Required fields are marked *