Much of our text data isn’t on simple plain text format, but instead on images or other data formats that don’t necessarily translate well enough into plain text for language models to process them well. Luckily, OCR (Optical Character Recognition) has come a long way and has become a powerful ally when it comes to turning images into text. OCR tools like PyTesseract not only provide us with good out-of-the-box solutions but nowadays they’re also able to tie in well with other tools. By extracting not only the text itself but also where each word appears on the image, its size, and additional info, we have many ways to extract data more insightfully and a safe route to map it back to the original image. This allows us to keep the original file format while modifying it as we wish using these pieces of information, for example, highlighting or blurring names or structuring our data into a database automatically.
By using Natural Language Processing (NLP) on top of the OCR, we can use more intelligent forms of processing to achieve more complex tasks or get better results at already implemented tasks. Named Entity Recognition (NER), the process of recognizing entities such as “PERSON”, “DATE”, and others inside a text, is an example that can be used by itself or alongside other methods.
Imagine you have many scanned files containing sensitive information that you want to blur out to guarantee they’re compliant with data protection laws (such as Brazil’s LGPD). Let’s use PyTesseract and SpaCy to process them automatically. First, as is standard with OCR, pre-processing the image is in our best interest, as it helps the OCR engine improve its predictions. For this step, we could use any image processing engine, such as OpenCV, to apply the binarization and other minor tweaks. An excellent overall processing pipeline for scanned documents follows below:
Below we can see the original scanned file after the pre-processing, as it gets sent to the OCR engine after going through the above pipeline.
Using PyTesseract we can get the text that it extracted both in the string format as well as in a Pandas’ DataFrame containing each word, its position in the text as well as other info relating to the extraction, such as which block of text, paragraph and line it belongs to:
By having the OCR’s output in this format, we can feed it to SpaCy, our chosen NLP framework, smartly. When SpaCy runs a text through a language model, it outputs a spacy.Doc object that contains all the input and output data generated by the model. But by doing it this way, we have no guarantee that SpaCy will tokenize our text the same way as in the DataFrame. The catch is that we can feed a list of text (tokens) to a Doc object when initializing it, and SpaCy will understand this list as the tokenization schema. This way, by using the text column as the list, we will have the DataFrame and the SpaCy’s Doc indexes perfectly aligned, discarding the need for complex alignment systems between the two or creating custom attributes inside the Doc object. If this new Doc object is fed through a language model, it will be processed as any other text and still generate the outputs we’re expecting. It should be noted that a slight performance reduction might happen at the language model level since it’s unable to do its own tokenization, but it’s not a big concern.
Having the Doc object with all its useful info and the DataFrame aligned, we can now easily map all the extractions from the NER model back to where it lies on the image. Not only that, but we’re also able to change the label we’re interested in very easily or add new labels to be anonymized as long as they’re supported by the language model we’re using. Below we have used this pipeline to blur out everything the model caught as entities automatically:
Both inside and outside SpaCy’s framework we can change how we process the extracted text to tailor it to our needs. For example, we could easily filter the entities we’re interested in, as in the example below, where we’ve blurred out only the PERSON entities. Still, inside SpaCy, we could use tools like the EntityRuler to get a more robust and complex extraction system on top of the model. Both inside and outside of SpaCy, we could implement RegEx rules as a powerful and safe way of extracting and/or generating valuable data for anchoring, smart search, and many others.
We have many ways to combine spatial information from PyTesseract’s extraction with NLP beyond just extracting data. The spatial informations we have, such as paragraphs and lines as well as the text-block locations, can help us set up a system that can distinguish headers, titles, and forms smartly. This can be used to speed up a searching system or enable it to filter out parts of the image that it shouldn’t pay attention to. These extractions can be used to create relational databases using this extracted info and create a graph of files that relate to a single person, organization or anything of interest. Using all the extracted info as inputs, we can even classify files inside this relational database to enrich it!
To sum it up, by using spatial information allied to all the tools NLP models provide us, we can do information retrieval in a very insightful and powerful way. Not only can we automate tasks, but we could also use the byproducts of this automation to generate even richer data and structure our pieces of information directly into relational databases!