Machine learning text extractor

5/20/2023

Fortunately, for long documents, we can take advantage of the fact that even indexing a few pages is likely to make the document much more accessible from searches. Some PDF documents have a lot of pages, and processing those files is thus more costly. Once we decided on the file types and developed an estimate of how much OCR-able content lived on each page, we wanted to be strategic about the way we approached each file. Overall, our target users have roughly twice as many JPEGs as PDFs, but each PDF has 8.8 pages on average, and PDFs have a much higher likelihood to contain text images, so in terms of overall load on our system, PDFs would contribute over 10x as much as JPEGs! However, it turns out that we could reduce this number significantly through a simple analysis, described next. It turns out that the distribution of pages in each of the 3 buckets is 69%, 28%, and 3%, respectively. We would like to skip pages in categories 1 and 3 and focus only on category 2, since this is where we can provide a benefit.

Page does not have substantial text content.
Page has text, but only in the form of an image, and thus not currently indexable.
Page has text that is already embedded and indexable.
For PDFs, the situation is a bit more complicated, as a PDF can contain multiple pages, and each page can exist in one of three categories: The model we trained was a convolutional neural network which takes an input image before converting its output into a binary decision about whether it is likely to have text content.įor images, the most common image type is JPEG, and we found that roughly 9% of JPEGs are likely to contain text. This includes, for example, scans or photos of documents, but excludes things like images with a random street sign. So a key building block was a machine learning model that could determine if a given piece of content was OCR-able, in other words, whether it has text that has a good chance of being recognizable by our OCR system. However, not all images or PDFs contain text in fact, most are just photos or illustrations without any text. This includes image formats and PDF files without text data. The types of files we want to process are those that currently don’t have indexable text content.

For multi-page document types like PDFs, how many pages do we need to process to make this useful?.Which of those files are likely to have “OCR-able” content?.More specifically, we wanted to answer the following questions: This would not only inform the cost estimate, but also confirm its usefulness. Automatic image text recognition is able to intelligently distinguish between all of these documents to categorize data contained within.įirst, we set out to gauge the size of the task, specifically trying to understand the amount of data we would have to process. PDF files fall in-between because they can contain a mixture of text and image content.

Image formats (like JPEG, PNG, or GIF) are generally not indexable because they have no text content, while text-based document formats (like TXT, DOCX, or HTML) are generally indexable. Similarly, 25% of these PDFs are scans of documents that are also candidates for automatic text recognition.įrom a computer vision perspective, although a document and an image of a document might appear very similar to a person, there’s a big difference in the way computers see these files: a document can be indexed for search, allowing users to find it by entering some words from the file an image is opaque to search indexing systems, since it appears as only a collection of pixels. These are now candidates for automatic image text recognition. Of those files, 10-20% are photos of documents-like receipts and whiteboard images-as opposed to documents themselves. People have stored more than 20 billion image and PDF files in Dropbox. The potential benefit of automatically recognizing text in images (including PDFs containing images) is tremendous. One of the most impactful benefits that users will see from these changes is that users on Dropbox Professional and Dropbox Business Advanced and Enterprise plans can search for English text within images and PDFs using a system we’re describing as automatic image text recognition. In our previous blog post s, we talked about how we updated the Dropbox search engine to add intelligence into our users’ workflow, and how we built our optical character recognition (OCR) pipeline.

0 Comments

Machine learning text extractor

Leave a Reply.

Author

Archives

Categories