Fortunately, for long documents, we can take advantage of the fact that even indexing a few pages is likely to make the document much more accessible from searches. Some PDF documents have a lot of pages, and processing those files is thus more costly. Once we decided on the file types and developed an estimate of how much OCR-able content lived on each page, we wanted to be strategic about the way we approached each file. Overall, our target users have roughly twice as many JPEGs as PDFs, but each PDF has 8.8 pages on average, and PDFs have a much higher likelihood to contain text images, so in terms of overall load on our system, PDFs would contribute over 10x as much as JPEGs! However, it turns out that we could reduce this number significantly through a simple analysis, described next. It turns out that the distribution of pages in each of the 3 buckets is 69%, 28%, and 3%, respectively. We would like to skip pages in categories 1 and 3 and focus only on category 2, since this is where we can provide a benefit.
0 Comments
Leave a Reply. |