Optical Character Recognition
Uses, Advantages and Disadvantages
Challenges in the digital era are continuously being overcome with new technological procedures and techniques.
One such application is Optical Character Recognition (OCR) which has improved over the years to become an efficient mechanism to convert written text into machine-readable input. Although this process is not always 100% accurate, many developments have aimed to achieve total precision.
There are numerous applications for Optical Character Recognition and Optical Mark Readers (OMR). The latter is related to the former in that it concentrates more on cursive and image-based originals, but since it overlaps OCR in many ways, for the purpose of this essay both methods are combined in the acronym OCR.
The argument herein is that although OCR is efficient, the need for proofreading by humans is still necessary. Artificial intelligence has not yet overtaken the cognitive powers of the human being who can decipher characters and images in books, manuscripts, images, and other types of media. OCR normally works well with clean, clear and easy-to-read text. Tables for example, are often difficult to scan for OCR.
Applications of OCR/OMR
- Data Entry
- Text Entry
- Process Automation
- Assistive Technology for the Visually Impaired
- Automatic Number-Plate Readers
- Automatic Cartography
- Form Readers
- Signature Verification and Identification
- Handwriting Recognition
- Receipt Imaging
- Legal Documentation
- Banking Applications
- Captcha Technology
Factors of OCR
Quality control is paramount. Native speakers with third-level education are best at proof-reading scanned documents. Tables, as mentioned, are troublesome, and will often need human intervention to decipher data. Images can be scanned using OCR but this is not what the proper purpose of OCR is, since image editing software such as Gimp or Photoshop are preferable methods. Specialised material containing formulae, special characters, and diacritical marks are also problematic. Manual retyping will sometimes be the only option. Questionnaires can be read by OCR but usually only if tick-box answers are used. Handwriting on surveys may be difficult to decipher.
The main advantage of OCR is that it converts printed pages into machine-readable text, which can then be decoded by screen readers using speech synthesisers designed for the visually-impaired reader (Woodford, 2014). In 1974, Raymond Kurzwell developed the Kurzwell Reading Machine (KRM) which utilises a flatbed scanner and speech synthesiser to read printed pages out loud to blind people.
Research in OCR
Numerous surveys and studies of Optical Character Recognition have been conducted by a multitude of researchers. For example, Hamad and Kaya (2016) quantified the challenges for OCR quite distinctly by discussing problems such as lighting, rotation, and aspect ratios, amongst others. They clearly demonstrate the different phases involved in pre-processing of documents for scanning. Although not alone in discussing ‘neural networks’, they define it succinctly as “an ascertaining architecture that includes enormously parallel interconnection of flexible node processors” (Hamad and Kaya, 2016). Naturally they are discussing the science behind OCR.
Bruno Menard discusses the “OCR component of an identification or inspection system being an important element in boosting overall quality” (Menard, 2008). He defines four evaluation criteria for OCR; flexibility, reliability, ease of use, and efficiency. As he concludes “an efficient tool will provide a steady and deterministic identification rate, and consequently increased productivity” (Menard, 2008).
An ‘Encompassing Review’ by Pai and Kolkure (2015) is another journal article which gives a short, but concise summary of conditions that need to be met for optimal OCR. They conclude that “each technique has its own uniqueness and level of accuracy, but still some modifications have to be done for characters of different size and fonts” (Pai and Kolkure, 2015). In contrast, Gupta et al. (2014) discuss OPENCV (Open Computer Vision), an open-source set of libraries originally developed by Intel. They emphasise the importance of performance rates as being: (a) recognition rate; (b) rejection rate; and (c) error rate (Gupta et al., 2014).
Another open-source OCR engine is Tesseract, originally developed by Hewlett-Packard, and eventually released in 2005. Its main advantage is that it can handle both black-on-white text and white-on-black. Mithe, Indalkar, and Divekar (2013) discuss this, as well as text-to-speech conversion. Their article demonstrates some clear flowcharts and diagrams to explain the process behind OCR (Mithe et al., 2013).
Pandey et al. (2017) provide a similar review of OCR, although they concentrate more on mobile applications. Unfortunately, there could be much more in this area that they could have mentioned. Islam et al. (2016) surveyed OCR by the latest research into the field. They also mention the difficulties of recognising characters for languages such as Arabic, Sindhi and Urdu. A multi-lingual character recognition system needs further research (Islam et al., 2016).
Pansare and Joshi (2012) provide another overview of character recognition stating that “selection of relevant technique plays an important role in performance of character recognition rate” (Pansare and Joshi (2012). As do other researchers, they recognise that OCR is not totally accurate, and therefore it can be surmised that a system of 100% accuracy has yet to be developed.
The print publishing industry utilises optical character recognition for archiving of news stories and research material. Libraries also need to digitise books, magazines and journals. An important digital tool for OCR has been ABBYY FineReader. Another is OCROpus which Google has developed. However, these can still not achieve what the human eye is able to, because intricacies in print cannot always be recognised by OCR software, as already mentioned. Techniques such as ‘de-skewing’ and ‘de-speckling’ are used in pre-processing of newspaper articles. Imagine the difficulty of scanning a newspaper page with columns and tables and images, as well as recognising a wide variety of fonts.
In particular with newspaper scanning, factors such as the quality of the original are crucial for accurate OCR. Contrast between black and white is important, although many newspapers now use colour for print. Pages need to be scanned perfectly horizontally to avoid skewing. Training personnel to use the OCR engine effectively is timely and costly, and it is often difficult to find suitably qualified people.
Rose Holley in her 2009 article for the Australian Newspaper Distribution Programme provided a comprehensive checklist for digitising and archiving newspaper articles. She suggests over a dozen techniques for improving OCR accuracy. She notes that “manual text correction has always been an effective method of increasing accuracy, but on a large scale it is not viable or cost effective” (Holley, 2009).
A noteworthy milestone in the history of OCR has been the IMPACT Project (Improving Access to Text) in 2009. This was an EU funded project involving several libraries and research institutions and a couple of private-sector companies. Hildelies Balk and Lieke Ploeger (2009) listed several factors to take into consideration for digitising material as:
- Bad paper quality and warping;
- Distortion due to the binding;
- Bleed through;
- Shining through;
- Poor inking and bad printing;
- Gothic print types;
- Obsolete characters such as the long ‘s’;
- Small text particularly for newspapers or large images;
- Irregular spacing between columns, letters or lack of headlines;
- Annotations by users.
The main aim of IMPACT has been to digitise historical documentation throughout Europe and this may take some time. Everything cannot be archived in just a few weeks, it will take years, even decades.
The Centre for e-Research at King’s College London, has also been a leading proponent of open-source OCR in the area of historical research. Blanke et al. (2012) produced a comprehensive article in the Journal of Documentation, outlining digitisation for humanities research. They state that “historical documents, with their immanent constraints of low quality and unusual character sets, remain a challenge to OCR” (Blanke et al., 2012). They stress that commercial OCR software companies may be limited by their capabilities to deal with historical archives and that open-source is the best way forward. They state that “the best results will be achieved by combining the flexibility of open-source tools with the mature character models of commercial products” (Blanke et al., 2012).
They developed an application called OCRopodium Web Processing (OWP), which provides an abstraction layer above regular OCR tools, and at the same time, achieving a user-friendly interface. Blanke et al. (2012) discuss the OCR workflow in four principal stages: Pre-processing, layout analysis, recognition, and post-processing and correction. An objective of their OCR application was providing tools for the complete process “from scanned images to ingestion in a digital repository” (Blanke et al. 2012).
Recognising that manual correction of OCR is the most labour-intensive of all the processes involved is a common theme across research into optical character recognition and optical mark reading. Even when manual correction is inevitably required it is difficult for a touch-typist to work with the output file from OCR. Not only that, but the transcriber needs to be familiar with the original text, since this could be from a defunct language such as Latin or Old-English/Anglo-Saxon.
Blanke et al. (2012) discussed two important case studies. One was ‘The Stormont Papers’ which comprised over 50 years of parliamentary debates of the devolved government of Northern Ireland. For this there were over 100,000 printed pages. It involved an OWP workflow that incorporated binarization nodes for cleaning up original documents. One final process used a Tesseract recognition node to process the outgoing text recognition.
The second case study was the ‘European Holocaust Research Infrastructure’ (EHRI) which involved dispersed archives regarding the Holocaust of the Second World War. 20 partner organisations in 13 countries combined relevant material including film and art, objects, photos, documents, and personal memoirs. The final results using Tesseract OCR engines nevertheless provided 90% accuracy, although this is to be considered quite an achievement in itself.
Blanke et al. (2012) conclude that combining open-source and commercial OCR is the way forward, with limitations on cost and quality of output. They state, “highly variable material requires a wide range of approaches: character recognition tools vary greatly in their effectiveness depending on the input material and a one-size-fits-all approach will usually be unsuitable in the context of a typical historical archive” (Blanke et al. 2012). This effectively summarises OCR in its present form.
A recent study of OCR involved several contributors from the IMPACT Project. Christy et al. (2017) researched special collections from libraries around the world from 1476 to 1800. There were more than 300,000 documents which were digitised for the relevant libraries. Approximately 45 million pages were originally imaged in the late 1970’s, then transformed onto microfilm in the 1980’s, and then these were digitised in the 1900’s. Obviously with such transformation, OCR would be a herculean task. The quality of many of the historical documents was poor, even for a human to read, and therefore was a challenge for OCR engines.
A challenge for humanities research was that with such low quality, it was necessary to preserve the documents not just as images, but as keyed text – otherwise they would become part of a ‘dark archive’ (Christy et al. 2017). The European Commission funded the IMPACT Project for four years to ensure that OCR standards were capable of digitising cultural heritage materials. Europe is in danger of ‘entering a new Dark Age’ (Christy et al. 2017). Their comprehensive article describes in detail the processes which were involved, including ‘de-noising’ such as recognising the letter ‘f’ as actually being an ‘s’ (e.g. ‘feason’ instead of ‘season’). Christy et al. (2017) mention the importance of a ‘human-in-the-loop’, which is a form of ‘distributed proofreading’ used in Project Gutenberg. As already discussed, OCR is still not 100% effective since it needs human intervention over artificial intelligence.
Optical character recognition (and optical mark reading) are extremely useful tools for digitising a wide range of printed texts and transforming them into machine-readable media. OCR of course has several varied uses, but for the purpose of this essay the area of historical archiving was prominent. Whether the accuracy of OCR will improve in the future is unknown, since it does not always maintain full levels of accuracy, and it is unsure if artificial intelligence will ever be able to replace the detail and conciseness of the human eye.
Balk, Hildelies and Ploeger, Lieke, “IMPACT: working together to address the challenges involving mass digitization of historical printed text”, OCLC Systems & Services: International digital library perspectives, Vol. 25 Issue: 4, pp.233-248, 2009 https://doi-org.ucc.idm.oclc.org/10.1108/10650750911001824
Blanke, Tobias, Bryant, Michael, Hedges, Mark, “Open source optical character recognition for historical research”, Journal of Documentation, Vol. 68 Issue: 5, pp.659-683, 2012 https://doi.org/10.1108/00220411211256021
Christy, Matthew, et al. “Mass Digitization of Early Modern Texts with Optical Character Recognition.” Journal on Computing and Cultural Heritage (JOCCH), vol. 11, no. 1, 2018;2017; pp. 1-25. http://delivery.acm.org.ucc.idm.oclc.org/10.1145/3080000/3075645/a6-christy.pdf
Gupta, T., Ahuja, C., Aich, S. “Optical Character Recognition”, International Journal of Emerging Technology and Advanced Engineering, Vol. 4, Issue 9, September 2014. http://www.ijetae.com/files/Volume4Issue9/IJETAE_0914_58.pdf
Hamad, Karez & Kaya, Mehmet. “A Detailed Analysis of Optical Character Recognition Technology”. International Journal of Applied Mathematics, Electronics and Computers. 4. 244-244. 10.18100/ijamec.270374, 2016. https://www.researchgate.net/publication/311851325_A_Detailed_Analysis_of_Optical_Character_Recognition_Technology
Holley, Rose, “How Good Can It Get, Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs”. D-Lib Magazine, March/April 2009. http://www.dlib.org/dlib/march09/holley/03holley.html
Islam, Noman & Islam, Zeeshan & Noor, Nazia. (2016). “A Survey on Optical Character Recognition System”. ITB Journal of Information and Communication Technology. https://www.researchgate.net/journal/1978-3086_ITB_Journal_of_Information_and_Communication_Technology
Menard, Bruno. “Optical character recognition: the OCR component of an identification or inspection system is an important element in boosting overall quality.” Quality, May 2008, p. S18+. Academic OneFile, https://link.galegroup.com/apps/doc/A179534991/AONE?u=googlescholar&sid=AONE&xid=3bce000f.
Mithe, R., Indalkar, S., Divekar, N. “Optical Character Recognition”. International Journal of Recent Technology and Engineering, Vol. 2, Issue 1, March 2013. http://www.ijrte.org/download/volume-2-issue-1/
Pai, N. and Kolkure, V. “Optical Character Recognition: An Encompassing Review”. International Journal of Research in Engineering and Technology, Vol. 04, January 2015. https://ijret.org/volumes/2015v04/i01/IJRET20150401062.pdf
Pandey, A., Sharma, V., Paanchbhai, S., Hedaoo, N., Zade, S.D. “Optical Character Recognition”. International Journal of Engineering and Management Research, Vol. 7, Issue 2, March-April 2017. ISSN 2250-0758. http://www.ijemr.net/DOC/OpticalCharacterRecognitionOCR.pdf
Pansare, S. and Joshi, D. “A Survey on Optical Character Recognition Techniques”. International Journal of Science and Research. ISSN 2319-7064. Vol. 3, Issue 12, December 2014. https://pdfs.semanticscholar.org/e050/88deeb0cb24a407edf1eff938b633ca83781.pdf
Woodford, Chris. “OCR (optical character recognition)”. https://www.explainthatstuff.com/how-ocr-works.html 2010/2014.
How Does Optical Character Recognition (OCR) Work? [01:30]