OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
OCRopus is a free document analysis and optical character recognition system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.
OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google.
OCRopus is developed for Linux; however, users have reported success with OCRopus on Mac OS X and an application called TakOCR[1] has been developed that installs OCRopus on Mac OS X and provides a simple droplet interface.
Releases
mkdir ~/build
cd ~/build
hg clone https://iulib.googlecode.com/hg/ iulib
cd iulib
hg update -r ocropus-0.4.3
scons
sudo scons install
cd ~/build
hg clone https://ocropus.googlecode.com/hg/ ocropus
cd ocropus
hg update -r ocropus-0.4.3
scons
sudo scons install
That should work on Ubuntu 9.04 if you have all the necessary packages installed; if not, have a look at the DevInstall page or the Google Group Pages.
Resources
- OCRopus Mailing List (subscribe / contribute)
- OCRopus Group Pages (add your contributions here)
- User-contributed links and resources (add links here)
- iulib Library (you need to install this)
- hOCR Tools -- tools for manipulating OCR output
- DECAPOD -- camera-based document capture and tagged PDF generation
- PyOpenFST -- Python bindings for OpenFST (for language modeling)
The following is the most important documentation:
- Release Notes -- summary information about releases
- Development Install -- how to install the development version of OCRopus
- Using -- some information about how to use OCRopus
- Training -- how to train OCRopus
- Publications -- information about algorithms
- C++ Programming -- extending OCRopus in C++
- C++ Coding Conventions -- memory management, pointers, naming, formatting
- File Formats -- file formats used by OCRopus
- Book-Level Representation -- directory layout for whole book recognition
- hOCR Output Format -- (X)HTML-compatible OCR output format
0 commenti:
Post a Comment