[humanities-dev] OCRing text
pedro at esfera.mobi
Sun Feb 19 22:11:50 GMT 2012
I've tried OCRopus a bit and a least for brazillian portuguese I would
rather stick with tesseract.
The first results are quite nice if you got a good resolution picture in
the right TIFF format. For specifics sets of documents I was hoping to
build an web app wich makes it easier to build the training sets - auto
generates the text with correct spacing, font so people can print it a
home, scanit and upload back through a web interface? -
About mobile, the idea behind the 3d printed bookscanner is exaclty that.
Creating a lightweight system which can be assembled quickly (the first
sketches looks like a war-of-the-world tripod with two led lamps to
iluminate the text) and can be carried around.
After the images are captured, it will be streamlined through a script
which will convert it to the proper 2bit TIFF format, (ideally) adjust
brightness and contrast and then ocr-it.
Then it will expose online both the scan and the ocrtext, so people can
improve the text. Ideally using some sort of overlap layer (at least for
On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley at austgate.co.uk>wrote:
> I've stayed away from ocropus so far because the build process just
> seems unnecessarily tortuous. Time to dive in!
> My sense is that it uses Tesseract as underlying engine so it copes with
> some of the language issues. This version appears to be under some heavy
> development to make it more Python based and less reliant on C++ so
> perhaps this will make it easier in future releases.
> I'll probably dive into it soon enough and give it a go.
> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins at gmail.com wrote:
> > What's the general sense of tesseract vs. ocropus? Which is better?
> > I've been trying to get ocropus to play nice with OS X and it's not
> > pretty.
> > Tod
> > _______________________________________________
> > humanities-dev mailing list
> > humanities-dev at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/humanities-dev
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the humanities-dev