[humanities-dev] OCRing text
rufus.pollock at okfn.org
Mon Feb 27 14:33:45 GMT 2012
Just wanted to jump in here as OCR'ing stuff is something I (and
others at the OKF) have long been interested in e.g.
I'm wondering if there is a need (and interest) in building a simple
Integrated with PyBossa / TEXTUS it could provide a nice scan / ocr /
On 19 February 2012 22:11, Pedro Markun <pedro at esfera.mobi> wrote:
> I've tried OCRopus a bit and a least for brazillian portuguese I would
> rather stick with tesseract.
That's a nice piece of info as I wondered whether OCRopus was useful.
I last used tesseract a few years ago and it looks like it has come on
quite a bit.
> The first results are quite nice if you got a good resolution picture in the
> right TIFF format. For specifics sets of documents I was hoping to build an
> web app wich makes it easier to build the training sets - auto generates the
> text with correct spacing, font so people can print it a home, scanit and
> upload back through a web interface? -
> About mobile, the idea behind the 3d printed bookscanner is exaclty that.
> Creating a lightweight system which can be assembled quickly (the first
> sketches looks like a war-of-the-world tripod with two led lamps to
> iluminate the text) and can be carried around.
> After the images are captured, it will be streamlined through a script which
> will convert it to the proper 2bit TIFF format, (ideally) adjust brightness
> and contrast and then ocr-it.
> Then it will expose online both the scan and the ocrtext, so people can
> improve the text. Ideally using some sort of overlap layer (at least for
> proper positioning).
> Pedro Markun
> On Sun, Feb 19, 2012 at 10:29 AM, iain emsley <iain_emsley at austgate.co.uk>
>> I've stayed away from ocropus so far because the build process just
>> seems unnecessarily tortuous. Time to dive in!
>> My sense is that it uses Tesseract as underlying engine so it copes with
>> some of the language issues. This version appears to be under some heavy
>> development to make it more Python based and less reliant on C++ so
>> perhaps this will make it easier in future releases.
>> I'll probably dive into it soon enough and give it a go.
>> On Sat, 2012-02-18 at 14:38 -0800, todd.d.robbins at gmail.com wrote:
>> > What's the general sense of tesseract vs. ocropus? Which is better?
>> > I've been trying to get ocropus to play nice with OS X and it's not
>> > pretty.
>> > Tod
>> > _______________________________________________
>> > humanities-dev mailing list
>> > humanities-dev at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/humanities-dev
>> humanities-dev mailing list
>> humanities-dev at lists.okfn.org
> humanities-dev mailing list
> humanities-dev at lists.okfn.org
Co-Founder, Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/
More information about the humanities-dev