From: "Charlie Gibbs" <cgibbs@kltpzyxm.invalid>
Newsgroups: alt.folklore.computers
Subject: Scanning old manuals
Date: 10 Nov 04 10:39:27 -0800
Organization: http://newsguy.com
Lines: 14
Message-ID: <668.810T1381T6394818@kltpzyxm.invalid>
NNTP-Posting-Host: p-472.newsdawg.com
X-Newsreader: THOR 2.5a (Amiga;TCP/IP)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsmi-eu.news.garr.it!newsmi-us.news.garr.it!NewsITBone-GARR!hammer.uoregon.edu!logbridge.uoregon.edu!pln-w!spln!rex!extra.newsguy.com!newsp.newsguy.com!news2
Xref: nightfall.franklin.ch alt.folklore.computers:189767

Like many of us, I'm sure, I have a collection of old computer
manuals that's taking up a lot of shelf space.  In the name of
home renovations (and marital bliss) I would be willing to let
go of some of the newer (i.e. later than 1980) ones if I could
scan them.  Like most such manuals, they're 95% text (in a couple
of typefaces and sizes) and line drawings.  I presume the ideal
solution is to use some sort of scanning/OCR software to turn them
into PDF files.  Is there readily available software (preferably
for Linux) to do this?  Have any of you embarked on such a project,
and do you have any words of wisdom to share?

--
cgibbs@surfnaked.ca (Charlie Gibbs)

######

NNTP-Posting-Date: Wed, 10 Nov 2004 16:20:25 -0600
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Reply-To: Anne & Lynn Wheeler <lynn@garlic.com>
References: <668.810T1381T6394818@kltpzyxm.invalid>
From: Anne & Lynn Wheeler <lynn@garlic.com>
Date: Wed, 10 Nov 2004 15:19:59 -0700
Message-ID: <u4qjxibxs.fsf@mail.comcast.net>
Organization: Wheeler&Wheeler
User-Agent: Gnus/5.090024 (Oort Gnus v0.24) Emacs/21.3 (windows-nt)
Cancel-Lock: sha1:dBjIrtsEeyseUJzP/W3OJ08BqJ4=
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Lines: 27
NNTP-Posting-Host: 67.176.12.113
X-Trace: sv3-gg4K3LDybG75fi9i/uoKnLFEaWrSuBxVGGi0SNSOCc3QKVeVYHQj7mmSJLmkY/bZ/ZF51izXtmNzTgW!Pof1zWIyymtM4AW5p+QZQIUgZVemiUJdwawZMnXIMIBQLKPprmfe6gKo
X-Complaints-To: abuse@comcast.net
X-DMCA-Complaints-To: dmca@comcast.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.20
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local1.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189770


"Charlie Gibbs" <cgibbs@kltpzyxm.invalid> writes:
> Like many of us, I'm sure, I have a collection of old computer
> manuals that's taking up a lot of shelf space.  In the name of home
> renovations (and marital bliss) I would be willing to let go of some
> of the newer (i.e. later than 1980) ones if I could scan them.  Like
> most such manuals, they're 95% text (in a couple of typefaces and
> sizes) and line drawings.  I presume the ideal solution is to use
> some sort of scanning/OCR software to turn them into PDF files.  Is
> there readily available software (preferably for Linux) to do this?
> Have any of you embarked on such a project, and do you have any
> words of wisdom to share?

i was just looking at asking the same question ... all sorts of odds
and ends stuff from the 60s and 70s.

however we just unearthed a bunch of old handwritten letters from the
40s ... that i would also be interested in scanning(?).

when i looked at some of this stuff nearly 10 years ago ... it all
seemed to be scaffolded off fax scanning, tiff format and ocr of
tiff/fax softcopy (current scanners appear to have much higher
resolution as well as color capability compared to the older fax
oriented stuff).

-- 
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

######

From: "David Wade" <g8mqw@yahoo.com>
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Wed, 10 Nov 2004 23:47:33 -0000
Lines: 33
Message-ID: <2vfnodF2l23bvU1@uni-berlin.de>
References: <668.810T1381T6394818@kltpzyxm.invalid>
X-Trace: news.uni-berlin.de FaY62ECG61DjNsIsFP9YBAMv1ephLyoPXjpKfL2oCRdZ4EB1Q=
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!fu-berlin.de!uni-berlin.de!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189774

"Charlie Gibbs" <cgibbs@kltpzyxm.invalid> wrote in message
news:668.810T1381T6394818@kltpzyxm.invalid...
> Like many of us, I'm sure, I have a collection of old computer
> manuals that's taking up a lot of shelf space.  In the name of
> home renovations (and marital bliss) I would be willing to let
> go of some of the newer (i.e. later than 1980) ones if I could
> scan them.  Like most such manuals, they're 95% text (in a couple
> of typefaces and sizes) and line drawings.  I presume the ideal
> solution is to use some sort of scanning/OCR software to turn them
> into PDF files.  Is there readily available software (preferably
> for Linux) to do this?  Have any of you embarked on such a project,
> and do you have any words of wisdom to share?
>

Charlie,
  I have done a little bit of work on this. From speaking to Al at
www.bitsavers.org I discovered that
his manuals are basically 400 DPI MONO TIFF files formatted as PDF. I have
tried extracting the images
and OCRing them with ScanSoft OmniPage PRO. I know this is windows software
but the PRO
version can be "taught" to recognise a pattern as a letter. I tried it on
the CMS Commands manual
from bitsavers, and it seems quit good. I think it took me a couple of hours
(i.e. most of an evening)
to get the manual into Word/HTML/PDF.
  Dave.

> --
> cgibbs@surfnaked.ca (Charlie Gibbs)
>

######

Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
References: <668.810T1381T6394818@kltpzyxm.invalid>
From: Bernd Felsche <bernie@innovative.iinet.net.au>
User-Agent: nn/6.6.5
Date: Thu, 11 Nov 2004 09:37:04 +0800
Message-ID: <0bsa62xf8h.ln2@innovative.iinet.net.au>
Lines: 40
NNTP-Posting-Host: 203.59.144.24
X-Trace: 1100137080 per-qv1-newsreader-01.iinet.net.au 6565 203.59.144.24
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!HSNX.atgi.net!newsfeed.iinet.net.au!newsfeed.iinet.net.au!per-qv1-newsreader-01.iinet.net.au!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189775

"Charlie Gibbs" <cgibbs@kltpzyxm.invalid> writes:

>Like many of us, I'm sure, I have a collection of old computer
>manuals that's taking up a lot of shelf space.  In the name of
>home renovations (and marital bliss) I would be willing to let
>go of some of the newer (i.e. later than 1980) ones if I could
>scan them.  Like most such manuals, they're 95% text (in a couple
>of typefaces and sizes) and line drawings.  I presume the ideal
>solution is to use some sort of scanning/OCR software to turn them
>into PDF files.  Is there readily available software (preferably
>for Linux) to do this?  Have any of you embarked on such a project,
>and do you have any words of wisdom to share?

Wisdom? Is that like learning from the mistakes of others?

Three years ago I embarked on a similar project of scanning
photographs.

Scanning "by hand" and catalogueing so that you are still to find
the relevant items is time-consuming.

If one could scan 2 pages a minute by turning pages, etc., then a
thousand-page tome will take a week of evenings.

For loose-bound stuff, there are document scanners loitering in many
a corner that'll process 30+ pages a minute. If you cannot find a
company who officially makes that part of its business, then a box
of chocolates/carton of beer may entice some owners of such
facilities to allow you to use their scanner at an idle time.

BTW: scan most text pages as "line-art" as that tends to reduce the
"noise" in the image and aids compressability. Subsequent OCR'ing
can be used to generate "indexes". Store the original; OCR text can
be unreliable and sometime produces pure gibberish; even after
extensive "training".
-- 
/"\ Bernd Felsche - Innovative Reckoning, Perth, Western Australia
\ /  ASCII ribbon campaign | I'm a .signature virus!
 X   against HTML mail     | Copy me into your ~/.signature
/ \  and postings          | to help me spread!

######

From: Howard Shubs <howard@shubs.net>
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Wed, 10 Nov 2004 22:12:09 -0500
Organization: ='SEQUENTIAL'
Lines: 27
Message-ID: <howard-36E8FF.22120910112004@news.newsguy.com>
References: <668.810T1381T6394818@kltpzyxm.invalid>
NNTP-Posting-Host: p-585.newsdawg.com
Mail-Copies-To: nobody
User-Agent: MT-NewsWatcher/3.4 (PPC Mac OS X)
X-Face: "S"r{U%bs].&Ud}Pc~<qT-k!*znQD|gj[,+)#DaY~AsjVL[UlRXi&y]]1#_De
        VoZ=w,7@[GOQ'x"{;Ee<>~~0a]M:t5l>>EN\1Faw10M9NK1Xq59wo7-"s0S+[{etQorO
        /Nf-Ci"i9v'MT!R8)J]N[4|2&x1r^Iq&{SB"6dknr0=+6UFb.>+{zMn_1=rw&/V+"d@*
        ZS5\LoW_
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.moat.net!pln-e!spln!rex!extra.newsguy.com!newsp.newsguy.com!howard
Xref: nightfall.franklin.ch alt.folklore.computers:189776

In article <668.810T1381T6394818@kltpzyxm.invalid>,
 "Charlie Gibbs" <cgibbs@kltpzyxm.invalid> wrote:

> Is there readily available software (preferably
> for Linux) to do this?  Have any of you embarked on such a project,
> and do you have any words of wisdom to share?

Yes, I've done it.  I started in 1998 to work on two manuals for the IBM 
1130.  Some day, I might finish!

I'm not converting into PDF but into HTML.  See 
<http://www.ibm1130.net/functional> and <http://www.ibm1130.net/DM2> for 
my efforts so far.

I've been using MacOS and OmniPage Pro.  It goes like this:

1) scan a page.
2) OCR the page
3) move the results into a text-based HTML editor called BBEdit.
4) clean up the text, proof it, possibly rebuild the graphics.
5) upload the new version of the page.

It's tedious, but once it's done, no one else will have to do it again.

-- 
"...crying in your drink is bad enough; crying into a hot fudge sundae
is disgusting." - Jerry Farnsworth in _Job: A Comedy of Justice_

######

From: toby@telegraphics.com.au (Toby Thain)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 10 Nov 2004 20:58:56 -0800
Organization: http://groups.google.com
Lines: 11
Message-ID: <d6ce4a6c.0411102058.407a80c5@posting.google.com>
References: <668.810T1381T6394818@kltpzyxm.invalid>
NNTP-Posting-Host: 65.92.124.111
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1100149137 1100 127.0.0.1 (11 Nov 2004 04:58:57 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Thu, 11 Nov 2004 04:58:57 +0000 (UTC)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!fr.ip.ndsoftware.net!proxad.net!216.239.36.134.MISMATCH!postnews.google.com!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189777

"Charlie Gibbs" <cgibbs@kltpzyxm.invalid> wrote in message news:<668.810T1381T6394818@kltpzyxm.invalid>...
> I presume the ideal
> solution is to use some sort of scanning/OCR software to turn them
> into PDF files.  Is there readily available software (preferably
> for Linux) to do this?  Have any of you embarked on such a project,
> and do you have any words of wisdom to share?

My advice: Whether you OCR or not, *please* make sure the original
scanned images remain available in your archive.

--Toby

######

Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
From: Tris Orendorff <triso@remove-me.cogeco.ca>
References: <668.810T1381T6394818@kltpzyxm.invalid>
Organization: Torensoft Consulting
Message-ID: <Xns959ECC6B5E051RepublicPicturesLtd@216.221.81.119>
User-Agent: Xnews/06.08.25
X-Face: IshsEs\2&-CRNbQ'E]S8OU{A_:p:*[<45&]7:[QnpNNZHm''7(M+Do^,?tfk!JK@i~x(3PR"?ZH=bE!coJ$9J1_Gs""|xs20v&1g%;QcA@HE?UJ&|9-(g&gkBPz{ZKVI^Yp8(c)T{I:BDk{EN,b"~/VV7OL46b64c3fp-TO7?)CvvaFSqW*73$s*]j99a-<ifgw.FkFrA_\!#/fzt%e0?Ic~8P~Vzp6'L|ZYmLM(sdmc:};7\?m8.0F^En~NT%>#zry-{86LtBDLi2TB=ArfN^.;4a'$[}/!1Pe2o8*],I(<x|A[:.wv`?Q9#=@iJm3
Lines: 28
Date: Fri, 12 Nov 2004 01:04:16 GMT
NNTP-Posting-Host: 24.226.72.210
X-Complaints-To: abuse@cogeco.ca
X-Trace: read1.cgocable.net 1100221456 24.226.72.210 (Thu, 11 Nov 2004 20:04:16 EST)
NNTP-Posting-Date: Thu, 11 Nov 2004 20:04:16 EST
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!meganewsservers.com!feeder2.on.meganewsservers.com!feed.cgocable.net!read1.cgocable.net.POSTED!53ab2750!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189800

"Charlie Gibbs" <cgibbs@kltpzyxm.invalid> wrote in news:668.810T1381T6394818@kltpzyxm.invalid:

> Like many of us, I'm sure, I have a collection of old computer
> manuals that's taking up a lot of shelf space.  In the name of
> home renovations (and marital bliss) I would be willing to let
> go of some of the newer (i.e. later than 1980) ones if I could
> scan them.  Like most such manuals, they're 95% text (in a couple
> of typefaces and sizes) and line drawings.  I presume the ideal
> solution is to use some sort of scanning/OCR software to turn them
> into PDF files.  Is there readily available software (preferably
> for Linux) to do this?  Have any of you embarked on such a project,
> and do you have any words of wisdom to share?

Perhaps Distributed Proofreaders <www.pgdp.net> could help. They are adding at least 250 books and 
magazines to Project Gutenberg every month.  

-- 
Sincerely,

Tris Orendorff

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d++ s+:- a+ C+ UL++++ P+ L+ E- W+ N++ o- K++ w+ O+ M !V PS+ PE Y+ PGP t+ !5 X- R- tv--- b++ 
DI++ D+ G++ e++ h---- r+++ y+++
------END GEEK CODE BLOCK------

######

From: ace join_to ware@iinet.net.au (Tony Epton)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Fri, 12 Nov 2004 01:27:23 GMT
Organization: Aceware Programming Pty Ltd
Message-ID: <419410e1.9176015@news.m.iinet.net.au>
References: <668.810T1381T6394818@kltpzyxm.invalid>
X-Newsreader: Forte Free Agent 1.21/32.243
Lines: 26
NNTP-Posting-Host: 203.59.54.231
X-Trace: 1100222412 per-qv1-newsreader-01.iinet.net.au 6545 203.59.54.231
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de!t-online.de!npeer.de.kpn-eurorings.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newsfeed1.swip.net!swipnet!News-Peer-Europe!news.stupi.se!news.net.hanse.com!nntp.gblx.net!nntp3.phx1!nntp1.phx1.gblx.net!nntp.gblx.net!nntp.gblx.net!nntp4.phx1!newsfeed.iinet.net.au!newsfeed.iinet.net.au!per-qv1-newsreader-01.iinet.net.au!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189801

On 10 Nov 04 10:39:27 -0800, "Charlie Gibbs" <cgibbs@kltpzyxm.invalid>
wrote:

Has anyone come up with a good technique for shaving of the "perfect
binding" spine in order to use an automatic document feeeder.

What's everyones' opinion - should we destroy a book in order to save
it ?

Tony

>Like many of us, I'm sure, I have a collection of old computer
>manuals that's taking up a lot of shelf space.  In the name of
>home renovations (and marital bliss) I would be willing to let
>go of some of the newer (i.e. later than 1980) ones if I could
>scan them.  Like most such manuals, they're 95% text (in a couple
>of typefaces and sizes) and line drawings.  I presume the ideal
>solution is to use some sort of scanning/OCR software to turn them
>into PDF files.  Is there readily available software (preferably
>for Linux) to do this?  Have any of you embarked on such a project,
>and do you have any words of wisdom to share?
>
>--
>cgibbs@surfnaked.ca (Charlie Gibbs)
>

######

Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au>
Reply-To: jhaynes@alumni.uark.edu
Organization: University of Arkansas Alumni
X-Newsreader: trn 4.0-test72 (19 April 1999)
From: haynes@alumni.uark.edu (Jim Haynes)
Lines: 19
Message-ID: <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>
Date: Fri, 12 Nov 2004 02:46:21 GMT
NNTP-Posting-Host: 63.190.113.169
X-Complaints-To: abuse@earthlink.net
X-Trace: newsread2.news.atl.earthlink.net 1100227581 63.190.113.169 (Thu, 11 Nov 2004 18:46:21 PST)
NNTP-Posting-Date: Thu, 11 Nov 2004 18:46:21 PST
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news-fra1.dfn.de!news0.de.colt.net!colt.net!easynet-monga!easynet.net!news-out.cwix.com!newsfeed.cwix.com!border1.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!elnk-atl-nf1!newsfeed.earthlink.net!stamper.news.atl.earthlink.net!newsread2.news.atl.earthlink.net.POSTED!c4a26796!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189803

I've done a lot of manual scanning of old tech manuals, under Linux.
Can't say I have any wisdom to pass along, and it's tedious.  I scan
them in gray scale and store as jpeg files.  When I tried line art there
were too many dropouts.  With gray scale I can use a color editor and
clean up the dirty whites and gray blacks.  Doing that adds to the
tediousity of it all.  I don't have software to convert them to pdf
files, but it appears that converting them from jpeg to pdf makes them
grow by a factor of five or so, or maybe that means I'm not doing it
right.  On one set of documents I went so far as to put html wrappers
around them all, as a reader complained it was tedious to read them
under Windows with whatever he was using for a jpeg viewer.  But the
rest of them I have not done that, as that also was an awful lot of work.

You can see some of my stuff at
massis.lcs.mit.edu/archives/ technical/western-union-tech-review/ 
-- 

jhhaynes at earthlink dot net

######

From: Howard Shubs <howard@shubs.net>
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Thu, 11 Nov 2004 23:18:33 -0500
Organization: ='SEQUENTIAL'
Lines: 11
Message-ID: <howard-809D34.23183311112004@news.newsguy.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au>
NNTP-Posting-Host: p-908.newsdawg.com
Mail-Copies-To: nobody
User-Agent: MT-NewsWatcher/3.4 (PPC Mac OS X)
X-Face: "S"r{U%bs].&Ud}Pc~<qT-k!*znQD|gj[,+)#DaY~AsjVL[UlRXi&y]]1#_De
        VoZ=w,7@[GOQ'x"{;Ee<>~~0a]M:t5l>>EN\1Faw10M9NK1Xq59wo7-"s0S+[{etQorO
        /Nf-Ci"i9v'MT!R8)J]N[4|2&x1r^Iq&{SB"6dknr0=+6UFb.>+{zMn_1=rw&/V+"d@*
        ZS5\LoW_
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsmi-eu.news.garr.it!newsmi-us.news.garr.it!NewsITBone-GARR!hammer.uoregon.edu!logbridge.uoregon.edu!pln-w!spln!dex!extra.newsguy.com!newsp.newsguy.com!howard
Xref: nightfall.franklin.ch alt.folklore.computers:189805

In article <419410e1.9176015@news.m.iinet.net.au>,
 ace join_to ware@iinet.net.au (Tony Epton) wrote:

> What's everyones' opinion - should we destroy a book in order to save
> it ?

The manuals I've been working on have been ring-bound, so no problem.

-- 
"...crying in your drink is bad enough; crying into a hot fudge sundae
is disgusting." - Jerry Farnsworth in _Job: A Comedy of Justice_

######

From: aek@spies.com (Al Kossow)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Thu, 11 Nov 2004 20:44:40 -0800
Organization: Apple Computer, Inc.
Lines: 6
Message-ID: <aek-1111042044400001@haxrus.apple.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>
NNTP-Posting-Host: haxrus.apple.com
X-Trace: news.apple.com 1100234680 21848 17.205.21.66 (12 Nov 2004 04:44:40 GMT)
X-Complaints-To: abuse@news.apple.com
NNTP-Posting-Date: Fri, 12 Nov 2004 04:44:40 +0000 (UTC)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!newsfeed-east.nntpserver.com!nntpserver.com!newsfeed1.sea.pnap.net!newsfeed2.sea.pnap.net!newsfeed.pnap.net!forum.apple.com!news.apple.com!haxrus.apple.com!user
Xref: nightfall.franklin.ch alt.folklore.computers:189808

In article <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>,
jhaynes@alumni.uark.edu wrote:

>I don't have software to convert them to pdf

tumble.brouhaha.com

######

From: dnichols@d-and-d.com (DoN. Nichols)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 12 Nov 2004 01:08:48 -0500
Organization: D & D Data, Vienna, VA
Lines: 37
Message-ID: <cn1k1g$a8v$1@fuego.d-and-d.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>
Reply-To: dnichols@d-and-d.com (Donald Nichols)
NNTP-Posting-Host: fuego.d-and-d.com
X-Trace: ceilidh.d-and-d.com 1100239728 14864 10.0.0.26 (12 Nov 2004 06:08:48 GMT)
X-Complaints-To: abuse@d-and-d.com
NNTP-Posting-Date: 12 Nov 2004 06:08:48 GMT
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer.monmouth.com!chiapp19.algx.com!dca1-feed1.news.algx.net!allegiance!ceilidh.d-and-d.com!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189809

In article <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>,
Jim Haynes <jhaynes@alumni.uark.edu> wrote:
>I've done a lot of manual scanning of old tech manuals, under Linux.
>Can't say I have any wisdom to pass along, and it's tedious.  I scan
>them in gray scale and store as jpeg files.  When I tried line art there

	Hmm ... jpeg is a potential problem, as it tends to corrupt
information close to intersections of lines -- especially if the lines
form acute angle.

	My preference for scanning B&W documents is TIFF format, good
compression, with *no* loss of detail.  And with a scanner which allows
you to shift the crossover point from black to white, you can usually
get very good scans into pure B&W.  I use an old HP ScanJet (with its
driver from HP in Windows) to do the scan, and then move to a unix box
for the rest.

	The TIFF libs include a program to convert TIFF to PS, and
GhostScript includes one to convert PS to PDF, so I scan the pages with
sequential names (which will sort properly on a unix filesystem -- that
is leading zeros in chapter and page numbers), and then use a script
which will convert it all to a single PDF file.

	What I tend to scan is manuals for no-longer manufactured
machine tools, but the principle is the same (with perhaps greater need
for the fine detail in the exploded drawings so you know what the part
number is, and how it goes back with the other parts).

	Some of these days I will get around to trying some of the unix
drivers for scanners to see how well they work.

	Enjoy,
		DoN.
-- 
 Email:   <dnichols@d-and-d.com>   | Voice (all times): (703) 938-4564
	(too) near Washington D.C. | http://www.d-and-d.com/dnichols/DoN.html
           --- Black Holes are where God is dividing by zero ---

######

From: Squidster <squidster3128@yahoo.com>
Subject: Re: Scanning old manuals
User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.)
Message-ID: <pan.2004.11.12.08.53.08.358870@yahoo.com>
Newsgroups: alt.folklore.computers
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Lines: 22
Date: Fri, 12 Nov 2004 08:53:07 GMT
NNTP-Posting-Host: 16.151.176.254
X-Complaints-To: abuse@HP.com
X-Trace: news.cpqcorp.net 1100249587 16.151.176.254 (Fri, 12 Nov 2004 00:53:07 PST)
NNTP-Posting-Date: Fri, 12 Nov 2004 00:53:07 PST
Organization: Hewlett-Packard Company
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!newsfeed.wirehub.nl!newsfeed.icl.net!newsfeed.frii.net!newsfeed.frii.net!news.compaq.com!news.cpqcorp.net!53ab2750!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189812

I've done a lot of tiff to PDF conversion on Linux.
You should NEVER use jpegs on old and faded docs, much too lossy.

The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs!
With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a
combined pdf. It is great! Software is reliable, fast and robust.

Both utilities are available libtiff-3.6.1-2 or later. Earlier versions do
not have the critical tiff2pdf.

HTH.

 On Fri, 12 Nov 2004 02:46:21 +0000, Jim Haynes wrote:

> I've done a lot of manual scanning of old tech manuals, under Linux.
> Can't say I have any wisdom to pass along, and it's tedious.  I scan
> them in gray scale and store as jpeg files.  When I tried line art there
> were too many dropouts.  With gray scale I can use a color editor and
> clean up the dirty whites and gray blacks.  Doing that adds to the
> tediousity of it all.  I don't have software to convert them to pdf
> files, but it appears that converting them from jpeg to pdf makes them
<snip>

######

From: mwojcik@newsguy.com (Michael Wojcik)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 12 Nov 2004 16:47:36 GMT
Organization: Micro Focus International Ltd
Lines: 56
Message-ID: <cn2pf80fg3@news2.newsguy.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au>
Reply-To: mwojcik@newsguy.com
NNTP-Posting-Host: p-116.newsdawg.com
X-Newsreader: xrn 9.00
Originator: mww@sen
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news.maxwell.syr.edu!logbridge.uoregon.edu!pln-w!spln!dex!extra.newsguy.com!newsp.newsguy.com!mww
Xref: nightfall.franklin.ch alt.folklore.computers:189838


In article <419410e1.9176015@news.m.iinet.net.au>, ace join_to ware@iinet.net.au (Tony Epton) writes:
> 
> Has anyone come up with a good technique for shaving of the "perfect
> binding" spine in order to use an automatic document feeeder.

ISTM that I've read at least one report of good results from clamping
the book in a home-made jig that left the binding exposed, then
running it through a table saw with a finish-quality thin-kerf blade.

(By the way, a good technique for shaving down your Usenet posts is
to avoid top-posting and snip quoted text you're not referring to
directly.)

> What's everyones' opinion - should we destroy a book in order to save
> it ?

Mine is that, in the general case, the most important features to
preserve are, in order:

- The main text.
- Illustrations directly relevant to the main text.
- Authoring and editing information.
- Peripheral text (introductions and the like).
- Formatting of the text, including page numbers and so forth (which
are important for locating citations, and for textual studies, such
as determining edition).
- Indicia, such as copyright information (also important for textual
studies, as well as questions of intellectual property).
- Images of the actual pages.
- Description of the cover, spine, binding, etc (for textual studies).
- Images of the cover, spine, binding, etc.
- The actual physical book.

In short, ripping the spine to scan the book can be justified in
most circumstances.  For books which are extremely rare or of
special interest as physical artifacts, it's preferable to use
some other method, such as carefully turning the pages and photo-
graphing them.

It's quite common, in textual studies, to create a catalog of
descriptions of the editions and imprints of a book; they usually
suffice for most of the purposes that you'd want an actual copy for,
except of course for the psychological one of owning the thing itself.
(Walter Benjamin famously theorized that there's an essential
difference, what he called "aura", between an original work of art
and "mechanical reproductions" of it.  Of course, books themselves
are nearly always mechanical reproductions, but they're a step closer
to the original than a scanned version.  Personally, I've never found
his argument persuasive - Benjamin is far too much the mystic for my
tastes - but some people do feel that way.)

-- 
Michael Wojcik                  michael.wojcik@microfocus.com

Any average educated person can turn out competent verse.  -- W. H. Auden

######

From: Roland Hutchinson <my.spamtrap@verizon.net>
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: Fri, 12 Nov 2004 14:14:05 -0500
Lines: 21
Message-ID: <2vkgbuF2mu9d6U1@uni-berlin.de>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <cn2pf80fg3@news2.newsguy.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8Bit
X-Trace: news.uni-berlin.de 4EBC5diIxh3MKfwCJtA92wggIvtXfD9dNCxj61In7bBvJ6psk=
User-Agent: KNode/0.7.7
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news-fra1.dfn.de!fu-berlin.de!uni-berlin.de!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189843

Michael Wojcik wrote:

> It's quite common, in textual studies, to create a catalog of
> descriptions of the editions and imprints of a book; they usually
> suffice for most of the purposes that you'd want an actual copy for,
> except of course for the psychological one of owning the thing itself.

And then every so often, something comes around that you absolutely,
positively, have to have the original for to solve a bibliographic or
textual question, and historians end up doing something bombarding like
bombarding the ink in the marginalia of J. S. Bach's personal copy of the
Bible with subatomic particles and watching the pretty x-rays that result
to get a handle on what inks were used.

(Well, it makes a change from tossing CDs in the microwave oven, anyway.)

-- 
Roland Hutchinson��������������Will�play�viola�da�gamba�for�food.

NB mail to my.spamtrap [at] verizon.net is heavily filtered to
remove spam.��If�your�message�looks�like�spam�I�may�not�see�it.

######

From: toby@telegraphics.com.au (Toby Thain)
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 12 Nov 2004 11:45:19 -0800
Organization: http://groups.google.com
Lines: 56
Message-ID: <d6ce4a6c.0411121145.3ac120f2@posting.google.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <pan.2004.11.12.08.53.08.358870@yahoo.com>
NNTP-Posting-Host: 65.92.124.111
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1100288719 10379 127.0.0.1 (12 Nov 2004 19:45:19 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Fri, 12 Nov 2004 19:45:19 +0000 (UTC)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!tiscali!newsfeed1.ip.tiscali.net!proxad.net!216.239.36.134.MISMATCH!postnews.google.com!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189845

Squidster <squidster3128@yahoo.com> wrote in message news:<pan.2004.11.12.08.53.08.358870@yahoo.com>...
> I've done a lot of tiff to PDF conversion on Linux.
> You should NEVER use jpegs on old and faded docs, much too lossy.
> 
> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs!
> With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a
> combined pdf. It is great! Software is reliable, fast and robust.

Not only that, but you can ask tiffcp to compress Fax/G4, which is
very compact.

But first make sure you haven't fallen into the common pitfall of a
high res dithered bitmap on a grey-level scan. I've seen dozens of DEC
documents posted (mainly PDFs) that are unnecessarily large and
unclear due to this. (I've been fixing these and recompressing for
improved clarity, browsing speed and *huge* reductions, say 5-10:1 or
more, in file size.)

> 
> Both utilities are available libtiff-3.6.1-2 or later. Earlier versions do
> not have the critical tiff2pdf.

My cleanup recipe is approximately: 
1. pdfimages to extract to PBM (from xpdf)
2. Photoshop action to clean up each page (uses my Netpbm file
formats, http://www.telegraphics.com.au/sw/#netpbmformats ).
The action typically 
  2a. converts to Grey Scale, 
  2b. reduces resolution (on the 600dpi scans I've typically
encountered with the problem, this does not lose any significant
info),
  2c. thresholds, and 
  2d. resaves as bitmap TIFF.
3. recompress using tiffcp -c g4
4. optionally reconvert to PDF (it's much faster to browse TIFFs, so I
usually leave them as TIFF nowadays).

The other faux pas I've seen is scanning 600dpi colour cover pages and
including them in the PDF. Removing these, or converting to some sane
representation, typically halves the size of the PDF right there.

--Toby

> 
> HTH.
> 
>  On Fri, 12 Nov 2004 02:46:21 +0000, Jim Haynes wrote:
> 
> > I've done a lot of manual scanning of old tech manuals, under Linux.
> > Can't say I have any wisdom to pass along, and it's tedious.  I scan
> > them in gray scale and store as jpeg files.  When I tried line art there
> > were too many dropouts.  With gray scale I can use a color editor and
> > clean up the dirty whites and gray blacks.  Doing that adds to the
> > tediousity of it all.  I don't have software to convert them to pdf
> > files, but it appears that converting them from jpeg to pdf makes them
>  <snip>

######

NNTP-Posting-Date: Sat, 13 Nov 2004 22:23:29 -0600
Date: Sat, 13 Nov 2004 23:23:27 -0500
From: Olin Sibert <news8304@siliconkeep.omitthispart.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: alt.folklore.computers,alt.os.multics
Subject: Re: Scanning old (Multics) manuals
References: <668.810T1381T6394818@kltpzyxm.invalid>
In-Reply-To: <668.810T1381T6394818@kltpzyxm.invalid>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <wKadnXY4ptJcQAvcRVn-3A@rcn.net>
Lines: 117
NNTP-Posting-Host: 207.172.220.126
X-Trace: sv3-zVfdm0SZM8vCJLACi8mGfkDV0q0vCzm0vQdL8X+qSKTqdzdTvtAjuT1ZusUKLDVVxvFdmf9rgVjf9AZ!Q29rjMsbMzZEDoWygUdgfQr8dwNihx+Y0Hb5dJleBCE6Krk59XCYmEVE/5RgfQzHO6s=
X-Complaints-To: abuse@rcn.net
X-DMCA-Complaints-To: abuse@rcn.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.20
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local1.nntp.dca.giganews.com!nntp.rcn.net!news.rcn.net.POSTED!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189889

Charlie Gibbs wrote:

 > Like many of us, I'm sure, I have a collection of old computer
 > manuals that's taking up a lot of shelf space.  In the name of
 > home renovations (and marital bliss) I would be willing to let
 > go of some of the newer (i.e. later than 1980) ones if I could
 > scan them.  Like most such manuals, they're 95% text (in a couple
 > of typefaces and sizes) and line drawings.  I presume the ideal
 > solution is to use some sort of scanning/OCR software to turn them
 > into PDF files.  Is there readily available software (preferably
 > for Linux) to do this?  Have any of you embarked on such a project,
 > and do you have any words of wisdom to share?
 >
 > --
 > cgibbs@surfnaked.ca (Charlie Gibbs)

Recently, I've been working on the Multics History Project, an
effort to scan a large collection of manuals, technical memos,
etc.  The primary goal is to preserve the information, so I've
been concentrating on making TIFF images, but eventually I hope
to OCR at least the more interesting ones and perhaps convert
them to HTML.

This is a summary of my experience so far.

I'm using several small scanners: Fujitsu ScanSnap fi55110EOX,
Xerox Documate 252, Canon DR-2080C.  These little machines are
great: cost between $400 to $850, automatic double-side scanning,
and 6-12 sheets/minute (or 12-24 page images/minute) at
archival-quality resolution of 600 DPI.  The paper paths are very
simple (straight through) and they are slow enough that a paper
jam isn't a catastrophe.  I've only had a single serious page
misfeed so far (out of 10,000 or so).

None of those scanners has emerged as ideal for all purposes: the
ScanSnap is by far least expensive, and easy to use, but slower.
The Documate 252 is fast, but its "feature" of discarding all its
work when it encounters an error is mighty annoying.  The Canon
is really slow at 600 DPI (only 3.5 pages/minute).  It does 7 ppm
at 400 DPI, and a respectable 10 at 300, but I've also had
considerable trouble (multiple feeds) from its auto-feeder.

The bundled scanner GUI software tends to be very rich and
"helpful" for scanning individual pages, but not so good at
keeping an organized catalog.  Instead, I ginned up a little
application that watches for scanned files to appear in a pool
directory, prompts for metadata, and creates a catalog entry in a
small database.  This sort of thing is critical: entering catalog
data has been the most time-consuming part of the project, even
though I've put a lot of effort into simplifying that process.

I've found that I can keep up--barely--with the ScanSnap while
scanning technical memos (3 to 30 pages each).  Entering metadata
and physical document handling is time-consuming.  On the other
hand, when scanning big items like manuals, it's practical for
one person to keep two scanners busy.  A big--even if
temporary--workspace is very helpful, because one needs lots of
room to keep stacks of work in progress in order to keep things
running at full speed.

I keep the files as multi-page TIFFs, one file per document,
because that's what the OCR software seems to expect, and because
it's a well-documented,lossless, and relatively simple format.
For online usage, TIFFs are easily converted to PDF, either by
tiff2pdf or by Adobe Acrobat.  I'm looking forward to trying
Tumble for semi-automated creation of PDF bookmarks
(http://http://tumble.brouhaha.com/).  The ScanSnap scanner only
produces PDFs, so I have to convert those first with PDF2TFF.

Because manuals tend to be large, and thus need to be scanned in
chunks, the application uses tiffcp to combine multiple scan
files.  It is possible to keep feeding these scanners as they run
to scan more than the 50 page-capacity of the ADF in a single
pass, but at least for the Xerox, that's inadvisable: when it
encounters an error, instead of writing out what it's scanned so
far, it discards everything.

The documents have been a mix of 3-hole looseleaf and stapled.
Even after removing the staples (with wire clippers and pliers,
NOT staple removers), the pages sometimes like to stick together
at the staple holes, so they must be riffled thoroughly before
scanning.  The same is true, to a lesser extent, of the looseleaf
papers: if they were printed, then punched, the pages can stick
at the holes.  This kind of sticking can result in missed pages
and misfeeds.  Binder and staple holes also interfere with output
stacking: I've had to cobble together some little cardboard
kludges to make sure pages end up neatly stacked.

For OCR, I've had reasonable success with OmniPage Pro 12,
although I haven't tried many documents yet: it's clearly a
labor-intensive process.  It works best on crisply-printed stuff,
but it's not bad even for pale photocopies.  One thing I haven't
figured out is a good way to deal with headers and footers:
OmniPage normally treats them as part of the text, and can be
told to ignore them entirely, but I haven't found any automated
way to extract them so that they are preserved but aren't part of
the normal text flow.

Acrobat's "Paper Capture" OCR feature seems pretty slick, because
it generates searchable text from a PDF and maintains
correspondence between the scanned words and their locations on
the page.  However, it's not as flexible as OmniPage for getting
the text back out of for adjusting scanning parameters.  I
haven't experimented much with Acrobat OCR yet.

A last suggestion is that if you have children, enlist them:
feeding the scanners can be a fun game, and it leaves you free to
enter the metadata and talk to them about the history.

My conclusion is that massive scanning is a highly practical
project, thanks largely to these inexpensive duplex scanners.
I'm looking forward to turning this technology loose on my
personal paper collection, too.  Feel free to e-mail me with
questions.

Olin Sibert
news8304 (at) siliconkeep.com

######

Message-ID: <41953a34$0$27543$db0fefd9@news.zen.co.uk>
From: Rupert Pigott <darkboong@try-removing-hotmail-this.com>
Subject: Re: Scanning old manuals
Newsgroups: alt.folklore.computers
Reply-To: darkboong@try-removing-hotmail-this.com
Date: Fri, 12 Nov 2004 22:33:22 +0000
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <pan.2004.11.12.08.53.08.358870@yahoo.com> <d6ce4a6c.0411121145.3ac120f2@posting.google.com>
User-Agent: KNode/0.7.7
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
Lines: 32
Organization: Zen Internet
NNTP-Posting-Host: 82.69.129.217
X-Trace: 1100298804 cockburn.zen.co.uk 27543 82.69.129.217:38540
X-Complaints-To: abuse@zen.co.uk
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!nntp.giganews.com!news-hub.cableinet.net!blueyonder!easynet-monga!easynet.net!news.clara.net!wagner.news.clara.net!hamilton.zen.co.uk!zen.net.uk!fuller.zen.co.uk!cockburn.zen.co.uk.POSTED!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189847

Toby Thain wrote:

> Squidster <squidster3128@yahoo.com> wrote in message
> news:<pan.2004.11.12.08.53.08.358870@yahoo.com>...
>> I've done a lot of tiff to PDF conversion on Linux.
>> You should NEVER use jpegs on old and faded docs, much too lossy.
>> 
>> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs!
>> With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a
>> combined pdf. It is great! Software is reliable, fast and robust.
> 
> Not only that, but you can ask tiffcp to compress Fax/G4, which is
> very compact.

Someone just walked over my grave... :)

When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps
in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that
the (minimal) CPU time expended on compressing and decompressing it was
more than made up for the time saved doing I/O. This was in the days of
the Pentium Pro 200 when CPU cycles were considerably more precious than
they are now. ;)

Kinda nice seeing the same pearl of wisdom being applied to another
practical problem. Kinda weird seeing the same library being used too. ;)


Cheers,
Rupert
-- 
Threading sequential code through the eye of a parallel needle
makes little sense. ;)

######

Sender: gregm@europa.pienet
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <pan.2004.11.12.08.53.08.358870@yahoo.com> <d6ce4a6c.0411121145.3ac120f2@posting.google.com> <41953a34$0$27543$db0fefd9@news.zen.co.uk>
From: Greg Menke <gregm-news@toadmail.com>
Message-ID: <m33bzdq377.fsf@europa.pienet>
Lines: 41
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: 13 Nov 2004 08:38:20 -0500
NNTP-Posting-Host: 66.159.78.167
X-Complaints-To: abuse@toad.net
X-Trace: news.abs.net 1100353100 66.159.78.167 (Sat, 13 Nov 2004 08:38:20 EST)
NNTP-Posting-Date: Sat, 13 Nov 2004 08:38:20 EST
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!news-fra1.dfn.de!news-lei1.dfn.de!newsfeed.freenet.de!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!news.cs.univ-paris8.fr!news.glorb.com!newsread.com!news-xfer.newsread.com!nntp.abs.net!news.abs.net!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189857

Rupert Pigott <darkboong@try-removing-hotmail-this.com> writes:

> Toby Thain wrote:
> 
> > Squidster <squidster3128@yahoo.com> wrote in message
> > news:<pan.2004.11.12.08.53.08.358870@yahoo.com>...
> >> I've done a lot of tiff to PDF conversion on Linux.
> >> You should NEVER use jpegs on old and faded docs, much too lossy.
> >> 
> >> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs!
> >> With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a
> >> combined pdf. It is great! Software is reliable, fast and robust.
> > 
> > Not only that, but you can ask tiffcp to compress Fax/G4, which is
> > very compact.
> 
> Someone just walked over my grave... :)
> 
> When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps
> in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that
> the (minimal) CPU time expended on compressing and decompressing it was
> more than made up for the time saved doing I/O. This was in the days of
> the Pentium Pro 200 when CPU cycles were considerably more precious than
> they are now. ;)
> 
> Kinda nice seeing the same pearl of wisdom being applied to another
> practical problem. Kinda weird seeing the same library being used too. ;)

The inflection point of the trade off varies, I ported the gzip
library into a bootloader on a 12mhz R3000 bootloader so we could gzip
the firmware loads which come in over a serial port.  The firmware
load format is S3 records; 7 bit ascii text descriptions of the hex
data to be loaded.

Because the target cpu was on the slow side, the greatest throughput
benefit came with minimum gzip compression.  Using stronger
compression cost more cpu-time to decompress than the savings in
download time.  In general, we got a six-fold improvement over
plain-text uploads.

Gregm

######

From: greymaus@yahoo.com
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 13 Nov 2004 14:26:17 GMT
Lines: 23
Message-ID: <slrncpbk8p.1ch.greymaus@darkstar.example.net>
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <cn2pf80fg3@news2.newsguy.com>
X-Trace: news.uni-berlin.de 5JUjUsdcTf/tuYZM03CeOwadIVN9yCRIA1ARJU6trmqAmLkrtf
User-Agent: slrn/0.9.8.0 (Linux)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!fu-berlin.de!uni-berlin.de!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189863

On 12 Nov 2004 16:47:36 GMT, Michael Wojcik  wrote:
> In short, ripping the spine to scan the book can be justified in
> most circumstances.  For books which are extremely rare or of
> special interest as physical artifacts, it's preferable to use
> some other method, such as carefully turning the pages and photo-
> graphing them.

I have some books from the 50's, most are pulp format, but they stood
for most of the intervening years there, sometimes be removed to check
on things. I gave my nephew one of them when he started to go to horse
races with his father, "Education of a Poker Player", by Herbert
Yardley, which I consider to be THE gambling book, even in some
respects THE life book, even if there is little in it about horse
racing. I got it back falling apart, the pages were disintegrating.
Just too old for its crappy paper. I am thinking of getting a scanner
and scanning two or three of the rest. (One of the others was a book
about careers in computing, I looked just now and it is gone. If
someone is consulting it now, they may be puzzled about the section on
cards).

-- 
greymaus
97.025% of statistics are wrong

######

Message-ID: <41961a91$0$27542$db0fefd9@news.zen.co.uk>
From: Rupert Pigott <darkboong@try-removing-hotmail-this.com>
Subject: Re: Scanning old manuals
Newsgroups: alt.folklore.computers
Reply-To: darkboong@try-removing-hotmail-this.com
Date: Sat, 13 Nov 2004 14:30:27 +0000
References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <pan.2004.11.12.08.53.08.358870@yahoo.com> <d6ce4a6c.0411121145.3ac120f2@posting.google.com> <41953a34$0$27543$db0fefd9@news.zen.co.uk> <m33bzdq377.fsf@europa.pienet>
User-Agent: KNode/0.7.7
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
Lines: 64
Organization: Zen Internet
NNTP-Posting-Host: 82.69.129.217
X-Trace: 1100356241 cockburn.zen.co.uk 27542 82.69.129.217:20458
X-Complaints-To: abuse@zen.co.uk
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!dedekind.zen.co.uk!zen.net.uk!fuller.zen.co.uk!cockburn.zen.co.uk.POSTED!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189864

Greg Menke wrote:

> Rupert Pigott <darkboong@try-removing-hotmail-this.com> writes:
> 
>> Toby Thain wrote:
>> 
>> > Squidster <squidster3128@yahoo.com> wrote in message
>> > news:<pan.2004.11.12.08.53.08.358870@yahoo.com>...
>> >> I've done a lot of tiff to PDF conversion on Linux.
>> >> You should NEVER use jpegs on old and faded docs, much too lossy.
>> >> 
>> >> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to
>> >> PDFs! With "tiffcp" you can combine multiple tiffs and use tiff2pdf to
>> >> get a combined pdf. It is great! Software is reliable, fast and
>> >> robust.
>> > 
>> > Not only that, but you can ask tiffcp to compress Fax/G4, which is
>> > very compact.
>> 
>> Someone just walked over my grave... :)
>> 
>> When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps
>> in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that
>> the (minimal) CPU time expended on compressing and decompressing it was
>> more than made up for the time saved doing I/O. This was in the days of
>> the Pentium Pro 200 when CPU cycles were considerably more precious than
>> they are now. ;)
>> 
>> Kinda nice seeing the same pearl of wisdom being applied to another
>> practical problem. Kinda weird seeing the same library being used too. ;)
> 
> The inflection point of the trade off varies, I ported the gzip
> library into a bootloader on a 12mhz R3000 bootloader so we could gzip
> the firmware loads which come in over a serial port.  The firmware
> load format is S3 records; 7 bit ascii text descriptions of the hex
> data to be loaded.
> 
> Because the target cpu was on the slow side, the greatest throughput
> benefit came with minimum gzip compression.  Using stronger
> compression cost more cpu-time to decompress than the savings in
> download time.  In general, we got a six-fold improvement over
> plain-text uploads.

I got roughly 6-10x increases in I/O throughput, seems to be the sweet
spot on general purpose compression. There were some beneficial side
effects for my apps under NT though. NT's whacky file caching tried to
evict your program's working set onto disk and cram the (huge) *strictly*
sequentially read file into memory. Reducing the size of the file NT was
trying to store in RAM drastically reduced VM thrashing.

That seemed so painfully brain dead that I swore blind it was something
stupid my code was doing, but I observed the same behaviour in every NT
app we had at the time. It was just the extreme size of the files that
brought the insanity to light. I tried all kinds of tricks... Opening &
closing files, memory mapped (supposedly) unbuffered I/O, nothing worked.

Hopefully they've fixed that insanity by now. :)


Cheers,
Rupert
-- 
Threading sequential code through the eye of a parallel needle
makes little sense. ;)

######

From: Jorgen Grahn <jgrahn-nntq@algonet.se>
Newsgroups: alt.folklore.computers
Subject: Re: Scanning old manuals
Date: 14 Nov 2004 16:44:35 GMT
Organization: Glocalnet AB
Lines: 31
Message-ID: <slrncpf2nf.36f.jgrahn-nntq@frailea.sa.invalid>
References: <668.810T1381T6394818@kltpzyxm.invalid>
NNTP-Posting-Host: 213.163.130.197
X-Trace: yggdrasil.glocalnet.net 1100450675 9616 213.163.130.197 (14 Nov 2004 16:44:35 GMT)
X-Complaints-To: abuse@glocalnet.net
NNTP-Posting-Date: 14 Nov 2004 16:44:35 GMT
User-Agent: slrn/0.9.7.4 (Linux)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!newsfeed.sunet.se!news01.sunet.se!news2.glocalnet.net!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:189908

On 10 Nov 04 10:39:27 -0800, Charlie Gibbs <cgibbs@kltpzyxm.invalid> wrote:
...
> of typefaces and sizes) and line drawings.  I presume the ideal
> solution is to use some sort of scanning/OCR software to turn them
> into PDF files.  Is there readily available software (preferably
> for Linux) to do this?  Have any of you embarked on such a project,
> and do you have any words of wisdom to share?

For OCR on Linux ... let's see, the guys who restored the "Unix Text
Processing" book from paper mentioned some software.

http://www.oreilly.com/openbook/utp/
http://home.alltel.net/kollar/utp/

Yes, it's in the addendum to the preface of the published text:

    Another essential element in the UTP Revival was the emergence of free
    OCR software. While not quite up to the commercial offerings, programs
    like gocr gave more people the opportunity to contribute.

http://jocr.sourceforge.net/

(And all that is something you probably have googled your way to already,
but the above shows that gocr isn't just another unusable toy SourceForge
project ...)

/Jorgen

-- 
  // Jorgen Grahn <jgrahn@       Ph'nglui mglw'nafh Cthulhu
\X/                algonet.se>   R'lyeh wgah'nagl fhtagn!

######

From: toby@telegraphics.com.au (Toby Thain)
Newsgroups: alt.folklore.computers,alt.os.multics
Subject: Re: Scanning old (Multics) manuals
Date: 19 Nov 2004 21:50:23 -0800
Organization: http://groups.google.com
Lines: 52
Message-ID: <d6ce4a6c.0411192150.1ec949@posting.google.com>
References: <668.810T1381T6394818@kltpzyxm.invalid> <wKadnXY4ptJcQAvcRVn-3A@rcn.net>
NNTP-Posting-Host: 65.92.124.111
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1100929824 15754 127.0.0.1 (20 Nov 2004 05:50:24 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Sat, 20 Nov 2004 05:50:24 +0000 (UTC)
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!postnews.google.com!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:190233

Olin Sibert <news8304@siliconkeep.omitthispart.com> wrote in message news:<wKadnXY4ptJcQAvcRVn-3A@rcn.net>...
> Charlie Gibbs wrote:
> 
>  > Like many of us, I'm sure, I have a collection of old computer
>  > manuals ...Have any of you embarked on such a project,
>  > and do you have any words of wisdom to share?
>  >
>  > --
>  > cgibbs@surfnaked.ca (Charlie Gibbs)
> 
> Recently, I've been working on the Multics History Project, an
> effort to scan a large collection of manuals, technical memos,
> etc.  The primary goal is to preserve the information, so I've
> been concentrating on making TIFF images, but eventually I hope
> to OCR at least the more interesting ones and perhaps convert
> them to HTML.

It is very important that the bitmaps are also preserved and
accessible, whatever subsequent postprocessing is done.

> 
> This is a summary of my experience so far....
> 
> I keep the files as multi-page TIFFs, one file per document,
> because that's what the OCR software seems to expect, and because
> it's a well-documented,lossless, and relatively simple format.

It is a good choice. Note also that the libtiff tools (tiffcp) can
compress with the G4 fax codec, which is the most compact for this
application. I see you're already using tiffcp so maybe this is not
news :)

> For online usage, TIFFs are easily converted to PDF, either by

I recommend leaving as TIFFs, they are often faster and more
convenient to browse that way (and more compact, unless you compress
G4 before converting to PDF with a compression passthru).

> ...

My 2c.
--Toby

> 
> My conclusion is that massive scanning is a highly practical
> project, thanks largely to these inexpensive duplex scanners.
> I'm looking forward to turning this technology loose on my
> personal paper collection, too.  Feel free to e-mail me with
> questions.
> 
> Olin Sibert
> news8304 (at) siliconkeep.com

######

From: "Douglas H. Quebbeman" <DougQSPAMHATER@NOSPAM.iglou.com>
Newsgroups: alt.folklore.computers,alt.os.multics
References: <668.810T1381T6394818@kltpzyxm.invalid> <wKadnXY4ptJcQAvcRVn-3A@rcn.net> <d6ce4a6c.0411192150.1ec949@posting.google.com>
Subject: Re: Scanning old (Multics) manuals
Date: Sat, 20 Nov 2004 11:02:47 -0500
Lines: 15
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.2180
X-RFC2646: Format=Flowed; Original
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180
NNTP-Posting-Host: c005766.customers.cinergycom.net
X-Original-NNTP-Posting-Host: c005766.customers.cinergycom.net
Message-ID: <419f6aa7$1_1@news.iglou.com>
X-Trace: news.iglou.com 1100966567 c005766.customers.cinergycom.net (20 Nov 2004 11:02:47 -0500)
X-Authenticated-User: dougq
X-Original-NNTP-Posting-Host: 216.135.61.138
Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.glorb.com!wns14feed!worldnet.att.net!198.6.0.86!ash.uu.net!news.iglou.com!not-for-mail
Xref: nightfall.franklin.ch alt.folklore.computers:190241

In news:d6ce4a6c.0411192150.1ec949@posting.google.com,
Toby Thain <toby@telegraphics.com.au> screib:
>
> It is a good choice. Note also that the libtiff tools (tiffcp) can
> compress with the G4 fax codec, which is the most compact for this
> application. I see you're already using tiffcp so maybe this is not
> news :)

I also prefer to use the G4 codec, but I've found many older programs
that otherwise handle TIF files don't handle those with G4 encoding. Its
not a deal-killer, but just something to bear in mind...

-dq