From: "Charlie Gibbs" Newsgroups: alt.folklore.computers Subject: Scanning old manuals Date: 10 Nov 04 10:39:27 -0800 Organization: http://newsguy.com Lines: 14 Message-ID: <668.810T1381T6394818@kltpzyxm.invalid> NNTP-Posting-Host: p-472.newsdawg.com X-Newsreader: THOR 2.5a (Amiga;TCP/IP) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsmi-eu.news.garr.it!newsmi-us.news.garr.it!NewsITBone-GARR!hammer.uoregon.edu!logbridge.uoregon.edu!pln-w!spln!rex!extra.newsguy.com!newsp.newsguy.com!news2 Xref: nightfall.franklin.ch alt.folklore.computers:189767 Like many of us, I'm sure, I have a collection of old computer manuals that's taking up a lot of shelf space. In the name of home renovations (and marital bliss) I would be willing to let go of some of the newer (i.e. later than 1980) ones if I could scan them. Like most such manuals, they're 95% text (in a couple of typefaces and sizes) and line drawings. I presume the ideal solution is to use some sort of scanning/OCR software to turn them into PDF files. Is there readily available software (preferably for Linux) to do this? Have any of you embarked on such a project, and do you have any words of wisdom to share? -- cgibbs@surfnaked.ca (Charlie Gibbs) ###### NNTP-Posting-Date: Wed, 10 Nov 2004 16:20:25 -0600 Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Reply-To: Anne & Lynn Wheeler References: <668.810T1381T6394818@kltpzyxm.invalid> From: Anne & Lynn Wheeler Date: Wed, 10 Nov 2004 15:19:59 -0700 Message-ID: Organization: Wheeler&Wheeler User-Agent: Gnus/5.090024 (Oort Gnus v0.24) Emacs/21.3 (windows-nt) Cancel-Lock: sha1:dBjIrtsEeyseUJzP/W3OJ08BqJ4= MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Lines: 27 NNTP-Posting-Host: 67.176.12.113 X-Trace: sv3-gg4K3LDybG75fi9i/uoKnLFEaWrSuBxVGGi0SNSOCc3QKVeVYHQj7mmSJLmkY/bZ/ZF51izXtmNzTgW!Pof1zWIyymtM4AW5p+QZQIUgZVemiUJdwawZMnXIMIBQLKPprmfe6gKo X-Complaints-To: abuse@comcast.net X-DMCA-Complaints-To: dmca@comcast.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.20 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local1.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189770 "Charlie Gibbs" writes: > Like many of us, I'm sure, I have a collection of old computer > manuals that's taking up a lot of shelf space. In the name of home > renovations (and marital bliss) I would be willing to let go of some > of the newer (i.e. later than 1980) ones if I could scan them. Like > most such manuals, they're 95% text (in a couple of typefaces and > sizes) and line drawings. I presume the ideal solution is to use > some sort of scanning/OCR software to turn them into PDF files. Is > there readily available software (preferably for Linux) to do this? > Have any of you embarked on such a project, and do you have any > words of wisdom to share? i was just looking at asking the same question ... all sorts of odds and ends stuff from the 60s and 70s. however we just unearthed a bunch of old handwritten letters from the 40s ... that i would also be interested in scanning(?). when i looked at some of this stuff nearly 10 years ago ... it all seemed to be scaffolded off fax scanning, tiff format and ocr of tiff/fax softcopy (current scanners appear to have much higher resolution as well as color capability compared to the older fax oriented stuff). -- Anne & Lynn Wheeler | http://www.garlic.com/~lynn/ ###### From: "David Wade" Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Wed, 10 Nov 2004 23:47:33 -0000 Lines: 33 Message-ID: <2vfnodF2l23bvU1@uni-berlin.de> References: <668.810T1381T6394818@kltpzyxm.invalid> X-Trace: news.uni-berlin.de FaY62ECG61DjNsIsFP9YBAMv1ephLyoPXjpKfL2oCRdZ4EB1Q= X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1437 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!fu-berlin.de!uni-berlin.de!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189774 "Charlie Gibbs" wrote in message news:668.810T1381T6394818@kltpzyxm.invalid... > Like many of us, I'm sure, I have a collection of old computer > manuals that's taking up a lot of shelf space. In the name of > home renovations (and marital bliss) I would be willing to let > go of some of the newer (i.e. later than 1980) ones if I could > scan them. Like most such manuals, they're 95% text (in a couple > of typefaces and sizes) and line drawings. I presume the ideal > solution is to use some sort of scanning/OCR software to turn them > into PDF files. Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? > Charlie, I have done a little bit of work on this. From speaking to Al at www.bitsavers.org I discovered that his manuals are basically 400 DPI MONO TIFF files formatted as PDF. I have tried extracting the images and OCRing them with ScanSoft OmniPage PRO. I know this is windows software but the PRO version can be "taught" to recognise a pattern as a letter. I tried it on the CMS Commands manual from bitsavers, and it seems quit good. I think it took me a couple of hours (i.e. most of an evening) to get the manual into Word/HTML/PDF. Dave. > -- > cgibbs@surfnaked.ca (Charlie Gibbs) > ###### Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals References: <668.810T1381T6394818@kltpzyxm.invalid> From: Bernd Felsche User-Agent: nn/6.6.5 Date: Thu, 11 Nov 2004 09:37:04 +0800 Message-ID: <0bsa62xf8h.ln2@innovative.iinet.net.au> Lines: 40 NNTP-Posting-Host: 203.59.144.24 X-Trace: 1100137080 per-qv1-newsreader-01.iinet.net.au 6565 203.59.144.24 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!HSNX.atgi.net!newsfeed.iinet.net.au!newsfeed.iinet.net.au!per-qv1-newsreader-01.iinet.net.au!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189775 "Charlie Gibbs" writes: >Like many of us, I'm sure, I have a collection of old computer >manuals that's taking up a lot of shelf space. In the name of >home renovations (and marital bliss) I would be willing to let >go of some of the newer (i.e. later than 1980) ones if I could >scan them. Like most such manuals, they're 95% text (in a couple >of typefaces and sizes) and line drawings. I presume the ideal >solution is to use some sort of scanning/OCR software to turn them >into PDF files. Is there readily available software (preferably >for Linux) to do this? Have any of you embarked on such a project, >and do you have any words of wisdom to share? Wisdom? Is that like learning from the mistakes of others? Three years ago I embarked on a similar project of scanning photographs. Scanning "by hand" and catalogueing so that you are still to find the relevant items is time-consuming. If one could scan 2 pages a minute by turning pages, etc., then a thousand-page tome will take a week of evenings. For loose-bound stuff, there are document scanners loitering in many a corner that'll process 30+ pages a minute. If you cannot find a company who officially makes that part of its business, then a box of chocolates/carton of beer may entice some owners of such facilities to allow you to use their scanner at an idle time. BTW: scan most text pages as "line-art" as that tends to reduce the "noise" in the image and aids compressability. Subsequent OCR'ing can be used to generate "indexes". Store the original; OCR text can be unreliable and sometime produces pure gibberish; even after extensive "training". -- /"\ Bernd Felsche - Innovative Reckoning, Perth, Western Australia \ / ASCII ribbon campaign | I'm a .signature virus! X against HTML mail | Copy me into your ~/.signature / \ and postings | to help me spread! ###### From: Howard Shubs Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Wed, 10 Nov 2004 22:12:09 -0500 Organization: ='SEQUENTIAL' Lines: 27 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> NNTP-Posting-Host: p-585.newsdawg.com Mail-Copies-To: nobody User-Agent: MT-NewsWatcher/3.4 (PPC Mac OS X) X-Face: "S"r{U%bs].&Ud}Pc~~~0a]M:t5l>>EN\1Faw10M9NK1Xq59wo7-"s0S+[{etQorO /Nf-Ci"i9v'MT!R8)J]N[4|2&x1r^Iq&{SB"6dknr0=+6UFb.>+{zMn_1=rw&/V+"d@* ZS5\LoW_ Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.moat.net!pln-e!spln!rex!extra.newsguy.com!newsp.newsguy.com!howard Xref: nightfall.franklin.ch alt.folklore.computers:189776 In article <668.810T1381T6394818@kltpzyxm.invalid>, "Charlie Gibbs" wrote: > Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? Yes, I've done it. I started in 1998 to work on two manuals for the IBM 1130. Some day, I might finish! I'm not converting into PDF but into HTML. See and for my efforts so far. I've been using MacOS and OmniPage Pro. It goes like this: 1) scan a page. 2) OCR the page 3) move the results into a text-based HTML editor called BBEdit. 4) clean up the text, proof it, possibly rebuild the graphics. 5) upload the new version of the page. It's tedious, but once it's done, no one else will have to do it again. -- "...crying in your drink is bad enough; crying into a hot fudge sundae is disgusting." - Jerry Farnsworth in _Job: A Comedy of Justice_ ###### From: toby@telegraphics.com.au (Toby Thain) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 10 Nov 2004 20:58:56 -0800 Organization: http://groups.google.com Lines: 11 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> NNTP-Posting-Host: 65.92.124.111 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: posting.google.com 1100149137 1100 127.0.0.1 (11 Nov 2004 04:58:57 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Thu, 11 Nov 2004 04:58:57 +0000 (UTC) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!fr.ip.ndsoftware.net!proxad.net!216.239.36.134.MISMATCH!postnews.google.com!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189777 "Charlie Gibbs" wrote in message news:<668.810T1381T6394818@kltpzyxm.invalid>... > I presume the ideal > solution is to use some sort of scanning/OCR software to turn them > into PDF files. Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? My advice: Whether you OCR or not, *please* make sure the original scanned images remain available in your archive. --Toby ###### Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals From: Tris Orendorff References: <668.810T1381T6394818@kltpzyxm.invalid> Organization: Torensoft Consulting Message-ID: User-Agent: Xnews/06.08.25 X-Face: IshsEs\2&-CRNbQ'E]S8OU{A_:p:*[<45&]7:[QnpNNZHm''7(M+Do^,?tfk!JK@i~x(3PR"?ZH=bE!coJ$9J1_Gs""|xs20v&1g%;QcA@HE?UJ&|9-(g&gkBPz{ZKVI^Yp8(c)T{I:BDk{EN,b"~/VV7OL46b64c3fp-TO7?)CvvaFSqW*73$s*]j99a-#zry-{86LtBDLi2TB=ArfN^.;4a'$[}/!1Pe2o8*],I( wrote in news:668.810T1381T6394818@kltpzyxm.invalid: > Like many of us, I'm sure, I have a collection of old computer > manuals that's taking up a lot of shelf space. In the name of > home renovations (and marital bliss) I would be willing to let > go of some of the newer (i.e. later than 1980) ones if I could > scan them. Like most such manuals, they're 95% text (in a couple > of typefaces and sizes) and line drawings. I presume the ideal > solution is to use some sort of scanning/OCR software to turn them > into PDF files. Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? Perhaps Distributed Proofreaders could help. They are adding at least 250 books and magazines to Project Gutenberg every month. -- Sincerely, Tris Orendorff -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS d++ s+:- a+ C+ UL++++ P+ L+ E- W+ N++ o- K++ w+ O+ M !V PS+ PE Y+ PGP t+ !5 X- R- tv--- b++ DI++ D+ G++ e++ h---- r+++ y+++ ------END GEEK CODE BLOCK------ ###### From: ace join_to ware@iinet.net.au (Tony Epton) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Fri, 12 Nov 2004 01:27:23 GMT Organization: Aceware Programming Pty Ltd Message-ID: <419410e1.9176015@news.m.iinet.net.au> References: <668.810T1381T6394818@kltpzyxm.invalid> X-Newsreader: Forte Free Agent 1.21/32.243 Lines: 26 NNTP-Posting-Host: 203.59.54.231 X-Trace: 1100222412 per-qv1-newsreader-01.iinet.net.au 6545 203.59.54.231 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de!t-online.de!npeer.de.kpn-eurorings.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newsfeed1.swip.net!swipnet!News-Peer-Europe!news.stupi.se!news.net.hanse.com!nntp.gblx.net!nntp3.phx1!nntp1.phx1.gblx.net!nntp.gblx.net!nntp.gblx.net!nntp4.phx1!newsfeed.iinet.net.au!newsfeed.iinet.net.au!per-qv1-newsreader-01.iinet.net.au!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189801 On 10 Nov 04 10:39:27 -0800, "Charlie Gibbs" wrote: Has anyone come up with a good technique for shaving of the "perfect binding" spine in order to use an automatic document feeeder. What's everyones' opinion - should we destroy a book in order to save it ? Tony >Like many of us, I'm sure, I have a collection of old computer >manuals that's taking up a lot of shelf space. In the name of >home renovations (and marital bliss) I would be willing to let >go of some of the newer (i.e. later than 1980) ones if I could >scan them. Like most such manuals, they're 95% text (in a couple >of typefaces and sizes) and line drawings. I presume the ideal >solution is to use some sort of scanning/OCR software to turn them >into PDF files. Is there readily available software (preferably >for Linux) to do this? Have any of you embarked on such a project, >and do you have any words of wisdom to share? > >-- >cgibbs@surfnaked.ca (Charlie Gibbs) > ###### Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> Reply-To: jhaynes@alumni.uark.edu Organization: University of Arkansas Alumni X-Newsreader: trn 4.0-test72 (19 April 1999) From: haynes@alumni.uark.edu (Jim Haynes) Lines: 19 Message-ID: <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> Date: Fri, 12 Nov 2004 02:46:21 GMT NNTP-Posting-Host: 63.190.113.169 X-Complaints-To: abuse@earthlink.net X-Trace: newsread2.news.atl.earthlink.net 1100227581 63.190.113.169 (Thu, 11 Nov 2004 18:46:21 PST) NNTP-Posting-Date: Thu, 11 Nov 2004 18:46:21 PST Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news-fra1.dfn.de!news0.de.colt.net!colt.net!easynet-monga!easynet.net!news-out.cwix.com!newsfeed.cwix.com!border1.nntp.dca.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!elnk-atl-nf1!newsfeed.earthlink.net!stamper.news.atl.earthlink.net!newsread2.news.atl.earthlink.net.POSTED!c4a26796!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189803 I've done a lot of manual scanning of old tech manuals, under Linux. Can't say I have any wisdom to pass along, and it's tedious. I scan them in gray scale and store as jpeg files. When I tried line art there were too many dropouts. With gray scale I can use a color editor and clean up the dirty whites and gray blacks. Doing that adds to the tediousity of it all. I don't have software to convert them to pdf files, but it appears that converting them from jpeg to pdf makes them grow by a factor of five or so, or maybe that means I'm not doing it right. On one set of documents I went so far as to put html wrappers around them all, as a reader complained it was tedious to read them under Windows with whatever he was using for a jpeg viewer. But the rest of them I have not done that, as that also was an awful lot of work. You can see some of my stuff at massis.lcs.mit.edu/archives/ technical/western-union-tech-review/ -- jhhaynes at earthlink dot net ###### From: Howard Shubs Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Thu, 11 Nov 2004 23:18:33 -0500 Organization: ='SEQUENTIAL' Lines: 11 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> NNTP-Posting-Host: p-908.newsdawg.com Mail-Copies-To: nobody User-Agent: MT-NewsWatcher/3.4 (PPC Mac OS X) X-Face: "S"r{U%bs].&Ud}Pc~~~0a]M:t5l>>EN\1Faw10M9NK1Xq59wo7-"s0S+[{etQorO /Nf-Ci"i9v'MT!R8)J]N[4|2&x1r^Iq&{SB"6dknr0=+6UFb.>+{zMn_1=rw&/V+"d@* ZS5\LoW_ Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsmi-eu.news.garr.it!newsmi-us.news.garr.it!NewsITBone-GARR!hammer.uoregon.edu!logbridge.uoregon.edu!pln-w!spln!dex!extra.newsguy.com!newsp.newsguy.com!howard Xref: nightfall.franklin.ch alt.folklore.computers:189805 In article <419410e1.9176015@news.m.iinet.net.au>, ace join_to ware@iinet.net.au (Tony Epton) wrote: > What's everyones' opinion - should we destroy a book in order to save > it ? The manuals I've been working on have been ring-bound, so no problem. -- "...crying in your drink is bad enough; crying into a hot fudge sundae is disgusting." - Jerry Farnsworth in _Job: A Comedy of Justice_ ###### From: aek@spies.com (Al Kossow) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Thu, 11 Nov 2004 20:44:40 -0800 Organization: Apple Computer, Inc. Lines: 6 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> NNTP-Posting-Host: haxrus.apple.com X-Trace: news.apple.com 1100234680 21848 17.205.21.66 (12 Nov 2004 04:44:40 GMT) X-Complaints-To: abuse@news.apple.com NNTP-Posting-Date: Fri, 12 Nov 2004 04:44:40 +0000 (UTC) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!newsfeed-east.nntpserver.com!nntpserver.com!newsfeed1.sea.pnap.net!newsfeed2.sea.pnap.net!newsfeed.pnap.net!forum.apple.com!news.apple.com!haxrus.apple.com!user Xref: nightfall.franklin.ch alt.folklore.computers:189808 In article <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>, jhaynes@alumni.uark.edu wrote: >I don't have software to convert them to pdf tumble.brouhaha.com ###### From: dnichols@d-and-d.com (DoN. Nichols) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 12 Nov 2004 01:08:48 -0500 Organization: D & D Data, Vienna, VA Lines: 37 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> Reply-To: dnichols@d-and-d.com (Donald Nichols) NNTP-Posting-Host: fuego.d-and-d.com X-Trace: ceilidh.d-and-d.com 1100239728 14864 10.0.0.26 (12 Nov 2004 06:08:48 GMT) X-Complaints-To: abuse@d-and-d.com NNTP-Posting-Date: 12 Nov 2004 06:08:48 GMT Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer.monmouth.com!chiapp19.algx.com!dca1-feed1.news.algx.net!allegiance!ceilidh.d-and-d.com!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189809 In article <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net>, Jim Haynes wrote: >I've done a lot of manual scanning of old tech manuals, under Linux. >Can't say I have any wisdom to pass along, and it's tedious. I scan >them in gray scale and store as jpeg files. When I tried line art there Hmm ... jpeg is a potential problem, as it tends to corrupt information close to intersections of lines -- especially if the lines form acute angle. My preference for scanning B&W documents is TIFF format, good compression, with *no* loss of detail. And with a scanner which allows you to shift the crossover point from black to white, you can usually get very good scans into pure B&W. I use an old HP ScanJet (with its driver from HP in Windows) to do the scan, and then move to a unix box for the rest. The TIFF libs include a program to convert TIFF to PS, and GhostScript includes one to convert PS to PDF, so I scan the pages with sequential names (which will sort properly on a unix filesystem -- that is leading zeros in chapter and page numbers), and then use a script which will convert it all to a single PDF file. What I tend to scan is manuals for no-longer manufactured machine tools, but the principle is the same (with perhaps greater need for the fine detail in the exploded drawings so you know what the part number is, and how it goes back with the other parts). Some of these days I will get around to trying some of the unix drivers for scanners to see how well they work. Enjoy, DoN. -- Email: | Voice (all times): (703) 938-4564 (too) near Washington D.C. | http://www.d-and-d.com/dnichols/DoN.html --- Black Holes are where God is dividing by zero --- ###### From: Squidster Subject: Re: Scanning old manuals User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.) Message-ID: Newsgroups: alt.folklore.computers References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Lines: 22 Date: Fri, 12 Nov 2004 08:53:07 GMT NNTP-Posting-Host: 16.151.176.254 X-Complaints-To: abuse@HP.com X-Trace: news.cpqcorp.net 1100249587 16.151.176.254 (Fri, 12 Nov 2004 00:53:07 PST) NNTP-Posting-Date: Fri, 12 Nov 2004 00:53:07 PST Organization: Hewlett-Packard Company Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!newsfeed.wirehub.nl!newsfeed.icl.net!newsfeed.frii.net!newsfeed.frii.net!news.compaq.com!news.cpqcorp.net!53ab2750!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189812 I've done a lot of tiff to PDF conversion on Linux. You should NEVER use jpegs on old and faded docs, much too lossy. The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs! With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a combined pdf. It is great! Software is reliable, fast and robust. Both utilities are available libtiff-3.6.1-2 or later. Earlier versions do not have the critical tiff2pdf. HTH. On Fri, 12 Nov 2004 02:46:21 +0000, Jim Haynes wrote: > I've done a lot of manual scanning of old tech manuals, under Linux. > Can't say I have any wisdom to pass along, and it's tedious. I scan > them in gray scale and store as jpeg files. When I tried line art there > were too many dropouts. With gray scale I can use a color editor and > clean up the dirty whites and gray blacks. Doing that adds to the > tediousity of it all. I don't have software to convert them to pdf > files, but it appears that converting them from jpeg to pdf makes them ###### From: mwojcik@newsguy.com (Michael Wojcik) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 12 Nov 2004 16:47:36 GMT Organization: Micro Focus International Ltd Lines: 56 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> Reply-To: mwojcik@newsguy.com NNTP-Posting-Host: p-116.newsdawg.com X-Newsreader: xrn 9.00 Originator: mww@sen Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news.maxwell.syr.edu!logbridge.uoregon.edu!pln-w!spln!dex!extra.newsguy.com!newsp.newsguy.com!mww Xref: nightfall.franklin.ch alt.folklore.computers:189838 In article <419410e1.9176015@news.m.iinet.net.au>, ace join_to ware@iinet.net.au (Tony Epton) writes: > > Has anyone come up with a good technique for shaving of the "perfect > binding" spine in order to use an automatic document feeeder. ISTM that I've read at least one report of good results from clamping the book in a home-made jig that left the binding exposed, then running it through a table saw with a finish-quality thin-kerf blade. (By the way, a good technique for shaving down your Usenet posts is to avoid top-posting and snip quoted text you're not referring to directly.) > What's everyones' opinion - should we destroy a book in order to save > it ? Mine is that, in the general case, the most important features to preserve are, in order: - The main text. - Illustrations directly relevant to the main text. - Authoring and editing information. - Peripheral text (introductions and the like). - Formatting of the text, including page numbers and so forth (which are important for locating citations, and for textual studies, such as determining edition). - Indicia, such as copyright information (also important for textual studies, as well as questions of intellectual property). - Images of the actual pages. - Description of the cover, spine, binding, etc (for textual studies). - Images of the cover, spine, binding, etc. - The actual physical book. In short, ripping the spine to scan the book can be justified in most circumstances. For books which are extremely rare or of special interest as physical artifacts, it's preferable to use some other method, such as carefully turning the pages and photo- graphing them. It's quite common, in textual studies, to create a catalog of descriptions of the editions and imprints of a book; they usually suffice for most of the purposes that you'd want an actual copy for, except of course for the psychological one of owning the thing itself. (Walter Benjamin famously theorized that there's an essential difference, what he called "aura", between an original work of art and "mechanical reproductions" of it. Of course, books themselves are nearly always mechanical reproductions, but they're a step closer to the original than a scanned version. Personally, I've never found his argument persuasive - Benjamin is far too much the mystic for my tastes - but some people do feel that way.) -- Michael Wojcik michael.wojcik@microfocus.com Any average educated person can turn out competent verse. -- W. H. Auden ###### From: Roland Hutchinson Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: Fri, 12 Nov 2004 14:14:05 -0500 Lines: 21 Message-ID: <2vkgbuF2mu9d6U1@uni-berlin.de> References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8Bit X-Trace: news.uni-berlin.de 4EBC5diIxh3MKfwCJtA92wggIvtXfD9dNCxj61In7bBvJ6psk= User-Agent: KNode/0.7.7 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!news-fra1.dfn.de!fu-berlin.de!uni-berlin.de!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189843 Michael Wojcik wrote: > It's quite common, in textual studies, to create a catalog of > descriptions of the editions and imprints of a book; they usually > suffice for most of the purposes that you'd want an actual copy for, > except of course for the psychological one of owning the thing itself. And then every so often, something comes around that you absolutely, positively, have to have the original for to solve a bibliographic or textual question, and historians end up doing something bombarding like bombarding the ink in the marginalia of J. S. Bach's personal copy of the Bible with subatomic particles and watching the pretty x-rays that result to get a handle on what inks were used. (Well, it makes a change from tossing CDs in the microwave oven, anyway.) -- Roland Hutchinson              Will play viola da gamba for food. NB mail to my.spamtrap [at] verizon.net is heavily filtered to remove spam.  If your message looks like spam I may not see it. ###### From: toby@telegraphics.com.au (Toby Thain) Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 12 Nov 2004 11:45:19 -0800 Organization: http://groups.google.com Lines: 56 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> NNTP-Posting-Host: 65.92.124.111 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: posting.google.com 1100288719 10379 127.0.0.1 (12 Nov 2004 19:45:19 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Fri, 12 Nov 2004 19:45:19 +0000 (UTC) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!tiscali!newsfeed1.ip.tiscali.net!proxad.net!216.239.36.134.MISMATCH!postnews.google.com!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189845 Squidster wrote in message news:... > I've done a lot of tiff to PDF conversion on Linux. > You should NEVER use jpegs on old and faded docs, much too lossy. > > The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs! > With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a > combined pdf. It is great! Software is reliable, fast and robust. Not only that, but you can ask tiffcp to compress Fax/G4, which is very compact. But first make sure you haven't fallen into the common pitfall of a high res dithered bitmap on a grey-level scan. I've seen dozens of DEC documents posted (mainly PDFs) that are unnecessarily large and unclear due to this. (I've been fixing these and recompressing for improved clarity, browsing speed and *huge* reductions, say 5-10:1 or more, in file size.) > > Both utilities are available libtiff-3.6.1-2 or later. Earlier versions do > not have the critical tiff2pdf. My cleanup recipe is approximately: 1. pdfimages to extract to PBM (from xpdf) 2. Photoshop action to clean up each page (uses my Netpbm file formats, http://www.telegraphics.com.au/sw/#netpbmformats ). The action typically 2a. converts to Grey Scale, 2b. reduces resolution (on the 600dpi scans I've typically encountered with the problem, this does not lose any significant info), 2c. thresholds, and 2d. resaves as bitmap TIFF. 3. recompress using tiffcp -c g4 4. optionally reconvert to PDF (it's much faster to browse TIFFs, so I usually leave them as TIFF nowadays). The other faux pas I've seen is scanning 600dpi colour cover pages and including them in the PDF. Removing these, or converting to some sane representation, typically halves the size of the PDF right there. --Toby > > HTH. > > On Fri, 12 Nov 2004 02:46:21 +0000, Jim Haynes wrote: > > > I've done a lot of manual scanning of old tech manuals, under Linux. > > Can't say I have any wisdom to pass along, and it's tedious. I scan > > them in gray scale and store as jpeg files. When I tried line art there > > were too many dropouts. With gray scale I can use a color editor and > > clean up the dirty whites and gray blacks. Doing that adds to the > > tediousity of it all. I don't have software to convert them to pdf > > files, but it appears that converting them from jpeg to pdf makes them > ###### NNTP-Posting-Date: Sat, 13 Nov 2004 22:23:29 -0600 Date: Sat, 13 Nov 2004 23:23:27 -0500 From: Olin Sibert User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113 X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: alt.folklore.computers,alt.os.multics Subject: Re: Scanning old (Multics) manuals References: <668.810T1381T6394818@kltpzyxm.invalid> In-Reply-To: <668.810T1381T6394818@kltpzyxm.invalid> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Message-ID: Lines: 117 NNTP-Posting-Host: 207.172.220.126 X-Trace: sv3-zVfdm0SZM8vCJLACi8mGfkDV0q0vCzm0vQdL8X+qSKTqdzdTvtAjuT1ZusUKLDVVxvFdmf9rgVjf9AZ!Q29rjMsbMzZEDoWygUdgfQr8dwNihx+Y0Hb5dJleBCE6Krk59XCYmEVE/5RgfQzHO6s= X-Complaints-To: abuse@rcn.net X-DMCA-Complaints-To: abuse@rcn.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.20 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.glorb.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local1.nntp.dca.giganews.com!nntp.rcn.net!news.rcn.net.POSTED!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189889 Charlie Gibbs wrote: > Like many of us, I'm sure, I have a collection of old computer > manuals that's taking up a lot of shelf space. In the name of > home renovations (and marital bliss) I would be willing to let > go of some of the newer (i.e. later than 1980) ones if I could > scan them. Like most such manuals, they're 95% text (in a couple > of typefaces and sizes) and line drawings. I presume the ideal > solution is to use some sort of scanning/OCR software to turn them > into PDF files. Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? > > -- > cgibbs@surfnaked.ca (Charlie Gibbs) Recently, I've been working on the Multics History Project, an effort to scan a large collection of manuals, technical memos, etc. The primary goal is to preserve the information, so I've been concentrating on making TIFF images, but eventually I hope to OCR at least the more interesting ones and perhaps convert them to HTML. This is a summary of my experience so far. I'm using several small scanners: Fujitsu ScanSnap fi55110EOX, Xerox Documate 252, Canon DR-2080C. These little machines are great: cost between $400 to $850, automatic double-side scanning, and 6-12 sheets/minute (or 12-24 page images/minute) at archival-quality resolution of 600 DPI. The paper paths are very simple (straight through) and they are slow enough that a paper jam isn't a catastrophe. I've only had a single serious page misfeed so far (out of 10,000 or so). None of those scanners has emerged as ideal for all purposes: the ScanSnap is by far least expensive, and easy to use, but slower. The Documate 252 is fast, but its "feature" of discarding all its work when it encounters an error is mighty annoying. The Canon is really slow at 600 DPI (only 3.5 pages/minute). It does 7 ppm at 400 DPI, and a respectable 10 at 300, but I've also had considerable trouble (multiple feeds) from its auto-feeder. The bundled scanner GUI software tends to be very rich and "helpful" for scanning individual pages, but not so good at keeping an organized catalog. Instead, I ginned up a little application that watches for scanned files to appear in a pool directory, prompts for metadata, and creates a catalog entry in a small database. This sort of thing is critical: entering catalog data has been the most time-consuming part of the project, even though I've put a lot of effort into simplifying that process. I've found that I can keep up--barely--with the ScanSnap while scanning technical memos (3 to 30 pages each). Entering metadata and physical document handling is time-consuming. On the other hand, when scanning big items like manuals, it's practical for one person to keep two scanners busy. A big--even if temporary--workspace is very helpful, because one needs lots of room to keep stacks of work in progress in order to keep things running at full speed. I keep the files as multi-page TIFFs, one file per document, because that's what the OCR software seems to expect, and because it's a well-documented,lossless, and relatively simple format. For online usage, TIFFs are easily converted to PDF, either by tiff2pdf or by Adobe Acrobat. I'm looking forward to trying Tumble for semi-automated creation of PDF bookmarks (http://http://tumble.brouhaha.com/). The ScanSnap scanner only produces PDFs, so I have to convert those first with PDF2TFF. Because manuals tend to be large, and thus need to be scanned in chunks, the application uses tiffcp to combine multiple scan files. It is possible to keep feeding these scanners as they run to scan more than the 50 page-capacity of the ADF in a single pass, but at least for the Xerox, that's inadvisable: when it encounters an error, instead of writing out what it's scanned so far, it discards everything. The documents have been a mix of 3-hole looseleaf and stapled. Even after removing the staples (with wire clippers and pliers, NOT staple removers), the pages sometimes like to stick together at the staple holes, so they must be riffled thoroughly before scanning. The same is true, to a lesser extent, of the looseleaf papers: if they were printed, then punched, the pages can stick at the holes. This kind of sticking can result in missed pages and misfeeds. Binder and staple holes also interfere with output stacking: I've had to cobble together some little cardboard kludges to make sure pages end up neatly stacked. For OCR, I've had reasonable success with OmniPage Pro 12, although I haven't tried many documents yet: it's clearly a labor-intensive process. It works best on crisply-printed stuff, but it's not bad even for pale photocopies. One thing I haven't figured out is a good way to deal with headers and footers: OmniPage normally treats them as part of the text, and can be told to ignore them entirely, but I haven't found any automated way to extract them so that they are preserved but aren't part of the normal text flow. Acrobat's "Paper Capture" OCR feature seems pretty slick, because it generates searchable text from a PDF and maintains correspondence between the scanned words and their locations on the page. However, it's not as flexible as OmniPage for getting the text back out of for adjusting scanning parameters. I haven't experimented much with Acrobat OCR yet. A last suggestion is that if you have children, enlist them: feeding the scanners can be a fun game, and it leaves you free to enter the metadata and talk to them about the history. My conclusion is that massive scanning is a highly practical project, thanks largely to these inexpensive duplex scanners. I'm looking forward to turning this technology loose on my personal paper collection, too. Feel free to e-mail me with questions. Olin Sibert news8304 (at) siliconkeep.com ###### Message-ID: <41953a34$0$27543$db0fefd9@news.zen.co.uk> From: Rupert Pigott Subject: Re: Scanning old manuals Newsgroups: alt.folklore.computers Reply-To: darkboong@try-removing-hotmail-this.com Date: Fri, 12 Nov 2004 22:33:22 +0000 References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> User-Agent: KNode/0.7.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Lines: 32 Organization: Zen Internet NNTP-Posting-Host: 82.69.129.217 X-Trace: 1100298804 cockburn.zen.co.uk 27543 82.69.129.217:38540 X-Complaints-To: abuse@zen.co.uk Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!nntp.giganews.com!news-hub.cableinet.net!blueyonder!easynet-monga!easynet.net!news.clara.net!wagner.news.clara.net!hamilton.zen.co.uk!zen.net.uk!fuller.zen.co.uk!cockburn.zen.co.uk.POSTED!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189847 Toby Thain wrote: > Squidster wrote in message > news:... >> I've done a lot of tiff to PDF conversion on Linux. >> You should NEVER use jpegs on old and faded docs, much too lossy. >> >> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs! >> With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a >> combined pdf. It is great! Software is reliable, fast and robust. > > Not only that, but you can ask tiffcp to compress Fax/G4, which is > very compact. Someone just walked over my grave... :) When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that the (minimal) CPU time expended on compressing and decompressing it was more than made up for the time saved doing I/O. This was in the days of the Pentium Pro 200 when CPU cycles were considerably more precious than they are now. ;) Kinda nice seeing the same pearl of wisdom being applied to another practical problem. Kinda weird seeing the same library being used too. ;) Cheers, Rupert -- Threading sequential code through the eye of a parallel needle makes little sense. ;) ###### Sender: gregm@europa.pienet Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <41953a34$0$27543$db0fefd9@news.zen.co.uk> From: Greg Menke Message-ID: Lines: 41 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: 13 Nov 2004 08:38:20 -0500 NNTP-Posting-Host: 66.159.78.167 X-Complaints-To: abuse@toad.net X-Trace: news.abs.net 1100353100 66.159.78.167 (Sat, 13 Nov 2004 08:38:20 EST) NNTP-Posting-Date: Sat, 13 Nov 2004 08:38:20 EST Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!news-fra1.dfn.de!news-lei1.dfn.de!newsfeed.freenet.de!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!news.cs.univ-paris8.fr!news.glorb.com!newsread.com!news-xfer.newsread.com!nntp.abs.net!news.abs.net!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189857 Rupert Pigott writes: > Toby Thain wrote: > > > Squidster wrote in message > > news:... > >> I've done a lot of tiff to PDF conversion on Linux. > >> You should NEVER use jpegs on old and faded docs, much too lossy. > >> > >> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to PDFs! > >> With "tiffcp" you can combine multiple tiffs and use tiff2pdf to get a > >> combined pdf. It is great! Software is reliable, fast and robust. > > > > Not only that, but you can ask tiffcp to compress Fax/G4, which is > > very compact. > > Someone just walked over my grave... :) > > When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps > in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that > the (minimal) CPU time expended on compressing and decompressing it was > more than made up for the time saved doing I/O. This was in the days of > the Pentium Pro 200 when CPU cycles were considerably more precious than > they are now. ;) > > Kinda nice seeing the same pearl of wisdom being applied to another > practical problem. Kinda weird seeing the same library being used too. ;) The inflection point of the trade off varies, I ported the gzip library into a bootloader on a 12mhz R3000 bootloader so we could gzip the firmware loads which come in over a serial port. The firmware load format is S3 records; 7 bit ascii text descriptions of the hex data to be loaded. Because the target cpu was on the slow side, the greatest throughput benefit came with minimum gzip compression. Using stronger compression cost more cpu-time to decompress than the savings in download time. In general, we got a six-fold improvement over plain-text uploads. Gregm ###### From: greymaus@yahoo.com Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 13 Nov 2004 14:26:17 GMT Lines: 23 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> X-Trace: news.uni-berlin.de 5JUjUsdcTf/tuYZM03CeOwadIVN9yCRIA1ARJU6trmqAmLkrtf User-Agent: slrn/0.9.8.0 (Linux) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!fu-berlin.de!uni-berlin.de!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189863 On 12 Nov 2004 16:47:36 GMT, Michael Wojcik wrote: > In short, ripping the spine to scan the book can be justified in > most circumstances. For books which are extremely rare or of > special interest as physical artifacts, it's preferable to use > some other method, such as carefully turning the pages and photo- > graphing them. I have some books from the 50's, most are pulp format, but they stood for most of the intervening years there, sometimes be removed to check on things. I gave my nephew one of them when he started to go to horse races with his father, "Education of a Poker Player", by Herbert Yardley, which I consider to be THE gambling book, even in some respects THE life book, even if there is little in it about horse racing. I got it back falling apart, the pages were disintegrating. Just too old for its crappy paper. I am thinking of getting a scanner and scanning two or three of the rest. (One of the others was a book about careers in computing, I looked just now and it is gone. If someone is consulting it now, they may be puzzled about the section on cards). -- greymaus 97.025% of statistics are wrong ###### Message-ID: <41961a91$0$27542$db0fefd9@news.zen.co.uk> From: Rupert Pigott Subject: Re: Scanning old manuals Newsgroups: alt.folklore.computers Reply-To: darkboong@try-removing-hotmail-this.com Date: Sat, 13 Nov 2004 14:30:27 +0000 References: <668.810T1381T6394818@kltpzyxm.invalid> <419410e1.9176015@news.m.iinet.net.au> <1uVkd.8421$_J2.5833@newsread2.news.atl.earthlink.net> <41953a34$0$27543$db0fefd9@news.zen.co.uk> User-Agent: KNode/0.7.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Lines: 64 Organization: Zen Internet NNTP-Posting-Host: 82.69.129.217 X-Trace: 1100356241 cockburn.zen.co.uk 27542 82.69.129.217:20458 X-Complaints-To: abuse@zen.co.uk Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news2.euro.net!216.196.110.149.MISMATCH!border2.nntp.ams.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!dedekind.zen.co.uk!zen.net.uk!fuller.zen.co.uk!cockburn.zen.co.uk.POSTED!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189864 Greg Menke wrote: > Rupert Pigott writes: > >> Toby Thain wrote: >> >> > Squidster wrote in message >> > news:... >> >> I've done a lot of tiff to PDF conversion on Linux. >> >> You should NEVER use jpegs on old and faded docs, much too lossy. >> >> >> >> The solution is to use "tiff2pdf" which DIRECTLY converts tiff to >> >> PDFs! With "tiffcp" you can combine multiple tiffs and use tiff2pdf to >> >> get a combined pdf. It is great! Software is reliable, fast and >> >> robust. >> > >> > Not only that, but you can ask tiffcp to compress Fax/G4, which is >> > very compact. >> >> Someone just walked over my grave... :) >> >> When I worked on newspaper bitmap bashing code (100Mbyte->1Gbyte bitmaps >> in the mid-late 90s), I *ALWAYS* used G3 or G4 compression. I found that >> the (minimal) CPU time expended on compressing and decompressing it was >> more than made up for the time saved doing I/O. This was in the days of >> the Pentium Pro 200 when CPU cycles were considerably more precious than >> they are now. ;) >> >> Kinda nice seeing the same pearl of wisdom being applied to another >> practical problem. Kinda weird seeing the same library being used too. ;) > > The inflection point of the trade off varies, I ported the gzip > library into a bootloader on a 12mhz R3000 bootloader so we could gzip > the firmware loads which come in over a serial port. The firmware > load format is S3 records; 7 bit ascii text descriptions of the hex > data to be loaded. > > Because the target cpu was on the slow side, the greatest throughput > benefit came with minimum gzip compression. Using stronger > compression cost more cpu-time to decompress than the savings in > download time. In general, we got a six-fold improvement over > plain-text uploads. I got roughly 6-10x increases in I/O throughput, seems to be the sweet spot on general purpose compression. There were some beneficial side effects for my apps under NT though. NT's whacky file caching tried to evict your program's working set onto disk and cram the (huge) *strictly* sequentially read file into memory. Reducing the size of the file NT was trying to store in RAM drastically reduced VM thrashing. That seemed so painfully brain dead that I swore blind it was something stupid my code was doing, but I observed the same behaviour in every NT app we had at the time. It was just the extreme size of the files that brought the insanity to light. I tried all kinds of tricks... Opening & closing files, memory mapped (supposedly) unbuffered I/O, nothing worked. Hopefully they've fixed that insanity by now. :) Cheers, Rupert -- Threading sequential code through the eye of a parallel needle makes little sense. ;) ###### From: Jorgen Grahn Newsgroups: alt.folklore.computers Subject: Re: Scanning old manuals Date: 14 Nov 2004 16:44:35 GMT Organization: Glocalnet AB Lines: 31 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> NNTP-Posting-Host: 213.163.130.197 X-Trace: yggdrasil.glocalnet.net 1100450675 9616 213.163.130.197 (14 Nov 2004 16:44:35 GMT) X-Complaints-To: abuse@glocalnet.net NNTP-Posting-Date: 14 Nov 2004 16:44:35 GMT User-Agent: slrn/0.9.7.4 (Linux) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!newscore.univie.ac.at!newsfeed.sunet.se!news01.sunet.se!news2.glocalnet.net!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:189908 On 10 Nov 04 10:39:27 -0800, Charlie Gibbs wrote: ... > of typefaces and sizes) and line drawings. I presume the ideal > solution is to use some sort of scanning/OCR software to turn them > into PDF files. Is there readily available software (preferably > for Linux) to do this? Have any of you embarked on such a project, > and do you have any words of wisdom to share? For OCR on Linux ... let's see, the guys who restored the "Unix Text Processing" book from paper mentioned some software. http://www.oreilly.com/openbook/utp/ http://home.alltel.net/kollar/utp/ Yes, it's in the addendum to the preface of the published text: Another essential element in the UTP Revival was the emergence of free OCR software. While not quite up to the commercial offerings, programs like gocr gave more people the opportunity to contribute. http://jocr.sourceforge.net/ (And all that is something you probably have googled your way to already, but the above shows that gocr isn't just another unusable toy SourceForge project ...) /Jorgen -- // Jorgen Grahn R'lyeh wgah'nagl fhtagn! ###### From: toby@telegraphics.com.au (Toby Thain) Newsgroups: alt.folklore.computers,alt.os.multics Subject: Re: Scanning old (Multics) manuals Date: 19 Nov 2004 21:50:23 -0800 Organization: http://groups.google.com Lines: 52 Message-ID: References: <668.810T1381T6394818@kltpzyxm.invalid> NNTP-Posting-Host: 65.92.124.111 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: posting.google.com 1100929824 15754 127.0.0.1 (20 Nov 2004 05:50:24 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Sat, 20 Nov 2004 05:50:24 +0000 (UTC) Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!irazu.switch.ch!switch.ch!newsfeed00.sul.t-online.de!t-online.de!news.glorb.com!postnews.google.com!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:190233 Olin Sibert wrote in message news:... > Charlie Gibbs wrote: > > > Like many of us, I'm sure, I have a collection of old computer > > manuals ...Have any of you embarked on such a project, > > and do you have any words of wisdom to share? > > > > -- > > cgibbs@surfnaked.ca (Charlie Gibbs) > > Recently, I've been working on the Multics History Project, an > effort to scan a large collection of manuals, technical memos, > etc. The primary goal is to preserve the information, so I've > been concentrating on making TIFF images, but eventually I hope > to OCR at least the more interesting ones and perhaps convert > them to HTML. It is very important that the bitmaps are also preserved and accessible, whatever subsequent postprocessing is done. > > This is a summary of my experience so far.... > > I keep the files as multi-page TIFFs, one file per document, > because that's what the OCR software seems to expect, and because > it's a well-documented,lossless, and relatively simple format. It is a good choice. Note also that the libtiff tools (tiffcp) can compress with the G4 fax codec, which is the most compact for this application. I see you're already using tiffcp so maybe this is not news :) > For online usage, TIFFs are easily converted to PDF, either by I recommend leaving as TIFFs, they are often faster and more convenient to browse that way (and more compact, unless you compress G4 before converting to PDF with a compression passthru). > ... My 2c. --Toby > > My conclusion is that massive scanning is a highly practical > project, thanks largely to these inexpensive duplex scanners. > I'm looking forward to turning this technology loose on my > personal paper collection, too. Feel free to e-mail me with > questions. > > Olin Sibert > news8304 (at) siliconkeep.com ###### From: "Douglas H. Quebbeman" Newsgroups: alt.folklore.computers,alt.os.multics References: <668.810T1381T6394818@kltpzyxm.invalid> Subject: Re: Scanning old (Multics) manuals Date: Sat, 20 Nov 2004 11:02:47 -0500 Lines: 15 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.2180 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 NNTP-Posting-Host: c005766.customers.cinergycom.net X-Original-NNTP-Posting-Host: c005766.customers.cinergycom.net Message-ID: <419f6aa7$1_1@news.iglou.com> X-Trace: news.iglou.com 1100966567 c005766.customers.cinergycom.net (20 Nov 2004 11:02:47 -0500) X-Authenticated-User: dougq X-Original-NNTP-Posting-Host: 216.135.61.138 Path: nightfall.franklin.ch!pfaff2.ethz.ch!news-zh.switch.ch!switch.ch!solnet.ch!solnet.ch!news.glorb.com!wns14feed!worldnet.att.net!198.6.0.86!ash.uu.net!news.iglou.com!not-for-mail Xref: nightfall.franklin.ch alt.folklore.computers:190241 In news:d6ce4a6c.0411192150.1ec949@posting.google.com, Toby Thain screib: > > It is a good choice. Note also that the libtiff tools (tiffcp) can > compress with the G4 fax codec, which is the most compact for this > application. I see you're already using tiffcp so maybe this is not > news :) I also prefer to use the G4 codec, but I've found many older programs that otherwise handle TIF files don't handle those with G4 encoding. Its not a deal-killer, but just something to bear in mind... -dq