UltraScrape: A Windows Console/DOS Screen Scraper

If you can see it, you can convert it.

Is your database locked up behind proprietary DOS software and a secret file format? UltraScrape may be able to help you free your data, converting it to standard CSV format which can then be imported to any database. Designed to handle thousands of records (or more), using UltraScrape can be much easier and more cost effective than attempting to reverse-engineer a non-standard file format. UltraScrape is able to capture any textual data visible on the screen and dump it to a file that can be converted later. However, UltraScrape is not limited to databases, it can be used to convert any data visible inside a textual DOS box.

Note: UltraScrape is released with the hope that it will be useful. It will need to be heavily tailored to your specific needs and database. I developed it to convert a specific database, and did so with great success. Your results may vary.

Download UltraScrape v0.0

Using UltraScrape

UltraScrape is split into two programs: driver.pl and rp.pl. The driver handles the copying-and-pasting each record, so its important it is setup correctly. It basically works by print-screening, pressing keys to go to the next record, and repeating.

The Driver

First, you need to install ActiveState Perl and Win32::GuiTest. I have tried many other APIs to send keys to a DOS application in Windows, but none have worked as well as Win32::GuiTest. You also need Win32::Clipboard (comes with ActivePerl, but may be necessary if you are using another Perl distribution).

Methods

Now open driver.pl in your favorite text editor and edit it to your needs. The default file to save to is raw.txt, and the default method is 2, "Windowed Drag-and-Enter" which works best in my experience. In this mode, you must load the database in a DOS console window of standard size, move it to the upper-left hand corner of the screen (at (0,0)), and enable QuickEdit mode under Properties->Options. QuickEdit mode allows selecting text by simply clicking it, and copying by pressing enter.

Method 1, "Full Screen PrintScreen" seems like a simple solution but has complications. It works by requiring that you load the program up, press Alt+Enter to full-screen it, and then allow it to hit the print screen key. When a DOS prompt is windowed, PrintScreen (or even Alt+PrintScreen, which only copies the current window) will copy the window graphically. This is not desired because OCR would be needed to extract the text, so to get around it, the program can be full-screened, and screenshots will be saved in ordinary readable text format. Unfortunately, after 700 or 800 screenshots, my Windows XP machine began constantly running out of memory. I tried using Windows 2000 and Me with the same problem. Windows 98 had a more severe problem: the down key, used to go to the next record, did not function. Even worse, under all operating systems, the database program took longer to write to the screen in full-screen mode, requiring that a delay of a few seconds be inserted or else the lower portion of the screen would be cut off.

The third method is "Windowed Select-All Enter", a variation of method 1. Instead of taking the circuitous route of dragging the mouse, it uses the simple-minded Edit->Select All option from the console system menu. This does not work well: "Select All" often fails to do as advertised, in my experience. It is this reason I had to develop the now-preferred method 2. As you can see, scraping DOS programs has its pitfalls, but now that UltraScrape has been developed and working methods discovered, you can begin to scrape your programs with impunity.

Clipboard

In each of the methods, the driver receives records from the program through the clipboard. This is not as easy as it sounds. Sometimes after a copy, the clipboard will be empty, sometimes it not change at all, sometimes it will be truncated. The driver tries to cope with all these cases. You may need to modify the allowed clipboard data length of 1976 and 1871 for displays not 25x80.

Next Record

Finally, you may need to modify SendKeys("{DOWN}") if your program does not go to the next record using the down key. If your program has other screens in each record that you wish to capture, also specify them here. For example, if you need to press F5 to access extended information on a record, use SendKeys("{F5}") in the appropriate place.

You can try the driver out now by running it with your program in position. The pointer should (assuming you are using the recommended method 2) drag across the screen copying it quickly. If not, something is wrong with driver.pl and needs to be fixed. Now that the driver is setup, the postprocessor needs to be coded.

The Postprocessor (rp.pl)

You will need to write some Perl code to parse the database screenshots. Fortunately, this is the easy part since Perl excells at text-processing. The best way to do it is follow the end of raw.txt, reading between SCRAPE_START and SCRAPE_END as it grows. After each record is read, use regular expressions to extract the fields and output them in your desired format (easiest is CSV (Comma Separated Values), but Perl modules are available for directly writing more advanced document and database formats). An example of how to do this is available in rp-example.pl, although it is of limited use to those without access to the propretiary database which it converts from.

Metering Conversion Progress

A large database can take a while to convert, so if you have a "wc" command-line utility (as on Unix), you can use progress.pl to display the number of lines in your converted file (assuming your rp.pl outputs to "csv"), and the change in lines every five seconds.

Conclusion

Although the code is short, I've put a lot of effort into finding a workable way to screen scrape a large database. I hope that by releasing this information and source code under the GNU GPL v2 to the public, companies are able to migrate their old DOS databases to newer open databases, such as PostgreSQL. Success stories (if any) and feedback are welcomed. Enjoy.

-Jeff Connelly jeffconnelly [AT] users.sourceforge.net