WWWGrab is a configurable, highly flexible data transformation utility that can:
* scrape web pages
* parse emails
* convert files
etcetera.
It allows the user to specify arbitrary pattern combinations, and actions to perform when the patterns are recognized. It accepts input from web pages (via HTTP), stored emails (via MAPI) and/or files. It can generate database tables (via ODBC) and/or files. It can be configured to parse web pages, emails (header fields, text bodies) or perform transformations on files.
WWWGrab uses Set Machine, which can perform a wide variety of tasks because its design recognizes that many transformation tasks (parsing/extraction/conversion/searches etc.) involve the same basic repetitive process:
- recognition of patterns in the input,
- transition to another "state" based on recognition of the next pattern in the input.
Internally, WWWGrab/Set Machine is very general and abstract. The user defines the details of the transformation task. As a result, WWWGrab/Set Machine is very flexible, (but can be challenging!).
Features:
* Recursive capabilities (enabling parsing of nested HTML/XML tags, comments, etc.)
* Wide-string (Unicode) input / output capability
* Stored email (MAPI) interface
* ODBC interface making database layout info (table and field names) available to the configuration developer
* ODBC interface allowing generation of arbitrary SQL statements built with a combination of user-defined data and parsed data
* User-defined function interface allowing execution of custom DLL code
WWWGrab/Set Machine can be configured to :
* Scrape (parse) web pages / HTML
* Parse emails
* Search for (and replace) text
* Repair data
* Generate C/C++ code, HTML, XML, and other formats from various sources (emails, C/C++ code, HTML, XML, etc.)
* Parse C/C++ source code
* Generate and execute SQL
* Count words/keywords
* Count lines
WWWGrab/Set Machine can be configured to perform a wide variety of tasks.
|