Learn How To
Learn how to perform common web miner tasks
How does it work?
In a nutshell Web Miner comes with a build in web crawler that trawls whatever you tell it to and then the miners extract data from the pages being crawled. This data can then be viewed, exported and even HTTP Posted to any destination.
How to start a mining session
The following steps will enable you to start a successful mining session are:
- Add one or more sources
- Select one more more miners (configure to your specific needs if necessary)
- Set output arfitfact folder or POST URL
- Click Run from the Mining menu or toolbar
How to save mining settings
Select Save or Save As from the File Menu or click the save icon from the toolbar to create a document with the settings for the current mining session. A Web Miner document (WMR) file will be saved
How to load save settings
Select Open from the File Menu or click the open icon from the toolbar to open a Web Miner document (WMR).
Go to the Help Menu and choose the Activate Software... option.
Check for updates
Go to the Help Menu and choose the Check for Updates... option.
Submit a feature request
Go to the Help Menu and choose the Provide Feedback option.
User Interface Explained
Learn all about the user interface and how to navigate around Web Miner to extract data from the web or any other web pages.
Shows in real-time the number of mined artefacts plus mining, computer and network settings of the computer.
Search the entire web
Automatically configures the crawler to search the entirety of each website and the link pages starting from www.wikipedia.org. (This search takes a very long time, so beware when running)
Represents the pages that will be mined. This can be one or more legal Uniform Resource Locator. These pages will be mined according to the your configured settings
Represents the pages that will be excluded from the search. This can be one or more legal Uniform Resource Locator
Load sources from text file
This allows the loading of URL sources for included or excluded pages from a text file. The text file must have one URL per row
Load sources for CSV file
This allows the loading of URL sources for included or excluded pages for a CSV file. The CSV file must have one URL per row
Clear all sources
Removes all sources for either the excluded or included sources
Represents the built in miners that extract specific or custom data from the web. Miners extract a Uniform Resource Locator to the mined data.
To select a miner simply check the selection box at the top of the miner. Miners can be turned on or off during an active mining session, and will start mining immediately. The pages that have been mined already will not be revisited.
- ATOM 1.0+ – Extracts ATOM 1.0 feeds from the pages being crawled
- RSS 2.0+ – Extracts RSS 2.0 feeds from the pages being crawled
- Keyword Density – Computes keyword density for pages crawled
- Meta tags – extract meta tags from the pages being crawled
- Images – Extract images from the pages being crawled
- Fax Numbers – Extracts FAX numbers from the pages being crawled
- Uniform Resource Locator (URL) – Extracts URLs from the pages being crawled
- IP Address – Extracts IP addresses from the pages being crawled
- Phone Numbers – Extracts phone numbers from the pages being crawled
- Extensible Stylesheet Language Transformations (XSLT) – Executes custom XSLT scripts to extract data from the pages being crawled. This allows the creation of a script that can pull data in any shape you choose. XSLT are very powerful and a very welcomed inclusion into the Web Miner family of miners
- Regular Expressions – Uses custom regular expressions to extract data from the pages being crawled. Regular expressions can be constructed (e.g. BuildRegex) to extract any data. Just like XSLT it is a powerful tool and geared towards those who prefer Regular Expressions. This is indeed a very powerful miner and will be well loved by some afficionados.
- Microformats – Extracts various microformats from the pages being crawled.
- Email Address - Extracts email addresses from the pages being crawled
- Files / Documents - Extracts all documents found in the pages crawled with the extensions/types specified by the user
- Activity Stream 1.0+ - Extracts activity stream feeds from the pages being crawled.
- Traverse external pages - Whether pages external to the root uri should be crawled.
- Traverse the links in external pages - Whether pages external to the root uri should have their links crawled. NOTE: IsExternalPageCrawlEnabled must be true for this setting to have any effect.
- Respect Robots Text File - Whether the crawler should retrieve and respect the robots.txt file.
- Re-Crawl Pages - Whether Uris should be crawled more than once. This is not common and should be false for most scenarios.
- Respect Anchor REL No Follow - Whether the crawler should ignore links that have a <a href="whatever" rel="nofollow">.
- Respect META Robots No Follow - Whether the crawler should ignore links on pages that have a <meta name="robots" content="nofollow" /> tag.
- Maximum Pages to crawl - This option forces the crawler to stop after the specified number matches the number of pages crawled. This effectively ends the mining session. This value is required.
- Maximum pages to crawl per domain - Specifies the maximum number of pages to crawl for a single domain despite how often it occurs within the range of website maps. If zero, this setting has no effect.
- Crawl duration - Specifies the maximum time that the mining session should last. The crawler terminates crawling when the time limit if reached. If zero, this setting has no effect.
- Minimum Crawl delay per domain (ms) - The number of milliseconds to wait in between http requests to the same domain.
- Maximum Crawl Depth - Maximum levels below root page to crawl. If value is 0, the homepage will be crawled but none of its links will be crawled. If the level is 1, the homepage and its links will be crawled but none of the links links will be crawled.
- Use Agent - Represents the identity of the current of crawler (Abot). e.g. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; abot v1.5 http://code.google.com/p/abot)".
- Robots.txt User Agent - The user agent string to use when checking robots.txt file for specific directives. Some examples of other crawler's user agent values are "googlebot", "slurp" etc.
- Maximum Robots.txt Crawl Delay in Seconds - The maximum numer of seconds to respect in the robots.txt "Crawl-delay: X" directive. IsRespectRobotsDotTextEnabled must be true for this value to be used. If zero, will use whatever the robots.txt crawl delay requests no matter how high the value is.
- Maximum redirects for HTTP requests - The maximum number of redirects that the request follows. If zero, this setting has no effect.
- Force link parsing - Indicates whether the crawler should parse the page's links even if the crawl decision determines that those links will not be crawled.
- HTTP requests auto redirect - The maximum number of redirects that the request follows. If zero, this setting has no effect.
- HTTP requests automatic decompression - Indicates web page that are compressed with gzip and deflate will be automatically accepted and decompressed.
- Enable crawler logging - Enables all crawler feedback to be displayed in the application Event Log area. Checking this option will result in a growing memory consumption which wil in turn impact application performance.
- Minimize to system tray - When checked the application is minimized to the system tray when the minimize button is clicked.
- Maximum concurrent crawl threads - Determines the maximum number of Threads that will be used during crawling of web pages.
- Maximum memory usage (MB) - The maximum amout of memory to allow the process to use. If this limit is exceeded the crawler will stop prematurely. If zero, this setting has no effect.
- Memory usage check threshold (s)
- Maximum memory usage cache time in seconds - The maximum amount of time before refreshing the value used to determine the amount of memory being used by the process that hosts the crawler instance. This value has no effect if set to zero.
- HTTP requests timeout in seconds - Represents the time-out value in milliseconds for the System.Net.HttpWebRequest.GetResponse() and System.Net.HttpWebRequest.GetRequestStream() methods. If zero, this setting has no effect.
- Minimum available memory required (MB) - The maximum amout of memory to allow the process to use. If this limit is exceeded the crawler will stop prematurely. If zero, this setting has no effect.
- HTTP ServerPoint connection limit - Represents the maximum number of concurrent connections allowed by a System.Net.ServicePoint. The system default is 2. This means that only 2 concurrent http connections can be open to the same host. If zero, this setting has no effect.
- Maximum page size in bytes - Maximum size of page. If the page size is above this value, it will not be downloaded or processed. If zero, this setting has no effect.
Here there are descriptions of the settings for the mail server used for any emailing purposes including when sending emails with tthe E-mail miner.
- User name - SMTP server user name
- Password - SMTP server password
- Require logon using Secure Password Authentication (SPA) - Indicates that SPA should be used when authenticating/signing in to the SMTP server.
- SMTP Server - The SMTP Server which is used to send emails.
- Port number - The SMTP server communications port.
- TLS/SSL required - Indicates that the SMTP server should use TLS or SSL.
Here there are descriptions of the settings around using proxy servers when crawling the web pages.
- Use proxy server - Indicates that the specified proxy servers should be used when crawling the web pages.
- Cycle proxy servers - Indicates that the proxy servers should be rotated if there are more than one.
- Proxy cycle interval (s) - Specifies the time interval in seconds before a proxy server is changed.
Here there are descriptions of the settings around saving the artifacts that are found during mining sessions.
- Artefacts destination folder - The output folder where all the found artifacts would be stored.
- URI for JSON artefact HTTP Post - Represents a destination that accepts HTTP POSTs of artifacts represented in JSON format.
- Post artefacts to URI - This enables the HTTP POSTing of artifacts to the specified destination.
This view shows the user information about the current mining session plus specifications about the device running Web Miner that are relevant to the performance of a mining session.
These are web galleries that can be created after 1 or more mining sessions to display either the images or documents/files found during these sessions.
This represents the pages that will be used as starting points for the mining session.
These are the units that do most of the work, and especially the key work in extracting the data from the pages being crawled.
Manages the settings for the entire application including the mining session, proxy servers, email settings and logging.
Frequently Askes Questions
What is Web Miner?
Web Miner is a complete web data extraction tool that has a built-in web crawler to mine an unlimited number of web pages.
Why would I use Web Miner?
If you are interested in extracting any form of data from the web then Web Miner is a tool for you.
Does Web Miner have a built in Web Crawler?
Yes. Web Miner uses a very fast built-in web crawler.
What are Miners?
Miners are the heart of Web Miner which extract the data from the pages being crawl.
What are crawlers?
A crawler is the engine that goes from page to page and allow the miners to extract the data.
How do I save the extracted data?
Choose the output destination folder from the output settings and all extracted data will be saved to this location. It is important that this location is writable.
How can I integrate Web Miner into my applications?
Use the HTTP POST output setting to post all found artifacts in JSON format to any destination. This allows data to be received by any web application which can then be saved into the database or sent directly to your application.
Can I export data in CSV format?
What can I extract with Web Miner?
Web miner extracts any type of data including custom data using regular expression and XSLT. There are built-in miners that extract specific data including:
- E-mail addresses
- RSS Feeds
- Atom Feeds
- Activity Stream Feeds
- IP Addresses
- Phone NUmbers
- Keyword Density
- Meta Tags
- Fax Numbers
Best practices of web crawling.
So you want to crawl the web for data?
Before you start, there are a few simple rules to follow to make sure you are a good web-citizen.
- Be nice to the sites you are crawling. You don't want to cause a denial of service attack or something accidentally malicious. Keep the number of simultaneous page requests low and the pause between pages in at least 1 second in your Advanced settings.
- Crawlers can take long time. Try to optimize time by crawling only the parts of the website that you really need – use the “Where to crawl” and “Where not to crawl” options.