Discussion Thread
Data Extractor
Message Thread
Extract any data, including email addresses and URLs from your files and webpages.
Posted in the Data Extractor Forum.
Why Does The Software Just Stop Scraping?
I've been using Data Extractor for scraping on Craig's List with great results, but the problem I'm having is the software will sometimes just stop and not continue harvesting from it's URL list for no apparent reason. I'm mystified as to why this is happening. I've tried randomizing the order of the list of URLs to be harvested from (all URLs are for real, existing pages), so that there aren't any identifiable patterns in my scraping which the site might pick up on and block (eg. 500 consecutive page requests to a single classifieds area (eg. Antiques For Sale) in a single city (eg. Detroit). But this doesn't seem to help. Could they be monitoring my IP address? I've found none of the IP address changing programs seem to work.
I'd also like to know if there's a way for the software to automatically resume scraping if you hit an error page (eg. page not found). These pages stop the software dead in it's tracks and you have to manually prompt the software to get it back on track again. Many hours of productivity get lost this way - and I don't want to have to babysit the thing. If the software had a user specifiable maximum time between page requests (eg. 60 seconds) before continuing to the next URL to be harvested, this problem would be solved.
Any insight? Thanks.
Why Does The Software Just Stop Scraping?
It might be stopping because the page does not finish loading. On the 3rd tab you can resume scraping my clicking a link in the bottom right preview window.
Why Does The Software Just Stop Scraping?
That's what I do - I go into that little window at the bottom right and click any link there and it resumes scraping at the next URL in the list, although it doesn't get any emails from the current page - it just gets skipped. You may be right in that the page is not fully loading - for the below three ads, none of the pictures in the ad get loaded in that little preview window, which is probably why it's stopping. So how do I get the software to load pages with pics?
http://oregoncoast.craigs...85245.html
http://newjersey.craigsli...41930.html
http://charlotte.craigsli...62574.html
There used to be an option on Craigs to view strictly text based versions of any ad but this is no longer available. What to do?
Why Does The Software Just Stop Scraping?
Randy, you can try disabling images in Internet Explorer settings, that's my only other advice
Why Does The Software Just Stop Scraping?
Disabling the images is a good idea, in fact it should be a default setting in the software because no data you'd want to harvest is ever in any images, and with enough use the software really starts building up your temporary internet file cache with all the images stored which needs periodic deleting or you get gigs of useless data stored there. A good free app for removing these files is Crap Cleaner.
I'm still using Data Extractor to harvest ad URLs, but now using an old program called Email Spider to get the emails. Like DE, it will sometimes stall on certain URLs (still don't understand why), but within a couple of minutes it will continue on to the next one. It also lets you filter results so eg. no @craigslist.org emails are kept. Also the harvested data is automatically saved to your TXT file periodically as it harvests rather than having all the accumulated data kept strictly in memory. I've found because of that Data Extractor to be a real memory hog (I only have 512MB) and it can quickly gobble up most of it, making my PC become inordinately slow and necessitating a reboot. So anyways, just a few suggestions for making the software better. Great product in any case.
Why Does The Software Just Stop Scraping?
I've had the same problem and never thought of disabling images, however I've just tried disabling images in both IE and Firefox yet the preview window in Data Extractor still shows images.