Discussion Thread
Data Extractor
Message Thread

Extract any data, including email addresses and URLs from your files and webpages.
Posted in the Data Extractor Forum.
title and meta tags from HTML page
How can I extract title and meta tags from a HTML page?
title and meta tags from HTML page
We've added the script here:
http://www.iconico.com/Da...taTags.txt
You may install it by doing the following:
1. Create a new rule in Data Extractor; select the second tab and clicking 'New'.
2. Type a descriptive name for your rule and click OK.
3. Then click 'Edit Rule details'.
4. Select the 'HTML Webpage script' option on the right.
5. Copy and paste the rule into the 'Rule Details' text area.
title and meta tags from HTML page
Thanks.
Now how can I extract body and title at same time?
title and meta tags from HTML page
title and meta tags from HTML page
I tried inserting this script, and it did not recover any meta tags or title tags. When I make my own rule to find titles using the search by text option and simply putting the title tags with the * wildcard, it does find the title tags. What is wrong?
title and meta tags from HTML page
Hard to say what's wrong eric, the rule should work fine. With the rule please make sure you have line breaks in the source and it's not all on one line, you may need to 'View Source' depending on your browser.
title and meta tags from HTML page
The code has the appropriate line breaks. It will scan all the files in my website, but does not extract anything at all, even though most pages have title tags and meta tags. I am scanning californialung.org to do a content audit, is there another code you offer to do this same thing?
title and meta tags from HTML page
Eric,
I just realized there was a missing '}' at the end of the rule. I've updated the rule, please try again and it should work for you. Sorry about that!
title and meta tags from HTML page
Thanks, that helped. It now finds all the meta tags, but it also now tries to open every file, including pdf and .mov files. I now either find that the appllication stalls when opening a .mov, or I get this error:
-begin error-
Data Extractor Script Error:
TypeError: -2147024891
Access is Denied
Please check that your script is correct
-end error-
Can i restrict what files it actually will open? Also, does the error above mean its running into password protected directories? Normally the program just asks for my username and password and continues gathering metadata.
Thanks for your help!
Eric
title and meta tags from HTML page
The error is occuring not becuase of password protected directories or files but becuase the Data Extractor is trying to open something that's not a webpage.
How are you setting up the Data Extactor on the first tab? Are you giving it a list of files or are you extracting from an entire site. If you could send your exact settings I can see if I can reproduce the error here and get you an answer.
title and meta tags from HTML page
The first tab, I enter in the home page url, and check the box to scan all webfiles linked. Tab 2, I have the meta data code you have kindly posted here. then I click the extract now button. The site does have video files, that seems to be where it stops most often and generates the error.
Eric
title and meta tags from HTML page
What's the URL I need to try duplicating it here.
title and meta tags from HTML page
title and meta tags from HTML page
Eric,
Looks like you're right that the application is stopping when it finds a video. There's a quick fix for this which is to edit the file 'DEscript.txt' which is located in the C:\Program Files\Data Extractor folder.
You can add in different file extentions to be ignored by changing the line that's 9 lines from the bottom. Please change it to the following and the extraction should run all the way through, although you will need to click 'Cancel' for the security popups as they appear.
if ((ext != '.doc') && (ext != '.xls') && (ext != '.xml') && (ext != '.pdf') && (ext != '.xml') && (ext != '.txt') && (ext != '.csv') && (ext != '.mp3') && (ext != '.mov') && (ext != '.wma') && (ext != '.wmv')) {
title and meta tags from HTML page
For me this script returns multiple identical results per page. Why does it do that? Is there a way to make it stop searching the page after it's found 1 title already? I am just using the part with the URL and title tags, as that's all I need. Thanks!
title and meta tags from HTML page
It shouldn't! You can always click the 'Remove Duplicates' button
title and meta tags from HTML page
Interesting. I wonder why mine is doing that then. Like if I put in your site here, I get four identical results which all say "Iconico.com Software."
Remove duplicates is fine, but the processing time is long. I'm wondering if it's because it keeps looking for more results or if it's just because it has the load the page.