ICONICO

Discussion Thread

Data Extractor

Message Thread

For WindowsData Extractor

Data Extractor iconExtract any data, including email addresses and URLs from your files and webpages.

Posted in the Data Extractor Forum.




Question

Hi!

I urgently need to spider some data from webpages like  http://www.zumi.pl/1997247,Kancelaria_Radcy_Prawnego_Maciej_Puk_-_Profit-n_Sp._z_o.o.,Poznan,firma.html - and I know nothing about Javascript :( 

Could you help me with a simple Script, that will identify and list e-mail addresses? Those strings are "hidden" - the program would have to find string e.g.: "lg0 = "maciej_puk";at = "@";lg1 = "op.pl";" - and make "maciej_puk@op.pl" output... This is too difficult for me...

Best regards!
Matt from Poland
by Matt Koscielniak on Jan 27 2010 10:27am Reply

Question

We have a script creation service, however we have to charge for it:

http://www.iconico.com/DataExtractor/CustomRule.aspx
by Nico Westerdale on Jan 27 2010 12:13pm Reply

Question

Although the whole thing probably can be done in Javascript I know little about it either so would split the task up :-
Run the following script :-

var obj=document.getElementsByTagName('script')

DataExtractor.SetColumns(1);
for (var t=0; t<obj.length; t++)
{
      var match=/lto:/;
      var matchpos=obj[t].innerHTML.search(match);
      if (matchpos != -1)
      {
      DataExtractor.StartNewResult();
      DataExtractor.AddResult(1,obj[t].innerHTML );

      }
}

This will get you raw messy data that can be cleaned up easily in notepad, word, Excel etc. using search and replace. E.G.
load the extracted results into notepad and do :-

Replace, (Ctrl H)
Find what:  pr_1 = " but leave the Replace with: field blank
Find what: ";pr_2 = " again leave the Replace with: field blank
Find what: ";lg0 = "   put a single space in the Replace with: field
Find what: ";at = "&#64;";lg1 = " Put @ in the Replace with: field
Finally
Find what: "; and replace with a single comma ,

This assumes that the Email addresses are "hidden" in the same way on every page.

Save file as ExtractionResults.csv and then open it in Excel, all your emails should be in the first column and you can simply delete any other columns, if you want you can do another search/replace to get rid of the "mailto: " bit.
by Martin King on Jan 28 2010 10:19am Reply

Question

Done a bit more playing and the following script will do the whole job for the example page you posted but is unlikely to work if the pages in question are formatted significantly differently :-

First it finds the bit of HTML that contains the email this is done by looking at every script tag until it finds one containing lto: (part of the string "mailto:". Then it finds the leftmost part of the email. This can be tweaked by changing the lines :-
var match=/lg0 =/;     ("lg0" is the text to find) and
left=left+7;        (7 is how far from the found text the email address starts)

It then finds the rightmost part of the email address and the innerleft and innerright parts (i.e. the bit where we need to put an @) again these can be tweaked as above

var obj=document.getElementsByTagName('script')

DataExtractor.SetColumns(1);
for (var t=0; t<obj.length; t++)
{
      var match=/lto:/;
      var matchpos=obj[t].innerHTML.search(match);
      if (matchpos != -1)
      {
            var match=/lg0 = /;
            var left=obj[t].innerHTML.search(match);
            left=left+7;
           
            var match=/ document.write/;
            var right=obj[t].innerHTML.search(match);
            right=right-2
           
            var match=/at =/;
            var innerleft=obj[t].innerHTML.search(match);
            innerleft=innerleft-2
           
            var match=/lg1 = /;
            var innerright=obj[t].innerHTML.search(match);
            innerright=innerright+7
           
      var st=obj[t].innerHTML;
      var op=st.slice(left,innerleft);
      var op2=st.slice(innerright,right);
     
      DataExtractor.StartNewResult();
      DataExtractor.AddResult(1,op+"@"+op2);

      }
}


And to any experts out there yes I know it's messy :-)
by Martin King on Jan 28 2010 11:11am Reply

Question

Oops:-
var match=/lg0 =/;     ("lg0" is the text to find) and

Should read

var match=/lg0 =/;     ("lg0 =" is the text to find) and
by Martin King on Jan 28 2010 3:05pm Reply

Question

Hm! A little thanks/feedback from the original poster would have been nice. Not sure I'll be posting any more freebies :-(
by Martin King on Oct 1 2010 4:27am Reply

Our Software Stores

IconicoAccurate Design and Development Software

BitsDuJourDiscount Deal Coupons for Windows and Mac Software Apps

Our Software Services

IcoBlogOur Official Blog

© copyright 2004-2024 Iconico, Inc. Code & Design. All Rights Reserved. Terms & Conditions Privacy Policy Terms of Use Login