imred

Here is some code to scrape... Now maybe somebody can help me out and figure out how to scrape

asp

 x,

asp

 , and others which this script doesn't want to scrape Applause  I can scrape with a webcontrol without issue, however, I want this to run as a service - so that really isn't a good way to do it.

        Public Function ScrapeURL(ByVal MyURL As String) As String
            Dim ReturnScrape As String = ""
            Dim myUri As New Uri(MyURL)
            Dim MyRequest As HttpWebRequest = DirectCast(WebRequest.Create(myUri.AbsoluteUri), HttpWebRequest)
            MyRequest.AllowAutoRedirect = True
            MyRequest.MaximumAutomaticRedirections = 10
            MyRequest.UserAgent = "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
            MyRequest.KeepAlive = True
            MyRequest.Timeout = 30000

            Dim MyResponse As HttpWebResponse = Nothing
            Try
                MyResponse = DirectCast(MyRequest.GetResponse, HttpWebResponse)
            Catch exception1 As WebException
                Return ReturnScrape
            End Try

            Dim MyReader As StreamReader = Nothing
            Try
                Dim MyEncoding As New UTF8Encoding
                MyReader = New StreamReader(MyResponse.GetResponseStream, MyEncoding)
                ReturnScrape = MyReader.ReadToEnd
            Catch exception As Exception
                Console.WriteLine(exception.Message)
            End Try

            MyReader.Close()
            MyResponse.Close()


            'Might want to use

Regex

  to Strip out the HTML here
            return (ReturnScrape)

        End Function




Well, I hope someone can help me figure out how to scrape ALL web pages instead of just some of them.  I did find a web control over at example-code.com, the Chilkat spider... However, I would have to do A LOT of processing of the returned text with that thing.

I'd prefer something that I can bring in (just like a web browser would) and then just READ the text off the resultant page.  Not sure how easy that would be.

Any help?

imred

I ended up using the Chilkat Spider and then re-writing my

regex

 t StripHTML function to strip a bit differently.

Now - Does anyone know how to keep from inserting question marks into the database when inserting text??

I have the:    Characters in the original string and even when I try to

regex

  them out (using character codes 147 & 14Applause changing them to '' (two single quotes) the insert STILL changes them to ?

It's freakin' maddening.

nutballs

Try making sure that those actually are those codes (147 & 14Applause, they might not be, they might be some unicode version.

imred

For a testing phase, I'm bringing back the stripped html (including the

regex

  against chr(147)/(14Applause and I can see that the characters change.... So, I can only assume that I have replaced them correctly.  That's what is so maddening about it.

nutballs

hmm. only thing i can think beyond that is a logic error. an order of execution problem. try breaking it down to the smallest chunk of code you can, eliminating any other things you are doing to the string.

Really there should be no reason for it not to work.

the only other suggestion is to use .replace (or whatever it is)
if that works, in the same exact code that the

regex

  wont, then there must be something slightly wrong with the

regex

 .

imred

Yeah, I have tried all that Applause

It has literally come down to the insert statement.. When inserting it changes the characters to a ?.  Everything else works pretty smoothly because I have single-stepped through the program (written about 200 different ways now) to figure it out.

I even changed the code to use parameters and executenonquery.  Not sure what else I can do, but I HAVE to get this to work!  Especially 'cause I can't - it makes me want to try even harder Applause

nutballs

oh AHHHH

its on the insert. got it. Its a unicode issue then im pretty sure.
Is this to MsSQL? if so, change the column datatype to Nvarchar, or Nwhateveryouareusing. im guessing you have it set to a non-N type.
if MySQL, same problem i think, but its the 'collation type'. Though i am not actually sure what that should be set to.

imred

MySql (running on Windows). I set everything the column characterset to: utf8 and the Column Collate to: utf8_general_ci.  The datatype is LONGTEXT

Anything seem unusual there?

nutballs

the mysql collations are like voodoo to me, so I really don't know specifically, but I am pretty sure that may be where the problem is. hopefully someone else knows.

arms

it definitely seems like a utf thing.
in

python

  when creating a connection you can specify the character set.
if your tables are set for utf then look at what options you have when making a connection.


Perkiset's Place Home   Politics @ Perkiset's