The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 23, 2019, 11:35:30 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: VB.Net and Scraping  (Read 7378 times)
imred
Rookie
**
Offline Offline

Posts: 26


View Profile
« on: October 27, 2007, 08:24:13 AM »

Here is some code to scrape... Now maybe somebody can help me out and figure out how to scrape aspx, asp, and others which this script doesn't want to scrape Sad  I can scrape with a webcontrol without issue, however, I want this to run as a service - so that really isn't a good way to do it.

        Public Function ScrapeURL(ByVal MyURL As String) As String
            Dim ReturnScrape As String = ""
            Dim myUri As New Uri(MyURL)
            Dim MyRequest As HttpWebRequest = DirectCast(WebRequest.Create(myUri.AbsoluteUri), HttpWebRequest)
            MyRequest.AllowAutoRedirect = True
            MyRequest.MaximumAutomaticRedirections = 10
            MyRequest.UserAgent = "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
            MyRequest.KeepAlive = True
            MyRequest.Timeout = 30000

            Dim MyResponse As HttpWebResponse = Nothing
            Try
                MyResponse = DirectCast(MyRequest.GetResponse, HttpWebResponse)
            Catch exception1 As WebException
                Return ReturnScrape
            End Try

            Dim MyReader As StreamReader = Nothing
            Try
                Dim MyEncoding As New UTF8Encoding
                MyReader = New StreamReader(MyResponse.GetResponseStream, MyEncoding)
                ReturnScrape = MyReader.ReadToEnd
            Catch exception As Exception
                Console.WriteLine(exception.Message)
            End Try

            MyReader.Close()
            MyResponse.Close()


            'Might want to use Regex to Strip out the HTML here
            return (ReturnScrape)

        End Function




Well, I hope someone can help me figure out how to scrape ALL web pages instead of just some of them.  I did find a web control over at example-code.com, the Chilkat spider... However, I would have to do A LOT of processing of the returned text with that thing.

I'd prefer something that I can bring in (just like a web browser would) and then just READ the text off the resultant page.  Not sure how easy that would be.

Any help?

Logged
imred
Rookie
**
Offline Offline

Posts: 26


View Profile
« Reply #1 on: October 28, 2007, 07:39:55 AM »

I ended up using the Chilkat Spider and then re-writing my regext StripHTML function to strip a bit differently.

Now - Does anyone know how to keep from inserting question marks into the database when inserting text??

I have the:     Characters in the original string and even when I try to regex them out (using character codes 147 & 148) changing them to '' (two single quotes) the insert STILL changes them to ?

It's freakin' maddening.

Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #2 on: October 28, 2007, 08:29:43 AM »

Try making sure that those actually are those codes (147 & 148), they might not be, they might be some unicode version.

Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
imred
Rookie
**
Offline Offline

Posts: 26


View Profile
« Reply #3 on: October 28, 2007, 08:38:55 AM »

For a testing phase, I'm bringing back the stripped html (including the regex against chr(147)/(148) and I can see that the characters change.... So, I can only assume that I have replaced them correctly.  That's what is so maddening about it.

Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #4 on: October 28, 2007, 10:23:02 AM »

hmm. only thing i can think beyond that is a logic error. an order of execution problem. try breaking it down to the smallest chunk of code you can, eliminating any other things you are doing to the string.

Really there should be no reason for it not to work.

the only other suggestion is to use .replace (or whatever it is)
if that works, in the same exact code that the regex wont, then there must be something slightly wrong with the regex.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
imred
Rookie
**
Offline Offline

Posts: 26


View Profile
« Reply #5 on: October 28, 2007, 02:25:45 PM »

Yeah, I have tried all that Smiley

It has literally come down to the insert statement.. When inserting it changes the characters to a ?.  Everything else works pretty smoothly because I have single-stepped through the program (written about 200 different ways now) to figure it out.

I even changed the code to use parameters and executenonquery.  Not sure what else I can do, but I HAVE to get this to work!  Especially 'cause I can't - it makes me want to try even harder Smiley
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #6 on: October 28, 2007, 03:40:19 PM »

oh AHHHH

its on the insert. got it. Its a unicode issue then im pretty sure.
Is this to MsSQL? if so, change the column datatype to Nvarchar, or Nwhateveryouareusing. im guessing you have it set to a non-N type.
if MySQL, same problem i think, but its the 'collation type'. Though i am not actually sure what that should be set to.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
imred
Rookie
**
Offline Offline

Posts: 26


View Profile
« Reply #7 on: October 28, 2007, 04:32:42 PM »

MySql (running on Windows). I set everything the column characterset to: utf8 and the Column Collate to: utf8_general_ci.  The datatype is LONGTEXT

Anything seem unusual there?
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #8 on: October 28, 2007, 06:50:27 PM »

the mysql collations are like voodoo to me, so I really don't know specifically, but I am pretty sure that may be where the problem is. hopefully someone else knows.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
arms
Expert
****
Offline Offline

Posts: 235



View Profile
« Reply #9 on: October 28, 2007, 07:52:41 PM »

it definitely seems like a utf thing.
in python when creating a connection you can specify the character set.
if your tables are set for utf then look at what options you have when making a connection.
Logged
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!