
![]() |
imred
Here is some code to scrape... Now maybe somebody can help me out and figure out how to scrape
aspx,asp, and others which this script doesn't want to scrape![]() Public Function ScrapeURL(ByVal MyURL As String) As String Dim ReturnScrape As String = "" Dim myUri As New Uri(MyURL) Dim MyRequest As HttpWebRequest = DirectCast(WebRequest.Create(myUri.AbsoluteUri), HttpWebRequest) MyRequest.AllowAutoRedirect = True MyRequest.MaximumAutomaticRedirections = 10 MyRequest.UserAgent = "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" MyRequest.KeepAlive = True MyRequest.Timeout = 30000 Dim MyResponse As HttpWebResponse = Nothing Try MyResponse = DirectCast(MyRequest.GetResponse, HttpWebResponse) Catch exception1 As WebException Return ReturnScrape End Try Dim MyReader As StreamReader = Nothing Try Dim MyEncoding As New UTF8Encoding MyReader = New StreamReader(MyResponse.GetResponseStream, MyEncoding) ReturnScrape = MyReader.ReadToEnd Catch exception As Exception Console.WriteLine(exception.Message) End Try MyReader.Close() MyResponse.Close() 'Might want to use Regexto Strip out the HTML herereturn (ReturnScrape) End Function Well, I hope someone can help me figure out how to scrape ALL web pages instead of just some of them. I did find a web control over at example-code.com, the Chilkat spider... However, I would have to do A LOT of processing of the returned text with that thing. I'd prefer something that I can bring in (just like a web browser would) and then just READ the text off the resultant page. Not sure how easy that would be. Any help? imred
I ended up using the Chilkat Spider and then re-writing my
regext StripHTML function to strip a bit differently.Now - Does anyone know how to keep from inserting question marks into the database when inserting text?? I have the: “ ” Characters in the original string and even when I try to regexthem out (using character codes 147 & 14![]() It's freakin' maddening. nutballs
Try making sure that those actually are those codes (147 & 14
![]() imred
For a testing phase, I'm bringing back the stripped html (including the
regexagainst chr(147)/(14![]() nutballs
hmm. only thing i can think beyond that is a logic error. an order of execution problem. try breaking it down to the smallest chunk of code you can, eliminating any other things you are doing to the string.
Really there should be no reason for it not to work. the only other suggestion is to use .replace (or whatever it is) if that works, in the same exact code that the regexwont, then there must be something slightly wrong with theregex.imred
Yeah, I have tried all that
![]() It has literally come down to the insert statement.. When inserting it changes the characters to a ?. Everything else works pretty smoothly because I have single-stepped through the program (written about 200 different ways now) to figure it out. I even changed the code to use parameters and executenonquery. Not sure what else I can do, but I HAVE to get this to work! Especially 'cause I can't - it makes me want to try even harder ![]() nutballs
oh AHHHH
its on the insert. got it. Its a unicode issue then im pretty sure. Is this to MsSQL? if so, change the column datatype to Nvarchar, or Nwhateveryouareusing. im guessing you have it set to a non-N type. if MySQL, same problem i think, but its the 'collation type'. Though i am not actually sure what that should be set to. imred
MySql (running on Windows). I set everything the column characterset to: utf8 and the Column Collate to: utf8_general_ci. The datatype is LONGTEXT
Anything seem unusual there? nutballs
the mysql collations are like voodoo to me, so I really don't know specifically, but I am pretty sure that may be where the problem is. hopefully someone else knows.
arms
it definitely seems like a utf thing.
in pythonwhen creating a connection you can specify the character set.if your tables are set for utf then look at what options you have when making a connection. |

Thread Categories

![]() |
![]() |
Best of The Cache Home |
![]() |
![]() |
Search The Cache |
- Ajax
- Apache & mod_rewrite
- BlackHat SEO & Web Stuff
- C/++/#, Pascal etc.
- Database Stuff
- General & Non-Technical Discussion
- General programming, learning to code
- Javascript Discussions & Code
- Linux Related
- Mac, iPhone & OS-X Stuff
- Miscellaneous
- MS Windows Related
- PERL & Python Related
- PHP: Questions & Discussion
- PHP: Techniques, Classes & Examples
- Regular Expressions
- Uncategorized Threads