The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. October 16, 2019, 04:33:11 AM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Regex breaking for Third-level domains  (Read 13857 times)
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« on: January 13, 2009, 02:54:53 PM »

I use a custom regex-based function to Parse Urls which is a better alternative to parse_url(). The function outputs the Domain as well as a few other useful things. So, it works great for Second-level Domains but breaks on Third-level domains such as .co.uk

'http://somedomain.com/someurl.html'  -> domain correctly identified as 'somedomain.com'
'http://sub1.somedomain.com/someurl.html'  -> domain correctly identified as 'somedomain.com'

'http://somedomain.co.uk/someurl.html'  -> domain identified as '.co.uk'
'http://sub1.somedomain.co.uk/someurl.html'  -> domain identified as '.co.uk'


This is obviously because the function is identifying a domain as the rightmost entity which has a single dot

The only solution I can think of is to add a list of 3rd level TLDs to the function and modify the code to cross-check the Domain against this list.

But, if you guys would Parse Domains from Urls in a different way, I would really like to know and any help is appreciated.

Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: January 13, 2009, 04:54:59 PM »

If all you're looking for is the complete domain, then you could do it something like this (off the cuff, not checked)

~[^/]+//([^/]+)/(.*$)~

which says exclude everything up till the first / (http:, https: etc), then skip over two slashes (//) then collect everything until the next slash, then skip the next slash, then take everything till the end of the URL.

Quick
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #2 on: January 13, 2009, 05:01:11 PM »

What I do is similiar to what perk said.

removed 'http://'
add a trailing slash.
Determine first slash: $firstslash = index($string, '/');
Then I take the substring from index 0 to the first slash
Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: January 13, 2009, 05:45:51 PM »

That's programmatically much stronger Bomps, because if you can avoid pulling the trigger on the regex engine the code may be faster.

So let's see, if we PHP Bomps example (again, just off the cuff, not checked)

Code:
<?php

$target 
strpos($inURL'//') + 2;
$inURL substr($inURL$target65535);
if (!
$target strpos($inURL'/'))
{
// The entire remains is the domain.
$domain $inURL;
$uri '/';
} else {
$domain substr($inURL0$target);
$uri substr($inURL$target65535);
}

// At this point, $domain and $uri contain what you are looking for.

?>

Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« Reply #4 on: January 14, 2009, 01:14:31 AM »

If all you're looking for is the complete domain, then you could do it something like this (off the cuff, not checked)

~[^/]+//([^/]+)/(.*$)~

which says exclude everything up till the first / (http:, https: etc), then skip over two slashes (//) then collect everything until the next slash, then skip the next slash, then take everything till the end of the URL.

Quick

I am a long-standing fan/user of Editpad Pro which has Regex search... I constantly check my Regex in Editpad before putting it in my code.

So, to cut to the point... your regex is broken  Roll Eyes ... even after escaping the frontslashes
Logged
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« Reply #5 on: January 14, 2009, 01:23:53 AM »

What I do is similiar to what perk said.

removed 'http://'
add a trailing slash.
Determine first slash: $firstslash = index($string, '/');
Then I take the substring from index 0 to the first slash


Nope Bompa & Perk... you are getting the Host and NOT the domain. If you just wanted the Host, then why write code for it... just use PHP builtin function parse_url

I use a custom parsing function that does a lot more (original is posted by a contributor at us.php.net/parse_url) ... let me post it here to show you how powerful/useful the function is...


Code:
function myparseurl($url)
{
  /*
####    Author : Anand (netmktg) ####
####    Date   : Sep, 2008 ####
####    Inspired by : Contributions at http://php.net//parse_url ####

Example URL http://me:you@sub.site.org:29000/pear/validate.html?happy=me&sad=you#gobottom
Output Array...

[0] => http://me:you@sub.site.org:29000/pear/validate.html?happy=me&sad=you#gobottom
[scheme] => http
[login] => me
[pass] => you
[host] => sub.site.org
[subdomain] => sub
[domain] => site.org
[extension] => org
[port] => 29000
[path] => /pear/validate.html
[file] => validate.html
[fileext] => html
[arg] => happy=me&sad=you
[anchor] => gobottom
  */

  $r  = "^(?:(?P<scheme>\w+)://)?";
  $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
  $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
  $r .= "(?::(?P<port>\d+))?";
  $r .= "(?P<path>[-_~\w/]*/(?P<file>[-_\w]+(?:\.(?P<fileext>\w+))?)?)?";
  $r .= "(?:\?(?P<arg>[\w=&\+ ]+))?";
  $r .= "(?:#(?P<anchor>\w+))?";
  // Delimiters
  $r = "!$r!";

  $url = rtrim($url, '/');
  preg_match ($r,$url,$m);

  $m[0] = $url;
  $m['domain'] = strtolower($m['domain']);
  $m['host'] = strtolower($m['host']);

  //If [path] is only a Trailing '/', set [path] to empty string
  $m['path'] = rtrim($m['path'], '/');
  $m['filenoext'] =  preg_replace('/(.+)\.(.*)/','\1', $m['file']);
 
  return $m;
}

« Last Edit: January 14, 2009, 01:26:05 AM by netmktg » Logged
Bompa
Administrator
Lifer
*****
Offline Offline

Posts: 564


Where does this show?


View Profile
« Reply #6 on: January 14, 2009, 03:08:30 AM »

Ok, I misunderstoood you.  It can be confusing sometimes.  For example, I have never
registered "a host", but I have registered many domains (domain names).  And your talk of 2nd level
and 3rd level domains, went WAY over my pea-sized brain.

Why I write code to get the host rather than use a built-in php function?  First, I don't code
much php. Second, I like writing my own stuff, it's fun.  And yah, I like re-inventing the wheel,
it's fun for me and someday I might come up with a better wheel.

Anyways, good luck with parsing the TLDs, I don't dare post another suggestion in this thread.

Cheesy

Logged

"The most beautiful and profound emotion we can experience is the sensation of the mystical..." - Albert Einstein
netmktg
Rookie
**
Offline Offline

Posts: 37



View Profile
« Reply #7 on: January 14, 2009, 03:29:30 AM »

Anyways, good luck with parsing the TLDs, I don't dare post another suggestion in this thread.

Cheesy

Well, then I aint gonna signup for BlackhatBootcamp 


Ok, I misunderstoood you.  It can be confusing sometimes.  For example, I have never
registered "a host", but I have registered many domains (domain names).  And your talk of 2nd level
and 3rd level domains, went WAY over my pea-sized brain.


.com .net .us .org  etc. are Top Level Domains (TLDs)

BlackhatBootcamp.com is a Second-Level Domain... we all register 2nd Level Domains


.org.uk .co.uk .com.au  etc. are Third Level Domains (simply because they are a subset of 2nd Level Domains)


To the problem in hand, I need the Domain and not the Host, as I need to store this in Mysql where only one Url is stored per DOMAIN. For ex. wordpress.com has over 90 million subdomains and Blogspot.com has 353 million (as seen from Google) but in my DB they both get just ONE row each... as Subdomains don't count. And why don't subdomains count... because those are what BH spammers create. And that includes me  ROFLMAO  and BH spam  is the purpose of my (now oversized) php system. The thing has become a monster in the 5 months since I started work on it  Mobster
« Last Edit: January 14, 2009, 03:32:10 AM by netmktg » Logged
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #8 on: January 14, 2009, 09:04:34 AM »

So, to cut to the point... your regex is broken  Roll Eyes ... even after escaping the frontslashes
LOL - did you not even read the "off the cuff part" dude? Just trying to spin your gears. Take a red.

And re-reading your initial post it's clear why I misunderstood your post. My bad. But you could definitely have posted your function first and then mentioned what you were having trouble with, rather than taking this tact and, worst off, slamming Bompa, who clearly was just helping and probably misunderstood because of my take on your post. It would be a mistake to confuse someone's understanding of your post about a technical issue as a lack of strength in another discipline. So for that, fuck off.

But in answer to your question, parse_url will do what you are doing pretty admirably except that it returns the host rather than just the domain. So by simply mapping over the host element of the returned parse_url array, you could do this much more quickly. You are correct - there is no machinable marker in a host string that would differentiate the reasons for 2 dots in a string ie., www.google.com and google.co.uk - both of which are obviously valid hosts. If what you wanted in those two cases was to get the valid domain irrespective of a A/CName reference (www and such) then I'd do something akin to what you have suggested, but moreso - I'd have an array of the known suffixes and either regex for them or programmatically check. For example (com|net|org|biz|co\.uk|me|tv)$ etc. If something did not match my regex I'd fail and have the 'bot email me so that I could update the code. Or, again, programmatically you could do it because the regex would be really freaking long. A starter list of valid top levels is here: http://data.iana.org/TLD/tlds-alpha-by-domain.txt, I'd imagine that you can find valid thirds easily enough.

Good luck.
« Last Edit: January 14, 2009, 09:08:30 AM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!