|
kurdt
|
 |
« on: September 15, 2009, 03:39:57 PM » |
|
The more I study this Hadoop (MapReduce) database, the more it seems like The Database for large-scale data processing. I mean I almost got a boner when I read that Google was handling over 20 petabytes per DAY back in Jan 2008 and they use Hadoop or MapReduce.. I'm not actually sure if Google is using Hadoop but they have said "they support it" but basically by using MapReduce, Google is using Hadoop. Is anybody else using Hadoop in their processing? It seems it can do A LOT of pre/post-processing amazingly fast that I used to do with my code when using MySQL. I have to admit that until now that I have studied Hadoop I haven't realized how much you can actually do with database queries. I used to do all that mixing & shaking "manually" in the code. I'm still little confused about a lot of stuff surrounding Hadoop & MapReduce since it's low level stuff compared to my previous experiences with high level coding only. But I'm getting there. If anybody wants to learn more about Hadoop, www.cloudera.com offers great video training that explains the basics. Now I just need to get few of those Backblaze boxes and I'm good to go 
|
|
|
|
« Last Edit: September 15, 2009, 03:41:31 PM by kurdt »
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
kurdt
|
 |
« Reply #1 on: September 16, 2009, 09:57:35 AM » |
|
No replies? I thought you guys were into heavy processing?  I'm still little twisted between choosing Pig or Hive. Hbase seems kind of blah so it's either Pig or Hive. Based on Cloudera's demonstration I'm starting to think Hive.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
perkiset
|
 |
« Reply #2 on: September 16, 2009, 10:04:26 AM » |
|
Actually I should have  because I know hadoop and MapReduce but don't do any mining of that kind of size... but just about any project that the Apache Foundation has running is worthy of my attention in a big way.
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #3 on: September 16, 2009, 10:13:58 AM » |
|
Actually I should have  because I know hadoop and MapReduce but don't do any mining of that kind of size... but just about any project that the Apache Foundation has running is worthy of my attention in a big way. My next research in the queue is about how to do local DNS lookup tables. As you know I know only the basics of HTTP and protocols behind the scenes so it's going to be a challenge for me. This project is going to be so cool & exciting. When I'm ready to talk about it, I think few people here will piss themselves or get a weak boner at least.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
perkiset
|
 |
« Reply #4 on: September 16, 2009, 10:18:58 AM » |
|
you mean files to configure named? or something more intricate? BTW: try the blue pill if a weaky is the problem 
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #5 on: September 16, 2009, 10:26:39 AM » |
|
you mean files to configure named? or something more intricate? Naah, I mean skipping the whole step of fetching first the IP for the domain. This is something that I know will become a problem at some point but at least when prototyping I'm not going to pay too much attention to it. But if you know any good page to read about the basics how you would setup a server that would have DNS info stored locally please do share 
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
perkiset
|
 |
« Reply #6 on: September 16, 2009, 10:31:19 AM » |
|
Well, I've done a bunch of it and it really helped my email baster for a while (until direct delivery was perceived as a spammer's technique ... but there are other ways around it so I am considering going back...) a simple start would be to install PEAR:net_dns which will let you do queries, then simply drop the results in a local database. It's not really that tough at all. Can be lots of fun as well 
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
kurdt
|
 |
« Reply #7 on: September 16, 2009, 10:36:40 AM » |
|
Well, I've done a bunch of it and it really helped my email baster for a while (until direct delivery was perceived as a spammer's technique ... but there are other ways around it so I am considering going back...) a simple start would be to install PEAR:net_dns which will let you do queries, then simply drop the results in a local database. It's not really that tough at all. Can be lots of fun as well  Thanks man. I'll play with it and try to find ready-made C++ library  I'm trying to do a transition to C++ now that I have realized how doomed SEO actually is as an industry. Btw, you happen to know which is the fastest database at the moment for small data storage? Like for example caching frequently used small stuff. I mean Hadoop with Hive/Pig is The Database but the nature of Hadoop and the whole distributed file system is that the data chunks are huge so it doesn't really fit to work with net apps directly.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
nutballs
|
 |
« Reply #8 on: September 16, 2009, 10:41:55 AM » |
|
i dont know either of those because I always have just made the database do what i want one way or another.
I guess I should check those out .
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
kurdt
|
 |
« Reply #9 on: September 16, 2009, 10:44:14 AM » |
|
i dont know either of those because I always have just made the database do what i want one way or another. Well I know you get stuff done but I challenge you to make common database like MySQL "to just work" with dataflow like 20PB/day in Google's case... if you can beat my challenge, you'll be a rich man  And I might add that the performance isn't the only issue with traditional model. It's all the issues scaling brings like reliability and so on.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|
nutballs
|
 |
« Reply #10 on: September 16, 2009, 11:26:40 AM » |
|
of course not. but I will never deal in PBs. Barely TBs.
And, I just looked up hadoop and saw the 1 thing that always makes me close my browser. JAVA. lol though I did just read this which is interesting of course "It's possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)[14]. As an example The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).[15]"
I am intrigued though, just because I like the idea of a true cloud architecture. a box fails, you just pull it and put in a new one, no biggy. But am I missing something? how is hadoop a database server? or just a model to build or run a DB server on top of?
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
perkiset
|
 |
« Reply #11 on: September 16, 2009, 11:28:55 AM » |
|
Btw, you happen to know which is the fastest database at the moment for small data storage? Like for example caching frequently used small stuff.
MySQL with MyISAM is both easy and in the top tier, although I think that each of the DBs have a particular test to show why they are the fastest. If you install innoDB you will suffer - but that will be required if you want to handle rollbackable transactions. Using MySQL on a local box with MyISAM tables is as fast as poop through a goose. You'll be quite pleased IMO.
|
|
|
|
|
Logged
|
It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
|
|
|
|
isthisthingon
|
 |
« Reply #12 on: September 16, 2009, 12:04:11 PM » |
|
Is anybody else using Hadoop in their processing?  Thanks kurdt - best tip I've had in a while. Never used hadoop but bookmarked hadoop.apache.org and love what I'm reading. As for small storage I'd agree with perk and go MySQL. If possible I never subject myself to built-in transaction handling.
|
|
|
|
|
Logged
|
I would love to change the world, but they won't give me the source code.
|
|
|
|
nutballs
|
 |
« Reply #13 on: September 16, 2009, 12:32:21 PM » |
|
If you install innoDB you will suffer - but that will be required if you want to handle rollbackable transactions.
Statement? Meet Blanket! Although a lot of the time this is the case. indexes are generally faster at larger scale in inno because of keyclustering, but the flipside is you need to go buy more ram. inno has foreign keys so you can cascade delete The big reason for inno is writes though. Inno is row locking, mysam is table locking. lots of writes=lots of waiting around for a free slot. I HAD to go with inno for a few of my tables for my turd generator because the writes are about 30% in those tables. inno > 2gb tables, regardless of host filesystem limits by filesplitting. Also why some of my tables are inno. 100GB mysam table anyone? downside to inno is lack of fulltext indexing.
|
|
|
|
|
Logged
|
I could eat a bowl of Alphabet Soup and shit a better argument than that.
|
|
|
|
kurdt
|
 |
« Reply #14 on: September 16, 2009, 12:37:46 PM » |
|
I am intrigued though, just because I like the idea of a true cloud architecture. a box fails, you just pull it and put in a new one, no biggy. But am I missing something? how is hadoop a database server? or just a model to build or run a DB server on top of?
I'm not sure if you meant this but it's not just about boxes, it's also about individual harddrives. You can pull them out also very easily and nothing is lost. Hadoop itself isn't "a database server", it's more like distributed cloud type of file system to save/read data very efficiently and fast. In it's core MapReduce kinda reminds normal database but it's Pig and Hive that make it actually behave more like traditional database.
|
|
|
|
|
Logged
|
I met god and he had nothing to say to me.
|
|
|
|