The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 22, 2019, 07:34:15 PM

Login with username, password and session length


Pages: [1]
  Print  
Author Topic: Hadoop  (Read 4011 times)
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« on: September 15, 2009, 03:39:57 PM »

The more I study this Hadoop (MapReduce) database, the more it seems like The Database for large-scale data processing. I mean I almost got a boner when I read that Google was handling over 20 petabytes per DAY back in Jan 2008 and they use Hadoop or MapReduce.. I'm not actually sure if Google is using Hadoop but they have said "they support it" but basically by using MapReduce, Google is using Hadoop.

Is anybody else using Hadoop in their processing? It seems it can do A LOT of pre/post-processing amazingly fast that I used to do with my code when using MySQL. I have to admit that until now that I have studied Hadoop I haven't realized how much you can actually do with database queries. I used to do all that mixing & shaking "manually" in the code. I'm still little confused about a lot of stuff surrounding Hadoop & MapReduce since it's low level stuff compared to my previous experiences with high level coding only. But I'm getting there.

If anybody wants to learn more about Hadoop, www.cloudera.com offers great video training that explains the basics. Now I just need to get few of those Backblaze boxes and I'm good to go Smiley
« Last Edit: September 15, 2009, 03:41:31 PM by kurdt » Logged

I met god and he had nothing to say to me.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #1 on: September 16, 2009, 09:57:35 AM »

No replies? I thought you guys were into heavy processing? Cheesy

I'm still little twisted between choosing Pig or Hive. Hbase seems kind of blah so it's either Pig or Hive. Based on Cloudera's demonstration I'm starting to think Hive.
Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #2 on: September 16, 2009, 10:04:26 AM »

Actually I should have Popcorn because I know hadoop and MapReduce but don't do any mining of that kind of size... but just about any project that the Apache Foundation has running is worthy of my attention in a big way.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #3 on: September 16, 2009, 10:13:58 AM »

Actually I should have Popcorn because I know hadoop and MapReduce but don't do any mining of that kind of size... but just about any project that the Apache Foundation has running is worthy of my attention in a big way.
My next research in the queue is about how to do local DNS lookup tables. As you know I know only the basics of HTTP and protocols behind the scenes so it's going to be a challenge for me. This project is going to be so cool & exciting. When I'm ready to talk about it, I think few people here will piss themselves or get a weak boner at least.
Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #4 on: September 16, 2009, 10:18:58 AM »

you mean files to configure named? or something more intricate?

BTW: try the blue pill if a weaky is the problem  ROFLMAO
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #5 on: September 16, 2009, 10:26:39 AM »

you mean files to configure named? or something more intricate?
Naah, I mean skipping the whole step of fetching first the IP for the domain. This is something that I know will become a problem at some point but at least when prototyping I'm not going to pay too much attention to it. But if you know any good page to read about the basics how you would setup a server that would have DNS info stored locally please do share Smiley
Logged

I met god and he had nothing to say to me.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #6 on: September 16, 2009, 10:31:19 AM »

Well, I've done a bunch of it and it really helped my email baster for a while (until direct delivery was perceived as a spammer's technique ... but there are other ways around it so I am considering going back...)

a simple start would be to install PEAR:net_dns which will let you do queries, then simply drop the results in a local database. It's not really that tough at all. Can be lots of fun as well  Mobster
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #7 on: September 16, 2009, 10:36:40 AM »

Well, I've done a bunch of it and it really helped my email baster for a while (until direct delivery was perceived as a spammer's technique ... but there are other ways around it so I am considering going back...)

a simple start would be to install PEAR:net_dns which will let you do queries, then simply drop the results in a local database. It's not really that tough at all. Can be lots of fun as well  Mobster
Thanks man. I'll play with it and try to find ready-made C++ library Smiley

I'm trying to do a transition to C++ now that I have realized how doomed SEO actually is as an industry.

Btw, you happen to know which is the fastest database at the moment for small data storage? Like for example caching frequently used small stuff. I mean Hadoop with Hive/Pig is The Database but the nature of Hadoop and the whole distributed file system is that the data chunks are huge so it doesn't really fit to work with net apps directly.
Logged

I met god and he had nothing to say to me.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #8 on: September 16, 2009, 10:41:55 AM »

i dont know either of those because I always have just made the database do what i want one way or another.

I guess I should check those out .
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #9 on: September 16, 2009, 10:44:14 AM »

i dont know either of those because I always have just made the database do what i want one way or another.
Well I know you get stuff done but I challenge you to make common database like MySQL "to just work" with dataflow like 20PB/day in Google's case... if you can beat my challenge, you'll be a rich man Smiley

And I might add that the performance isn't the only issue with traditional model. It's all the issues scaling brings like reliability and so on.
Logged

I met god and he had nothing to say to me.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #10 on: September 16, 2009, 11:26:40 AM »

of course not. but I will never deal in PBs. Barely TBs.

And, I just looked up hadoop and saw the 1 thing that always makes me close my browser. JAVA. lol
though I did just read this which is interesting of course "It's possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)[14]. As an example The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).[15]"

I am intrigued though, just because I like the idea of a true cloud architecture. a box fails, you just pull it and put in a new one, no biggy.
But am I missing something? how is hadoop a database server? or just a model to build or run a DB server on top of?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #11 on: September 16, 2009, 11:28:55 AM »

Btw, you happen to know which is the fastest database at the moment for small data storage? Like for example caching frequently used small stuff.

MySQL with MyISAM is both easy and in the top tier, although I think that each of the DBs have a particular test to show why they are the fastest. If you install innoDB you will suffer - but that will be required if you want to handle rollbackable transactions.

Using MySQL on a local box with MyISAM tables is as fast as poop through a goose. You'll be quite pleased IMO.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
isthisthingon
Global Moderator
Lifer
*****
Offline Offline

Posts: 2879



View Profile
« Reply #12 on: September 16, 2009, 12:04:11 PM »

Quote
Is anybody else using Hadoop in their processing?

 Shocked

Thanks kurdt - best tip I've had in a while.  Never used hadoop but bookmarked hadoop.apache.org and love what I'm reading.  As for small storage I'd agree with perk and go MySQL.  If possible I never subject myself to built-in transaction handling.
Logged

I would love to change the world, but they won't give me the source code.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #13 on: September 16, 2009, 12:32:21 PM »

If you install innoDB you will suffer - but that will be required if you want to handle rollbackable transactions.

Statement? Meet Blanket!

Although a lot of the time this is the case.
indexes are generally faster at larger scale in inno because of keyclustering, but the flipside is you need to go buy more ram.
inno has foreign keys so you can cascade delete

The big reason for inno is writes though.
Inno is row locking, mysam is table locking.
lots of writes=lots of waiting around for a free slot. I HAD to go with inno for a few of my tables for my turd generator because the writes are about 30% in those tables.
inno > 2gb tables, regardless of host filesystem limits by filesplitting. Also why some of my tables are inno. 100GB mysam table anyone?

downside to inno is lack of fulltext indexing.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
kurdt
Lifer
*****
Offline Offline

Posts: 1153


paha arkkitehti


View Profile
« Reply #14 on: September 16, 2009, 12:37:46 PM »

I am intrigued though, just because I like the idea of a true cloud architecture. a box fails, you just pull it and put in a new one, no biggy. But am I missing something? how is hadoop a database server? or just a model to build or run a DB server on top of?
I'm not sure if you meant this but it's not just about boxes, it's also about individual harddrives. You can pull them out also very easily and nothing is lost.

Hadoop itself isn't "a database server", it's more like distributed cloud type of file system to save/read data very efficiently and fast. In it's core MapReduce kinda reminds normal database but it's Pig and Hive that make it actually behave more like traditional database.
Logged

I met god and he had nothing to say to me.
Pages: [1]
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!