The Cache: Technology Expert's Forum
 
*
Welcome, Guest. Please login or register. September 18, 2019, 12:47:23 PM

Login with username, password and session length


Pages: [1] 2
  Print  
Author Topic: suggestion on efficiency  (Read 6533 times)
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« on: January 01, 2009, 07:44:16 PM »

my large network of autogenerated sites has lots of pages. so many that on some of my servers I am running into inode limits... not space, inodes. what I currently do is store each page as a serialized string of information in a single text file, and have a template file that gets parsed with the content from the serialized file replacing the tokens. the same idea would be a bunch of XML files with an XSL template that transforms the XML. but this is faster and less dependent as covered in another thread.

The problem is, the average site has 20k pages. So, there are 20k files.
I have removed some redundancies I had, which shrunk the problem, but its still huge.
So, i was trying to think if there is a way to reduce the inodes/filecount without increasing the ram requirements or load times too badly.

if i cram each site into 1 big ass file, the potential is on some of the bigger sites to be 250mb for the content file and it would be a serialized array with 50k root elements to scan through. technically it would work, but not sure if it will realistically.

thoughts?
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #1 on: January 03, 2009, 12:16:31 AM »

Why are you not MySQLing them? Seems like the quickest way to get past the inode problem and you won't incur THAT much more overhead. That's what I'd do.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #2 on: January 03, 2009, 08:15:25 AM »

portability. i dont know where this code will get put. often there is no SQL access. Though, i may have to change that thought process it seems.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #3 on: January 03, 2009, 09:51:56 PM »

The one big file with a sidecar index file and fseek and you're still fast as anything. Deleting a file will be nasty tho.
Or perhaps something like this: http://www.c-worker.ch/txtdbapi/index_eng.php

Dunno. But you may consider MySQL to be a bottom line requirement, otherwise you're going to be reinventing the wheel.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #4 on: January 03, 2009, 11:21:48 PM »

yea, im giving more thought to the mysql requirement. i guess, because of the direction I have been heading with this, I can start requiring it.

I have thought of another option, which is the way I did my prior network. 1 database at the core machine, all the nodes just pull from that real time. it "was" fast enough back then. but not sure now. I guess I can do it that way to begin with, then, if its too slow or bandwidth intensive, move to a local DB. Maybe something in the middle, like caching the data as files, like I currently do, but only 3 days worth maybe. If file exists, use it and "touch", else pull from core server. hmmm. that might work. that way I keep my current methodology anyway.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #5 on: January 04, 2009, 12:15:16 PM »

I appreciate that there may be a free requirement here so this may not be an option, but how about using Amazons S3 or elastic cloud thing as a central storage medium and then you're drone sites can just pull from there? They have grid... might as well leverage it.

Just a thought.

DM

PS - There maybe similar things you can do with Googles App engine, but using Google services may not be wise  Wink
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #6 on: January 04, 2009, 02:02:19 PM »

I had considered S3 for another project, in conjuction with amazon ec2, however, it was cost prohibative, and frankly, ec2 is retardedly confusing.

by my calcs it would be about $60/month just for data storage. actually not too bad. especially since it should be fast as sin.
i guess that is worth reconsideration. it would mean my core server is now way overkill..

I guess i could just pull realtime from s3, would make my remote sites even less intensive than they already are. hmmmmm....
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #7 on: January 04, 2009, 02:15:00 PM »

errr wait. i just got totally confused.

S3 is just storage. so i guess I could store the pages in S3. but I still would need a database. I guess technically, i could just have a root page, that pulls the "correct page" from S3 and just just passes it through. template and all. hmm interesting...

the other side is SimpleDB apparently.
i cant figure out how the hell to calculate the possible pricing there tough.
it looks like it could cost over $100 a month. not really sure though.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
DangerMouse
Expert
****
Offline Offline

Posts: 244



View Profile
« Reply #8 on: January 04, 2009, 04:06:33 PM »

I really don't know what it costs tbh - I've just read its cheap and good value. The idea was to solve the nodes problem, just import xml from S3 and render via small dynamic script on each site/host etc.

I'm pretty sure SimpleDB is a little bit like CouchDBs model but again I know little about it apart from I like the whole webservice/cloud thing, and therefore think its cool Wink

DM
Logged
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #9 on: January 04, 2009, 04:24:05 PM »

yea i finished actually reading the info and overviews, instead of just skimming.

basically S3 is like a remote directory, if you want to think of it that way. So for me, that will be perfect. I will store my serialized page content at s3, and have my template locally to the site. for me, this means almost nothing will change methodology wise. I store it remotely instead of locally.

I will let you guys know how it goes. I think I am going to give it a whirl.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #10 on: January 04, 2009, 11:30:03 PM »

Sounds like a pretty damn interesting model... looking forward to hearing if bandwidth and response time booger you at all.
Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #11 on: January 04, 2009, 11:34:09 PM »

Right as i was posting you did perk.

and... Its Converted!

LOL. that was simple. Mostly because of the way I am doing things.

in case anyone wants to try it out, this php class made things uber simple.
http://undesigned.org.za/2007/10/22/amazon-s3-php-class

So i have been live for 5 minutes and.......
            $0.100 per GB - all data transfer in      0.015 GB                             $0.01
          $0.170 per GB - first 10 TB / month data transfer out    0.033 GB    $0.01
          $0.01 per 1,000 PUT, COPY, POST, or LIST requests    6856 Requests    $0.07
          $0.01 per 10,000 GET and all other requests    4679 Requests            $0.01

currently it is costing me 2 cents a minute. LOL
I need to implement a caching solution, but for now, good enough. woot!

thanks dm. If anything, at least now I can add S3 to my list of crap I can charge clients for. heheh.

I am gonna have to keep an eye on bandwidth at my nodes, since this is technically doubling my usage to pull down, then send it back out to the client browser.

@perk speed so far seems to be no issue. really the only bottleneck would be the network.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #12 on: January 05, 2009, 02:03:10 PM »

this exercise has become a reminder of why you should read the fscking manual...

so...

here is what I have learned so far.
short version of how s3 works is this.
account: this is you and all your stuff.
bucket: this is like a drive i guess. you can have up to 100 buckets.
object: files basically. an object can have keys to make finding the file easier in the pile that forms in the bucket.

up to 100 buckets. would have helped if I knew that ahead of time... LOL
I designed my schema like this: 1 bucket per site, each bucket contains the content files for that site.
er... lets just say I have a few more than 100 sites... so... that schema blew up fast since I ran out of buckets in about 30 seconds, but didnt know it until this morning when I got all my "your shit is broken" alerts from my system.

so it is literally like a bucket concept. the bucket is infinitly big and can fit infinite number of objects in it.
an object can have keys. they are assigned like this:
key1/subkey2/anotherkey3/object
looks suspiciously like directories...
good enough. I will now call it a DIRECTORY.
so...

My new methodology is
1 big ass bucket.
each object is stored as sitename/filename
thats it...

oh no... its not...

reading other stuff and comments that were related to the "why the hell cant I create any more buckets" threads. I come to find out this little gem...
when you list the content of a bucket, it will only return up to 1000 results.
errrr. I have well over 1000 results in me bucket... And even within a single key/directory. DAMMIT.
So, more back peddling, and doing things right...
and now I store a list of files for that domain, locally, so I can just scan for a file without involving S3. Considering the "list" commands cost more, this is a good move anyway.

all in all though, it is retarded simple to use, and seems to be pretty damn fast.
PUTs are a little slow maybe, but GETs are damn fast it seems. And since I think they use smart location type stuff for putting your files physically closer to the machine that is requesting them, i think it gets faster. not sure though, since i didnt really read... of course.

and considering getting a file is a matter of GET: bucket/keys/filename
its pretty damn easy to understand.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
nutballs
Administrator
Lifer
*****
Offline Offline

Posts: 5627


Back in my day we had 9 planets


View Profile
« Reply #13 on: January 09, 2009, 01:35:14 PM »

so to further this exercise...

I might just implement this myself. s3 is cheap enough if all you ever do is put a file once. the problem is that I update my files every so often, and that costs 10x as much and seems to be burning up the cash.

so I may just go ahead and implement my own remote storage methodology, since I do have a core server. but then, if I am going to do that, I might as well just go with the database model that I avoided in the first place. But at least the node would not be dependant on a local database, just a single remote one. which I am not thrilled about.
Logged

I could eat a bowl of Alphabet Soup and shit a better argument than that.
perkiset
Olde World Hacker
Administrator
Lifer
*****
Offline Offline

Posts: 10096



View Profile
« Reply #14 on: January 10, 2009, 12:29:57 PM »

Cage, VPN so that you can talk to the machines in a native format, RPC style pull from your other boxes. Clean, easy.

It's an investment at first, but once you get there, it's something that you'll never live without. I built my cage for this very purpose around 2000 and have never looked back. I can help with the details quite extensively if you'd like to go that way.

I'm not kidding, it's the deal once you get there. And then, whether you have MySQL on the remote boxes or not, doesn't matter. Your cage is the mother ship. And I use "cage" loosely - a couple machines with a firewall in front and you're golden.
« Last Edit: January 10, 2009, 12:31:29 PM by perkiset » Logged

It is now believed, that after having lived in one compound with 3 wives and never leaving the house for 5 years, Bin Laden called the U.S. Navy Seals himself.
Pages: [1] 2
  Print  
 
Jump to:  

Perkiset's Place Home   Best of The Cache   phpMyIDE: MySQL Stored Procedures, Functions & Triggers
Politics @ Perkiset's   Pinkhat's Perspective   
cache
mart
coder
programmers
ajax
php
javascript
Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC
Seo4Smf v0.2 © Webmaster's Talks


Valid XHTML 1.0! Valid CSS!