#archiveteam-bs 2013-05-03,Fri

↑back Search

Time Nickname Message
00:23 🔗 Smiley awww hell wtf
00:23 🔗 * Smiley ponders what to tell the doctor tomorrow
00:23 🔗 Smiley "HI, most of the time i feel completely normal, apart from those times I wish I didn't exist."
00:23 🔗 omf_ You have moments that make you anxious for no reason
00:24 🔗 Smiley yes
00:25 🔗 Smiley Dr Cope << hahaha
00:25 🔗 omf_ That is how you should explain it and then give a few examples.
00:25 🔗 Smiley yes.
00:25 🔗 Smiley yes I should
00:27 🔗 dashcloud from twitter: Allow me to donate this slogan to any evanescent startup with no business model beyond acquisition. "We'll make you the star in rm -rf *"
00:27 🔗 omf_ yes dashcloud
00:49 🔗 omf_ Wikis must drive OCD people nuts
01:10 🔗 DFJustin yeah I put in like 7000 spelling and formatting edits to wikipedia before basically giving up
01:18 🔗 omf_ I think our wiki is a lot better than it was last time this year
01:19 🔗 omf_ More content and much less spam
01:19 🔗 Aranje so I was just handed a printout of someone's geocities site that archiveteam saved. He was very happy.
01:23 🔗 SketchCow awww
02:29 🔗 omf_ Iooking at the 60day history, there was a fuck ton of backing off http://archive.org/stats/s3.php#60d
03:19 🔗 omf_ 753 of the top 1 million sites are google
03:20 🔗 omf_ #2, #12, #23, #25, etc....
04:35 🔗 omf_ On a 2gb 2core butt I can grab 7 screenshots at a time
04:35 🔗 omf_ CPU usage being the primary limiter
04:36 🔗 omf_ I am slowly scaling this up to see if any bottlenecks arise
04:36 🔗 omf_ The end goal being able to just fire this up and capture any amount of pages in a reasonable amount of time
04:37 🔗 omf_ I still cannot figure out why images were not showing up from amazon on the posterous test
04:57 🔗 instence what is the best way to resolve a large batch of URL's that are being redirected? aka have been "moved"
04:58 🔗 instence is there a better way than wget spider?
04:58 🔗 instence or cleaner I should say
05:07 🔗 omf_ are you just trying to get the resolve urls?
05:09 🔗 instence yep, i am using an app to check the status of a large URL list, which kicks back 404 not found, "ok", timed out, no connection, no such host, etc
05:10 🔗 instence and some are 301 object permanently moved, or 302 object temporarily moved
05:10 🔗 instence and I want to just get back where they were moved to
05:13 🔗 omf_ how big a url list
05:19 🔗 instence Hmm anywhere from 50 to 500,000
05:19 🔗 instence i think on average it would be 100-1000 url's that I would be trying to resolve
05:22 🔗 instence I am basically using this app to chew through sites have archived for any externally linked URL's, and I run a check on those, looking for more sites to archive, and I do this on each site.
05:22 🔗 instence The list of potentials grows and grows
05:22 🔗 omf_ Yeah I am dealing with that kind of problem right now
05:22 🔗 omf_ I am testing out 7 urls at a time
05:24 🔗 instence I use a windows app called XENU to do URL checking
05:24 🔗 omf_ Scaling up the server size will help but I am looking into dns caching as well
05:27 🔗 instence I could use wget to spider, and set a level depth limit, causing it to resolve, but the verbose output is too much.
05:27 🔗 instence I would love to just get a straight list somehow
05:28 🔗 instence i might have to do that though, and pump the output through SED or something
05:28 🔗 instence just that I know how to do that in theory, not necessarily in practice off the top of my head lol
05:28 🔗 instence as i am not a wizard in regex yet
06:09 🔗 brayden Ah yeah Xenu is really good.
06:15 🔗 brayden instence, so if they give a 3xx HTTP result, the script needs to just say where the new location is?
06:17 🔗 instence yea
06:18 🔗 instence I export what is in xenu to tab list and then into spreadsheet
06:19 🔗 instence basically i want to just take the list of url's that are 3xx and resolve them, to find out their new location
06:20 🔗 brayden Well whilst I was waiting for you to respond I made a Python script using some code I stole off of stackexchange, like usual, which kind of does that
06:20 🔗 brayden http://brayden.ur.cx/redirect.py
06:20 🔗 brayden it'll output 1 on the first line if a redirect then 2nd line it'll include the new location
06:20 🔗 brayden otherwise 0
06:20 🔗 brayden using sed it would be something like sed -n1p or whatever it was to get the result
06:21 🔗 brayden ah I was close
06:21 🔗 brayden sed -n 1p
06:21 🔗 brayden for line one, apparently
06:21 🔗 brayden Is the spreadsheet just a csv or is it a proper excel one?
06:22 🔗 instence well, it can be anything really, as what gets dumped from xenu is just a tab delimited file
06:23 🔗 instence i usually drop it into a spreadsheet just to do sorting and contains searches, categorizing what is there
06:23 🔗 instence i would filter by 3xx and just select the urls, past to txt file and want to run that list through something to get the resolved URL's
06:25 🔗 instence thanks for the python example
06:25 🔗 brayden parsing tab delimited files should be pretty easy.
06:25 🔗 brayden just making a thing for that now
06:26 🔗 instence hmm almost wonder if I could do something like that in php as well, as I have written scraper scripts in php that work great
06:26 🔗 instence this is cool cause it will get me into learning a bit of python, which I haven't work with before yet ;)
06:26 🔗 brayden Can you provide an example of a line from the tab delimited file?
06:27 🔗 instence I don't even know if I would process the tab delimited file. As I would do my sorting first and just select the range of URL's that I want to resolve so honestly whatever the script targets it could be a txt file with just a single list of URL's
06:28 🔗 instence so urls.txt, and just 100 URL's or whatever
06:30 🔗 brayden C:\Users\brayden\PycharmProjects\PythonXMLTest>python show-redirect.py urls.txt
06:30 🔗 brayden 1 http://www.google.com/
06:30 🔗 brayden 1 http://archiveteam.org/index.php?title=Main_Page
06:30 🔗 brayden 1 http://www.google.com.au/
06:30 🔗 brayden seems to be working
06:31 🔗 instence awesome
06:31 🔗 brayden http://brayden.ur.cx/redirect.py just updated this one
06:31 🔗 brayden it has no error checking though or anything like that
06:31 🔗 brayden if a site goes down or times out for whatever reason it will crash
06:31 🔗 brayden really easy to fix though if htat becomes a problem
06:32 🔗 instence cool, at least if something does bork I will know why
06:33 🔗 instence this is great though
06:33 🔗 instence thanks a bunch for the script
06:33 🔗 brayden no worries. Python is a really easy language to learn so you should try
06:34 🔗 instence quick question: does it treat tabs/spaces as contextual? since there are no curley braces? like coffee script?
06:35 🔗 brayden for every line in the text file it'll just send the text off to the testUrl function
06:35 🔗 instence oh i mean for the python code itself
06:35 🔗 instence like the syntax
06:35 🔗 brayden oh
06:35 🔗 brayden it uses indenting
06:36 🔗 brayden after every : you raise the indenting level
06:36 🔗 instence ok cool
06:36 🔗 brayden It doesn't use braces or anything
06:37 🔗 instence this is the perfect thing to get my into experimenting with python, as I can already clearly see what this scrupt is doing and it relates to stuff I need to do
06:37 🔗 instence so I can probobly work from this, expand on it, and branch out into other tasks I need done
06:37 🔗 instence so looks like this weekend I will be getting into python a bit ;)
06:37 🔗 brayden I'll be here around this time for the weekend if you need to ask anything
06:38 🔗 instence ok cool, i will most likely be up Fri/Sat night all night working on various archiving tasks
06:39 🔗 instence so I will ping you if I have any questions
06:39 🔗 instence I am going to get python up and running locally tomorrow night and test out this script for myself
06:39 🔗 brayden yeah well python is really easy to setup on windows/linux so you shouldn't have any trouble
06:40 🔗 brayden this just uses standard libraries
06:41 🔗 instence I have setup python before, for some application dependancy in the past. It was pretty straight forward.
06:44 🔗 brayden Do you just use notepad++ or something like that to write it?
06:48 🔗 instence like as far as preferred editor?
06:49 🔗 brayden well what are you planning on using?
06:50 🔗 instence well, I am sort of IDE agnostic at the moment. At work I use a combination of eclipse, notepad++, and Dreamweaver (coder mode only, mainly because you can custom collapse any selected region of code)
06:50 🔗 instence though I have ambitions to learn VIM and migrate to that
06:50 🔗 instence I am designer/front end developer
06:52 🔗 instence if I write python it might be in notepad++ for now
06:54 🔗 brayden Well PHPstorm can collapse bits of code and I bet it has a hell of a lot better completion than Dreamweaver as well as being way cheaper.
06:55 🔗 brayden and having Linux/Mac/Windows support
06:55 🔗 brayden Personally I have a lot of trouble remembering things so I use a fully fledged IDE
06:55 🔗 brayden PyCharm is the one I use. It is pretty expensive and I'm a student so can't really afford it.
06:55 🔗 brayden Had to go via other means :(
06:55 🔗 brayden but they have a 30 day trial
06:56 🔗 brayden A lot of nice features are shared between editions, for instance, I'm doing a tornado template and it has detected it is HTML syntax.
06:56 🔗 brayden and providing me with proper completion etc.
06:56 🔗 brayden it even was nice enough to download the twitter bootstrap js and provide completion with that!
06:57 🔗 brayden https://www.jetbrains.com/ anyway it is this lot. They do some seriously good software!
07:00 🔗 instence cool I will have to check it out
07:01 🔗 instence I actually have been shying away from hinting/completion aside from auto-closing html tags and auto-generating blocks of preformatted code
07:01 🔗 brayden Why?
07:01 🔗 instence basically because I found myself getting too reliant on the auto completion, and i didn't realy know the languages as well
07:01 🔗 brayden Well yeah there is that.
07:02 🔗 brayden But that's not really a problem of using auto completion, I reckon at least.
07:02 🔗 instence so to force myself to remember and learn the languages better I have been just referring to api documentation and writing as much as I can
07:03 🔗 instence yea its not that auto completion is the problem per say
07:03 🔗 brayden and few auto completions are powerful enough to let you get away with that anyway
07:03 🔗 brayden I still have to go through module docs all the time.
07:05 🔗 instence I am going to checkout PyCharm
07:05 🔗 instence looks interesting
07:06 🔗 brayden They made a nice theme for it too which is pretty easy on the eyes
07:06 🔗 brayden this whole "darcula" thing they're integrating into their stuff
07:06 🔗 instence lol
07:06 🔗 instence nice name
09:30 🔗 godane so i found another episode of gamespot tv
09:30 🔗 godane i really wish these rips where at 165kbs
09:31 🔗 godane maybe not that great but still alot better then dialup
10:53 🔗 godane yes
10:53 🔗 godane found another episode of call for help
11:13 🔗 Smiley psyatric nurse :o
13:53 🔗 Smiley raaawr
14:06 🔗 ersi Raring rawrtail
14:09 🔗 DFJustin another gaming gem rescued from the dustbin of history http://archive.org/download/Nextys_Archive/OW__B.ISO/Butt_Slam%2FBUTTSLAM.ZIP
14:10 🔗 SketchCow Whew, back
14:10 🔗 ersi Butt slam :D Bwahaha
14:10 🔗 Cameron_D http://www.theverge.com/2013/5/3/4294548/tears-in-rain-how-snapchat-showed-me-the-glory-of-fading-data
15:03 🔗 sep332 Hackers for Charity documentary has 2 days left http://www.kickstarter.com/projects/1456247168/hackers-in-uganda-a-documentary?ref=live
15:39 🔗 SketchCow I won't support it
15:39 🔗 SketchCow They basically dumped in 5k of their own money to guarantee investment
17:57 🔗 DFJustin http://www.forbes.com/sites/andygreenberg/2013/05/03/this-is-the-worlds-first-entirely-3d-printed-gun-photos/
18:28 🔗 SketchCow Thingiverse dropped the funs
18:28 🔗 SketchCow guns
18:28 🔗 SketchCow Talked to Bre about it
18:28 🔗 SketchCow Told him I expected it, I said I expect Guniverse within seconds and he says there's already one
18:33 🔗 chronomex haha
18:33 🔗 SketchCow http://i.imgur.com/KBCsbVi.gif
19:14 🔗 omf_ I am hitting the final stretch of having refreshed all my backups. It is a huge relief. Sometimes you delete a few old things, most of the time you add more.
19:22 🔗 omf_ only 21gb left to go through
19:23 🔗 omf_ I really should get more drive trays but they are $17 each.
19:26 🔗 omf_ I mention this because a friend just had his 3rd hard drive failure since I have known him and he lost everything again
20:04 🔗 Smiley Ok so another week off work :/
20:05 🔗 Smiley however means I maybe crunching again late night as that saeems to be teh tiem i become active here.
20:05 🔗 Smiley However right now my only concern is this vodka.
20:46 🔗 ersi soultcer: Haha, thanks for the chocolate! Awesome packaging
20:49 🔗 ersi Super tasty :3
20:51 🔗 Smiley :O
21:00 🔗 ersi http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/80000/3000/300/183359/183359.strip.gif
21:11 🔗 Coderjoe there already was defcad.org
21:48 🔗 DopefishJ woo got all of simtelnet on my laptop
21:48 🔗 DFJustin 1997 me would be so jealous
21:49 🔗 DFJustin seems to be missing a lot of games compared to what I remember though
21:49 🔗 DFJustin the mirror at ftp.riken.go.jp is the same so maybe they cleared them out at some point

irclogger-viewer