[10:34] we have archived all ~140000 mp3s of earbits now [11:04] 131,224 songs according to the frontpage [11:06] details ;) [11:29] WikiTeam vs. bitrot: 249 : 189 https://wikiapiary.com/w/index.php?title=WikiTeam_websites [11:30] (I hope the actual number is better, not all dumps are mapped by wikiapiary yet.) [13:15] schbirid: very cool [14:22] this is all the elite-tnk stuff i could find on the internet, some mirrored from before various (gtoal, for instance) takedowns: https://dl.dropboxusercontent.com/u/79094972/elite-tnk.7z [14:23] i have no idea how that should be archived, some probably should be blacked out [14:23] gtoal stuff used to live at http://www.gtoal.com/athome/tailgunner/java/elite/elite-the_new_kind-1.0/ [14:24] elitegl still exists at http://web.archive.org/web/20060515000000*/http://homepage.ntlworld.com/paul.dunn4/elitegl.zip [14:25] we have david braben to thank for all this stuff disappearing [14:29] ooh one more file, just found it: ftp://faime.demon.co.uk/pub/files/newkind.tar.bz2 [14:35] wow that ftp is unreliable as hell, i may have just taken it down :( [14:43] weird, it shouldn't be [14:43] though, actually [14:43] never mind [14:43] that's a home-hosted box [14:44] Lord_Nigh: demon is (was?) an ISP that allows every customer to get their own DNS subdomain for their home IP [14:44] similar to what XS4ALL (and previously Demon NL before it got absorbed into XS4ALL) do [14:57] that's sweet [14:57] wish US ISPs did that [14:58] I mean, I suppose I'd rather have properly priced symmetric broadband first [14:58] lol [15:26] there's something weird going on with my angelfire grab- it's now taking between 5 and 6 minutes per page downloaded now. Started off very quickly, but now it's really, really slow [15:27] give the tape feeding robot its time ;) [15:37] heh [16:29] yay now automatic links to inside tars! 1.0 GB [contents] https://archive.org/details/wiki.urbandead.com [16:36] so, instead of the OOM killer ending my grab, wget just ran out of memory and killed the grab [16:37] it looks like I got bogged down in a lot of blog links from the pages like this: http://midlandlocks.angelfire.com/blog/index.blog?start=1340286959 should I use --reject-regex (what one?) or reject domain? [17:04] yay infinite calendars [17:07] will a *blog* regex work like I want it to? (don't go to any page with blog in the url), or do I need a more targetted regex? [17:08] '/blog/' should work, maybe even 'midlandlocks.angelfire.com/blog/' (never tried that and ignoring that the . = any character) [17:09] maybe use '/blog/index.blog\?start=' rather? [17:10] I imagine it's on a bunch of pages, so I guess /blog/index.blog\? would be a better choice [17:10] it'd be nice if I had a good list of old-style angelfire pages, instead of the new ones [17:12] anyone tried crawling a webring before? this page: http://leocentaur.angelfire.com/theremin.html mentions it is part of a webring [17:12] http://www.angelfire.com/ga/quake4/ [17:12] http://www.angelfire.com/ult/td/ [17:12] are two old ones i know [17:13] but wasnt there a list of them all available? [17:13] dashcloud: http://www.angelfire.com/robots.txt [17:14] aw yiss [17:14] sitemaps <3 [17:14] thanks for archiving angelfire btw, great target [17:15] now that makes things easier [17:15] might not be complete but a great start [17:15] is there a way to make wget multi-threaded, or do I need to just run multiple wgets? [17:17] wpull might do what you want [17:17] also might be a better choice for a large site like angelfire, because it uses sqlite instead of a list in ram [17:20] I am definitely open to trying different things [17:21] here's the current command I'm using for wget: http://paste.archivingyoursh.it/fucominiha.mel [17:21] how do I get a copy of wpull, and what would an equivalent command look like? [17:27] instructions are here https://github.com/chfoo/wpull [17:43] what's that recommended upload script again? i forgot that curl cannot do multiple files at once and i dont want to use s3cmd again [17:46] nvm, i guess i can use multiple --upload-file args for curl [17:49] internetarchive-python? [17:49] this is probably what you want: https://pypi.python.org/pypi/internetarchive [17:50] yeah, that [17:51] [18:29] yay now automatic links to inside tars! 1.0 GB [contents] https://archive.org/details/wiki.urbandead.com [17:51] whoo, so the "add that to the item page" thing happened :D [17:52] dashcloud: http://dir.yahoo.com/computers_and_internet/internet/world_wide_web/searching_the_web/webrings/ [17:52] but it's Yahoo [17:52] so be fast [17:52] lol [17:54] if my wget grab fails for a fourth time, I'll switch to that instead [17:55] joepie91: it did! [18:04] greap, that python thing fails with "OverflowError: long int too large to convert to int" on my warc.gz [18:05] and a similar issue has not been replied to in 2 months https://github.com/jjjake/ia-wrapper/issues/57 [18:09] schbirid, stop being a peasant and use 64-bit Python. :p [18:09] SN4T14: the vm is 32 bit, so no [18:10] Switch to a 64-bit one, then [18:10] It's a Python problem, not a script one, it seems [18:10] not my problem... >:( [18:11] >.> [18:11] Alternatively, split the files into <2GB chunks [18:11] I'm guessing it's just taking the size in bytes and crapping it's pants. [18:13] alternatively i will just use s3cmd again... [18:48] hmm [21:41] https://twitter.com/textfiles/status/478278683551993856 [21:42] "We will be purging the media files over the next couple of weeks. Please make sure to retain your original copies in case you decide to license them directly in the future." [21:48] ooh fun: http://rawporter.s3.amazonaws.com/ [21:50] WE ARE PLEASED TO ANNOUNCE RAWPORTER HAS ENTERED INTO AN EXCLUSIVE BUSINESS PARTNERSHIP AND WILL BE JOINING A MUCH LARGER ORGANIZATION. [21:50] oh fuck you yahoo. [21:51] wtf is it with companies and misconfigured S3 buckets [21:51] lol [21:52] joepie91: in a good or bad way? [21:52] good way [21:52] https://rawporter.s3.amazonaws.com/?marker=photos/4r0um476v3qmjp.jpg [21:52] here [21:52] marker works [21:52] problem solved [21:52] cc garyrh [21:52] so we can just iterate over their shit, basically [21:53] yeah [21:53] i'm looking to see how they store videos [21:53] yey someone figured out markers. [21:53] plz be video/something.bla [21:54] Smiley: markers are well-documented, they just didn't work with earbits because that wasn't direct S3 [21:54] but cloudfront-proxied [21:54] which stripped GET params [21:54] (which is why the params didn't appear to do anything) [21:54] anwyay [21:54] anyway * [21:54] marker: just set the last Key you encountered as marker [21:54] and it'll be the first Key of the next page [21:54] all alphabetically sorted [21:55] should be 10 minutes of writing Python, tops [21:57] :p [21:57] bright side: they gave us pictures of the founders, so that SketchCow can put them on the slide-of-shame for his next presentation [21:57] indeed :p [21:57] prominent figure: CTO Michael Robinson, thanks for not securing your S3 bucket [21:58] hm, no CFO? perfect nobody to blame when they get the S3 bill [21:58] hahaha [21:59] I wonder what earbits guys will think [21:59] "... the fuck is that spike" [21:59] lol [22:00] confirmed: videos are on s3 [22:00] I wonder how many people you guys have fucked over with multi-thousand dollar AWS bills. :p [22:00] http://rawporter.s3.amazonaws.com/uploads/tmd57xoakjyy64.flv [22:00] and the hand written letter from Jeff Bezos "Thanks for the bandwidth usage this week!" [22:00] hehe [22:01] garyrh: awesome, it'll be in the list then [22:01] so yeah [22:01] somebody write something that iterates [22:01] over the / [22:01] and you're done [22:01] I can't do that now, I need sleep [22:01] while true; do echo penis; done [22:01] and I'll probably end up accidentally blowing up an S3 datacenter or something [22:01] That's something that iterates [22:01] :p [22:01] Kind of [22:01] Well, that looks [22:01] hang on [22:01] loops [22:02] i'll give it a shot [22:02] iteration=("iteration" "is" "silly"); for i in "${iteration[@]}"; do echo $i; done [22:02] That's something that iterates! :D [22:03] [00:01] while true; do echo penis; done [22:03] I'm not sure I'd object to this [22:03] also, ew, bash [22:03] lol [22:03] alternatively [22:03] yes penis [22:03] also, -bs [22:03] does the same. :p [22:04] ...I thought I was in -bs. :p [22:04] bs indeed! [22:04] as did I, actually [22:04] lol [22:04] ;-) [22:04] They both just say #ArchiveTe in HexChat before getting cut off. :p [22:04] lol [22:11] I have too many ssh connections [22:11] and too many tmux sessions [22:17] db48x, put them all inside a tmux session! :D [22:19] I put a tmux in your tmux so you can tmux while you tmux... [22:19] Oh yes, tmux me harder. :p [22:19] * SN4T14 moans [22:21] * joepie91 raises eyebrow [22:21] Queue the porn music! [22:21] (Crap, this isn't -bs, why do we keep moving over? >.>) [22:32] the difference between a window manager and tmux is pretty slim [23:17] so what's the channel name for rawporter? [23:18] db48x, i got a list of files from s3 and am currently downloading them [23:19] estimated size? [23:19] probably a few gigs [23:20] that's pretty small [23:20] yeah, i think the site was small [23:20] do we know what their url structure was like? [23:21] or were there any public videos at all? [23:22] files were stored on http://rawporter.s3.amazonaws.com/ [23:22] pictures and videos [23:22] sure, but if we're going to put up an archive of the site, it will help to know what was visible at what url [23:24] profiles: http://rawporter.com/pm/NUMBER, media:http://rawporter.com/m/NUMBER [23:24] but now the site is down, so it's hard to tell what else there is/was [23:24] is the s3 down too now? [23:25] no s3 is still up, but you can't access the profiles/media on their website anymore [23:26] https://webcache.googleusercontent.com/search?q=cache:V4yT2PCIjlkJ:rawporter.com/m+&cd=4&hl=en&ct=clnk&gl=us&client=firefox-nightly [23:26] not very helpful [23:27] yep, that's how I found out the s3 link. [23:27] too bad there was no warning [23:28] http://blog.rawporter.com/ is still there [23:29] there's some cached pages of it on bing, might help [23:33] multiple blog posts angry at how google does not disable right clicking to save images [23:36] yea, hilarious actually