[00:01] Google Reader contains feeds for thousands of deleted blogs [00:06] akkuhn, [00:06] apparently [00:06] you cannot run rm -rf / [00:06] successfully in that emu [00:06] it's not a good emo [00:06] *u [00:07] lol [00:12] I think we should start collecting everyone's .opml files so that the obscure dead blogs can get backed up [00:13] is there a convenient HTTP server for collecting uploads? [00:47] ivan`: It's true, all those entries I starred years ago on blogs that no longer exist are still available in Google Reader. [00:48] well remember, if we put a ton of pressure on google, they might just backpedal [00:48] they did for google video (though they eventually killed it anyway) [00:51] viseratop: I have some ideas for getting as many feed URLs as possible [00:51] the best is to search some giant web crawl for *.blogspot.com *.tumblr.com etc and infer the RSS URL [00:52] another thing is a bookmarklet or userscript that uploads your feed list somewhere [00:52] should be more convenient than Google Takeout [01:00] I'd like to run the Posterous backup [01:00] Anything special I need to do? [01:01] run warrior, click posterous [01:01] "NOTE: Posterous will ban you. Ask on EFnet IRC, #archiveteam, before running this." [01:02] yeah, they will ban you after a while for a bit [01:02] ok [01:02] unless you browse posterous regularly you shouldn't have to worry [01:02] I don't [01:02] Just thought there might be a special workaround like routing your requests through tor or something [01:03] the bans aren't really slowing us down; rather the posterous infrastructure is very fragile [01:04] ok thanks [01:06] if you need further help, please join #preposterus [01:14] so uh [01:14] do we save google reader? [01:19] robbiet4-: get every single feed URL first [01:20] heh [01:20] heh [01:20] uh [01:20] * robbiet4- /parts [01:20] grabbing the text content from those is probably easy [01:20] * ivan` misread; thought there was a "how" in there [01:23] Wow, I love running around punching the living shit out of people about this RSS thing. [01:23] That's how I want to spend the time [01:33] what's this place about, is there a website? [01:34] I know it's for archiving information, maybe databases, open source media? [01:34] http://www.archiveteam.org/index.php?title=Main_Page [01:42] thanks [01:58] SketchCow: I sense the thud of thousands of twitter insects against glass this evening. [02:05] yeah [02:05] Well, just a few. [02:06] So, now that I've been living the Mascot of Archive Team lifestyle for a few years, and thanks for that, by the way.... [02:06] ... there's a lot of "oh look, THAT again" horseshit that gets in my face. [02:06] The question that comes to mind is whether I want to go ahead and engage, or let it roll off like rain. It's a debate. [02:06] So we get the "who cares about this old shit" people. [02:07] We get the "well, of course I won't PAY for this service, fuck that" people. [02:07] You know, the people who use their shopping bags six times and have a drawer of mustard packets [02:08] requesting permission to run posterous project on my warrior client [02:09] i used to be one of those "well of course i won't PAY for this service" people. then i realized i work for a software company and people paying for services is what lets me eat. [02:15] Well, look. [02:15] Outside of the "why pay for it" situation where I realize you don't want to get wallet-raped, i.e. pay $60 for a game you pwn in 5 hours... [02:16] ....at some point, you realize if you're spending days, endless days, using an item that has a back infrastructure you benefit from, a few bucks makes sense. [02:16] Then at least you can complain. [02:16] agreed. there's a pervasive sense of entitlement. motherfuckers took away my free service! [02:16] So I have, for example, Flickr, GMail for Business, Namecheap/EasyDNS, Newsblur, and a few others I pay for. [02:16] HOW DARE YOU take away my free service. i doth protest! [02:17] Outside of the DNS ones that are notably more expensive, I bet I probably spend $200 a year. [02:17] With DNS and remembering a few other things I pay for, I can't be paying more than $500 in total. [02:17] $500! [02:17] somehow $200 of network infrastructure seems absurd, but folks will gladly piss that away on 40 coffee's. [02:18] and then there's Pinboard, which is definitely worth every dollar I paid for it [02:18] For shit I use and depend on 24/7. And as you ALL know I am one outlier motherfucker, I use these services until the disks actually make Paranormal Activity Quality wailing sounds [02:19] partly a perception of value at scale too. it's easier for someone to process $12 a year for an RSS feeder when it's all a company does, much less for them to see the value in Google Toiletpaper because "dude, that costs them like nothing to run" [02:20] coincidentally, i will laugh if google reader becomes a google apps-paid only feature. oh the uproar. [02:21] "@textfiles I've got 4TB of 356K of github repos downloaded, I'm quickly running out of harddrive space. Whats the best way to get you a copy" [02:21] * SketchCow flashes the horns [02:21] ROCK. ON. [02:22] NICE [02:22] Told him to come here, he's officially a member. [02:24] reposting a request for info on the posterous project, since the warrior prompts people to ask for info here [02:24] permission granted, go ahead [02:24] ty [02:24] perhaps we should remove the warning [02:25] helllloooooooooo nurse [02:26] the hero github deserves, but not the one it needs right now. [02:26] greetings. [02:26] (actually i don't know what the hell they need right now) [02:27] They need a hug [02:27] so give got about 4TB of github repos (369734) downloaded for a project im workig on [02:27] I should point out their offices are great. [02:27] What's the archive format, WiK. [02:28] right now its all just git clones, none compressed [02:28] i was gona use bitcasa to dump them all too, but it keeps crashing when i try to robocopy [02:28] 'git bundle' is a good storage format [02:28] WiK: I hope you are using git --mirror (implies --bare) to save space [02:28] ya, but i dont want them bundled, otherwise its more of a pita for me to process [02:28] ivan`: im not, as i want all the branches [02:29] https://github.com/wick2o/gitDIgger [02:29] WiK: you still get all of the branches [02:29] ivan: then whats the difference [02:29] you don't waste a ton of space on a checkout of HEAD [02:30] ahhh, well ill modify my script to use that now..brb [02:30] you can convert existing repos to bare repos but try not to accidentally rm 4 TB ;) [02:30] :P [02:31] well, pytongit wont let me juse the --mirror with their .clone() [02:32] ivan`: thats where find / -iname HEAD -exec would come in handy [02:35] I just use os.system [02:35] I just submitted a CFP to defcon for this little project [02:35] actually often subprocess.call et al [02:36] had someone today fullful a wishlist 4TB harddrive since i was almost outta space [02:36] nice. [02:36] ya, ive got 5 external drives connected to my machine right now :) [02:37] how much sense would it make to bundle and upload? [02:37] since bundles seem to be good for archiving... [02:38] idk, i posed to jscott on twitter and he told me to join here (didndt know there was an irc channel) [02:38] also you're probably aware that we have a list of GH users from late last year... [02:38] SketchCow is jscott [02:38] ahhhh [02:38] idnt know that, im just using their api and started with repo id # 1 and goin from there [02:39] nice [02:39] aaah :P [02:39] how does their API deal with private repos? just throws a 403? [02:39] i had the anonymous api limit lowered from 1k request an hour to 60 :) [02:39] oh you were using the anonymous API? [02:39] i was at first :) [02:40] heh it probably could be done without the API [02:40] when i was just testing my spider [02:40] ah :D [02:40] People, talk to him, and based on what the result is, I can either have a drive that goes to IA, or some other solution. [02:40] well you get 5k request an hour, and with a threaded cloner using 10 threads at a time, i RRARELY hit anythin gclose to that hourly limit [02:41] I'd suggest an item per user, with each repo being a single-file bundle in that item [02:41] remember our userlist is out of date :( [02:41] it's on IA though [02:41] I still feel like a dummy for not backing up bugs.sun.com [02:42] SketchCow: i was tring to get a sponser to build a 20TB nas and then hand the thin over to you after my talk at defcon (if the CFP gets accepted) or firetalk [02:42] ivan`: :( [02:42] did you back up any of sun? [02:42] i have a database that dumps the user/project name [02:43] I'm quite annoyed that Oracle paywalled the whole thing [02:43] so i can spit out user list for all useres ive seen so far [02:43] have old hardware that needs bios updates or restore data? screw you unless you have an expensive contract [02:43] WiK: how are you crawling right now? [02:44] via api and a threaded python script i wrote [02:44] still via API? :P [02:44] yep, its the best/fastest way to amke sure i get everything without getting banned [02:45] i was using a custom webcrealwer to get users and projects, but keep getting 'blocked' [02:45] before i knew they had an api [02:45] balrog_: funny thing is that I would have noticed they were about to JIRA up the thing if I kept my Reader feeds better organized [02:45] in my db i keep track of username, project name, harddrive, and if ive greeped or processed it yet [02:46] of course they neglected details like "we'll rm all the user comments" but that could have been infered [02:46] The short form is that we can certainly take this item. [02:46] I don't think I have any of sun, no [02:47] anyone have a suggestion for simplest way to get wget 1.14 onto a ec2 instance? [02:47] i recall fighting with the SSL compile options under arch a few weeks back, wondering if i'll get the same ... [02:48] SketchCow: https://github.com/wick2o/gitDigger those are my resultin wordlists if your interested (so far) [02:48] -rw-r--r-- 1 root root 24420816 Mar 14 02:22 Electronic Entertainment (December 1995).iso03.cdr [02:48] -rw-r--r-- 1 root root 1321824 Mar 14 02:22 Electronic Entertainment (December 1995).iso04.cdr [02:48] -rw-r--r-- 1 root root 43180032 Mar 11 2011 Explore the World of Software - Kids Graphics for Windows (1995).iso [02:48] -rw-r--r-- 1 root root 676304896 Sep 26 2009 Interactive Entertainment - Episode 13 (1995).iso [02:48] -rw-r--r-- 1 root root 656586752 Mar 12 2011 bootDisc 14 (October 1997).iso [02:48] -rw-r--r-- 1 root root 683087872 Mar 12 2011 bootDisc 20 (April 1998).iso [02:50] fyi: im in no hurry to offload any of this data. but do want to put it in your hands, and then when your bored you can run github updates and keep em updated ;) [02:53] Well, I'll be at DEFCON, and you can always bring a drive. :) [02:55] 7zip them up per username and see how small of a drive i can get them onto [02:56] git packs are already compressed [02:57] so your telling me i can 7zip it up and get no better? [02:57] you might save 1% or so [02:58] with over 400k repos, that 1% could add up [02:58] I wouldn't bother [02:58] an ideal github mirror would have bare repos in a form that can be easily updated with git pull --rebase [02:59] ivan so you want me to remove the HEAD's from all of them? [02:59] there's no reason to have a checkout of the HEAD [03:00] thats easy enough to fix [03:00] (did I say git pull --rebase? I meant git fetch) [03:01] also, `git` will annoyingly prompt you for a username and password when you `git fetch` a repo that has been deleted [03:02] you can work around this by changing the remote to https://dummyuser:dummypass@github.com/user/repo.git [03:02] you can work around this by changing the remote to https://dummyuser:dummypass@github.com/user/repo.git [03:02] you can work around this by changing the remote to https://dummyuser:dummypass@github.com/user/repo.git [03:02] oops [03:05] so what do you guys use for storage of all this archiveal stuff? [03:06] massive drive arrays [03:07] you know, same way anyone does ;) [03:08] We use depressed arrays [03:08] http://www.flickr.com/photos/textfiles/8273584676/ [03:09] hahaha [03:09] ya, thats a bit more funds then i have for my project :) [03:10] :P [03:10] I have 5T of ~idle space, happy to store a copy of something [03:20] that's one sad petabox [03:25] Request denied: source address is sending an ecessive volume of requests ... ok then, i'm doing something right :) [03:44] SketchCow: is there an Internet Archive Archive? [03:46] Only somewhat [03:46] We've tried to make export utilities for the archive, we should do more [03:48] how much does 10 PB (?) cost to build, anyways? [03:52] http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-storage-pod-3-0/ [03:54] about $600K plus humans to put everything together [03:55] that's pretty doable for some agency like a national library / archives [06:04] hdevalenc: we had SO MANY wasted PBs at San Diego Supercomputer Center, it's never stopped bothering me--there are people who use small grants well and there are people who use large grants terribly--full on bureaucratic hell, and so much barely used storage. [06:06] viseratop: do you know of any efforts to get large archives to mirror archive.org? universities, national archives, etc [06:08] hdevalenc: I'm not all that up to date, but I'm now having the urge to check in with friends still at the SC centers. I also have some friends who do digi-preservation at Library of Congress. Worth investigating, though I'm sure SketchCow already has a pretty good feel on this. [06:08] hdevalenc: I'll ask my network just for kicks, never know what stirring up some dust will do. [06:10] make a petition, lol [06:10] that will be very effective [06:10] yeah petition the shit out of that [06:11] dead-tree petitions are actually fun [06:11] you can get your MP to read them, they go in the records, etc [06:12] online petitions really piss me off; they manage to eliminate the one thing that petitions are actually useful for [06:15] SketchCow: I'm fairly strapped in with UCSD, Calit2.net in particular. Let me know if it's helpful to stir up some dust on this, always happy to--also have won a few NSF grants, but those are bristly as hell (as I'm sure you know). More dog-and-pony shows and less actual achievements. [06:16] hdevalenc: Interesting socially though how whitehouse.gov has to keep upping their threshold due to pranksters. [06:17] indeed [06:18] see, the thing about deadtree petitions is that in order to make them you have to go up to people and convince them to sign on [06:18] best case, they're now on-side since you persuaded them [06:19] worst case, they know that it's an issue, because someone cares enough to run about with a clipboard [06:19] I don't really think that online petitions do that as well [06:25] woop woop woop off-topic siren [06:25] is there a convenient way to set the bind ip address of the seesaw script? [06:26] on the wiki it says I should use --bind- [06:26] address [06:27] but that option doesn't seem to exist in any of the scripts -- should I add that as a param passed to wget-lua in pipeline.py? [06:39] I think it was changed to --address [06:40] afaik that changes the ip of the web interface [06:40] (also) [06:40] does it do both? [06:41] er, not sure [06:41] also, is posterous accessible over ipv6? [06:41] sadly, no [07:00] Dude, posterous is barely available over ipv4 [07:04] I just got tapped by someone with 900 radio shows going 20 years back, 3-4 hours a show, to host an archive of them. [07:04] Going to happen. [07:04] Very exciting. [07:04] He's like "it's half a terabyte" [07:04] I'm like PFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFT [07:04] PFFFFFFf indeed [07:05] &awat away [07:05] I know a radio station that doesn't keep more than 3 months of recordings due to cost [07:05] it's really depressing [07:15] hey all! This is Ismail from World Backup Day... [07:16] err...I mean that I'm working on that. [07:16] not that I'm from the future. [07:18] I was hope to ask if it'll be alright to have a callout to this Posterous project on the webpage on ~ March 31st? [07:19] Sure, but the fact is that I don't know if that's the source of the probem. [07:20] Check #preposterus and ask, and bear in mind it's 12-3am in the US [07:20] Yeah, it 3 am here as well. live in ohio. :) [07:21] pardon the typos [07:24] I was also looking for comments on an effort to create some sort of manifesto for startups to have some sort of end-of-life procedures. [07:25] but it's late and probably better to bring this up tomorrow. [09:45] omgomgomg google reader [09:45] little known thing, they're a huge archiver of rss feeds [09:46] "Google Reader is more than a feed reader: it's also a platform for feed caching and archiving. That means Google Reader stores all the posts from the subscribed feeds and they're available if you keep scrolling down in the interface." [09:46] that's from http://googlesystem.blogspot.com/2007/06/reconstruct-feeds-history-using-google.html [09:46] shutdown blogposts are here: http://googlereader.blogspot.com/2013/03/powering-down-google-reader.html [09:46] and here: http://googleblog.blogspot.com/2013/03/a-second-spring-of-cleaning.html [12:09] need a google reader wikipage [12:11] i'm thinking have a simple webpage that allows people to to upload their subscriptions (i think the files have the extension opml), then a thing for people to submit throwaway/dummy google accounts (since google reader doesn't display anything if one isn't logged in) then use the accounts to grab the archived copies of the blogs submitted [12:11] o_O [12:11] hmmmmmm that doesn't really make sense to me. [12:11] then periodically check the blog pages, until it goes down [12:12] Oh wait I see [12:12] you want to grab the archives. [12:12] pretty sure archiveteam warrior would do fine for the downloading [12:13] Smiley: yep! [12:13] "" [12:13] "Google Reader is more than a feed reader: it's also a platform for feed caching and archiving. That means Google Reader stores all the posts from the subscribed feeds and they're available if you keep scrolling down in the interface. [12:13] " [12:14] if there's not one already i'll start on a wikipage tomorrow and i guess look into how to make that simple webpage [12:14] the effort needs a name though also [12:54] arrith: Create a wikipage with the info you got [12:55] is there not already a section for reader on the google page? [12:55] That sentence makes no sense [12:55] Oh, you mean this? http://www.archiveteam.org/index.php?title=Google#Google_Reader [12:55] ersi: append "on the wiki" [12:56] That's not a project page though. [12:56] its called "backup tools"… the irony [14:45] concord and county sheriffs each spent over in years 150 TRILLION. This is jut trying to find a way to put people in jail as dispatchers are listening to government fres. they open mail and refuiser to topand wring return letters in a negative way. this is YOU TAX DOLALRS AT WORK. many things stolen outof mail distribution every large ticket and pesent gifts and military and government [14:45] back. theydispatchers used, given way and sold the dispatcdher go ina stolen police get up to get citizens mail 3x a day. dispatchers want to be the 1st to stop anyone they dont like fora sucessful life. also if no black mail paid they label you like they did the our friend as a child molesfer and regtisterede ass the victim was in his opwn home. [14:45] orders. they also f0000 over my friens life as since dispatchdr signed a quiet contract with government the decidded to keep going against p3eoeple Concord Ca believes to get anything they waqnt out of us is to make uip worlds first over 1 million chargds and assignemt of jail time for not showing resopece then taledabout withohut illegal we are askinb begging for help ingetting as much [14:47] Did we just get spammed? [14:54] yup [14:58] ┌∩┐(◣_◢)┌∩┐ [14:59] This is what we do to spammers [14:59] (╯°□°)╯︵ ɹǝʇʎq [15:18] When dispatch been ask as to why no hlep for ;peole and a new person" we wnat to be the 1st to get a eprson out of government that we dont like. this is in reverse of the letter form white hose. theywant to retre with heads high adn if jew then goes to jail. dispatrcher admtiited making juphargdes and on comomputer. disaptc h admitte that studdents using stolencomps weregivden jobs paind [15:18] in cash orocaine to reun atarget. THIS GJUY IS DYING PLESE FOR GOD AND OUR SAVOR HELP [15:26] What's the boint of running a gibberish spam bot? [15:27] where are ze ops? [15:27] FIRE THE COANNONS [15:27] we could fire warrior pings at him [15:27] :D [15:27] with messages in the UA :D [16:51] OH MY GOD GUYS DID YOU HEAR ABOUT WHAT THE CONCORD COUNTY SHERIFFS DID [16:52] WHAT DID THEY DO [16:52] SOMETHING SOMETHING 150 TRILLION [16:52] MY TAX DOLALRS AT WORK [16:53] IRC spam is always best spam - it's like hobos who blow other hobos for crack money. [16:53] It's like, way to go downmarket [16:54] hahahaha [16:56] So, I don't know if anyone else wants to help write capsule summaries of computer platforms, but I could really use more help with more. [16:56] I want to spiff up TOSEC. [16:57] SketchCow: What platforms do you need? [16:58] A lot. Come to #iatosec [17:15] arrith: the best way to get more feed URLs is a web crawl and also inferring based on a blogspot, tumblr, etc domains [17:39] sup gents [17:52] arrith: check out https://github.com/wick2o/stPepper [17:52] its distributred software where you can divide up the ipv4 address space and spider the internet for links [17:53] i started doing it to generate a good seed file for a spider, got alot of the space done [17:53] but then the ppl who were helping me kinda lost interest [18:07] i also pulled all the url soutta the wikipedia data which was a good start as well [19:56] btw. a friend reports he had a CVS project on sourceforge a long time ago, and it has kind of disappeared [20:09] SketchCow, Do you want a pull down of what is left of jmanga [20:12] Not sure [20:17] Nothing to be saved from here but I thought I would mention the fuck you Adobe just handed out. http://developers.slashdot.org/story/13/03/14/189204/adobe-shuts-down-browser-testing-service-browserlab [21:16] I lay good money MS is going to kill that IE shit with no warning like Adobe [21:19] ie shit? [21:19] The online multiple IE tester [21:19] and those self expiring VM images too [21:49] didn't another browser testing service die a few weeks ago too? [21:49] getting serious deja vu with that adobe announcement [22:51] ersi: getting on that today [22:52] ivan`, WiK: hm, spidering might be something to look into but i'm at least hoping to get the top 90% from various 'top blog' lists, and ideally some stuff like the recommendations internal to google reader itself