[03:12] RELEASE THE KRAKEN [03:20] I've seen this twice now, thought I'd let someone know http://imgur.com/kzd5KWh [03:25] and to be safe, it's a screen shot of the warrior showing code and not the grey screen I'm used to seeing [03:54] Has anyone made a generalized retargetable warrior EC2 AMI? [03:55] I'd love to have an image I can spin up on demand that'd pull the project du jour down and churn [03:55] donate some AWS credits. [03:55] that actually sounds like something I would be interested in doing [03:56] I just recently built an AMI and code to spin up/down spot instances to compile and execute third-party addons to the codebase I maintain [03:58] no way in hell would I let that shit execute on the campus network [04:05] SketchCow, I got a 24gb backup of ftp.ea.com from last year. Is 7z a supported format or should I just change it to a tar.gz or zip [04:18] It's easiest if it's a .zip for the purposes of making it browsable [04:19] https://twitter.com/drnormal/status/326904028590116864 [04:21] can do [04:23] Mornin' guys. [04:25] Morninggggg [04:32] http://archive.org/details/bitsavers_disktrend_removed_files [04:32] I'm going to leave it there [04:32] Like that [04:53] anyone got any ideas about the messaged up Archive Warrior display? Don't even know if its working or not. [04:55] dragondon: I've seen errors like that before...Usually when the host has high load... [04:55] What does the webinterface say? [05:02] TeeCee, seems to show things just fine. And given that I got an AMD 6core, 8GB of ram, and the process monitors shows all lines nearly dead (save for the networking part, which is due to the warrior itself), that answer seems off. [05:04] which project are you running dragondon [05:05] omf_, formspring. I have the 'choice' setting turned on. [05:15] dragondon: looks like the VM is running out of memory [05:16] +1 [05:16] I periodically get that on my desktop running the seesaw scripts outside of the vm, so we should probably investigate ways to trim and split crawls in the middle [05:18] http://archive.org/details/2008.ftp.bundle.collection oh fuck yes ftp.uu.net I missed you [05:31] Yeah, that thing is a true goldmine. [05:37] chronomex, should I stop it, give it more memory for now? [05:38] All I'm trying to do is jam as many goldmines up as possible, but we definitely need a round of metadata on these poor things [05:40] SketchCow, not really knowing what I am talking about, is the metadata an automated or manual process? [05:41] It'll be a combination. [05:42] here's generally good metadata: http://collections.si.edu/search/results.htm?q=record_ID%3Anmah_834010&repo=DPLA [05:43] hmm, I don't mind running another Vm for automated stuff (thinking of running a second Warrior VM) but don't want to take much time away from my newest endeavour to learn Python via course [05:43] this isn't something that would be running in a warrior. [05:43] This is a process. [05:44] hmm, are there any scripts for scraping details from places like wikipedia to help make this a little quicker? [05:54] No. [05:54] No, this is a thing. [05:54] This is a thing you can stop thinking about. [05:54] It's up there with "How do I find the right person for me." [05:55] The answer to that is not "I bet if we do a search on the itunes store for "find me the right person" we can punch right through this." [05:55] Also if you use Wikipedia for metadata you are actually the antichrist [05:55] Like, people should just wheel the babies right to you for eating [05:56] the "I need more metadata" crowd is mostly librarians and their ilk [05:56] Me too [05:56] SketchCow: well, not all of us can be members of the clean plate club with respect to babies [05:56] I just don't think you need to hold stuff offline until the metadata's perfect [05:56] hell to the no [05:56] that's what makes me r a d i c al [05:56] ha [05:56] * dragondon <--- not much of a librarian [05:59] I'll just keep on doing what I am till I learn more :) [06:02] umm, I was going to download another copy of the archive warrior, but archiveteam.org doens't seem to tell me how to get it. I recall seeing a 'how to help' page at one point in the past. [06:04] ah, found it. Shouldn't have to dig for this. Should be on the main page. http://archiveteam.org/index.php?title=ArchiveTeam_Warrior [06:04] yeah, the information on the warrior is tucked far and away on the warrior page. [06:04] http://archiveteam.org/index.php?title=Warrior [06:04] Only you are digging [06:04] I mean, if you call clicking a link 'digging' [06:05] * SketchCow chronomex Going to go back to working on the movie. Don't need this guy. :) [06:05] Oh, look at that. [06:05] No spam on the wiki. [06:06] Thanks again, BlueMax Smiley soultcer [06:11] none? All gone? [06:11] Well, maybe we have a couple half-eaten broken spam accounts lurking here and there. [06:11] But before they were bringing up some live zombies every night, we got it from 30-40 new edits down to 4-5, now we're at zero. [06:12] Two nights in a row. [06:12] That's a big deal, someone's sad [06:12] Just as we get press and attention [06:13] yay [06:37] so microsoft e3 2012 press conf is getting uploaded [06:37] https://twitter.com/textfiles/status/326947951484211200 [07:11] Here is my life [07:11] My life is that I just had to double check I could actually speak at the Library of Congress and still have enough time to make it over to the five day Apple II conference. [07:11] * SketchCow haunted look [07:28] SketchCow: We all envy you, no need to emphasize how cool you are ;-) [07:28] And I am serious, I really envy you [07:50] doesn't make it any less stressful though :-) [07:52] and doesn't attract the ladies either... It's like: "Hey, I just saved 100.000 animated gifs, want to date with me?" ... So digital archeology needs it's time to develop it's charm :) [07:52] digital preservation [07:53] anyway :3 [07:54] well, both actually, but right [07:54] :) [10:02] norbert79: I believe SketchCow already has a lady in his life [10:02] in case you were worried [10:27] chronomex: Still it's nice being able to show charm [10:27] and beinjg appreciated for it :) [10:32] woop woop woop off-topic siren [16:35] the corporate youtube channels are worth backing up continuously, they remove old ads and such [16:35] e.g. a terrible old windows phone ad from microsoft is too embarassing for them https://www.youtube.com/watch?v=ewk8zWx9lqE [16:35] apple removed their lame genius ads [18:20] Jamming in a pile of CD-ROMs [18:24] < [18:24] ups [18:28] "Total Anhilation"? [18:32] No, sadly [18:32] One day! One day [18:33] the guy who runs http://www.marksfriggin.com (a site with logs of each daily howard stern show that's been up since 1995) has been threatening to shut the site for a while and i'm interested in keeping a pre-emptive archive just in case. anyone able to point me in the right direction on doing a simple site archive (whether heritrix or other tools—i'm primarily a python developer if that influences the tooling at all)? [18:35] lukeman: I'd start with wget -r -l 0 -m -p --warc-file www_marksfriggin_com http://www.marksfriggin.com/ [18:35] lukeman: I'd say using wget (version 1.14) would probably be the best if it's a simple site (ie. content not hidden by javascript) [18:35] yes [18:35] thanks guys [18:36] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [18:36] that works for me. wasn't sure if i needed something more complex. [18:37] Feel free to hack on one of any of our current repositories by the way, a lot of it is in python - available at https://github.com/ArchiveTeam/ ;) (seesaw-kit is an important one for example) [18:37] usually wget is fine, sometimes you need more complex tools though [18:38] http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem has a bunch of tools as well [18:38] yeah, i was reading that before [18:38] 01[13yahoo-upcoming-grab01] 15alard pushed 1 new commit to 06master: 02https://github.com/ArchiveTeam/yahoo-upcoming-grab/commit/21070051b5534a340379a07acae6f0475ffeacc6 [18:38] 13yahoo-upcoming-grab/06master 142107005 15Alard: Ignore Wget error 4 (dns resolution). [20:25] Posterous attention achieved [20:27] yeah I just saw the douchbag tweet from a guy at twitter [20:33] o.O some ppl still use posterous for stuff in May? [20:35] welp, think i may have gotten my 2nd ip ban from github using their api :) [20:37] Is there any legal means by which Posterous could 'donate' their entire database to the Internet Archive, or similar? (Noticed the great news about Cuil earlier.) [20:38] no. [20:38] No, no, this is going to explode very quickly. [20:38] It's already exploding. [20:39] what has exploded [20:42] WiK: full ban, or API rate throttle? [20:44] I'm interested to see what happens. [20:55] hmm we had dedicated boxes and with them we would not have been able to make the deadline? what a joke... [21:00] is some mass "alert the media" scramble in order? [21:01] edoc: even german news had the posterous shutdown covered [21:02] http://www.spiegel.de/netzwelt/web/blogdienst-posterous-wird-abgeschaltet-a-883986.html [21:03] i ll write them up to do a article about archiveteam, lets hope they ll do it :) [21:03] BBC failed to mention. [21:03] this does not suprise me. [21:03] if they're going to let their users download their data after the 30th, why do they care if we are hammering them now? [21:03] I've been on the BBC sites a few times... [21:04] paulv: lies and damned lies. [21:12] SketchCow: can i direct the spiegel guys to you? (twitter) [21:12] i just hope they do a follow up [21:20] SketchCow: any plans on moving my tezkill videos to my tekzilla collection soon? [21:20] godane: I think now is a bad time to ask bud. [21:20] i'm going to uploaded more directly to tezkilla collection [21:20] ok [21:21] can underscor do that then? [21:22] Not sure, but Jason has.... a number of things going on atm, including the pending destruction that is Posterous. [21:22] got that [21:23] I'm off to bed as I'm rather ill and not getting any better : [21:23] i only ask cause the collection is going to look funny [21:23] Get well soon, Smiley. [21:23] thanks [21:23] jumping from episode 42 to 105 [21:45] Ne1 around happy to answer a n00bs question? [21:45] What's up [21:46] Just started running a linux server and have left it on all day working on upcoming. [21:46] Thing is despite a 76mb connection it's only got a gig done so far [21:47] Watching the little graph looks like it gets very small amount of data then hangs for s econd [21:47] hola [21:47] Is this normal or have I set something up wrong [21:48] I do not speak much English [21:49] isn't upcoming complete now? [21:49] I think that's normal, abards - the client is pretty considerate and only downloads a little bit of data at a time, it doesn't hammer the server or download constantly. [21:49] If you only have a few threads running (default is 2) then what you describe sounds about right. I'm new myself but I don't think it's a problem. [21:49] plus upcoming is small [21:50] Ok cool, how do I up the threads? Can I ssh into the virtual machine or is their a config page? [21:50] 143 items out, 0 to do: http://tracker.archiveteam.org/upcoming/ [21:51] And also, quite a lot of Upcoming was finished by last night, so there have been only a fairly small number of work items today anyway. [21:51] Ok I'll switch project, it was just the one that alerted me [21:52] Config page is on the left hand side of your web browser - "Your Settings" [21:52] Mine just has name :) [21:52] Tick the 'Show Advanced Settings' at the top [21:52] ahhh [21:53] hola a todos me interesa ayudar en su proyecto me esta ayudando el traductor de ingles-español de google :) [21:53] Sorry spent so much time playing around with getting it running headless behind afirewall didnt really look very closely [21:53] All good fun. :) [21:54] Yeah, been without internet aside from work for a few years, nice to get back on and start using it for something [21:55] Hola n00b406 - Yo no hablo español, pero espero que alguien aquí puede ayudar. [21:56] :) [21:56] no entiendo espanol [21:57] well I'm interested to discuss your project it is, purpose, philosophy and ... [22:00] No hablo espano, pero Soy Awesome! [22:01] espanol* [22:01] lol [22:01] Hay algunos muy buenos discursos y conferencias sobre YouTube dadas por 'Jason Scott, que puede encontrar interesante, pero están en Inglés. [22:02] [[Sidenote: It'd be great to get those speeches transcribed, so they could be easily translated... no?] [22:03] antomatic, I have some info about that [22:03] give me a minute [22:04] antomatic: I haven't spoken spanish in 4 years, but I can understand what you're saying. Amazing. [22:05] Wow. I don't even speak Spanish, this is purely Google Translate. [22:06] good news on revision3 now [22:06] it autoloaded older episodes links on the episodes page now [22:06] I'm not sure it's grammatically correct, but I can understand it. I could never keep por/para straight for example. [22:07] wait, what was going on with revision3? [22:13] El Historia de Soy Sauce es muy interesante: http://www.youtube.com/watch?v=-2ZTmuX3cog [22:14] Fantastic automatic subtitles on that video. [22:14] "My name is Jason Scott, biamby mascot of our country." [22:18] Actually that IS something I can help with, if it's useful. I can rip that and clean it up into clean, translatable (and YT-viewable) subtitles. If it helps, obviously. Not if not. :) [22:19] antomatic, I myself would love it and I know others would as well. If you want to do it, do it [22:19] Happy to. OK, leave it with me. :) [22:20] closure: i can manually get there, but when i run my code it gets a 403 Forbidden [22:20] ive changed ip address, same, changed username/password same [22:21] WiK, how many repos till that happens [22:21] anon access in my scrippt still works [22:22] 622336 [22:22] not bad [22:23] have an estimate on how many repos there are total? [22:23] 1660285 [22:24] that was the last id ive seen, and im thinking... [22:24] https://api.github.com/repositories?since=1660285 [22:24] you could mess with that to figure out how many there are in totoal, i just never bothered to look [22:24] Do you want help downloading blocks of it? I can throw some butts at it [22:26] ill gonna stop for a bit and give my modem a break, i wanan keep this 'my' project at least until after defcon [22:26] my modem can use a break anyway [22:26] AND my curret 4tb harddrive is full...so i had to stop the downloading for a bit anyway [22:27] i think they just blocked the user/pass i was using to authenicate [22:27] as its not being blocked per ip/user-agent [22:27] I understand that WiK. You want to finish what you started. [22:28] nope, i just submitted a cfp to defcon on it, and if they reject ill submit it to firetalks [22:29] after that, ill mostlikly talk to the peeps here about handin goff the project, recoding my stuff so it will work inside of a 'worker' [22:29] I can help you with that [22:30] http://github.com/wick2o/githunt [22:30] I have so many ideas of things I want to try on that data [22:30] if you look at api_uname_harvest2.py you can see what im doing [22:31] http://i.imgur.com/gZ8Fc83.gif [22:31] im pretty much just ru nning that in different modess [22:31] Today in a nutshell [22:31] download = downloads the rips 10 at a time [22:32] processor loops though each folder and updates the database [22:32] so that i can get folder/dir counts [22:32] then i have a custom bash script that runs in cygwin that does grepping on all the results before i remove the HD for another [22:33] and then i have to manual process those :( [22:35] fucking mother fucking. I just got this back after uploading 33gb [22:36] I am surprise this is not checked before I upload a file http://paste.archivingyoursh.it/wesepegeba.xml [22:37] hmm - my Warrior's jumped onto Upcoming at some point today - thought that was done with... is there any use in me telling it to work on Posterous or is that still subject to getting blocked? (or am I just losing my marbles?) [22:38] Baljem, you can point it back at posterous [22:38] cool. will do! [22:39] Is form spring worth doing? I notice they say they could be rescued, but there is a ton of stuff left. [22:40] https://twitter.com/mrox64/status/327189622591475712 [22:40] Formspring is low priority to me [22:40] I'm worried about posterous, but they're fucking us [22:41] I'll switch over to Posterous then. [22:42] What's the deal with Revision3? Are they shutting down? [22:43] (panics) Whaat?! [22:43] that was my thought [22:43] now I got to upload 33gb again [22:43] Discovery only just bought them! [22:43] "good news on revision3 now [22:43] it autoloaded older episodes links on the episodes page now" [22:46] They wouldn't shut it, surely.. haven't heard anything. [22:50] Welcome to the fun, jfranusic [22:52] i doing rev3 just in case [22:52] alot of the episodes are from like before 2010 right now [22:53] Hi. [22:53] Would you happen to know if anybody has made an effort to archive [22:53] This was a really early ISP based in Manchester, England. A series of [22:53] mergers and takeovers means that the domain basically got forgotten [22:53] nwnet.co.uk ? [22:53] about, and it's full of late 1990s gems such as [22:53] http://www.nwnet.co.uk/worsley/ - spot the MS Front Page template, for [22:53] an NHS website! Lots of personal pages and very amateur feeling [22:53] company homepages. googling site:nwnet.co.uk seems to bring up [22:53] loads.. [22:53] I was a subscriber back in the 1990s. My own website is still there, [22:53] and my email still works. And it's through that I've had notification [22:53] that they are switching it all off on 1st May ... I'd run a wget [22:53] across it, but haven't a tool to grab a starting list of urls from [22:53] google, and don't have enough time to write one - is there anything [22:53] about already? [22:53] Cheers [22:53] Rob [22:53] -------------- [22:53] OK, let's do it. [22:54] It's a real thing, a website set being shut down in.... sex days [22:54] six [22:54] well sex days too, if you play it right [22:54] nwnet.co.uk doesn't work for me [22:55] Here it redirects to www.telinco.net, which has a general 'shutting down, deleting may 1st, get lost, your fault if you didn't back it up' message on it [22:55] disgraceful. [22:55] yeah the www redirects [22:56] godane: so, you're just doing a pro-active backup? [22:57] 1490 google results for http://www.google.co.uk/search?num=100&newwindow=1&site=&source=hp&q=site%3Anwnet.co.uk&oq=site%3Anwnet.co.uk&gs_l=hp.3...1279.3709.0.3900.18.16.1.0.0.0.128.1097.14j2.16.0...0.0...1c.1.11.hp.8MVadSwsvjI [22:57] eh, I mean, for "site:nwnet.co.uk" [22:57] oh, 306 by the time it gets to page 4. [22:57] yes [22:58] doh. (no good at this.) :) [22:58] there is some cool shit on there [22:59] godane: gotcha, okay, good show. [23:00] nwnet.co.uk/worsley doesn't redirect here (east coast, US) [23:01] Here we go http://urlsearch.commoncrawl.org/?q=nwnet.co.uk [23:01] 320 urls there [23:01] holy crap [23:03] must be something about British ISPs - a few years ago UK Online was closed down by Murdoch's lot, taking a whole bunch of subscribers' sites with it (such as Cliff Lawson's resources for Amstrad PCs) [23:04] alas I didn't know about Archive Team when I got that e-mail or I'd have suggested doing something about it :-/ [23:08] so, I'm going to do a crawl of the nwnet.co.uk/worsley one first using this wget command: wget -r -l 0 -m -p --warc-file nwnet-worsley http://www.nwnet.co.uk/worsley (and then move on, one by one through the other sites listed in the commoncrawl list) [23:15] sounds good [23:18] check the wayback machine. With all that cuil data there are probably more urls in there [23:19] while I think of it... [23:19] http://i.imgur.com/8udKTNj.png [23:20] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [23:20] antomatic: THY MAGIC WORD IS yahoosucks [23:20] dashcloud: Before you do that, I think this is a job for the warrior [23:20] It seems so obvious now. :) - Thanks! [23:21] looks like twitter sucks more. [23:21] ;[ [23:21] that's fine- if you go here: http://urlsearch.commoncrawl.org/?q=nwnet.co.uk there's a .json download of all the urls [23:21] I could write a ruby script that brute forces urls, would that be useful ? [23:22] Have it check a new one every 4 seconds or something. [23:23] Dicitonary attack isn't unworthwhile [23:24] there's not that many listed there in that list [23:24] maybe 20 actual sites- haven't checked google's listings yet [23:26] We'll want to hit google [23:28] any sense on if they allow numbers in usernames? [23:31] haven't seen any yet - but they do seem to be case-sensitive [23:31] Yuk! [23:31] e.g. nwnet.co.uk/BFG/ [23:32] Yes - numbers. [23:32] www.nwnet.co.uk/i2i/ [23:38] Please just keep building a massive textfile we can use for the downloader [23:38] It's OK if we spend a night with a couple of you tracking possible filenames [23:38] Dictionary attacks against google work good too [23:38] What project is this for [23:38] so, which paste site? archivingyoursh.it or elsewhere? [23:44] Hi, everybody. [23:44] (waves) [23:45] I have a Posterous task which has stalled - 2hrs since the last entry on its wget.log. [23:45] I'm not running the Warrior. [23:46] It's downloaded over 4k URLs, over the past 3 days. [23:47] Is this normal, or is this broken? [23:48] Er, to be clear: should it wait for hours between requests? [23:50] i'm getting the pat and stu show for april 24 2013 [23:50] *the video [23:50] of it [23:51] cause the mp3 is not on there site yet [23:58] I'm doing a dictionary attack against the the website. We'll see how it goes. [23:59] It looks like it was a spammer, and it's now full of 404s.