[01:25] SketchCow: I'm archiving the Atari section from ftp.inf.tu-dresden.de , and some of the folders have a nice ASCII art greeting when you enter the folder. How should that be preserved for presentation on IA? (right now, I just copied it to a text file, but no idea how it should be handled on IA) [02:02] i think the #rawdogster grab scripts are ready for grabbing profiles [02:08] chfoo: Good [02:08] want to do some tests? [02:11] dashcloud: on many servers, that information is in the .message file, which the server loads and send when you enter that directory. check to see if that's the case and make sure to grab it if it is. [02:14] that looks like the case for that site [02:15] and it might be that the client requests it upon changing directories. I don't remember anymore. [02:16] example: ftp://ftp.inf.tu-dresden.de/software/atari/Checkpoint/.message [02:32] SketchCow: sorry, i'm a bit too tired right now. but i plan to do more testing tomrrow [02:32] unless someone here wants to test the scripts out, more than welcome to [02:49] No problem [03:04] thanks Coderjoe ! [04:18] hi folks, this is a pretty amazing FTP site: ftp.cs.tu-berlin.de tons of content, many mirrors of older content, and an pre-made index file to browse through all the stuff that's there- download INDEX [04:21] Want me to grab the copy, or are you [04:22] can you? [04:23] now this is funny [04:23] the index alone is 78 MB (that's pure text) [04:23] i found a digg dialogg with timothy geithner i my wall street journal video dumps [06:35] holy shit they have a pirated copy of matlab dated 1992 with an instruction manual from 1981 on that ftp [06:35] Awesome. [06:36] http://www.martinreddy.net/ukvrsig/ [07:42] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [07:43] Yes I take things litterally [07:52] yahoosucks [07:53] haha thanks [07:54] apparently efnet hates me [07:55] Either that or my ident is banned [07:56] so guess the french hate me, i was trying to use irc.efnet.fr [07:57] Doubt ArKiver is around ? [08:28] It's a little late for him. [08:28] He's probably tucked into bed with his teddy bear and pudding cup. [08:29] SketchCow: would it be a idea to put who is grabbing wich FTP in the wiki? [08:30] I was just going to go along grabbing things from lists that people were going to give me. [08:30] But who knows where we are now. [08:30] If you want to start a wiki page, sure. [08:31] But right now I'm just slamming down lists. [08:31] im grabbing 5 FTP's right now but they are +12TB in size... [08:31] Sweet. [08:32] * SketchCow is currently, this second, ingesting the talks of a January 2014 hacker conference. [08:32] how much space do you got in fos anyway? seems that that box cant be filled :p [08:33] Well, it has 17tb [08:33] Which is enough for me to get things off it pretty wasily. [08:39] https://archive.org/details/ShmooCon2014_Attacker_Ghost_Stories (Typical talk) [08:42] midas: even if there isn't any list, or there is one but you can't find it, noting your 5 sites on [[FTP]] (or [[Talk:FTP]]) can't harm [08:43] true, will start that in a minute, first have to get some users off my back ;-) [08:48] very funny: http://abcnews.go.com/WNT/video?id=1831172 [08:48] video will not play at all [08:48] oh god slashdot [08:48] * joepie91 waves at midas [08:49] Hooray for grabbing slashdot. [08:49] SCHOOL DONE FOR THE DAY check notifications OOH NEW EMAIL whats this SLASHDOT [08:49] * midas waves at joepie91 [08:51] hey joepie91 [08:52] ohai Konata [08:52] ohai Konata_ [08:52] and hai midas :P [08:52] oh, yeah [08:52] I forgot that those are online [08:52] anyway, https://github.com/ArchiveTeam/slashdot-grab [08:53] I have outlined my preferred strategy in the STRATEGY file [08:53] please PR, etc. [08:53] we can grab user accounts and stuff later, but the real value is the discussion and stories [08:53] IMO [08:54] grabbing slashdot again? [08:54] so it turns out i maybe able to get some webcasts from april 2006 [08:54] midas: if there was another grab, I don't know about it [08:55] oh, somehow i tought it was done already. no worries [08:55] I need to zzz for now [08:55] need to keep those archive up 2 date :p [08:55] that said, part (1) of STRATEGY is really "URL discovery" [08:56] and parts (2) and (3) are "content fetch" [08:56] so if that helps to think about it :P [08:59] so added a small list, doing a du -sh now to see how big the folders are again, will take about 4 hours tho :p [09:07] good news everyone [09:07] i found real server that abcnews hosts things [09:08] real path with variables to get the idea: http://cdn.ctnhd.com/storage/naeast1/abcnews.origin.cdn.level3.net/published/${year:2:4}/$month/$filename [09:09] nice [09:29] so it looks like april of 2006 for the most part will get saved [09:29] most of the world news webcast still exist and the full episodes of world news tonight still exist [10:09] Also, I propose #slashdocs for slashdo [10:09] t [10:22] so, uh [10:22] Hi Viddlers, [10:22] In 2006, Viddler’s founding business model was based on the creation of a community site for video enthusiasts and personal sharing. At the time, our business revenue model was driven through advertising. As a Viddler community user, you were a part of this model. As time has passed Viddler is no longer able to support this offering and business model. [10:22] Therefore we’ve made the decision to close our free site and community effective March 11th, 2014. [10:22] yesterday [10:22] it looks like nobody on the interwebs has caught wind of this yet? [10:23] (also, GLaDOS, have you renewed archivingyoursh.it yet :P) [10:24] Talked about it here [10:45] ah, right, archivingyoursh.it [10:49] Kenshin: ot [10:49] Kenshin: whoops, sorry. [10:50] joepie91: it should be up in a few hours, just waiting on ns changes [10:51] GLaDOS: :D [16:22] It's a little late for him. [16:22] He's probably tucked into bed with his teddy bear and pudding cup. [16:22] SketchCow why did you say that? [16:37] arkiver: because http://radio.notacon.org/2011/shows/Fuck%20Jason%20Scott.mp3 ;) [16:53] lol [17:54] right [18:50] arkiver: that was a joke if you didn't catch it. ;) [20:59] Viddler is now talking to me. [20:59] I'm being the strong face [22:23] Hey guys, writing a bit of an archive tool for a site I like [22:23] Just wondering, is it good practice to grab a warc file for everything I download (images, html pages, epub files, etc)? [22:24] I think it is, shouldn't add any data to the request [22:24] and if I'm worried about space, I can just buy another hard drive or something :P [22:26] yes, WARC everything [22:27] yep, figured it was the best thing to do [22:27] easy importing into wayback later, if we want to [22:31] danneh_: if the site isn't huge, we also have a bot you can use to grab the site [22:32] Are sites like ebay and amazon archied often? [22:33] yipdw: it's a bit big, about 170k user stories. It uses a whole bunch of JS for comments and AJAX and all that, so figured writing a custom Py script was probably best [22:33] and possibly later on, look into warrior scripts and all that jazz [22:37] try #archivebot [22:37] danneh_: we've shoved bigger things (are shoving bigger things) at it [22:37] unless the whole site is utterly unusable with Javascript, it's easier than coding something custom and verifying the result [22:39] yipdw: fair enough. I'll hop in there after work and have a bit of a look [22:40] thanks for the info! [23:02] https://twitter.com/textfiles/status/430974554219888640 [23:02] bwah ha ha [23:02] WARC is good for a lot of things, especially static data (even if generated on the server) [23:03] But if it's heavy javascript or weird constructions, there's sometimes going to be problems. [23:08] aha, makes sense. most of it's static files (including static js, it uses ?random to force redownload every time, but I just strip it to the bare url), so I'm not too worried for now