[00:07] uploaded: http://archive.org/details/dltv_068_episode [00:07] :-D [03:04] Does Efnet have a good browser client, a la mibbit? [03:04] Might need to log on from work to catch up with Schbirid; we're on opposite shifts :( [03:32] shaqfu: http://chat.efnet.org/ [03:34] chronomex: Thanks; even has a snazzy new interface [03:34] Hopefully it doesn't get caught in the NetNanny :( [04:05] SketchCow: what the hell happen here: http://archive.org/details/20040229-bbs-lcrumbling [04:05] the ogg video and 512kb mpeg4 is ~12mb [04:05] when the source mpeg2 is 1.3gb [04:06] i think those files need to recreate [04:18] looks like [04:31] maybe a fundrasier for jason scott to get all the bbs interviews up [04:31] thiink maybe it will get him to do it [05:00] wow [05:00] that interview item has quite a long task history [05:14] first 70 episodes of dl.tv is up [05:33] the video derivatives can be a lot smaller if the source is HD because it downscales [05:34] that said it doesn't seek properly in this case so something probably went wrong [06:30] awesome, I just managed to use _Creating GeoCities Websites_ to round out an Amazon order and get free shipping [06:32] yes, something went horribly wrong in the last derive.php process 3.6 years ago. it deleted a lot of thumbnails and the other derivatives and then, for some reason (ffmpeg bug? I don't see an error message in the log file) only created 4 thumbnail images and the short incomplete video files. [06:32] yeah, I didn't see anything indicative of badness either [06:35] Coderjoe: May explain way most mp3s for bbs docs are under 1kbyte too [06:35] the ogg files look fine [06:35] for the size [06:39] SketchCow will have to tap that item. I don't have permissions for that collection to trigger a rederive [06:40] I'd tap that [06:41] there is more that needs tapping [06:42] Actually. [06:42] I just got it through the secret backdoor way [06:42] http://archive.org/catalog.php?history=1&identifier=20040229-bbs-lcrumbling [06:42] Didn't know that worked. Neat. [06:42] the internet archive glory hole system [06:43] bad mp3: http://archive.org/details/1993-bbs-tbbstape [06:44] mp3 file is 816.0b [06:44] i want that compress if it worked [06:44] haha [06:44] ok, let me make sure this derive runs properly first [06:44] we all cause have the internet archive audio version on a hard drive [06:45] I don't want to fuck more than one item if the method I queued it with is fucked [06:45] godane: what does that mean? I can't understand you [06:45] if mp3 really compressed that small [06:45] we could fit all of ia's audio on a HD [06:45] maybe usb stick [06:45] lol [06:46] even though i think IA will still need a big fat usb stick even at that compress [06:46] yay, 12 tasks from running [06:47] almost there [06:47] same problem: http://archive.org/details/20040229-bbs-schinnell [06:47] http://archive.org/details/20040128-bbs-willing [06:47] Running [06:47] http://www.us.archive.org/log_show.php?task_id=110466442 [06:48] Cross your fingers [06:48] i was tempted to try mashing a button I can see for the item [06:48] http://archive.org/details/22020525-bbs-milkyliz3 [06:49] it looks like the source file may be corrupt [06:49] but we'll see [06:50] does the files.xml info match the file? [06:50] even if it is, our deriver is much better at recovering from shit than it was 5 years ago [06:50] ffmpeg improvements, etc [06:50] i hope not [06:50] hm? [06:51] jason scott has not update anything in the bbs docs for 5 years [06:51] one thing if all interviews was there [06:51] but there not [06:51] i just fear the full intervews will be lost [06:51] thats all [06:53] https://internetarchive.etherpad.mozilla.org/5 [06:53] Please list bad identifiers there [06:53] IDENTIFIERS ONLY [06:54] I will trigger a redrive [06:54] this derive seems to be taking longer than the last one initiated by tracey, which is almost certainly a good thing. [06:54] absolutely [06:54] (for the thumbnail generation step alone) [06:55] It's also much bigger than the TV she's deriving [06:55] so it makes sense [06:55] no, I meant the last derive on this interview item [06:56] back on one of the (now apparently defunct) ia3* nodes [06:56] oic [06:56] yes, ia3* is (was?) the thumper farm [06:57] thumper? [06:57] http://archive.org/images/petabox-via.jpg [06:57] ^ thumpers [06:57] they still exist/are in service, but repurposed [06:57] godane: Identifiers only please in that list [06:57] so not archive.org/details/blah [06:57] just blah [06:57] I know of the old via red boxes [06:58] oh. thumper. [06:58] thump thump [06:59] thumbnail stage complete. thumbnail on the detail page looks a heck of a lot better [06:59] I actually have our "display" rack right next to me [07:00] guarded by sharkive [07:00] http://i.imgur.com/FizaK.jpg [07:00] (which is our remote controlled mylar balloon) [07:01] nice photo [07:01] haha [07:01] it's my shit camera phone [07:02] it could do with some aiming next time [07:02] I was taking it blind [07:02] hard to get a good angle [07:03] i believe i got all of them [07:03] ok [07:03] these only have mp3 problems [07:03] kicking it off shortly [07:03] All of them only need the mp3s rederived? [07:03] (cause I can set that) [07:04] also i got this i want to add to shareware cds: http://archive.org/details/cdrom-3d-world-119 [07:04] I can't do that [07:04] and this: http://archive.org/details/cdrom-3d-world-150 [07:04] jason will be mad [07:04] ok [07:04] queuing the derives is something within my scope [07:04] ok [07:04] sorry [07:05] hacking the permissions system, while possible, is not in my job description [07:05] and the last thing I want is more talkings-to from SketchCow [07:05] np :) [07:05] just want you to know why [07:05] oh really, I bet I could come up with stuff that you're less interested in [07:06] i maybe getting a bigger hard drive soon [07:06] The derive thing isn't really a circumvention. Moving collections is me going in and changing items and manually updating mysql tables. Lots of room for error and fuckery, so better to let a superadmin take care of it [07:06] i hope [07:06] ok [07:06] (where they have access to the easy web page that just works (tm) [07:06] ) [07:06] plus, circumventing perms would not create the correct audit trail [07:06] ^ [07:06] all kinds of wrong [07:07] which is why I'm not doing it. I'm slowly learning. [07:07] me too [07:07] SketchCow's lessons aren't all going to waste [07:07] that's reassuring [07:07] of course i'm backing up like very thing [07:07] fuck it, doing a full redrive of all those items [07:07] The other derivatives may be fucked [07:07] and I don't feel like dealing with it [07:07] agree [07:08] sounds reasonable [07:08] but i played with most of them and most of the videos are fine [07:08] So I have a script that's supposed to list the statuses of torrents in deluge [07:08] and this is what it just gave me [07:08] 1 Jeremy [07:08] 325 Seeding [07:08] 63 Downloading [07:08] hahahahaha [07:09] something's obviously borked there [07:10] ok, rederives queued [07:10] Priority 0 tasks [07:10] So they should start immediately. [07:10] All running. [07:11] holy fuck [07:11] http://archive.org/details/diggnation [07:11] that should be a collection with each video in an item [07:11] goddamn, Famicoman hehehe [07:11] http://archive.org/catalog.php?history=1&identifier=diggnation [07:11] 21 day derive hahahahaha [07:12] that poor worker [07:16] underscor: So you got one torrent which is Jeremying? [07:16] Pretty awesome if you ask me [07:16] :D [07:17] and on the other host [07:17] 1 Holy [07:17] 51 Seeding [07:17] 701 Downloading [07:17] must be a copy of the bible or something [07:21] underscor: Thats what Famicoman does [07:21] its meant to make it one stop shop [07:22] yeah, but it breaks our system [07:22] i was told to do one item per video for the stage6 collection [07:22] most likely for this (and related) reason(s) [07:23] plus, darking one video is easier when each is a separate item [07:23] some of the rev3 stuff it was ok i think [07:23] I dunno [07:23] Not my call [07:23] like unboxing porn [07:23] http://archive.org/details/unboxingporn [07:23] But I'll bring it up tommorrow when alexis is back. [07:24] i would have perfer diggnation by year [07:24] just so it would be easyier on archive.org [07:26] sweet jesus that item is wrong on so many levels. on the plus side, it shouldn't have a need to add new files to it (and thus possibly triggering another couple month long derive) [07:27] (the diggnation one is the one I am talking about) [07:28] over 400 episodes in one item. I don't think the system was intended to work like this. [07:29] and I don't feel like writing the xml parsing crap I would need to do in order to sum up the original file sizes. I suspect this item is much larger than desired, even before derivatives [07:30] woopwoop? [07:30] size: 236,398,442 KB [07:30] yowch [07:32] i'm not doing it with dl.tv and crankygeeks luckly [07:32] oh yeah. I forgot about that method [07:32] dl.tv is like 50gb of video [07:32] crankgeeks is like 30gb [07:37] I am off to bed [07:38] please PM me if any of those derives explode [07:38] rather [07:38] email me [07:38] abuie@archive.org [07:38] oh god! a derive exploded. I have pieces of mpeg2 file stuck in my leg! [07:38] xD [07:38] s/pieces/bits/ ? [07:39] Yeah, better. [07:39] sleep sounds like an excellent idea [07:40] have to get up early tomorrow [07:40] going to redwood city with mario to work on the ia7* datacenter [07:40] replace disks, rack new hardware, install switches, etc [07:40] \o/ [07:40] sounds fun [07:41] For me it is. [07:41] Probably not a lot of people [07:41] haha [07:43] wtf, rsync, why aren't you happy [07:43] it found and skipped the already transferred gigabytes, but seems to have hung [07:44] (this is for yet another multigigabyte memac user) [07:44] It should continue at some point, I don't know what it's doing there [07:44] It's happened to me, too; eventually they finished [07:45] eventually = ? [07:46] By the next day? :-P Dunno, didn't watch them, just noticed them getting stuck [07:47] well, shit, okay, I'll just have to deal with my connection not getting fucked [07:47] hah [07:47] It might only be some minutes, I really didn't pay attention [07:47] ah, there it goes [07:53] I am interested in helping out with the fanfiction.net archiving. The other channel is dead right now. [07:56] omf_: we're pretty much done with that for now. there's presumably some project to do a continuous scrape to stay on top of it, but, we've probably got everything before the purge hit [07:56] stupid admins finally decided to enforce a long ignored content rating rule, and start deleting stoires [07:57] aah [07:57] to clarify, THE SITE ITSELF IS NOT GOING ANYWHERE!!! anytime soon, that I know of, I just thought it would be a good idea to grab it, cause it's the biggest [07:58] at the time i wasnt even aware of the upcoming purge [07:58] yeah I was interested in it to do some nlp work [07:58] nlp? [07:58] natural language processing [07:58] it is hard to find large chunks of modern text [07:58] ooooooh, Google would love that! [07:58] porjoect gutenberg, and wikipedia work well? [07:59] pg is not really modern [07:59] project gutenberg has a format consistency problem [07:59] and wikipedia is fine for non-fiction writing [08:00] Well onto my next idea. I am going to update the deathwatch page for berlios.de and delicious [08:00] ive got dozens of stories in a consistent format downloaded off fanfiction.net for my own library, by a tool i got from google code [08:01] also you might want to add dead dying damned to the dead column :) [08:01] berlios got saved and the new delicious does not have any of the old content [08:03] anyway i have 1705 stories, saved in text format, ( _italics_ *bold*) with metadata at the beginning of each file, 281mb if you want it [08:03] the archive we saved will be up soon, presumably, ask SketchCow [08:08] I'm unable to edit wiki pages :/ "Call to undefined method Article::getSection()" [08:08] That's a feature, not a bug [08:09] nah, but SketchCow has said the wiki needs some techlove ^ [08:10] Well I suppose the feature stops the bots [08:45] Are there any criteria for adding sites to "dead as a doornail" other than it being dead [08:47] uploading screen savers from may 2004 [09:01] So what projects need help? The apple one sounds like it is almost complete [09:09] omf_: well, it's probably best to stick to sites that archiveteam has or would have archived. [09:09] re: the wiki [09:10] chronomex, that is what I mean. I only have one example gameart.org It had almost 10 years worth of gaming art [09:10] then it went down [09:11] ok [09:11] /me zzzz [09:13] I have been getting more and more into big data projects so I am excited at what is out there. [09:29] I saw a note on the site about doing comparisons to remove duplicate images from the geocities data. Anyone know what the status of that is? [09:37] Also the tracker for FortuneCity is offline so I cannot find out the status of the archiving that took place before it closed earlier this year. [09:44] omf_: What information were you looking for? At the FoCity tracker I mean [10:22] ersi, just to see if the project got completed or things were missed. I had a few sets of pages on there I wouldn't mind seeing again [10:29] omf_: AFAIK all of the users/pages that we were able to crawl were downloaded. That does however not say anything about the completeness in general I guess. [10:30] alard: Know if we got the dataset of all user/url's left anywhere? [10:36] http://archive.org/details/archiveteam-fortunecity-list [10:37] and http://archive.org/details/archiveteam-fortunecity [10:38] omf_/ersi: and http://archive.org/download/test-memac-index-test/fortunecity.html [10:40] that is cool [10:42] Ah, yeah - that was what I was looking for. [10:42] Good things come to those who search. :) [10:42] :D [11:24] Is there a mailing list to follow or is an ear to the irc channel and an eye on the wiki the way to go. [11:26] underscor, you can kill the derive if that would help. [11:33] omf_: No mailing list, this is where the magic happens [11:34] and in the sub-channels of course (like #wikiteam, #urlteam) and project specific channels [12:34] Is there a sub-group for maintaining the wiki [13:14] hi all [13:14] just found your project and i love it [13:14] :) [13:21] is there a tool for windows available too ? [13:22] so i can integrate some workstations in the process ? [13:45] fLoo: Yeah, there's the "Archiveteam Warrior" - which is a Virtual Machine you boot up and can help out with projects [13:45] Link to "AT-warrior" is in /topic [13:46] already found it, thanks [13:46] contributing 12 gbit now [13:46] hope that helps [13:48] Whoa man [13:48] You at some university or something? [13:49] nope [13:49] datacenter [13:49] got some nice machines here [13:52] is anyone working on the usenet history dump? [13:52] currently [13:55] I just remembered about this: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html [13:56] which is 28M posts were collected from 2005-2009 [13:56] is there a way to check how many packages at all + gigabytes i uploaded ? [13:57] it is 40gb [14:06] mmmh [14:06] i dont find anything to see how much i contributed [14:06] :( [14:08] gogo guys [14:17] ersi: http://memac.heroku.com/ looks good [14:17] :) [14:21] hmm looks like they stripped a lot of stuff such that it's useful as a corpus but not so much for history [14:21] DFJustin: What's that? [14:22] I just remembered about this: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html [14:22] It is a starting point for that time [14:22] Thanks [14:22] the old archives were easier to find [14:23] it is going to be the more recent posts that are going to be the problem [14:23] why did google have to buyout dejanews [14:33] over 700 million posts in google groups. Subtract 1981-1991 and the last 5 years [14:33] and then consider that not every post is a thread. [14:34] it is still going to take some clever work [14:46] https://twitter.com/archiveteam/status/212618778968731649 So hey, what's this about? [14:54] I found someone else trying to do a usenet archive like google. That at least easies the field a little [15:18] http://www.fortunecity.ws/ [15:32] http://www.heise.de/newsticker/meldung/Freiwillige-legen-Archiv-oeffentlicher-MobileMe-Daten-an-1626844.html [15:33] cool [15:33] the comments are full of angry trolling about how private data should be forbidden to be used "like this" [15:33] :D [15:34] You keep using that word. I do not think it means what you think it means. [15:35] Yeah yeah, people can whine as much as they want - what matters is that the data is not totally gone though :) [15:35] mistym: me? what word? [15:36] it's not trolling if they genuinely believe it [15:36] Schbirid: It's a movie quote. :V Was talking about "private" [15:36] right [15:37] DFJustin: What about fortunecity.ws? >_> [15:38] Schbirid [15:38] due to this article i came here [15:38] and now i'm contributing ;) [15:38] sweet [15:38] welcome [15:38] thanks [15:39] just noting its existence [15:40] DFJustin: ah, alright [15:44] guys [15:44] i just saw that there is a seesaw-s3 script [15:44] which is perfect for my bandwidth [15:44] but it requires access-tokens [15:44] howto acquire them ? [15:45] fLoo: i asked about that a while ago and was told it was not worth using at the moment [15:45] ok [15:45] currently i run 20 instances of seesaw per machine [15:45] still only 25 % bw used [15:45] :( [15:46] wel, stick around. the next bandwidth eating project will come [15:46] fLoo: The seesaw-s3 script doesn't download any faster, it's just the upload that's different. [15:46] ok, then its np [15:46] The problem with the normal seesaw script is that everything ends up on one machine, which fills up. [15:46] uploading is fine here with ~ 60-70 mbit per connection [15:46] The seesaw-s3 script uploads directly to archive.org, so that was very helpful for the bulk downloaders. [15:47] i understand, thanks for the information [15:47] Thank's for helping! [15:47] its a cool project [15:47] :) [15:47] is there a way to get my own statistics ? [15:48] i mean my complete contribution-stats ? [15:48] http://memac.heroku.com/ [15:48] yea, but howto query for a single user ? [15:48] AHHH [15:49] there is a freaking '+' button [15:49] seriously .. couldnt see it [15:52] fortunecity.ws is a bit strange: pages full of disclaimers and pseudo-legal texts, but nothing at all about the nature of the site, the source of the data etc. [15:54] Indeed [16:13] schbirid what do you use to crawl ? [16:13] i dont think its your hansenet account ;) [16:14] ovh server(s) [16:14] just saw it [16:15] another question: why does archiveteam announce that we're finished with the project [16:15] but there are still 20k missing [16:15] where was that announced? [16:17] heise etc [16:17] twitter [16:17] I think the deal is we've been though them all once now and this is the stuff that didn't work right on the first pass [16:17] oh, SketchCow tweeted that indeed https://twitter.com/archiveteam/status/217665111895191552 [16:18] fLoo: do you have old pc/game mag cover discs? rip them! [16:19] for example stuff was added to the script to handle infinitely recursive folders [16:19] Schbirid: lol why ? [16:20] fLoo: to archive them at archive.org [16:20] gonna see what i still got [16:20] So hey, speaking of tweets, who was this / what is this about? https://twitter.com/archiveteam/status/212618778968731649 [16:20] playstation 1 games too ? [16:21] fLoo: one of our side projects is http://archive.org/details/cdbbsarchive [16:22] old magazine and shareware discs are a great source of stuff that isn't available online anymore [16:22] yeah [16:23] playstation I'm not sure because I think they need to be ripped using special procedures to be correct [16:24] Playstation is not exactly endangered material either. Shareware is much more ephemeral. [16:26] playstation demo discs had weird demos on them [16:29] Oh yeah, demo discs, that's true. [16:30] I know a guy who spent years tracking down a Japanese demo disc with the only playable copy of an obscure cancelled game he was obsessed with. [16:30] did he find it? [16:30] He did! [16:30] so it's already on archive.org isn't it? :) [16:30] Turns out: pretty good game. Even the demo is very incomplete though. [16:31] Amazingly - no. It is not. [16:31] Maybe it should be? [16:31] very very bad [16:31] of course it must [16:32] I put a whole bunch of sega saturn demo discs in there but I haven't looked at other systems yet [16:37] saturn, yay! [16:39] http://archive.org/search.php?query=sega%20saturn%20AND%20collection%3Acdbbsarchive [16:39]