[03:57] hi folks, travis goodspeed would like if anyone send him paper copies or high-res scans of the Japanese, Brazilian, German, or Arabic (Jordan) variants of Byte Magazine? [04:40] Good luck on THAT, Travis [05:48] y [05:49] oops [06:24] http://devslovebacon.com/conferences/bacon-2014/talks/from-colo-to-yolo-confessions-of-the-angriest-archivist [07:31] SketchCow: that reminds me -- is audio or video of your Build talk available anywhere? [07:31] seems like the official videos are not up yet [08:06] No. [08:06] Not up yet. [08:06] No idea why they're taking so long, EXCEPT. [08:06] This is the last year, so he might be working on a deluxe version of the talk, with the new year as the chaser. [08:07] there some place I can go to read historical b threads? [08:07] asdfsadf: /b/? [08:08] good one [08:08] asdfsadf: It was a serious question. [08:08] What's a b thread? [08:08] i mean like previous years [08:08] oh yea [08:08] Oh okay you do mean /b/. [08:09] Uh. [08:09] never mind [08:09] i thought you meant the joke was all threads are historical haha [08:09] asdfsadf: :P [08:09] asdfsadf: Nah. I think there is but I don't remember where it is. [08:44] myopera.com is about to die and I have the list of all non-bannen users. [08:44] 16 457 047 of them [08:45] https://docs.google.com/uc?id=0B8aRlPij6kNrTTdRNHdOdDcxWDA&export=download [08:45] I dont know how to take this information from the userlist to a project [08:46] Atluxity: You would need a way for us to use that list to grab the content with wget. [08:46] I don't know much about the process beyond that. [08:46] I think there was an engineering channel for archive team. [08:54] I think there was a grab already up [08:54] https://archive.org/details/files.myopera.com-initialgrab [09:12] midas: So then we just have to use the method we used last time to grab new data? [09:17] probably yeah [10:38] Atluxity: I will start grabbing some users now and see how fast I can get my speed [10:44] cool [10:56] looks like there are no limits!!!! :D [10:56] going very fast [10:56] 100 links per second or something like that [10:57] Atluxity: does my opera also host videos? [10:59] I was told there was no limits, the guy just said "please be gentle". I dont know how gentle we can be if we are to get it all withing March 1st [10:59] I dont know if myopera hosts videos [11:00] it would supprise me [11:00] did the list of usernames help? [11:00] yep [11:00] running a test on chooseopera [11:00] it looks like it's quite big [11:00] so a good test [11:05] arkiver; what's the status on wallbase? [11:05] joepie91: well they have limits [11:06] :(? [11:06] so the crawl is limited but it's going fine [11:06] around 40000 items per 12 hours now [11:06] know* [11:06] how long do you expect it to take to grab everything including metadata? [11:06] and they say they don't have plans to shut it [11:06] mhmm [11:06] I'm first doing all the wallpaper pages with the wallpapers and stuff [11:07] and then I'll start doing the other things like search terms and stuff [11:07] (zoom works!!!!) [11:08] doing around 63 pages per minute [11:09] when is my opera going to shut down? [11:09] ^ Atluxity [11:11] will upload the first warc of chooseopera when finished [11:11] it is a lot bigger then the average blog [11:11] but a good test to start with [11:12] arkiver: March 1st [11:12] hmm march 1st [11:12] thank you [11:14] and do you know when my opera was created? [11:15] I dont have that information easily availible, do you want me to research it? [11:17] got it [11:17] http://en.wikipedia.org/wiki/My_Opera [11:17] 2001 [11:17] well I need that since there are urls for a calendar [11:17] that are going forever [11:17] so even to the midages [11:17] going to limit it to a time [11:17] when it was created [11:19] a wise move [11:20] is this just something you have put on a server you have access to or can it be made to a warrior-project? [11:23] I don't know how to create a warrior project [11:23] you should ask chfoo about that [11:24] I', just excluding everything with /archive/monthly/ now [11:27] PANIC [11:27] http://www.heise.de/open/meldung/BerliOS-Entwicklerplattform-macht-zu-2104211.html?wt_mc=rss.open.beitrag.atom [11:27] BerliOS shutting down April 30 [11:36] http://developer.berlios.de/forum/forum.php?forum_id=39220 [11:40] Havn't we grabbed berlios before? :o [11:41] ersi: the response I got in another channel was "oh, shutting down again?" so quite possibly [11:42] but figure a grab would be important regardless [11:56] chfoo: can we create a warrior project from my opera? [12:11] joepie91: Indeed [12:33] for some sites i almost start to think "shouldnt you be dead yet?" [12:37] Yes, that's what the Yahoo! CEO thinks of every website loaded in their browser. [12:38] true story [12:38] midas: haha [12:38] "euhm.. why is this still around?" [12:39] yeah, the last time berliOS said they would shutdown was like 2 years ago? [12:40] 2011 [13:49] hmm [13:49] I can try to get a script running here to download most of the channels [13:49] chooseopera is downloaded [13:49] 1,4 GB [13:50] took a ew minutes only [13:50] and i think that's one of the biggest accounts... [13:52] and 45639 urls [14:01] working on batch script for WarcMiddleware [14:12] wokring for me... :D [14:12] testing with the first 45 accounts from the list [14:18] 45 accounts done [14:18] going to do a crawl of the first 1 million accounts now [14:21] 100.000 accounts* [14:26] okay godane2 [14:34] doing a test on the first 100.000 accounts [14:34] if that works, I'll put all of the millions of accounts in it [14:34] and it should be going then [14:42] but I'm not sure if I can make it before the deadliner [14:42] even if it's going this fast [14:42] so [14:42] the best thing is to plit it up between people I think [14:49] I can try to make it go even faster... [14:51] will try that tonight or tomorrow [14:51] shall I upload some warc.gz examples to show that the warc's work? [14:53] according to what I see I should be able to run multiple sessions [14:53] need 17 sessions then [14:53] will try that [15:14] going to start 30 sessions tomorrow [15:14] to even have sopme days left before the deadline to be sure everything is there [15:14] will keep you guys informed every day about the progress [15:14] I'm also still doing wallbase.cc BTW [15:18] nice arkiver [15:18] if you need any help, let me know [15:18] thank you midas [15:18] I'll keep that in mind!! :D [15:18] :p [15:18] tonight I will uplaod some warc's for people to see [15:18] buuuuuut [15:19] someone does need to do the forums [15:19] and the things beside the accounts [15:20] my FTP grab just passed the 9TB... [15:22] midas, wow that's a lot [15:22] which ftp server? [15:23] 5 ftp's [15:23] tp.tu-chemnitz.de ftp.uni-muenster.de gatekeeper.dec.com [15:23] ftp.uni-erlangen.de ftp.warwick.ac.uk [15:23] haha good job man! [15:23] she's still going ;-) [15:23] it's going to take me weeks to upload this [15:27] lol [15:27] know that [15:27] download faster then upload... [15:27] :/ [15:27] horrible... [15:30] yeah [15:30] 180Mbit down, 100mbit up [15:31] still, 32TB/mnd should be doable at max speed, im guessing ill hit 20TB/mnd [15:39] FYI: I need to exclude all accounts that have |, &, <, >, (, ), ^ and @ [15:39] so if someone can search for the channels with those thigs in it and download them [15:39] would be great [15:39] since I can't do them here with this script [15:50] /mnd? you german, midas? :) [15:50] dutch :p i mean month, /mo? [15:55] midas: also dutch?? :D [15:58] jup :p [15:58] haha me too! :) [16:05] lots of dutchies here [16:06] spoorwaggen [16:06] and shit yo [16:13] spoortwaggen? [16:13] spoorwaggen* [16:13] you mean spoorwagen? :) [16:14] downloading 1680 accounts per second [16:14] per hour* [16:14] yeah, I did [16:14] going to get that up tomorrow to around 50000-60000 [17:35] anyone can create a project. i'm only writing the recent grab scripts, but someone else adds it master list of warrior projects. [18:02] if there is no BerliOS channel, join #honeynutberlios [18:08] https://archive.org/details/businesscase now in 1.0 [18:37] switch back to browser, ctrl-t and start typing url, realize inputs are going into atari 800 visicalc #archiveteamproblems [19:12] hello [19:15] DFJustin: hahaha [19:54] Pretty much. [20:21] chfoo: I'm already guite good going with that website. :D [20:21] today just testing [20:22] tomorrow I'll try to do up to 50000-60000 accounts per hour. [20:30] sounds good [20:33] :) [20:33] well [20:33] chfoo, this one has no limits, so it isnt't too hard... :) [20:33] hehe, most account are just empty [20:33] created and never sometyhing done with [20:33] but [20:33] chfoo [20:34] I download many thousands of accounts now as a test [20:34] and here it looks like they are working [20:34] would you mind if I upload some warc's so that you can also test them? [20:34] (just in case) [20:35] arkiver: sure, maybe a few others can take a look at them too [20:35] yes that sounds good [20:36] just to be 100% sure they work [20:36] imagine: downloaded millions of accounts and they don't work... [20:36] O.o [20:36] going to pack up and upload some [20:40] chfoo hang on uploading... [20:47] chfoo: https://www.filepicker.io/api/file/ZizgWffKT1e9PGaCpiLP [20:47] some warc's [20:47] I added two big warc's [20:47] and a lot of small warc's [20:47] you'll see most of them are just emtpy accounts [20:56] D: [20:57] oh right [20:57] most are empty because theres nothing to download D: [20:59] no no [20:59] they are not empty [20:59] just the account has been created [20:59] and the creater has done nothing [21:00] Smiley: like this one: [21:00] http://my.opera.com/4bass8/ [21:00] going to stop the test now [21:01] and start testing with more multiple crawlers tomorrow [21:07] chfoo: and? do they work for you? [21:08] arkiver: i'm a bit concerned that you are using requesting gzip compression "Accept-Encoding: x-gzip,gzip,deflate" [21:09] hmm [21:09] I could view them well with warc proxy [21:09] and I uploaded some to the IA [21:09] https://archive.org/details/arkiver20140131-1 [21:10] my new packs [21:10] they are indexed good [21:10] https://archive.org/download/arkiver20140131-1/arkiver20140131-1.cdx.gz [21:10] but because my item is not in the web collection I can't view them in the wayback machine [21:10] would it be possible for you to quickly upload those items ot the wayback machine and see if they work there? [21:11] arkiver: i can't. i'm not affiliated in any way. [21:11] :/ [21:11] ah [21:11] but they do index [21:12] so they should work right? [21:12] and they work in the warc proxy [21:13] arkiver: it depends on how the wayback machine handles it. i have no idea actually. [21:14] hmm [21:17] and i noticed that "WARC-Payload-Digest" is missing as well (but that's optional) [21:18] yep [21:18] maybe SketchCow wants to move the items (temporarily) to the web section of the IA just to see if they work?? [21:22] So using wget/warc is the best way to archive a site? [21:22] arkiver: hold on, the warc file isn't valid. there's duplicate WARC-Record-ID [21:23] hmm [21:27] testing.... [21:27] inserting in older uploded pack [21:27] gosh [21:28] I hope it works [21:30] Nobody can shove them into wayback but employees. [21:30] Why would I move them in temporarily? [21:31] Dud1: or wpull. or use archivebot. [21:43] SketchCowL to test if they work [21:43] the warc's from my opera [21:45] Yeah, but if they work, they're in. [21:45] Why not have them in. [21:46] no reason to pull them out if they work [21:53] SketchCow: ah, yes, of course the warc's I'm producing now seem to be working in warc proxy [21:54] so if you want to do that [21:54] would be nice!!] [21:58] Give me your internet archive account e-mail.