[00:00] i have to keep the server until july anyway [00:00] will just downgrade ram, cpu, ... to make it a little bit cheaper [00:00] ok [00:00] I probably have everything finished in a few weeks [00:04] *** DFJustin has quit IRC (Ping timeout: 740 seconds) [00:07] *** DFJustin has joined #archiveteam [00:07] *** swebb sets mode: +o DFJustin [00:16] *** maltris has quit IRC (Ping timeout: 246 seconds) [00:22] *** maltris has joined #archiveteam [00:46] *** aaaaaaaa_ has joined #archiveteam [00:53] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [00:54] *** aaaaaaaa_ is now known as aaaaaaaaa [01:05] *** Ravenloft has quit IRC (Remote host closed the connection) [01:17] *** ivan`_ has quit IRC (Ping timeout: 250 seconds) [01:19] *** ivan` has joined #archiveteam [01:22] *** Sellyme_ has joined #archiveteam [01:23] *** fenn_ has joined #archiveteam [01:26] *** danneh_ has quit IRC (hub.se efnet.port80.se) [01:26] *** fenn has quit IRC (hub.se efnet.port80.se) [01:26] *** thechip has quit IRC (hub.se efnet.port80.se) [01:26] *** Nemo_bis has quit IRC (hub.se efnet.port80.se) [01:26] *** GLaDOS has quit IRC (hub.se efnet.port80.se) [01:26] *** Sellyme has quit IRC (hub.se efnet.port80.se) [01:26] *** Kazzy has quit IRC (hub.se efnet.port80.se) [01:26] *** Stary2001 has quit IRC (hub.se efnet.port80.se) [01:26] *** fresco___ has quit IRC (hub.se efnet.port80.se) [01:26] *** fluff_ has quit IRC (hub.se efnet.port80.se) [01:26] *** RainbowCo has quit IRC (hub.se efnet.port80.se) [01:26] *** Muad-Dib has quit IRC (hub.se efnet.port80.se) [01:26] *** Kniffy has quit IRC (hub.se efnet.port80.se) [01:26] *** lhobas has quit IRC (hub.se efnet.port80.se) [01:26] *** parsons has quit IRC (hub.se efnet.port80.se) [01:26] *** Shank___ has quit IRC (hub.se efnet.port80.se) [01:26] *** deathy has quit IRC (hub.se efnet.port80.se) [01:26] *** VonScoot has quit IRC (hub.se efnet.port80.se) [01:26] *** Riviera has quit IRC (hub.se efnet.port80.se) [01:27] *** Kazzy_ has joined #archiveteam [01:41] *** Kazzy_ is now known as Kazzy [01:50] *** godane has quit IRC (Ping timeout: 615 seconds) [01:51] *** godane has joined #archiveteam [02:03] *** sep332 has quit IRC (Ping timeout: 615 seconds) [02:03] *** sep332 has joined #archiveteam [02:06] *** Nertsy has quit IRC (Read error: Operation timed out) [02:06] *** pft has quit IRC (Read error: Operation timed out) [02:06] *** Jogie_ has quit IRC (Read error: Operation timed out) [02:08] *** Nertsy has joined #archiveteam [02:10] *** khaoohs has quit IRC (Read error: Operation timed out) [02:11] *** mistym has quit IRC (hub.efnet.us irc.paraphysics.net) [02:11] *** nertzy has quit IRC (hub.efnet.us irc.paraphysics.net) [02:11] *** phuzion has quit IRC (hub.efnet.us irc.paraphysics.net) [02:11] *** Sue_ has quit IRC (hub.efnet.us irc.paraphysics.net) [02:14] *** khaoohs_ has joined #archiveteam [02:14] *** mistym has joined #archiveteam [02:14] *** nertzy has joined #archiveteam [02:14] *** phuzion has joined #archiveteam [02:14] *** Sue_ has joined #archiveteam [02:16] *** Jogie has joined #archiveteam [02:17] *** primus104 has quit IRC (Read error: Operation timed out) [02:17] *** primus has quit IRC (Read error: Operation timed out) [02:17] *** wp494 has quit IRC () [02:17] *** dashcloud has quit IRC (Read error: Operation timed out) [02:17] *** dashcloud has joined #archiveteam [02:17] *** primus has joined #archiveteam [02:22] *** primus104 has joined #archiveteam [02:24] *** khaoohs__ has joined #archiveteam [02:24] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [02:29] *** nertzy2 has joined #archiveteam [02:30] *** phuzion_ has joined #archiveteam [02:31] *** Sue_ has quit IRC (Ping timeout: 246 seconds) [02:31] *** Sue_ has joined #archiveteam [02:33] *** phuzion has quit IRC (Read error: Connection reset by peer) [02:35] *** Sue_ has quit IRC (Ping timeout: 246 seconds) [02:36] *** ruukasu has quit IRC (Quit: WeeChat 1.0.1) [02:36] *** nertzy has quit IRC (Ping timeout: 615 seconds) [02:36] *** Froggypwn has joined #archiveteam [02:39] *** ruukasu has joined #archiveteam [02:39] *** Start has joined #archiveteam [02:45] *** wp494 has joined #archiveteam [02:50] *** Sue_ has joined #archiveteam [02:50] *** human39 has quit IRC (Leaving) [03:05] *** Start has quit IRC (Ping timeout: 480 seconds) [03:07] *** Start has joined #archiveteam [03:13] *** rejon has joined #archiveteam [03:19] *** Start has quit IRC (Quit: Leaving) [03:54] *** primus104 has quit IRC (Leaving.) [04:16] *** pft has joined #archiveteam [04:48] *** Ymgve has quit IRC () [05:03] *** mistym has quit IRC (Remote host closed the connection) [05:05] *** aaaaaaaaa has quit IRC (Leaving) [05:06] *** dashcloud has quit IRC (Read error: Operation timed out) [05:10] *** dashcloud has joined #archiveteam [05:15] *** rejon has quit IRC (Ping timeout: 480 seconds) [05:17] *** pft has quit IRC (ny.us.hub irc.paraphysics.net) [05:17] *** Sue_ has quit IRC (ny.us.hub irc.paraphysics.net) [05:20] *** dashcloud has quit IRC (Read error: Operation timed out) [05:23] *** dashcloud has joined #archiveteam [05:27] *** pft has joined #archiveteam [05:27] *** Sue_ has joined #archiveteam [05:29] *** brayden has quit IRC (Read error: Operation timed out) [05:29] *** mistym has joined #archiveteam [05:42] *** GLaDOS has joined #archiveteam [05:42] *** swebb sets mode: +o GLaDOS [05:42] *** brayden has joined #archiveteam [05:43] *** fluff_ has joined #archiveteam [05:51] *** APerti has joined #archiveteam [06:27] *** trs80 has quit IRC (hub.efnet.us irc.umich.edu) [06:41] *** trs80 has joined #archiveteam [06:47] *** ruukasu has quit IRC (Ping timeout: 265 seconds) [07:01] *** thefox has joined #archiveteam [07:05] *** the_fox has quit IRC (Read error: Operation timed out) [07:08] *** ruukasu has joined #archiveteam [07:55] *** dashcloud has quit IRC (Read error: Operation timed out) [07:58] *** dashcloud has joined #archiveteam [08:14] *** ruukasu has quit IRC (Ping timeout: 265 seconds) [08:40] *** brayden has quit IRC (Ping timeout: 606 seconds) [08:43] *** brayden has joined #archiveteam [08:45] *** primus104 has joined #archiveteam [08:49] *** signius has joined #archiveteam [09:04] *** ruukasu has joined #archiveteam [09:05] *** schbirid has joined #archiveteam [09:26] *** brayden has quit IRC (Read error: Operation timed out) [09:33] *** mistym has quit IRC (Remote host closed the connection) [09:48] *** brayden has joined #archiveteam [10:15] *** BlueMaxim has quit IRC (Quit: Leaving) [10:17] *** APerti has quit IRC (Ping timeout: 370 seconds) [10:19] *** APerti has joined #archiveteam [10:27] *** okeuday has quit IRC (Ping timeout: 480 seconds) [10:38] *** okeuday has joined #archiveteam [11:11] *** Riviera has joined #archiveteam [11:21] *** APerti has quit IRC (Ping timeout: 370 seconds) [11:34] *** danneh_ has joined #archiveteam [11:44] *** Ymgve has joined #archiveteam [12:12] *** deathy has joined #archiveteam [12:39] *** primus104 has quit IRC (Leaving.) [13:08] *** signius has quit IRC (Read error: Operation timed out) [13:20] *** signius has joined #archiveteam [13:47] *** eprillios has quit IRC (Read error: Operation timed out) [14:16] *** APerti has joined #archiveteam [14:25] *** APerti has quit IRC (Ping timeout: 370 seconds) [14:36] *** SadDM has quit IRC (leaving) [14:36] *** SadDM has joined #archiveteam [14:38] *** sankin has joined #archiveteam [14:40] *** Start has joined #archiveteam [14:53] *** thechip_ has joined #archiveteam [14:53] *** SadDM has quit IRC (leaving) [14:55] *** SadDM has joined #archiveteam [14:55] *** swebb sets mode: +o SadDM [14:58] http://computing.vt.edu/kb/entry/3997 [14:59] *** SadDM has left [14:59] The Filebox service will be shut down on December 31, 2014. After December 31, 2014, you will no longer be able to access files located on Filebox, so download a copy of your files and Web sites now. [15:08] *** SadDM has joined #archiveteam [15:08] *** swebb sets mode: +o SadDM [15:09] *** SadDM has left [15:13] *** SadDM has joined #archiveteam [15:13] *** swebb sets mode: +o SadDM [15:14] *** Start has quit IRC (Ping timeout: 606 seconds) [15:14] *** SadDM has left [15:18] *** SadDM has joined #archiveteam [15:18] *** swebb sets mode: +o SadDM [15:29] *** primus104 has joined #archiveteam [15:30] *** Start has joined #archiveteam [15:31] *** Start has quit IRC (Client Quit) [15:54] *** db48x has joined #archiveteam [16:29] *** aaaaaaaaa has joined #archiveteam [16:38] well, at least it was almost 2 months ahead of time... [16:38] does filebox have any public data? [16:40] seems to have public sites, under http://filebox.vt.edu/users/ [16:40] http://filebox.vt.edu/users/cdgorman/ [16:43] *** SadDM has left [16:43] oh god [16:43] https://www.google.nl/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=site%3Afilebox.vt.edu%20inurl%3Ausers [16:43] that does not look good... [16:44] Kazzy: wat do? [16:44] well, 21.3k results.. that feels like a warrior project [16:45] *** SadDM has joined #archiveteam [16:45] *** swebb sets mode: +o SadDM [16:48] censorship? :) https://filebox.vt.edu/robots.txt [16:48] *** SadDM has left [16:48] *** SadDM has joined #archiveteam [16:48] *** swebb sets mode: +o SadDM [16:49] *** SadDM has quit IRC (Client Quit) [16:49] I'd noticed that yeah, they all return 404's though.. [16:49] *** SadDM has joined #archiveteam [16:49] *** swebb sets mode: +o SadDM [16:49] "The default quota (space limitation) for Filebox is currently 30 MB." [16:50] https://computing.vt.edu/content/filebox-documentation#space [16:56] *** SadDM has quit IRC (leaving) [16:57] *** SadDM has joined #archiveteam [16:57] *** swebb sets mode: +o SadDM [17:01] *** db48x has quit IRC (Ping timeout: 258 seconds) [17:02] we would have like a week for a warrior project [17:02] and I'll be going to 31c3, so I have no time for it [17:02] :/ [17:04] oh noes, does that mean we will have to meet? D: D: D: [17:04] :) [17:04] nico: you coming again too? [17:05] schbirid: yes, obviously! :P [17:05] archiveteam meeting required [17:05] *** eprillios has joined #archiveteam [17:05] :) [17:05] also whoop whoop whoop offtopic siren [17:05] absolutely [17:05] oops [17:09] *** Froggypwn has quit IRC (Quit: ~ Trillian Astra - www.trillian.im ~) [17:09] if someone can get a google/bing/whatever scrape for filebox.vt.edu URLs, we could possibly get somewhere [17:20] *** dashcloud has quit IRC (Read error: Operation timed out) [17:24] *** dashcloud has joined #archiveteam [17:58] SketchCow: 500.000 items will be added to the clip art project. FOS is the target [18:26] *** primus104 has quit IRC (Leaving.) [18:28] *** Elegance has quit IRC (Quit: :(){ :|:& };:) [18:41] *** asd_ has joined #archiveteam [18:41] *** asd_ has quit IRC (Client Quit) [18:45] *** Emcy_ has joined #archiveteam [18:47] *** Emcy has quit IRC (Read error: Operation timed out) [18:50] *** bzc6p__ has joined #archiveteam [18:50] *** bzc6p__ is now known as bzc6p [18:51] I've started scraping Google [18:51] for filebox.vt.edu [18:51] *** mistym has joined #archiveteam [18:52] bzc6p: do you think we need a project for it? [18:52] arkiver: I don't think anything, Kazzy just said we need a scrape and I started to scrape. [18:53] "I've no idea what it is, but I'm helping in archiving it" [18:53] arkiver: we're looking at 21.3k results from google [18:53] maximum filesize 30MB [18:53] bzc6p: thanks for the scrape, was busy earlier :) [18:54] I'll just spit out an url list soon. [18:55] *** Elegance has joined #archiveteam [18:56] so filebox profiles are made like http://filebox.vt.edu/users/*user*/ [18:57] might be possible to do with archivebot [18:57] provide a !a < list [18:57] from the results i've seen, yet [18:57] it's possible with archivebot, we have a week [18:58] yeah [19:03] microsoft clip art is running [19:03] next up: Nokia Memories [19:11] *** db48x has joined #archiveteam [19:33] arkiver: regarding filebox, a user's site can be reached at least three ways. [19:34] hmm? [19:34] e.g. for "rmtaylor" all the followings work: [19:34] /users/rmtaylor [19:34] /r/rmtaylor [19:34] /~rmtaylor [19:35] I guess we reach the same files each way, but how many broken links do we leave behind? [19:36] However, grabbing everything in each 3 ways would be practically doing the same thing 3 times. [19:37] If the whole thing would not be too big in size and the websites is fast enough, we'll do all three ways [19:37] I guess a warrior project would be good for this then [19:37] And no, it doesn't seem to be redirecting (the url in my browser remains the same path) [19:38] can't we just process that afterwards, once we've grabbed one copy? [19:38] then it could be done with archivebot too [19:38] Kazzy: can do that, but if we have enough time and site can handle it, it's better to do the three ways at the start [19:41] arkiver: you may have misunderstood me. So the situation is NOT that two redirect to a third one. It appears like if they all existed separately. [19:41] Yes I know [19:42] Okay, I wasn't sure. [19:42] so that can be grabbed with archivebot, if that is fast enough [19:42] just three different urls for each found url [19:42] Hopefully just three. [19:44] well, archivebot had the deduplicates, although I'm not sure how that works in the warc [19:45] err, archivebot has a spotter for duplicates, but I'm not sure how that looks in the warc. [19:46] arkiver: do we suppose that a all of a user's files is reachable from its site, or shall I also create a list of individual files listed by Google? [19:47] bzc6p: from what I have seen, not all files are linked to from the main page [19:47] so it would be best to have the full list of individual urls [19:52] So I've got a list of URLs from Google and Common Crawl. (I don't speak Bing.) Now I'll create a list of user main pages and the found individual files in all three ways. [19:53] *** sankin has quit IRC (Leaving.) [20:00] *** BlueMaxim has joined #archiveteam [20:02] looks like archiveteam doesn't display sub-collection anymore [20:02] SketchCow: i would like to know why [20:03] *** primus104 has joined #archiveteam [20:03] ha ha [20:03] So, something is broken. [20:03] I've been raising it with the devs [20:07] i think if you edit archiveteam collection you can fix it [20:08] No. [20:08] It's endemic throughout the system. Code changed. [20:08] oh [20:08] ok [20:08] I'd been bringing it up elsewhere. [20:08] i only really noticed it in archiveteam collection [20:08] The lead dev really wants us moving to the v2 version of archive.org, and she doesn't always care if v1 stuff is affected. [20:09] Also, we have had a number of more aggressive "cleanup" routines running, which I've been contributing to. [20:09] Sometimes they get a tad saucy. [20:09] fyi my bug in cbsnews.com videos collection happen to a item in archiveteam-fire [20:09] from 2011 [20:09] and its not ximm fault this time [20:09] based on history [20:10] SketchCow: https://archive.org/details/forum.nos.org-2007 [20:10] its has ubuntu arc.gz files in it [20:12] good news is they are released to the domain [20:12] forums.nos.org domain [20:13] at least that one is a web archive and some elses problem [20:14] but also shows there is bad code in archive.org upload script it looks like [20:14] like it will upload to any item with its domain in it [20:32] *** ruukasu has quit IRC (Ping timeout: 265 seconds) [20:35] *** ruukasu has joined #archiveteam [20:35] *** amerrykan has quit IRC (Quit: Quitting) [21:06] *** amerrykan has joined #archiveteam [21:11] *** dashcloud has quit IRC (Read error: Operation timed out) [21:14] *** dashcloud has joined #archiveteam [21:17] *** Start has joined #archiveteam [21:17] arkiver: one last question. Does it matter if there is "www." at the beginning of the URL, or Wayback Machine doesn't distinguish that and the one without www? Because if they count different, I should make two versions. [21:17] I guess the answer is no, but I want to make sure. [21:18] www. and no www are different things and i am sure the WM differentiates [21:18] if one of them is canonical, use that [21:24] *** thechip_ has quit IRC (Ping timeout: 265 seconds) [21:26] I think wayback has some smarts to roll them together [21:28] Most of the is without www. But I think most of us omits www, although it may matter. [21:30] *** db48x` has joined #archiveteam [21:33] *** db48x has quit IRC (Ping timeout: 258 seconds) [21:34] Problem has been fixed [21:38] a quick test seems to indicate that it treats them the same, but not other subdomains [21:39] *** db48x` has quit IRC (Ping timeout: 258 seconds) [21:40] *** Start has quit IRC (Ping timeout: 265 seconds) [21:41] So I've discovered ~1050 users, and have direct links to ~5600 files. [21:46] *** Start has joined #archiveteam [21:50] *** thechip has joined #archiveteam [21:53] *** Start has quit IRC (Ping timeout: 265 seconds) [21:53] *** Start has joined #archiveteam [21:55] *** ruukasu has quit IRC (Quit: WeeChat 1.0.1) [21:56] *** ruukasu has joined #archiveteam [21:56] *** Start has quit IRC (Remote host closed the connection) [22:10] *** ruukasu has quit IRC (Quit: WeeChat 1.0.1) [22:10] *** ruukasu has joined #archiveteam [22:16] arkiver: list with filebox user main pages and with every discovered file, in all the three versions, is ready. Sites work with and without www. prefix without redirecting; but in my list they are without www., extend if you wish.. [22:17] http://paste.archivingyoursh.it/raw/xajobaqogo [22:29] *** bzc6p_ has joined #archiveteam [22:36] *** schbirid has quit IRC (Read error: Operation timed out) [22:37] *** bzc6p has quit IRC (Read error: Operation timed out) [22:40] *** APerti has joined #archiveteam [23:00] *** kyan has quit IRC (Quit: This computer has gone to sleep) [23:11] *** dashcloud has quit IRC (Read error: Operation timed out) [23:13] *** mhazinsk has joined #archiveteam [23:14] *** dashcloud has joined #archiveteam [23:17] arkiver: I'm prepending http:// to the list bzc6p gave. [23:18] aaaaaaaaa: I doubt that wpull needs that, but feel free [23:20] wayback makes no difference between www. and no www. [23:20] Can you think of a floppy disk game that has a proprietary file system on it to prevent copying? [23:23] *** db48x has joined #archiveteam [23:24] *** Ravenloft has joined #archiveteam [23:27] *** db48x has quit IRC (Remote host closed the connection) [23:29] *** useretail has quit IRC (Read error: Operation timed out) [23:29] *** garyrh has quit IRC (Read error: Operation timed out) [23:29] *** will__ has quit IRC (Read error: Operation timed out) [23:29] *** Void_ has quit IRC (Read error: Operation timed out) [23:29] APerti: didn't some Amiga programs do that? [23:31] *** will__ has joined #archiveteam [23:31] *** Void_ has joined #archiveteam [23:32] arkiver bzc6p_: I put it here http://paste.archivingyoursh.it/raw/kaviligefe [23:32] maybe it should be fed into archivebot with an a < [23:32] yep [23:34] *** garyrh has joined #archiveteam [23:37] Love the site name! [23:38] Awww... [23:38] Domain: sh.it [23:38] Status: UNASSIGNABLE [23:44] *** db48x has joined #archiveteam [23:45] *** APerti has quit IRC () [23:53] *** mistym has quit IRC (Remote host closed the connection)