[00:09] *** Stiletto has joined #archiveteam [00:54] https://archive.org/details/TV-ALJAZAM [01:09] I'd like to inform you that there is a massive archive called EarthStation1 that has an amazing amount of historical content that apparently refuses to be archived by anyone else, with an extremely obstructive design that makes it difficult to download and a robots.txt that rivals any other I've seen before. This angers me, because this site looks straight out of 1996, so its stability is questionab [01:09] le at best. Please consider attempting to save this incredible site; I don't want to imagine the prolific amount of content being lost forever due to the ignorance of an old webmaster. I love your work, by the way. Keep doing what you're doing! [01:09] http://www.earthstation1.com/ [01:33] *** Gfy has quit IRC (Ping timeout: 250 seconds) [01:34] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:35] *** JesseW has joined #archiveteam [01:37] *** Gfy has joined #archiveteam [01:41] *** slpeeds has joined #archiveteam [01:48] *** fdo54ss has quit IRC (Ping timeout: 633 seconds) [01:51] archivebot seems to be making short work of it [02:00] *** Honno has joined #archiveteam [02:12] *** bwn_ has joined #archiveteam [02:13] *** Coderjoe_ has quit IRC (Read error: Connection reset by peer) [02:18] *** gibigian1 has quit IRC (Remote host closed the connection) [02:22] *** JesseW has quit IRC (Ping timeout: 370 seconds) [02:26] *** bwn has quit IRC (Read error: Operation timed out) [02:33] *** Coderjoe has joined #archiveteam [02:36] *** Honno has quit IRC (Read error: Operation timed out) [02:50] Well, Archivebot is the honey badger of crawlers. [03:15] *** JesseW has joined #archiveteam [03:19] *** brayden has joined #archiveteam [03:19] *** swebb sets mode: +o brayden [03:24] *** bwn has joined #archiveteam [03:24] *** bwn_ has quit IRC (Quit: Quit) [03:29] *** jspiros has quit IRC (Read error: Operation timed out) [03:29] *** arkhive1 has joined #archiveteam [03:29] *** wyatt8740 has quit IRC (Read error: Operation timed out) [03:30] *** SadDM has quit IRC (Read error: Operation timed out) [03:30] *** Gfy has quit IRC (Read error: Operation timed out) [03:30] *** SN4T14 has quit IRC (Read error: Operation timed out) [03:30] *** mr-b has quit IRC (Read error: Operation timed out) [03:30] *** chfoo- has quit IRC (Read error: Operation timed out) [03:30] *** remsen has quit IRC (Ping timeout: 246 seconds) [03:30] *** matthusby has quit IRC (Ping timeout: 246 seconds) [03:31] *** ErkDog has quit IRC (Ping timeout: 246 seconds) [03:31] *** Emcy has joined #archiveteam [03:31] *** Atom-- has quit IRC (Read error: Operation timed out) [03:32] *** wyatt8740 has joined #archiveteam [03:32] *** yakfish has quit IRC (Ping timeout: 246 seconds) [03:33] *** bwn_ has joined #archiveteam [03:35] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:35] *** arkhive has quit IRC (Ping timeout: 492 seconds) [03:35] *** Gfy has joined #archiveteam [03:36] *** Emcy_ has quit IRC (Read error: Operation timed out) [03:37] *** remsen has joined #archiveteam [03:38] *** chfoo- has joined #archiveteam [03:38] *** SN4T14 has joined #archiveteam [03:38] *** ErkDog has joined #archiveteam [03:52] *** mr-b has joined #archiveteam [03:55] *** bwn_ is now known as bwn [04:11] From Vinay, our in-house researcher on the Wayback machine: [04:11] "about de-duplication, here.s a little fun stat: I ran a quick job the other day to find that for the period 1995-Sept 2015, had we been de-duping Archiveteam's WARC data as it came in, we would have written: [04:11] 11,628,333,091 Revisit records (11.62 Billion) & saved 136.16 TB of disk space. [04:22] ZOMG. *Only* 136 TB? In 20 years? OK, that *is* trivial. Wow. I would not have expected that. [04:23] wait, is that just Archiveteam's stuff? OK, that makes more sense. [04:23] If that was overall Wayback, I'd be amazed. [04:26] SketchCow, what was the size before deduplication [04:28] ~100kb/page sounds about right [04:43] *** bwn_ has joined #archiveteam [04:57] *** bwn has quit IRC (Read error: Operation timed out) [05:00] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:05] *** Sk1d has joined #archiveteam [05:11] *** Honno has joined #archiveteam [05:27] *** fie__ has quit IRC (Read error: Connection reset by peer) [05:28] *** fie__ has joined #archiveteam [05:41] *** vitzli has joined #archiveteam [06:02] *** BlueMaxim has joined #archiveteam [06:18] *** godane has quit IRC (Quit: Leaving.) [06:21] *** godane has joined #archiveteam [06:28] *** WinterFox has joined #archiveteam [06:49] *** bwn_ has quit IRC (Quit: Quit) [06:50] *** bwn has joined #archiveteam [06:50] *** Honno has quit IRC (Read error: Operation timed out) [06:51] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:01] *** scyther has joined #archiveteam [07:08] *** MMovie2 has joined #archiveteam [07:09] *** MMovie has quit IRC (Read error: Operation timed out) [07:18] Whoever told me about that site with all the bootleg recordings wins: https://archive.org/details/bottle_rockets_1995-03-18_Austin_TX [07:19] SketchCow: http://www.guitars101.com/forums/f145/ [07:19] tons of bootlegs there [07:20] Adorable. Too much work. [07:20] very true [07:20] I'll let these collections co-agulate somewhere; they always do. [07:26] *** ariscop has quit IRC (Quit: Leaving) [07:27] i'm getting some Jimi Hendrix [07:29] looks like there are some Jimi Hendrix KPFA tapes [07:31] Ive just gotten a load of UK Public Information Films from 1945 - 2006 [07:33] HCross: from here: http://www.nationalarchives.gov.uk/films/1945to1951/filmindex.htm [07:33] Yeah [07:33] YouTube-DL takes them like a dream [07:34] also they have fix urls in html: http://www.nationalarchives.gov.uk/films/large-files/public-info-films/Your-Very-Good-Health.flv [07:35] *** schbirid has joined #archiveteam [08:44] *** vitzli has quit IRC (Quit: Leaving) [09:02] *** ariscop has joined #archiveteam [09:27] *** d_rebel_ has quit IRC (Read error: Connection reset by peer) [09:28] *** filippo__ has quit IRC (Read error: Connection reset by peer) [09:41] *** d_rebel_ has joined #archiveteam [09:43] *** arkhive1 has quit IRC (Read error: Connection reset by peer) [09:44] *** vitzli has joined #archiveteam [09:51] *** godane has quit IRC (Leaving.) [09:53] *** godane has joined #archiveteam [10:15] *** bwn has quit IRC (Read error: Operation timed out) [10:21] *** ohhdemgir has joined #archiveteam [10:31] *** bwn has joined #archiveteam [11:08] *** filippo__ has joined #archiveteam [11:28] *** BlueMaxim has quit IRC (Quit: Leaving) [12:10] SketchCow: Thoughts on archiving ninlive.com? It's basically the largest collection of bootlegs of Nine Inch Nails concerts in existence. [12:14] If I had to guess, the site is probably between 500GB and 1TB, so it's not like it's some stupidly huge thing [13:03] *** Medowar has joined #archiveteam [13:24] *** WinterFox has quit IRC (Remote host closed the connection) [13:28] *** balrog has quit IRC (Ping timeout: 260 seconds) [13:29] *** balrog has joined #archiveteam [13:29] *** swebb sets mode: +o balrog [13:32] GameFront has announced they are officially closing on April 30: http://www.gamefront.com/gamefront-is-closing-down-april-30-2016/ [13:37] We had a warrior project working on it, what's the status on that? [13:41] *** pfallenop has quit IRC (Remote host closed the connection) [13:42] *** Ungstein has joined #archiveteam [13:42] oh awesome [13:42] we got most of it :D [13:42] I'll make sure we also have the newest files [13:46] *** gibigiana has joined #archiveteam [13:59] *** Honno has joined #archiveteam [14:07] Warrior Project is still running, only a few Files missing. http://tracker.archiveteam.org/gamefront/ Sucks, that they ban IPs so agressively, so I cant really grab them fast with 2 servers. [14:08] Looks like they're shutting down the FileFront forums as well: http://forums.filefront.com/announcements/461333-filefront-forums-closing-down-more-information-here.html [14:10] gamefront going away will be a huge loss for the modding communities of many older games I fear. [14:10] 5.4 million posts, it might be doable with archivebot, but I'm not sure. [14:10] 400K threads [14:10] It looks like the admin of the forums is trying to buy their database and re-launch it as an indepdant site [14:11] And plans to post a full backup of the DB in public even if that can't be done [14:11] Nice. [14:11] Is the forum code something that's publicly available? vBulletin or something? [14:11] Yeah, vB [14:11] Err, sorry, not a backup of the DB but a static rendered version of the site [14:11] Ah, ok, that makes better sense. [14:12] I'll have the forums saved also [14:12] arkiver: with the gamefront warrior grab or a different proejct? [14:12] Yes, there's also a long-running Archivebot scrape of them that's going [14:12] probably with the gamefront project [14:12] since they're both gamefront [14:12] ok [14:14] example of a saved file https://web.archive.org/web/20151030203630/http://www.gamefront.com/files/20888016/Grand_Theft_Auto_IV_Mod___GTA_Ultimate_Vehicle_Pack_v5 [14:14] all downloading works [14:14] woot [14:22] Apparently GameFront was owned by the same parent company as GameTrailers [14:23] They also own The Escapist, should we look at archiving them next? [14:29] *** Ungstein1 has joined #archiveteam [14:30] *** Ungstein has quit IRC (Ping timeout: 260 seconds) [14:40] http://fos.textfiles.com/ARCHIVETEAM/ just had an archivebot go by, so with that, the automatic pipeline is working. [14:41] I haven't had to set off a packaging/upload of a project in archive team's set for two days. [14:42] Awesome. [14:43] It's a hair slow, but that's because only one item is being done at any given time. I used to do two or three at once because I wouldn't do it for a few days and it'd bunch up - hopefully a relentless, no-pause pipeline will keep it relatively clear. The disk's at 21% full right now, and some of that is just because there's lingering rsync junk that will go away after we all verify all the stuff is [14:43] done. [14:46] *** ohhdemgir has quit IRC (Read error: Operation timed out) [15:08] *** Honno has quit IRC (Read error: Operation timed out) [15:31] *** scyther has quit IRC (Quit: Leaving) [15:38] Does anyone have a binary newsgroup? [15:38] Sorry, binary newsgroup access/downloads [15:38] Looking for something in particular? [15:38] Twilight CDs not already up on archive.org. [15:39] Twilight the movie? And the soundtrack for it? [15:39] Ask yourself an important question. [15:39] I know, that's probably not what you were looking for. [15:40] https://en.wikipedia.org/wiki/Twilight_%28CD-ROM%29 This? [15:40] *** metalcamp has joined #archiveteam [15:46] SketchCow: which ones do you need= [15:47] i can get 1-89 [16:03] *** Yoshimura has joined #archiveteam [16:20] *** mismatch has joined #archiveteam [16:22] SketchCow: I have payed personal access to binary newsgroups. So I can look stuff up if you wan't, but it would not be a working way to get large datasets. [16:22] There's a download pot involved. [16:23] * zino goes looking for those Twilight things. [16:27] Why did I do this. So much vampires and porn... [16:27] LOL [16:30] *** Honno has joined #archiveteam [16:38] Doesn't look like there are and CD releases in the binary archive I have access to. Just a bunch of the DVDs. [16:38] *** JesseW has joined #archiveteam [16:55] *** Ungstein has joined #archiveteam [16:57] *** Ungstein1 has quit IRC (Ping timeout: 260 seconds) [17:09] *** Ungstein has quit IRC (Quit: Leaving.) [17:11] *** JesseW has quit IRC (Ping timeout: 370 seconds) [17:24] I'm trying to run the GameFront grab scripts, but it's getting 404 errors on files that should be downloadable :s [17:24] yeah [17:24] this is currently a requeue of problematic files [17:25] I'll do a batch of the newest files [17:25] and when those are finished, I'll update the scripts to make them not abort on a problematic item [17:27] 22=404 http://media1.gamefront.com/worthplaying/nascarkartracing/NASCARKartRacing_Trailer.zip?b17f4b620c6cf1393ffa644e1feea1514087226f4d774dadf9bf09d9ca2a22062b861d319cd0784faf211762412aa108cda94c55d50ed9a5a7536bf117a2d30a9777d31ed3f03ba62be64053a04bb70d9c998edaffb0126c66473acd449b704571eaf8fa220bcd19801084f1d333b461bff8e6fefbce58debe5d21be7730da [17:27] that file does actually work [17:30] Yes [17:30] that's why the scripts currently abort [17:30] When their servers are busy we have to wait a bit longer before the download URL is active [17:30] if it's not active yet, it's a 404 [17:31] in a normal case wget would continue the grab, but for this ^ reason I let the grab abort when it happens [17:31] *** Ungstein has joined #archiveteam [17:34] *** schbirid has quit IRC (Quit: Leaving) [17:35] arkiver: does that result in them being downloaded later though? [17:36] if the item is requeued and the site is less busy, then it will get the file [17:36] I'll add a higher waiting time before trying the URL [17:46] hey, has anyone else seen wpull hang on epoll (probably waiting for a task from a queue, or waiting for a bunch of closed sockets)? [17:53] * yipdw_ has [17:53] well, wait a second [17:53] wpull without any scripts? [17:55] the command line is long and noisy but let me check [17:55] I don't see any good way of forcing it to continue but I'd rather not lose whatever it got [17:55] or leave the stuck unclosed job forever [17:56] connect with gdb and force it to return? :P [17:56] yeah, that can help [18:00] *** mek_ has quit IRC (Ping timeout: 250 seconds) [18:07] *** mek_ has joined #archiveteam [18:08] yeah, how the hell do you force it to return from a system call? [18:09] hm [18:10] connect with gdb, kill it with some signal to interrupt the syscall, catch the signal with gdb and don't pass it to the process's default handler? [18:10] yeah it'd be something like that [18:12] *** alfie has quit IRC (Read error: Operation timed out) [18:13] *** Stiletto has quit IRC () [18:14] by the way, yes, no script [18:15] oh ok [18:15] I've seen wpull hangs in archivebot, but those ended up being the archivebot script's fault [18:16] hangs on epoll that is [18:21] yeah I wasn't running this in a debug interpreter so I have no idea how it arose [18:22] *** alfie has joined #archiveteam [18:23] *** Froggypwn has quit IRC (Read error: Operation timed out) [18:28] *** Medowar has quit IRC (Quit: Connection closed for inactivity) [18:34] *** Ungstein has quit IRC (Quit: Leaving.) [18:37] *** Honno has quit IRC (Read error: Operation timed out) [18:45] *** Stiletto has joined #archiveteam [18:46] *** Stiletto has quit IRC (Client Quit) [18:50] *** VADemon has joined #archiveteam [18:57] hmm, so it turned out the right solution was just to murder wpull with kill -9 [18:57] job completed; uploading (looks like, anyway) [19:01] *** Stiletto has joined #archiveteam [19:02] *** Stiletto has quit IRC (Client Quit) [19:04] *** scyther has joined #archiveteam [19:04] *** vitzli has quit IRC (Read error: Operation timed out) [19:09] *** zshen has joined #archiveteam [19:11] *** Stiletto has joined #archiveteam [19:17] *** atomotic has joined #archiveteam [19:20] *** bwn has quit IRC (Ping timeout: 250 seconds) [19:28] arkiver: Last upload for GameTrailers is currently running. Time to figure out what to do with the 30 remaining unchunked archives. [19:31] *** Honno has joined #archiveteam [19:32] *** zshen has quit IRC (Quit: leaving) [19:44] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [19:46] *** mek_ has quit IRC (Read error: Operation timed out) [19:50] *** bwn has joined #archiveteam [19:52] zino: nice! [19:53] yipdw: how do you normally handle the remaining items that are too small in total to create a megaWARC? [19:53] create a dir by hand? [19:53] why is there a *minimum* size of a megaWARC? [19:54] or are all megaWARCs supposed to be the same size? [19:54] For example we set a megaWARC to be 40 GB, it will then move WARCs to a folder until that dir is more then 40 GB [19:54] then that dir will become a megaWARC [19:55] ah ok — so they are all intended to be the same size. Makes sense. [19:55] that size is chosen because it's currently near-optimal for archive.org's infrastructure and the speed of internet connections today [20:02] *** Stiletto has quit IRC () [20:04] *** mek_ has joined #archiveteam [20:04] arkiver: I move them to the packing queue [20:04] at which point the packer picks them up and goes forward [20:05] it would be good to have a script that does that; AFAIK that's something the megawarc admin needs to run, since we have no other end-of-job signal [20:11] *** Stiletto has joined #archiveteam [20:19] yipdw_: Thanks, I'll get on that for the last files then. [20:38] *** Stiletto has quit IRC (Read error: Operation timed out) [20:39] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [20:46] *** MMovie has joined #archiveteam [20:47] *** MMovie2 has quit IRC (Read error: Operation timed out) [20:58] *** ariscop has quit IRC (Leaving) [21:00] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [21:02] *** dashcloud has joined #archiveteam [21:08] *** VADemon has quit IRC (Quit: left4dead) [21:14] *** Honno has quit IRC (Read error: Operation timed out) [21:18] *** mek_ has quit IRC (Ping timeout: 250 seconds) [21:18] *** Stiletto has joined #archiveteam [21:26] *** scyther has quit IRC (Read error: Connection reset by peer) [21:36] *** ariscop has joined #archiveteam [22:33] *** BlueMaxim has joined #archiveteam [22:34] *** szalwia has quit IRC (Ping timeout: 260 seconds) [22:35] *** SirCmpwn has quit IRC (Ping timeout: 260 seconds) [22:39] *** SirCmpwn has joined #archiveteam [23:03] *** szalwia has joined #archiveteam [23:41] *** RichardG has quit IRC (Read error: Connection reset by peer) [23:43] *** RichardG has joined #archiveteam