#archiveteam 2016-04-14,Thu

↑back Search

Time Nickname Message
00:09 🔗 Stiletto has joined #archiveteam
00:54 🔗 SketchCow https://archive.org/details/TV-ALJAZAM
01:09 🔗 SketchCow I'd like to inform you that there is a massive archive called EarthStation1 that has an amazing amount of historical content that apparently refuses to be archived by anyone else, with an extremely obstructive design that makes it difficult to download and a robots.txt that rivals any other I've seen before. This angers me, because this site looks straight out of 1996, so its stability is questionab
01:09 🔗 SketchCow le at best. Please consider attempting to save this incredible site; I don't want to imagine the prolific amount of content being lost forever due to the ignorance of an old webmaster. I love your work, by the way. Keep doing what you're doing!
01:09 🔗 SketchCow http://www.earthstation1.com/
01:33 🔗 Gfy has quit IRC (Ping timeout: 250 seconds)
01:34 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:35 🔗 JesseW has joined #archiveteam
01:37 🔗 Gfy has joined #archiveteam
01:41 🔗 slpeeds has joined #archiveteam
01:48 🔗 fdo54ss has quit IRC (Ping timeout: 633 seconds)
01:51 🔗 DFJustin archivebot seems to be making short work of it
02:00 🔗 Honno has joined #archiveteam
02:12 🔗 bwn_ has joined #archiveteam
02:13 🔗 Coderjoe_ has quit IRC (Read error: Connection reset by peer)
02:18 🔗 gibigian1 has quit IRC (Remote host closed the connection)
02:22 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
02:26 🔗 bwn has quit IRC (Read error: Operation timed out)
02:33 🔗 Coderjoe has joined #archiveteam
02:36 🔗 Honno has quit IRC (Read error: Operation timed out)
02:50 🔗 SketchCow Well, Archivebot is the honey badger of crawlers.
03:15 🔗 JesseW has joined #archiveteam
03:19 🔗 brayden has joined #archiveteam
03:19 🔗 swebb sets mode: +o brayden
03:24 🔗 bwn has joined #archiveteam
03:24 🔗 bwn_ has quit IRC (Quit: Quit)
03:29 🔗 jspiros has quit IRC (Read error: Operation timed out)
03:29 🔗 arkhive1 has joined #archiveteam
03:29 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
03:30 🔗 SadDM has quit IRC (Read error: Operation timed out)
03:30 🔗 Gfy has quit IRC (Read error: Operation timed out)
03:30 🔗 SN4T14 has quit IRC (Read error: Operation timed out)
03:30 🔗 mr-b has quit IRC (Read error: Operation timed out)
03:30 🔗 chfoo- has quit IRC (Read error: Operation timed out)
03:30 🔗 remsen has quit IRC (Ping timeout: 246 seconds)
03:30 🔗 matthusby has quit IRC (Ping timeout: 246 seconds)
03:31 🔗 ErkDog has quit IRC (Ping timeout: 246 seconds)
03:31 🔗 Emcy has joined #archiveteam
03:31 🔗 Atom-- has quit IRC (Read error: Operation timed out)
03:32 🔗 wyatt8740 has joined #archiveteam
03:32 🔗 yakfish has quit IRC (Ping timeout: 246 seconds)
03:33 🔗 bwn_ has joined #archiveteam
03:35 🔗 bwn has quit IRC (Ping timeout: 492 seconds)
03:35 🔗 arkhive has quit IRC (Ping timeout: 492 seconds)
03:35 🔗 Gfy has joined #archiveteam
03:36 🔗 Emcy_ has quit IRC (Read error: Operation timed out)
03:37 🔗 remsen has joined #archiveteam
03:38 🔗 chfoo- has joined #archiveteam
03:38 🔗 SN4T14 has joined #archiveteam
03:38 🔗 ErkDog has joined #archiveteam
03:52 🔗 mr-b has joined #archiveteam
03:55 🔗 bwn_ is now known as bwn
04:11 🔗 SketchCow From Vinay, our in-house researcher on the Wayback machine:
04:11 🔗 SketchCow "about de-duplication, here.s a little fun stat: I ran a quick job the other day to find that for the period 1995-Sept 2015, had we been de-duping Archiveteam's WARC data as it came in, we would have written:
04:11 🔗 SketchCow 11,628,333,091 Revisit records (11.62 Billion) & saved 136.16 TB of disk space.
04:22 🔗 JesseW ZOMG. *Only* 136 TB? In 20 years? OK, that *is* trivial. Wow. I would not have expected that.
04:23 🔗 JesseW wait, is that just Archiveteam's stuff? OK, that makes more sense.
04:23 🔗 JesseW If that was overall Wayback, I'd be amazed.
04:26 🔗 ariscop SketchCow, what was the size before deduplication
04:28 🔗 ariscop ~100kb/page sounds about right
04:43 🔗 bwn_ has joined #archiveteam
04:57 🔗 bwn has quit IRC (Read error: Operation timed out)
05:00 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
05:05 🔗 Sk1d has joined #archiveteam
05:11 🔗 Honno has joined #archiveteam
05:27 🔗 fie__ has quit IRC (Read error: Connection reset by peer)
05:28 🔗 fie__ has joined #archiveteam
05:41 🔗 vitzli has joined #archiveteam
06:02 🔗 BlueMaxim has joined #archiveteam
06:18 🔗 godane has quit IRC (Quit: Leaving.)
06:21 🔗 godane has joined #archiveteam
06:28 🔗 WinterFox has joined #archiveteam
06:49 🔗 bwn_ has quit IRC (Quit: Quit)
06:50 🔗 bwn has joined #archiveteam
06:50 🔗 Honno has quit IRC (Read error: Operation timed out)
06:51 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:01 🔗 scyther has joined #archiveteam
07:08 🔗 MMovie2 has joined #archiveteam
07:09 🔗 MMovie has quit IRC (Read error: Operation timed out)
07:18 🔗 SketchCow Whoever told me about that site with all the bootleg recordings wins: https://archive.org/details/bottle_rockets_1995-03-18_Austin_TX
07:19 🔗 godane SketchCow: http://www.guitars101.com/forums/f145/
07:19 🔗 godane tons of bootlegs there
07:20 🔗 SketchCow Adorable. Too much work.
07:20 🔗 godane very true
07:20 🔗 SketchCow I'll let these collections co-agulate somewhere; they always do.
07:26 🔗 ariscop has quit IRC (Quit: Leaving)
07:27 🔗 godane i'm getting some Jimi Hendrix
07:29 🔗 godane looks like there are some Jimi Hendrix KPFA tapes
07:31 🔗 HCross Ive just gotten a load of UK Public Information Films from 1945 - 2006
07:33 🔗 godane HCross: from here: http://www.nationalarchives.gov.uk/films/1945to1951/filmindex.htm
07:33 🔗 HCross Yeah
07:33 🔗 HCross YouTube-DL takes them like a dream
07:34 🔗 godane also they have fix urls in html: http://www.nationalarchives.gov.uk/films/large-files/public-info-films/Your-Very-Good-Health.flv
07:35 🔗 schbirid has joined #archiveteam
08:44 🔗 vitzli has quit IRC (Quit: Leaving)
09:02 🔗 ariscop has joined #archiveteam
09:27 🔗 d_rebel_ has quit IRC (Read error: Connection reset by peer)
09:28 🔗 filippo__ has quit IRC (Read error: Connection reset by peer)
09:41 🔗 d_rebel_ has joined #archiveteam
09:43 🔗 arkhive1 has quit IRC (Read error: Connection reset by peer)
09:44 🔗 vitzli has joined #archiveteam
09:51 🔗 godane has quit IRC (Leaving.)
09:53 🔗 godane has joined #archiveteam
10:15 🔗 bwn has quit IRC (Read error: Operation timed out)
10:21 🔗 ohhdemgir has joined #archiveteam
10:31 🔗 bwn has joined #archiveteam
11:08 🔗 filippo__ has joined #archiveteam
11:28 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:10 🔗 phuzion SketchCow: Thoughts on archiving ninlive.com? It's basically the largest collection of bootlegs of Nine Inch Nails concerts in existence.
12:14 🔗 phuzion If I had to guess, the site is probably between 500GB and 1TB, so it's not like it's some stupidly huge thing
13:03 🔗 Medowar has joined #archiveteam
13:24 🔗 WinterFox has quit IRC (Remote host closed the connection)
13:28 🔗 balrog has quit IRC (Ping timeout: 260 seconds)
13:29 🔗 balrog has joined #archiveteam
13:29 🔗 swebb sets mode: +o balrog
13:32 🔗 MrRadar GameFront has announced they are officially closing on April 30: http://www.gamefront.com/gamefront-is-closing-down-april-30-2016/
13:37 🔗 MrRadar We had a warrior project working on it, what's the status on that?
13:41 🔗 pfallenop has quit IRC (Remote host closed the connection)
13:42 🔗 Ungstein has joined #archiveteam
13:42 🔗 arkiver oh awesome
13:42 🔗 arkiver we got most of it :D
13:42 🔗 arkiver I'll make sure we also have the newest files
13:46 🔗 gibigiana has joined #archiveteam
13:59 🔗 Honno has joined #archiveteam
14:07 🔗 Medowar Warrior Project is still running, only a few Files missing. http://tracker.archiveteam.org/gamefront/ Sucks, that they ban IPs so agressively, so I cant really grab them fast with 2 servers.
14:08 🔗 MrRadar Looks like they're shutting down the FileFront forums as well: http://forums.filefront.com/announcements/461333-filefront-forums-closing-down-more-information-here.html
14:10 🔗 Jonimus gamefront going away will be a huge loss for the modding communities of many older games I fear.
14:10 🔗 phuzion 5.4 million posts, it might be doable with archivebot, but I'm not sure.
14:10 🔗 phuzion 400K threads
14:10 🔗 MrRadar It looks like the admin of the forums is trying to buy their database and re-launch it as an indepdant site
14:11 🔗 MrRadar And plans to post a full backup of the DB in public even if that can't be done
14:11 🔗 phuzion Nice.
14:11 🔗 phuzion Is the forum code something that's publicly available? vBulletin or something?
14:11 🔗 phuzion Yeah, vB
14:11 🔗 MrRadar Err, sorry, not a backup of the DB but a static rendered version of the site
14:11 🔗 phuzion Ah, ok, that makes better sense.
14:12 🔗 arkiver I'll have the forums saved also
14:12 🔗 phuzion arkiver: with the gamefront warrior grab or a different proejct?
14:12 🔗 MrRadar Yes, there's also a long-running Archivebot scrape of them that's going
14:12 🔗 arkiver probably with the gamefront project
14:12 🔗 arkiver since they're both gamefront
14:12 🔗 phuzion ok
14:14 🔗 arkiver example of a saved file https://web.archive.org/web/20151030203630/http://www.gamefront.com/files/20888016/Grand_Theft_Auto_IV_Mod___GTA_Ultimate_Vehicle_Pack_v5
14:14 🔗 arkiver all downloading works
14:14 🔗 Jonimus woot
14:22 🔗 MrRadar Apparently GameFront was owned by the same parent company as GameTrailers
14:23 🔗 MrRadar They also own The Escapist, should we look at archiving them next?
14:29 🔗 Ungstein1 has joined #archiveteam
14:30 🔗 Ungstein has quit IRC (Ping timeout: 260 seconds)
14:40 🔗 SketchCow http://fos.textfiles.com/ARCHIVETEAM/ just had an archivebot go by, so with that, the automatic pipeline is working.
14:41 🔗 SketchCow I haven't had to set off a packaging/upload of a project in archive team's set for two days.
14:42 🔗 phuzion Awesome.
14:43 🔗 SketchCow It's a hair slow, but that's because only one item is being done at any given time. I used to do two or three at once because I wouldn't do it for a few days and it'd bunch up - hopefully a relentless, no-pause pipeline will keep it relatively clear. The disk's at 21% full right now, and some of that is just because there's lingering rsync junk that will go away after we all verify all the stuff is
14:43 🔗 SketchCow done.
14:46 🔗 ohhdemgir has quit IRC (Read error: Operation timed out)
15:08 🔗 Honno has quit IRC (Read error: Operation timed out)
15:31 🔗 scyther has quit IRC (Quit: Leaving)
15:38 🔗 SketchCow Does anyone have a binary newsgroup?
15:38 🔗 SketchCow Sorry, binary newsgroup access/downloads
15:38 🔗 phuzion Looking for something in particular?
15:38 🔗 SketchCow Twilight CDs not already up on archive.org.
15:39 🔗 phuzion Twilight the movie? And the soundtrack for it?
15:39 🔗 SketchCow Ask yourself an important question.
15:39 🔗 phuzion I know, that's probably not what you were looking for.
15:40 🔗 phuzion https://en.wikipedia.org/wiki/Twilight_%28CD-ROM%29 This?
15:40 🔗 metalcamp has joined #archiveteam
15:46 🔗 schbirid SketchCow: which ones do you need=
15:47 🔗 schbirid i can get 1-89
16:03 🔗 Yoshimura has joined #archiveteam
16:20 🔗 mismatch has joined #archiveteam
16:22 🔗 zino SketchCow: I have payed personal access to binary newsgroups. So I can look stuff up if you wan't, but it would not be a working way to get large datasets.
16:22 🔗 zino There's a download pot involved.
16:23 🔗 * zino goes looking for those Twilight things.
16:27 🔗 zino Why did I do this. So much vampires and porn...
16:27 🔗 MrRadar LOL
16:30 🔗 Honno has joined #archiveteam
16:38 🔗 zino Doesn't look like there are and CD releases in the binary archive I have access to. Just a bunch of the DVDs.
16:38 🔗 JesseW has joined #archiveteam
16:55 🔗 Ungstein has joined #archiveteam
16:57 🔗 Ungstein1 has quit IRC (Ping timeout: 260 seconds)
17:09 🔗 Ungstein has quit IRC (Quit: Leaving.)
17:11 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
17:24 🔗 Frogging I'm trying to run the GameFront grab scripts, but it's getting 404 errors on files that should be downloadable :s
17:24 🔗 arkiver yeah
17:24 🔗 arkiver this is currently a requeue of problematic files
17:25 🔗 arkiver I'll do a batch of the newest files
17:25 🔗 arkiver and when those are finished, I'll update the scripts to make them not abort on a problematic item
17:27 🔗 Frogging 22=404 http://media1.gamefront.com/worthplaying/nascarkartracing/NASCARKartRacing_Trailer.zip?b17f4b620c6cf1393ffa644e1feea1514087226f4d774dadf9bf09d9ca2a22062b861d319cd0784faf211762412aa108cda94c55d50ed9a5a7536bf117a2d30a9777d31ed3f03ba62be64053a04bb70d9c998edaffb0126c66473acd449b704571eaf8fa220bcd19801084f1d333b461bff8e6fefbce58debe5d21be7730da
17:27 🔗 Frogging that file does actually work
17:30 🔗 arkiver Yes
17:30 🔗 arkiver that's why the scripts currently abort
17:30 🔗 arkiver When their servers are busy we have to wait a bit longer before the download URL is active
17:30 🔗 arkiver if it's not active yet, it's a 404
17:31 🔗 arkiver in a normal case wget would continue the grab, but for this ^ reason I let the grab abort when it happens
17:31 🔗 Ungstein has joined #archiveteam
17:34 🔗 schbirid has quit IRC (Quit: Leaving)
17:35 🔗 Frogging arkiver: does that result in them being downloaded later though?
17:36 🔗 arkiver if the item is requeued and the site is less busy, then it will get the file
17:36 🔗 arkiver I'll add a higher waiting time before trying the URL
17:46 🔗 FalconK hey, has anyone else seen wpull hang on epoll (probably waiting for a task from a queue, or waiting for a bunch of closed sockets)?
17:53 🔗 * yipdw_ has
17:53 🔗 yipdw_ well, wait a second
17:53 🔗 yipdw_ wpull without any scripts?
17:55 🔗 FalconK the command line is long and noisy but let me check
17:55 🔗 FalconK I don't see any good way of forcing it to continue but I'd rather not lose whatever it got
17:55 🔗 FalconK or leave the stuck unclosed job forever
17:56 🔗 xmc connect with gdb and force it to return? :P
17:56 🔗 yipdw_ yeah, that can help
18:00 🔗 mek_ has quit IRC (Ping timeout: 250 seconds)
18:07 🔗 mek_ has joined #archiveteam
18:08 🔗 FalconK yeah, how the hell do you force it to return from a system call?
18:09 🔗 xmc hm
18:10 🔗 xmc connect with gdb, kill it with some signal to interrupt the syscall, catch the signal with gdb and don't pass it to the process's default handler?
18:10 🔗 FalconK yeah it'd be something like that
18:12 🔗 alfie has quit IRC (Read error: Operation timed out)
18:13 🔗 Stiletto has quit IRC ()
18:14 🔗 FalconK by the way, yes, no script
18:15 🔗 yipdw_ oh ok
18:15 🔗 yipdw_ I've seen wpull hangs in archivebot, but those ended up being the archivebot script's fault
18:16 🔗 yipdw_ hangs on epoll that is
18:21 🔗 FalconK yeah I wasn't running this in a debug interpreter so I have no idea how it arose
18:22 🔗 alfie has joined #archiveteam
18:23 🔗 Froggypwn has quit IRC (Read error: Operation timed out)
18:28 🔗 Medowar has quit IRC (Quit: Connection closed for inactivity)
18:34 🔗 Ungstein has quit IRC (Quit: Leaving.)
18:37 🔗 Honno has quit IRC (Read error: Operation timed out)
18:45 🔗 Stiletto has joined #archiveteam
18:46 🔗 Stiletto has quit IRC (Client Quit)
18:50 🔗 VADemon has joined #archiveteam
18:57 🔗 FalconK hmm, so it turned out the right solution was just to murder wpull with kill -9
18:57 🔗 FalconK job completed; uploading (looks like, anyway)
19:01 🔗 Stiletto has joined #archiveteam
19:02 🔗 Stiletto has quit IRC (Client Quit)
19:04 🔗 scyther has joined #archiveteam
19:04 🔗 vitzli has quit IRC (Read error: Operation timed out)
19:09 🔗 zshen has joined #archiveteam
19:11 🔗 Stiletto has joined #archiveteam
19:17 🔗 atomotic has joined #archiveteam
19:20 🔗 bwn has quit IRC (Ping timeout: 250 seconds)
19:28 🔗 zino arkiver: Last upload for GameTrailers is currently running. Time to figure out what to do with the 30 remaining unchunked archives.
19:31 🔗 Honno has joined #archiveteam
19:32 🔗 zshen has quit IRC (Quit: leaving)
19:44 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
19:46 🔗 mek_ has quit IRC (Read error: Operation timed out)
19:50 🔗 bwn has joined #archiveteam
19:52 🔗 arkiver zino: nice!
19:53 🔗 arkiver yipdw: how do you normally handle the remaining items that are too small in total to create a megaWARC?
19:53 🔗 arkiver create a dir by hand?
19:53 🔗 JW_work why is there a *minimum* size of a megaWARC?
19:54 🔗 JW_work or are all megaWARCs supposed to be the same size?
19:54 🔗 arkiver For example we set a megaWARC to be 40 GB, it will then move WARCs to a folder until that dir is more then 40 GB
19:54 🔗 arkiver then that dir will become a megaWARC
19:55 🔗 JW_work ah ok — so they are all intended to be the same size. Makes sense.
19:55 🔗 xmc that size is chosen because it's currently near-optimal for archive.org's infrastructure and the speed of internet connections today
20:02 🔗 Stiletto has quit IRC ()
20:04 🔗 mek_ has joined #archiveteam
20:04 🔗 yipdw_ arkiver: I move them to the packing queue
20:04 🔗 yipdw_ at which point the packer picks them up and goes forward
20:05 🔗 yipdw_ it would be good to have a script that does that; AFAIK that's something the megawarc admin needs to run, since we have no other end-of-job signal
20:11 🔗 Stiletto has joined #archiveteam
20:19 🔗 zino yipdw_: Thanks, I'll get on that for the last files then.
20:38 🔗 Stiletto has quit IRC (Read error: Operation timed out)
20:39 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
20:46 🔗 MMovie has joined #archiveteam
20:47 🔗 MMovie2 has quit IRC (Read error: Operation timed out)
20:58 🔗 ariscop has quit IRC (Leaving)
21:00 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
21:02 🔗 dashcloud has joined #archiveteam
21:08 🔗 VADemon has quit IRC (Quit: left4dead)
21:14 🔗 Honno has quit IRC (Read error: Operation timed out)
21:18 🔗 mek_ has quit IRC (Ping timeout: 250 seconds)
21:18 🔗 Stiletto has joined #archiveteam
21:26 🔗 scyther has quit IRC (Read error: Connection reset by peer)
21:36 🔗 ariscop has joined #archiveteam
22:33 🔗 BlueMaxim has joined #archiveteam
22:34 🔗 szalwia has quit IRC (Ping timeout: 260 seconds)
22:35 🔗 SirCmpwn has quit IRC (Ping timeout: 260 seconds)
22:39 🔗 SirCmpwn has joined #archiveteam
23:03 🔗 szalwia has joined #archiveteam
23:41 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:43 🔗 RichardG has joined #archiveteam

irclogger-viewer