[00:00] *** laufwerkf has joined #archiveteam [00:17] *** tfgbd_znc has quit IRC (Ping timeout: 633 seconds) [00:29] *** kristian_ has quit IRC (Leaving) [00:49] *** JesseW has joined #archiveteam [00:58] *** DoomTay has joined #archiveteam [01:26] *** ZeoNet has quit IRC (Read error: Operation timed out) [01:52] *** BlueMaxim has joined #archiveteam [02:24] *** JesseW has quit IRC (Ping timeout: 370 seconds) [02:27] *** Zialus is now known as RMF|away [02:31] so... [02:32] how do I help with archiving stuff? the Warrior? [02:52] that's probably the easiest way [02:55] I just ran it as a docker container in a VPS [02:56] took me a while to get the web UI to work (port forwarding and all) [02:56] I set it to "archiveteam's choice" and it seems its choice was urlteam, which is frequently giving "no tasks available" :o [02:58] the web UI is super slick [03:07] *** laufwerkf has quit IRC () [03:09] what project does have tasks? [03:09] I have a *lot* of bandwidth and I'd like to use it :P [03:15] *** JesseW has joined #archiveteam [03:28] Doesn't the warrior interface display all possible tasks? [03:32] it shows all possible projects, and most of the ones I try say they have no tasks from the tracker [03:32] nicolas17: How about Orkut? That's going kaput next month. [03:33] *** RichardG has quit IRC (Ping timeout: 370 seconds) [03:34] the throttling and project switching is pretty annoying... like if I switch to orkut, it doesn't start any new orkut task because there are too many concurrent tasks from another project running already... but all those tasks are sleeping! ("No items available currently. Trying again in 120 seconds") [03:35] nicolas17: they will eventually time out, and it will load new tasks from orkut [03:36] there certainly are various things that could be improved about the warrior, though [03:36] more like keep sleeping because of heavy tracker rate limiting on the orkut project :P [03:36] when I looked into it, I got stuck trying to set up a testing environment [03:36] nicolas17: sure, but at least they'll be waiting on orkut [03:37] also, if you can, URLteam can always use people investigating shorteners -- then I can add more to the tracker, and there will be more work to do [03:38] I'm on a gigabit pipe doing pretty brief 50KB/s bursts and then sleeping, it's a bit frustrating [03:40] *** tomwsmf has quit IRC (Read error: Operation timed out) [03:42] What's your ISP? [03:42] I'm running it on a VPS [03:44] maybe I should run an archivebot node instead? :P [03:44] nicolas17: Putting another call out. We really could do with a few more newsbuddy grabbers. If anyone has a fast, stable connection and is willing to help, just come into #newsgrabber and let myself or arkiver know please [03:44] * nicolas17 still reading the wiki [03:44] nicolas17: no new archivebot nodes for now [03:45] oki [03:53] will the orkut grab finish in time with the current rate limiting? it seems like adding more warriors would make no difference [03:54] *** RichardG has joined #archiveteam [04:13] *** ravetcofx has quit IRC (Read error: Connection reset by peer) [04:16] *** ravetcofx has joined #archiveteam [04:20] *** DoomTay has quit IRC (DoomTay) [04:23] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:30] *** Sk1d has joined #archiveteam [05:35] *** barblfish has joined #archiveteam [05:36] According to the wiki article on DNS History, "the site is a zombie" [05:36] I just took a peek and the site looks to be working normally, including search [05:37] *** barblfish has quit IRC (Client Quit) [05:38] *** barblfish has joined #archiveteam [05:39] barblfish: good! maybe the interest prompted the site owner to keep it running [05:40] barblfish: feel free to update the wiki page, mentioning that the site seems to be working (but mention exactly what you did and didn't try, as other pieces may still be broken). [05:40] Probably should TRY to archve it at a "leisurely" pace just in case. For whatever reason, the closure notice is still there [05:40] the magic word is yahoosucks [05:40] K [05:41] barblfish: yeah, having an individual run a grab-site instance at, say 1 request per couple of minutes (in a random order, with a random delay) is probably worth doing [05:46] *** barblfish has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 48.0/20160726073904]) [06:02] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:07] *** nicolas17 has quit IRC (Read error: Operation timed out) [07:44] *** Selavi has joined #archiveteam [07:48] *** JesseW has joined #archiveteam [08:25] *** MMovie1 has joined #archiveteam [08:27] *** MMovie has quit IRC (Read error: Operation timed out) [08:42] *** Morbus has quit IRC (Ping timeout: 255 seconds) [08:45] *** Morbus has joined #archiveteam [09:08] *** Honno has joined #archiveteam [10:07] *** WinterFox has joined #archiveteam [10:10] *** JesseW has quit IRC (Read error: Operation timed out) [10:25] *** SketchCow has quit IRC (Read error: Operation timed out) [10:28] *** SketchCow has joined #archiveteam [10:28] *** swebb sets mode: +o SketchCow [11:03] *** ats has quit IRC (Quit: Lost terminal) [11:05] *** ats has joined #archiveteam [12:59] *** BlueMaxim has quit IRC (Quit: Leaving) [13:00] *** RMF|away is now known as Zialus [13:05] *** WinterFox has quit IRC (Read error: Operation timed out) [13:21] *** DoomTay has joined #archiveteam [13:30] *** ats has quit IRC (Quit: leaving) [13:36] *** ats has joined #archiveteam [13:36] Reddit thread regarding Google Code deleting tarballs in 5 months: https://www.reddit.com/r/programming/comments/4y4epv/about_5_months_from_now_the_tarballs_from_google/ [13:45] joepie91: you're talking about the Google Code Archive shutting down too? [13:46] seems so [13:46] have not read the thread carefully [13:46] just passing it on [13:46] yeah, it looks like it [13:47] When we're done with the 'original' google code we'll do the google code archive too [13:48] *** bauruine has quit IRC (Ping timeout: 260 seconds) [13:53] *** bauruine has joined #archiveteam [13:58] Since ArchiveBot seems to not handle LEGO.com videos properly, I'm going to try my hand at archiving videos at http://web.archive.org/web/20160616230429/http://www.lego.com/en-us/chima/videos "manually". And I just figured out how to do that [14:07] how are you going to do that? [14:07] let's move this to #archiveteam-bs also [14:08] Yeaah.... [14:09] Can't [14:31] *** Sneakyimp has joined #archiveteam [14:53] DoomTay: come over to -bs, also, youtube-dl should save those for you. [14:53] arkiver, joepie91: just emailed Chris DiBona about getting in touch with ArchiveTeam re: Google Code, we'll see how that goes. [14:54] the #googlecodeblue wiki needs some TLC. [14:54] tracker seems to be down, also. [14:54] It looks like I'm banned from -bs. Also, I already tried youtube-dl with ArchiveBot. no luck. [14:56] no, you'd only be able to use it on the live site [14:56] if the videos ain't in archive.org, they ain't in archive.org. [14:56] I wonder what you did to get banned from bs [14:57] Apparently a history of "saying galactically dumb shit" [14:59] oh well, live and learn [14:59] what are you trying to save exactly? [14:59] if it's missing files in archive.org you'd have to go back to the source [14:59] if they're gone there, YouTube or you're too late. [15:01] I'm trying to save videos off of http://www.lego.com/en-us/chima/videos . Problem is their video player is powered by AngularJS, and the player is set up "on the fly" [15:02] And using youtube-dl is also a no go: "unsupported URL" [15:03] correct [15:04] post a correctly formatted issue / request on https://github.com/rg3/youtube-dl/issues [15:06] the only other hint I'll give you is look at the network traffic for manifest.f4m [15:22] *** JesseW has joined #archiveteam [15:33] Hey channel. I have Internet2 at my disposal and a huge stash of unused disk space; can I be of assistance for google code or some other project? Ideally your answer is something like "Yes, please run aria2 on each URL in the list at $URL." ;) [15:36] nwf: join #newsgrabber and ask about being a grabber [15:37] nwf: also, check out iabackup.archiveteam.org for a use for your disk space [15:37] and THANK YOU! [15:37] feel free to ask here if you have questions [15:38] Thanks. :) [15:39] you can also run a #warrior, but we don't have any project ATM that needs help, I think. But that could change anytime. [15:40] you can run an archivebot pipeline [15:40] reliable long term ones are always wanted [15:41] Whazzat? [15:41] on-demand archiver of small-to-medium or at-risk websites [15:42] see #archivebot, http://archiveteam.org/index.php?title=ArchiveBot [15:42] Sounds neat. Who has authority to push to the queue? (I don't want there to be risk to my hosting organization.) [15:43] trusted users from here, though the bar is set pretty low [15:43] and sometimes questionable stuff is archived [15:43] so if that's of concern, it's fine [15:43] Well, it just means I need to ask the admins for permission / give them a heads up that network security might come after them for a particular IP address. [15:44] we're also not accepting new #archivebot pipelines right now, according to yipdw (who maintains the list) [15:44] oh [15:44] I didn't kbkw that, alright [15:44] *** nicolas17 has joined #archiveteam [15:45] yeah, new archivebot pipelines is blocked by various code changes (i'm not certain exactly what) [15:45] well archivebot needs to be rewritten, I know that, but it's trudging along anyway :P [15:45] but AFAIK, #newsgrabber is actively looking for new pipelines, and #iabackup, while inactive, is still accepting new storage [15:47] Sanqui: http://archiveteam.org/index.php?title=ArchiveBot#Volunteer_a_Node see the note at the top [15:47] got it [16:03] *** DoomTay has quit IRC (Quit: Page closed) [16:08] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:21] *** DoomTay has joined #archiveteam [17:04] *** AlexLehm has joined #archiveteam [17:33] *** kristian_ has joined #archiveteam [18:06] *** JW_work has quit IRC (Quit: Leaving.) [18:07] *** JW_work has joined #archiveteam [18:10] http://www.npr.org/sections/ombudsman/2016/08/17/489516952/npr-website-to-get-rid-of-comments NPR is removing comments from articles, but the comments will still be alive through Disqus. Are we planning on addressing this? (i.e. dumping threads from disqus for each article, or perhaps something else..?) [18:11] Quote from the article: "All existing comments on the site will disappear. That is because while comments look as though they exist on the NPR.org pages, they actually live within Disqus, an outside moderation platform used by NPR. So when the commenting software is removed, the archival comments go with it, Montgomery said, adding that it is not possible to remove the comment system but leave the old comments. Individual users will still be able [18:11] to see an archive of their own comments in their Disqus accounts." [18:30] *** JW_work has quit IRC (Read error: Connection reset by peer) [18:31] *** JW_work has joined #archiveteam [18:55] *** GLaDOS has quit IRC (Read error: Operation timed out) [18:56] *** GLaDOS has joined #archiveteam [19:03] *** SmileyG has quit IRC (Remote host closed the connection) [19:13] *** JW_work1 has joined #archiveteam [19:15] *** JW_work has quit IRC (Read error: Operation timed out) [19:16] *** Smiley has joined #archiveteam [19:28] *** ats has quit IRC (reeeeboooooooot) [19:59] *** AlexLehm has quit IRC (Ping timeout: 260 seconds) [20:06] *** tomwsmf has joined #archiveteam [20:07] *** SirCmpwn has quit IRC (Read error: Operation timed out) [20:10] *** ats has joined #archiveteam [20:24] *** SirCmpwn has joined #archiveteam [20:25] *** kristian_ has quit IRC (Leaving) [20:31] *** pfallenop has quit IRC (Read error: Operation timed out) [20:37] *** mr-b has quit IRC (Read error: Operation timed out) [20:40] *** mr-b has joined #archiveteam [20:40] *** pfallenop has joined #archiveteam [20:42] *** DoomTay has quit IRC (Quit: Page closed) [20:45] *** mr-b has quit IRC (Ping timeout: 246 seconds) [21:02] *** kristian_ has joined #archiveteam [21:02] *** mr-b has joined #archiveteam [21:06] *** Honno has quit IRC (Read error: Operation timed out) [21:35] *** robink has quit IRC (Ping timeout: 501 seconds) [22:08] *** robink has joined #archiveteam [22:38] *** pfallenop has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** nicolas17 has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** SketchCow has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Morbus has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** zenguy has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** superkuh has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** dashcloud has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** chazchaz has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** winr5r has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** MrRadar has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** RedType_ has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** zino has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** arkiver has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Peetz0r_ has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Infreq has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** aschmitz has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** gibigiana has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** w0rp has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** HCross has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** indrora has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** dxrt has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Zebranky has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** ranma has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** antomatic has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** hook54321 has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** luckcolor has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** ErkDog has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Cameron_D has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** dcmorton has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** is- has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** Jogie has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** mistym- has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** swebb has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** atlogbot has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** dserodio has quit IRC (ny.us.hub irc.servercentral.net) [22:38] *** filippo__ has quit IRC (ny.us.hub irc.servercentral.net) [22:40] *** andromed1 has quit IRC (Read error: Connection reset by peer) [22:55] *** JW_work has joined #archiveteam [23:04] *** JW_work1 has quit IRC (Read error: Operation timed out) [23:05] *** DoomTay has joined #archiveteam [23:05] *** pfallenop has joined #archiveteam [23:05] *** nicolas17 has joined #archiveteam [23:05] *** SketchCow has joined #archiveteam [23:05] *** Morbus has joined #archiveteam [23:05] *** zenguy has joined #archiveteam [23:05] *** superkuh has joined #archiveteam [23:05] *** dashcloud has joined #archiveteam [23:05] *** chazchaz has joined #archiveteam [23:05] *** winr5r has joined #archiveteam [23:05] *** MrRadar has joined #archiveteam [23:05] *** RedType_ has joined #archiveteam [23:05] *** zino has joined #archiveteam [23:05] *** arkiver has joined #archiveteam [23:05] *** Infreq has joined #archiveteam [23:05] *** Peetz0r_ has joined #archiveteam [23:05] *** indrora has joined #archiveteam [23:05] *** aschmitz has joined #archiveteam [23:05] *** gibigiana has joined #archiveteam [23:05] *** w0rp has joined #archiveteam [23:05] *** HCross has joined #archiveteam [23:05] *** irc.servercentral.net sets mode: +oooo SketchCow chazchaz arkiver HCross [23:05] *** dxrt has joined #archiveteam [23:05] *** Zebranky has joined #archiveteam [23:05] *** ranma has joined #archiveteam [23:05] *** antomatic has joined #archiveteam [23:05] *** hook54321 has joined #archiveteam [23:05] *** luckcolor has joined #archiveteam [23:05] *** ErkDog has joined #archiveteam [23:05] *** Cameron_D has joined #archiveteam [23:05] *** dcmorton has joined #archiveteam [23:05] *** irc.servercentral.net sets mode: +oooo dxrt antomatic luckcolor dcmorton [23:05] *** is- has joined #archiveteam [23:05] *** mistym- has joined #archiveteam [23:05] *** swebb has joined #archiveteam [23:05] *** atlogbot has joined #archiveteam [23:05] *** dserodio has joined #archiveteam [23:05] *** filippo__ has joined #archiveteam [23:05] *** irc.servercentral.net sets mode: +oo mistym- swebb [23:05] *** swebb sets mode: +o brayden_ [23:05] *** swebb sets mode: +o Atluxity [23:05] *** swebb sets mode: +o DFJustin [23:05] *** swebb sets mode: +o beardicus [23:05] *** swebb sets mode: +o midas [23:05] *** swebb sets mode: +o SadDM [23:05] *** swebb sets mode: +o balrog [23:05] *** swebb sets mode: +o edsu [23:05] *** swebb sets mode: +o joepie91 [23:05] *** swebb sets mode: +o altlabel [23:05] *** swebb sets mode: +o Jonimoose [23:05] *** swebb sets mode: +o xmc [23:08] *** dxrt has quit IRC (Ping timeout: 370 seconds) [23:10] *** dxrt has joined #archiveteam [23:13] *** max has joined #archiveteam [23:14] i have a site that may have historical significance and i am thinking of shutting it down. who should i talk to about potentially getting it archived efficiently? [23:15] What's the site? [23:15] www.ytmnd.com [23:16] o my [23:16] ...okay yes that has historical / internet culture significance o.O [23:16] o.o [23:16] it isn't really cost-effective to host anymore [23:16] yea we can hold it [23:16] max: thank you for considering how best to archive it [23:16] <3 [23:16] i could spend the time to try to get it on all virtualized, but i think it would only prolong the inevitable death [23:17] max: how much bandwidth is it eating? [23:17] the best way would be to make a copy of the whole site database, and ship/upload that to archive.org as an item [23:17] (we can help if you have questions) [23:18] if that's not feasible (and maybe as an alternative), we can make a scrape of it before it goes down, which will get copied into the Wayback Machine [23:18] nicolas17: probably less than 10mbps on average, mainly the costs are colocation fees at the moment since the hardware is aging [23:18] imo a scrape would be best in any case [23:18] *** howdoicom has joined #archiveteam [23:18] guided by a list of valid sites [23:18] warc it up [23:18] it'd just be nice to have the raw database, too, in case someone else wants to host it again later [23:18] *** BlueMaxim has joined #archiveteam [23:18] yeah [23:19] but yeah, both — both would be best [23:19] max: I meant in GB/month (a constant 10mbps would mean 3TB/mo) [23:19] nicolas17: i haven't looked and i get billed at 95th percentile [23:19] JW_work: bothisgood.gif [23:19] exactly [23:19] the content drive is currently 1.7T, i think i'd probably need to anonymize the db at the very least, remove private messages and stuff [23:20] at the very least, i could write a script to create a list of every unique URL on the entire site [23:20] JW_work: someone should probably write some scripts [23:21] well, if you're willing, I'm pretty certain archive.org would be delighted to get a non-anoymized version of the drive and keep it private for a couple of decades or so [23:21] to be fair, there is probably a ton of dmca violations, and horrific nsfw stuff [23:21] 1.7T is not particular painfully large for us [23:21] i figured [23:22] JW_work: I heard you guys wanted a copy of Mapillary in case they go under... [23:22] the database is pretty large [23:22] yep, it'd be great to have that, too [23:22] Mapillary staff told me they have 200TB of photos, so yeah, 1.7TB is small XD [23:23] yeah, 200TB is in the range where we'd need to discuss with IA staff before dropping it on them :-) [23:23] but 1.5TB of our content is probably homemade drawings of sonic the hedgehog having sex with tails [23:24] that's fine [23:24] eh, we're still glad to have it [23:24] db is mysql and around 180gb. it has historical view data for every site dating back to 2004 i think [23:24] that would be awesome to have [23:25] you mean like access logs? data scientists are drooling right now [23:25] :-) [23:25] it's more like date, site_id, view_counter [23:25] but yeah some neat stuff could be done with it [23:27] i wonder if a warc would be able to faithfully encapsulate/play back a ytmnd [23:27] it uses a flash loader because at the time it was the only way you could gaplessly loop WAV files [23:28] warc as a format should support it, a naive scraper trying to create the warc would have trouble with the Flash though [23:28] max: How long can you keep it online for? [23:28] if you want to make a html5 version that plays nicely in the archive, people from the future would appreciate it [23:28] if you don't want to, that's fine [23:29] Frogging: indefinitely [23:29] thanks [23:29] this is pretty preliminary, but if i dont give it to someone it will just sit on a hard drive in my closet forever which seems pretty lame [23:30] I think Google made a Flash-to-HTML5 converter (mainly for Flash ads to work on mobile), it would be interesting to see if it can handle ytmnd .swf's [23:30] I once tried to make a script to convert the things to HTML5, but I got absolutely nowhere with it [23:30] (actually kind of Flash-to-JSON which is then interpreted by an HTML5/Javascript player) [23:30] ytmnd just has 1 swf for the player and everything else is standard image/audio formats [23:31] i made a prelim html5 version in 2011 but audio support wasnt very good back then [23:31] oh :o [23:31] and that was the last time i really worked on the site [23:31] I thought you had vector swf animations and stuff [23:34] it's a glorified flash intro and then is just used to play sound [23:34] i.e. waits until the gif and audio are loaded before playing either [23:37] We should probably give max the secret phrase so he can make a page about this [23:37] xmc: ok, there is no way you can throw a generic warc grabber at this; there is a swf loader that gets a json to know what wav and jpg to load [23:38] so if you want to scrape, custom script it is [23:38] yea [23:38] that was my gut feeling [23:38] http://picard.ytmnd.com/info/508/json [23:38] this sounds like a good job for the warrior [23:42] *** kristian_ has quit IRC (Leaving) [23:43] turns out if i just change the default from flash to html5, it seems to work fine now [23:44] less flashy since there's no status or anything, but it lets you see the site at least [23:47] Ha ha, flashy [23:51] omg I grew up on YTMD [23:51] ErkDog: what were the consequences? :P [23:52] max, is it PHP / MySQL? [23:53] guess I could read, lol [23:54] yeah common lamp stack [23:54] lol I mirrored this from YTMD 1,000 years ago [23:54] http://erkdog.netho.tk/picard/ [23:55] YTMD was basically audio-meme's before meme's were even a thing [23:55] well is* [23:55] it's like a meme w/ audio / animation except meme's didn't exist back then [23:57] we just called them fads, very few of them had the staying power of something like dickbutt