[03:10] the tracker disk usage is at 94% [06:05] #justouttv, the justin.tv "grab the videos that have some views" project, is now online and driving up bandwidth bills [06:10] yay [06:20] aaaand [06:20] 500 GB in 4 hours [06:20] goddamn [06:22] actually, that's around 34.7 MB/s [06:22] Yay for the cloud. 10 bucks for 2 TB of transfer over at DO. [06:23] Really only less than half of that (given the whole push pull thing that seesaw does) but still :D [06:26] Folks here say they're crawling 300 .fi URLs per second http://helsinginyliopisto.etapahtuma.fi/Default.aspx?tabid=304&id=9538 [06:27] (with heritrix) [07:22] whoever has access to @at_warrior, you're wanted in #justouttv [07:23] mostly to tweet about justouttv [08:16] holy shitfuck, this will distort the graph scale for the next year http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/project_bytes.html [08:16] perfect [08:20] holy crap, is that going to bankrupt you? [08:20] one does not simply bankrupt archiveteam [08:24] I'm finally running warrior now [09:45] wow wow wow! [09:45] Dear customers, [09:45] After receiving many complaints regarding the stability of our OneCloud infrastructure, and exploring many options for a better, stronger system, we have finally come to the decision to start off fresh. [09:45] This unfortunately means that we are terminating the OneCloud system in its entirety, and that all VPS plans will be terminated on June 16th 2014. [09:45] There will be no further warnings after this message, so please make sure to complete any migration or backup process to preserve your data and services. [09:48] which provider is that [09:49] Oneprovider [12:09] I have a question regarding this search: https://ia801604.us.archive.org/11/items/dailybooth-freeze-frame-index/ [12:10] each time i search i get a 'Sorry, your browser is not smart enough. (It does not support HTTP Range requests.)' message i've tried this with mozilla chrome and ie, on a few computers, and on a computer with dmz, all did the same error.. [12:21] ?? [12:23] Sorry, your browser is not smart enough. (It does not support HTTP Range requests.)... [12:23] whhat do? [12:24] * DBArch :S [12:42] Change browser? Seems pretty obvious. [14:18] Slamming in 1300+ more Diplomacy zines. Stuff from the 60s to the mid-2000s. (https://archive.org/details/tetracuspid_27-1978-04-03) [14:18] It boggles my mind that people used to play board games by mail. [14:19] SketchCow: If you want to make a collection for them and make me an admin, I can start filing them. [14:20] Also, for my next batch, I can upload them to there directly. [15:13] So.... many zines [15:13] Where are you getting these from? [15:13] what should the name be, by the way. [15:21] The last few rounds (totalling 5000 or so) have so far come from a single site (http://www.whiningkentpigs.com/DW/zines.htm), but there are a couple of other sites with the same type of zine that will be in my crosshairs shortly. [15:36] As for a name, I'd go with "Diplomacy Zines" [15:39] diplomacyzines created [15:46] SadDM, I'll swap a bnch over to you. [15:50] OK cool... thanks. [15:52] Are all 4000 items in opensource (except that mailing list) good to put into this collection? [15:54] oh... no. Just the stuff in https://archive.org/search.php?query=uploader%3A%22aeakett%40gmail.com%22%20AND%20subject%3A%22dipzine%22 [15:57] 3,680 of them it is. [15:57] sounds about right [16:01] https://archive.org/details/diplomacyzines is coming along and will populate. [16:13] SketchCow: sounds like we might be having a space issue with justin... [16:32] What wwherehflksdfjldfjsdf [16:33] SketchCow: seems you dropped out of -bs... [17:34] SketchCow: I tried to make an edit to the new collection's description and got the following message: "You are not allowed to submit items into collection(s): magazine_rack" [17:34] I found that a little odd since I was able to add an image to that collection item. [17:34] If it's a "no go", it's not a big deal... just thought I'd add a picture to the page. [17:40] hmmm [18:21] Welp, I just discovered my job for the next X weeks/months: "Ancestry.com Announces Retirement of Several Websites" [18:21] "Ancestry.com announced this morning at 10:00 MT that it is retiring several of its websites. The websites are" [18:21] MyFamily.com MyCanvas.com Genealogy.com Mundia.com [18:22] This is LOTS of data from small companies they've acquired over the years. Mostly it's the message board data that needs saving; the underlying databases are already posted on other sites. [18:22] wait what [18:22] "Users will be told the retirement timeline and how to export their data." [18:22] ancestry.com is going down? [18:22] So, no timeline yet, and yes they have an export function. [18:23] :| [18:23] No, not ancestry -- sites they've acquired over the years. [18:23] A great archive [18:23] Unlimited storage space and SiteSafeSM technology keep all of your family memories safe and secure. No matter what. [18:23] right [18:23] Ha. From MyFamily.com [18:23] lol [18:23] reminds me of... *searches* [18:24] (had to dig through my jason scott stalkings) [18:24] http://bit-chest.com/ [18:24] lots of handwavium [18:24] THE GREATEST ELEMENT OF ALL [18:24] Stealing handwavium for speech [18:25] :D [18:25] found it on tvtropes some time ago [18:25] the magic ingredient that makes everything right without explanation [18:25] seems to fit perfectly for these "safe permanent storage" clowns [18:25] wow, a family history site destroying data? [18:25] really? [18:25] talk about irony [18:26] It's really the message boards that need the ArchiveTeam love, so I'll start there. They go back at least 15 years in some cases. For example, 67,000+ posts just for the SMITH family: http://genforum.genealogy.com/smith/ [18:26] these are my people [18:26] SMITHs [18:27] Asparagir: very 1995 messageboards, that should be easy to archive [18:27] Yeah, I just hope the shutdown timeline isn't too compressed. [18:28] Otherwise, I will have to finally teach myself how to write seesaw scripts and build my own Warrior project. :-) [18:28] looks like something we could hit with archivebot [18:29] easy peasy [18:29] exmic: yup, it's big, but it's simply structured [18:29] archivebot should do fine on this one [18:30] Does ArchiveBot have enough space at the moment to take on a project like that? And remember, these are four separate sites, some with separate message board or forum sub-domains. [18:31] well, genforum at least should be fine [18:31] I'd guess ten, fifteen gigs before compression [18:41] This is a misuse of archivebot [18:42] It does strike me that archivebot is very, very successful. Maybe we need to make it so its work can be over to a larger pool of volunteer machines? [18:43] Machines much less likely to flit in and out like warriors. [18:43] Ultrawarriors, if you will. [18:43] So, a warrior project, then? Or should I grab it myself and do a standard upload to IA to the archiveteam_antecedents collection? [18:43] SPARTANS [18:44] I'm going to say "It's a misuse of archivebot but should be used as a sign we should upgrade archivebot's abilities and flexibility" [18:44] So go ahead [18:44] *Ultimatewarriors [18:44] We've done a few other whoppers before. [18:44] but those whoppers can really choke things up [18:44] http://www.kayfabenews.com/wp-content/uploads/2014/04/warrior.jpg [18:45] And as discussed, we should use our awareness of this to rethink some of the bot's abilities. [18:45] For example, being able to say "and this is a big on" so it goes into a different torpedo tube [18:45] SketchCow: once RAM requirements are solved (that is, can run on <512MB boxes without swap), I can plug a few boxes into the archivebot architecture [18:45] insofar that helps [18:46] and yes, that would be a useful distinction [18:46] though I'd opt for saying "this is a small one" rather than saying "this is a big one", so that if somebody forgets to specify it won't accidentally block everything [18:47] or maybe I mis-judge the size of a site and send a huge job to the smart-car lane [18:47] either way, there will be issues [18:48] Well, the whole point of archivebot upon inception was for small sites. [18:49] oh yeah, but the definition of "small" has been sliding quite a bit lately [18:49] So maybe having it notice we've gone past a certain limit, be it 1gb of material, or x amount of URLs, and go "uh, this needs to go to the ultimatewarriors" [18:49] My concern is mostly we lose timeliness [18:50] If it takes hours to get to a classic fuckup or craigslist ad, it'll be gone [18:55] if I had written archivebot using mongodb it'd clearly be webscale [18:55] SketchCow: this is moving into -bs territory, but... I'm not sure of the current archivebot architecture, but is it capable yet of freezing/pausing jobs, sending them over to another box wholesale, and resuming them there? [18:55] no [18:55] This isn't -bs territory [18:55] The channel is the bot working and talking about the bot [18:55] lengthy discussion :) [18:56] anyway [18:56] and giving yipdw kudos for the fucking thing [18:56] it will at some point, but there are bigger issues to deal with first, like the reporting process fucking itself up periodically [18:56] I asked for a swiss army knife and he made the iron giant [18:56] yipdw: is archivebot still using wget, or does it use wpull now? [18:56] wpull [18:56] hmmm [18:56] on all nodes? [18:56] yes [18:56] there are however memory issues remaining, and I think those have to do with the reporter threads [18:57] I might be able to dick around with it in the near future then, and see if I can duct-tape together a resume function [18:57] in MOST cases you will not see a memory blowup [18:57] mm [18:57] Another line of thought for Asparagir's question [18:57] The point of the bot is to make basic things easy and not constantly have to ramp people up on the "right" ways and missing mistakes. [18:57] yipdw: if you had to describe the current architecture of archivebot in a single line, including technologies/architectures used, what would it be? (so I can get a vague idea of what to expect) [18:57] But Asparagir has been in this place forever, she gets what it needs. [18:57] So her doing it the old-fashioned way seems quite legit [18:58] joepie91: 45 cats, a blender and underscor's mom [18:58] joepie91: Python, Ruby on the backend, CoffeeScript/Ember.js on the frontend, CouchDB and Redis as datastores [18:58] Oh sure, use the layman's terms [18:59] underscor's mom will be present in release 6 [18:59] joepie91: also the fetch pipeline is seesaw, though a much more complicated seesaw pipeline than any other I am aware of [19:00] what about a beefy box [19:00] ivan` runs one [19:00] for archivebot [19:00] it's the reason why we're doing 30 concurrent jobs vs. 5 [19:00] :P [19:00] yipdw: main communication protocol(s) between components? [19:00] redis pubsub [19:00] :P [19:00] oh dear [19:00] right [19:00] it works [19:01] I know Python, I can learn Ruby, I know CoffeeScript, I can learn Ember, I know CouchDB a little bit, I know Redis a little bit, I have nfi how its pubsub works [19:01] not too bad a score [19:01] it's pretty easy [19:01] I was going to use e.g. ZeroMQ and then went "I do not need that" [19:01] Kenshin: beefy boxes are always useful :P [19:01] yipdw: see, if it were zeromq, I would've known how it worked :D [19:01] also, ArchiveBot is very much a product of "get shit online" [19:02] the fact that it has done what it is doing surprises the hell out of me [19:02] it is the fastest way to get a small site archived though [19:02] I mean, its processes are running in a tmux [19:02] * nico hide his screen process [19:03] in any case, there is plenty to do [19:03] archivebot is also running with a lot of different version of the pipeline/wpull [19:03] http://i.imgur.com/SfNlIEA.png [19:03] but are we even maxing out archivebot's resources [19:03] Kenshin: no [19:03] yipdw: hehe [19:04] my drone is sleeping [19:04] I think yipdw knows the current maxing or not maxing. [19:04] but if we threw the sites mentioned on it [19:04] it's doing 32 jobs right now [19:04] I think we can do 35 [19:04] it'll probably flip? [19:04] yipdw: Maybe work with exmic to add graphs? [19:04] they're already being done [19:04] SketchCow: yeah [19:04] yipdw: how receptive are you to (well-tested) architecture changes for archivebot? in case I have a bored weekend [19:04] (doing a grab of http://tcrf.net/ and http://geekbeat.tv/) [19:04] joepie91: I'm fine with them if they fix things [19:04] such as not needing tmux? :P [19:04] I don't need tmux [19:05] joepie91: put it under supervisord :) [19:05] I just haven't written e.g. start scripts or gotten it in daemontools [19:05] I can point you to current deficiencies in ArchiveBot [19:06] for example, there is currently no way to know what the max load is [19:06] there is also no way right now to signal when a job starts [19:06] (but we do know when it finishes) [19:06] stuff like that [19:06] yipdw: are there issue tickets for this? [19:06] I do not think that fixing those requires significant architecture changes [19:06] joepie91: yeah [19:06] I can say that archivebot is pulling roughly 200gb a day. [19:06] because then I can just look at those [19:08] yipdw: correct repo link? [19:08] https://github.com/ArchiveTeam/ArchiveBot [19:08] thanks, bookmarked [19:08] now brb, just got a bugticket for pythonwhois, work to do :D [19:09] the most confusing part of it is probably the cogs program [19:09] do we have a page for the generalogy site(s)? need to link [19:09] yipdw: anyway if you reach a point you need resources for archivebot, just ping me [19:09] Kenshin: np [19:09] thanks [19:17] Nemo_bis: No, not yet -- I got the news through one of my e-mail mailing lists. But i'm sure something will be up soon. [19:18] Nemo_bis: Wait, it looks like one of the four sites, genealogy.com, now has a page with info about the shutdown: http://www.ancestry.com/cs/faq/genealogy-faq [19:18] How did today become Verify All Archiveteam Architecture Day [19:18] Probably the fact that justin tv is a fucking nightmare planet of websuck crashing into the IA building [19:18] Props to Ancestry.com for not deleting the message boards, just putting them into read-only mode. [19:19] Kaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaan [19:19] SketchCow: i guess it's been quiet for a while? [19:19] and suddenly we have justin.tv [19:19] so suddenly everyone is interested in how we're able to cope with projects? [19:19] Yeah [19:19] We'll likely be overengineered as fuck after this. [19:19] Which is good [19:20] For when facebook goes down [19:20] It never rains, but it pours. [19:20] 21:20 @SketchCow> For when facebook goes down [19:21] a nightmare [19:23] [14:31:37] <@jrra> rip crazynation.org [19:23] strange, we did some small projects not that long ago [19:24] So, on another subject...random thing that I discovered the other day: since when does the Washington Post block the IA bot in their robots.txt? And therefore is not available AT ALL in the Wayback Machine? [19:24] ah fuck, SketchCow, which project was the one we had months to do? i lost the channel [19:24] What assholes. [19:25] or someone who remembers, i think we had to september or something [19:25] Asparagir: it still exist in IA [19:25] Yeah, but not visible to the public. [19:26] * joepie91 back [19:26] [21:18] <@SketchCow> How did today become Verify All Archiveteam Architecture Day [19:26] has to happen every once in a while [19:27] Asparagir: stub at http://archiveteam.org/index.php?title=Ancestry.com [19:27] balrog: crazynation is suspended; might just be a host issue [19:27] ah, #totheyard [19:27] seems it's been down for some time [19:28] midas: helium ? mlkshk ? verizon customer pages ? [19:28] http://archiveteam.org/index.php?title=Current_Projects#Upcoming_projects [19:28] wikis, how wonderful ;) [19:29] "We're pleased to announce that GenForum message boards, Family Tree Maker homepages, and the most popular articles will continue to be available in a read-only format on the Genealogy.com site." [19:29] well that's something, I suppose [19:30] minimum bar of shutdowning almost reached [19:38] Nemo_bis: Thanks. Minor edits made; will update as needed. [19:39] https://github.com/FlatRockSoft/Hovertank3D [19:42] Asparagir: http://familytreemaker.genealogy.com/users/ [19:42] wow, those pages look straight out of 1995 [19:42] They are! [19:43] joepie91: True. But am I going to trust 39,000,000+ web pages of family history to those guys' whims? https://www.google.com/search?client=safari&rls=en&q=site:familytreemaker.genealogy.com&ie=UTF-8&oe=UTF-8 Hahahahaha, no. [19:43] Okay, 38.2 million. But still! [19:43] (And SketchCow chastises me in another channel that these numbers are very very very sketchy.) [19:55] Put everything on wikidata.org! [19:58] then you can make https://toolserver.org/~magnus/ts2/geneawiki/?q=Q508848 (http://ultimategerardm.blogspot.it/2013/08/what-to-do-when-wikipedia-does-not_16.html ) [20:01] re: archivebot discussion, how about a mechanism to cut jobs at 1000 fetches and send the queue back to the tracker/hub for re-issuance [20:06] Any plans to put blip.tv back in the warrior? [20:18] ------------------------------------- [20:18] IMPORTANT NEWS [20:18] archive.org now shows .iso files in file listing links in items [20:18] okay not so important [20:18] ------------------------------------- [20:18] But Brewster was cranky it was the old ways. [20:19] I don't see a change, can you explain in more detail? [20:27] SketchCow: does it now also link to the ISO browsing feature? [20:27] * joepie91 has been waiting for that to be a UI thing for a whil [20:27] while * [20:30] No! [20:31] Good point! [20:34] SketchCow: I like it :) it's like a Kahle stamp of approval on all the ripping [20:38] Kahle's happy we've brought so much to bear [20:38] We've caught up for a decade of neglect [20:38] handily [20:44] RELEASE 6 [20:45] I'll add a changelog to archivebot and just arbitrarily start at RELEASE 6 [20:46] sounds good, ship it [20:46] :D [20:57] SketchCow: I stalked you in PM, btw [20:58] oh, hai underscor, it's been a while [23:50] Does anyone remember who did the ValleyWag crawl in ArchiveBot a few months ago? And why? Just curious... [23:54] Nevermind, got it. [23:54] No one did a full crawl; it was just a few key articles grabbed. [23:55] looks like ivan` did [23:55] #archivebot.EFnet.20131207.log:[19:44:10] !a http://valleywag.gawker.com/ [23:55] No, the ones I saw were articles about Brendan Eich and related news. No full crawls yet. [23:56] And were initiated by yipdw . [23:56] Point being, I am living in San Francisco (well, almost) during a latter-day Gilded Age and I want the stories about this place preserved, robots.txt or no. [23:57] I think this might be a good test project for me to break out the wpull + phantomjs, instead of wget. [23:58] http://archivebot.at.ninjawedding.org:4567/#/histories/http://valleywag.gawker.com/ [23:59] Six months ago, half these crazy startups didn't even exist yet. :-P