#archiveteam 2014-06-05,Thu

↑back Search

Time Nickname Message
03:10 🔗 chfoo the tracker disk usage is at 94%
06:05 🔗 yipdw #justouttv, the justin.tv "grab the videos that have some views" project, is now online and driving up bandwidth bills
06:10 🔗 exmic yay
06:20 🔗 yipdw aaaand
06:20 🔗 yipdw 500 GB in 4 hours
06:20 🔗 yipdw goddamn
06:22 🔗 yipdw actually, that's around 34.7 MB/s
06:22 🔗 aggrosk Yay for the cloud. 10 bucks for 2 TB of transfer over at DO.
06:23 🔗 aggrosk Really only less than half of that (given the whole push pull thing that seesaw does) but still :D
06:26 🔗 Nemo_bis Folks here say they're crawling 300 .fi URLs per second http://helsinginyliopisto.etapahtuma.fi/Default.aspx?tabid=304&id=9538
06:27 🔗 Nemo_bis (with heritrix)
07:22 🔗 yipdw whoever has access to @at_warrior, you're wanted in #justouttv
07:23 🔗 trs80 mostly to tweet about justouttv
08:16 🔗 exmic holy shitfuck, this will distort the graph scale for the next year http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/project_bytes.html
08:16 🔗 exmic perfect
08:20 🔗 voltagex holy crap, is that going to bankrupt you?
08:20 🔗 exmic one does not simply bankrupt archiveteam
08:24 🔗 voltagex I'm finally running warrior now
09:45 🔗 midas wow wow wow!
09:45 🔗 midas Dear customers,
09:45 🔗 midas After receiving many complaints regarding the stability of our OneCloud infrastructure, and exploring many options for a better, stronger system, we have finally come to the decision to start off fresh.
09:45 🔗 midas This unfortunately means that we are terminating the OneCloud system in its entirety, and that all VPS plans will be terminated on June 16th 2014.
09:45 🔗 midas There will be no further warnings after this message, so please make sure to complete any migration or backup process to preserve your data and services.
09:48 🔗 Kenshin which provider is that
09:49 🔗 midas Oneprovider
12:09 🔗 DBArchive I have a question regarding this search: https://ia801604.us.archive.org/11/items/dailybooth-freeze-frame-index/
12:10 🔗 DBArchive each time i search i get a 'Sorry, your browser is not smart enough. (It does not support HTTP Range requests.)' message i've tried this with mozilla chrome and ie, on a few computers, and on a computer with dmz, all did the same error..
12:21 🔗 DBArch ??
12:23 🔗 DBArch Sorry, your browser is not smart enough. (It does not support HTTP Range requests.)...
12:23 🔗 DBArch whhat do?
12:24 🔗 * DBArch :S
12:42 🔗 Nemo_bis Change browser? Seems pretty obvious.
14:18 🔗 SadDM Slamming in 1300+ more Diplomacy zines. Stuff from the 60s to the mid-2000s. (https://archive.org/details/tetracuspid_27-1978-04-03)
14:18 🔗 SadDM It boggles my mind that people used to play board games by mail.
14:19 🔗 SadDM SketchCow: If you want to make a collection for them and make me an admin, I can start filing them.
14:20 🔗 SadDM Also, for my next batch, I can upload them to there directly.
15:13 🔗 SketchCow So.... many zines
15:13 🔗 SketchCow Where are you getting these from?
15:13 🔗 SketchCow what should the name be, by the way.
15:21 🔗 SadDM The last few rounds (totalling 5000 or so) have so far come from a single site (http://www.whiningkentpigs.com/DW/zines.htm), but there are a couple of other sites with the same type of zine that will be in my crosshairs shortly.
15:36 🔗 SadDM As for a name, I'd go with "Diplomacy Zines"
15:39 🔗 SketchCow diplomacyzines created
15:46 🔗 SketchCow SadDM, I'll swap a bnch over to you.
15:50 🔗 SadDM OK cool... thanks.
15:52 🔗 SketchCow Are all 4000 items in opensource (except that mailing list) good to put into this collection?
15:54 🔗 SadDM oh... no. Just the stuff in https://archive.org/search.php?query=uploader%3A%22aeakett%40gmail.com%22%20AND%20subject%3A%22dipzine%22
15:57 🔗 SketchCow 3,680 of them it is.
15:57 🔗 SadDM sounds about right
16:01 🔗 SketchCow https://archive.org/details/diplomacyzines is coming along and will populate.
16:13 🔗 balrog SketchCow: sounds like we might be having a space issue with justin...
16:32 🔗 SketchCow What wwherehflksdfjldfjsdf
16:33 🔗 balrog SketchCow: seems you dropped out of -bs...
17:34 🔗 SadDM SketchCow: I tried to make an edit to the new collection's description and got the following message: "You are not allowed to submit items into collection(s): magazine_rack"
17:34 🔗 SadDM I found that a little odd since I was able to add an image to that collection item.
17:34 🔗 SadDM If it's a "no go", it's not a big deal... just thought I'd add a picture to the page.
17:40 🔗 SketchCow hmmm
18:21 🔗 Asparagir Welp, I just discovered my job for the next X weeks/months: "Ancestry.com Announces Retirement of Several Websites"
18:21 🔗 Asparagir "Ancestry.com announced this morning at 10:00 MT that it is retiring several of its websites. The websites are"
18:21 🔗 Asparagir MyFamily.com MyCanvas.com Genealogy.com Mundia.com
18:22 🔗 Asparagir This is LOTS of data from small companies they've acquired over the years. Mostly it's the message board data that needs saving; the underlying databases are already posted on other sites.
18:22 🔗 joepie91 wait what
18:22 🔗 Asparagir "Users will be told the retirement timeline and how to export their data."
18:22 🔗 joepie91 ancestry.com is going down?
18:22 🔗 Asparagir So, no timeline yet, and yes they have an export function.
18:23 🔗 joepie91 :|
18:23 🔗 Asparagir No, not ancestry -- sites they've acquired over the years.
18:23 🔗 antomatic A great archive
18:23 🔗 antomatic Unlimited storage space and SiteSafeSM technology keep all of your family memories safe and secure. No matter what.
18:23 🔗 joepie91 right
18:23 🔗 antomatic Ha. From MyFamily.com
18:23 🔗 joepie91 lol
18:23 🔗 joepie91 reminds me of... *searches*
18:24 🔗 joepie91 (had to dig through my jason scott stalkings)
18:24 🔗 joepie91 http://bit-chest.com/
18:24 🔗 joepie91 lots of handwavium
18:24 🔗 SketchCow THE GREATEST ELEMENT OF ALL
18:24 🔗 SketchCow Stealing handwavium for speech
18:25 🔗 joepie91 :D
18:25 🔗 joepie91 found it on tvtropes some time ago
18:25 🔗 joepie91 the magic ingredient that makes everything right without explanation
18:25 🔗 joepie91 seems to fit perfectly for these "safe permanent storage" clowns
18:25 🔗 balrog wow, a family history site destroying data?
18:25 🔗 balrog really?
18:25 🔗 balrog talk about irony
18:26 🔗 Asparagir It's really the message boards that need the ArchiveTeam love, so I'll start there. They go back at least 15 years in some cases. For example, 67,000+ posts just for the SMITH family: http://genforum.genealogy.com/smith/
18:26 🔗 exmic these are my people
18:26 🔗 exmic SMITHs
18:27 🔗 joepie91 Asparagir: very 1995 messageboards, that should be easy to archive
18:27 🔗 Asparagir Yeah, I just hope the shutdown timeline isn't too compressed.
18:28 🔗 Asparagir Otherwise, I will have to finally teach myself how to write seesaw scripts and build my own Warrior project. :-)
18:28 🔗 exmic looks like something we could hit with archivebot
18:29 🔗 exmic easy peasy
18:29 🔗 joepie91 exmic: yup, it's big, but it's simply structured
18:29 🔗 joepie91 archivebot should do fine on this one
18:30 🔗 Asparagir Does ArchiveBot have enough space at the moment to take on a project like that? And remember, these are four separate sites, some with separate message board or forum sub-domains.
18:31 🔗 exmic well, genforum at least should be fine
18:31 🔗 exmic I'd guess ten, fifteen gigs before compression
18:41 🔗 SketchCow This is a misuse of archivebot
18:42 🔗 SketchCow It does strike me that archivebot is very, very successful. Maybe we need to make it so its work can be over to a larger pool of volunteer machines?
18:43 🔗 SketchCow Machines much less likely to flit in and out like warriors.
18:43 🔗 SketchCow Ultrawarriors, if you will.
18:43 🔗 Asparagir So, a warrior project, then? Or should I grab it myself and do a standard upload to IA to the archiveteam_antecedents collection?
18:43 🔗 Asparagir SPARTANS
18:44 🔗 SketchCow I'm going to say "It's a misuse of archivebot but should be used as a sign we should upgrade archivebot's abilities and flexibility"
18:44 🔗 SketchCow So go ahead
18:44 🔗 SadDM *Ultimatewarriors
18:44 🔗 SketchCow We've done a few other whoppers before.
18:44 🔗 SadDM but those whoppers can really choke things up
18:44 🔗 SketchCow http://www.kayfabenews.com/wp-content/uploads/2014/04/warrior.jpg
18:45 🔗 SketchCow And as discussed, we should use our awareness of this to rethink some of the bot's abilities.
18:45 🔗 SketchCow For example, being able to say "and this is a big on" so it goes into a different torpedo tube
18:45 🔗 joepie91 SketchCow: once RAM requirements are solved (that is, can run on <512MB boxes without swap), I can plug a few boxes into the archivebot architecture
18:45 🔗 joepie91 insofar that helps
18:46 🔗 joepie91 and yes, that would be a useful distinction
18:46 🔗 joepie91 though I'd opt for saying "this is a small one" rather than saying "this is a big one", so that if somebody forgets to specify it won't accidentally block everything
18:47 🔗 SadDM or maybe I mis-judge the size of a site and send a huge job to the smart-car lane
18:47 🔗 SadDM either way, there will be issues
18:48 🔗 SketchCow Well, the whole point of archivebot upon inception was for small sites.
18:49 🔗 SadDM oh yeah, but the definition of "small" has been sliding quite a bit lately
18:49 🔗 SketchCow So maybe having it notice we've gone past a certain limit, be it 1gb of material, or x amount of URLs, and go "uh, this needs to go to the ultimatewarriors"
18:49 🔗 SketchCow My concern is mostly we lose timeliness
18:50 🔗 SketchCow If it takes hours to get to a classic fuckup or craigslist ad, it'll be gone
18:55 🔗 yipdw if I had written archivebot using mongodb it'd clearly be webscale
18:55 🔗 joepie91 SketchCow: this is moving into -bs territory, but... I'm not sure of the current archivebot architecture, but is it capable yet of freezing/pausing jobs, sending them over to another box wholesale, and resuming them there?
18:55 🔗 yipdw no
18:55 🔗 SketchCow This isn't -bs territory
18:55 🔗 SketchCow The channel is the bot working and talking about the bot
18:55 🔗 joepie91 lengthy discussion :)
18:56 🔗 joepie91 anyway
18:56 🔗 SketchCow and giving yipdw kudos for the fucking thing
18:56 🔗 yipdw it will at some point, but there are bigger issues to deal with first, like the reporting process fucking itself up periodically
18:56 🔗 SketchCow I asked for a swiss army knife and he made the iron giant
18:56 🔗 joepie91 yipdw: is archivebot still using wget, or does it use wpull now?
18:56 🔗 yipdw wpull
18:56 🔗 joepie91 hmmm
18:56 🔗 joepie91 on all nodes?
18:56 🔗 yipdw yes
18:56 🔗 yipdw there are however memory issues remaining, and I think those have to do with the reporter threads
18:57 🔗 joepie91 I might be able to dick around with it in the near future then, and see if I can duct-tape together a resume function
18:57 🔗 yipdw in MOST cases you will not see a memory blowup
18:57 🔗 joepie91 mm
18:57 🔗 SketchCow Another line of thought for Asparagir's question
18:57 🔗 SketchCow The point of the bot is to make basic things easy and not constantly have to ramp people up on the "right" ways and missing mistakes.
18:57 🔗 joepie91 yipdw: if you had to describe the current architecture of archivebot in a single line, including technologies/architectures used, what would it be? (so I can get a vague idea of what to expect)
18:57 🔗 SketchCow But Asparagir has been in this place forever, she gets what it needs.
18:57 🔗 SketchCow So her doing it the old-fashioned way seems quite legit
18:58 🔗 SketchCow joepie91: 45 cats, a blender and underscor's mom
18:58 🔗 yipdw joepie91: Python, Ruby on the backend, CoffeeScript/Ember.js on the frontend, CouchDB and Redis as datastores
18:58 🔗 SketchCow Oh sure, use the layman's terms
18:59 🔗 yipdw underscor's mom will be present in release 6
18:59 🔗 yipdw joepie91: also the fetch pipeline is seesaw, though a much more complicated seesaw pipeline than any other I am aware of
19:00 🔗 Kenshin what about a beefy box
19:00 🔗 yipdw ivan` runs one
19:00 🔗 Kenshin for archivebot
19:00 🔗 yipdw it's the reason why we're doing 30 concurrent jobs vs. 5
19:00 🔗 yipdw :P
19:00 🔗 joepie91 yipdw: main communication protocol(s) between components?
19:00 🔗 yipdw redis pubsub
19:00 🔗 yipdw :P
19:00 🔗 joepie91 oh dear
19:00 🔗 joepie91 right
19:00 🔗 yipdw it works
19:01 🔗 joepie91 I know Python, I can learn Ruby, I know CoffeeScript, I can learn Ember, I know CouchDB a little bit, I know Redis a little bit, I have nfi how its pubsub works
19:01 🔗 joepie91 not too bad a score
19:01 🔗 yipdw it's pretty easy
19:01 🔗 yipdw I was going to use e.g. ZeroMQ and then went "I do not need that"
19:01 🔗 joepie91 Kenshin: beefy boxes are always useful :P
19:01 🔗 joepie91 yipdw: see, if it were zeromq, I would've known how it worked :D
19:01 🔗 yipdw also, ArchiveBot is very much a product of "get shit online"
19:02 🔗 yipdw the fact that it has done what it is doing surprises the hell out of me
19:02 🔗 Kenshin it is the fastest way to get a small site archived though
19:02 🔗 yipdw I mean, its processes are running in a tmux
19:02 🔗 * nico hide his screen process
19:03 🔗 yipdw in any case, there is plenty to do
19:03 🔗 nico archivebot is also running with a lot of different version of the pipeline/wpull
19:03 🔗 SketchCow http://i.imgur.com/SfNlIEA.png
19:03 🔗 Kenshin but are we even maxing out archivebot's resources
19:03 🔗 nico Kenshin: no
19:03 🔗 joepie91 yipdw: hehe
19:04 🔗 nico my drone is sleeping
19:04 🔗 SketchCow I think yipdw knows the current maxing or not maxing.
19:04 🔗 Kenshin but if we threw the sites mentioned on it
19:04 🔗 yipdw it's doing 32 jobs right now
19:04 🔗 yipdw I think we can do 35
19:04 🔗 Kenshin it'll probably flip?
19:04 🔗 SketchCow yipdw: Maybe work with exmic to add graphs?
19:04 🔗 yipdw they're already being done
19:04 🔗 yipdw SketchCow: yeah
19:04 🔗 joepie91 yipdw: how receptive are you to (well-tested) architecture changes for archivebot? in case I have a bored weekend
19:04 🔗 nico (doing a grab of http://tcrf.net/ and http://geekbeat.tv/)
19:04 🔗 yipdw joepie91: I'm fine with them if they fix things
19:04 🔗 joepie91 such as not needing tmux? :P
19:04 🔗 yipdw I don't need tmux
19:05 🔗 nico joepie91: put it under supervisord :)
19:05 🔗 yipdw I just haven't written e.g. start scripts or gotten it in daemontools
19:05 🔗 yipdw I can point you to current deficiencies in ArchiveBot
19:06 🔗 yipdw for example, there is currently no way to know what the max load is
19:06 🔗 yipdw there is also no way right now to signal when a job starts
19:06 🔗 yipdw (but we do know when it finishes)
19:06 🔗 yipdw stuff like that
19:06 🔗 joepie91 yipdw: are there issue tickets for this?
19:06 🔗 yipdw I do not think that fixing those requires significant architecture changes
19:06 🔗 yipdw joepie91: yeah
19:06 🔗 SketchCow I can say that archivebot is pulling roughly 200gb a day.
19:06 🔗 joepie91 because then I can just look at those
19:08 🔗 joepie91 yipdw: correct repo link?
19:08 🔗 yipdw https://github.com/ArchiveTeam/ArchiveBot
19:08 🔗 joepie91 thanks, bookmarked
19:08 🔗 joepie91 now brb, just got a bugticket for pythonwhois, work to do :D
19:09 🔗 yipdw the most confusing part of it is probably the cogs program
19:09 🔗 Nemo_bis do we have a page for the generalogy site(s)? need to link
19:09 🔗 Kenshin yipdw: anyway if you reach a point you need resources for archivebot, just ping me
19:09 🔗 yipdw Kenshin: np
19:09 🔗 yipdw thanks
19:17 🔗 Asparagir Nemo_bis: No, not yet -- I got the news through one of my e-mail mailing lists. But i'm sure something will be up soon.
19:18 🔗 Asparagir Nemo_bis: Wait, it looks like one of the four sites, genealogy.com, now has a page with info about the shutdown: http://www.ancestry.com/cs/faq/genealogy-faq
19:18 🔗 SketchCow How did today become Verify All Archiveteam Architecture Day
19:18 🔗 SketchCow Probably the fact that justin tv is a fucking nightmare planet of websuck crashing into the IA building
19:18 🔗 Asparagir Props to Ancestry.com for not deleting the message boards, just putting them into read-only mode.
19:19 🔗 SketchCow Kaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaan
19:19 🔗 Kenshin SketchCow: i guess it's been quiet for a while?
19:19 🔗 Kenshin and suddenly we have justin.tv
19:19 🔗 Kenshin so suddenly everyone is interested in how we're able to cope with projects?
19:19 🔗 SketchCow Yeah
19:19 🔗 SketchCow We'll likely be overengineered as fuck after this.
19:19 🔗 SketchCow Which is good
19:20 🔗 SketchCow For when facebook goes down
19:20 🔗 Asparagir It never rains, but it pours.
19:20 🔗 nico 21:20 @SketchCow> For when facebook goes down
19:21 🔗 nico a nightmare
19:23 🔗 balrog [14:31:37] <@jrra> rip crazynation.org
19:23 🔗 midas strange, we did some small projects not that long ago
19:24 🔗 Asparagir So, on another subject...random thing that I discovered the other day: since when does the Washington Post block the IA bot in their robots.txt? And therefore is not available AT ALL in the Wayback Machine?
19:24 🔗 midas ah fuck, SketchCow, which project was the one we had months to do? i lost the channel
19:24 🔗 Asparagir What assholes.
19:25 🔗 midas or someone who remembers, i think we had to september or something
19:25 🔗 nico Asparagir: it still exist in IA
19:25 🔗 Asparagir Yeah, but not visible to the public.
19:26 🔗 * joepie91 back
19:26 🔗 joepie91 [21:18] <@SketchCow> How did today become Verify All Archiveteam Architecture Day
19:26 🔗 joepie91 has to happen every once in a while
19:27 🔗 Nemo_bis Asparagir: stub at http://archiveteam.org/index.php?title=Ancestry.com
19:27 🔗 joepie91 balrog: crazynation is suspended; might just be a host issue
19:27 🔗 midas ah, #totheyard
19:27 🔗 balrog seems it's been down for some time
19:28 🔗 BiggieJon midas: helium ? mlkshk ? verizon customer pages ?
19:28 🔗 Nemo_bis http://archiveteam.org/index.php?title=Current_Projects#Upcoming_projects
19:28 🔗 Nemo_bis wikis, how wonderful ;)
19:29 🔗 joepie91 "We're pleased to announce that GenForum message boards, Family Tree Maker homepages, and the most popular articles will continue to be available in a read-only format on the Genealogy.com site."
19:29 🔗 joepie91 well that's something, I suppose
19:30 🔗 joepie91 minimum bar of shutdowning almost reached
19:38 🔗 Asparagir Nemo_bis: Thanks. Minor edits made; will update as needed.
19:39 🔗 schbirid https://github.com/FlatRockSoft/Hovertank3D
19:42 🔗 balrog Asparagir: http://familytreemaker.genealogy.com/users/
19:42 🔗 balrog wow, those pages look straight out of 1995
19:42 🔗 Asparagir They are!
19:43 🔗 Asparagir joepie91: True. But am I going to trust 39,000,000+ web pages of family history to those guys' whims? https://www.google.com/search?client=safari&rls=en&q=site:familytreemaker.genealogy.com&ie=UTF-8&oe=UTF-8 Hahahahaha, no.
19:43 🔗 Asparagir Okay, 38.2 million. But still!
19:43 🔗 Asparagir (And SketchCow chastises me in another channel that these numbers are very very very sketchy.)
19:55 🔗 Nemo_bis Put everything on wikidata.org!
19:58 🔗 Nemo_bis then you can make https://toolserver.org/~magnus/ts2/geneawiki/?q=Q508848 (http://ultimategerardm.blogspot.it/2013/08/what-to-do-when-wikipedia-does-not_16.html )
20:01 🔗 exmic re: archivebot discussion, how about a mechanism to cut jobs at 1000 fetches and send the queue back to the tracker/hub for re-issuance
20:06 🔗 Famicoman Any plans to put blip.tv back in the warrior?
20:18 🔗 SketchCow -------------------------------------
20:18 🔗 SketchCow IMPORTANT NEWS
20:18 🔗 SketchCow archive.org now shows .iso files in file listing links in items
20:18 🔗 SketchCow okay not so important
20:18 🔗 SketchCow -------------------------------------
20:18 🔗 SketchCow But Brewster was cranky it was the old ways.
20:19 🔗 exmic I don't see a change, can you explain in more detail?
20:27 🔗 joepie91 SketchCow: does it now also link to the ISO browsing feature?
20:27 🔗 * joepie91 has been waiting for that to be a UI thing for a whil
20:27 🔗 joepie91 while *
20:30 🔗 SketchCow No!
20:31 🔗 SketchCow Good point!
20:34 🔗 Nemo_bis SketchCow: I like it :) it's like a Kahle stamp of approval on all the ripping
20:38 🔗 SketchCow Kahle's happy we've brought so much to bear
20:38 🔗 SketchCow We've caught up for a decade of neglect
20:38 🔗 SketchCow handily
20:44 🔗 underscor RELEASE 6
20:45 🔗 yipdw I'll add a changelog to archivebot and just arbitrarily start at RELEASE 6
20:46 🔗 exmic sounds good, ship it
20:46 🔗 underscor :D
20:57 🔗 joepie91 SketchCow: I stalked you in PM, btw
20:58 🔗 joepie91 oh, hai underscor, it's been a while
23:50 🔗 Asparagir Does anyone remember who did the ValleyWag crawl in ArchiveBot a few months ago? And why? Just curious...
23:54 🔗 Asparagir Nevermind, got it.
23:54 🔗 Asparagir No one did a full crawl; it was just a few key articles grabbed.
23:55 🔗 DFJustin looks like ivan` did
23:55 🔗 DFJustin #archivebot.EFnet.20131207.log:[19:44:10] <ivan`> !a http://valleywag.gawker.com/
23:55 🔗 Asparagir No, the ones I saw were articles about Brendan Eich and related news. No full crawls yet.
23:56 🔗 Asparagir And were initiated by yipdw .
23:56 🔗 Asparagir Point being, I am living in San Francisco (well, almost) during a latter-day Gilded Age and I want the stories about this place preserved, robots.txt or no.
23:57 🔗 Asparagir I think this might be a good test project for me to break out the wpull + phantomjs, instead of wget.
23:58 🔗 DFJustin http://archivebot.at.ninjawedding.org:4567/#/histories/http://valleywag.gawker.com/
23:59 🔗 Asparagir Six months ago, half these crazy startups didn't even exist yet. :-P

irclogger-viewer