[00:03] SketchCow: Could I get this moved to web pretty please https://archive.org/details/www.tswf.com.au-2016-01-29-4b9f1792-00000.warc.gz [00:05] Done [00:06] Thanks! [00:09] *** Start has joined #archiveteam-bs [00:27] *** ndizzle has joined #archiveteam-bs [00:28] *** Smiley has joined #archiveteam-bs [00:29] *** SketchCo1 has joined #archiveteam-bs [00:29] *** swebb sets mode: +o SketchCo1 [00:29] *** RichardG has joined #archiveteam-bs [00:30] *** Start has quit IRC (Read error: Connection reset by peer) [00:30] *** SketchCow has quit IRC (Read error: Connection reset by peer) [00:30] *** Start has joined #archiveteam-bs [00:30] *** RichardG_ has quit IRC (Ping timeout: 249 seconds) [00:30] *** jk[SVP] has quit IRC (Ping timeout: 249 seconds) [00:30] *** dxrt- has quit IRC (Ping timeout: 260 seconds) [00:30] *** sigkell has quit IRC (Ping timeout: 260 seconds) [00:30] *** SketchCo1 is now known as SketchCow [00:31] *** jk[SVP] has joined #archiveteam-bs [00:32] *** SmileyG has quit IRC (Ping timeout: 260 seconds) [00:32] *** ndiddy has joined #archiveteam-bs [00:33] *** bauruine has quit IRC (Ping timeout: 260 seconds) [00:33] *** joepie91 has quit IRC (Ping timeout: 260 seconds) [00:33] *** bauruine has joined #archiveteam-bs [00:34] *** sigkell has joined #archiveteam-bs [00:35] *** joepie91 has joined #archiveteam-bs [00:39] *** xXx_ndidd has quit IRC (Read error: Operation timed out) [00:40] *** xXx_ndidd has joined #archiveteam-bs [00:43] *** ndizzle has quit IRC (Read error: Operation timed out) [00:48] *** beardicus has quit IRC (Read error: Operation timed out) [00:48] *** xXx_ndidd has quit IRC (Ping timeout: 492 seconds) [00:50] *** botpie91 has quit IRC (Read error: Operation timed out) [00:51] *** w0rp has quit IRC (Ping timeout: 252 seconds) [00:52] *** ndiddy has quit IRC (Read error: Operation timed out) [00:55] *** w0rp has joined #archiveteam-bs [01:00] *** closure has quit IRC (Ping timeout: 633 seconds) [01:05] You moſt certainly can uſe a long S at the beginning of a word; it was only preſcribed at the end ther. [01:05] Thereof, I mean. Damn but my client doesn't like unicode, lol. [01:05] proscribed, perhaps [01:06] Anyway, the initial long S gives us the humorously-mispronounced ſuccor :) [01:06] :) [01:10] *** beardicus has joined #archiveteam-bs [01:10] *** botpie91 has joined #archiveteam-bs [01:11] *** closure has joined #archiveteam-bs [01:13] god tumblr's notes are worthless [01:14] a separate line for every single repost with basically 0 information content [01:14] and if someone adds words they get clipped at like 80 characters [01:14] hahaha [01:15] wow [01:15] or 40? [01:15] not many [01:15] 477681tidwyxkioafgeg1gnk1 looks to be mired in note scrolling [01:15] http://mulder-are-you-suggesting.tumblr.com/notes/91675990871/aJgM9HqqN?from_c=1445509362 [01:15] etc [01:15] then add notumblrnotes [01:15] yeah. [01:15] !igset 477681tidwyxkioafgeg1gnk1 notumblrnotes [01:15] er oops [01:15] w/e [01:15] lol [01:16] though the notes link to other posts that reply to it [01:16] so [01:16] * xmc shrug [01:20] as much as one can meaningfully reply in tumblr [01:20] but nobody does [01:25] *** JW_work has joined #archiveteam-bs [01:25] Anyone is always welcome to add notumblrnotes to any tumblr jobs that I (JesseW) started. [01:26] I don't like to put it on initially, as it's fine to get the notes for posts with only a hundred or so — but it's a waste of time to get all the notes for ones with tens of thousands [01:26] I like to see notumblrnotes as a last resort [01:26] grabbing all the notes should be feasible and it would be if not for technical problems [01:27] fair point [01:27] I just have a plenty big list of tumblrs I'm interested in archiving, and don't want to overwhelm archivebot [01:27] grab-site might work better honestly [01:27] it seems like the crawling algorithm will tend to get all the reasonable things more toward the beginning with the billion-long notes forming a long tail [01:27] grab site won't get them into the wayback machine [01:28] Not with that kind of an attitude it won't... [01:28] :-P [01:28] so policing them and adding notumblrnotes once they have only a few things left to fetch is probably a good strategy [01:28] if it's at-risk my impulse is to just add notumblrnotes straight up [01:28] since tumblrs can be long enough without [01:28] also those avatars... [01:29] the ones I add generally aren't at risk — just containing valuable material (or at least one valuable post someone linked to) [01:29] yipdw: yo check out job 7me1xyijiosr1fg3a8vhb2oep [01:30] it's having wild trouble RsyncUploading [01:30] in any case, this should probably go in #archivebot [01:30] yeah [01:30] I dunno what's wrong with that one [01:31] well 23 is "partial transfer due to error" so usually something like full disk on the remote end [01:31] (unlikely...) [01:31] For stuff that's really at risk, there's a tumblr archiving script on github I use sometimes. Can snag *the contents of* a thousand-post site in a minute or two. No WARC, sadly, but as an option of last resort, it's okay... [01:32] it's weird [01:33] for this tumblr I care about there are only ~16000 items in the queue so it should be fine [01:33] I'm surprised it's up this long [01:34] snape: please add a link to that to the wiki [01:34] Almost certain that's where I found it. Lemme go check... [01:34] I kind of wish one could add a filter like an ignore but "hang onto these URIs but get everything else before bothering with them" [01:35] Yeah, it's on the wiki already - https://github.com/bbolli/tumblr-utils/ [01:36] to be used with, e.g., ^http.*\.media\.tumblr\.com/ [01:36] ah, good [01:36] Yeah, it's be nice if there was a !defer regex thing to push stuff to the back of the queue for a job... [01:37] wpull goes breadth-first so you usually get that behavior anyway [01:44] with tumblr I see it pull, by breadth, 50-100 profile thumbnails, then one post, then another 50-100 profile thumbnails from that post... [02:01] *** plog99 has joined #archiveteam-bs [02:23] *** JesseW has joined #archiveteam-bs [02:34] *** ndiddy has joined #archiveteam-bs [02:55] *** plog99 has quit IRC (Ping timeout: 250 seconds) [02:57] *** Chorca has quit IRC (Ping timeout: 252 seconds) [02:58] *** plog99 has joined #archiveteam-bs [02:59] *** Chorca has joined #archiveteam-bs [04:09] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [04:27] chfoo -- If you were willing/able to run a copy of ilya's https://github.com/ikreymer/pywb-ia on http://archive.fart.website -- I could make a PR to add links from pages in the archivebot viewer (e.g. http://archive.fart.website/archivebot/viewer/job/7yy80 ) to it, so people could verify that archivebot jobs worked out correctly: [04:30] that's a lot of disk space [04:31] AFAIK pywb requires the WARCs be local [04:31] has that changed? [04:31] *** dxrt- has joined #archiveteam-bs [04:31] pywb-ia's /item/ method *doesn't* require the WARCs to be local. [04:32] https://github.com/ikreymer/pywb-ia#single-item-replay-item [04:32] "This will download the item .idx file locally on first use, and access the .cdx.gz and WARC remotely" [04:32] yipdw: [04:32] oh ok [04:44] https://twitter.com/isotrumpp is an amazing twitter account and I can't explain why [04:44] there's something about C++ with Trump that makes me laugh [04:44] *** bwn has quit IRC (Read error: Operation timed out) [04:47] If someone (Sketchcow?) gets the chance, I'd love it if https://archive.org/details/29JanWARCs and https://archive.org/details/miscwarcs1 were set to "web" for eventual ingestion into the wayback machine [05:20] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:24] *** Chorca has quit IRC (Quit: leaving) [05:31] *** Sk1d has joined #archiveteam-bs [05:35] *** Sk1d has quit IRC (Ping timeout: 200 seconds) [05:35] *** Sk1d has joined #archiveteam-bs [06:16] JesseW: i'm not sure what you mean exactly [06:18] Let me try to clarify. [06:18] In the viewer (running on your server) each job has a page like this: http://archive.fart.website/archivebot/viewer/job/7yy80 [06:21] If you (or someone else) ran a copy of pywb-ia, it could show the contents of the WARC for that job via a URL like this: http://archive.fart.website/archivebot/pywb-ia/item/archiveteam_archivebot_go_20150810080001/*/http://www.echoschildren.org/ [06:22] with the archiveteam_archivebot_go_20150810080001 part populated from the item name and the http://www.echoschildren.org/ part populated from the job initial URL (both already displayed on the job page). [06:22] It would only download the CDX idx files locally (which are generally small) and load the rest on demand from IA. [06:23] oh, i understand now [06:23] cool [06:24] if you didn't want to run a copy yourself, someone could still write a bookmarklet to add links from the viewer to a copy running locally. I may do that. [06:24] yeah, i might be able to set that up [06:25] awesome! [06:26] The only problem I had setting it up was that on debian wheezy I had a slightly too old version of greenlet that wasn't required by the requirements.txt -- otherwise it installed like a charm. [06:29] Michael Karpeles (mek) from IA (ArchiveLab) is also glad to set up a copy of pywb-ia on archivelab.org -- so in that case there will be two. :-) [06:41] *** Spring has joined #archiveteam-bs [06:42] whoever scanned this musn't have been looking at what they were doing [06:42] https://archive.org/stream/Another_Journal_by_Pett_02_Rylee/Another%20Journal%20by%20Pett%20%2802%29%20%28Rylee%29#page/n1/mode/2up [06:43] slight 1940s-ish nudity warning. All the pages are cut off, it's odd. [06:44] edit, I tell a lie! It was only the archive's viewer that was cropping them [06:44] now I can see all the lettering (which is what I was there for believe it or not) [06:46] Spring: yes, that's a wart in the bookreader -- probably worth opening a bug for it if there isn't one already (IDK where the bugreporter is, though which is why I haven't done so myself yet) [06:47] why were you looking for the locations of a 1940s cheesecake publisher? [06:47] :p [06:47] for the lettering examples [06:48] that is, the hand-drawn headings [06:50] Ah, "Pett's Corner" on that page? That is nice lettering, yeah. [06:53] *** snape has quit IRC (ircII EPIC4-2.10.5 -- Are we there yet?) [07:09] *** JesseW has quit IRC (Ping timeout: 252 seconds) [07:36] *** robink has joined #archiveteam-bs [08:24] *** schbirid has joined #archiveteam-bs [08:44] *** bwn has joined #archiveteam-bs [08:45] *** metalcamp has joined #archiveteam-bs [10:27] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [10:34] *** vitzli has joined #archiveteam-bs [10:49] *** metalcamp has joined #archiveteam-bs [11:07] *** Spring has quit IRC (Ping timeout: 362 seconds) [11:13] *** Spring has joined #archiveteam-bs [11:26] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [11:37] *** ohhdemgir has joined #archiveteam-bs [11:51] *** Spring has quit IRC (Ping timeout: 362 seconds) [11:52] *** Spring has joined #archiveteam-bs [12:15] *** Spring has quit IRC (Ping timeout: 362 seconds) [12:15] *** Spring has joined #archiveteam-bs [12:21] *** Spring has quit IRC (Quit: Leaving) [13:56] *** RichardG has quit IRC (Ping timeout: 633 seconds) [15:03] *** Start has quit IRC (Quit: Disconnected.) [15:31] *** mismatch_ has quit IRC (Ping timeout: 633 seconds) [15:42] *** SketchCo1 has joined #archiveteam-bs [15:42] *** swebb sets mode: +o SketchCo1 [15:44] *** SketchCow has quit IRC (Read error: Operation timed out) [15:53] *** Start has joined #archiveteam-bs [15:55] *** ndiddy has quit IRC (Read error: Operation timed out) [16:11] *** godane has quit IRC (Ping timeout: 252 seconds) [16:21] *** mismatch_ has joined #archiveteam-bs [16:25] *** godane has joined #archiveteam-bs [16:28] *** RichardG has joined #archiveteam-bs [16:37] *** godane has quit IRC (Quit: Leaving.) [16:40] *** metalcamp has joined #archiveteam-bs [16:42] So many new tv channels are being recorded by IA! [16:43] *sigh*, another day another SSL/TLS vulnerability: https://www.drownattack.com/ [16:45] *** godane has joined #archiveteam-bs [16:46] sslv2 vuln [16:46] that's been on the shit list for, what, 2 years now? [16:47] SSLv2 was known to be broken even in the 90s [16:47] no one that is keeping up to date should be effected, and the people that haven't were already wildly vulnerable [16:47] What's new is that this is an attack that uses specially crafted SSLv2 connection attempts to figure out the key [16:47] we were discussing it in the office this morning [16:47] Even for clients using TLS [16:48] OK, but who that isn't asleep at the wheel hasn't had sslv2 globally disabled for years now?? [16:49] It seems like a non issue [16:49] "house that is on fire now flooded" [16:49] a non-news story [16:49] https://www.drownattack.com/#faq-ssllabs [16:50] Basically, if the key is used on any SSLv2 server (including mail servers which may be overlooked when updating TLS configuration) you are vulnerable [16:51] And there was a recently-discovered OpenSSL bug where it would accept SSLv2 connections even if all SSLv2 ciphersuites were disabled [17:05] *** JesseW has joined #archiveteam-bs [17:06] *** Start has quit IRC (Quit: Disconnected.) [17:16] *** JesseW has quit IRC (Quit: Leaving.) [17:17] Just sent mail to IA about recording the BVN TV channel [17:18] Where is the list of channels they record? [17:19] There's no list [17:20] or every channel there's a collection [17:20] https://archive.org/details/tvarchive?and%5B%5D=mediatype%3A%22collection%22&sort=-publicdate [17:20] That are all collections in tvarchive [17:20] Ah, thanks [17:20] scroll down and you'll see a lot of TV channel that started recording this year [17:20] around januari 2/7/14 [17:24] "january 2/7/14" ??? what does that mean [17:25] well, scroll down and you'll see the newly recorded channel on january 2, 7 and 14 [17:26] ah [17:26] it looked like you were saying 2/7/14 as a date, which was very confusing [17:26] I see [17:26] should have written it different [17:27] * xmc shrug [18:03] *** vitzli has quit IRC (Leaving) [18:04] I like the reasonable archive team members. [18:05] *** SketchCo1 is now known as SketchCow [18:05] Today, I am so blisteringly angry again. Might be the diet. [18:05] *** tomwsmf-a has joined #archiveteam-bs [18:05] Might be someone pissed me off (not here) [18:06] dieting is a bitch [18:38] *** Sk1d has quit IRC (hub.se efnet.portlane.se) [18:38] *** altlabel has quit IRC (hub.se efnet.portlane.se) [18:38] *** Fletcher_ has quit IRC (hub.se efnet.portlane.se) [18:39] *** Sk2d has joined #archiveteam-bs [18:41] *** Start has joined #archiveteam-bs [18:53] *** Sk2d is now known as Sk1d [19:02] *** bsmith093 has quit IRC (Ping timeout: 190 seconds) [19:10] *** Start has quit IRC (Quit: Disconnected.) [19:21] *** Start has joined #archiveteam-bs [19:27] *** altlabel has joined #archiveteam-bs [19:29] *** bwn has quit IRC (Read error: Operation timed out) [19:43] *** Sk1d has quit IRC (Killed (hub.se (Nick collision (new)))) [19:43] *** Sk1d has joined #archiveteam-bs [19:51] *** bwn has joined #archiveteam-bs [20:36] *** Boppen has joined #archiveteam-bs [20:45] *** Start has quit IRC (Quit: Disconnected.) [20:56] *** ndiddy has joined #archiveteam-bs [21:02] *** robink has quit IRC (Ping timeout: 260 seconds) [21:14] *** Boppen has quit IRC (hub.se irc.du.se) [21:14] *** robink has joined #archiveteam-bs [21:21] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [21:39] *** schbirid has quit IRC (Quit: Leaving) [21:59] SketchCow: 2003 pages of telegraph.co.uk are archived [21:59] mostly anyways [22:00] looks like telegraph gives out random /404/404.html redirect pages [22:00] so i will have to go thur that and give you a retry on them [22:06] *** Stilett0 has joined #archiveteam-bs [22:06] *** BnA-Rob1n has joined #archiveteam-bs [22:06] *** w0rp has quit IRC (Ping timeout: 252 seconds) [22:06] *** dashcloud has quit IRC (Ping timeout: 252 seconds) [22:06] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [22:06] *** dashcloud has joined #archiveteam-bs [22:07] *** Stiletto has quit IRC (Ping timeout: 252 seconds) [22:08] *** Lord_Nigh has joined #archiveteam-bs [22:09] *** w0rp has joined #archiveteam-bs [22:11] *** BnA-Rob1n has quit IRC (Quit: Bye!) [22:12] *** BnA-Rob1n has joined #archiveteam-bs [22:18] *** acridAxid has quit IRC (marauder) [22:19] *** acridAxid has joined #archiveteam-bs [22:22] *** winr4r has quit IRC (Ping timeout: 252 seconds) [22:22] *** MrRadar has quit IRC (Ping timeout: 252 seconds) [22:22] *** Baljem_ has quit IRC (Ping timeout: 252 seconds) [22:22] *** Baljem has joined #archiveteam-bs [22:22] *** Simpbrai_ has quit IRC (Ping timeout: 252 seconds) [22:22] *** Fusl has quit IRC (Ping timeout: 252 seconds) [22:22] *** Simpbrai_ has joined #archiveteam-bs [22:23] *** espes__ has quit IRC (Ping timeout: 252 seconds) [22:24] *** Zebranky has quit IRC (Ping timeout: 252 seconds) [22:24] *** Zebranky has joined #archiveteam-bs [22:24] *** MrRadar has joined #archiveteam-bs [22:24] *** SketchCow has quit IRC (Ping timeout: 252 seconds) [22:24] *** BnA-Robin has quit IRC (Ping timeout: 378 seconds) [22:24] *** BnA-Robin has joined #archiveteam-bs [22:25] *** SketchCow has joined #archiveteam-bs [22:25] *** swebb sets mode: +o SketchCow [22:29] *** winr4r has joined #archiveteam-bs [22:30] *** will has quit IRC (hub.dk irc.underworld.no) [22:30] *** useretail has quit IRC (hub.dk irc.underworld.no) [22:30] *** Rye has quit IRC (hub.dk irc.underworld.no) [22:30] *** is- has quit IRC (hub.dk irc.underworld.no) [22:30] *** ersi has quit IRC (hub.dk irc.underworld.no) [22:35] *** is-_ has joined #archiveteam-bs [22:36] *** espes__ has joined #archiveteam-bs [22:38] *** Fusl has joined #archiveteam-bs [22:39] *** Stilett0 is now known as Stiletto [22:42] *** ersi_ has joined #archiveteam-bs [22:48] *** will has joined #archiveteam-bs [22:57] *** robink has quit IRC (Read error: Operation timed out) [23:08] Would it be possible to give FOS a bit of a kick? most of the AB pipelines are very backed up with WARCs due to slow rsync - and running out of disk space.. things will start get a bit funky soon. [23:32] yipdw: did you find what you needed to track that corruption bug on Intel? [23:33] *** Start has joined #archiveteam-bs