[00:00] *** Stiletto has quit IRC (Read error: Operation timed out) [00:14] I wonder if some of that bleeding data made its way into the Wayback Machine... [00:14] (Or into other Archive Team dumps) [00:16] *** Stiletto has joined #archiveteam-bs [00:20] TobiX: unfortunately, likely so [00:42] *** icedice has quit IRC (Ping timeout: 245 seconds) [00:46] Yay! The guy whose $10k package of PAL SNES games was lost finally received them: https://byuu.org/emulation/preservation/found-package/ [00:50] CancerFlare [00:52] *** spiko has quit IRC (Read error: Operation timed out) [01:00] *** Ravenlow has joined #archiveteam-bs [01:00] http://collections.museumvictoria.com.au/items/1223859 [01:10] MrRadar: I guess we gained 100 SNES games, and lost millions of bits of PII [01:10] lol [01:11] and a hash function [01:13] *** Nyx has quit IRC (Ping timeout: 260 seconds) [01:14] joepie91: centralization amirite [01:15] I had occasionally thought about how so much of the Web seems to be behind cloud flare and what the implications could be.. [01:16] yes, someone named joepie91 wrote a few essays about that i think [01:16] *** Nyx has joined #archiveteam-bs [01:17] oh really? :p I was just recalling some related conversations about cloudflare in here [01:17] lol [01:18] that was more about the so called "browser check" however [01:18] one of those essays was an article: http://cryto.net/~joepie91/blog/2016/07/14/cloudflare-we-have-a-problem/ [01:18] the others were IRC rants I think [01:18] oh, and a few comments in various places [01:19] Lol the TPB thing, that's hilarious [01:56] *** brayden has joined #archiveteam-bs [01:56] *** swebb sets mode: +o brayden [02:02] *** Ctrl-S___ has quit IRC (Ping timeout: 260 seconds) [02:11] *** username1 has joined #archiveteam-bs [02:15] *** schbirid2 has quit IRC (Read error: Operation timed out) [02:37] *** Stiletto has quit IRC (Read error: Operation timed out) [02:40] *** pizzaiolo has left [02:54] *** brayden has quit IRC (Read error: Operation timed out) [02:56] fun project: check recent archiveteam WARCs for sites behind cloudflare to see if they contain cloudbleed data [02:56] *** brayden has joined #archiveteam-bs [02:56] *** swebb sets mode: +o brayden [02:56] I assume WARCs would include all headers etc and so archive it [02:58] yes [02:58] *** _desu___ has quit IRC (Ping timeout: 260 seconds) [02:58] *** HCross2 has quit IRC (Ping timeout: 260 seconds) [02:58] damn [02:59] just read about this... [02:59] IA has probably archived a lot of private data there [02:59] does anyone know if the cloudflare "reverse proxy" that runs on the user's web servers is open source? [03:00] *** voltagex has quit IRC (Ping timeout: 260 seconds) [03:00] *** sigkell has quit IRC (Ping timeout: 260 seconds) [03:00] *** jiphex has quit IRC (Ping timeout: 260 seconds) [03:04] CF-Host-Origin-IP: followed by eg, "authorization" or "password" seems to be the header to look for [03:06] [03:59] does anyone know if the cloudflare "reverse proxy" that runs on the user's web servers is open source? [03:06] there is no such software [03:06] the reverse proxy runs on cloudflare servers [03:06] not user's servers [03:06] and no, it's not open-source, but it's a modified nginx [03:08] huh, I thought cloudflare was using their own protocol between user servers and their servers [03:08] seem to remember blog posts by them about it [03:09] *** brayden has quit IRC (Read error: Operation timed out) [03:12] *** Stiletto has joined #archiveteam-bs [03:12] hmm. always been http/s afaik [03:13] perhaps it was a in-cloud thing [03:14] closure: maybe they upgraded to SPDY and posted about it? [03:14] closure: you're probably thinking of their origin CA thing [03:14] *** _desu___ has joined #archiveteam-bs [03:15] *** HCross2 has joined #archiveteam-bs [03:15] but this doesn't involve custom software [03:15] alternatively, mod_cloudflare or whatever it's called, for the IP handling [03:15] but no, CF <-> origin comms are always HTTP(S) [03:15] just read the google report [03:15] pretty bad [03:15] arkiver: s/pretty/very/ [03:15] considerably worse than heartbleed [03:16] extremely* [03:16] completely unknown impact [03:16] yes [03:16] possibly the biggest breach in recent history [03:16] afaik NSA has been at least keeping headers of requests and responses [03:16] the NSA is the least of your worries here [03:16] :P [03:16] worry about the shady people running scrapers [03:16] lol [03:16] blackhat SEO people and such [03:16] who have scrape logs going back years [03:16] yes [03:17] and can arbitrarily fish out PII [03:17] at least the NSA won't drain your CC :) [03:17] seems less worse than heartbleed; doesn't affect my servers for example :) [03:17] let's have a good look at our WARCs [03:17] see what's in there [03:17] closure: it doesn't matter whether it affects your servers or not [03:17] not planning on editing though [03:17] closure: everybody using the web is affected [03:17] "worry about the shady people running scrapers" -- new archiveteam motto [03:17] haha [03:17] closure: any data you've ever sent to cloudflare in any way is potentially leaked [03:18] which, given how big cloudflare is and how you don't know whether sites have used CF in the past, basically means [03:18] "everything you've ever sent to a website is potentially leaked" [03:18] ug [03:19] I'm off to bed [03:19] :/ [03:19] dreaming about cloudbleed [03:19] nightmares [03:19] good night/day all [03:19] heh [03:19] night [03:56] who called it cloudbleed and not Red Rain [03:56] COME ON [04:05] How did this go on for months and nobody noticed? [04:05] (How long was it happening for again?) [04:09] Yeah wow, the archiveteam dumps must have a treasure trove of crap in them. [04:10] inb4 everything gets darker [04:10] darked* [04:11] Darked? [04:11] I can't even imagine the scale of the effort it would take to scrub all the pages saved in the IA [04:12] namespace: it means disabling the viewing of an item on IA [04:12] MrRadar: it would be impossible [04:22] *** Ctrl-S___ has joined #archiveteam-bs [04:34] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [04:35] so i'm uploading more NPR Talk of the Nation [04:35] good news is we have everything from before 2007-12-31 [04:36] i just have to grab at least another ~5 and half years more [04:36] before its complete [04:37] i'm also close to having medium.com urls from 2016-09 [04:38] just for scale the 2016-09-26 dump has 17883 urls in it [04:39] *** ndiddy has quit IRC (Read error: Connection reset by peer) [04:39] only 17797 urls were downloading [04:40] so they maybe not delete as much as before [04:41] *** Ctrl-S___ has quit IRC (Ping timeout: 260 seconds) [04:42] *** _desu___ has quit IRC (Ping timeout: 260 seconds) [04:44] *** Lord_Nigh has joined #archiveteam-bs [04:53] *** icedice has joined #archiveteam-bs [04:53] *** icedice has quit IRC (Remote host closed the connection) [05:02] *** wp494 has quit IRC (Ping timeout: 244 seconds) [05:10] *** brayden has joined #archiveteam-bs [05:10] *** swebb sets mode: +o brayden [05:12] *** Sk1d has joined #archiveteam-bs [05:24] *** jiphex has joined #archiveteam-bs [05:35] *** DopefishJ has joined #archiveteam-bs [05:35] *** swebb sets mode: +o DopefishJ [05:37] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [05:41] *** BlueMaxim has quit IRC (Quit: Leaving) [05:45] *** brayden_ has joined #archiveteam-bs [05:45] *** swebb sets mode: +o brayden_ [05:50] *** brayden has quit IRC (Read error: Operation timed out) [05:52] *** Stiletto has quit IRC (Read error: Operation timed out) [05:55] *** Stiletto has joined #archiveteam-bs [06:12] *** Lord_Nigh has quit IRC (Ping timeout: 250 seconds) [06:17] *** Lord_Nigh has joined #archiveteam-bs [06:23] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [06:28] *** Lord_Nigh has joined #archiveteam-bs [06:43] So. Who wants to bring up the awkward fact we probably have passwords, oauth tokens, and secrets in our warc files? I am honestly scared to grep right now. [06:43] *** DopefishJ is now known as DFJustin [06:45] bleh [06:45] There's nothing to be done [06:46] It is not as though *only* we would. [06:46] Practically any crawling operation, or indeed *random people's browser caches*, would have the same. [06:53] *** _desu___ has joined #archiveteam-bs [06:53] *** Ctrl-S___ has joined #archiveteam-bs [07:09] *** fie__ has quit IRC (Ping timeout: 633 seconds) [07:15] *** Oddy has joined #archiveteam-bs [07:33] *** BlueMaxim has joined #archiveteam-bs [07:34] *** fie has joined #archiveteam-bs [07:39] *** fie has quit IRC (Read error: Connection reset by peer) [07:54] *** fie has joined #archiveteam-bs [08:12] "Change all your passwords" [08:12] oh yes thanks [08:13] *** phuzion has quit IRC (Read error: Operation timed out) [08:17] how to grep for them? [08:17] *** username1 is now known as schbirid [08:18] *** phuzion has joined #archiveteam-bs [08:18] *** brayden_ has quit IRC (Read error: Operation timed out) [08:21] *** brayden has joined #archiveteam-bs [08:21] *** swebb sets mode: +o brayden [08:22] schbirid: grep for '{"scheme":"http"} CF-Host-Origin-IP' [08:30] www.cloudflarestagingformobilereddit.com oO :D [08:41] *** phuzion has quit IRC (Read error: Operation timed out) [08:50] *** brayden_ has joined #archiveteam-bs [08:50] *** swebb sets mode: +o brayden_ [08:51] i've found some ArchiveBot grabs done in February 2017 of https websites fronted by CloudFlare [08:51] let's grep [08:54] *** phuzion has joined #archiveteam-bs [08:55] *** brayden has quit IRC (Read error: Operation timed out) [09:00] *** brayden_ has quit IRC (Read error: Operation timed out) [09:10] Oh god. [09:19] *** Jonison has joined #archiveteam-bs [09:30] *** odemg has quit IRC (Remote host closed the connection) [09:50] *** GE has joined #archiveteam-bs [10:03] *** odemg has joined #archiveteam-bs [10:20] *** odemg has quit IRC (Remote host closed the connection) [10:23] *** brayden has joined #archiveteam-bs [10:23] *** swebb sets mode: +o brayden [10:24] *** odemg has joined #archiveteam-bs [10:32] *** odemg has quit IRC (Remote host closed the connection) [10:34] *** odemg has joined #archiveteam-bs [10:40] *** Stilett0 has joined #archiveteam-bs [10:40] *** Stilett0 has quit IRC (Client Quit) [10:46] *** godane has quit IRC (Ping timeout: 492 seconds) [10:51] *** Jonison has quit IRC (Read error: Connection reset by peer) [10:56] *** godane has joined #archiveteam-bs [10:58] *** pizzaiolo has joined #archiveteam-bs [11:38] *** GE has quit IRC (Quit: zzz) [11:51] *** vitzli has joined #archiveteam-bs [11:54] *** Oddy has quit IRC (Read error: Operation timed out) [12:04] going to be interesting [12:04] no WARCs should be edited though [12:17] *** dashcloud has quit IRC (Read error: Operation timed out) [12:21] *** dashcloud has joined #archiveteam-bs [12:28] *** BlueMaxim has quit IRC (Read error: Operation timed out) [13:04] *** GE has joined #archiveteam-bs [13:06] *** Aranje has quit IRC (Read error: Operation timed out) [13:12] no point removing the data anyway [13:12] you can't un-leak things [13:12] :P [13:12] no [13:13] but there will be peope who want this stuff removed [13:13] best that can be done is keep it out during any rewrite on the data when it's served [13:17] right [13:28] *jesus* at sketchcow's blog post [13:28] closure: link pls [13:29] http://ascii.textfiles.com/archives/5139 [13:32] *** Oddy has joined #archiveteam-bs [13:35] yeah, that's some hardcore shit o.O [13:35] my (recently deceased for other reasons) dad had a similar story...went to his doctor for a routine exam, casually mentioned that he felt like he had water in his thighs at the end [13:36] so....do we need to backup textfiles.com? [13:37] and went to leave, and the doc went "you're not going anywhere, EMTs will be here in two minutes, we're putting you in a hospital right now" [13:38] christ [13:41] ...Wow. [13:46] *** dashcloud has quit IRC (Read error: Operation timed out) [13:50] *** dashcloud has joined #archiveteam-bs [14:06] *** odemg has quit IRC (Remote host closed the connection) [14:22] *** wp494 has joined #archiveteam-bs [14:23] *** odemg has joined #archiveteam-bs [14:35] *** pizzaiol1 has joined #archiveteam-bs [14:37] *** pizzaiolo has quit IRC (Read error: Operation timed out) [15:01] *** SadDM has quit IRC (Read error: Operation timed out) [15:05] *** SadDM has joined #archiveteam-bs [15:05] *** swebb sets mode: +o SadDM [15:12] *** schbirid2 has joined #archiveteam-bs [15:15] *** schbirid has quit IRC (Read error: Operation timed out) [15:30] *** nickware has joined #archiveteam-bs [15:32] *** pizzaiol1 is now known as pizzaiolo [15:36] *** nickware has quit IRC (Ping timeout: 370 seconds) [15:46] That was a great post [16:18] *** odemg has quit IRC (Remote host closed the connection) [16:23] Also I'd heard that heart attacks can be subtle buggers but having one for a week without knowing it, o.o [16:24] That's scary [16:28] *** jspiros has quit IRC (Read error: Operation timed out) [16:31] *** Aranje has joined #archiveteam-bs [16:41] *** joepie91 has quit IRC (Read error: Operation timed out) [16:41] *** arkiver has quit IRC (Read error: Operation timed out) [16:42] *** odemg has joined #archiveteam-bs [16:46] *** Oddy has quit IRC (Read error: Operation timed out) [17:34] *** jspiros has joined #archiveteam-bs [17:36] *** godane has quit IRC (Quit: Leaving.) [17:36] *** godane has joined #archiveteam-bs [17:46] I was gonna add this to his wikipedia entry but there's nowhere it could fit right now [17:47] so looks like the BadContent Error is telling me to use pdftk to repair the pdf [17:47] whats funny is pdftk make its the output.pdf broken in epdfview [17:50] *** jspiros has quit IRC (leaving) [17:56] SketchCow: i'm going to email my problem to info@archive.org and you address [17:59] now email sent [18:02] *** icedice has joined #archiveteam-bs [18:03] *** Panasonic has joined #archiveteam-bs [18:04] *** Ravenlow has quit IRC (Ping timeout: 244 seconds) [18:16] *** Aranje has quit IRC (Read error: Operation timed out) [18:17] *** vitzli has quit IRC (Leaving) [18:18] *** Aranje has joined #archiveteam-bs [18:44] *** odemg has quit IRC (Remote host closed the connection) [18:46] *** odemg has joined #archiveteam-bs [18:49] *** jspiros has joined #archiveteam-bs [19:06] *** nickware has joined #archiveteam-bs [19:13] *** nickware has quit IRC (Ping timeout: 370 seconds) [19:22] xmc, the issue is not only grabbing svns, but the mix of VCS backends. [19:22] aye [19:23] We would need to scrape the source code link, identify its VCS type, and then pass it off to a worker. [19:23] oh hm, it's bigger than i assumed [19:24] This comes up often enough that we should probably make it generalized so we can apply it to any source hosting site. [19:24] yeah [19:25] *** VADemon has joined #archiveteam-bs [19:26] Useful to know about svnrdump though, I didn't know that [19:27] Sounds like if I'm interested in saving this stuff in the absence of archiveteam tools (and lack effort to add them), I should do a bunch of svnrdumps off my own bat? [19:28] (gna provide rsync access to svn to project admins, like me) [19:28] jtn2: does that cover other projects? [19:28] or just your own? [19:29] nightpool: I haven't looked to see if I can see other projects [19:30] http://gna.org/svn/?group=freeciv describes the rsync access (near the bottom) [19:31] Looks like it's anonymous access, so no need to ask admins [19:31] jtn2, lets start multiple approaches. Our project goals are to back up Gna! source code repos and to make it easier to do so in the future. I will begin looking into creating a simple script to handle VCS discovery and grabbing, you contact the Gna! admins and see what you can do with your rsync assess. [19:32] It might be polite to ask / warn them before hitting their servers hard (I'm sure you folks have opinions on that). [19:32] Gna admins notionally hang out on #gna on freenode, but often don't seem to be paying attention. (why it's closing I think) [19:33] I seem to be finding stuff out. Should I create a page on archiveteam wiki? [19:33] I am in the process of. Channel #git-r-done [19:34] (Larry the Cable Guy would be proud) [19:34] not gnarm? :P [19:34] gna.org doesn't do Git... [19:35] * jtn2 <- may be missing the joke [19:36] #gnarm is fine for Gna!, but a generalized system for grabbing source code repos from multiple VCSs is under #git-r-done. ;) [19:39] OIC [19:43] *** odemg has quit IRC (Remote host closed the connection) [19:56] Thanks! [20:03] *** mkram has joined #archiveteam-bs [20:07] *** bill-auge has joined #archiveteam-bs [20:07] hello - pizzaiolo: asked me to come in here [20:08] bill-auge was trying to contact gna webmasters to see if it would be possible to move gna repos to notabug.org [20:08] some emails worked, some didn't [20:08] bill-auge: which emails worked? [20:08] Gna admins are quite hard to get hold of IME. [20:09] (BTW: #gnarm exists. But I'll carry on here for now.) [20:10] yea we will see - the ones that seem to get through i only sent about 20 minutes ago [20:10] When I asked them "how should I best interact with you about shutdown stuff" (I have a project) one of them said to raise service request tickets (http://gna.org/support/?group=admin) [20:10] I haven't tried this, but responses to those have been slow in the hpast. [20:11] that was the two mentioned on this page https://gna.org/contact.php [20:11] Gna admins notionally hang out on #gna on freenode, but often don't seem to be paying attention. (why it's closing I think) [20:11] (zerodeux and beuc IIRC) [20:12] why is #picasso only inhabitated by me? The wiki says it is upcoming, and lists that irc channel... [20:12] is this the project by the peeps at inria.fr ? [20:13] yea i tried the IRC channel there are no ops [20:13] bill-auge: I have a project on Gna, do you too [20:13] ? [20:14] well i am there... Anyone I should op in that channel, now that i resident it? [20:14] no i help out with notabug so i keep all my wares there [20:14] * jtn2 should investigate notabug [20:15] general rule for archiveteam is anyone opped in the main channels might should get op in project channels, also main contributors, people who are not being disruptive, etc [20:15] but the future of my project is OT for here... [20:15] anyway afk [20:15] but if you're running it, you decide [20:16] also mirrors on shithub o/c +gitlab +bitbucket - but i tend to treat notabug as the upstream for most [20:17] I thought I would want to lurk it due to the potentially big size, in order to get information on potential use a read-once valhalla storage. To influence my priorities. [20:30] *** Panasonic has quit IRC (Ping timeout: 246 seconds) [20:40] *** icedice has quit IRC (Quit: Leaving) [20:48] *** Gfy has joined #archiveteam-bs [21:32] *** Aranje has quit IRC (Read error: Operation timed out) [21:36] *** odemg has joined #archiveteam-bs [21:59] *** yeoldetoa has quit IRC (Read error: Operation timed out) [22:00] *** wabu has quit IRC (Ping timeout: 246 seconds) [22:00] *** yeoldetoa has joined #archiveteam-bs [22:00] *** chazchaz has quit IRC (Read error: Operation timed out) [22:00] *** wabu has joined #archiveteam-bs [22:00] *** chazchaz has joined #archiveteam-bs [22:01] *** dxrt- has quit IRC (Quit: Ping timeout (120 seconds)) [22:06] *** espes__ has quit IRC (Ping timeout: 633 seconds) [22:08] *** wabu has quit IRC (Read error: Operation timed out) [22:08] *** chazchaz has quit IRC (Read error: Operation timed out) [22:09] *** wabu has joined #archiveteam-bs [22:10] *** ndiddy has joined #archiveteam-bs [22:12] *** chazchaz has joined #archiveteam-bs [22:14] *** espes__ has joined #archiveteam-bs [22:21] *** GE has quit IRC (Remote host closed the connection) [22:34] *** odemg has quit IRC (Remote host closed the connection) [22:40] *** odemg has joined #archiveteam-bs [22:41] *** schbirid2 has quit IRC (Quit: Leaving) [22:45] *** Aranje has joined #archiveteam-bs [22:46] *** BlueMaxim has joined #archiveteam-bs [23:03] *** dashcloud has quit IRC (Read error: Operation timed out) [23:06] *** schbirid has joined #archiveteam-bs [23:06] *** dashcloud has joined #archiveteam-bs [23:21] http://oldrobot.org/?url=http%3A%2F%2Fwww.spiegel.de%2Frobots.txt :D [23:33] *** bill-auge has quit IRC (Quit: Page closed) [23:55] *** dan- has quit IRC (Ping timeout: 260 seconds)