[00:44] i'm trying to spider archiveofourown.org for urls to grab, and i can't seem to get past the index page. i've tried every user agent i can think of, nothing works!! [00:45] here's what i'm using [00:45] wget --spider -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36" -m www.archiveofourown.org 2>&1 > urls.txt [00:58] *** JesseW has joined #archiveteam-bs [01:11] i'm at 872k items now [01:14] bsmith093: browsers send more request headers that they might be checking for [01:15] *** robink has quit IRC (Ping timeout: 506 seconds) [01:15] chrome and firefox developer tools have a "copy as curl" feature in the network tab that you can use to construct an identical request [01:18] ivan: i have no idea where that option is, i'min the dev tab [01:19] right-click a network request [01:19] you have to reload the page to see the request for the page itself [01:20] i've done that. 18 requests 18 kb, right click is doing nothing special [01:21] Copy -> Copy as curl [01:22] (that's in Chrome) [01:25] iv'e only used the dev console once. i have very little idea what i'm doing. i see elements console sources network and timeline [01:26] network tab [01:26] i reloaded the page, theres html everywhere, now what? [01:26] reload the page [01:26] stay on the network tab, right-click a request [01:27] k got it now, theres a massive cookie file shoudl i tell wget to use that? [01:27] *** Cameron_D has quit IRC (Ping timeout: 370 seconds) [01:28] actually here [01:29] the was apm with the curl blob in it. [01:41] *** Cameron_D has joined #archiveteam-bs [01:45] *** robink has joined #archiveteam-bs [02:29] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [02:29] *** Sk1d has joined #archiveteam-bs [02:29] *** Sk1d has quit IRC (Connection closed) [02:32] *** Sk1d has joined #archiveteam-bs [02:39] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [02:56] *** Sk1d has joined #archiveteam-bs [03:09] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [03:15] *** Sk1d has joined #archiveteam-bs [03:16] *** Swizzle has joined #archiveteam-bs [03:29] *** dashcloud has quit IRC (Read error: Connection reset by peer) [03:30] *** tomwsmf_ has quit IRC (Ping timeout: 255 seconds) [03:31] *** dashcloud has joined #archiveteam-bs [03:51] *** mutoso has quit IRC (Read error: Connection reset by peer) [03:51] *** Atros has joined #archiveteam-bs [03:51] *** mutoso_ has joined #archiveteam-bs [03:52] *** robink has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** Frogging has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** balrog has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** Mayonaise has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** acridAxid has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** jspiros has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** coretx has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** remsen1 has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** ranma has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** ivan has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** chfoo has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** SadDM has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** yakfish has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** Stiletto has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** trs80 has quit IRC (ircd.choopa.net hub.efnet.us) [03:52] *** superkuh has quit IRC (Excess Flood) [03:53] *** superkuh has joined #archiveteam-bs [03:54] *** atrocity has quit IRC (Ping timeout: 633 seconds) [03:58] *** robink has joined #archiveteam-bs [03:58] *** Stiletto has joined #archiveteam-bs [03:58] *** Frogging has joined #archiveteam-bs [03:58] *** balrog has joined #archiveteam-bs [03:58] *** Mayonaise has joined #archiveteam-bs [03:58] *** acridAxid has joined #archiveteam-bs [03:58] *** jspiros has joined #archiveteam-bs [03:58] *** coretx has joined #archiveteam-bs [03:58] *** remsen1 has joined #archiveteam-bs [03:58] *** ranma has joined #archiveteam-bs [03:58] *** ivan has joined #archiveteam-bs [03:58] *** chfoo has joined #archiveteam-bs [03:58] *** SadDM has joined #archiveteam-bs [03:58] *** yakfish has joined #archiveteam-bs [03:58] *** trs80 has joined #archiveteam-bs [03:58] *** hub.efnet.us sets mode: +ooo balrog chfoo SadDM [03:58] *** swebb sets mode: +o balrog [03:58] *** swebb sets mode: +o SadDM [04:07] *** ndiddy has quit IRC (Read error: Operation timed out) [04:36] *** Meroje has quit IRC (Quit: bye!) [04:36] *** Meroje has joined #archiveteam-bs [04:50] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:55] *** Aranje has quit IRC (Quit: Three sheets to the wind) [04:57] *** Sk1d has joined #archiveteam-bs [05:08] *** superkuh has quit IRC (Read error: Operation timed out) [05:08] *** Petri152 has quit IRC (Ping timeout: 633 seconds) [05:09] *** superkuh has joined #archiveteam-bs [05:14] *** fie_ has joined #archiveteam-bs [05:15] *** phuzion has quit IRC (Ping timeout: 633 seconds) [05:15] *** phuzion has joined #archiveteam-bs [05:20] *** Petri152 has joined #archiveteam-bs [05:22] *** phuzion has quit IRC (Read error: Connection reset by peer) [05:22] *** fie has quit IRC (Read error: Operation timed out) [05:28] *** phuzion has joined #archiveteam-bs [05:33] *** dashcloud has quit IRC (Read error: Operation timed out) [05:37] *** dashcloud has joined #archiveteam-bs [05:44] *** JesseW has quit IRC (Ping timeout: 370 seconds) [05:57] *** RichardG_ has joined #archiveteam-bs [05:57] *** RichardG has quit IRC (Read error: Connection reset by peer) [06:07] *** dashcloud has quit IRC (Read error: Operation timed out) [06:07] *** JesseW has joined #archiveteam-bs [06:11] *** dashcloud has joined #archiveteam-bs [06:28] *** JesseW has quit IRC (Read error: Operation timed out) [06:45] *** BlueMaxim has joined #archiveteam-bs [07:46] *** logchfoo2 starts logging #archiveteam-bs at Tue Sep 20 07:46:55 2016 [07:46] *** logchfoo2 has joined #archiveteam-bs [07:52] *** Petri152 has quit IRC (ny.us.hub ircd.choopa.net) [07:52] *** bwn has quit IRC (ny.us.hub ircd.choopa.net) [07:52] *** fusl has quit IRC (ny.us.hub ircd.choopa.net) [07:58] *** bwn_ has joined #archiveteam-bs [08:04] *** GE has joined #archiveteam-bs [08:08] *** bwn_ is now known as bwn [08:14] *** SmileyG has quit IRC (Read error: Operation timed out) [09:01] *** Petri152 has joined #archiveteam-bs [09:01] *** fusl has joined #archiveteam-bs [09:05] *** Smiley has joined #archiveteam-bs [10:20] *** GE has quit IRC (Quit: zzz) [10:25] *** godane has quit IRC (Read error: Operation timed out) [10:28] *** godane has joined #archiveteam-bs [11:51] *** GE has joined #archiveteam-bs [12:23] *** dashcloud has quit IRC (Read error: Operation timed out) [12:42] *** dashcloud has joined #archiveteam-bs [13:16] *** phuzion has joined #archiveteam-bs [13:20] *** RichardG_ is now known as RichardG [13:23] *** VADemon has joined #archiveteam-bs [13:36] *** useretail has quit IRC (Ping timeout: 244 seconds) [13:37] *** dashcloud has quit IRC (Read error: Operation timed out) [13:37] *** dashcloud has joined #archiveteam-bs [14:02] *** BlueMaxim has quit IRC (Quit: Leaving) [14:22] *** Start has quit IRC (Quit: Disconnected.) [14:32] *** useretail has joined #archiveteam-bs [14:51] *** JesseW has joined #archiveteam-bs [14:57] i'm up to 873k items [14:58] i'm also at 9400ish items in my godaneinbox [15:15] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [15:15] http://dailycaller.com/2016/09/19/computer-tech-who-asked-how-to-strip-out-email-addresses-may-have-worked-for-hillary/ [15:15] apparently, Hillary email server tech asked for help on Reddit [15:16] in what seems like falsifying evidence? [15:16] *** BartoCH has joined #archiveteam-bs [15:16] perfect, they've found a scapegoat [15:19] Kaz: not necessarily. post claimed that he was asked to do so [15:19] by $employer [15:19] so... :P [15:24] yeah, true [15:25] wonder how this one will be spun [15:29] *** sep332 has joined #archiveteam-bs [15:31] a full index of every North Korean domain existence: https://github.com/mandatoryprogrammer/NorthKoreaDNSLeak [15:31] (source; zone transfer misconfiguration) [15:32] hm, maybe not [15:32] oh yeah no, it is, it's just a very small zone [15:32] :p [15:40] *** JesseW has quit IRC (Ping timeout: 370 seconds) [15:42] *** metalcamp has joined #archiveteam-bs [15:44] *** JesseW has joined #archiveteam-bs [15:49] *** JesseW has quit IRC (Read error: Operation timed out) [16:04] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [16:05] *** GLaDOS has joined #archiveteam-bs [16:10] *** VADemon has quit IRC (Ping timeout: 255 seconds) [16:32] I expected nothing less from a country that doesn't think the cold war is over [16:45] *** schbirid has joined #archiveteam-bs [16:55] *** dashcloud has quit IRC (Read error: Operation timed out) [16:59] *** dashcloud has joined #archiveteam-bs [18:10] HEY HI HELLO [18:10] So, I'm Australia for a couple weeks starting this weekend. [18:11] I assume none of you live there (I'll be in Melbourne this time) but I'll also be semi-spotty during that time. (Lots of walking) [18:15] dont you mean sketchy? [18:34] *** Aranje has joined #archiveteam-bs [18:47] *** Aranje has quit IRC (Quit: Three sheets to the wind) [18:51] *** Aranje has joined #archiveteam-bs [19:47] *** ndiddy has joined #archiveteam-bs [19:58] *** metalcamp has quit IRC (Quit: Bye) [20:02] *** JW_work1 has joined #archiveteam-bs [20:05] *** JW_work has quit IRC (Read error: Operation timed out) [20:19] *** Stiletto has quit IRC (Ping timeout: 190 seconds) [20:20] *** Stiletto has joined #archiveteam-bs [20:25] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [20:26] *** dashcloud has joined #archiveteam-bs [20:26] have we ever considered trying to scrape logs.omegle.com ? [20:37] plenty of three letter agencies already have a copy [20:38] *** schbirid has quit IRC (Quit: Leaving) [20:46] *** tomwsmf_ has joined #archiveteam-bs [20:46] hook54321: seriously, dude, that's gross [20:47] I *know* the URLs have no access control, that is not the point [20:47] there is a tremendous difference between saving homepages and saving omegle logs, and the difference is intent to publish [20:48] ^ seconded [20:48] no, there is no HTTP header or browser extension to delineate this [20:48] that doesn't fucking mean it doesn't exist [20:50] it is also quite possible that Omegle is used by private citizens and we treat their correspondence differently than (say) public figures [20:51] archiveteam isn't just saving. in essence, we republish. so if something wasn't public and we've made it public, that's on us. [20:52] publishing new things can ruin lives completely by accident. so we have *always* drawn the line at things that were chosen to be published by a person, and then another person decided to delete. [20:53] likewise, if someone decides to unpublish their work, that is a thing which we must err on the side of respecting. [20:54] well [20:54] that's a tricky one but yes [20:54] it is tricky. [20:54] we should skew towards respecting the author's wishes [20:55] we try to not make editorial decisions [20:55] this means a few things [20:55] ? [20:55] 1. decision to archive is made based on risk and impact, not approval-of-content or percieved value [20:56] uh, i forget what i was going to say for 2 and 3 [20:56] My attitude is that if there's stuff that makes people quake in their dishwashing soap about saving as an Archiveteam thing, someone who disagrees can use the tools and do the work. [20:56] yes [20:56] Doesn't mean we need to be doing it. Especially something minor in terms of processing power. [20:56] but under the archiveteam banner we shouldn't be sucking out semi-private things that weren't publicly displayed [20:57] There's no reason to [20:57] like, should we archive the list of imeis that weev sucked out of the at&t website hole [20:57] if he posts it, maybe. but we don't need to go do the same shit. [20:57] No reason to. [20:57] right [20:57] Someone else can hand over a blackbox "crap hackerz got" thing to the archive, it'll go dark, or not, or whatever [20:57] exactly what i'm saying [20:58] we have a reputation, which we've earned by not being dickheads [20:58] and also being effective [20:58] *** ndizzle has joined #archiveteam-bs [20:58] ...we're kind of dickheads [20:58] But we're not ridiculously sociopathic dickheads [20:59] RIGHT [20:59] Talking like whatever that exiled nutter is who talks in third person [20:59] *** JW_work has joined #archiveteam-bs [20:59] you and me and yipdw all are in 100% concordance here [21:01] Always good to revisit premises [21:01] Remember why we got into it [21:01] FOR A DECADE [21:01] holy shit you're right [21:01] but yeah [21:02] all this makes sense to me, FWIW [21:02] original purpose of archiveteam: individual humans decide to publish something, corporations decide to delete it. [21:02] *** ndiddy has quit IRC (Read error: Operation timed out) [21:02] and there's an important difference between *privately* making a copy of something (and even passing it on to IA as a dark archive) and publishing it for free download [21:06] It technically is public, but I agree that we should try to not damage our reputation. [21:07] *** JW_work1 has quit IRC (Ping timeout: 633 seconds) [21:07] It's public but not *published* [21:08] it's like someone leaving their house unlocked [21:08] I think the thing is, we naturally have acquired/encouraged a general fog of nerds really into saving shit. [21:08] Some of that saving and working isn't what we're into, but it's out there and people have skills, etc. [21:09] hook54321: if you want to save it, sure. but if you publish it you're being really rude. and don't do it under the archiveteam name. [21:10] make sense? [21:10] *** dashcloud has quit IRC (Read error: Operation timed out) [21:11] It's technically public [21:13] and you can see the tax return on my desk if you have a drone with a camera [21:13] *** dashcloud has joined #archiveteam-bs [21:14] sometimes the line between published and unintentionally public is vague. but i think this example is fairly straigtforward. [21:15] searching for "site:logs.omegle.com" gives you last of logs, google isn't exactly a drone. I see what you mean though, I think the logs themselves are images for some reason, so they shouldn't be OCR-ed. [21:16] I see how this could potentially damage our reputation though. [21:16] surprised there's no robots.txt [21:17] Me too. [21:17] never ascribe to malice what can be explained by incompetence or indifference [21:18] are you really that dense. it's not about our reputation, it's about being decent members of society. [21:22] I know, but there are uses for the logs other than what some assume they are used for. For example, Yik Yak has given researchers access to user's posts. [21:23] did omegle do that? [21:24] yeah, YikYak did that. it's also an action that's hard to reconcile with YikYak initially being sold as a geographically isolated, anonymous messenger [21:26] idk if omegle has done it or not. There are tons of things about Yik Yak that are hard to reconcile. [21:29] *** JW_work has quit IRC (Quit: Leaving.) [21:29] anyway, whatever yikyak did or didn't do is tangential. the point I wanted to make is that Omegle isn't a publishing platform, and collating those logs and making them more conveniently available isn't just "I did a bunch of GET requests and shoved them all into this file" [21:31] two people using (text-mode) Omegle are deidentified. this doesn't, however, mean that they suddenly have perfect operational security. if you make a bunch of logs more accessible you run the risk of making it possible to identify Omegle users and subject them to the full range of social badness that humans can deliver [21:32] that doesn't seem like a particularly productive thing to do [21:32] *** JW_work has joined #archiveteam-bs [21:32] but I've exceeded my text quota so I'm done here [21:34] *** tomwsmf_ has quit IRC (Read error: Operation timed out) [21:34] (sidenote: even if they had perfect opsec, I'd still feel like it was wrong; the eavesdropper is an adversary. that's probably not the position you want to be in) [21:50] *** tomwsmf_ has joined #archiveteam-bs [22:14] *** kyounko has joined #archiveteam-bs [22:14] *** kyounko has left [22:56] *** dashcloud has quit IRC (Read error: Operation timed out) [23:00] *** dashcloud has joined #archiveteam-bs [23:17] *** Start has joined #archiveteam-bs [23:22] *** GE has quit IRC (Quit: zzz) [23:29] *** kristian_ has joined #archiveteam-bs [23:31] hi all [23:31] did you see the news on the North Korean "Internet" being leaked? [23:31] yep, we're grabbing them [23:33] cool! [23:34] btw, I was thinking of something ... could a project perhaps be made where people DL their own Facebook? [23:36] not just their posts, but the experience somehow [23:36] if that makes sense ... [23:44] *** zhongfu has quit IRC (Remote host closed the connection) [23:45] *** zhongfu has joined #archiveteam-bs [23:47] kristian_: which parts of the experience? [23:47] that of the user, hook54321 [23:48] this is something that will be lost when FB is gone [23:48] that could potentially be all of facebook [23:48] yeah [23:48] also, part of the experience is interacting with other people :P [23:48] but if you could save something that would let people click around for half an hour or so [23:49] I think he means being able to browse it like it was the website [23:49] well ... we all know what happens to people, hook54321 [23:49] that's one way of putting it, Frogging [23:50] archive.is kinda has better support for archiving pages on facebook [23:50] Only public stuff though [23:56] *** JesseW has joined #archiveteam-bs