[00:02] *** pikhq has quit IRC (Ping timeout: 244 seconds) [00:08] *** BlueMax has joined #archiveteam-bs [00:09] *** pikhq has joined #archiveteam-bs [00:32] *** tomaspark has quit IRC (Read error: Operation timed out) [00:35] *** tomaspark has joined #archiveteam-bs [00:42] *** LordNigh2 has joined #archiveteam-bs [00:45] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [00:45] *** LordNigh2 is now known as Lord_Nigh [00:49] *** kisspunch has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** mundus201 has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** hook54321 has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** kevinr has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** MrRadar2 has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** SketchCow has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** BnAboyZ has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** Tenebrae has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** nyaomi has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** Fusl has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** tsr has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** Sue has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** w0rp has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** Spydar007 has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** BnARobin has quit IRC (hub.efnet.us irc.efnet.nl) [00:49] *** bsmith093 has quit IRC (hub.efnet.us irc.efnet.nl) [00:50] *** slyphic has quit IRC (Read error: Operation timed out) [00:51] *** slyphic has joined #archiveteam-bs [00:52] *** kisspunch has joined #archiveteam-bs [00:52] *** mundus201 has joined #archiveteam-bs [00:52] *** hook54321 has joined #archiveteam-bs [00:52] *** kevinr has joined #archiveteam-bs [00:52] *** MrRadar2 has joined #archiveteam-bs [00:52] *** SketchCow has joined #archiveteam-bs [00:52] *** BnAboyZ has joined #archiveteam-bs [00:52] *** Tenebrae has joined #archiveteam-bs [00:52] *** nyaomi has joined #archiveteam-bs [00:52] *** Fusl has joined #archiveteam-bs [00:52] *** tsr has joined #archiveteam-bs [00:52] *** Sue has joined #archiveteam-bs [00:52] *** w0rp has joined #archiveteam-bs [00:52] *** Spydar007 has joined #archiveteam-bs [00:52] *** BnARobin has joined #archiveteam-bs [00:52] *** bsmith093 has joined #archiveteam-bs [00:52] *** irc.efnet.nl sets mode: +ooo hook54321 MrRadar2 SketchCow [00:52] *** swebb sets mode: +o SketchCow [00:52] *** midas4 sets mode: +o SketchCow [00:52] *** midas1 sets mode: +o SketchCow [01:07] *** DMackey has joined #archiveteam-bs [01:17] *** balrog has quit IRC (Bye) [01:21] *** balrog has joined #archiveteam-bs [01:21] *** swebb sets mode: +o balrog [01:23] *** dashcloud has quit IRC (Read error: Operation timed out) [02:42] *** Jusque has quit IRC (Read error: Operation timed out) [03:55] *** qw3rty119 has joined #archiveteam-bs [03:55] *** odemg has quit IRC (Read error: Operation timed out) [04:00] *** odemg has joined #archiveteam-bs [04:01] *** Jusque has joined #archiveteam-bs [04:01] *** qw3rty118 has quit IRC (Read error: Operation timed out) [04:47] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [04:50] *** Lord_Nigh has joined #archiveteam-bs [05:04] *** squires has quit IRC (Remote host closed the connection) [05:21] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [05:24] *** Lord_Nigh has joined #archiveteam-bs [05:30] *** jmtd is now known as Jon [05:52] *** Mateon1 has quit IRC (Read error: Operation timed out) [05:52] *** Mateon1 has joined #archiveteam-bs [06:11] *** robogoat_ has quit IRC (Ping timeout: 252 seconds) [06:11] *** robogoat has joined #archiveteam-bs [06:23] *** LordNigh2 has joined #archiveteam-bs [06:23] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [06:23] *** LordNigh2 is now known as Lord_Nigh [06:26] *** Stilett0- has joined #archiveteam-bs [06:28] *** Stiletto has quit IRC (Ping timeout: 252 seconds) [06:28] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [06:29] *** Lord_Nigh has joined #archiveteam-bs [06:36] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [06:39] *** Lord_Nigh has joined #archiveteam-bs [07:10] *** LordNigh2 has joined #archiveteam-bs [07:10] *** Lord_Nigh has quit IRC (Ping timeout: 268 seconds) [07:11] *** LordNigh2 is now known as Lord_Nigh [07:13] *** schbirid has joined #archiveteam-bs [09:16] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [09:27] *** LordNigh2 has joined #archiveteam-bs [09:28] *** LordNigh2 is now known as Lord_Nigh [10:19] *** odemg has quit IRC (Read error: Operation timed out) [10:32] *** odemg has joined #archiveteam-bs [10:33] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:02] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [11:03] *** Mateon1 has quit IRC (Remote host closed the connection) [11:03] *** Lord_Nigh has joined #archiveteam-bs [11:03] *** Mateon1 has joined #archiveteam-bs [11:09] *** ndiddy has quit IRC () [11:12] *** Lord_Nigh has quit IRC (Ping timeout: 252 seconds) [11:16] *** Lord_Nigh has joined #archiveteam-bs [11:17] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [11:17] *** Mateon1 has joined #archiveteam-bs [11:21] *** Zexaron has joined #archiveteam-bs [11:32] There is an ongoing important legal case involving what we do everyday, the lawyer of the defendant wants your opinions and insight into why we do what we do and the philosophy behind it all. Have your say here: https://redd.it/8gcji4 [11:40] https://www.reddit.com/r/DataHoarder/comments/8gcji4/the_philosophy_behind_data_hoarding_and_amateur/ [12:15] *** plue has quit IRC (Ping timeout: 260 seconds) [12:23] *** plue has joined #archiveteam-bs [13:56] *** RichardG has quit IRC (Read error: Connection reset by peer) [13:57] *** RichardG has joined #archiveteam-bs [15:07] *** TC01 has quit IRC (Remote host closed the connection) [15:12] *** ebel has joined #archiveteam-bs [15:13] JAA: I was a little worried that archivebot was only supposed to be for "Official(tm) Things" [15:22] ebel: ArchiveBot is for whatever we throw at it. :-) [15:23] ah, but who is we? :P [15:23] I would be interested in your aforementioned scripts for Tw & FB. [15:25] "We" = anyone with voice or ops in #archivebot. Now includes yourself. [15:26] Oh :D Thanks. [15:30] So, about the scraper: any name suggestions? That's the main thing that's stopping me from releasing it, honestly. [15:30] It's supposed to be a generic social media scraper, so the name should reflect that. I suck at inventing names, so it's currently called "social-media-scraper", which is an awful name. [15:31] It supports Twitter, Instagram, and (with limitations) Facebook so far. [15:37] What's the opposite of "share", since it does the opposite of that, right? It saves, not shares :P [15:38] so i got anyfesto to sort of work [15:39] its part of my rpi3 archivebox project [15:39] i only got kiwix to work [15:40] i have to test if vlc radio would work later [15:44] ebel: It doesn't "save" anything on its own, actually. It only *collects* the posts. At the moment, that just means extracting the link to each post, though additional data (e.g. post contents, author, date) could be added easily to make it a proper scraper. That's actually something I intend to do. [15:46] I've started using grab-site. It's amazing how some websites that seem small, are actually pretty bit. Hundreds/thousands of things in the queue! [15:47] That's tiny. It gets fun when there are millions in the queue. ;-) [15:47] also, holy moley, but why the hell does facebook have so much JS! [15:48] "https://static.xx.fbcdn.net/rsrc.php/blah.js" [15:48] Fun fact: ArchiveBot was initially created for small crawls with 100 or 1000 pages. Nowadays, we hardly have any crawls with less than 100k URLs. [15:48] Yes, Facebook sucks. [16:01] *** bwn has quit IRC (Read error: Operation timed out) [16:01] *** phuzion has quit IRC (Remote host closed the connection) [16:08] *** phuzion has joined #archiveteam-bs [16:11] *** bwn has joined #archiveteam-bs [16:34] *** wp494_ has joined #archiveteam-bs [16:37] *** wp494 has quit IRC (Ping timeout: 252 seconds) [16:40] *** Despatche has quit IRC (Ping timeout: 506 seconds) [16:41] *** Despatche has joined #archiveteam-bs [16:49] *** betamax has quit IRC (Ping timeout: 252 seconds) [17:06] *** betamax has joined #archiveteam-bs [17:38] *** Despatche has quit IRC (Quit: Read error: Connection reset by peer) [17:39] *** Despatche has joined #archiveteam-bs [18:52] I'm playing with wpull, and it's amazing. It (& phatomjs & webrecorder player) are exactly what I had looked for ages! :D [18:53] Is it possible to get it not to download all the analytics & ads on a page? [18:54] (I'm not really sure how to do that, I want --page-requisits, but not the advert stuff. Maybe a phantomjs with an adblocker installed? ) [18:57] PhantomJS is ugly, and its integration in wpull has some issues (namely massive duplication in the archives). Look into IA's brozzler maybe (headless Chromium + warcprox). [18:57] For ignoring in wpull, --reject-regex. Note that you can use that option only once, so you have to include all ignore patterns in one option. [18:59] another option could be a web proxy which just blocks those sort of URLs. adblocking at a proxy level. Pretty sure I've seen them [19:00] Also, if you want to use wpull, use version 1.2.3. Version 2.0.x is pretty unstable and has weird bugs. [19:00] I have 1.2. It's what's in pip [19:00] By the way, generally, we want to archive ads as well. [19:01] Uh, pip install wpull should install 2.0.1. [19:01] You might, but I don't. :D [19:01] It's great that webrecorder player can work in my browser (which has an adblocker), so that might be a solution. [19:05] I'm still impressed and blown away. I was looking for this sort of thing (on and off) for a little while. How did I miss this??? [19:05] :) [19:06] wpull is an archiveteam-developed tool, it's good but kinda needs a rewrite and we don't really advertise it [19:11] chfoo-developed* (except for a handful of commits). Credit where credit is due. [19:13] aye, sorry [19:13] i couldn't remember who [19:16] *** Kaz has quit IRC (Ping timeout: 260 seconds) [19:35] *** lindalap_ has joined #archiveteam-bs [19:35] *** lindalap has quit IRC (Write error: Connection reset by peer) [19:35] *** lindalap_ is now known as lindalap [19:49] *** jschwart has joined #archiveteam-bs [19:54] *** godane has quit IRC (Ping timeout: 257 seconds) [19:56] *** godane has joined #archiveteam-bs [19:56] *** svchfoo3 sets mode: +o godane [20:01] *** w00dsman has joined #archiveteam-bs [20:04] *** Kaz has joined #archiveteam-bs [20:04] ..oops [20:04] maybe shouldn't have deleted my znc host [20:08] *** w00dsman has quit IRC (WeeChat 2.1) [20:09] *** w00dsman has joined #archiveteam-bs [20:14] *** schbirid has quit IRC (Quit: Leaving) [20:26] *** BlueMax has joined #archiveteam-bs [20:39] *** godane has quit IRC (Read error: Operation timed out) [20:39] *** TC01 has joined #archiveteam-bs [20:41] *** godane has joined #archiveteam-bs [20:53] re: foodspotting [20:54] profiles can be found via numbers e.g. http://www.foodspotting.com/462800 [20:54] reviews can be found via numbers e.g. http://www.foodspotting.com/reviews/6163336 [20:55] places can be found via numbers too e.g. http://www.foodspotting.com/places/981555 [20:55] :) [21:02] SimpBrain: Can you try to figure out what the maximum number is for each of those please? [21:03] will do in a few mins [21:13] profiles last number: 5171135 [21:14] reviews final number: 6163338 [21:18] places final number: 1059662 [21:18] profiles first number: 1 [21:19] yeah all start with 1 [21:34] *** RichardG has quit IRC (Read error: Operation timed out) [21:41] *** RichardG has joined #archiveteam-bs [21:48] *** wp494 has joined #archiveteam-bs [21:48] *** svchfoo3 sets mode: +o wp494 [21:53] *** wp494_ has quit IRC (Ping timeout: 492 seconds) [21:53] https://www.reddit.com/r/opendirectories/comments/8gl6eq/extensive_amiga_archive_theeyeeu [22:05] *** jschwart has quit IRC (Konversation terminated!) [22:38] *** godane has quit IRC (Ping timeout: 252 seconds) [22:54] *** godane has joined #archiveteam-bs [22:54] *** svchfoo3 sets mode: +o godane [22:59] *** godane has quit IRC (Ping timeout: 255 seconds) [23:06] *** godane has joined #archiveteam-bs [23:06] *** svchfoo3 sets mode: +o godane [23:34] *** tuluu has quit IRC (Read error: Operation timed out)