[00:07] xmc: Kaz: at least until recently, Googlebot did *not* actually run JS, despite many reports otherwise [00:07] it only does static analysis and knows about a very limited set of libraries and frameworks and how to extract meaning from their usage [00:08] huh, interesting [00:08] it's possible that this was changed recently [00:08] xmc: I was using base64-encoded content on a page (which was decoded immediately on page load) to hide certain data from Google [00:08] doxing prevention measures :P [00:08] it was unable to get past that [00:10] anyway, the statement Google put out about this was that Googlebot now "understands JS" [00:10] they never actually said that they *ran* JS [00:10] but that's how it was reported by the usual SEO-y outlets [00:10] which is why a lot of people now believe that Googlebot runs JS :P [00:10] figures [00:11] that screenshot suggests that this might be changing, though [00:12] alternatively, it could be their snippet crossreferencing thing fucking up and crossreferencing to a totally irrelevant 'related' page [00:12] (the thing where it shows you a snippet of text that doesn't actually originate from the page, but that exists on a page that Google considers to be 'related' or 'similar') [00:13] do they actually do that O.o [00:14] that sounds counterintuitive because if you visit a page from google you would expect to see the contents of the snippet on that page [00:14] in fact a good portion of the time I probably ctrl+f for the snippet immediately after it loads :p [00:27] I wonder what it would take to get backup of the cached sites that google stores, I know they get deleted after some time. [00:36] for a handful, it's easy- use anything. Beyond that, you're faced with serious hard-core problems to scrape content- Google gets really pissed about that, and will captcha you & your netblock [00:40] Ya [00:40] the entire OVH Ipv6 netblock is captcha'd [01:02] *** Stilett0 has joined #archiveteam-bs [01:17] Frogging: yes, Google does a bunch of weird shit with snippets [01:17] Frogging: you also sometimes get results where the page doesn't contain your query [01:17] and never did [01:18] Aww chat.pixiv.net is closing soon [01:18] the 15th [01:18] dashcloud: I actually once spoke with the person responsible for that mechanism, on IRC... they indeed /really/ do not like scrapers of any kind :P [01:18] (the one managing the scraper protection, that is) [01:19] * jrwr remembers Google Code [01:20] the guy showing up angry that it was 4AM getting alert SMS because of a suspected DDoS on GCode :0 [01:23] oh man Pixiv stores video data in a strange manner, its raw AMF Commands [01:23] pretty much Flash SVG+Animation commands [01:24] *** j08nY has quit IRC (Quit: Leaving) [01:25] *** ZexaronS has quit IRC (Leaving) [01:45] interesting, there is a API to figure out the AMF Downloads [01:45] Im going to write up some code and start downloading http://chat.pixiv.net [01:49] All I know is PHP, so its going to be messy, but ill have resume [01:50] Yay, IDs are simple, they just increase! [01:55] Its about 1136489 rooms [01:55] about 3-4MB a room [01:59] Does anyone know how to use this? https://github.com/bibanon/webcache-scraper [02:00] *** Stilett0 is now known as Stiletto [02:43] Damn, they are making this hard [02:44] NSFW crap will be behind a wall without logging in [03:04] I will continue this when I get home [03:05] here is my shitty code, its about 20% completed https://github.com/JRWR/savepixiv/blob/master/download.php [04:03] *** Stiletto has quit IRC () [04:18] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:20] *** ndiddy has quit IRC () [04:24] *** Yurume has quit IRC (Read error: Operation timed out) [04:25] *** Sk1d has joined #archiveteam-bs [04:27] *** Yurume has joined #archiveteam-bs [04:30] well I have started [04:30] but [04:30] holy shit is this slow [04:39] I've updated the github with my working code [04:39] Ill need to covert it to pipeline [04:39] some help would be nice [06:22] *** Ravenloft has quit IRC (Ping timeout: 250 seconds) [07:22] *** bwn has quit IRC (Ping timeout: 268 seconds) [07:22] *** logchfoo2 has quit IRC (Ping timeout: 268 seconds) [07:23] *** logchfoo3 starts logging #archiveteam-bs at Thu Jun 01 07:23:48 2017 [07:23] *** logchfoo3 has joined #archiveteam-bs [07:24] *** Hecatz has quit IRC (Ping timeout: 268 seconds) [07:25] *** bwn has joined #archiveteam-bs [07:27] *** kurt has quit IRC (Ping timeout: 268 seconds) [07:27] *** kurt has joined #archiveteam-bs [07:27] *** K4k has quit IRC (Read error: Operation timed out) [07:27] *** Frogging has quit IRC (Read error: Operation timed out) [07:27] *** K4k has joined #archiveteam-bs [07:27] *** FluffyFox has joined #archiveteam-bs [07:27] *** ranma_ has quit IRC (Read error: Operation timed out) [07:27] *** timmc has quit IRC (Read error: Operation timed out) [07:27] *** dboard has quit IRC (Read error: Operation timed out) [07:27] *** antomati_ has joined #archiveteam-bs [07:27] *** swebb sets mode: +o antomati_ [07:28] *** FluffyFox is now known as Frogging [07:28] *** SadDM has quit IRC (Read error: Operation timed out) [07:28] *** jspiros has quit IRC (Read error: Operation timed out) [07:28] *** decay has quit IRC (Read error: Operation timed out) [07:28] *** decay has joined #archiveteam-bs [07:28] *** wabu has quit IRC (Read error: Operation timed out) [07:28] *** wabu has joined #archiveteam-bs [07:28] *** antomatic has quit IRC (Read error: Operation timed out) [07:29] *** ploop has quit IRC (Read error: Operation timed out) [07:29] *** ivan has quit IRC (Ping timeout: 246 seconds) [07:29] *** trs80 has quit IRC (Ping timeout: 246 seconds) [07:29] *** rocode has quit IRC (Ping timeout: 246 seconds) [07:29] *** Hecatz has joined #archiveteam-bs [07:29] *** Selavi has quit IRC (Read error: Operation timed out) [07:29] *** Selavi has joined #archiveteam-bs [07:30] *** ivan has joined #archiveteam-bs [07:30] *** rocode has joined #archiveteam-bs [07:32] *** dashcloud has quit IRC (Read error: Operation timed out) [07:32] *** dashcloud has joined #archiveteam-bs [07:38] *** ranma_ has joined #archiveteam-bs [07:44] *** dboard has joined #archiveteam-bs [07:57] *** Jonison has joined #archiveteam-bs [07:58] *** Jonison has quit IRC (Client Quit) [08:18] *** greenie has quit IRC (Read error: Operation timed out) [08:29] *** jspiros has joined #archiveteam-bs [08:29] *** timmc has joined #archiveteam-bs [08:33] *** SadDM has joined #archiveteam-bs [08:33] *** swebb sets mode: +o SadDM [08:48] *** RedType has quit IRC (Ping timeout: 250 seconds) [08:48] *** RedType has joined #archiveteam-bs [09:03] *** j08nY has joined #archiveteam-bs [09:10] *** koon has quit IRC (Ping timeout: 250 seconds) [09:10] *** koon has joined #archiveteam-bs [09:24] CLICK for Photos 📷 [10:57] is that a spambot [11:14] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:24] *** BartoCH has joined #archiveteam-bs [11:29] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:47] *** BartoCH has joined #archiveteam-bs [11:56] *** trs80 has joined #archiveteam-bs [12:24] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [12:24] *** Honno has quit IRC (Quit: Leaving) [12:31] *** BartoCH has joined #archiveteam-bs [12:37] *** vitzli has joined #archiveteam-bs [12:39] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [12:40] *** BartoCH has joined #archiveteam-bs [12:51] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:08] *** BartoCH has joined #archiveteam-bs [13:16] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:27] *** BartoCH has joined #archiveteam-bs [13:31] *** BlueMaxim has quit IRC (Quit: Leaving) [13:32] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:37] *** BartoCH has joined #archiveteam-bs [13:43] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:53] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [13:53] morning [13:54] *** BartoCH has joined #archiveteam-bs [14:14] Im up to 1162 out of 1130932 [14:14] of the pixiv save [14:26] *** vitzli has quit IRC (Quit: Leaving) [14:26] *** DFJustin has quit IRC (Remote host closed the connection) [14:26] *** DFJustin has joined #archiveteam-bs [14:26] *** swebb sets mode: +o DFJustin [14:35] *** DFJustin has quit IRC (Remote host closed the connection) [14:35] *** DFJustin has joined #archiveteam-bs [14:35] *** swebb sets mode: +o DFJustin [14:40] *** Stilett0 has joined #archiveteam-bs [14:44] *** Fletcher has joined #archiveteam-bs [14:57] *** Aranje has joined #archiveteam-bs [15:37] *** DopefishJ has joined #archiveteam-bs [15:37] *** swebb sets mode: +o DopefishJ [15:39] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [17:19] *** superkuh has quit IRC (Read error: Operation timed out) [18:24] This is interesting https://webrecorder.io you can download it as a WARC file afterwards, seems like an effort to make it easy for people to make WARCs mainstream ... It doesn't seem perfect though, when I get to the download page test on https://marcan.st/talks/2014_pixiv_ugoku_player/ it says connection denied [18:27] it says Blocked I mean, I just tested with internet archive and it fails that test there too [18:28] *** antomatic has joined #archiveteam-bs [18:28] *** swebb sets mode: +o antomatic [18:30] *** antomati_ has quit IRC (Ping timeout: 250 seconds) [18:31] *** greenie has joined #archiveteam-bs [18:48] *** superkuh has joined #archiveteam-bs [19:15] *** tuluu has joined #archiveteam-bs [19:20] *** tuluu_ has joined #archiveteam-bs [19:21] *** tuluu has quit IRC (Ping timeout: 268 seconds) [19:37] anyone else having problems uploading to archive.org? [19:37] i'm getting a problem : Warning: Transient problem: HTTP error Will retry in 5 seconds. 10 retries [19:39] *** tuluu_ has quit IRC (Ping timeout: 268 seconds) [19:44] *** SHODAN_UI has joined #archiveteam-bs [19:47] *** tuluu has joined #archiveteam-bs [19:50] *** ndiddy has joined #archiveteam-bs [20:24] *** Ravenloft has joined #archiveteam-bs [20:26] *** bmcginty has quit IRC (Read error: Operation timed out) [20:27] *** Stiletto has joined #archiveteam-bs [20:28] *** Stilett0 has quit IRC (Read error: Operation timed out) [20:41] *** schbirid has joined #archiveteam-bs [20:42] *** bmcginty has joined #archiveteam-bs [20:42] I don't personally, but I guess that could explain why my ArchiveBot jobs don't show up on IA. [20:43] some of my* [21:07] looooooooooooool https://blog.pinboard.in/2017/06/pinboard_acquires_delicious/ [21:07] pinboard ftw [21:12] I'm so proud of him. [21:34] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [21:34] *** BartoCH has joined #archiveteam-bs [21:51] *** schbirid has quit IRC (Quit: Leaving) [22:05] *** icedice has joined #archiveteam-bs [22:10] Anyone here help with wget-lua, I'm having a hard time figuring out how to do this proper and make good WARCs [22:11] since the site im trying to save is kind of complex but simple in its design [22:13] and how the IA wants its data, because right now its not really digestible into WBM [22:14] Annnnnnd and its broken [22:20] *** jmtd is now known as Jon [22:25] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [22:34] hi jrwr [22:34] pixiv right [22:34] are your script somewhere online? [22:34] I'll create a warrior project for the website [22:34] but would like to see your scripts for that [22:34] ah I see https://github.com/JRWR/savepixiv/blob/master/download.php [22:34] jrwr: do we have a channel yet? [22:34] project will be here https://github.com/ArchiveTeam/pixiv-grab [22:34] Already made a project page for it last night [22:34] #savepixiv [22:34] awesome [22:34] but ya [22:34] so far the site has been responding well [22:47] *** SHODAN_UI has quit IRC (Remote host closed the connection) [23:05] *** Stilett0 has joined #archiveteam-bs [23:58] *** dashcloud has joined #archiveteam-bs