#archiveteam-bs 2017-06-01,Thu

↑back Search

Time Nickname Message
00:07 🔗 joepie91 xmc: Kaz: at least until recently, Googlebot did *not* actually run JS, despite many reports otherwise
00:07 🔗 joepie91 it only does static analysis and knows about a very limited set of libraries and frameworks and how to extract meaning from their usage
00:08 🔗 xmc huh, interesting
00:08 🔗 joepie91 it's possible that this was changed recently
00:08 🔗 joepie91 xmc: I was using base64-encoded content on a page (which was decoded immediately on page load) to hide certain data from Google
00:08 🔗 joepie91 doxing prevention measures :P
00:08 🔗 joepie91 it was unable to get past that
00:10 🔗 joepie91 anyway, the statement Google put out about this was that Googlebot now "understands JS"
00:10 🔗 joepie91 they never actually said that they *ran* JS
00:10 🔗 joepie91 but that's how it was reported by the usual SEO-y outlets
00:10 🔗 joepie91 which is why a lot of people now believe that Googlebot runs JS :P
00:10 🔗 xmc figures
00:11 🔗 joepie91 that screenshot suggests that this might be changing, though
00:12 🔗 joepie91 alternatively, it could be their snippet crossreferencing thing fucking up and crossreferencing to a totally irrelevant 'related' page
00:12 🔗 joepie91 (the thing where it shows you a snippet of text that doesn't actually originate from the page, but that exists on a page that Google considers to be 'related' or 'similar')
00:13 🔗 Frogging do they actually do that O.o
00:14 🔗 Frogging that sounds counterintuitive because if you visit a page from google you would expect to see the contents of the snippet on that page
00:14 🔗 Frogging in fact a good portion of the time I probably ctrl+f for the snippet immediately after it loads :p
00:27 🔗 jrwr I wonder what it would take to get backup of the cached sites that google stores, I know they get deleted after some time.
00:36 🔗 dashcloud for a handful, it's easy- use anything. Beyond that, you're faced with serious hard-core problems to scrape content- Google gets really pissed about that, and will captcha you & your netblock
00:40 🔗 jrwr Ya
00:40 🔗 jrwr the entire OVH Ipv6 netblock is captcha'd
01:02 🔗 Stilett0 has joined #archiveteam-bs
01:17 🔗 joepie91 Frogging: yes, Google does a bunch of weird shit with snippets
01:17 🔗 joepie91 Frogging: you also sometimes get results where the page doesn't contain your query
01:17 🔗 joepie91 and never did
01:18 🔗 jrwr Aww chat.pixiv.net is closing soon
01:18 🔗 jrwr the 15th
01:18 🔗 joepie91 dashcloud: I actually once spoke with the person responsible for that mechanism, on IRC... they indeed /really/ do not like scrapers of any kind :P
01:18 🔗 joepie91 (the one managing the scraper protection, that is)
01:19 🔗 * jrwr remembers Google Code
01:20 🔗 jrwr the guy showing up angry that it was 4AM getting alert SMS because of a suspected DDoS on GCode :0
01:23 🔗 jrwr oh man Pixiv stores video data in a strange manner, its raw AMF Commands
01:23 🔗 jrwr pretty much Flash SVG+Animation commands
01:24 🔗 j08nY has quit IRC (Quit: Leaving)
01:25 🔗 ZexaronS has quit IRC (Leaving)
01:45 🔗 jrwr interesting, there is a API to figure out the AMF Downloads
01:45 🔗 jrwr Im going to write up some code and start downloading http://chat.pixiv.net
01:49 🔗 jrwr All I know is PHP, so its going to be messy, but ill have resume
01:50 🔗 jrwr Yay, IDs are simple, they just increase!
01:55 🔗 jrwr Its about 1136489 rooms
01:55 🔗 jrwr about 3-4MB a room
01:59 🔗 hook54321 Does anyone know how to use this? https://github.com/bibanon/webcache-scraper
02:00 🔗 Stilett0 is now known as Stiletto
02:43 🔗 jrwr Damn, they are making this hard
02:44 🔗 jrwr NSFW crap will be behind a wall without logging in
03:04 🔗 jrwr I will continue this when I get home
03:05 🔗 jrwr here is my shitty code, its about 20% completed https://github.com/JRWR/savepixiv/blob/master/download.php
04:03 🔗 Stiletto has quit IRC ()
04:18 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:20 🔗 ndiddy has quit IRC ()
04:24 🔗 Yurume has quit IRC (Read error: Operation timed out)
04:25 🔗 Sk1d has joined #archiveteam-bs
04:27 🔗 Yurume has joined #archiveteam-bs
04:30 🔗 jrwr well I have started
04:30 🔗 jrwr but
04:30 🔗 jrwr holy shit is this slow
04:39 🔗 jrwr I've updated the github with my working code
04:39 🔗 jrwr Ill need to covert it to pipeline
04:39 🔗 jrwr some help would be nice
06:22 🔗 Ravenloft has quit IRC (Ping timeout: 250 seconds)
07:22 🔗 bwn has quit IRC (Ping timeout: 268 seconds)
07:22 🔗 logchfoo2 has quit IRC (Ping timeout: 268 seconds)
07:23 🔗 logchfoo3 starts logging #archiveteam-bs at Thu Jun 01 07:23:48 2017
07:23 🔗 logchfoo3 has joined #archiveteam-bs
07:24 🔗 Hecatz has quit IRC (Ping timeout: 268 seconds)
07:25 🔗 bwn has joined #archiveteam-bs
07:27 🔗 kurt has quit IRC (Ping timeout: 268 seconds)
07:27 🔗 kurt has joined #archiveteam-bs
07:27 🔗 K4k has quit IRC (Read error: Operation timed out)
07:27 🔗 Frogging has quit IRC (Read error: Operation timed out)
07:27 🔗 K4k has joined #archiveteam-bs
07:27 🔗 FluffyFox has joined #archiveteam-bs
07:27 🔗 ranma_ has quit IRC (Read error: Operation timed out)
07:27 🔗 timmc has quit IRC (Read error: Operation timed out)
07:27 🔗 dboard has quit IRC (Read error: Operation timed out)
07:27 🔗 antomati_ has joined #archiveteam-bs
07:27 🔗 swebb sets mode: +o antomati_
07:28 🔗 FluffyFox is now known as Frogging
07:28 🔗 SadDM has quit IRC (Read error: Operation timed out)
07:28 🔗 jspiros has quit IRC (Read error: Operation timed out)
07:28 🔗 decay has quit IRC (Read error: Operation timed out)
07:28 🔗 decay has joined #archiveteam-bs
07:28 🔗 wabu has quit IRC (Read error: Operation timed out)
07:28 🔗 wabu has joined #archiveteam-bs
07:28 🔗 antomatic has quit IRC (Read error: Operation timed out)
07:29 🔗 ploop has quit IRC (Read error: Operation timed out)
07:29 🔗 ivan has quit IRC (Ping timeout: 246 seconds)
07:29 🔗 trs80 has quit IRC (Ping timeout: 246 seconds)
07:29 🔗 rocode has quit IRC (Ping timeout: 246 seconds)
07:29 🔗 Hecatz has joined #archiveteam-bs
07:29 🔗 Selavi has quit IRC (Read error: Operation timed out)
07:29 🔗 Selavi has joined #archiveteam-bs
07:30 🔗 ivan has joined #archiveteam-bs
07:30 🔗 rocode has joined #archiveteam-bs
07:32 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:32 🔗 dashcloud has joined #archiveteam-bs
07:38 🔗 ranma_ has joined #archiveteam-bs
07:44 🔗 dboard has joined #archiveteam-bs
07:57 🔗 Jonison has joined #archiveteam-bs
07:58 🔗 Jonison has quit IRC (Client Quit)
08:18 🔗 greenie has quit IRC (Read error: Operation timed out)
08:29 🔗 jspiros has joined #archiveteam-bs
08:29 🔗 timmc has joined #archiveteam-bs
08:33 🔗 SadDM has joined #archiveteam-bs
08:33 🔗 swebb sets mode: +o SadDM
08:48 🔗 RedType has quit IRC (Ping timeout: 250 seconds)
08:48 🔗 RedType has joined #archiveteam-bs
09:03 🔗 j08nY has joined #archiveteam-bs
09:10 🔗 koon has quit IRC (Ping timeout: 250 seconds)
09:10 🔗 koon has joined #archiveteam-bs
09:24 🔗 Sanqui CLICK for Photos 📷
10:57 🔗 Nazca is that a spambot
11:14 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
11:24 🔗 BartoCH has joined #archiveteam-bs
11:29 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
11:47 🔗 BartoCH has joined #archiveteam-bs
11:56 🔗 trs80 has joined #archiveteam-bs
12:24 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
12:24 🔗 Honno has quit IRC (Quit: Leaving)
12:31 🔗 BartoCH has joined #archiveteam-bs
12:37 🔗 vitzli has joined #archiveteam-bs
12:39 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
12:40 🔗 BartoCH has joined #archiveteam-bs
12:51 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:08 🔗 BartoCH has joined #archiveteam-bs
13:16 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:27 🔗 BartoCH has joined #archiveteam-bs
13:31 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:32 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:37 🔗 BartoCH has joined #archiveteam-bs
13:43 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:53 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
13:53 🔗 jrwr morning
13:54 🔗 BartoCH has joined #archiveteam-bs
14:14 🔗 jrwr Im up to 1162 out of 1130932
14:14 🔗 jrwr of the pixiv save
14:26 🔗 vitzli has quit IRC (Quit: Leaving)
14:26 🔗 DFJustin has quit IRC (Remote host closed the connection)
14:26 🔗 DFJustin has joined #archiveteam-bs
14:26 🔗 swebb sets mode: +o DFJustin
14:35 🔗 DFJustin has quit IRC (Remote host closed the connection)
14:35 🔗 DFJustin has joined #archiveteam-bs
14:35 🔗 swebb sets mode: +o DFJustin
14:40 🔗 Stilett0 has joined #archiveteam-bs
14:44 🔗 Fletcher has joined #archiveteam-bs
14:57 🔗 Aranje has joined #archiveteam-bs
15:37 🔗 DopefishJ has joined #archiveteam-bs
15:37 🔗 swebb sets mode: +o DopefishJ
15:39 🔗 DFJustin has quit IRC (Ping timeout: 260 seconds)
17:19 🔗 superkuh has quit IRC (Read error: Operation timed out)
18:24 🔗 kittymeow This is interesting https://webrecorder.io you can download it as a WARC file afterwards, seems like an effort to make it easy for people to make WARCs mainstream ... It doesn't seem perfect though, when I get to the download page test on https://marcan.st/talks/2014_pixiv_ugoku_player/ it says connection denied
18:27 🔗 kittymeow it says Blocked I mean, I just tested with internet archive and it fails that test there too
18:28 🔗 antomatic has joined #archiveteam-bs
18:28 🔗 swebb sets mode: +o antomatic
18:30 🔗 antomati_ has quit IRC (Ping timeout: 250 seconds)
18:31 🔗 greenie has joined #archiveteam-bs
18:48 🔗 superkuh has joined #archiveteam-bs
19:15 🔗 tuluu has joined #archiveteam-bs
19:20 🔗 tuluu_ has joined #archiveteam-bs
19:21 🔗 tuluu has quit IRC (Ping timeout: 268 seconds)
19:37 🔗 godane anyone else having problems uploading to archive.org?
19:37 🔗 godane i'm getting a problem : Warning: Transient problem: HTTP error Will retry in 5 seconds. 10 retries
19:39 🔗 tuluu_ has quit IRC (Ping timeout: 268 seconds)
19:44 🔗 SHODAN_UI has joined #archiveteam-bs
19:47 🔗 tuluu has joined #archiveteam-bs
19:50 🔗 ndiddy has joined #archiveteam-bs
20:24 🔗 Ravenloft has joined #archiveteam-bs
20:26 🔗 bmcginty has quit IRC (Read error: Operation timed out)
20:27 🔗 Stiletto has joined #archiveteam-bs
20:28 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
20:41 🔗 schbirid has joined #archiveteam-bs
20:42 🔗 bmcginty has joined #archiveteam-bs
20:42 🔗 JAA I don't personally, but I guess that could explain why my ArchiveBot jobs don't show up on IA.
20:43 🔗 JAA some of my*
21:07 🔗 schbirid looooooooooooool https://blog.pinboard.in/2017/06/pinboard_acquires_delicious/
21:07 🔗 schbirid pinboard ftw
21:12 🔗 timmc I'm so proud of him.
21:34 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
21:34 🔗 BartoCH has joined #archiveteam-bs
21:51 🔗 schbirid has quit IRC (Quit: Leaving)
22:05 🔗 icedice has joined #archiveteam-bs
22:10 🔗 jrwr Anyone here help with wget-lua, I'm having a hard time figuring out how to do this proper and make good WARCs
22:11 🔗 jrwr since the site im trying to save is kind of complex but simple in its design
22:13 🔗 jrwr and how the IA wants its data, because right now its not really digestible into WBM
22:14 🔗 jrwr Annnnnnd and its broken
22:20 🔗 jmtd is now known as Jon
22:25 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)
22:34 🔗 arkiver hi jrwr
22:34 🔗 arkiver pixiv right
22:34 🔗 arkiver are your script somewhere online?
22:34 🔗 arkiver I'll create a warrior project for the website
22:34 🔗 arkiver but would like to see your scripts for that
22:34 🔗 arkiver ah I see https://github.com/JRWR/savepixiv/blob/master/download.php
22:34 🔗 arkiver jrwr: do we have a channel yet?
22:34 🔗 arkiver project will be here https://github.com/ArchiveTeam/pixiv-grab
22:34 🔗 jrwr Already made a project page for it last night
22:34 🔗 jrwr #savepixiv
22:34 🔗 arkiver awesome
22:34 🔗 jrwr but ya
22:34 🔗 jrwr so far the site has been responding well
22:47 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
23:05 🔗 Stilett0 has joined #archiveteam-bs
23:58 🔗 dashcloud has joined #archiveteam-bs

irclogger-viewer