#archiveteam-bs 2017-06-01,Thu

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
joepie91xmc: Kaz: at least until recently, Googlebot did *not* actually run JS, despite many reports otherwise
it only does static analysis and knows about a very limited set of libraries and frameworks and how to extract meaning from their usage
[00:07]
xmchuh, interesting [00:08]
joepie91it's possible that this was changed recently
xmc: I was using base64-encoded content on a page (which was decoded immediately on page load) to hide certain data from Google
doxing prevention measures :P
it was unable to get past that
anyway, the statement Google put out about this was that Googlebot now "understands JS"
they never actually said that they *ran* JS
but that's how it was reported by the usual SEO-y outlets
which is why a lot of people now believe that Googlebot runs JS :P
[00:08]
xmcfigures [00:10]
joepie91that screenshot suggests that this might be changing, though
alternatively, it could be their snippet crossreferencing thing fucking up and crossreferencing to a totally irrelevant 'related' page
(the thing where it shows you a snippet of text that doesn't actually originate from the page, but that exists on a page that Google considers to be 'related' or 'similar')
[00:11]
Froggingdo they actually do that O.o
that sounds counterintuitive because if you visit a page from google you would expect to see the contents of the snippet on that page
in fact a good portion of the time I probably ctrl+f for the snippet immediately after it loads :p
[00:13]
jrwrI wonder what it would take to get backup of the cached sites that google stores, I know they get deleted after some time. [00:27]
dashcloudfor a handful, it's easy- use anything. Beyond that, you're faced with serious hard-core problems to scrape content- Google gets really pissed about that, and will captcha you & your netblock [00:36]
jrwrYa
the entire OVH Ipv6 netblock is captcha'd
[00:40]
..... (idle for 22mn)
***Stilett0 has joined #archiveteam-bs [01:02]
.... (idle for 15mn)
joepie91Frogging: yes, Google does a bunch of weird shit with snippets
Frogging: you also sometimes get results where the page doesn't contain your query
and never did
[01:17]
jrwrAww chat.pixiv.net is closing soon
the 15th
[01:18]
joepie91dashcloud: I actually once spoke with the person responsible for that mechanism, on IRC... they indeed /really/ do not like scrapers of any kind :P
(the one managing the scraper protection, that is)
[01:18]
jrwrjrwr remembers Google Code
the guy showing up angry that it was 4AM getting alert SMS because of a suspected DDoS on GCode :0
oh man Pixiv stores video data in a strange manner, its raw AMF Commands
pretty much Flash SVG+Animation commands
[01:19]
***j08nY has quit IRC (Quit: Leaving)
ZexaronS has quit IRC (Leaving)
[01:24]
..... (idle for 20mn)
jrwrinteresting, there is a API to figure out the AMF Downloads
Im going to write up some code and start downloading http://chat.pixiv.net
All I know is PHP, so its going to be messy, but ill have resume
Yay, IDs are simple, they just increase!
[01:45]
Its about 1136489 rooms
about 3-4MB a room
[01:55]
hook54321Does anyone know how to use this? https://github.com/bibanon/webcache-scraper [01:59]
***Stilett0 is now known as Stiletto [02:00]
......... (idle for 43mn)
jrwrDamn, they are making this hard
NSFW crap will be behind a wall without logging in
[02:43]
..... (idle for 20mn)
I will continue this when I get home
here is my shitty code, its about 20% completed https://github.com/JRWR/savepixiv/blob/master/download.php
[03:04]
............ (idle for 58mn)
***Stiletto has quit IRC () [04:03]
.... (idle for 15mn)
Sk1d has quit IRC (Ping timeout: 250 seconds)
ndiddy has quit IRC ()
Yurume has quit IRC (Read error: Operation timed out)
Sk1d has joined #archiveteam-bs
Yurume has joined #archiveteam-bs
[04:18]
jrwrwell I have started
but
holy shit is this slow
[04:30]
I've updated the github with my working code
Ill need to covert it to pipeline
some help would be nice
[04:39]
..................... (idle for 1h43mn)
***Ravenloft has quit IRC (Ping timeout: 250 seconds) [06:22]
............. (idle for 1h0mn)
bwn has quit IRC (Ping timeout: 268 seconds)
logchfoo2 has quit IRC (Ping timeout: 268 seconds)
logchfoo3 starts logging #archiveteam-bs at Thu Jun 01 07:23:48 2017
logchfoo3 has joined #archiveteam-bs
Hecatz has quit IRC (Ping timeout: 268 seconds)
bwn has joined #archiveteam-bs
kurt has quit IRC (Ping timeout: 268 seconds)
kurt has joined #archiveteam-bs
K4k has quit IRC (Read error: Operation timed out)
Frogging has quit IRC (Read error: Operation timed out)
K4k has joined #archiveteam-bs
FluffyFox has joined #archiveteam-bs
ranma_ has quit IRC (Read error: Operation timed out)
timmc has quit IRC (Read error: Operation timed out)
dboard has quit IRC (Read error: Operation timed out)
antomati_ has joined #archiveteam-bs
swebb sets mode: +o antomati_
FluffyFox is now known as Frogging
SadDM has quit IRC (Read error: Operation timed out)
jspiros has quit IRC (Read error: Operation timed out)
decay has quit IRC (Read error: Operation timed out)
decay has joined #archiveteam-bs
wabu has quit IRC (Read error: Operation timed out)
wabu has joined #archiveteam-bs
antomatic has quit IRC (Read error: Operation timed out)
ploop has quit IRC (Read error: Operation timed out)
ivan has quit IRC (Ping timeout: 246 seconds)
trs80 has quit IRC (Ping timeout: 246 seconds)
rocode has quit IRC (Ping timeout: 246 seconds)
Hecatz has joined #archiveteam-bs
Selavi has quit IRC (Read error: Operation timed out)
Selavi has joined #archiveteam-bs
ivan has joined #archiveteam-bs
rocode has joined #archiveteam-bs
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[07:22]
ranma_ has joined #archiveteam-bs [07:38]
dboard has joined #archiveteam-bs [07:44]
Jonison has joined #archiveteam-bs
Jonison has quit IRC (Client Quit)
[07:57]
..... (idle for 20mn)
greenie has quit IRC (Read error: Operation timed out) [08:18]
jspiros has joined #archiveteam-bs
timmc has joined #archiveteam-bs
SadDM has joined #archiveteam-bs
swebb sets mode: +o SadDM
[08:29]
.... (idle for 15mn)
RedType has quit IRC (Ping timeout: 250 seconds)
RedType has joined #archiveteam-bs
[08:48]
.... (idle for 15mn)
j08nY has joined #archiveteam-bs [09:03]
koon has quit IRC (Ping timeout: 250 seconds)
koon has joined #archiveteam-bs
[09:10]
SanquiCLICK for Photos 📷 [09:24]
................... (idle for 1h33mn)
Nazcais that a spambot [10:57]
.... (idle for 17mn)
***BartoCH has quit IRC (Ping timeout: 260 seconds) [11:14]
BartoCH has joined #archiveteam-bs [11:24]
BartoCH has quit IRC (Ping timeout: 260 seconds) [11:29]
.... (idle for 18mn)
BartoCH has joined #archiveteam-bs [11:47]
trs80 has joined #archiveteam-bs [11:56]
...... (idle for 28mn)
BartoCH has quit IRC (Ping timeout: 260 seconds)
Honno has quit IRC (Quit: Leaving)
[12:24]
BartoCH has joined #archiveteam-bs [12:31]
vitzli has joined #archiveteam-bs
BartoCH has quit IRC (Ping timeout: 260 seconds)
BartoCH has joined #archiveteam-bs
[12:37]
BartoCH has quit IRC (Ping timeout: 260 seconds) [12:51]
.... (idle for 17mn)
BartoCH has joined #archiveteam-bs [13:08]
BartoCH has quit IRC (Ping timeout: 260 seconds) [13:16]
BartoCH has joined #archiveteam-bs
BlueMaxim has quit IRC (Quit: Leaving)
BartoCH has quit IRC (Ping timeout: 260 seconds)
[13:27]
BartoCH has joined #archiveteam-bs [13:37]
BartoCH has quit IRC (Ping timeout: 260 seconds) [13:43]
dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [13:53]
jrwrmorning [13:53]
***BartoCH has joined #archiveteam-bs [13:54]
..... (idle for 20mn)
jrwrIm up to 1162 out of 1130932
of the pixiv save
[14:14]
***vitzli has quit IRC (Quit: Leaving)
DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
[14:26]
DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
[14:35]
Stilett0 has joined #archiveteam-bs
Fletcher has joined #archiveteam-bs
[14:40]
Aranje has joined #archiveteam-bs [14:57]
......... (idle for 40mn)
DopefishJ has joined #archiveteam-bs
swebb sets mode: +o DopefishJ
DFJustin has quit IRC (Ping timeout: 260 seconds)
[15:37]
..................... (idle for 1h40mn)
superkuh has quit IRC (Read error: Operation timed out) [17:19]
.............. (idle for 1h5mn)
kittymeowThis is interesting https://webrecorder.io you can download it as a WARC file afterwards, seems like an effort to make it easy for people to make WARCs mainstream ... It doesn't seem perfect though, when I get to the download page test on https://marcan.st/talks/2014_pixiv_ugoku_player/ it says connection denied
it says Blocked I mean, I just tested with internet archive and it fails that test there too
[18:24]
***antomatic has joined #archiveteam-bs
swebb sets mode: +o antomatic
antomati_ has quit IRC (Ping timeout: 250 seconds)
greenie has joined #archiveteam-bs
[18:28]
.... (idle for 17mn)
superkuh has joined #archiveteam-bs [18:48]
...... (idle for 27mn)
tuluu has joined #archiveteam-bs [19:15]
tuluu_ has joined #archiveteam-bs
tuluu has quit IRC (Ping timeout: 268 seconds)
[19:20]
.... (idle for 16mn)
godaneanyone else having problems uploading to archive.org?
i'm getting a problem : Warning: Transient problem: HTTP error Will retry in 5 seconds. 10 retries
[19:37]
***tuluu_ has quit IRC (Ping timeout: 268 seconds) [19:39]
SHODAN_UI has joined #archiveteam-bs
tuluu has joined #archiveteam-bs
ndiddy has joined #archiveteam-bs
[19:44]
....... (idle for 34mn)
Ravenloft has joined #archiveteam-bs
bmcginty has quit IRC (Read error: Operation timed out)
Stiletto has joined #archiveteam-bs
Stilett0 has quit IRC (Read error: Operation timed out)
[20:24]
schbirid has joined #archiveteam-bs
bmcginty has joined #archiveteam-bs
[20:41]
JAAI don't personally, but I guess that could explain why my ArchiveBot jobs don't show up on IA.
some of my*
[20:42]
..... (idle for 24mn)
schbiridlooooooooooooool https://blog.pinboard.in/2017/06/pinboard_acquires_delicious/
pinboard ftw
[21:07]
timmcI'm so proud of him. [21:12]
..... (idle for 22mn)
***BartoCH has quit IRC (Ping timeout: 260 seconds)
BartoCH has joined #archiveteam-bs
[21:34]
.... (idle for 17mn)
schbirid has quit IRC (Quit: Leaving) [21:51]
icedice has joined #archiveteam-bs [22:05]
jrwrAnyone here help with wget-lua, I'm having a hard time figuring out how to do this proper and make good WARCs
since the site im trying to save is kind of complex but simple in its design
and how the IA wants its data, because right now its not really digestible into WBM
Annnnnnd and its broken
[22:10]
***jmtd is now known as Jon [22:20]
Stiletto has quit IRC (Ping timeout: 246 seconds) [22:25]
arkiverhi jrwr
pixiv right
are your script somewhere online?
I'll create a warrior project for the website
but would like to see your scripts for that
ah I see https://github.com/JRWR/savepixiv/blob/master/download.php
jrwr: do we have a channel yet?
project will be here https://github.com/ArchiveTeam/pixiv-grab
[22:34]
jrwrAlready made a project page for it last night
#savepixiv
[22:34]
arkiverawesome [22:34]
jrwrbut ya
so far the site has been responding well
[22:34]
***SHODAN_UI has quit IRC (Remote host closed the connection) [22:47]
.... (idle for 18mn)
Stilett0 has joined #archiveteam-bs [23:05]
........... (idle for 53mn)
dashcloud has joined #archiveteam-bs [23:58]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)