Time |
Nickname |
Message |
00:07
🔗
|
joepie91 |
xmc: Kaz: at least until recently, Googlebot did *not* actually run JS, despite many reports otherwise |
00:07
🔗
|
joepie91 |
it only does static analysis and knows about a very limited set of libraries and frameworks and how to extract meaning from their usage |
00:08
🔗
|
xmc |
huh, interesting |
00:08
🔗
|
joepie91 |
it's possible that this was changed recently |
00:08
🔗
|
joepie91 |
xmc: I was using base64-encoded content on a page (which was decoded immediately on page load) to hide certain data from Google |
00:08
🔗
|
joepie91 |
doxing prevention measures :P |
00:08
🔗
|
joepie91 |
it was unable to get past that |
00:10
🔗
|
joepie91 |
anyway, the statement Google put out about this was that Googlebot now "understands JS" |
00:10
🔗
|
joepie91 |
they never actually said that they *ran* JS |
00:10
🔗
|
joepie91 |
but that's how it was reported by the usual SEO-y outlets |
00:10
🔗
|
joepie91 |
which is why a lot of people now believe that Googlebot runs JS :P |
00:10
🔗
|
xmc |
figures |
00:11
🔗
|
joepie91 |
that screenshot suggests that this might be changing, though |
00:12
🔗
|
joepie91 |
alternatively, it could be their snippet crossreferencing thing fucking up and crossreferencing to a totally irrelevant 'related' page |
00:12
🔗
|
joepie91 |
(the thing where it shows you a snippet of text that doesn't actually originate from the page, but that exists on a page that Google considers to be 'related' or 'similar') |
00:13
🔗
|
Frogging |
do they actually do that O.o |
00:14
🔗
|
Frogging |
that sounds counterintuitive because if you visit a page from google you would expect to see the contents of the snippet on that page |
00:14
🔗
|
Frogging |
in fact a good portion of the time I probably ctrl+f for the snippet immediately after it loads :p |
00:27
🔗
|
jrwr |
I wonder what it would take to get backup of the cached sites that google stores, I know they get deleted after some time. |
00:36
🔗
|
dashcloud |
for a handful, it's easy- use anything. Beyond that, you're faced with serious hard-core problems to scrape content- Google gets really pissed about that, and will captcha you & your netblock |
00:40
🔗
|
jrwr |
Ya |
00:40
🔗
|
jrwr |
the entire OVH Ipv6 netblock is captcha'd |
01:02
🔗
|
|
Stilett0 has joined #archiveteam-bs |
01:17
🔗
|
joepie91 |
Frogging: yes, Google does a bunch of weird shit with snippets |
01:17
🔗
|
joepie91 |
Frogging: you also sometimes get results where the page doesn't contain your query |
01:17
🔗
|
joepie91 |
and never did |
01:18
🔗
|
jrwr |
Aww chat.pixiv.net is closing soon |
01:18
🔗
|
jrwr |
the 15th |
01:18
🔗
|
joepie91 |
dashcloud: I actually once spoke with the person responsible for that mechanism, on IRC... they indeed /really/ do not like scrapers of any kind :P |
01:18
🔗
|
joepie91 |
(the one managing the scraper protection, that is) |
01:19
🔗
|
* |
jrwr remembers Google Code |
01:20
🔗
|
jrwr |
the guy showing up angry that it was 4AM getting alert SMS because of a suspected DDoS on GCode :0 |
01:23
🔗
|
jrwr |
oh man Pixiv stores video data in a strange manner, its raw AMF Commands |
01:23
🔗
|
jrwr |
pretty much Flash SVG+Animation commands |
01:24
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
01:25
🔗
|
|
ZexaronS has quit IRC (Leaving) |
01:45
🔗
|
jrwr |
interesting, there is a API to figure out the AMF Downloads |
01:45
🔗
|
jrwr |
Im going to write up some code and start downloading http://chat.pixiv.net |
01:49
🔗
|
jrwr |
All I know is PHP, so its going to be messy, but ill have resume |
01:50
🔗
|
jrwr |
Yay, IDs are simple, they just increase! |
01:55
🔗
|
jrwr |
Its about 1136489 rooms |
01:55
🔗
|
jrwr |
about 3-4MB a room |
01:59
🔗
|
hook54321 |
Does anyone know how to use this? https://github.com/bibanon/webcache-scraper |
02:00
🔗
|
|
Stilett0 is now known as Stiletto |
02:43
🔗
|
jrwr |
Damn, they are making this hard |
02:44
🔗
|
jrwr |
NSFW crap will be behind a wall without logging in |
03:04
🔗
|
jrwr |
I will continue this when I get home |
03:05
🔗
|
jrwr |
here is my shitty code, its about 20% completed https://github.com/JRWR/savepixiv/blob/master/download.php |
04:03
🔗
|
|
Stiletto has quit IRC () |
04:18
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:20
🔗
|
|
ndiddy has quit IRC () |
04:24
🔗
|
|
Yurume has quit IRC (Read error: Operation timed out) |
04:25
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:27
🔗
|
|
Yurume has joined #archiveteam-bs |
04:30
🔗
|
jrwr |
well I have started |
04:30
🔗
|
jrwr |
but |
04:30
🔗
|
jrwr |
holy shit is this slow |
04:39
🔗
|
jrwr |
I've updated the github with my working code |
04:39
🔗
|
jrwr |
Ill need to covert it to pipeline |
04:39
🔗
|
jrwr |
some help would be nice |
06:22
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 250 seconds) |
07:22
🔗
|
|
bwn has quit IRC (Ping timeout: 268 seconds) |
07:22
🔗
|
|
logchfoo2 has quit IRC (Ping timeout: 268 seconds) |
07:23
🔗
|
|
logchfoo3 starts logging #archiveteam-bs at Thu Jun 01 07:23:48 2017 |
07:23
🔗
|
|
logchfoo3 has joined #archiveteam-bs |
07:24
🔗
|
|
Hecatz has quit IRC (Ping timeout: 268 seconds) |
07:25
🔗
|
|
bwn has joined #archiveteam-bs |
07:27
🔗
|
|
kurt has quit IRC (Ping timeout: 268 seconds) |
07:27
🔗
|
|
kurt has joined #archiveteam-bs |
07:27
🔗
|
|
K4k has quit IRC (Read error: Operation timed out) |
07:27
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
07:27
🔗
|
|
K4k has joined #archiveteam-bs |
07:27
🔗
|
|
FluffyFox has joined #archiveteam-bs |
07:27
🔗
|
|
ranma_ has quit IRC (Read error: Operation timed out) |
07:27
🔗
|
|
timmc has quit IRC (Read error: Operation timed out) |
07:27
🔗
|
|
dboard has quit IRC (Read error: Operation timed out) |
07:27
🔗
|
|
antomati_ has joined #archiveteam-bs |
07:27
🔗
|
|
swebb sets mode: +o antomati_ |
07:28
🔗
|
|
FluffyFox is now known as Frogging |
07:28
🔗
|
|
SadDM has quit IRC (Read error: Operation timed out) |
07:28
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
07:28
🔗
|
|
decay has quit IRC (Read error: Operation timed out) |
07:28
🔗
|
|
decay has joined #archiveteam-bs |
07:28
🔗
|
|
wabu has quit IRC (Read error: Operation timed out) |
07:28
🔗
|
|
wabu has joined #archiveteam-bs |
07:28
🔗
|
|
antomatic has quit IRC (Read error: Operation timed out) |
07:29
🔗
|
|
ploop has quit IRC (Read error: Operation timed out) |
07:29
🔗
|
|
ivan has quit IRC (Ping timeout: 246 seconds) |
07:29
🔗
|
|
trs80 has quit IRC (Ping timeout: 246 seconds) |
07:29
🔗
|
|
rocode has quit IRC (Ping timeout: 246 seconds) |
07:29
🔗
|
|
Hecatz has joined #archiveteam-bs |
07:29
🔗
|
|
Selavi has quit IRC (Read error: Operation timed out) |
07:29
🔗
|
|
Selavi has joined #archiveteam-bs |
07:30
🔗
|
|
ivan has joined #archiveteam-bs |
07:30
🔗
|
|
rocode has joined #archiveteam-bs |
07:32
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
07:32
🔗
|
|
dashcloud has joined #archiveteam-bs |
07:38
🔗
|
|
ranma_ has joined #archiveteam-bs |
07:44
🔗
|
|
dboard has joined #archiveteam-bs |
07:57
🔗
|
|
Jonison has joined #archiveteam-bs |
07:58
🔗
|
|
Jonison has quit IRC (Client Quit) |
08:18
🔗
|
|
greenie has quit IRC (Read error: Operation timed out) |
08:29
🔗
|
|
jspiros has joined #archiveteam-bs |
08:29
🔗
|
|
timmc has joined #archiveteam-bs |
08:33
🔗
|
|
SadDM has joined #archiveteam-bs |
08:33
🔗
|
|
swebb sets mode: +o SadDM |
08:48
🔗
|
|
RedType has quit IRC (Ping timeout: 250 seconds) |
08:48
🔗
|
|
RedType has joined #archiveteam-bs |
09:03
🔗
|
|
j08nY has joined #archiveteam-bs |
09:10
🔗
|
|
koon has quit IRC (Ping timeout: 250 seconds) |
09:10
🔗
|
|
koon has joined #archiveteam-bs |
09:24
🔗
|
Sanqui |
CLICK for Photos 📷 |
10:57
🔗
|
Nazca |
is that a spambot |
11:14
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
11:24
🔗
|
|
BartoCH has joined #archiveteam-bs |
11:29
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
11:47
🔗
|
|
BartoCH has joined #archiveteam-bs |
11:56
🔗
|
|
trs80 has joined #archiveteam-bs |
12:24
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
12:24
🔗
|
|
Honno has quit IRC (Quit: Leaving) |
12:31
🔗
|
|
BartoCH has joined #archiveteam-bs |
12:37
🔗
|
|
vitzli has joined #archiveteam-bs |
12:39
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
12:40
🔗
|
|
BartoCH has joined #archiveteam-bs |
12:51
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:08
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:16
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:27
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:31
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:32
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:37
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:43
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:53
🔗
|
|
dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) |
13:53
🔗
|
jrwr |
morning |
13:54
🔗
|
|
BartoCH has joined #archiveteam-bs |
14:14
🔗
|
jrwr |
Im up to 1162 out of 1130932 |
14:14
🔗
|
jrwr |
of the pixiv save |
14:26
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
14:26
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
14:26
🔗
|
|
DFJustin has joined #archiveteam-bs |
14:26
🔗
|
|
swebb sets mode: +o DFJustin |
14:35
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
14:35
🔗
|
|
DFJustin has joined #archiveteam-bs |
14:35
🔗
|
|
swebb sets mode: +o DFJustin |
14:40
🔗
|
|
Stilett0 has joined #archiveteam-bs |
14:44
🔗
|
|
Fletcher has joined #archiveteam-bs |
14:57
🔗
|
|
Aranje has joined #archiveteam-bs |
15:37
🔗
|
|
DopefishJ has joined #archiveteam-bs |
15:37
🔗
|
|
swebb sets mode: +o DopefishJ |
15:39
🔗
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
17:19
🔗
|
|
superkuh has quit IRC (Read error: Operation timed out) |
18:24
🔗
|
kittymeow |
This is interesting https://webrecorder.io you can download it as a WARC file afterwards, seems like an effort to make it easy for people to make WARCs mainstream ... It doesn't seem perfect though, when I get to the download page test on https://marcan.st/talks/2014_pixiv_ugoku_player/ it says connection denied |
18:27
🔗
|
kittymeow |
it says Blocked I mean, I just tested with internet archive and it fails that test there too |
18:28
🔗
|
|
antomatic has joined #archiveteam-bs |
18:28
🔗
|
|
swebb sets mode: +o antomatic |
18:30
🔗
|
|
antomati_ has quit IRC (Ping timeout: 250 seconds) |
18:31
🔗
|
|
greenie has joined #archiveteam-bs |
18:48
🔗
|
|
superkuh has joined #archiveteam-bs |
19:15
🔗
|
|
tuluu has joined #archiveteam-bs |
19:20
🔗
|
|
tuluu_ has joined #archiveteam-bs |
19:21
🔗
|
|
tuluu has quit IRC (Ping timeout: 268 seconds) |
19:37
🔗
|
godane |
anyone else having problems uploading to archive.org? |
19:37
🔗
|
godane |
i'm getting a problem : Warning: Transient problem: HTTP error Will retry in 5 seconds. 10 retries |
19:39
🔗
|
|
tuluu_ has quit IRC (Ping timeout: 268 seconds) |
19:44
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
19:47
🔗
|
|
tuluu has joined #archiveteam-bs |
19:50
🔗
|
|
ndiddy has joined #archiveteam-bs |
20:24
🔗
|
|
Ravenloft has joined #archiveteam-bs |
20:26
🔗
|
|
bmcginty has quit IRC (Read error: Operation timed out) |
20:27
🔗
|
|
Stiletto has joined #archiveteam-bs |
20:28
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
20:41
🔗
|
|
schbirid has joined #archiveteam-bs |
20:42
🔗
|
|
bmcginty has joined #archiveteam-bs |
20:42
🔗
|
JAA |
I don't personally, but I guess that could explain why my ArchiveBot jobs don't show up on IA. |
20:43
🔗
|
JAA |
some of my* |
21:07
🔗
|
schbirid |
looooooooooooool https://blog.pinboard.in/2017/06/pinboard_acquires_delicious/ |
21:07
🔗
|
schbirid |
pinboard ftw |
21:12
🔗
|
timmc |
I'm so proud of him. |
21:34
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
21:34
🔗
|
|
BartoCH has joined #archiveteam-bs |
21:51
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:05
🔗
|
|
icedice has joined #archiveteam-bs |
22:10
🔗
|
jrwr |
Anyone here help with wget-lua, I'm having a hard time figuring out how to do this proper and make good WARCs |
22:11
🔗
|
jrwr |
since the site im trying to save is kind of complex but simple in its design |
22:13
🔗
|
jrwr |
and how the IA wants its data, because right now its not really digestible into WBM |
22:14
🔗
|
jrwr |
Annnnnnd and its broken |
22:20
🔗
|
|
jmtd is now known as Jon |
22:25
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
22:34
🔗
|
arkiver |
hi jrwr |
22:34
🔗
|
arkiver |
pixiv right |
22:34
🔗
|
arkiver |
are your script somewhere online? |
22:34
🔗
|
arkiver |
I'll create a warrior project for the website |
22:34
🔗
|
arkiver |
but would like to see your scripts for that |
22:34
🔗
|
arkiver |
ah I see https://github.com/JRWR/savepixiv/blob/master/download.php |
22:34
🔗
|
arkiver |
jrwr: do we have a channel yet? |
22:34
🔗
|
arkiver |
project will be here https://github.com/ArchiveTeam/pixiv-grab |
22:34
🔗
|
jrwr |
Already made a project page for it last night |
22:34
🔗
|
jrwr |
#savepixiv |
22:34
🔗
|
arkiver |
awesome |
22:34
🔗
|
jrwr |
but ya |
22:34
🔗
|
jrwr |
so far the site has been responding well |
22:47
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
23:05
🔗
|
|
Stilett0 has joined #archiveteam-bs |
23:58
🔗
|
|
dashcloud has joined #archiveteam-bs |