Time |
Nickname |
Message |
00:00
🔗
|
|
yuitimoth has quit IRC (Ping timeout: 260 seconds) |
00:07
🔗
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
00:08
🔗
|
|
pfallenop has joined #archiveteam |
00:13
🔗
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
00:17
🔗
|
|
yuitimoth has joined #archiveteam |
00:19
🔗
|
|
pfallenop has joined #archiveteam |
00:25
🔗
|
|
dashcloud has joined #archiveteam |
00:25
🔗
|
|
Okgo has joined #archiveteam |
00:27
🔗
|
Okgo |
Ffx |
00:27
🔗
|
Okgo |
Ffx |
00:30
🔗
|
|
BlueMaxim has joined #archiveteam |
00:32
🔗
|
|
Okgo has quit IRC (Ping timeout: 268 seconds) |
00:42
🔗
|
|
zout has quit IRC (Ping timeout: 244 seconds) |
00:45
🔗
|
|
zout has joined #archiveteam |
01:03
🔗
|
|
zout has quit IRC (Ping timeout: 244 seconds) |
01:07
🔗
|
|
zout has joined #archiveteam |
01:11
🔗
|
|
zout has quit IRC (Ping timeout: 244 seconds) |
01:25
🔗
|
|
zout has joined #archiveteam |
01:42
🔗
|
|
zout has quit IRC (Ping timeout: 244 seconds) |
01:46
🔗
|
|
zout has joined #archiveteam |
03:49
🔗
|
|
maelstrom has quit IRC (Quit: Leaving) |
03:52
🔗
|
|
wp494 has quit IRC (Read error: Connection reset by peer) |
04:04
🔗
|
|
ndiddy has quit IRC (Ping timeout: 244 seconds) |
04:12
🔗
|
|
mutoso_ has quit IRC (Read error: Connection reset by peer) |
04:17
🔗
|
|
mutoso has joined #archiveteam |
04:43
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:49
🔗
|
|
Sk1d has joined #archiveteam |
06:32
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
06:50
🔗
|
|
wp494 has joined #archiveteam |
06:54
🔗
|
|
fie has joined #archiveteam |
07:12
🔗
|
|
atomotic has joined #archiveteam |
07:30
🔗
|
|
ravetcofx has quit IRC (Read error: Operation timed out) |
07:47
🔗
|
zout |
does anybody have any opinions on doing a capture of hackforums.net? |
07:47
🔗
|
zout |
scum-of-the-internet types, but they block almost all archival services like archive.org. |
07:48
🔗
|
zout |
they're frequently referenced by journalists like krebsonsecurity.com, but the complete lack of archival means all the content only lives on in screenshots. |
07:48
🔗
|
Sanqui |
one forum I go to has auto-bans set up for hackforums.net referrers, it's pretty funny |
07:49
🔗
|
Sanqui |
the very index requires a captcha though |
07:49
🔗
|
Sanqui |
so not archivebottable |
07:49
🔗
|
zout |
hackforums itself will ban any IP address using wget, or is in MaxMind MinFraud as being a VPN/VPS. |
07:49
🔗
|
zout |
index? you can crawl with a logged in account without hitting recaptcha, so long as you heavily spoof the headers and the UA. |
07:50
🔗
|
zout |
archivebot will almost certainly be IP address blacklisted. |
07:50
🔗
|
Sanqui |
nah, archivebot is ran privately, they won't have the ips |
07:50
🔗
|
Sanqui |
(no relation to ia) |
07:51
🔗
|
zout |
is its range anything that could be a rented server? |
07:51
🔗
|
Sanqui |
depends on the pipeline |
07:51
🔗
|
zout |
oh hold up, I just had a fantastic idea. |
07:51
🔗
|
zout |
wait, no, it's already behind cloudflare. |
07:52
🔗
|
Sanqui |
yeah cloudflare is poopoo atm |
07:52
🔗
|
zout |
drat. often you can abuse cloudflare to reverse proxy websites for you for scrapes, I don't know if that's public knowledge. |
07:53
🔗
|
Sanqui |
you mean revealing the ip? |
07:53
🔗
|
zout |
ie, sign up for CF yourself and set the origin as the site you want to archive. |
07:53
🔗
|
Sanqui |
oh wow, that's hilarious |
07:53
🔗
|
zout |
CF has lots of IP addresses they use to make the actual request to the origin server. |
07:54
🔗
|
zout |
sadly, if they're already behind CF then that doesn't work. |
07:56
🔗
|
zout |
tempted to just attempt to pull hackforums.net myself on my residential connection (which isn't banned). there's an awful lot of content though. |
08:04
🔗
|
zout |
be better if I could find a VPS or proxy that isn't banned though. |
08:06
🔗
|
zout |
I think it's using MaxMind GEOIP2, so it might just be a matter of finding a very new IP range that's not in the DB yet. |
08:42
🔗
|
zout |
:\ |
08:42
🔗
|
zout |
found an IP address that isn't banned! pity it throws clownflare CAPTCHAs, I can't scrape millions of URLs like that :( |
08:44
🔗
|
zout |
7.2M total pages in the queue, actually. |
08:50
🔗
|
zout |
ok- IPV6 gateways aren't banned. excellent. |
09:41
🔗
|
|
kurt has joined #archiveteam |
11:10
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
11:44
🔗
|
|
Morbus has joined #archiveteam |
11:47
🔗
|
|
kyounko has quit IRC (Read error: Operation timed out) |
12:04
🔗
|
|
atomotic has joined #archiveteam |
12:08
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:51
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
14:34
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
14:34
🔗
|
|
Start has joined #archiveteam |
14:35
🔗
|
|
Start has quit IRC (Client Quit) |
14:45
🔗
|
|
achip has joined #archiveteam |
15:29
🔗
|
|
kurt has quit IRC (Remote host closed the connection) |
15:33
🔗
|
|
maelstrom has joined #archiveteam |
15:38
🔗
|
|
bRick5772 has joined #archiveteam |
16:03
🔗
|
|
ZeoNet has joined #archiveteam |
16:15
🔗
|
|
VADemon has joined #archiveteam |
16:20
🔗
|
|
ZeoNet_ has joined #archiveteam |
16:22
🔗
|
|
ZeoNet has quit IRC (Ping timeout: 370 seconds) |
16:31
🔗
|
|
ZeoNet has joined #archiveteam |
16:32
🔗
|
|
ZeoNet_ has quit IRC (Ping timeout: 370 seconds) |
16:35
🔗
|
|
AlexLehm has joined #archiveteam |
16:49
🔗
|
arkiver |
zout: you're archiving into WARCs right? |
17:04
🔗
|
|
ZeoNet_ has joined #archiveteam |
17:07
🔗
|
|
ZeoNet has quit IRC (Read error: Operation timed out) |
17:15
🔗
|
|
nwf has quit IRC (WeeChat 1.5) |
17:17
🔗
|
|
nwf has joined #archiveteam |
17:24
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
17:39
🔗
|
|
ZeoNet_ has quit IRC (Read error: Operation timed out) |
18:03
🔗
|
|
VADemon has quit IRC (Read error: Operation timed out) |
18:04
🔗
|
|
vOYtEC has quit IRC (Read error: Connection reset by peer) |
18:04
🔗
|
|
vOYtEC has joined #archiveteam |
18:32
🔗
|
|
ravetcofx has joined #archiveteam |
19:11
🔗
|
|
bRick5772 has quit IRC (Quit: Leaving.) |
19:57
🔗
|
|
SketchCow has joined #archiveteam |
19:57
🔗
|
|
swebb sets mode: +o SketchCow |
20:01
🔗
|
zout |
arkiver: not yet, but when I am, yes. |
20:08
🔗
|
SketchCow |
Hiiiii |
20:14
🔗
|
arkiver |
hi |
20:16
🔗
|
|
jessew-el has joined #archiveteam |
20:20
🔗
|
jessew-el |
How well does Archivebot handle Storify? I didn't see any plans on the wiki yet I don't have any urgent reason to think it's dying right now, but it has a lot of valuable material. |
20:21
🔗
|
xmc |
try it and see |
20:22
🔗
|
Sanqui |
you can try loading a storify page without javascript. |
20:25
🔗
|
|
jessew-el has quit IRC (Ping timeout: 268 seconds) |
20:32
🔗
|
|
Stiletto has quit IRC () |
20:35
🔗
|
zout |
answer is, it loads fine. |
20:35
🔗
|
zout |
the pagination is broken in a browser but should be fine with some hackery. |
20:38
🔗
|
Sanqui |
"some hackery" isn't really possible with archivebot |
20:38
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
20:40
🔗
|
|
acridAxid has quit IRC (Quit: marauder) |
20:41
🔗
|
|
acridAxid has joined #archiveteam |
20:42
🔗
|
wp494 |
Shomi (a Netflix competitor in Canada) announced they're going to take themselves out to the back of the shed on november 30th |
20:42
🔗
|
wp494 |
http://shomimedia.com/news-feed/press-releases/2016/09/cue-the-closing-credits-shomi-thanks-canadians-for-a-good-run |
20:42
🔗
|
wp494 |
given its tv streaming nature there probably wouldn't be a whole lot we can get out hands on, but would be handy to toss social accounts into archivebot |
20:43
🔗
|
wp494 |
and maybe the main site itself |
20:51
🔗
|
|
Stiletto has joined #archiveteam |
21:03
🔗
|
|
ndiddy has joined #archiveteam |
21:05
🔗
|
|
computerf has quit IRC (Read error: Operation timed out) |
21:08
🔗
|
|
computerf has joined #archiveteam |
21:08
🔗
|
|
RichardG has joined #archiveteam |
21:11
🔗
|
zout |
I've found that crawls really get lengthy if I don't bother making custom whitelists for things. |
21:11
🔗
|
zout |
the difference between something targeted and something that sits and gathers a thousand variations of the same page with wget is well worth it. |
21:24
🔗
|
|
computerf has quit IRC (Read error: Operation timed out) |
21:35
🔗
|
|
computerf has joined #archiveteam |
21:41
🔗
|
|
ndiddy has quit IRC (Quit: Leaving) |
21:51
🔗
|
|
yeoldetoa has quit IRC (Remote host closed the connection) |
21:52
🔗
|
|
kristian_ has joined #archiveteam |
22:15
🔗
|
|
pfallenop has quit IRC (Quit: Lost terminal) |
22:22
🔗
|
|
BlueMaxim has joined #archiveteam |
22:31
🔗
|
|
AlexLehm has quit IRC (Ping timeout: 244 seconds) |
22:32
🔗
|
|
pfallenop has joined #archiveteam |
22:33
🔗
|
|
RoanKatto has joined #archiveteam |
22:33
🔗
|
|
Start has joined #archiveteam |
22:34
🔗
|
|
Morbus has quit IRC (Read error: Operation timed out) |
22:35
🔗
|
|
achip has quit IRC (Read error: Operation timed out) |
23:49
🔗
|
|
vOYtEC has quit IRC (Ping timeout: 244 seconds) |
23:52
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |