Time |
Nickname |
Message |
00:21
π
|
|
Stilett0 has joined #archiveteam-ot |
00:25
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
00:39
π
|
|
mgrytbak has quit IRC (Ping timeout: 272 seconds) |
00:39
π
|
|
_niklas has quit IRC (Ping timeout: 272 seconds) |
00:39
π
|
|
NatarajBt has quit IRC (Ping timeout: 272 seconds) |
00:39
π
|
|
_niklas has joined #archiveteam-ot |
00:40
π
|
|
pie_[bnc] has joined #archiveteam-ot |
00:41
π
|
|
pie_ has quit IRC (Ping timeout: 272 seconds) |
00:41
π
|
|
Gfy has quit IRC (Ping timeout: 272 seconds) |
00:41
π
|
|
anarcat has quit IRC (Ping timeout: 272 seconds) |
00:41
π
|
|
anarcat has joined #archiveteam-ot |
00:41
π
|
|
Gfy has joined #archiveteam-ot |
00:42
π
|
|
antomati_ has joined #archiveteam-ot |
00:42
π
|
|
Meli-sama has joined #archiveteam-ot |
01:40
π
|
|
britmob has quit IRC (Read error: Operation timed out) |
01:40
π
|
|
Jake has quit IRC (Read error: Operation timed out) |
01:40
π
|
|
paul2520 has quit IRC (Write error: Broken pipe) |
01:41
π
|
|
paul2520 has joined #archiveteam-ot |
01:42
π
|
|
asdf01011 has quit IRC (Read error: Operation timed out) |
01:42
π
|
|
systwi_ has joined #archiveteam-ot |
01:42
π
|
|
Arcorann_ has joined #archiveteam-ot |
01:42
π
|
|
Jake has joined #archiveteam-ot |
01:42
π
|
|
sembiance has quit IRC (Read error: Operation timed out) |
01:42
π
|
|
sembiance has joined #archiveteam-ot |
01:42
π
|
|
asdf01011 has joined #archiveteam-ot |
01:42
π
|
|
britmob has joined #archiveteam-ot |
01:44
π
|
|
Meli has quit IRC (se.hub efnet.portlane.se) |
01:44
π
|
|
Laverne has quit IRC (se.hub efnet.portlane.se) |
01:44
π
|
|
antomatic has quit IRC (se.hub efnet.portlane.se) |
01:45
π
|
|
systwi has quit IRC (Read error: Operation timed out) |
01:45
π
|
|
systwi_ is now known as systwi |
01:46
π
|
|
Arcorann has quit IRC (Read error: Operation timed out) |
02:36
π
|
Frogging |
EFNet's NICKLEN is too low (9). Other networks it's much too long though (like 30) |
02:36
π
|
Frogging |
freenode has a nice sane value (16) |
02:38
π
|
Frogging |
I wonder if I can make weechat limit the size of the sidebars so the main area doesn't get squished when some asshole has a long-ass name |
02:56
π
|
Arcorann_ |
I didn't even notice that nicks here were limited to 9 characters |
02:58
π
|
Arcorann_ |
IMO 16 is still a bit short though |
03:04
π
|
|
Larsenv has quit IRC (Quit: ZNC 1.8.0 - https://znc.in) |
03:11
π
|
|
Larsenv has joined #archiveteam-ot |
03:39
π
|
Raccoon |
back in my day we were happy if we got 3 letters on the arcade machine |
03:54
π
|
|
qw3rty__ has joined #archiveteam-ot |
04:01
π
|
|
qw3rty_ has quit IRC (Read error: Operation timed out) |
05:49
π
|
Flashfire |
https://test.jimsasbestosnsw.com.au/ as it turns out Jim does sell drugs on top of everything else systwi Ryz JAA Larsenv |
05:50
π
|
Ryz |
But not video games s: |
05:50
π
|
Ryz |
...Or gambling |
05:50
π
|
Flashfire |
Not yet that I have found |
08:10
π
|
|
Laverne has joined #archiveteam-ot |
10:10
π
|
|
godane has quit IRC (Read error: Operation timed out) |
11:17
π
|
|
godane has joined #archiveteam-ot |
11:23
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
13:41
π
|
|
VerifiedJ has joined #archiveteam-ot |
14:02
π
|
|
Wingy4 has joined #archiveteam-ot |
14:05
π
|
|
Wingy has quit IRC (Read error: Operation timed out) |
14:06
π
|
|
Wingy4 has quit IRC (Read error: Operation timed out) |
14:10
π
|
|
Wingy has joined #archiveteam-ot |
14:13
π
|
|
Wingy has quit IRC (Client Quit) |
14:13
π
|
|
Wingy has joined #archiveteam-ot |
14:25
π
|
Wingy |
I tried to archive osu.ppy.sh and got permanently ip banned lol |
14:26
π
|
Wingy |
Had to change the mac on my router to get a new ip |
14:26
π
|
Frogging |
wow |
14:32
π
|
Kaz |
lmao, yeah I'm not surprised |
14:32
π
|
Kaz |
guy's an ass. |
14:32
π
|
Arcorann_ |
Didn't give a warning? |
14:33
π
|
JAA |
'Why would I give a thief a warning?' - Them, probably. |
16:02
π
|
|
Arcorann_ has quit IRC (Read error: Connection reset by peer) |
16:31
π
|
|
BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io) |
18:56
π
|
|
britmob_ has joined #archiveteam-ot |
19:05
π
|
|
merami has joined #archiveteam-ot |
19:06
π
|
merami |
hey I am currently archiving a booru site with grab-site, the crawler is archiving every permutation of tags possible, will this keep going forever? |
19:09
π
|
JAA |
merami: Potentially yes, or at least A Damn Long Timeβ’. You'll want to add some ignore patterns to skip those. |
19:11
π
|
merami |
@JAA: how do I format DIR/igsets |
19:12
π
|
JAA |
merami: You will probably need DIR/ignores, and that's a text file containing one regular expresion per line. If the regex matches a URL, the URL is ignored. |
19:12
π
|
JAA |
https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl |
19:16
π
|
merami |
Okay, I have no experience with regex, but if I want to get rid of urls like this https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+tagme, can I just type in list&tags in DIR/ignores? |
19:18
π
|
JAA |
You could, to ignore all tag lists. Or ^https?://ccrp\.booru\.org/index\.php\?page=post&s=list&tags=[^&]+\+ to filter out pages with two or more tags. |
19:19
π
|
JAA |
Also, yeah, it would go on forever because the site's too dumb to disable the + link on tags that are already in the selection. So you get to URLs like https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+bow+bow+bow etc. |
19:19
π
|
JAA |
Well, it would only stop when the URL gets too long, and that'll take quite some time. |
19:20
π
|
merami |
yeah i just wanted to test out my machine on a small site and i get greeted with a 400k url queue. I'll try using your regex, thanks! |
19:57
π
|
|
martini has joined #archiveteam-ot |
21:14
π
|
britmob |
can confirm that you need lots of ignores on forums heh |
21:26
π
|
merami |
is concurrency limited by threads on my cpu? |
21:34
π
|
|
logchfoo4 has quit IRC (Remote host closed the connection) |
21:43
π
|
JAA |
merami: It's essentially single-threaded. There's a hard limit of 6 connections per host. You might also run into DB contention or HTML parsing bottlenecks. |
22:16
π
|
|
Wingy6 has joined #archiveteam-ot |
22:17
π
|
|
Wingy has quit IRC (Read error: Operation timed out) |
22:17
π
|
|
Wingy6 is now known as Wingy |
22:29
π
|
merami |
so once I finish a crawl, what should I do with all the failed links? |
22:29
π
|
merami |
is there some way go back and attempt them again? |
22:42
π
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
22:46
π
|
JAA |
merami: I don't remember grab-site's default settings for wpull, but I believe it retries them twice when it's done with all the first attempts. |
22:46
π
|
merami |
thanks! |
22:47
π
|
JAA |
Yep, here: https://github.com/ArchiveTeam/grab-site/blob/be38f488afe1bade46a147ad07582e9e62337003/libgrabsite/main.py#L233 |
23:02
π
|
|
Arcorann_ has joined #archiveteam-ot |
23:11
π
|
|
Terbium has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
23:25
π
|
|
martini2 has joined #archiveteam-ot |
23:29
π
|
|
martini2_ has joined #archiveteam-ot |
23:30
π
|
|
martini has quit IRC (Read error: Operation timed out) |
23:31
π
|
|
martini2 has quit IRC (Read error: Operation timed out) |
23:37
π
|
|
BlueMax has joined #archiveteam-ot |
23:44
π
|
|
Wingy7 has joined #archiveteam-ot |
23:46
π
|
|
Wingy has quit IRC (Read error: Operation timed out) |
23:46
π
|
|
Wingy7 is now known as Wingy |