[00:21] *** Stilett0 has joined #archiveteam-ot [00:25] *** Stiletto has quit IRC (Read error: Operation timed out) [00:39] *** mgrytbak has quit IRC (Ping timeout: 272 seconds) [00:39] *** _niklas has quit IRC (Ping timeout: 272 seconds) [00:39] *** NatarajBt has quit IRC (Ping timeout: 272 seconds) [00:39] *** _niklas has joined #archiveteam-ot [00:40] *** pie_[bnc] has joined #archiveteam-ot [00:41] *** pie_ has quit IRC (Ping timeout: 272 seconds) [00:41] *** Gfy has quit IRC (Ping timeout: 272 seconds) [00:41] *** anarcat has quit IRC (Ping timeout: 272 seconds) [00:41] *** anarcat has joined #archiveteam-ot [00:41] *** Gfy has joined #archiveteam-ot [00:42] *** antomati_ has joined #archiveteam-ot [00:42] *** Meli-sama has joined #archiveteam-ot [01:40] *** britmob has quit IRC (Read error: Operation timed out) [01:40] *** Jake has quit IRC (Read error: Operation timed out) [01:40] *** paul2520 has quit IRC (Write error: Broken pipe) [01:41] *** paul2520 has joined #archiveteam-ot [01:42] *** asdf01011 has quit IRC (Read error: Operation timed out) [01:42] *** systwi_ has joined #archiveteam-ot [01:42] *** Arcorann_ has joined #archiveteam-ot [01:42] *** Jake has joined #archiveteam-ot [01:42] *** sembiance has quit IRC (Read error: Operation timed out) [01:42] *** sembiance has joined #archiveteam-ot [01:42] *** asdf01011 has joined #archiveteam-ot [01:42] *** britmob has joined #archiveteam-ot [01:44] *** Meli has quit IRC (se.hub efnet.portlane.se) [01:44] *** Laverne has quit IRC (se.hub efnet.portlane.se) [01:44] *** antomatic has quit IRC (se.hub efnet.portlane.se) [01:45] *** systwi has quit IRC (Read error: Operation timed out) [01:45] *** systwi_ is now known as systwi [01:46] *** Arcorann has quit IRC (Read error: Operation timed out) [02:36] EFNet's NICKLEN is too low (9). Other networks it's much too long though (like 30) [02:36] freenode has a nice sane value (16) [02:38] I wonder if I can make weechat limit the size of the sidebars so the main area doesn't get squished when some asshole has a long-ass name [02:56] I didn't even notice that nicks here were limited to 9 characters [02:58] IMO 16 is still a bit short though [03:04] *** Larsenv has quit IRC (Quit: ZNC 1.8.0 - https://znc.in) [03:11] *** Larsenv has joined #archiveteam-ot [03:39] back in my day we were happy if we got 3 letters on the arcade machine [03:54] *** qw3rty__ has joined #archiveteam-ot [04:01] *** qw3rty_ has quit IRC (Read error: Operation timed out) [05:49] https://test.jimsasbestosnsw.com.au/ as it turns out Jim does sell drugs on top of everything else systwi Ryz JAA Larsenv [05:50] But not video games s: [05:50] ...Or gambling [05:50] Not yet that I have found [08:10] *** Laverne has joined #archiveteam-ot [10:10] *** godane has quit IRC (Read error: Operation timed out) [11:17] *** godane has joined #archiveteam-ot [11:23] *** BlueMax has quit IRC (Read error: Connection reset by peer) [13:41] *** VerifiedJ has joined #archiveteam-ot [14:02] *** Wingy4 has joined #archiveteam-ot [14:05] *** Wingy has quit IRC (Read error: Operation timed out) [14:06] *** Wingy4 has quit IRC (Read error: Operation timed out) [14:10] *** Wingy has joined #archiveteam-ot [14:13] *** Wingy has quit IRC (Client Quit) [14:13] *** Wingy has joined #archiveteam-ot [14:25] I tried to archive osu.ppy.sh and got permanently ip banned lol [14:26] Had to change the mac on my router to get a new ip [14:26] wow [14:32] lmao, yeah I'm not surprised [14:32] guy's an ass. [14:32] Didn't give a warning? [14:33] 'Why would I give a thief a warning?' - Them, probably. [16:02] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [16:31] *** BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io) [18:56] *** britmob_ has joined #archiveteam-ot [19:05] *** merami has joined #archiveteam-ot [19:06] hey I am currently archiving a booru site with grab-site, the crawler is archiving every permutation of tags possible, will this keep going forever? [19:09] merami: Potentially yes, or at least A Damn Long Timeā„¢. You'll want to add some ignore patterns to skip those. [19:11] @JAA: how do I format DIR/igsets [19:12] merami: You will probably need DIR/ignores, and that's a text file containing one regular expresion per line. If the regex matches a URL, the URL is ignored. [19:12] https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl [19:16] Okay, I have no experience with regex, but if I want to get rid of urls like this https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+tagme, can I just type in list&tags in DIR/ignores? [19:18] You could, to ignore all tag lists. Or ^https?://ccrp\.booru\.org/index\.php\?page=post&s=list&tags=[^&]+\+ to filter out pages with two or more tags. [19:19] Also, yeah, it would go on forever because the site's too dumb to disable the + link on tags that are already in the selection. So you get to URLs like https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+bow+bow+bow etc. [19:19] Well, it would only stop when the URL gets too long, and that'll take quite some time. [19:20] yeah i just wanted to test out my machine on a small site and i get greeted with a 400k url queue. I'll try using your regex, thanks! [19:57] *** martini has joined #archiveteam-ot [21:14] can confirm that you need lots of ignores on forums heh [21:26] is concurrency limited by threads on my cpu? [21:34] *** logchfoo4 has quit IRC (Remote host closed the connection) [21:43] merami: It's essentially single-threaded. There's a hard limit of 6 connections per host. You might also run into DB contention or HTML parsing bottlenecks. [22:16] *** Wingy6 has joined #archiveteam-ot [22:17] *** Wingy has quit IRC (Read error: Operation timed out) [22:17] *** Wingy6 is now known as Wingy [22:29] so once I finish a crawl, what should I do with all the failed links? [22:29] is there some way go back and attempt them again? [22:42] *** VerifiedJ has quit IRC (Quit: Leaving) [22:46] merami: I don't remember grab-site's default settings for wpull, but I believe it retries them twice when it's done with all the first attempts. [22:46] thanks! [22:47] Yep, here: https://github.com/ArchiveTeam/grab-site/blob/be38f488afe1bade46a147ad07582e9e62337003/libgrabsite/main.py#L233 [23:02] *** Arcorann_ has joined #archiveteam-ot [23:11] *** Terbium has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [23:25] *** martini2 has joined #archiveteam-ot [23:29] *** martini2_ has joined #archiveteam-ot [23:30] *** martini has quit IRC (Read error: Operation timed out) [23:31] *** martini2 has quit IRC (Read error: Operation timed out) [23:37] *** BlueMax has joined #archiveteam-ot [23:44] *** Wingy7 has joined #archiveteam-ot [23:46] *** Wingy has quit IRC (Read error: Operation timed out) [23:46] *** Wingy7 is now known as Wingy