#archiveteam-ot 2020-09-08,Tue

↑back Search

Time Nickname Message
00:21 πŸ”— Stilett0 has joined #archiveteam-ot
00:25 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
00:39 πŸ”— mgrytbak has quit IRC (Ping timeout: 272 seconds)
00:39 πŸ”— _niklas has quit IRC (Ping timeout: 272 seconds)
00:39 πŸ”— NatarajBt has quit IRC (Ping timeout: 272 seconds)
00:39 πŸ”— _niklas has joined #archiveteam-ot
00:40 πŸ”— pie_[bnc] has joined #archiveteam-ot
00:41 πŸ”— pie_ has quit IRC (Ping timeout: 272 seconds)
00:41 πŸ”— Gfy has quit IRC (Ping timeout: 272 seconds)
00:41 πŸ”— anarcat has quit IRC (Ping timeout: 272 seconds)
00:41 πŸ”— anarcat has joined #archiveteam-ot
00:41 πŸ”— Gfy has joined #archiveteam-ot
00:42 πŸ”— antomati_ has joined #archiveteam-ot
00:42 πŸ”— Meli-sama has joined #archiveteam-ot
01:40 πŸ”— britmob has quit IRC (Read error: Operation timed out)
01:40 πŸ”— Jake has quit IRC (Read error: Operation timed out)
01:40 πŸ”— paul2520 has quit IRC (Write error: Broken pipe)
01:41 πŸ”— paul2520 has joined #archiveteam-ot
01:42 πŸ”— asdf01011 has quit IRC (Read error: Operation timed out)
01:42 πŸ”— systwi_ has joined #archiveteam-ot
01:42 πŸ”— Arcorann_ has joined #archiveteam-ot
01:42 πŸ”— Jake has joined #archiveteam-ot
01:42 πŸ”— sembiance has quit IRC (Read error: Operation timed out)
01:42 πŸ”— sembiance has joined #archiveteam-ot
01:42 πŸ”— asdf01011 has joined #archiveteam-ot
01:42 πŸ”— britmob has joined #archiveteam-ot
01:44 πŸ”— Meli has quit IRC (se.hub efnet.portlane.se)
01:44 πŸ”— Laverne has quit IRC (se.hub efnet.portlane.se)
01:44 πŸ”— antomatic has quit IRC (se.hub efnet.portlane.se)
01:45 πŸ”— systwi has quit IRC (Read error: Operation timed out)
01:45 πŸ”— systwi_ is now known as systwi
01:46 πŸ”— Arcorann has quit IRC (Read error: Operation timed out)
02:36 πŸ”— Frogging EFNet's NICKLEN is too low (9). Other networks it's much too long though (like 30)
02:36 πŸ”— Frogging freenode has a nice sane value (16)
02:38 πŸ”— Frogging I wonder if I can make weechat limit the size of the sidebars so the main area doesn't get squished when some asshole has a long-ass name
02:56 πŸ”— Arcorann_ I didn't even notice that nicks here were limited to 9 characters
02:58 πŸ”— Arcorann_ IMO 16 is still a bit short though
03:04 πŸ”— Larsenv has quit IRC (Quit: ZNC 1.8.0 - https://znc.in)
03:11 πŸ”— Larsenv has joined #archiveteam-ot
03:39 πŸ”— Raccoon back in my day we were happy if we got 3 letters on the arcade machine
03:54 πŸ”— qw3rty__ has joined #archiveteam-ot
04:01 πŸ”— qw3rty_ has quit IRC (Read error: Operation timed out)
05:49 πŸ”— Flashfire https://test.jimsasbestosnsw.com.au/ as it turns out Jim does sell drugs on top of everything else systwi Ryz JAA Larsenv
05:50 πŸ”— Ryz But not video games s:
05:50 πŸ”— Ryz ...Or gambling
05:50 πŸ”— Flashfire Not yet that I have found
08:10 πŸ”— Laverne has joined #archiveteam-ot
10:10 πŸ”— godane has quit IRC (Read error: Operation timed out)
11:17 πŸ”— godane has joined #archiveteam-ot
11:23 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
13:41 πŸ”— VerifiedJ has joined #archiveteam-ot
14:02 πŸ”— Wingy4 has joined #archiveteam-ot
14:05 πŸ”— Wingy has quit IRC (Read error: Operation timed out)
14:06 πŸ”— Wingy4 has quit IRC (Read error: Operation timed out)
14:10 πŸ”— Wingy has joined #archiveteam-ot
14:13 πŸ”— Wingy has quit IRC (Client Quit)
14:13 πŸ”— Wingy has joined #archiveteam-ot
14:25 πŸ”— Wingy I tried to archive osu.ppy.sh and got permanently ip banned lol
14:26 πŸ”— Wingy Had to change the mac on my router to get a new ip
14:26 πŸ”— Frogging wow
14:32 πŸ”— Kaz lmao, yeah I'm not surprised
14:32 πŸ”— Kaz guy's an ass.
14:32 πŸ”— Arcorann_ Didn't give a warning?
14:33 πŸ”— JAA 'Why would I give a thief a warning?' - Them, probably.
16:02 πŸ”— Arcorann_ has quit IRC (Read error: Connection reset by peer)
16:31 πŸ”— BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io)
18:56 πŸ”— britmob_ has joined #archiveteam-ot
19:05 πŸ”— merami has joined #archiveteam-ot
19:06 πŸ”— merami hey I am currently archiving a booru site with grab-site, the crawler is archiving every permutation of tags possible, will this keep going forever?
19:09 πŸ”— JAA merami: Potentially yes, or at least A Damn Long Timeβ„’. You'll want to add some ignore patterns to skip those.
19:11 πŸ”— merami @JAA: how do I format DIR/igsets
19:12 πŸ”— JAA merami: You will probably need DIR/ignores, and that's a text file containing one regular expresion per line. If the regex matches a URL, the URL is ignored.
19:12 πŸ”— JAA https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl
19:16 πŸ”— merami Okay, I have no experience with regex, but if I want to get rid of urls like this https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+tagme, can I just type in list&tags in DIR/ignores?
19:18 πŸ”— JAA You could, to ignore all tag lists. Or ^https?://ccrp\.booru\.org/index\.php\?page=post&s=list&tags=[^&]+\+ to filter out pages with two or more tags.
19:19 πŸ”— JAA Also, yeah, it would go on forever because the site's too dumb to disable the + link on tags that are already in the selection. So you get to URLs like https://ccrp.booru.org/index.php?page=post&s=list&tags=bow+bow+bow+bow etc.
19:19 πŸ”— JAA Well, it would only stop when the URL gets too long, and that'll take quite some time.
19:20 πŸ”— merami yeah i just wanted to test out my machine on a small site and i get greeted with a 400k url queue. I'll try using your regex, thanks!
19:57 πŸ”— martini has joined #archiveteam-ot
21:14 πŸ”— britmob can confirm that you need lots of ignores on forums heh
21:26 πŸ”— merami is concurrency limited by threads on my cpu?
21:34 πŸ”— logchfoo4 has quit IRC (Remote host closed the connection)
21:43 πŸ”— JAA merami: It's essentially single-threaded. There's a hard limit of 6 connections per host. You might also run into DB contention or HTML parsing bottlenecks.
22:16 πŸ”— Wingy6 has joined #archiveteam-ot
22:17 πŸ”— Wingy has quit IRC (Read error: Operation timed out)
22:17 πŸ”— Wingy6 is now known as Wingy
22:29 πŸ”— merami so once I finish a crawl, what should I do with all the failed links?
22:29 πŸ”— merami is there some way go back and attempt them again?
22:42 πŸ”— VerifiedJ has quit IRC (Quit: Leaving)
22:46 πŸ”— JAA merami: I don't remember grab-site's default settings for wpull, but I believe it retries them twice when it's done with all the first attempts.
22:46 πŸ”— merami thanks!
22:47 πŸ”— JAA Yep, here: https://github.com/ArchiveTeam/grab-site/blob/be38f488afe1bade46a147ad07582e9e62337003/libgrabsite/main.py#L233
23:02 πŸ”— Arcorann_ has joined #archiveteam-ot
23:11 πŸ”— Terbium has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
23:25 πŸ”— martini2 has joined #archiveteam-ot
23:29 πŸ”— martini2_ has joined #archiveteam-ot
23:30 πŸ”— martini has quit IRC (Read error: Operation timed out)
23:31 πŸ”— martini2 has quit IRC (Read error: Operation timed out)
23:37 πŸ”— BlueMax has joined #archiveteam-ot
23:44 πŸ”— Wingy7 has joined #archiveteam-ot
23:46 πŸ”— Wingy has quit IRC (Read error: Operation timed out)
23:46 πŸ”— Wingy7 is now known as Wingy

irclogger-viewer