[01:27] *** Stiletto has quit IRC (Ping timeout: 506 seconds) [01:41] *** Stiletto has joined #archiveteam-ot [01:41] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [01:53] *** Stiletto has quit IRC (Ping timeout: 745 seconds) [02:00] betamax: I was not on IRC so I missed it. [02:28] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) [02:48] *** Stiletto has joined #archiveteam-ot [02:52] *** Stiletto has quit IRC (Remote host closed the connection) [03:03] *** Flashfloo has quit IRC (Read error: Connection reset by peer) [03:03] *** kiska has quit IRC (Remote host closed the connection) [03:03] *** Flashfire has quit IRC (Remote host closed the connection) [03:04] *** Flashfloo has joined #archiveteam-ot [03:04] *** kiska has joined #archiveteam-ot [03:04] *** Flashfire has joined #archiveteam-ot [03:04] *** Fusl sets mode: +o kiska [03:29] *** Stiletto has joined #archiveteam-ot [03:32] *** Stilettoo has joined #archiveteam-ot [03:40] *** Stiletto has quit IRC (Ping timeout: 506 seconds) [03:42] *** Stilettoo has quit IRC (Ping timeout: 604 seconds) [03:56] any recommendations about suppressing index.html?C=N&O=A stuff? [03:56] *** qw3rty111 has joined #archiveteam-ot [03:58] basically I'm trying to exclude directory listings all together and just download the content instead [04:01] *** qw3rty119 has quit IRC (Ping timeout: 600 seconds) [04:05] in wget? [04:14] yes [04:18] right now I'm using `wget -r -np -nc $url` [04:20] directory listings are necessary for retrieval, the only way you can not have them is delete afterwards [04:23] the index.html files being saved is one problem, but the bigger one is the `?C=N&O=A` stuff which controls sorting of the directory listing, basically each index.html is downloaded 7 times and that's wasting a lot of time in large trees [04:49] ah [05:00] I didnt think it downloaded the files each time [05:06] *** Leslie has quit IRC (Quit: leaving) [05:30] *** Mateon1 has quit IRC (Remote host closed the connection) [05:32] *** Mateon1 has joined #archiveteam-ot [07:44] *** m007a83_ has joined #archiveteam-ot [07:48] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [08:14] *** m007a83_ has quit IRC (Ping timeout: 252 seconds) [08:17] *** m007a83 has joined #archiveteam-ot [08:36] *** deevious has joined #archiveteam-ot [10:10] *** Shen has quit IRC (Ping timeout: 240 seconds) [10:27] *** Shen has joined #archiveteam-ot [10:34] kpcyrd: --reject-regex '/index\.html\?C=[NMSD]&O=[AD]$' [10:58] *** Dj-Wawa has joined #archiveteam-ot [12:13] *** VerifiedJ has joined #archiveteam-ot [12:29] *** BlueMax has quit IRC (Quit: Leaving) [14:52] *** systwi_ has joined #archiveteam-ot [14:58] *** systwi has quit IRC (Read error: Operation timed out) [14:59] *** systwi_ is now known as systwi [16:16] *** martini has joined #archiveteam-ot [16:49] *** Stiletto has joined #archiveteam-ot [17:00] *** DogsRNice has joined #archiveteam-ot [17:55] *** yano has quit IRC (WeeChat, The Better IRC Client, https://weechat.org/) [18:06] *** yano has joined #archiveteam-ot [18:10] *** schbirid has joined #archiveteam-ot [18:29] *** killsushi has joined #archiveteam-ot [19:09] Can anyone here help me create an ignore for a website I'm trying to grab? [19:12] How would I set an ignore pattern to prevent crawling to/grabbing other pages of a site? i.e. I want to block https://example.com/pages/2 and all other pages, including the URLs on those pages [19:21] Via grab-site or something? [19:22] Yep [19:23] https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl [19:24] So what are you trying to ignore? [19:24] Sorry BRB [19:37] Igloo: I'm trying to ignore https://www.aimljobs.com/pages/2 and /3 and /4 and so on [19:37] systwi: ^https?://www\.aimljobs\.com/pages/\d+$ [19:38] Awesome, thank you! [19:39] What is the 'https:?' ? Does that mean either 'http:' or 'https:' ? [19:39] Yep [19:39] Ah ok :) [19:39] s? = optionally match an "s" [19:40] I'm not that good with regex. I tried following https://duckduckgo.com/?q=regex+help&ia=cheatsheet&iax=1 but could only understand some of it [19:40] www.regex101.com [19:40] Just keep trying stuff. [19:40] :) [19:40] I like https://www.regular-expressions.info/ [19:40] Thanks guys :) [19:41] It has a good intro (as far as I can tell, never had to learn it from there), and also all the details when they're needed. [19:41] Nice [19:42] Hah yeah my only regex knowledge is shell assistance from stack exchange [20:03] So for blocking out pages like .com/tags/java .com/tags/cplusplus .com/tags/perl can I do ^https?://www\.aimljobs\.com/tags/*$ ? [20:09] ^https?://www\.aimljobs\.com/tags/.+$ [20:10] that'll ignore .com/tags/everything [20:10] You don't need the $ either [20:11] Nice thanks. Do I still need $ in the ignore JAA gave me? [20:14] $ = end of the string [20:15] But in the pattern Igloo gave, .+ anyway matches anything until the end of the string, so it doesn't matter. [20:16] Oh ok I was a bit confused [20:16] Thanks [21:16] *** martini has quit IRC (Quit: No Reasson) [21:24] *** schbirid has quit IRC (Remote host closed the connection) [21:44] *** dashcloud has quit IRC (Ping timeout: 252 seconds) [21:49] *** dashcloud has joined #archiveteam-ot [22:22] I'm grabbing some content from stackoverflow.com, but I'm getting a bunch of "429 Too many requests." Do I need to slow my crawler down some more? [22:22] That's what 429 normally means, yes. [22:22] I'm already crawling at 1 con. and 275-375 delay [22:23] Bumped the delay to 500-2000 with no change [22:26] Changed it to 1000-2500 and it seems to have done the trick. Now we wait [22:26] That's the archival dance. [22:27] Hah [22:39] *** BlueMax has joined #archiveteam-ot [23:17] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)