#archiveteam-ot 2019-07-22,Mon

↑back Search

Time Nickname Message
01:27 🔗 Stiletto has quit IRC (Ping timeout: 506 seconds)
01:41 🔗 Stiletto has joined #archiveteam-ot
01:41 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
01:53 🔗 Stiletto has quit IRC (Ping timeout: 745 seconds)
02:00 🔗 SketchCow betamax: I was not on IRC so I missed it.
02:28 🔗 Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)
02:48 🔗 Stiletto has joined #archiveteam-ot
02:52 🔗 Stiletto has quit IRC (Remote host closed the connection)
03:03 🔗 Flashfloo has quit IRC (Read error: Connection reset by peer)
03:03 🔗 kiska has quit IRC (Remote host closed the connection)
03:03 🔗 Flashfire has quit IRC (Remote host closed the connection)
03:04 🔗 Flashfloo has joined #archiveteam-ot
03:04 🔗 kiska has joined #archiveteam-ot
03:04 🔗 Flashfire has joined #archiveteam-ot
03:04 🔗 Fusl sets mode: +o kiska
03:29 🔗 Stiletto has joined #archiveteam-ot
03:32 🔗 Stilettoo has joined #archiveteam-ot
03:40 🔗 Stiletto has quit IRC (Ping timeout: 506 seconds)
03:42 🔗 Stilettoo has quit IRC (Ping timeout: 604 seconds)
03:56 🔗 kpcyrd any recommendations about suppressing index.html?C=N&O=A stuff?
03:56 🔗 qw3rty111 has joined #archiveteam-ot
03:58 🔗 kpcyrd basically I'm trying to exclude directory listings all together and just download the content instead
04:01 🔗 qw3rty119 has quit IRC (Ping timeout: 600 seconds)
04:05 🔗 astrid in wget?
04:14 🔗 kpcyrd yes
04:18 🔗 kpcyrd right now I'm using `wget -r -np -nc $url`
04:20 🔗 astrid directory listings are necessary for retrieval, the only way you can not have them is delete afterwards
04:23 🔗 kpcyrd the index.html files being saved is one problem, but the bigger one is the `?C=N&O=A` stuff which controls sorting of the directory listing, basically each index.html is downloaded 7 times and that's wasting a lot of time in large trees
04:49 🔗 astrid ah
05:00 🔗 Flashfire I didnt think it downloaded the files each time
05:06 🔗 Leslie has quit IRC (Quit: leaving)
05:30 🔗 Mateon1 has quit IRC (Remote host closed the connection)
05:32 🔗 Mateon1 has joined #archiveteam-ot
07:44 🔗 m007a83_ has joined #archiveteam-ot
07:48 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
08:14 🔗 m007a83_ has quit IRC (Ping timeout: 252 seconds)
08:17 🔗 m007a83 has joined #archiveteam-ot
08:36 🔗 deevious has joined #archiveteam-ot
10:10 🔗 Shen has quit IRC (Ping timeout: 240 seconds)
10:27 🔗 Shen has joined #archiveteam-ot
10:34 🔗 JAA kpcyrd: --reject-regex '/index\.html\?C=[NMSD]&O=[AD]$'
10:58 🔗 Dj-Wawa has joined #archiveteam-ot
12:13 🔗 VerifiedJ has joined #archiveteam-ot
12:29 🔗 BlueMax has quit IRC (Quit: Leaving)
14:52 🔗 systwi_ has joined #archiveteam-ot
14:58 🔗 systwi has quit IRC (Read error: Operation timed out)
14:59 🔗 systwi_ is now known as systwi
16:16 🔗 martini has joined #archiveteam-ot
16:49 🔗 Stiletto has joined #archiveteam-ot
17:00 🔗 DogsRNice has joined #archiveteam-ot
17:55 🔗 yano has quit IRC (WeeChat, The Better IRC Client, https://weechat.org/)
18:06 🔗 yano has joined #archiveteam-ot
18:10 🔗 schbirid has joined #archiveteam-ot
18:29 🔗 killsushi has joined #archiveteam-ot
19:09 🔗 systwi Can anyone here help me create an ignore for a website I'm trying to grab?
19:12 🔗 systwi How would I set an ignore pattern to prevent crawling to/grabbing other pages of a site? i.e. I want to block https://example.com/pages/2 and all other pages, including the URLs on those pages
19:21 🔗 Igloo Via grab-site or something?
19:22 🔗 systwi Yep
19:23 🔗 Igloo https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl
19:24 🔗 Igloo So what are you trying to ignore?
19:24 🔗 systwi Sorry BRB
19:37 🔗 systwi Igloo: I'm trying to ignore https://www.aimljobs.com/pages/2 and /3 and /4 and so on
19:37 🔗 JAA systwi: ^https?://www\.aimljobs\.com/pages/\d+$
19:38 🔗 systwi Awesome, thank you!
19:39 🔗 systwi What is the 'https:?' ? Does that mean either 'http:' or 'https:' ?
19:39 🔗 JAA Yep
19:39 🔗 systwi Ah ok :)
19:39 🔗 JAA s? = optionally match an "s"
19:40 🔗 systwi I'm not that good with regex. I tried following https://duckduckgo.com/?q=regex+help&ia=cheatsheet&iax=1 but could only understand some of it
19:40 🔗 Igloo www.regex101.com
19:40 🔗 Igloo Just keep trying stuff.
19:40 🔗 Igloo :)
19:40 🔗 JAA I like https://www.regular-expressions.info/
19:40 🔗 systwi Thanks guys :)
19:41 🔗 JAA It has a good intro (as far as I can tell, never had to learn it from there), and also all the details when they're needed.
19:41 🔗 systwi Nice
19:42 🔗 systwi Hah yeah my only regex knowledge is shell assistance from stack exchange
20:03 🔗 systwi So for blocking out pages like .com/tags/java .com/tags/cplusplus .com/tags/perl can I do ^https?://www\.aimljobs\.com/tags/*$ ?
20:09 🔗 Igloo ^https?://www\.aimljobs\.com/tags/.+$
20:10 🔗 Igloo that'll ignore .com/tags/everything
20:10 🔗 Igloo You don't need the $ either
20:11 🔗 systwi Nice thanks. Do I still need $ in the ignore JAA gave me?
20:14 🔗 JAA $ = end of the string
20:15 🔗 JAA But in the pattern Igloo gave, .+ anyway matches anything until the end of the string, so it doesn't matter.
20:16 🔗 systwi Oh ok I was a bit confused
20:16 🔗 systwi Thanks
21:16 🔗 martini has quit IRC (Quit: No Reasson)
21:24 🔗 schbirid has quit IRC (Remote host closed the connection)
21:44 🔗 dashcloud has quit IRC (Ping timeout: 252 seconds)
21:49 🔗 dashcloud has joined #archiveteam-ot
22:22 🔗 systwi I'm grabbing some content from stackoverflow.com, but I'm getting a bunch of "429 Too many requests." Do I need to slow my crawler down some more?
22:22 🔗 JAA That's what 429 normally means, yes.
22:22 🔗 systwi I'm already crawling at 1 con. and 275-375 delay
22:23 🔗 systwi Bumped the delay to 500-2000 with no change
22:26 🔗 systwi Changed it to 1000-2500 and it seems to have done the trick. Now we wait
22:26 🔗 JAA That's the archival dance.
22:27 🔗 systwi Hah
22:39 🔗 BlueMax has joined #archiveteam-ot
23:17 🔗 Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)

irclogger-viewer