#archiveteam-ot 2019-07-22,Mon

↑back Search

Time	Nickname	Message
01:27 ^🔗		Stiletto has quit IRC (Ping timeout: 506 seconds)
01:41 ^🔗		Stiletto has joined #archiveteam-ot
01:41 ^🔗		DogsRNice has quit IRC (Read error: Connection reset by peer)
01:53 ^🔗		Stiletto has quit IRC (Ping timeout: 745 seconds)
02:00 ^🔗	SketchCow	betamax: I was not on IRC so I missed it.
02:28 ^🔗		Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)
02:48 ^🔗		Stiletto has joined #archiveteam-ot
02:52 ^🔗		Stiletto has quit IRC (Remote host closed the connection)
03:03 ^🔗		Flashfloo has quit IRC (Read error: Connection reset by peer)
03:03 ^🔗		kiska has quit IRC (Remote host closed the connection)
03:03 ^🔗		Flashfire has quit IRC (Remote host closed the connection)
03:04 ^🔗		Flashfloo has joined #archiveteam-ot
03:04 ^🔗		kiska has joined #archiveteam-ot
03:04 ^🔗		Flashfire has joined #archiveteam-ot
03:04 ^🔗		Fusl sets mode: +o kiska
03:29 ^🔗		Stiletto has joined #archiveteam-ot
03:32 ^🔗		Stilettoo has joined #archiveteam-ot
03:40 ^🔗		Stiletto has quit IRC (Ping timeout: 506 seconds)
03:42 ^🔗		Stilettoo has quit IRC (Ping timeout: 604 seconds)
03:56 ^🔗	kpcyrd	any recommendations about suppressing index.html?C=N&O=A stuff?
03:56 ^🔗		qw3rty111 has joined #archiveteam-ot
03:58 ^🔗	kpcyrd	basically I'm trying to exclude directory listings all together and just download the content instead
04:01 ^🔗		qw3rty119 has quit IRC (Ping timeout: 600 seconds)
04:05 ^🔗	astrid	in wget?
04:14 ^🔗	kpcyrd	yes
04:18 ^🔗	kpcyrd	right now I'm using `wget -r -np -nc $url`
04:20 ^🔗	astrid	directory listings are necessary for retrieval, the only way you can not have them is delete afterwards
04:23 ^🔗	kpcyrd	the index.html files being saved is one problem, but the bigger one is the `?C=N&O=A` stuff which controls sorting of the directory listing, basically each index.html is downloaded 7 times and that's wasting a lot of time in large trees
04:49 ^🔗	astrid	ah
05:00 ^🔗	Flashfire	I didnt think it downloaded the files each time
05:06 ^🔗		Leslie has quit IRC (Quit: leaving)
05:30 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
05:32 ^🔗		Mateon1 has joined #archiveteam-ot
07:44 ^🔗		m007a83_ has joined #archiveteam-ot
07:48 ^🔗		m007a83 has quit IRC (Ping timeout: 252 seconds)
08:14 ^🔗		m007a83_ has quit IRC (Ping timeout: 252 seconds)
08:17 ^🔗		m007a83 has joined #archiveteam-ot
08:36 ^🔗		deevious has joined #archiveteam-ot
10:10 ^🔗		Shen has quit IRC (Ping timeout: 240 seconds)
10:27 ^🔗		Shen has joined #archiveteam-ot
10:34 ^🔗	JAA	kpcyrd: --reject-regex '/index\.html\?C=[NMSD]&O=[AD]$'
10:58 ^🔗		Dj-Wawa has joined #archiveteam-ot
12:13 ^🔗		VerifiedJ has joined #archiveteam-ot
12:29 ^🔗		BlueMax has quit IRC (Quit: Leaving)
14:52 ^🔗		systwi_ has joined #archiveteam-ot
14:58 ^🔗		systwi has quit IRC (Read error: Operation timed out)
14:59 ^🔗		systwi_ is now known as systwi
16:16 ^🔗		martini has joined #archiveteam-ot
16:49 ^🔗		Stiletto has joined #archiveteam-ot
17:00 ^🔗		DogsRNice has joined #archiveteam-ot
17:55 ^🔗		yano has quit IRC (WeeChat, The Better IRC Client, https://weechat.org/)
18:06 ^🔗		yano has joined #archiveteam-ot
18:10 ^🔗		schbirid has joined #archiveteam-ot
18:29 ^🔗		killsushi has joined #archiveteam-ot
19:09 ^🔗	systwi	Can anyone here help me create an ignore for a website I'm trying to grab?
19:12 ^🔗	systwi	How would I set an ignore pattern to prevent crawling to/grabbing other pages of a site? i.e. I want to block https://example.com/pages/2 and all other pages, including the URLs on those pages
19:21 ^🔗	Igloo	Via grab-site or something?
19:22 ^🔗	systwi	Yep
19:23 ^🔗	Igloo	https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl
19:24 ^🔗	Igloo	So what are you trying to ignore?
19:24 ^🔗	systwi	Sorry BRB
19:37 ^🔗	systwi	Igloo: I'm trying to ignore https://www.aimljobs.com/pages/2 and /3 and /4 and so on
19:37 ^🔗	JAA	systwi: ^https?://www\.aimljobs\.com/pages/\d+$
19:38 ^🔗	systwi	Awesome, thank you!
19:39 ^🔗	systwi	What is the 'https:?' ? Does that mean either 'http:' or 'https:' ?
19:39 ^🔗	JAA	Yep
19:39 ^🔗	systwi	Ah ok :)
19:39 ^🔗	JAA	s? = optionally match an "s"
19:40 ^🔗	systwi	I'm not that good with regex. I tried following https://duckduckgo.com/?q=regex+help&ia=cheatsheet&iax=1 but could only understand some of it
19:40 ^🔗	Igloo	www.regex101.com
19:40 ^🔗	Igloo	Just keep trying stuff.
19:40 ^🔗	Igloo	:)
19:40 ^🔗	JAA	I like https://www.regular-expressions.info/
19:40 ^🔗	systwi	Thanks guys :)
19:41 ^🔗	JAA	It has a good intro (as far as I can tell, never had to learn it from there), and also all the details when they're needed.
19:41 ^🔗	systwi	Nice
19:42 ^🔗	systwi	Hah yeah my only regex knowledge is shell assistance from stack exchange
20:03 ^🔗	systwi	So for blocking out pages like .com/tags/java .com/tags/cplusplus .com/tags/perl can I do ^https?://www\.aimljobs\.com/tags/*$ ?
20:09 ^🔗	Igloo	^https?://www\.aimljobs\.com/tags/.+$
20:10 ^🔗	Igloo	that'll ignore .com/tags/everything
20:10 ^🔗	Igloo	You don't need the $ either
20:11 ^🔗	systwi	Nice thanks. Do I still need $ in the ignore JAA gave me?
20:14 ^🔗	JAA	$ = end of the string
20:15 ^🔗	JAA	But in the pattern Igloo gave, .+ anyway matches anything until the end of the string, so it doesn't matter.
20:16 ^🔗	systwi	Oh ok I was a bit confused
20:16 ^🔗	systwi	Thanks
21:16 ^🔗		martini has quit IRC (Quit: No Reasson)
21:24 ^🔗		schbirid has quit IRC (Remote host closed the connection)
21:44 ^🔗		dashcloud has quit IRC (Ping timeout: 252 seconds)
21:49 ^🔗		dashcloud has joined #archiveteam-ot
22:22 ^🔗	systwi	I'm grabbing some content from stackoverflow.com, but I'm getting a bunch of "429 Too many requests." Do I need to slow my crawler down some more?
22:22 ^🔗	JAA	That's what 429 normally means, yes.
22:22 ^🔗	systwi	I'm already crawling at 1 con. and 275-375 delay
22:23 ^🔗	systwi	Bumped the delay to 500-2000 with no change
22:26 ^🔗	systwi	Changed it to 1000-2500 and it seems to have done the trick. Now we wait
22:26 ^🔗	JAA	That's the archival dance.
22:27 ^🔗	systwi	Hah
22:39 ^🔗		BlueMax has joined #archiveteam-ot
23:17 ^🔗		Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)

irclogger-viewer