Time |
Nickname |
Message |
01:27
🔗
|
|
Stiletto has quit IRC (Ping timeout: 506 seconds) |
01:41
🔗
|
|
Stiletto has joined #archiveteam-ot |
01:41
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
01:53
🔗
|
|
Stiletto has quit IRC (Ping timeout: 745 seconds) |
02:00
🔗
|
SketchCow |
betamax: I was not on IRC so I missed it. |
02:28
🔗
|
|
Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) |
02:48
🔗
|
|
Stiletto has joined #archiveteam-ot |
02:52
🔗
|
|
Stiletto has quit IRC (Remote host closed the connection) |
03:03
🔗
|
|
Flashfloo has quit IRC (Read error: Connection reset by peer) |
03:03
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
03:03
🔗
|
|
Flashfire has quit IRC (Remote host closed the connection) |
03:04
🔗
|
|
Flashfloo has joined #archiveteam-ot |
03:04
🔗
|
|
kiska has joined #archiveteam-ot |
03:04
🔗
|
|
Flashfire has joined #archiveteam-ot |
03:04
🔗
|
|
Fusl sets mode: +o kiska |
03:29
🔗
|
|
Stiletto has joined #archiveteam-ot |
03:32
🔗
|
|
Stilettoo has joined #archiveteam-ot |
03:40
🔗
|
|
Stiletto has quit IRC (Ping timeout: 506 seconds) |
03:42
🔗
|
|
Stilettoo has quit IRC (Ping timeout: 604 seconds) |
03:56
🔗
|
kpcyrd |
any recommendations about suppressing index.html?C=N&O=A stuff? |
03:56
🔗
|
|
qw3rty111 has joined #archiveteam-ot |
03:58
🔗
|
kpcyrd |
basically I'm trying to exclude directory listings all together and just download the content instead |
04:01
🔗
|
|
qw3rty119 has quit IRC (Ping timeout: 600 seconds) |
04:05
🔗
|
astrid |
in wget? |
04:14
🔗
|
kpcyrd |
yes |
04:18
🔗
|
kpcyrd |
right now I'm using `wget -r -np -nc $url` |
04:20
🔗
|
astrid |
directory listings are necessary for retrieval, the only way you can not have them is delete afterwards |
04:23
🔗
|
kpcyrd |
the index.html files being saved is one problem, but the bigger one is the `?C=N&O=A` stuff which controls sorting of the directory listing, basically each index.html is downloaded 7 times and that's wasting a lot of time in large trees |
04:49
🔗
|
astrid |
ah |
05:00
🔗
|
Flashfire |
I didnt think it downloaded the files each time |
05:06
🔗
|
|
Leslie has quit IRC (Quit: leaving) |
05:30
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
05:32
🔗
|
|
Mateon1 has joined #archiveteam-ot |
07:44
🔗
|
|
m007a83_ has joined #archiveteam-ot |
07:48
🔗
|
|
m007a83 has quit IRC (Ping timeout: 252 seconds) |
08:14
🔗
|
|
m007a83_ has quit IRC (Ping timeout: 252 seconds) |
08:17
🔗
|
|
m007a83 has joined #archiveteam-ot |
08:36
🔗
|
|
deevious has joined #archiveteam-ot |
10:10
🔗
|
|
Shen has quit IRC (Ping timeout: 240 seconds) |
10:27
🔗
|
|
Shen has joined #archiveteam-ot |
10:34
🔗
|
JAA |
kpcyrd: --reject-regex '/index\.html\?C=[NMSD]&O=[AD]$' |
10:58
🔗
|
|
Dj-Wawa has joined #archiveteam-ot |
12:13
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
12:29
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
14:52
🔗
|
|
systwi_ has joined #archiveteam-ot |
14:58
🔗
|
|
systwi has quit IRC (Read error: Operation timed out) |
14:59
🔗
|
|
systwi_ is now known as systwi |
16:16
🔗
|
|
martini has joined #archiveteam-ot |
16:49
🔗
|
|
Stiletto has joined #archiveteam-ot |
17:00
🔗
|
|
DogsRNice has joined #archiveteam-ot |
17:55
🔗
|
|
yano has quit IRC (WeeChat, The Better IRC Client, https://weechat.org/) |
18:06
🔗
|
|
yano has joined #archiveteam-ot |
18:10
🔗
|
|
schbirid has joined #archiveteam-ot |
18:29
🔗
|
|
killsushi has joined #archiveteam-ot |
19:09
🔗
|
systwi |
Can anyone here help me create an ignore for a website I'm trying to grab? |
19:12
🔗
|
systwi |
How would I set an ignore pattern to prevent crawling to/grabbing other pages of a site? i.e. I want to block https://example.com/pages/2 and all other pages, including the URLs on those pages |
19:21
🔗
|
Igloo |
Via grab-site or something? |
19:22
🔗
|
systwi |
Yep |
19:23
🔗
|
Igloo |
https://github.com/ArchiveTeam/grab-site#changing-ignores-during-the-crawl |
19:24
🔗
|
Igloo |
So what are you trying to ignore? |
19:24
🔗
|
systwi |
Sorry BRB |
19:37
🔗
|
systwi |
Igloo: I'm trying to ignore https://www.aimljobs.com/pages/2 and /3 and /4 and so on |
19:37
🔗
|
JAA |
systwi: ^https?://www\.aimljobs\.com/pages/\d+$ |
19:38
🔗
|
systwi |
Awesome, thank you! |
19:39
🔗
|
systwi |
What is the 'https:?' ? Does that mean either 'http:' or 'https:' ? |
19:39
🔗
|
JAA |
Yep |
19:39
🔗
|
systwi |
Ah ok :) |
19:39
🔗
|
JAA |
s? = optionally match an "s" |
19:40
🔗
|
systwi |
I'm not that good with regex. I tried following https://duckduckgo.com/?q=regex+help&ia=cheatsheet&iax=1 but could only understand some of it |
19:40
🔗
|
Igloo |
www.regex101.com |
19:40
🔗
|
Igloo |
Just keep trying stuff. |
19:40
🔗
|
Igloo |
:) |
19:40
🔗
|
JAA |
I like https://www.regular-expressions.info/ |
19:40
🔗
|
systwi |
Thanks guys :) |
19:41
🔗
|
JAA |
It has a good intro (as far as I can tell, never had to learn it from there), and also all the details when they're needed. |
19:41
🔗
|
systwi |
Nice |
19:42
🔗
|
systwi |
Hah yeah my only regex knowledge is shell assistance from stack exchange |
20:03
🔗
|
systwi |
So for blocking out pages like .com/tags/java .com/tags/cplusplus .com/tags/perl can I do ^https?://www\.aimljobs\.com/tags/*$ ? |
20:09
🔗
|
Igloo |
^https?://www\.aimljobs\.com/tags/.+$ |
20:10
🔗
|
Igloo |
that'll ignore .com/tags/everything |
20:10
🔗
|
Igloo |
You don't need the $ either |
20:11
🔗
|
systwi |
Nice thanks. Do I still need $ in the ignore JAA gave me? |
20:14
🔗
|
JAA |
$ = end of the string |
20:15
🔗
|
JAA |
But in the pattern Igloo gave, .+ anyway matches anything until the end of the string, so it doesn't matter. |
20:16
🔗
|
systwi |
Oh ok I was a bit confused |
20:16
🔗
|
systwi |
Thanks |
21:16
🔗
|
|
martini has quit IRC (Quit: No Reasson) |
21:24
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
21:44
🔗
|
|
dashcloud has quit IRC (Ping timeout: 252 seconds) |
21:49
🔗
|
|
dashcloud has joined #archiveteam-ot |
22:22
🔗
|
systwi |
I'm grabbing some content from stackoverflow.com, but I'm getting a bunch of "429 Too many requests." Do I need to slow my crawler down some more? |
22:22
🔗
|
JAA |
That's what 429 normally means, yes. |
22:22
🔗
|
systwi |
I'm already crawling at 1 con. and 275-375 delay |
22:23
🔗
|
systwi |
Bumped the delay to 500-2000 with no change |
22:26
🔗
|
systwi |
Changed it to 1000-2500 and it seems to have done the trick. Now we wait |
22:26
🔗
|
JAA |
That's the archival dance. |
22:27
🔗
|
systwi |
Hah |
22:39
🔗
|
|
BlueMax has joined #archiveteam-ot |
23:17
🔗
|
|
Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) |