Time |
Nickname |
Message |
00:10
🔗
|
|
BlueMax has joined #archiveteam-ot |
01:12
🔗
|
|
kiska has joined #archiveteam-ot |
01:14
🔗
|
|
Polylith has quit IRC (Read error: Operation timed out) |
01:15
🔗
|
|
Polylith has joined #archiveteam-ot |
01:29
🔗
|
|
ColdIce has quit IRC (Read error: Operation timed out) |
01:35
🔗
|
|
ColdIce has joined #archiveteam-ot |
01:38
🔗
|
|
ColdIce has quit IRC (Read error: Connection reset by peer) |
01:39
🔗
|
|
w0rmhole has joined #archiveteam-ot |
02:12
🔗
|
|
adinbied has quit IRC (Quit: Left Channel.) |
02:25
🔗
|
|
adinbied has joined #archiveteam-ot |
02:30
🔗
|
|
adinbied has quit IRC (Quit: Left Channel.) |
02:44
🔗
|
|
adinbied has joined #archiveteam-ot |
03:42
🔗
|
|
Odd0002 has quit IRC (Quit: ZNC - http://znc.in) |
03:44
🔗
|
|
ivan has quit IRC (Read error: Operation timed out) |
03:45
🔗
|
|
JAA has quit IRC (Read error: Operation timed out) |
03:45
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
03:45
🔗
|
|
ivan has joined #archiveteam-ot |
03:46
🔗
|
|
svchfoo1 sets mode: +o ivan |
03:48
🔗
|
|
wp494 has quit IRC (Ping timeout: 492 seconds) |
03:51
🔗
|
|
wp494 has joined #archiveteam-ot |
04:02
🔗
|
|
Odd0002 has joined #archiveteam-ot |
04:08
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 268 seconds) |
04:09
🔗
|
|
Mateon1 has joined #archiveteam-ot |
04:45
🔗
|
|
JAA has joined #archiveteam-ot |
04:45
🔗
|
|
svchfoo3 sets mode: +o JAA |
04:46
🔗
|
|
bakJAA sets mode: +o JAA |
04:50
🔗
|
|
jspiros has joined #archiveteam-ot |
04:57
🔗
|
|
dxrt- is now known as dxrt |
04:58
🔗
|
|
dxrt_ sets mode: +o dxrt |
06:15
🔗
|
w0rmhole |
ivan: you know the ins and outs of grab-site, right? |
06:15
🔗
|
Flashfire |
he wrote it |
06:15
🔗
|
Flashfire |
...... |
06:16
🔗
|
w0rmhole |
oh ok, i was going to ask him a question about it |
06:25
🔗
|
ivan |
w0rmhole: I'm here |
06:28
🔗
|
w0rmhole |
ivan: okay, so im using grab-site and i adjusted the delay while a crawl was running from 0ms to 250ms by editing the delay file. |
06:28
🔗
|
w0rmhole |
doing that froze up grab-site. it's not moving at all. i dont really want to break it. |
06:28
🔗
|
w0rmhole |
even setting the delay back to 0 didn't make a difference |
06:28
🔗
|
w0rmhole |
for the record, this is the command i ran: $ grab-site https://www.exxoshost.co.uk/forum?archiveteam --igsets forums |
06:28
🔗
|
ivan |
w0rmhole: you can look at the terminal to see which URLs it's currently grabbing, or using gs-dump-urls with in_progress |
06:29
🔗
|
ivan |
changing a delay to 250ms doesn't freeze crawls, probably a coincidence |
06:29
🔗
|
w0rmhole |
oh of course, right when you typed that it started working |
06:30
🔗
|
w0rmhole |
i think so |
06:31
🔗
|
w0rmhole |
said something about dns resolution errors when it continued , but i think that might just be an issue with the site and not grab-site |
06:33
🔗
|
w0rmhole |
one other question i have if you don't mind |
06:34
🔗
|
ivan |
I'm still here |
06:34
🔗
|
w0rmhole |
that forum keeps putting in that stupid phpsessid garbage in the url |
06:34
🔗
|
w0rmhole |
is there a way for grab-site to not capture those urls, and only the actual url? |
06:35
🔗
|
ivan |
wpull has a URLRewriter that should be handling that |
06:35
🔗
|
w0rmhole |
i.e. https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196 as opposed to https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196?sid=0befb2c2dc4ac8d45b88f1fe7cce2b71 |
06:35
🔗
|
w0rmhole |
is that enabled by default? |
06:35
🔗
|
w0rmhole |
in grab-site |
06:35
🔗
|
ivan |
re.compile("^(.*)(?:sid=[0-9a-zA-Z]{32})(?:&(.*))?$", re.I), |
06:36
🔗
|
* |
ivan looks |
06:36
🔗
|
w0rmhole |
... |
06:36
🔗
|
w0rmhole |
i dont know what to do with that x_x |
06:37
🔗
|
ivan |
yes |
06:37
🔗
|
ivan |
libgrabsite/main.py |
06:37
🔗
|
ivan |
253: "--strip-session-id", |
06:37
🔗
|
w0rmhole |
ohh nvm i know what you mean |
06:38
🔗
|
ivan |
so it's enabled but I don't know the details of the implementation |
06:38
🔗
|
ivan |
is grab-site grabbing URLs with the session id? |
06:38
🔗
|
w0rmhole |
in one situation yes |
06:38
🔗
|
w0rmhole |
i'll try to find the original command i used |
06:39
🔗
|
w0rmhole |
btw, do i need to use that ?archiveteam thing in the url like i did up there? |
06:40
🔗
|
w0rmhole |
to keep the session id out |
06:40
🔗
|
ivan |
probably not |
06:40
🔗
|
w0rmhole |
$ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20" |
06:40
🔗
|
w0rmhole |
ok so that's the command i ran |
06:41
🔗
|
w0rmhole |
and while browsing the warc with webrecorder player |
06:41
🔗
|
w0rmhole |
in the url field, i saw the session id appear on the 2nd forum page |
06:41
🔗
|
w0rmhole |
aka start=10 |
06:46
🔗
|
ivan |
please file a bug with details because my crawl of the site hangs on something pretty quickly |
06:46
🔗
|
ivan |
maybe it's some weird page requisite behavior, I don't know |
06:47
🔗
|
w0rmhole |
okay, will do |
06:48
🔗
|
w0rmhole |
i had to manually specify the user agent to that to allow it to grab those |
06:48
🔗
|
w0rmhole |
with ua specified=~30s |
06:48
🔗
|
w0rmhole |
w/o ua specified=~4min |
06:48
🔗
|
w0rmhole |
iirc |
06:59
🔗
|
w0rmhole |
ivan: https://github.com/ludios/grab-site/issues/132 |
06:59
🔗
|
w0rmhole |
hope my english skills aren't too shitty |
07:06
🔗
|
ivan |
replied there |
07:08
🔗
|
ivan |
we don't fabricate responses in WARCs, that would be bad |
07:11
🔗
|
w0rmhole |
sorry |
07:12
🔗
|
w0rmhole |
im still new to grab-site so i am still adjusting to how it works |
07:12
🔗
|
ivan |
w0rmhole: you can start the crawl with a sid, see the README for the cookie stuff |
07:12
🔗
|
ivan |
does it work? don't set your hopes too high |
07:12
🔗
|
w0rmhole |
i did add a cookie file later on |
07:13
🔗
|
w0rmhole |
which didnt make much a difference |
07:14
🔗
|
ivan |
try setting the cookie expiration time to the distant future |
07:14
🔗
|
ivan |
2147483647 |
07:14
🔗
|
ivan |
paste me your cookie file with the sid |
07:14
🔗
|
w0rmhole |
1 minute pls |
07:15
🔗
|
w0rmhole |
.blogspot.com TRUE / FALSE 2147483647 NCR 1 |
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_k |
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_sid b2fbc6f704098f6e4a6711a8eb508b98 |
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_u 1 |
07:15
🔗
|
w0rmhole |
.reddit.com TRUE / FALSE 2147483647 over18 1 |
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 birthtime 0 |
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 lastagecheckage 1-January-1970 |
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 mature_content 1 |
07:15
🔗
|
w0rmhole |
oh sorry bad formatting |
07:15
🔗
|
w0rmhole |
i will use link |
07:16
🔗
|
ivan |
I think your tabs got lost yeah |
07:16
🔗
|
w0rmhole |
http://pasted.co/6b3de39e |
07:18
🔗
|
ivan |
yeah try changing the 1568875520 expiration to 2147483647 |
07:18
🔗
|
w0rmhole |
ok let me try |
07:18
🔗
|
ivan |
and maybe make sure the session is fresh enough for the server to still know about it? |
07:18
🔗
|
ivan |
if that doesn't work there might not be much you can do about the forum software giving you sid links |
07:18
🔗
|
w0rmhole |
sorry that last part confuses me |
07:19
🔗
|
ivan |
there's also a `secure` flag after the path set to TRUE but I assume you're grabbing https:// forum pages |
07:19
🔗
|
w0rmhole |
(english is my second language btw) |
07:19
🔗
|
w0rmhole |
yes |
07:19
🔗
|
ivan |
if the forum forgot about the session it might give you a ?sid= link with a new session, but I'm just guessing how it works |
07:19
🔗
|
w0rmhole |
so possible solution would be to get new sessid? |
07:20
🔗
|
ivan |
they probably expire in a reasonably short period |
07:20
🔗
|
ivan |
yes |
07:20
🔗
|
w0rmhole |
ok |
07:20
🔗
|
w0rmhole |
i should still specify user agent, correct? |
07:20
🔗
|
ivan |
I guess |
07:21
🔗
|
w0rmhole |
if i dont some images do not load |
07:21
🔗
|
ivan |
oh heh never mind 1568875520 is this date next year |
07:22
🔗
|
ivan |
Forum Software, man |
07:23
🔗
|
ivan |
does the WARC player fail to find the page when you click on a ?sid= link? |
07:25
🔗
|
ivan |
and which one are you using? |
07:28
🔗
|
w0rmhole |
no it finds it |
07:28
🔗
|
w0rmhole |
using the same player mentioned on github |
08:08
🔗
|
ivan |
ok, that sounds like a decent outcome despite the sid= crap |
08:22
🔗
|
w0rmhole |
ivan: one other thing, does grab-site support delays like: 250ms-350ms instead of just one number? |
08:24
🔗
|
ivan |
w0rmhole: yeah, just write 250-350 to the file |
08:24
🔗
|
ivan |
or give that to --delay= |
08:25
🔗
|
w0rmhole |
thanks! :) |
08:25
🔗
|
w0rmhole |
i really like grab-site, good work! |
08:27
🔗
|
ivan |
it's mostly chfoo's work in wpull but thanks |
08:34
🔗
|
w0rmhole |
both of you, i give my thanks to |
08:34
🔗
|
|
C4K3 has quit IRC (leaving) |
09:13
🔗
|
|
faolingfa has quit IRC (Leaving) |
10:22
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
10:42
🔗
|
|
kiska has quit IRC (Read error: Connection reset by peer) |
10:44
🔗
|
|
w0rmhole has quit IRC (Ping timeout: 252 seconds) |
10:44
🔗
|
|
Flashfire has quit IRC (Ping timeout: 252 seconds) |
10:52
🔗
|
|
kiska has joined #archiveteam-ot |
10:52
🔗
|
|
kiska has quit IRC (se.hub irc.underworld.no) |
12:37
🔗
|
JAA |
Underworld pls |
13:10
🔗
|
|
sknebel has quit IRC (Ping timeout: 268 seconds) |
13:11
🔗
|
|
kiska has joined #archiveteam-ot |
13:33
🔗
|
|
faolingfa has joined #archiveteam-ot |
13:34
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
13:35
🔗
|
|
sknebel has joined #archiveteam-ot |
13:35
🔗
|
|
wp494 has joined #archiveteam-ot |
15:12
🔗
|
|
w0rmhole has joined #archiveteam-ot |
15:19
🔗
|
w0rmhole |
ivan: is there a way to set the number of retries if grab-site fails to get something the first few times? |
16:50
🔗
|
|
schbirid has joined #archiveteam-ot |
17:30
🔗
|
ivan |
w0rmhole: --wpull-args=--tries=N |
18:47
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
18:48
🔗
|
|
schbirid has joined #archiveteam-ot |
18:59
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
21:49
🔗
|
|
Flashfire has joined #archiveteam-ot |
23:08
🔗
|
|
BlueMax has joined #archiveteam-ot |
23:10
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
23:11
🔗
|
|
Jens has joined #archiveteam-ot |