| Time |
Nickname |
Message |
|
00:10
🔗
|
|
BlueMax has joined #archiveteam-ot |
|
01:12
🔗
|
|
kiska has joined #archiveteam-ot |
|
01:14
🔗
|
|
Polylith has quit IRC (Read error: Operation timed out) |
|
01:15
🔗
|
|
Polylith has joined #archiveteam-ot |
|
01:29
🔗
|
|
ColdIce has quit IRC (Read error: Operation timed out) |
|
01:35
🔗
|
|
ColdIce has joined #archiveteam-ot |
|
01:38
🔗
|
|
ColdIce has quit IRC (Read error: Connection reset by peer) |
|
01:39
🔗
|
|
w0rmhole has joined #archiveteam-ot |
|
02:12
🔗
|
|
adinbied has quit IRC (Quit: Left Channel.) |
|
02:25
🔗
|
|
adinbied has joined #archiveteam-ot |
|
02:30
🔗
|
|
adinbied has quit IRC (Quit: Left Channel.) |
|
02:44
🔗
|
|
adinbied has joined #archiveteam-ot |
|
03:42
🔗
|
|
Odd0002 has quit IRC (Quit: ZNC - http://znc.in) |
|
03:44
🔗
|
|
ivan has quit IRC (Read error: Operation timed out) |
|
03:45
🔗
|
|
JAA has quit IRC (Read error: Operation timed out) |
|
03:45
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
|
03:45
🔗
|
|
ivan has joined #archiveteam-ot |
|
03:46
🔗
|
|
svchfoo1 sets mode: +o ivan |
|
03:48
🔗
|
|
wp494 has quit IRC (Ping timeout: 492 seconds) |
|
03:51
🔗
|
|
wp494 has joined #archiveteam-ot |
|
04:02
🔗
|
|
Odd0002 has joined #archiveteam-ot |
|
04:08
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 268 seconds) |
|
04:09
🔗
|
|
Mateon1 has joined #archiveteam-ot |
|
04:45
🔗
|
|
JAA has joined #archiveteam-ot |
|
04:45
🔗
|
|
svchfoo3 sets mode: +o JAA |
|
04:46
🔗
|
|
bakJAA sets mode: +o JAA |
|
04:50
🔗
|
|
jspiros has joined #archiveteam-ot |
|
04:57
🔗
|
|
dxrt- is now known as dxrt |
|
04:58
🔗
|
|
dxrt_ sets mode: +o dxrt |
|
06:15
🔗
|
w0rmhole |
ivan: you know the ins and outs of grab-site, right? |
|
06:15
🔗
|
Flashfire |
he wrote it |
|
06:15
🔗
|
Flashfire |
...... |
|
06:16
🔗
|
w0rmhole |
oh ok, i was going to ask him a question about it |
|
06:25
🔗
|
ivan |
w0rmhole: I'm here |
|
06:28
🔗
|
w0rmhole |
ivan: okay, so im using grab-site and i adjusted the delay while a crawl was running from 0ms to 250ms by editing the delay file. |
|
06:28
🔗
|
w0rmhole |
doing that froze up grab-site. it's not moving at all. i dont really want to break it. |
|
06:28
🔗
|
w0rmhole |
even setting the delay back to 0 didn't make a difference |
|
06:28
🔗
|
w0rmhole |
for the record, this is the command i ran: $ grab-site https://www.exxoshost.co.uk/forum?archiveteam --igsets forums |
|
06:28
🔗
|
ivan |
w0rmhole: you can look at the terminal to see which URLs it's currently grabbing, or using gs-dump-urls with in_progress |
|
06:29
🔗
|
ivan |
changing a delay to 250ms doesn't freeze crawls, probably a coincidence |
|
06:29
🔗
|
w0rmhole |
oh of course, right when you typed that it started working |
|
06:30
🔗
|
w0rmhole |
i think so |
|
06:31
🔗
|
w0rmhole |
said something about dns resolution errors when it continued , but i think that might just be an issue with the site and not grab-site |
|
06:33
🔗
|
w0rmhole |
one other question i have if you don't mind |
|
06:34
🔗
|
ivan |
I'm still here |
|
06:34
🔗
|
w0rmhole |
that forum keeps putting in that stupid phpsessid garbage in the url |
|
06:34
🔗
|
w0rmhole |
is there a way for grab-site to not capture those urls, and only the actual url? |
|
06:35
🔗
|
ivan |
wpull has a URLRewriter that should be handling that |
|
06:35
🔗
|
w0rmhole |
i.e. https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196 as opposed to https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196?sid=0befb2c2dc4ac8d45b88f1fe7cce2b71 |
|
06:35
🔗
|
w0rmhole |
is that enabled by default? |
|
06:35
🔗
|
w0rmhole |
in grab-site |
|
06:35
🔗
|
ivan |
re.compile("^(.*)(?:sid=[0-9a-zA-Z]{32})(?:&(.*))?$", re.I), |
|
06:36
🔗
|
* |
ivan looks |
|
06:36
🔗
|
w0rmhole |
... |
|
06:36
🔗
|
w0rmhole |
i dont know what to do with that x_x |
|
06:37
🔗
|
ivan |
yes |
|
06:37
🔗
|
ivan |
libgrabsite/main.py |
|
06:37
🔗
|
ivan |
253: "--strip-session-id", |
|
06:37
🔗
|
w0rmhole |
ohh nvm i know what you mean |
|
06:38
🔗
|
ivan |
so it's enabled but I don't know the details of the implementation |
|
06:38
🔗
|
ivan |
is grab-site grabbing URLs with the session id? |
|
06:38
🔗
|
w0rmhole |
in one situation yes |
|
06:38
🔗
|
w0rmhole |
i'll try to find the original command i used |
|
06:39
🔗
|
w0rmhole |
btw, do i need to use that ?archiveteam thing in the url like i did up there? |
|
06:40
🔗
|
w0rmhole |
to keep the session id out |
|
06:40
🔗
|
ivan |
probably not |
|
06:40
🔗
|
w0rmhole |
$ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20" |
|
06:40
🔗
|
w0rmhole |
ok so that's the command i ran |
|
06:41
🔗
|
w0rmhole |
and while browsing the warc with webrecorder player |
|
06:41
🔗
|
w0rmhole |
in the url field, i saw the session id appear on the 2nd forum page |
|
06:41
🔗
|
w0rmhole |
aka start=10 |
|
06:46
🔗
|
ivan |
please file a bug with details because my crawl of the site hangs on something pretty quickly |
|
06:46
🔗
|
ivan |
maybe it's some weird page requisite behavior, I don't know |
|
06:47
🔗
|
w0rmhole |
okay, will do |
|
06:48
🔗
|
w0rmhole |
i had to manually specify the user agent to that to allow it to grab those |
|
06:48
🔗
|
w0rmhole |
with ua specified=~30s |
|
06:48
🔗
|
w0rmhole |
w/o ua specified=~4min |
|
06:48
🔗
|
w0rmhole |
iirc |
|
06:59
🔗
|
w0rmhole |
ivan: https://github.com/ludios/grab-site/issues/132 |
|
06:59
🔗
|
w0rmhole |
hope my english skills aren't too shitty |
|
07:06
🔗
|
ivan |
replied there |
|
07:08
🔗
|
ivan |
we don't fabricate responses in WARCs, that would be bad |
|
07:11
🔗
|
w0rmhole |
sorry |
|
07:12
🔗
|
w0rmhole |
im still new to grab-site so i am still adjusting to how it works |
|
07:12
🔗
|
ivan |
w0rmhole: you can start the crawl with a sid, see the README for the cookie stuff |
|
07:12
🔗
|
ivan |
does it work? don't set your hopes too high |
|
07:12
🔗
|
w0rmhole |
i did add a cookie file later on |
|
07:13
🔗
|
w0rmhole |
which didnt make much a difference |
|
07:14
🔗
|
ivan |
try setting the cookie expiration time to the distant future |
|
07:14
🔗
|
ivan |
2147483647 |
|
07:14
🔗
|
ivan |
paste me your cookie file with the sid |
|
07:14
🔗
|
w0rmhole |
1 minute pls |
|
07:15
🔗
|
w0rmhole |
.blogspot.com TRUE / FALSE 2147483647 NCR 1 |
|
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_k |
|
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_sid b2fbc6f704098f6e4a6711a8eb508b98 |
|
07:15
🔗
|
w0rmhole |
.exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_u 1 |
|
07:15
🔗
|
w0rmhole |
.reddit.com TRUE / FALSE 2147483647 over18 1 |
|
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 birthtime 0 |
|
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 lastagecheckage 1-January-1970 |
|
07:15
🔗
|
w0rmhole |
store.steampowered.com FALSE / FALSE 2147483647 mature_content 1 |
|
07:15
🔗
|
w0rmhole |
oh sorry bad formatting |
|
07:15
🔗
|
w0rmhole |
i will use link |
|
07:16
🔗
|
ivan |
I think your tabs got lost yeah |
|
07:16
🔗
|
w0rmhole |
http://pasted.co/6b3de39e |
|
07:18
🔗
|
ivan |
yeah try changing the 1568875520 expiration to 2147483647 |
|
07:18
🔗
|
w0rmhole |
ok let me try |
|
07:18
🔗
|
ivan |
and maybe make sure the session is fresh enough for the server to still know about it? |
|
07:18
🔗
|
ivan |
if that doesn't work there might not be much you can do about the forum software giving you sid links |
|
07:18
🔗
|
w0rmhole |
sorry that last part confuses me |
|
07:19
🔗
|
ivan |
there's also a `secure` flag after the path set to TRUE but I assume you're grabbing https:// forum pages |
|
07:19
🔗
|
w0rmhole |
(english is my second language btw) |
|
07:19
🔗
|
w0rmhole |
yes |
|
07:19
🔗
|
ivan |
if the forum forgot about the session it might give you a ?sid= link with a new session, but I'm just guessing how it works |
|
07:19
🔗
|
w0rmhole |
so possible solution would be to get new sessid? |
|
07:20
🔗
|
ivan |
they probably expire in a reasonably short period |
|
07:20
🔗
|
ivan |
yes |
|
07:20
🔗
|
w0rmhole |
ok |
|
07:20
🔗
|
w0rmhole |
i should still specify user agent, correct? |
|
07:20
🔗
|
ivan |
I guess |
|
07:21
🔗
|
w0rmhole |
if i dont some images do not load |
|
07:21
🔗
|
ivan |
oh heh never mind 1568875520 is this date next year |
|
07:22
🔗
|
ivan |
Forum Software, man |
|
07:23
🔗
|
ivan |
does the WARC player fail to find the page when you click on a ?sid= link? |
|
07:25
🔗
|
ivan |
and which one are you using? |
|
07:28
🔗
|
w0rmhole |
no it finds it |
|
07:28
🔗
|
w0rmhole |
using the same player mentioned on github |
|
08:08
🔗
|
ivan |
ok, that sounds like a decent outcome despite the sid= crap |
|
08:22
🔗
|
w0rmhole |
ivan: one other thing, does grab-site support delays like: 250ms-350ms instead of just one number? |
|
08:24
🔗
|
ivan |
w0rmhole: yeah, just write 250-350 to the file |
|
08:24
🔗
|
ivan |
or give that to --delay= |
|
08:25
🔗
|
w0rmhole |
thanks! :) |
|
08:25
🔗
|
w0rmhole |
i really like grab-site, good work! |
|
08:27
🔗
|
ivan |
it's mostly chfoo's work in wpull but thanks |
|
08:34
🔗
|
w0rmhole |
both of you, i give my thanks to |
|
08:34
🔗
|
|
C4K3 has quit IRC (leaving) |
|
09:13
🔗
|
|
faolingfa has quit IRC (Leaving) |
|
10:22
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
|
10:42
🔗
|
|
kiska has quit IRC (Read error: Connection reset by peer) |
|
10:44
🔗
|
|
w0rmhole has quit IRC (Ping timeout: 252 seconds) |
|
10:44
🔗
|
|
Flashfire has quit IRC (Ping timeout: 252 seconds) |
|
10:52
🔗
|
|
kiska has joined #archiveteam-ot |
|
10:52
🔗
|
|
kiska has quit IRC (se.hub irc.underworld.no) |
|
12:37
🔗
|
JAA |
Underworld pls |
|
13:10
🔗
|
|
sknebel has quit IRC (Ping timeout: 268 seconds) |
|
13:11
🔗
|
|
kiska has joined #archiveteam-ot |
|
13:33
🔗
|
|
faolingfa has joined #archiveteam-ot |
|
13:34
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
|
13:35
🔗
|
|
sknebel has joined #archiveteam-ot |
|
13:35
🔗
|
|
wp494 has joined #archiveteam-ot |
|
15:12
🔗
|
|
w0rmhole has joined #archiveteam-ot |
|
15:19
🔗
|
w0rmhole |
ivan: is there a way to set the number of retries if grab-site fails to get something the first few times? |
|
16:50
🔗
|
|
schbirid has joined #archiveteam-ot |
|
17:30
🔗
|
ivan |
w0rmhole: --wpull-args=--tries=N |
|
18:47
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
|
18:48
🔗
|
|
schbirid has joined #archiveteam-ot |
|
18:59
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
|
21:49
🔗
|
|
Flashfire has joined #archiveteam-ot |
|
23:08
🔗
|
|
BlueMax has joined #archiveteam-ot |
|
23:10
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
|
23:11
🔗
|
|
Jens has joined #archiveteam-ot |