#archiveteam-ot 2018-09-19,Wed

↑back Search

Time Nickname Message
00:10 🔗 BlueMax has joined #archiveteam-ot
01:12 🔗 kiska has joined #archiveteam-ot
01:14 🔗 Polylith has quit IRC (Read error: Operation timed out)
01:15 🔗 Polylith has joined #archiveteam-ot
01:29 🔗 ColdIce has quit IRC (Read error: Operation timed out)
01:35 🔗 ColdIce has joined #archiveteam-ot
01:38 🔗 ColdIce has quit IRC (Read error: Connection reset by peer)
01:39 🔗 w0rmhole has joined #archiveteam-ot
02:12 🔗 adinbied has quit IRC (Quit: Left Channel.)
02:25 🔗 adinbied has joined #archiveteam-ot
02:30 🔗 adinbied has quit IRC (Quit: Left Channel.)
02:44 🔗 adinbied has joined #archiveteam-ot
03:42 🔗 Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
03:44 🔗 ivan has quit IRC (Read error: Operation timed out)
03:45 🔗 JAA has quit IRC (Read error: Operation timed out)
03:45 🔗 jspiros has quit IRC (Read error: Operation timed out)
03:45 🔗 ivan has joined #archiveteam-ot
03:46 🔗 svchfoo1 sets mode: +o ivan
03:48 🔗 wp494 has quit IRC (Ping timeout: 492 seconds)
03:51 🔗 wp494 has joined #archiveteam-ot
04:02 🔗 Odd0002 has joined #archiveteam-ot
04:08 🔗 Mateon1 has quit IRC (Ping timeout: 268 seconds)
04:09 🔗 Mateon1 has joined #archiveteam-ot
04:45 🔗 JAA has joined #archiveteam-ot
04:45 🔗 svchfoo3 sets mode: +o JAA
04:46 🔗 bakJAA sets mode: +o JAA
04:50 🔗 jspiros has joined #archiveteam-ot
04:57 🔗 dxrt- is now known as dxrt
04:58 🔗 dxrt_ sets mode: +o dxrt
06:15 🔗 w0rmhole ivan: you know the ins and outs of grab-site, right?
06:15 🔗 Flashfire he wrote it
06:15 🔗 Flashfire ......
06:16 🔗 w0rmhole oh ok, i was going to ask him a question about it
06:25 🔗 ivan w0rmhole: I'm here
06:28 🔗 w0rmhole ivan: okay, so im using grab-site and i adjusted the delay while a crawl was running from 0ms to 250ms by editing the delay file.
06:28 🔗 w0rmhole doing that froze up grab-site. it's not moving at all. i dont really want to break it.
06:28 🔗 w0rmhole even setting the delay back to 0 didn't make a difference
06:28 🔗 w0rmhole for the record, this is the command i ran: $ grab-site https://www.exxoshost.co.uk/forum?archiveteam --igsets forums
06:28 🔗 ivan w0rmhole: you can look at the terminal to see which URLs it's currently grabbing, or using gs-dump-urls with in_progress
06:29 🔗 ivan changing a delay to 250ms doesn't freeze crawls, probably a coincidence
06:29 🔗 w0rmhole oh of course, right when you typed that it started working
06:30 🔗 w0rmhole i think so
06:31 🔗 w0rmhole said something about dns resolution errors when it continued , but i think that might just be an issue with the site and not grab-site
06:33 🔗 w0rmhole one other question i have if you don't mind
06:34 🔗 ivan I'm still here
06:34 🔗 w0rmhole that forum keeps putting in that stupid phpsessid garbage in the url
06:34 🔗 w0rmhole is there a way for grab-site to not capture those urls, and only the actual url?
06:35 🔗 ivan wpull has a URLRewriter that should be handling that
06:35 🔗 w0rmhole i.e. https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196 as opposed to https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196?sid=0befb2c2dc4ac8d45b88f1fe7cce2b71
06:35 🔗 w0rmhole is that enabled by default?
06:35 🔗 w0rmhole in grab-site
06:35 🔗 ivan re.compile("^(.*)(?:sid=[0-9a-zA-Z]{32})(?:&(.*))?$", re.I),
06:36 🔗 * ivan looks
06:36 🔗 w0rmhole ...
06:36 🔗 w0rmhole i dont know what to do with that x_x
06:37 🔗 ivan yes
06:37 🔗 ivan libgrabsite/main.py
06:37 🔗 ivan 253: "--strip-session-id",
06:37 🔗 w0rmhole ohh nvm i know what you mean
06:38 🔗 ivan so it's enabled but I don't know the details of the implementation
06:38 🔗 ivan is grab-site grabbing URLs with the session id?
06:38 🔗 w0rmhole in one situation yes
06:38 🔗 w0rmhole i'll try to find the original command i used
06:39 🔗 w0rmhole btw, do i need to use that ?archiveteam thing in the url like i did up there?
06:40 🔗 w0rmhole to keep the session id out
06:40 🔗 ivan probably not
06:40 🔗 w0rmhole $ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20"
06:40 🔗 w0rmhole ok so that's the command i ran
06:41 🔗 w0rmhole and while browsing the warc with webrecorder player
06:41 🔗 w0rmhole in the url field, i saw the session id appear on the 2nd forum page
06:41 🔗 w0rmhole aka start=10
06:46 🔗 ivan please file a bug with details because my crawl of the site hangs on something pretty quickly
06:46 🔗 ivan maybe it's some weird page requisite behavior, I don't know
06:47 🔗 w0rmhole okay, will do
06:48 🔗 w0rmhole i had to manually specify the user agent to that to allow it to grab those
06:48 🔗 w0rmhole with ua specified=~30s
06:48 🔗 w0rmhole w/o ua specified=~4min
06:48 🔗 w0rmhole iirc
06:59 🔗 w0rmhole ivan: https://github.com/ludios/grab-site/issues/132
06:59 🔗 w0rmhole hope my english skills aren't too shitty
07:06 🔗 ivan replied there
07:08 🔗 ivan we don't fabricate responses in WARCs, that would be bad
07:11 🔗 w0rmhole sorry
07:12 🔗 w0rmhole im still new to grab-site so i am still adjusting to how it works
07:12 🔗 ivan w0rmhole: you can start the crawl with a sid, see the README for the cookie stuff
07:12 🔗 ivan does it work? don't set your hopes too high
07:12 🔗 w0rmhole i did add a cookie file later on
07:13 🔗 w0rmhole which didnt make much a difference
07:14 🔗 ivan try setting the cookie expiration time to the distant future
07:14 🔗 ivan 2147483647
07:14 🔗 ivan paste me your cookie file with the sid
07:14 🔗 w0rmhole 1 minute pls
07:15 🔗 w0rmhole .blogspot.com TRUE / FALSE 2147483647 NCR 1
07:15 🔗 w0rmhole .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_k
07:15 🔗 w0rmhole .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_sid b2fbc6f704098f6e4a6711a8eb508b98
07:15 🔗 w0rmhole .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_u 1
07:15 🔗 w0rmhole .reddit.com TRUE / FALSE 2147483647 over18 1
07:15 🔗 w0rmhole store.steampowered.com FALSE / FALSE 2147483647 birthtime 0
07:15 🔗 w0rmhole store.steampowered.com FALSE / FALSE 2147483647 lastagecheckage 1-January-1970
07:15 🔗 w0rmhole store.steampowered.com FALSE / FALSE 2147483647 mature_content 1
07:15 🔗 w0rmhole oh sorry bad formatting
07:15 🔗 w0rmhole i will use link
07:16 🔗 ivan I think your tabs got lost yeah
07:16 🔗 w0rmhole http://pasted.co/6b3de39e
07:18 🔗 ivan yeah try changing the 1568875520 expiration to 2147483647
07:18 🔗 w0rmhole ok let me try
07:18 🔗 ivan and maybe make sure the session is fresh enough for the server to still know about it?
07:18 🔗 ivan if that doesn't work there might not be much you can do about the forum software giving you sid links
07:18 🔗 w0rmhole sorry that last part confuses me
07:19 🔗 ivan there's also a `secure` flag after the path set to TRUE but I assume you're grabbing https:// forum pages
07:19 🔗 w0rmhole (english is my second language btw)
07:19 🔗 w0rmhole yes
07:19 🔗 ivan if the forum forgot about the session it might give you a ?sid= link with a new session, but I'm just guessing how it works
07:19 🔗 w0rmhole so possible solution would be to get new sessid?
07:20 🔗 ivan they probably expire in a reasonably short period
07:20 🔗 ivan yes
07:20 🔗 w0rmhole ok
07:20 🔗 w0rmhole i should still specify user agent, correct?
07:20 🔗 ivan I guess
07:21 🔗 w0rmhole if i dont some images do not load
07:21 🔗 ivan oh heh never mind 1568875520 is this date next year
07:22 🔗 ivan Forum Software, man
07:23 🔗 ivan does the WARC player fail to find the page when you click on a ?sid= link?
07:25 🔗 ivan and which one are you using?
07:28 🔗 w0rmhole no it finds it
07:28 🔗 w0rmhole using the same player mentioned on github
08:08 🔗 ivan ok, that sounds like a decent outcome despite the sid= crap
08:22 🔗 w0rmhole ivan: one other thing, does grab-site support delays like: 250ms-350ms instead of just one number?
08:24 🔗 ivan w0rmhole: yeah, just write 250-350 to the file
08:24 🔗 ivan or give that to --delay=
08:25 🔗 w0rmhole thanks! :)
08:25 🔗 w0rmhole i really like grab-site, good work!
08:27 🔗 ivan it's mostly chfoo's work in wpull but thanks
08:34 🔗 w0rmhole both of you, i give my thanks to
08:34 🔗 C4K3 has quit IRC (leaving)
09:13 🔗 faolingfa has quit IRC (Leaving)
10:22 🔗 BlueMax has quit IRC (Quit: Leaving)
10:42 🔗 kiska has quit IRC (Read error: Connection reset by peer)
10:44 🔗 w0rmhole has quit IRC (Ping timeout: 252 seconds)
10:44 🔗 Flashfire has quit IRC (Ping timeout: 252 seconds)
10:52 🔗 kiska has joined #archiveteam-ot
10:52 🔗 kiska has quit IRC (se.hub irc.underworld.no)
12:37 🔗 JAA Underworld pls
13:10 🔗 sknebel has quit IRC (Ping timeout: 268 seconds)
13:11 🔗 kiska has joined #archiveteam-ot
13:33 🔗 faolingfa has joined #archiveteam-ot
13:34 🔗 wp494 has quit IRC (Read error: Operation timed out)
13:35 🔗 sknebel has joined #archiveteam-ot
13:35 🔗 wp494 has joined #archiveteam-ot
15:12 🔗 w0rmhole has joined #archiveteam-ot
15:19 🔗 w0rmhole ivan: is there a way to set the number of retries if grab-site fails to get something the first few times?
16:50 🔗 schbirid has joined #archiveteam-ot
17:30 🔗 ivan w0rmhole: --wpull-args=--tries=N
18:47 🔗 schbirid has quit IRC (Read error: Operation timed out)
18:48 🔗 schbirid has joined #archiveteam-ot
18:59 🔗 schbirid has quit IRC (Read error: Operation timed out)
21:49 🔗 Flashfire has joined #archiveteam-ot
23:08 🔗 BlueMax has joined #archiveteam-ot
23:10 🔗 Jens has quit IRC (Remote host closed the connection)
23:11 🔗 Jens has joined #archiveteam-ot

irclogger-viewer