[00:10] *** BlueMax has joined #archiveteam-ot [01:12] *** kiska has joined #archiveteam-ot [01:14] *** Polylith has quit IRC (Read error: Operation timed out) [01:15] *** Polylith has joined #archiveteam-ot [01:29] *** ColdIce has quit IRC (Read error: Operation timed out) [01:35] *** ColdIce has joined #archiveteam-ot [01:38] *** ColdIce has quit IRC (Read error: Connection reset by peer) [01:39] *** w0rmhole has joined #archiveteam-ot [02:12] *** adinbied has quit IRC (Quit: Left Channel.) [02:25] *** adinbied has joined #archiveteam-ot [02:30] *** adinbied has quit IRC (Quit: Left Channel.) [02:44] *** adinbied has joined #archiveteam-ot [03:42] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [03:44] *** ivan has quit IRC (Read error: Operation timed out) [03:45] *** JAA has quit IRC (Read error: Operation timed out) [03:45] *** jspiros has quit IRC (Read error: Operation timed out) [03:45] *** ivan has joined #archiveteam-ot [03:46] *** svchfoo1 sets mode: +o ivan [03:48] *** wp494 has quit IRC (Ping timeout: 492 seconds) [03:51] *** wp494 has joined #archiveteam-ot [04:02] *** Odd0002 has joined #archiveteam-ot [04:08] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [04:09] *** Mateon1 has joined #archiveteam-ot [04:45] *** JAA has joined #archiveteam-ot [04:45] *** svchfoo3 sets mode: +o JAA [04:46] *** bakJAA sets mode: +o JAA [04:50] *** jspiros has joined #archiveteam-ot [04:57] *** dxrt- is now known as dxrt [04:58] *** dxrt_ sets mode: +o dxrt [06:15] ivan: you know the ins and outs of grab-site, right? [06:15] he wrote it [06:15] ...... [06:16] oh ok, i was going to ask him a question about it [06:25] w0rmhole: I'm here [06:28] ivan: okay, so im using grab-site and i adjusted the delay while a crawl was running from 0ms to 250ms by editing the delay file. [06:28] doing that froze up grab-site. it's not moving at all. i dont really want to break it. [06:28] even setting the delay back to 0 didn't make a difference [06:28] for the record, this is the command i ran: $ grab-site https://www.exxoshost.co.uk/forum?archiveteam --igsets forums [06:28] w0rmhole: you can look at the terminal to see which URLs it's currently grabbing, or using gs-dump-urls with in_progress [06:29] changing a delay to 250ms doesn't freeze crawls, probably a coincidence [06:29] oh of course, right when you typed that it started working [06:30] i think so [06:31] said something about dns resolution errors when it continued , but i think that might just be an issue with the site and not grab-site [06:33] one other question i have if you don't mind [06:34] I'm still here [06:34] that forum keeps putting in that stupid phpsessid garbage in the url [06:34] is there a way for grab-site to not capture those urls, and only the actual url? [06:35] wpull has a URLRewriter that should be handling that [06:35] i.e. https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196 as opposed to https://www.exxoshost.co.uk/forum/viewtopic.php?f=14&t=1196?sid=0befb2c2dc4ac8d45b88f1fe7cce2b71 [06:35] is that enabled by default? [06:35] in grab-site [06:35] re.compile("^(.*)(?:sid=[0-9a-zA-Z]{32})(?:&(.*))?$", re.I), [06:36] * ivan looks [06:36] ... [06:36] i dont know what to do with that x_x [06:37] yes [06:37] libgrabsite/main.py [06:37] 253: "--strip-session-id", [06:37] ohh nvm i know what you mean [06:38] so it's enabled but I don't know the details of the implementation [06:38] is grab-site grabbing URLs with the session id? [06:38] in one situation yes [06:38] i'll try to find the original command i used [06:39] btw, do i need to use that ?archiveteam thing in the url like i did up there? [06:40] to keep the session id out [06:40] probably not [06:40] $ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20" [06:40] ok so that's the command i ran [06:41] and while browsing the warc with webrecorder player [06:41] in the url field, i saw the session id appear on the 2nd forum page [06:41] aka start=10 [06:46] please file a bug with details because my crawl of the site hangs on something pretty quickly [06:46] maybe it's some weird page requisite behavior, I don't know [06:47] okay, will do [06:48] i had to manually specify the user agent to that to allow it to grab those [06:48] with ua specified=~30s [06:48] w/o ua specified=~4min [06:48] iirc [06:59] ivan: https://github.com/ludios/grab-site/issues/132 [06:59] hope my english skills aren't too shitty [07:06] replied there [07:08] we don't fabricate responses in WARCs, that would be bad [07:11] sorry [07:12] im still new to grab-site so i am still adjusting to how it works [07:12] w0rmhole: you can start the crawl with a sid, see the README for the cookie stuff [07:12] does it work? don't set your hopes too high [07:12] i did add a cookie file later on [07:13] which didnt make much a difference [07:14] try setting the cookie expiration time to the distant future [07:14] 2147483647 [07:14] paste me your cookie file with the sid [07:14] 1 minute pls [07:15] .blogspot.com TRUE / FALSE 2147483647 NCR 1 [07:15] .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_k [07:15] .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_sid b2fbc6f704098f6e4a6711a8eb508b98 [07:15] .exxoshost.co.uk TRUE / TRUE 1568875520 phpbbexxos_u 1 [07:15] .reddit.com TRUE / FALSE 2147483647 over18 1 [07:15] store.steampowered.com FALSE / FALSE 2147483647 birthtime 0 [07:15] store.steampowered.com FALSE / FALSE 2147483647 lastagecheckage 1-January-1970 [07:15] store.steampowered.com FALSE / FALSE 2147483647 mature_content 1 [07:15] oh sorry bad formatting [07:15] i will use link [07:16] I think your tabs got lost yeah [07:16] http://pasted.co/6b3de39e [07:18] yeah try changing the 1568875520 expiration to 2147483647 [07:18] ok let me try [07:18] and maybe make sure the session is fresh enough for the server to still know about it? [07:18] if that doesn't work there might not be much you can do about the forum software giving you sid links [07:18] sorry that last part confuses me [07:19] there's also a `secure` flag after the path set to TRUE but I assume you're grabbing https:// forum pages [07:19] (english is my second language btw) [07:19] yes [07:19] if the forum forgot about the session it might give you a ?sid= link with a new session, but I'm just guessing how it works [07:19] so possible solution would be to get new sessid? [07:20] they probably expire in a reasonably short period [07:20] yes [07:20] ok [07:20] i should still specify user agent, correct? [07:20] I guess [07:21] if i dont some images do not load [07:21] oh heh never mind 1568875520 is this date next year [07:22] Forum Software, man [07:23] does the WARC player fail to find the page when you click on a ?sid= link? [07:25] and which one are you using? [07:28] no it finds it [07:28] using the same player mentioned on github [08:08] ok, that sounds like a decent outcome despite the sid= crap [08:22] ivan: one other thing, does grab-site support delays like: 250ms-350ms instead of just one number? [08:24] w0rmhole: yeah, just write 250-350 to the file [08:24] or give that to --delay= [08:25] thanks! :) [08:25] i really like grab-site, good work! [08:27] it's mostly chfoo's work in wpull but thanks [08:34] both of you, i give my thanks to [08:34] *** C4K3 has quit IRC (leaving) [09:13] *** faolingfa has quit IRC (Leaving) [10:22] *** BlueMax has quit IRC (Quit: Leaving) [10:42] *** kiska has quit IRC (Read error: Connection reset by peer) [10:44] *** w0rmhole has quit IRC (Ping timeout: 252 seconds) [10:44] *** Flashfire has quit IRC (Ping timeout: 252 seconds) [10:52] *** kiska has joined #archiveteam-ot [10:52] *** kiska has quit IRC (se.hub irc.underworld.no) [12:37] Underworld pls [13:10] *** sknebel has quit IRC (Ping timeout: 268 seconds) [13:11] *** kiska has joined #archiveteam-ot [13:33] *** faolingfa has joined #archiveteam-ot [13:34] *** wp494 has quit IRC (Read error: Operation timed out) [13:35] *** sknebel has joined #archiveteam-ot [13:35] *** wp494 has joined #archiveteam-ot [15:12] *** w0rmhole has joined #archiveteam-ot [15:19] ivan: is there a way to set the number of retries if grab-site fails to get something the first few times? [16:50] *** schbirid has joined #archiveteam-ot [17:30] w0rmhole: --wpull-args=--tries=N [18:47] *** schbirid has quit IRC (Read error: Operation timed out) [18:48] *** schbirid has joined #archiveteam-ot [18:59] *** schbirid has quit IRC (Read error: Operation timed out) [21:49] *** Flashfire has joined #archiveteam-ot [23:08] *** BlueMax has joined #archiveteam-ot [23:10] *** Jens has quit IRC (Remote host closed the connection) [23:11] *** Jens has joined #archiveteam-ot