[00:36] wait archive.org has winamp skins now? holy shittttt [01:23] https://motherboard.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites [01:24] *** bithippo has joined #archiveteam-ot [01:56] *** bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) [02:46] *** Despatche has quit IRC (Ping timeout: 633 seconds) [02:55] *** Despatche has joined #archiveteam-ot [03:02] *** Despatche has quit IRC (Read error: Operation timed out) [03:49] *** dashcloud has quit IRC (Remote host closed the connection) [03:53] *** icedice has joined #archiveteam-ot [04:57] *** Despatche has joined #archiveteam-ot [04:58] *** Sanqui has quit IRC (Read error: Operation timed out) [05:00] *** Sanqui has joined #archiveteam-ot [05:01] *** Despatche has quit IRC (Remote host closed the connection) [05:02] *** Despatche has joined #archiveteam-ot [05:33] sorry ivan i lost it in the tsunami of messages here, what did i need to use to stop `grab-site' from falling into link loops like this?: http://atarimusic.exxoshost.co.uk/forum/topic?f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24 ................ [05:33] original command i ran: `grab-site --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://atarimusic.exxoshost.co.uk/"' [05:34] *** Despatche has quit IRC (Remote host closed the connection) [05:35] *** Despatche has joined #archiveteam-ot [05:43] w0rmhole: try an ignore that includes \?.+\? because well-formed URLs have at most one '?' [05:43] or even \?.*\? [05:44] sorry im not sure how to do that [05:44] do i literally type: [05:44] \?.*\? [05:44] yep [05:44] in ignores [05:44] into the ignores file [05:44] on a new line, hit save [05:44] ok [05:44] would you recommend one over the other? [05:45] \?.*\? [05:45] okay, thank you for the help [05:45] .* matches zero or more characters [05:45] .+ matches one or more characters [05:45] lets hope my comp. doesnt shit itself again [05:46] i walked away for a day and came back to it choking on urls like that [05:46] \? is a literal '?' because an unescaped '?' means "make the last thing optional to match" [05:46] and kept expanding [05:46] okay i think i get it [05:47] i thought your original suggestion included brackets [05:47] [ [05:47] ] [05:47] [05:47] w0rmhole: \?topic.+\?topic ignore? [05:47] [05:47] and/or [\?&]p=.+[\?&]p= ignore [05:48] yeah yeah that [05:48] maybe you were seeing a different loop then [05:48] possibly [05:48] let me checkl [05:48] my terminal windows [05:48] oh yep [05:48] http://atarimusic.exxoshost.co.uk/forum/topic?p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108? [05:48] ....... [05:49] multiple ? in a URL will almost always be bad but I don't know if I want to make it a default ignore [05:50] so would i apply both: [05:50] \?.*\? [05:50] and [05:50] \?topic.+\?topic ignore? [05:50] in the ignores file? sorry i am still learning [05:50] \?.*\? will ignore anything that \?topic.+\?topic would ignore [05:51] ok so just using \?.*\? i'll be set for both? [05:51] sure [05:51] thanks [05:52] hopefully that will work [05:53] so with the whole phpsessid shit (since it's in the url) i dont need to worry about that? [05:54] I don't know what you mean [05:55] sorry, bad english, if i visit https://www.exxoshost.co.uk/forum/ and click on one of the links i get https://www.exxoshost.co.uk/forum/viewtopic.php?f=64&t=1241&sid=baa4aee6927b67f3d18a63e8f3e7edf8 note the sid=xxxxxxxxxxx...... in the url [05:56] grab-site will prevent this automatically? [05:56] reload the page and you'll see it no longer has the ?sid= once the cookie is set [05:56] you can start the crawl with a cookie or pick some irrelevant page as the first url [05:57] grab-site takes multiple URLs if needed [05:57] so something like `grab-site https://www.exxoshost.co.uk/forum?archiveteam --igset forums' ? [05:57] sure [05:58] or do i need to do it like this: `grab-site https://www.exxoshost.co.uk/forum/?archiveteam --igset forums' or is no different? [05:58] sorry for asking [05:59] you might need to quote the URL argument because shells expand ? to a matching character in a filename [06:00] ok good idea [06:13] *** Despatche has quit IRC (Quit: Error: Connection reset by peer) [06:15] ivan: is `--wpull-args=--tries=1000' a bit extreme? [06:16] yes [06:16] 15 more reasonable? [06:16] even 5 should be plenty [06:17] just the website im trying keeps erroring out but after 700 or so attempts (time out) it finally gets what it was trying to get [06:17] was the site down for a while? [06:18] dont believe so [06:18] wasn't for that site i said earlier, but i had it happen to another site and, well, that one too, but not THAT many retries [06:18] if something takes 700 tries to succeed that's not something grab-site is designed to archive [06:19] you can do whatever you feel you need to do but trying 1000 times might get you banned from some sites [06:19] a .zip file [06:19] good point [06:19] 3.2mb zip file [06:43] ivan: this look like an acceptable cookies.txt for this to you? https://share.dmca.gripe/dQYlzLvcwDF8eWUP.txt [06:43] generated it by visiting https://www.exxoshost.co.uk/forum then refreshing; and http://atarimusic.exxoshost.co.uk/ and refreshing. [06:45] you probably need to remove the #HttpOnly_ [06:45] the entire entry or just that [06:46] just that string [06:46] start with the . on the domain [06:46] you can also try changing that 0 in the first cookie to 1570084624 [06:46] whats that do? [06:46] idk what TRUE and FALSE mean either in this [06:47] https://unix.stackexchange.com/questions/36531/format-of-cookies-when-using-wget [06:47] so i do not touch TRUE/FALSE [06:48] yeah those should be fine [06:50] okay but i am still confused as to why i replace 0 with 157...... [06:50] i am sorry [06:50] why is 0 bad [06:50] it might expire when the crawl starts [06:50] I haven't tested it [06:50] oh ok [06:51] and the number you suggested? [06:51] is that a yr from now? [06:53] yes [06:54] ivan: look good? https://share.dmca.gripe/CJ1Jm0dcCyCgmnAw.txt [06:54] yes [06:55] thank you very much for the helkp [06:55] *help [06:55] one last thing, does grab-site auto copy cookies.txt to folder where warc is stored? [06:57] it tells wpull to --load-cookies from a file and --save-cookies to DIR/cookies.txt [06:58] a lot of stuff may happen inside wpull [06:58] thanks [06:59] https://github.com/ludios/grab-site/blob/29b9825dc5f49c25f01d93746cfb0638c724c22a/libgrabsite/main.py#L240-L259 [07:00] and up to line 216 above [07:00] ty [07:15] *** djsundog has quit IRC (Read error: Operation timed out) [07:15] *** mal_ has quit IRC (Read error: Operation timed out) [07:16] *** ivan has quit IRC (Read error: Operation timed out) [07:17] *** ivan has joined #archiveteam-ot [07:17] *** svchfoo1 sets mode: +o ivan [07:17] *** Albardin has quit IRC (Read error: Operation timed out) [07:18] *** vectr0n_ has joined #archiveteam-ot [07:19] *** godane has quit IRC (Read error: Operation timed out) [07:21] *** dxrt_ has quit IRC (Read error: Operation timed out) [07:22] *** Mateon1 has quit IRC (Read error: Operation timed out) [07:22] *** Mateon1 has joined #archiveteam-ot [07:23] *** Albardin has joined #archiveteam-ot [07:24] *** kiska1 has quit IRC (Read error: Connection reset by peer) [07:24] *** kiska1 has joined #archiveteam-ot [07:25] *** vectr0n has quit IRC (Ping timeout: 600 seconds) [07:25] *** vectr0n_ is now known as vectr0n [07:26] *** godane has joined #archiveteam-ot [07:27] *** svchfoo1 sets mode: +o godane [07:27] *** dxrt_ has joined #archiveteam-ot [07:27] *** dxrt sets mode: +o dxrt_ [07:29] *** mal_ has joined #archiveteam-ot [07:33] *** djsundog has joined #archiveteam-ot [07:39] ivan: if i want to use both --tries and --load-cookies in grab-site do i do: `--wpull-args=--tries=5 --wpull-args=--load-cookies=/PATH\ TO/cookies.txt' [07:39] ? [07:41] nvm the github page tells me to wrap in quotes [07:44] one question, does running something like `grab-site https://www.reddit.com/r/oculus/ --igsets=reddit' grab all of that subreddit or only as much as reddits api allows? [07:46] Is everyone else having trouble with instagram [07:51] *** m007a83_ has joined #archiveteam-ot [07:54] *** m007a83 has quit IRC (Read error: Operation timed out) [07:56] *** icedice has quit IRC (Ping timeout: 252 seconds) [08:01] *** wp494 has quit IRC (west.us.hub irc.Prison.NET) [08:07] *** wp494 has joined #archiveteam-ot [08:48] does anybody here know who danooct1 is [08:49] the malware guy? [08:49] yeah [08:49] I know of him why? [08:50] i found a guilty-pleasure song he co-produced [08:50] https://www.youtube.com/watch?v=UJRt41HNLJw [08:50] Tubeup it [08:50] its so bad it' [08:50] is godlike [08:50] just did [08:53] I have programs designed for windows NT [08:53] need link nao pls [08:53] lol [08:54] Looks like I will need to rip it later [08:58] w0rmhole: reddit is terrible and doesn't link to everything [08:58] there's an issue on snscrape for reddit support [08:58] oh that sucks -_- [09:02] Flashfire: https://puu.sh/BEWV1/0e44e80636.mp3 [09:49] *** VerifiedJ has joined #archiveteam-ot [10:20] *** VerifiedJ has quit IRC (Quit: Leaving) [12:18] *** arkiver has quit IRC (Read error: Operation timed out) [12:18] *** kiska1 has quit IRC (Read error: Operation timed out) [12:19] *** mal_ has quit IRC (Write error: Broken pipe) [12:19] *** djsundog has quit IRC (Read error: Operation timed out) [12:19] *** dxrt_ has quit IRC (Read error: Operation timed out) [12:19] *** Albardin has quit IRC (Write error: Broken pipe) [12:20] *** wp494 has quit IRC (Ping timeout: 255 seconds) [12:20] *** wp494 has joined #archiveteam-ot [12:21] *** arkiver has joined #archiveteam-ot [12:31] *** Albardin has joined #archiveteam-ot [12:32] *** kiska1 has joined #archiveteam-ot [12:36] *** dxrt_ has joined #archiveteam-ot [12:36] *** dxrt sets mode: +o dxrt_ [12:40] *** mal_ has joined #archiveteam-ot [12:41] *** djsundog has joined #archiveteam-ot [14:03] *** BlueMax has quit IRC (Read error: Connection reset by peer) [14:26] *** odemg has joined #archiveteam-ot [14:57] *** bithippo has joined #archiveteam-ot [15:13] *** bithippo has quit IRC (Textual IRC Client: www.textualapp.com) [15:36] *** VerifiedJ has joined #archiveteam-ot [16:01] *** odemg has quit IRC (Ping timeout: 260 seconds) [16:07] *** schbirid has joined #archiveteam-ot [16:13] *** odemg has joined #archiveteam-ot [16:14] twitter search is down right now in case anyone is running a bunch of snscrape and wonders why no results [16:20] *** astrid has joined #archiveteam-ot [18:07] *** odemg has quit IRC (Ping timeout: 260 seconds) [18:18] *** odemg has joined #archiveteam-ot [20:08] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [20:08] *** Mateon1 has joined #archiveteam-ot [20:12] *** VerifiedJ has quit IRC (Quit: Leaving) [21:01] *** schbirid has quit IRC (Read error: Connection reset by peer) [21:42] *** odemg has quit IRC (Ping timeout: 260 seconds) [21:54] *** odemg has joined #archiveteam-ot [22:04] *** Jens has quit IRC (Remote host closed the connection) [22:05] *** Jens has joined #archiveteam-ot [22:39] *** Stiletto has joined #archiveteam-ot [22:40] *** Stilett0 has quit IRC (Ping timeout: 252 seconds) [23:06] *** m007a83_ is now known as m007a83 [23:19] *** godane has quit IRC (Read error: Operation timed out) [23:23] *** dashcloud has joined #archiveteam-ot [23:36] *** BlueMax has joined #archiveteam-ot [23:46] *** odemg has quit IRC (Ping timeout: 260 seconds) [23:48] *** Stilett0 has joined #archiveteam-ot [23:48] *** Stiletto has quit IRC (Read error: Operation timed out) [23:57] *** odemg has joined #archiveteam-ot [23:59] *** Stiletto has joined #archiveteam-ot