#archiveteam-ot 2018-10-03,Wed

↑back Search

Time Nickname Message
00:36 🔗 Despatche wait archive.org has winamp skins now? holy shittttt
01:23 🔗 Flashfire https://motherboard.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites
01:24 🔗 bithippo has joined #archiveteam-ot
01:56 🔗 bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
02:46 🔗 Despatche has quit IRC (Ping timeout: 633 seconds)
02:55 🔗 Despatche has joined #archiveteam-ot
03:02 🔗 Despatche has quit IRC (Read error: Operation timed out)
03:49 🔗 dashcloud has quit IRC (Remote host closed the connection)
03:53 🔗 icedice has joined #archiveteam-ot
04:57 🔗 Despatche has joined #archiveteam-ot
04:58 🔗 Sanqui has quit IRC (Read error: Operation timed out)
05:00 🔗 Sanqui has joined #archiveteam-ot
05:01 🔗 Despatche has quit IRC (Remote host closed the connection)
05:02 🔗 Despatche has joined #archiveteam-ot
05:33 🔗 w0rmhole sorry ivan i lost it in the tsunami of messages here, what did i need to use to stop `grab-site' from falling into link loops like this?: http://atarimusic.exxoshost.co.uk/forum/topic?f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24 ................
05:33 🔗 w0rmhole original command i ran: `grab-site --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://atarimusic.exxoshost.co.uk/"'
05:34 🔗 Despatche has quit IRC (Remote host closed the connection)
05:35 🔗 Despatche has joined #archiveteam-ot
05:43 🔗 ivan w0rmhole: try an ignore that includes \?.+\? because well-formed URLs have at most one '?'
05:43 🔗 ivan or even \?.*\?
05:44 🔗 w0rmhole sorry im not sure how to do that
05:44 🔗 w0rmhole do i literally type:
05:44 🔗 w0rmhole \?.*\?
05:44 🔗 ivan yep
05:44 🔗 w0rmhole in ignores
05:44 🔗 ivan into the ignores file
05:44 🔗 ivan on a new line, hit save
05:44 🔗 w0rmhole ok
05:44 🔗 w0rmhole would you recommend one over the other?
05:45 🔗 ivan \?.*\?
05:45 🔗 w0rmhole okay, thank you for the help
05:45 🔗 ivan .* matches zero or more characters
05:45 🔗 ivan .+ matches one or more characters
05:45 🔗 w0rmhole lets hope my comp. doesnt shit itself again
05:46 🔗 w0rmhole i walked away for a day and came back to it choking on urls like that
05:46 🔗 ivan \? is a literal '?' because an unescaped '?' means "make the last thing optional to match"
05:46 🔗 w0rmhole and kept expanding
05:46 🔗 w0rmhole okay i think i get it
05:47 🔗 w0rmhole i thought your original suggestion included brackets
05:47 🔗 w0rmhole [
05:47 🔗 w0rmhole ]
05:47 🔗 ivan <ivan>
05:47 🔗 ivan w0rmhole: \?topic.+\?topic ignore?
05:47 🔗 ivan <ivan>
05:47 🔗 ivan and/or [\?&]p=.+[\?&]p= ignore
05:48 🔗 w0rmhole yeah yeah that
05:48 🔗 ivan maybe you were seeing a different loop then
05:48 🔗 w0rmhole possibly
05:48 🔗 w0rmhole let me checkl
05:48 🔗 w0rmhole my terminal windows
05:48 🔗 w0rmhole oh yep
05:48 🔗 w0rmhole http://atarimusic.exxoshost.co.uk/forum/topic?p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?
05:48 🔗 w0rmhole .......
05:49 🔗 ivan multiple ? in a URL will almost always be bad but I don't know if I want to make it a default ignore
05:50 🔗 w0rmhole so would i apply both:
05:50 🔗 w0rmhole \?.*\?
05:50 🔗 w0rmhole and
05:50 🔗 w0rmhole \?topic.+\?topic ignore?
05:50 🔗 w0rmhole in the ignores file? sorry i am still learning
05:50 🔗 ivan \?.*\? will ignore anything that \?topic.+\?topic would ignore
05:51 🔗 w0rmhole ok so just using \?.*\? i'll be set for both?
05:51 🔗 ivan sure
05:51 🔗 w0rmhole thanks
05:52 🔗 w0rmhole hopefully that will work
05:53 🔗 w0rmhole so with the whole phpsessid shit (since it's in the url) i dont need to worry about that?
05:54 🔗 ivan I don't know what you mean
05:55 🔗 w0rmhole sorry, bad english, if i visit https://www.exxoshost.co.uk/forum/ and click on one of the links i get https://www.exxoshost.co.uk/forum/viewtopic.php?f=64&t=1241&sid=baa4aee6927b67f3d18a63e8f3e7edf8 note the sid=xxxxxxxxxxx...... in the url
05:56 🔗 w0rmhole grab-site will prevent this automatically?
05:56 🔗 ivan reload the page and you'll see it no longer has the ?sid= once the cookie is set
05:56 🔗 ivan you can start the crawl with a cookie or pick some irrelevant page as the first url
05:57 🔗 ivan grab-site takes multiple URLs if needed
05:57 🔗 w0rmhole so something like `grab-site https://www.exxoshost.co.uk/forum?archiveteam --igset forums' ?
05:57 🔗 ivan sure
05:58 🔗 w0rmhole or do i need to do it like this: `grab-site https://www.exxoshost.co.uk/forum/?archiveteam --igset forums' or is no different?
05:58 🔗 w0rmhole sorry for asking
05:59 🔗 ivan you might need to quote the URL argument because shells expand ? to a matching character in a filename
06:00 🔗 w0rmhole ok good idea
06:13 🔗 Despatche has quit IRC (Quit: Error: Connection reset by peer)
06:15 🔗 w0rmhole ivan: is `--wpull-args=--tries=1000' a bit extreme?
06:16 🔗 ivan yes
06:16 🔗 w0rmhole 15 more reasonable?
06:16 🔗 ivan even 5 should be plenty
06:17 🔗 w0rmhole just the website im trying keeps erroring out but after 700 or so attempts (time out) it finally gets what it was trying to get
06:17 🔗 ivan was the site down for a while?
06:18 🔗 w0rmhole dont believe so
06:18 🔗 w0rmhole wasn't for that site i said earlier, but i had it happen to another site and, well, that one too, but not THAT many retries
06:18 🔗 ivan if something takes 700 tries to succeed that's not something grab-site is designed to archive
06:19 🔗 ivan you can do whatever you feel you need to do but trying 1000 times might get you banned from some sites
06:19 🔗 w0rmhole a .zip file
06:19 🔗 w0rmhole good point
06:19 🔗 w0rmhole 3.2mb zip file
06:43 🔗 w0rmhole ivan: this look like an acceptable cookies.txt for this to you? https://share.dmca.gripe/dQYlzLvcwDF8eWUP.txt
06:43 🔗 w0rmhole generated it by visiting https://www.exxoshost.co.uk/forum then refreshing; and http://atarimusic.exxoshost.co.uk/ and refreshing.
06:45 🔗 ivan you probably need to remove the #HttpOnly_
06:45 🔗 w0rmhole the entire entry or just that
06:46 🔗 ivan just that string
06:46 🔗 ivan start with the . on the domain
06:46 🔗 ivan you can also try changing that 0 in the first cookie to 1570084624
06:46 🔗 w0rmhole whats that do?
06:46 🔗 w0rmhole idk what TRUE and FALSE mean either in this
06:47 🔗 ivan https://unix.stackexchange.com/questions/36531/format-of-cookies-when-using-wget
06:47 🔗 w0rmhole so i do not touch TRUE/FALSE
06:48 🔗 ivan yeah those should be fine
06:50 🔗 w0rmhole okay but i am still confused as to why i replace 0 with 157......
06:50 🔗 w0rmhole i am sorry
06:50 🔗 w0rmhole why is 0 bad
06:50 🔗 ivan it might expire when the crawl starts
06:50 🔗 ivan I haven't tested it
06:50 🔗 w0rmhole oh ok
06:51 🔗 w0rmhole and the number you suggested?
06:51 🔗 w0rmhole is that a yr from now?
06:53 🔗 w0rmhole yes
06:54 🔗 w0rmhole ivan: look good? https://share.dmca.gripe/CJ1Jm0dcCyCgmnAw.txt
06:54 🔗 ivan yes
06:55 🔗 w0rmhole thank you very much for the helkp
06:55 🔗 w0rmhole *help
06:55 🔗 w0rmhole one last thing, does grab-site auto copy cookies.txt to folder where warc is stored?
06:57 🔗 ivan it tells wpull to --load-cookies from a file and --save-cookies to DIR/cookies.txt
06:58 🔗 ivan a lot of stuff may happen inside wpull
06:58 🔗 w0rmhole thanks
06:59 🔗 ivan https://github.com/ludios/grab-site/blob/29b9825dc5f49c25f01d93746cfb0638c724c22a/libgrabsite/main.py#L240-L259
07:00 🔗 ivan and up to line 216 above
07:00 🔗 w0rmhole ty
07:15 🔗 djsundog has quit IRC (Read error: Operation timed out)
07:15 🔗 mal_ has quit IRC (Read error: Operation timed out)
07:16 🔗 ivan has quit IRC (Read error: Operation timed out)
07:17 🔗 ivan has joined #archiveteam-ot
07:17 🔗 svchfoo1 sets mode: +o ivan
07:17 🔗 Albardin has quit IRC (Read error: Operation timed out)
07:18 🔗 vectr0n_ has joined #archiveteam-ot
07:19 🔗 godane has quit IRC (Read error: Operation timed out)
07:21 🔗 dxrt_ has quit IRC (Read error: Operation timed out)
07:22 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
07:22 🔗 Mateon1 has joined #archiveteam-ot
07:23 🔗 Albardin has joined #archiveteam-ot
07:24 🔗 kiska1 has quit IRC (Read error: Connection reset by peer)
07:24 🔗 kiska1 has joined #archiveteam-ot
07:25 🔗 vectr0n has quit IRC (Ping timeout: 600 seconds)
07:25 🔗 vectr0n_ is now known as vectr0n
07:26 🔗 godane has joined #archiveteam-ot
07:27 🔗 svchfoo1 sets mode: +o godane
07:27 🔗 dxrt_ has joined #archiveteam-ot
07:27 🔗 dxrt sets mode: +o dxrt_
07:29 🔗 mal_ has joined #archiveteam-ot
07:33 🔗 djsundog has joined #archiveteam-ot
07:39 🔗 w0rmhole ivan: if i want to use both --tries and --load-cookies in grab-site do i do: `--wpull-args=--tries=5 --wpull-args=--load-cookies=/PATH\ TO/cookies.txt'
07:39 🔗 w0rmhole ?
07:41 🔗 w0rmhole nvm the github page tells me to wrap in quotes
07:44 🔗 w0rmhole one question, does running something like `grab-site https://www.reddit.com/r/oculus/ --igsets=reddit' grab all of that subreddit or only as much as reddits api allows?
07:46 🔗 Flashfire Is everyone else having trouble with instagram
07:51 🔗 m007a83_ has joined #archiveteam-ot
07:54 🔗 m007a83 has quit IRC (Read error: Operation timed out)
07:56 🔗 icedice has quit IRC (Ping timeout: 252 seconds)
08:01 🔗 wp494 has quit IRC (west.us.hub irc.Prison.NET)
08:07 🔗 wp494 has joined #archiveteam-ot
08:48 🔗 w0rmhole does anybody here know who danooct1 is
08:49 🔗 Flashfire the malware guy?
08:49 🔗 w0rmhole yeah
08:49 🔗 Flashfire I know of him why?
08:50 🔗 w0rmhole i found a guilty-pleasure song he co-produced
08:50 🔗 w0rmhole https://www.youtube.com/watch?v=UJRt41HNLJw
08:50 🔗 Flashfire Tubeup it
08:50 🔗 w0rmhole its so bad it'
08:50 🔗 w0rmhole is godlike
08:50 🔗 w0rmhole just did
08:53 🔗 Flashfire I have programs designed for windows NT
08:53 🔗 w0rmhole need link nao pls
08:53 🔗 w0rmhole lol
08:54 🔗 Flashfire Looks like I will need to rip it later
08:58 🔗 ivan w0rmhole: reddit is terrible and doesn't link to everything
08:58 🔗 ivan there's an issue on snscrape for reddit support
08:58 🔗 w0rmhole oh that sucks -_-
09:02 🔗 eientei95 Flashfire: https://puu.sh/BEWV1/0e44e80636.mp3
09:49 🔗 VerifiedJ has joined #archiveteam-ot
10:20 🔗 VerifiedJ has quit IRC (Quit: Leaving)
12:18 🔗 arkiver has quit IRC (Read error: Operation timed out)
12:18 🔗 kiska1 has quit IRC (Read error: Operation timed out)
12:19 🔗 mal_ has quit IRC (Write error: Broken pipe)
12:19 🔗 djsundog has quit IRC (Read error: Operation timed out)
12:19 🔗 dxrt_ has quit IRC (Read error: Operation timed out)
12:19 🔗 Albardin has quit IRC (Write error: Broken pipe)
12:20 🔗 wp494 has quit IRC (Ping timeout: 255 seconds)
12:20 🔗 wp494 has joined #archiveteam-ot
12:21 🔗 arkiver has joined #archiveteam-ot
12:31 🔗 Albardin has joined #archiveteam-ot
12:32 🔗 kiska1 has joined #archiveteam-ot
12:36 🔗 dxrt_ has joined #archiveteam-ot
12:36 🔗 dxrt sets mode: +o dxrt_
12:40 🔗 mal_ has joined #archiveteam-ot
12:41 🔗 djsundog has joined #archiveteam-ot
14:03 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
14:26 🔗 odemg has joined #archiveteam-ot
14:57 🔗 bithippo has joined #archiveteam-ot
15:13 🔗 bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
15:36 🔗 VerifiedJ has joined #archiveteam-ot
16:01 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
16:07 🔗 schbirid has joined #archiveteam-ot
16:13 🔗 odemg has joined #archiveteam-ot
16:14 🔗 ivan twitter search is down right now in case anyone is running a bunch of snscrape and wonders why no results
16:20 🔗 astrid has joined #archiveteam-ot
18:07 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
18:18 🔗 odemg has joined #archiveteam-ot
20:08 🔗 Mateon1 has quit IRC (Ping timeout: 268 seconds)
20:08 🔗 Mateon1 has joined #archiveteam-ot
20:12 🔗 VerifiedJ has quit IRC (Quit: Leaving)
21:01 🔗 schbirid has quit IRC (Read error: Connection reset by peer)
21:42 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
21:54 🔗 odemg has joined #archiveteam-ot
22:04 🔗 Jens has quit IRC (Remote host closed the connection)
22:05 🔗 Jens has joined #archiveteam-ot
22:39 🔗 Stiletto has joined #archiveteam-ot
22:40 🔗 Stilett0 has quit IRC (Ping timeout: 252 seconds)
23:06 🔗 m007a83_ is now known as m007a83
23:19 🔗 godane has quit IRC (Read error: Operation timed out)
23:23 🔗 dashcloud has joined #archiveteam-ot
23:36 🔗 BlueMax has joined #archiveteam-ot
23:46 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
23:48 🔗 Stilett0 has joined #archiveteam-ot
23:48 🔗 Stiletto has quit IRC (Read error: Operation timed out)
23:57 🔗 odemg has joined #archiveteam-ot
23:59 🔗 Stiletto has joined #archiveteam-ot

irclogger-viewer