#archiveteam-ot 2018-10-03,Wed

↑back Search

Time	Nickname	Message
00:36 ^🔗	Despatche	wait archive.org has winamp skins now? holy shittttt
01:23 ^🔗	Flashfire	https://motherboard.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites
01:24 ^🔗		bithippo has joined #archiveteam-ot
01:56 ^🔗		bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
02:46 ^🔗		Despatche has quit IRC (Ping timeout: 633 seconds)
02:55 ^🔗		Despatche has joined #archiveteam-ot
03:02 ^🔗		Despatche has quit IRC (Read error: Operation timed out)
03:49 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
03:53 ^🔗		icedice has joined #archiveteam-ot
04:57 ^🔗		Despatche has joined #archiveteam-ot
04:58 ^🔗		Sanqui has quit IRC (Read error: Operation timed out)
05:00 ^🔗		Sanqui has joined #archiveteam-ot
05:01 ^🔗		Despatche has quit IRC (Remote host closed the connection)
05:02 ^🔗		Despatche has joined #archiveteam-ot
05:33 ^🔗	w0rmhole	sorry ivan i lost it in the tsunami of messages here, what did i need to use to stop `grab-site' from falling into link loops like this?: http://atarimusic.exxoshost.co.uk/forum/topic?f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24 ................
05:33 ^🔗	w0rmhole	original command i ran: `grab-site --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://atarimusic.exxoshost.co.uk/"'
05:34 ^🔗		Despatche has quit IRC (Remote host closed the connection)
05:35 ^🔗		Despatche has joined #archiveteam-ot
05:43 ^🔗	ivan	w0rmhole: try an ignore that includes \?.+\? because well-formed URLs have at most one '?'
05:43 ^🔗	ivan	or even \?.*\?
05:44 ^🔗	w0rmhole	sorry im not sure how to do that
05:44 ^🔗	w0rmhole	do i literally type:
05:44 ^🔗	w0rmhole	\?.*\?
05:44 ^🔗	ivan	yep
05:44 ^🔗	w0rmhole	in ignores
05:44 ^🔗	ivan	into the ignores file
05:44 ^🔗	ivan	on a new line, hit save
05:44 ^🔗	w0rmhole	ok
05:44 ^🔗	w0rmhole	would you recommend one over the other?
05:45 ^🔗	ivan	\?.*\?
05:45 ^🔗	w0rmhole	okay, thank you for the help
05:45 ^🔗	ivan	.* matches zero or more characters
05:45 ^🔗	ivan	.+ matches one or more characters
05:45 ^🔗	w0rmhole	lets hope my comp. doesnt shit itself again
05:46 ^🔗	w0rmhole	i walked away for a day and came back to it choking on urls like that
05:46 ^🔗	ivan	\? is a literal '?' because an unescaped '?' means "make the last thing optional to match"
05:46 ^🔗	w0rmhole	and kept expanding
05:46 ^🔗	w0rmhole	okay i think i get it
05:47 ^🔗	w0rmhole	i thought your original suggestion included brackets
05:47 ^🔗	w0rmhole	[
05:47 ^🔗	w0rmhole	]
05:47 ^🔗	ivan	<ivan>
05:47 ^🔗	ivan	w0rmhole: \?topic.+\?topic ignore?
05:47 ^🔗	ivan	<ivan>
05:47 ^🔗	ivan	and/or [\?&]p=.+[\?&]p= ignore
05:48 ^🔗	w0rmhole	yeah yeah that
05:48 ^🔗	ivan	maybe you were seeing a different loop then
05:48 ^🔗	w0rmhole	possibly
05:48 ^🔗	w0rmhole	let me checkl
05:48 ^🔗	w0rmhole	my terminal windows
05:48 ^🔗	w0rmhole	oh yep
05:48 ^🔗	w0rmhole	http://atarimusic.exxoshost.co.uk/forum/topic?p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?
05:48 ^🔗	w0rmhole	.......
05:49 ^🔗	ivan	multiple ? in a URL will almost always be bad but I don't know if I want to make it a default ignore
05:50 ^🔗	w0rmhole	so would i apply both:
05:50 ^🔗	w0rmhole	\?.*\?
05:50 ^🔗	w0rmhole	and
05:50 ^🔗	w0rmhole	\?topic.+\?topic ignore?
05:50 ^🔗	w0rmhole	in the ignores file? sorry i am still learning
05:50 ^🔗	ivan	\?.*\? will ignore anything that \?topic.+\?topic would ignore
05:51 ^🔗	w0rmhole	ok so just using \?.*\? i'll be set for both?
05:51 ^🔗	ivan	sure
05:51 ^🔗	w0rmhole	thanks
05:52 ^🔗	w0rmhole	hopefully that will work
05:53 ^🔗	w0rmhole	so with the whole phpsessid shit (since it's in the url) i dont need to worry about that?
05:54 ^🔗	ivan	I don't know what you mean
05:55 ^🔗	w0rmhole	sorry, bad english, if i visit https://www.exxoshost.co.uk/forum/ and click on one of the links i get https://www.exxoshost.co.uk/forum/viewtopic.php?f=64&t=1241&sid=baa4aee6927b67f3d18a63e8f3e7edf8 note the sid=xxxxxxxxxxx...... in the url
05:56 ^🔗	w0rmhole	grab-site will prevent this automatically?
05:56 ^🔗	ivan	reload the page and you'll see it no longer has the ?sid= once the cookie is set
05:56 ^🔗	ivan	you can start the crawl with a cookie or pick some irrelevant page as the first url
05:57 ^🔗	ivan	grab-site takes multiple URLs if needed
05:57 ^🔗	w0rmhole	so something like `grab-site https://www.exxoshost.co.uk/forum?archiveteam --igset forums' ?
05:57 ^🔗	ivan	sure
05:58 ^🔗	w0rmhole	or do i need to do it like this: `grab-site https://www.exxoshost.co.uk/forum/?archiveteam --igset forums' or is no different?
05:58 ^🔗	w0rmhole	sorry for asking
05:59 ^🔗	ivan	you might need to quote the URL argument because shells expand ? to a matching character in a filename
06:00 ^🔗	w0rmhole	ok good idea
06:13 ^🔗		Despatche has quit IRC (Quit: Error: Connection reset by peer)
06:15 ^🔗	w0rmhole	ivan: is `--wpull-args=--tries=1000' a bit extreme?
06:16 ^🔗	ivan	yes
06:16 ^🔗	w0rmhole	15 more reasonable?
06:16 ^🔗	ivan	even 5 should be plenty
06:17 ^🔗	w0rmhole	just the website im trying keeps erroring out but after 700 or so attempts (time out) it finally gets what it was trying to get
06:17 ^🔗	ivan	was the site down for a while?
06:18 ^🔗	w0rmhole	dont believe so
06:18 ^🔗	w0rmhole	wasn't for that site i said earlier, but i had it happen to another site and, well, that one too, but not THAT many retries
06:18 ^🔗	ivan	if something takes 700 tries to succeed that's not something grab-site is designed to archive
06:19 ^🔗	ivan	you can do whatever you feel you need to do but trying 1000 times might get you banned from some sites
06:19 ^🔗	w0rmhole	a .zip file
06:19 ^🔗	w0rmhole	good point
06:19 ^🔗	w0rmhole	3.2mb zip file
06:43 ^🔗	w0rmhole	ivan: this look like an acceptable cookies.txt for this to you? https://share.dmca.gripe/dQYlzLvcwDF8eWUP.txt
06:43 ^🔗	w0rmhole	generated it by visiting https://www.exxoshost.co.uk/forum then refreshing; and http://atarimusic.exxoshost.co.uk/ and refreshing.
06:45 ^🔗	ivan	you probably need to remove the #HttpOnly_
06:45 ^🔗	w0rmhole	the entire entry or just that
06:46 ^🔗	ivan	just that string
06:46 ^🔗	ivan	start with the . on the domain
06:46 ^🔗	ivan	you can also try changing that 0 in the first cookie to 1570084624
06:46 ^🔗	w0rmhole	whats that do?
06:46 ^🔗	w0rmhole	idk what TRUE and FALSE mean either in this
06:47 ^🔗	ivan	https://unix.stackexchange.com/questions/36531/format-of-cookies-when-using-wget
06:47 ^🔗	w0rmhole	so i do not touch TRUE/FALSE
06:48 ^🔗	ivan	yeah those should be fine
06:50 ^🔗	w0rmhole	okay but i am still confused as to why i replace 0 with 157......
06:50 ^🔗	w0rmhole	i am sorry
06:50 ^🔗	w0rmhole	why is 0 bad
06:50 ^🔗	ivan	it might expire when the crawl starts
06:50 ^🔗	ivan	I haven't tested it
06:50 ^🔗	w0rmhole	oh ok
06:51 ^🔗	w0rmhole	and the number you suggested?
06:51 ^🔗	w0rmhole	is that a yr from now?
06:53 ^🔗	w0rmhole	yes
06:54 ^🔗	w0rmhole	ivan: look good? https://share.dmca.gripe/CJ1Jm0dcCyCgmnAw.txt
06:54 ^🔗	ivan	yes
06:55 ^🔗	w0rmhole	thank you very much for the helkp
06:55 ^🔗	w0rmhole	*help
06:55 ^🔗	w0rmhole	one last thing, does grab-site auto copy cookies.txt to folder where warc is stored?
06:57 ^🔗	ivan	it tells wpull to --load-cookies from a file and --save-cookies to DIR/cookies.txt
06:58 ^🔗	ivan	a lot of stuff may happen inside wpull
06:58 ^🔗	w0rmhole	thanks
06:59 ^🔗	ivan	https://github.com/ludios/grab-site/blob/29b9825dc5f49c25f01d93746cfb0638c724c22a/libgrabsite/main.py#L240-L259
07:00 ^🔗	ivan	and up to line 216 above
07:00 ^🔗	w0rmhole	ty
07:15 ^🔗		djsundog has quit IRC (Read error: Operation timed out)
07:15 ^🔗		mal_ has quit IRC (Read error: Operation timed out)
07:16 ^🔗		ivan has quit IRC (Read error: Operation timed out)
07:17 ^🔗		ivan has joined #archiveteam-ot
07:17 ^🔗		svchfoo1 sets mode: +o ivan
07:17 ^🔗		Albardin has quit IRC (Read error: Operation timed out)
07:18 ^🔗		vectr0n_ has joined #archiveteam-ot
07:19 ^🔗		godane has quit IRC (Read error: Operation timed out)
07:21 ^🔗		dxrt_ has quit IRC (Read error: Operation timed out)
07:22 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
07:22 ^🔗		Mateon1 has joined #archiveteam-ot
07:23 ^🔗		Albardin has joined #archiveteam-ot
07:24 ^🔗		kiska1 has quit IRC (Read error: Connection reset by peer)
07:24 ^🔗		kiska1 has joined #archiveteam-ot
07:25 ^🔗		vectr0n has quit IRC (Ping timeout: 600 seconds)
07:25 ^🔗		vectr0n_ is now known as vectr0n
07:26 ^🔗		godane has joined #archiveteam-ot
07:27 ^🔗		svchfoo1 sets mode: +o godane
07:27 ^🔗		dxrt_ has joined #archiveteam-ot
07:27 ^🔗		dxrt sets mode: +o dxrt_
07:29 ^🔗		mal_ has joined #archiveteam-ot
07:33 ^🔗		djsundog has joined #archiveteam-ot
07:39 ^🔗	w0rmhole	ivan: if i want to use both --tries and --load-cookies in grab-site do i do: `--wpull-args=--tries=5 --wpull-args=--load-cookies=/PATH\ TO/cookies.txt'
07:39 ^🔗	w0rmhole	?
07:41 ^🔗	w0rmhole	nvm the github page tells me to wrap in quotes
07:44 ^🔗	w0rmhole	one question, does running something like `grab-site https://www.reddit.com/r/oculus/ --igsets=reddit' grab all of that subreddit or only as much as reddits api allows?
07:46 ^🔗	Flashfire	Is everyone else having trouble with instagram
07:51 ^🔗		m007a83_ has joined #archiveteam-ot
07:54 ^🔗		m007a83 has quit IRC (Read error: Operation timed out)
07:56 ^🔗		icedice has quit IRC (Ping timeout: 252 seconds)
08:01 ^🔗		wp494 has quit IRC (west.us.hub irc.Prison.NET)
08:07 ^🔗		wp494 has joined #archiveteam-ot
08:48 ^🔗	w0rmhole	does anybody here know who danooct1 is
08:49 ^🔗	Flashfire	the malware guy?
08:49 ^🔗	w0rmhole	yeah
08:49 ^🔗	Flashfire	I know of him why?
08:50 ^🔗	w0rmhole	i found a guilty-pleasure song he co-produced
08:50 ^🔗	w0rmhole	https://www.youtube.com/watch?v=UJRt41HNLJw
08:50 ^🔗	Flashfire	Tubeup it
08:50 ^🔗	w0rmhole	its so bad it'
08:50 ^🔗	w0rmhole	is godlike
08:50 ^🔗	w0rmhole	just did
08:53 ^🔗	Flashfire	I have programs designed for windows NT
08:53 ^🔗	w0rmhole	need link nao pls
08:53 ^🔗	w0rmhole	lol
08:54 ^🔗	Flashfire	Looks like I will need to rip it later
08:58 ^🔗	ivan	w0rmhole: reddit is terrible and doesn't link to everything
08:58 ^🔗	ivan	there's an issue on snscrape for reddit support
08:58 ^🔗	w0rmhole	oh that sucks -_-
09:02 ^🔗	eientei95	Flashfire: https://puu.sh/BEWV1/0e44e80636.mp3
09:49 ^🔗		VerifiedJ has joined #archiveteam-ot
10:20 ^🔗		VerifiedJ has quit IRC (Quit: Leaving)
12:18 ^🔗		arkiver has quit IRC (Read error: Operation timed out)
12:18 ^🔗		kiska1 has quit IRC (Read error: Operation timed out)
12:19 ^🔗		mal_ has quit IRC (Write error: Broken pipe)
12:19 ^🔗		djsundog has quit IRC (Read error: Operation timed out)
12:19 ^🔗		dxrt_ has quit IRC (Read error: Operation timed out)
12:19 ^🔗		Albardin has quit IRC (Write error: Broken pipe)
12:20 ^🔗		wp494 has quit IRC (Ping timeout: 255 seconds)
12:20 ^🔗		wp494 has joined #archiveteam-ot
12:21 ^🔗		arkiver has joined #archiveteam-ot
12:31 ^🔗		Albardin has joined #archiveteam-ot
12:32 ^🔗		kiska1 has joined #archiveteam-ot
12:36 ^🔗		dxrt_ has joined #archiveteam-ot
12:36 ^🔗		dxrt sets mode: +o dxrt_
12:40 ^🔗		mal_ has joined #archiveteam-ot
12:41 ^🔗		djsundog has joined #archiveteam-ot
14:03 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
14:26 ^🔗		odemg has joined #archiveteam-ot
14:57 ^🔗		bithippo has joined #archiveteam-ot
15:13 ^🔗		bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
15:36 ^🔗		VerifiedJ has joined #archiveteam-ot
16:01 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
16:07 ^🔗		schbirid has joined #archiveteam-ot
16:13 ^🔗		odemg has joined #archiveteam-ot
16:14 ^🔗	ivan	twitter search is down right now in case anyone is running a bunch of snscrape and wonders why no results
16:20 ^🔗		astrid has joined #archiveteam-ot
18:07 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
18:18 ^🔗		odemg has joined #archiveteam-ot
20:08 ^🔗		Mateon1 has quit IRC (Ping timeout: 268 seconds)
20:08 ^🔗		Mateon1 has joined #archiveteam-ot
20:12 ^🔗		VerifiedJ has quit IRC (Quit: Leaving)
21:01 ^🔗		schbirid has quit IRC (Read error: Connection reset by peer)
21:42 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
21:54 ^🔗		odemg has joined #archiveteam-ot
22:04 ^🔗		Jens has quit IRC (Remote host closed the connection)
22:05 ^🔗		Jens has joined #archiveteam-ot
22:39 ^🔗		Stiletto has joined #archiveteam-ot
22:40 ^🔗		Stilett0 has quit IRC (Ping timeout: 252 seconds)
23:06 ^🔗		m007a83_ is now known as m007a83
23:19 ^🔗		godane has quit IRC (Read error: Operation timed out)
23:23 ^🔗		dashcloud has joined #archiveteam-ot
23:36 ^🔗		BlueMax has joined #archiveteam-ot
23:46 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
23:48 ^🔗		Stilett0 has joined #archiveteam-ot
23:48 ^🔗		Stiletto has quit IRC (Read error: Operation timed out)
23:57 ^🔗		odemg has joined #archiveteam-ot
23:59 ^🔗		Stiletto has joined #archiveteam-ot

irclogger-viewer