#archiveteam-bs 2017-11-03,Fri

↑back Search

Time	Nickname	Message
00:02 ^🔗		MadArchiv has quit IRC (Read error: Operation timed out)
00:04 ^🔗		YetAnothe has quit IRC (Ping timeout: 246 seconds)
00:15 ^🔗		Stilett0 has joined #archiveteam-bs
00:26 ^🔗		YetAnothe has joined #archiveteam-bs
00:37 ^🔗		YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
00:39 ^🔗		MadArchiv has joined #archiveteam-bs
00:52 ^🔗		Stilett0 has quit IRC ()
01:12 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
01:18 ^🔗		drumstick has joined #archiveteam-bs
01:41 ^🔗		ola_norsk has joined #archiveteam-bs
01:42 ^🔗	ola_norsk	any suggestion on how to capture all "Latest" listings on a twitter hashtag?
01:43 ^🔗	ola_norsk	as in ALL..
01:44 ^🔗		MadArchiv has quit IRC (Read error: Operation timed out)
01:44 ^🔗	ola_norsk	twitter seems to be throwing a "Takes too long long to load. Please retry", when using wayback
01:45 ^🔗		kvieta has quit IRC (Read error: Connection reset by peer)
01:45 ^🔗	ola_norsk	im specifically looking to archive hashtags #istandwithcorey ..It seems to be current fancy
01:46 ^🔗	ola_norsk	!ia
01:46 ^🔗	ola_norsk	!help
01:47 ^🔗	ola_norsk	i've lost the magic :(
01:49 ^🔗	ola_norsk	!ia https://twitter.com/Corey_Feldman
01:51 ^🔗	ola_norsk	!ia https://twitter.com/hashtag/istandwithcorey
01:51 ^🔗	ola_norsk	oh well :)
01:52 ^🔗	ola_norsk	fell obliged to archive those..For reasons unknown to common men :D
01:53 ^🔗		ola_norsk has quit IRC (BEER!)
02:11 ^🔗		ld1 has quit IRC (Quit: ~)
02:14 ^🔗		ld1 has joined #archiveteam-bs
02:18 ^🔗		MadArchiv has joined #archiveteam-bs
02:25 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
02:27 ^🔗		tuluu has joined #archiveteam-bs
02:28 ^🔗		Pixi has joined #archiveteam-bs
03:01 ^🔗		YetAnothe has joined #archiveteam-bs
03:01 ^🔗		MadArchiv has quit IRC (Read error: Connection reset by peer)
03:06 ^🔗		kvieta has joined #archiveteam-bs
03:58 ^🔗		YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
04:10 ^🔗	ranma	damn, pity we can't go back in time
04:11 ^🔗	ranma	https://techcrunch.com/2017/11/02/r-i-p-new-york-media/
04:11 ^🔗	ranma	archives are gone supposedly
04:14 ^🔗		qw3rty119 has joined #archiveteam-bs
04:17 ^🔗		qw3rty118 has quit IRC (Read error: Operation timed out)
04:25 ^🔗		Specular has joined #archiveteam-bs
04:28 ^🔗		Stilett0 has joined #archiveteam-bs
04:31 ^🔗	pikhq	Most of 'em were scraped and are in Wayback already.
04:32 ^🔗	pikhq	Would be nice if we could have a stronger guarantee than "we probably got it?" though.
04:35 ^🔗	SketchCow	They've mailed me, they're giving them to me.
04:35 ^🔗	SketchCow	Rest easy
06:05 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
06:48 ^🔗		BlueMaxim has joined #archiveteam-bs
06:54 ^🔗		Pixi has quit IRC (Quit: Pixi)
07:31 ^🔗		TheLovina has joined #archiveteam-bs
07:42 ^🔗		schbirid has joined #archiveteam-bs
07:50 ^🔗		jschwart has joined #archiveteam-bs
07:54 ^🔗		jschwart has quit IRC (Client Quit)
08:02 ^🔗		Specular has quit IRC (Read error: Connection reset by peer)
08:33 ^🔗		yuitimoth has quit IRC (Ping timeout: 506 seconds)
08:44 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
08:44 ^🔗		tuluu has joined #archiveteam-bs
08:46 ^🔗		w0rp has quit IRC (Read error: Operation timed out)
08:48 ^🔗		w0rp has joined #archiveteam-bs
09:09 ^🔗	godane	SketchCow: i'm doing your GDC 1996 Psychic Detective tape
09:46 ^🔗		ld1 has quit IRC (Quit: ~)
10:14 ^🔗		MadArchiv has joined #archiveteam-bs
10:14 ^🔗	godane	so i'm now doing the Buffy Bits tape
10:14 ^🔗	godane	*Bits=Bites
10:15 ^🔗	godane	btw the GDC 1996 Psychic Detective tape has lot of tape that was not used
10:16 ^🔗	godane	there was no video for half of tape from what i can tell
10:16 ^🔗	godane	anyways all video on tape was captured
10:18 ^🔗		Specular has joined #archiveteam-bs
10:24 ^🔗	MadArchiv	Alright! So it's day two of the webcomic arching, um, "project" and -- I've been wondering -- why don't we start it out by paying some attention to Hiveworks? It's a relatively small webcomic website that hosts popular, high profile webcomics due to its high barrier of entry and, as I previously mentioned, has a list (https://thehiveworks.com/completed) of completed (and/or cancelled) webcomics that we can just hit with wget, heritrix, or grab-s
10:24 ^🔗		MadArchiv has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
10:26 ^🔗		Specular has quit IRC (Quit: Leaving)
11:01 ^🔗		ld1 has joined #archiveteam-bs
11:02 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:07 ^🔗	godane	i'm now doing the xena warrior princess tape
11:25 ^🔗		pizzaiolo has joined #archiveteam-bs
11:47 ^🔗		drumstick has quit IRC (Read error: Operation timed out)
12:03 ^🔗		spacegirl has quit IRC (Read error: Operation timed out)
12:15 ^🔗		spacegirl has joined #archiveteam-bs
13:09 ^🔗		Stilett0 has quit IRC (Ping timeout: 248 seconds)
13:10 ^🔗		Stilett0 has joined #archiveteam-bs
13:17 ^🔗	godane	so i got a sci-fi airing of tales from the crypt
13:27 ^🔗		yuitimoth has joined #archiveteam-bs
13:42 ^🔗	godane	so the date of this recording is 2001-06-28 at 11PM
13:43 ^🔗	godane	and it fits cause Blade Runner The Director's Cut ran that Saturday
14:52 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
14:52 ^🔗	godane	so i got The Police Behind The Music tape
14:53 ^🔗	godane	its a BBC America tape that was taped over
14:53 ^🔗		dashcloud has joined #archiveteam-bs
15:00 ^🔗		ScruffyB has quit IRC (Remote host closed the connection)
15:03 ^🔗		phillipsj has joined #archiveteam-bs
15:07 ^🔗		qw3rty119 has quit IRC (Read error: Operation timed out)
15:36 ^🔗		MadArchiv has joined #archiveteam-bs
15:58 ^🔗	MadArchiv	Guys, I'm planning on archiving all the comics from the completed webcomics list from Hiveworks with grab-site so we can maybe, just maybe, give that webcomic archiving project some traction to back it up. Any tips?
16:01 ^🔗	schbirid	<3
16:11 ^🔗		namespac- has joined #archiveteam-bs
16:14 ^🔗	MadArchiv	How do I use grab-site anyway?
16:20 ^🔗	DFJustin	https://github.com/ludios/grab-site
16:24 ^🔗	MadArchiv	Hmmm, alright. I'll see what I can do with it when I actually do manage to get my hands on my laptop.
16:31 ^🔗	MadArchiv	By the way, my laptop runs on Windows 10, and the page you linked me to says that the Windows 10 version of this tool is experimental, should I be worried about that?
16:44 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
16:47 ^🔗		Mateon1 has joined #archiveteam-bs
17:13 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
17:14 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
17:18 ^🔗		Pixi has joined #archiveteam-bs
17:28 ^🔗		Stilett0 has joined #archiveteam-bs
18:40 ^🔗		schbirid has quit IRC (Quit: Leaving)
18:44 ^🔗		schbirid has joined #archiveteam-bs
18:46 ^🔗		schbirid has quit IRC (Client Quit)
18:48 ^🔗		MadArchiv has quit IRC (Read error: Connection reset by peer)
18:51 ^🔗		Pixi has quit IRC (Quit: Pixi)
18:59 ^🔗		schbirid has joined #archiveteam-bs
19:13 ^🔗		icedice has joined #archiveteam-bs
19:21 ^🔗		jschwart has joined #archiveteam-bs
19:27 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
19:38 ^🔗		dashcloud has joined #archiveteam-bs
19:56 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
19:57 ^🔗		Mateon1 has joined #archiveteam-bs
20:35 ^🔗		icedice has quit IRC (Ping timeout: 260 seconds)
20:36 ^🔗		icedice has joined #archiveteam-bs
20:39 ^🔗		icedice has quit IRC (Client Quit)
21:00 ^🔗		Pixi has joined #archiveteam-bs
21:13 ^🔗		MadArchiv has joined #archiveteam-bs
21:14 ^🔗		icedice has joined #archiveteam-bs
21:32 ^🔗		MadArchiv has quit IRC (Read error: Connection reset by peer)
21:52 ^🔗	godane	so that tape has 6 hours of tv
21:52 ^🔗	godane	not just behind the music
21:53 ^🔗	godane	we are also get reruns of 2 guys and a girl
21:53 ^🔗	godane	on We channel
21:56 ^🔗		user has joined #archiveteam-bs
21:57 ^🔗	godane	also i got 'The OC' episode that aired on 2003-08-12 on this tape
22:10 ^🔗		qw3rty3 has joined #archiveteam-bs
22:25 ^🔗		user is now known as Ceryn
22:25 ^🔗	godane	this tape is going to have some random stuff at the end
22:26 ^🔗	Ceryn	Do you guys handle Javascript at all when crawling websites for archival?
22:26 ^🔗	astrid	yes
22:27 ^🔗	Ceryn	How? Using what? I'm considering Selenium for Pythonto crawl from a headless browser. I'd like to hear alternatives though.
22:27 ^🔗	astrid	a headless browser, yeah
22:27 ^🔗	astrid	(that's for ad-hoc crawls)
22:27 ^🔗	Ceryn	Well, alright. Which library?
22:28 ^🔗	astrid	for planned things we will figure out what the javascript does and generate a list of url patterns by hand-ish that it depends on having
22:28 ^🔗	Ceryn	Oh, shit. That sounds like a lot of work.
22:29 ^🔗	astrid	it's usually pretty easy with modern dev tools
22:29 ^🔗	Ceryn	Huh. Interesting.
22:29 ^🔗	Ceryn	How come you don't use headless browsers too for planned things?
22:29 ^🔗	astrid	and (checks calendar) eight years of experience (!)
22:29 ^🔗	Ceryn	:D
22:29 ^🔗	Ceryn	I figure that helps a lot.
22:30 ^🔗	astrid	they're super resource intensive, and also they tend to fetch page prequisite resources over again for every pageload
22:30 ^🔗	astrid	so the archive file gets massively bloated with millions of copies of static content
22:30 ^🔗	Ceryn	Hm. That won't work.
22:31 ^🔗	astrid	it's less than optimal
22:31 ^🔗	Ceryn	Haha.
22:31 ^🔗	astrid	we also don't go for a full copy of every page the site can possibly render
22:32 ^🔗	astrid	rather, we go for every unique piece of data in the site
22:32 ^🔗	astrid	so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX
22:32 ^🔗	astrid	so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX" pages
22:32 ^🔗	astrid	et
22:32 ^🔗	astrid	c
22:32 ^🔗	JAA	That depends on the project though. For example, for the Steam user forums, we also grabbed the individual post pages because those links still appear elsewhere on the web frequently.
22:33 ^🔗	astrid	yes
22:33 ^🔗	Ceryn	Do you make a manual list of these kinds of links too, then? The ones you don't want to follow?
22:33 ^🔗	JAA	Also, regarding the JavaScript part: PhantomJS in wpull is semi-broken and doesn't really work properly.
22:33 ^🔗	astrid	priorities: 1, all unique data. 2, preservation of inbound links. 3, browseability.
22:34 ^🔗	astrid	Ceryn: we write a thing which is "here is a userid get me all the pages related to it"
22:34 ^🔗	astrid	and then provide that with a list of userids
22:34 ^🔗	astrid	so it's a static set of pages added in, not excluded from a general crawl
22:34 ^🔗	astrid	so we can also get things that aren't necessarily linked, but are retrievable
22:34 ^🔗	Ceryn	Oh. So custom crawlers all the way.
22:34 ^🔗	astrid	yeah
22:35 ^🔗	Ceryn	I was thinking it should be doable to make a general one and then just handle some trap exceptions.
22:35 ^🔗	astrid	we have tooling for that, when it's the better choice
22:35 ^🔗	astrid	(namely, archivebot)
22:36 ^🔗		Pixi has quit IRC (Ping timeout: 255 seconds)
22:37 ^🔗	Ceryn	The only think you really want from Javascript is to get the links it tries to fetch, right?
22:37 ^🔗	yipdw	and any DOM changes
22:37 ^🔗	Ceryn	Won't you have them just by storing the actual JS though?
22:38 ^🔗		icedice2 has joined #archiveteam-bs
22:38 ^🔗	Ceryn	I mean, at archival time, JS is stored but is only really relevant for getting more links.
22:38 ^🔗	yipdw	doesn't always work. a friend and I were trying to pull up KCNA's reporting of Kim Jong-il's death via Wayback and it turns out that the Javascript did really strange things that precluded loading
22:39 ^🔗	yipdw	I'm sure the data is in there somewhere but we didn't really want to dig through 30,000-something matches
22:39 ^🔗	Ceryn	Shoot. Javascript really seems to be a pain.
22:39 ^🔗	yipdw	javascript is a goddamn scourge
22:40 ^🔗	astrid	any optimism is precluded by the terrible things that humans do
22:40 ^🔗		icedice2 has quit IRC (Client Quit)
22:40 ^🔗	yipdw	that said, Google seems to have some luck with headless Chrome
22:40 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
22:40 ^🔗	Ceryn	Oh?
22:41 ^🔗	yipdw	no idea how they're doing it but I suppose a lot of things become easier when you have hundreds of thousands of computers
22:41 ^🔗	JAA	There's also headless Firefox now.
22:41 ^🔗	JAA	So it might be time to play around with these again.
22:41 ^🔗	JAA	s/ again//
22:41 ^🔗	yipdw	again is right, we looked into it once
22:42 ^🔗	JAA	Yeah, I mean, there were no proper headless browsers before.
22:42 ^🔗	JAA	Until a few months ago, that is.
22:42 ^🔗	JAA	There were various projects which gutted browsers to create something similar, but as far as I know, no actual browser had a headless mode.
22:45 ^🔗		BlueMaxim has joined #archiveteam-bs
22:45 ^🔗	Ceryn	With a headless browser you could also do a few sample screenshots. That'd be cool.
22:48 ^🔗	JAA	Yeah, I think headless browsers are really the way to go for any page that isn't just simple HTML without scripting. You basically archive exactly what a user would get when accessing the site. You trigger more realistic traffic patterns. You can make use of the cache to avoid redownloads of the page requisites. And so on.
22:52 ^🔗	JAA	However, I can also think of cases where it won't work that well. For example, when links are added on the fly by scripts. You'll want to click each link to grab the entire thing. But if the links disappear once you click one of them, it gets tricky to archive it all.
22:53 ^🔗	mundus	Can someone please add https://www.dnainfo.com to archive bot?
22:55 ^🔗	JAA	mundus: We can, but I'm not sure if we should. See #archiveteam.
22:55 ^🔗	JAA	On the other hand, I don't see immediately how that could be archived with the warrior.
22:55 ^🔗	JAA	It doesn't look like you can grab the articles by ID or anything like that.
22:56 ^🔗	JAA	SketchCow: Should we throw DNAinfo into ArchiveBot, or will we have a warrior project?
22:57 ^🔗	JAA	(Also, let's not forget about Gothamist.)
22:57 ^🔗	mundus	ah
23:03 ^🔗		tuluu_ has joined #archiveteam-bs
23:05 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
23:05 ^🔗		drumstick has joined #archiveteam-bs
23:29 ^🔗	Ceryn	Anyone familiar with fanfiction.net? If you were to archive it, what would you save? Only all stories? All stories and author pages? Any user pages? Reviews? Everything?
23:29 ^🔗	godane	so i found 2 more publicly tapes that was taped over
23:29 ^🔗	Ceryn	godane: Where are you finding tapes? :P
23:29 ^🔗	godane	plus side is its felicity so i maybe able get episodes of that with original music
23:30 ^🔗	godane	SketchCow sent me 3 boxes of tapes
23:30 ^🔗	Ceryn	Oh, nice.
23:30 ^🔗	godane	there is also ebay that helps me sometimes
23:31 ^🔗	JAA	Ceryn: http://archiveteam.org/index.php?title=FanFiction.Net
23:36 ^🔗	DrasticAc	Can anyone edit the Miiverse wiki Archiving status to say it's being actively archived?
23:36 ^🔗	DrasticAc	I'm getting asked on Twitter about it, since they see "Not saved yet" and think we didn't do anything, when we've gotten/are getting a ton right this moment
23:38 ^🔗	Ceryn	JAA: Cool!
23:38 ^🔗	Ceryn	Isn't #fanfriction supposed to be alive?
23:40 ^🔗	Ceryn	You upload 800 GB of text uncompressed? :S
23:40 ^🔗	JAA	Ceryn: No, the wiki is just outdated. That's probably the channel that was used back in 2012 to coordinate the grab.
23:40 ^🔗	JAA	DrasticAc: On it.
23:40 ^🔗	DrasticAc	Sweet, thanks
23:40 ^🔗		tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.)
23:41 ^🔗	Ceryn	The Wiki was last updated legitimately in May 2016 it seems. It also references a 2015 dump.
23:41 ^🔗	JAA	Yeah, but the old information wasn't cleared out.
23:42 ^🔗	Ceryn	Okay. So currently the only copy from Archiveteam is the 2012 one?
23:43 ^🔗		tuluu has joined #archiveteam-bs
23:54 ^🔗	Ceryn	I think I'll make a Fanfiction dump this year, unless someone's already up to date.
23:54 ^🔗	Ceryn	I'd like to make it on-going though. So it stays in sync.
23:56 ^🔗	JAA	bsmith093: ^

irclogger-viewer