#archiveteam-bs 2017-11-03,Fri

↑back Search

Time Nickname Message
00:02 🔗 MadArchiv has quit IRC (Read error: Operation timed out)
00:04 🔗 YetAnothe has quit IRC (Ping timeout: 246 seconds)
00:15 🔗 Stilett0 has joined #archiveteam-bs
00:26 🔗 YetAnothe has joined #archiveteam-bs
00:37 🔗 YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
00:39 🔗 MadArchiv has joined #archiveteam-bs
00:52 🔗 Stilett0 has quit IRC ()
01:12 🔗 drumstick has quit IRC (Read error: Operation timed out)
01:18 🔗 drumstick has joined #archiveteam-bs
01:41 🔗 ola_norsk has joined #archiveteam-bs
01:42 🔗 ola_norsk any suggestion on how to capture all "Latest" listings on a twitter hashtag?
01:43 🔗 ola_norsk as in ALL..
01:44 🔗 MadArchiv has quit IRC (Read error: Operation timed out)
01:44 🔗 ola_norsk twitter seems to be throwing a "Takes too long long to load. Please retry", when using wayback
01:45 🔗 kvieta has quit IRC (Read error: Connection reset by peer)
01:45 🔗 ola_norsk im specifically looking to archive hashtags #istandwithcorey ..It seems to be current fancy
01:46 🔗 ola_norsk !ia
01:46 🔗 ola_norsk !help
01:47 🔗 ola_norsk i've lost the magic :(
01:49 🔗 ola_norsk !ia https://twitter.com/Corey_Feldman
01:51 🔗 ola_norsk !ia https://twitter.com/hashtag/istandwithcorey
01:51 🔗 ola_norsk oh well :)
01:52 🔗 ola_norsk fell obliged to archive those..For reasons unknown to common men :D
01:53 🔗 ola_norsk has quit IRC (BEER!)
02:11 🔗 ld1 has quit IRC (Quit: ~)
02:14 🔗 ld1 has joined #archiveteam-bs
02:18 🔗 MadArchiv has joined #archiveteam-bs
02:25 🔗 tuluu has quit IRC (Read error: Operation timed out)
02:27 🔗 tuluu has joined #archiveteam-bs
02:28 🔗 Pixi has joined #archiveteam-bs
03:01 🔗 YetAnothe has joined #archiveteam-bs
03:01 🔗 MadArchiv has quit IRC (Read error: Connection reset by peer)
03:06 🔗 kvieta has joined #archiveteam-bs
03:58 🔗 YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
04:10 🔗 ranma damn, pity we can't go back in time
04:11 🔗 ranma https://techcrunch.com/2017/11/02/r-i-p-new-york-media/
04:11 🔗 ranma archives are gone supposedly
04:14 🔗 qw3rty119 has joined #archiveteam-bs
04:17 🔗 qw3rty118 has quit IRC (Read error: Operation timed out)
04:25 🔗 Specular has joined #archiveteam-bs
04:28 🔗 Stilett0 has joined #archiveteam-bs
04:31 🔗 pikhq Most of 'em were scraped and are in Wayback already.
04:32 🔗 pikhq Would be nice if we could have a stronger guarantee than "we probably got it?" though.
04:35 🔗 SketchCow They've mailed me, they're giving them to me.
04:35 🔗 SketchCow Rest easy
06:05 🔗 BlueMaxim has quit IRC (Quit: Leaving)
06:48 🔗 BlueMaxim has joined #archiveteam-bs
06:54 🔗 Pixi has quit IRC (Quit: Pixi)
07:31 🔗 TheLovina has joined #archiveteam-bs
07:42 🔗 schbirid has joined #archiveteam-bs
07:50 🔗 jschwart has joined #archiveteam-bs
07:54 🔗 jschwart has quit IRC (Client Quit)
08:02 🔗 Specular has quit IRC (Read error: Connection reset by peer)
08:33 🔗 yuitimoth has quit IRC (Ping timeout: 506 seconds)
08:44 🔗 tuluu has quit IRC (Read error: Operation timed out)
08:44 🔗 tuluu has joined #archiveteam-bs
08:46 🔗 w0rp has quit IRC (Read error: Operation timed out)
08:48 🔗 w0rp has joined #archiveteam-bs
09:09 🔗 godane SketchCow: i'm doing your GDC 1996 Psychic Detective tape
09:46 🔗 ld1 has quit IRC (Quit: ~)
10:14 🔗 MadArchiv has joined #archiveteam-bs
10:14 🔗 godane so i'm now doing the Buffy Bits tape
10:14 🔗 godane *Bits=Bites
10:15 🔗 godane btw the GDC 1996 Psychic Detective tape has lot of tape that was not used
10:16 🔗 godane there was no video for half of tape from what i can tell
10:16 🔗 godane anyways all video on tape was captured
10:18 🔗 Specular has joined #archiveteam-bs
10:24 🔗 MadArchiv Alright! So it's day two of the webcomic arching, um, "project" and -- I've been wondering -- why don't we start it out by paying some attention to Hiveworks? It's a relatively small webcomic website that hosts popular, high profile webcomics due to its high barrier of entry and, as I previously mentioned, has a list (https://thehiveworks.com/completed) of completed (and/or cancelled) webcomics that we can just hit with wget, heritrix, or grab-s
10:24 🔗 MadArchiv has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
10:26 🔗 Specular has quit IRC (Quit: Leaving)
11:01 🔗 ld1 has joined #archiveteam-bs
11:02 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:07 🔗 godane i'm now doing the xena warrior princess tape
11:25 🔗 pizzaiolo has joined #archiveteam-bs
11:47 🔗 drumstick has quit IRC (Read error: Operation timed out)
12:03 🔗 spacegirl has quit IRC (Read error: Operation timed out)
12:15 🔗 spacegirl has joined #archiveteam-bs
13:09 🔗 Stilett0 has quit IRC (Ping timeout: 248 seconds)
13:10 🔗 Stilett0 has joined #archiveteam-bs
13:17 🔗 godane so i got a sci-fi airing of tales from the crypt
13:27 🔗 yuitimoth has joined #archiveteam-bs
13:42 🔗 godane so the date of this recording is 2001-06-28 at 11PM
13:43 🔗 godane and it fits cause Blade Runner The Director's Cut ran that Saturday
14:52 🔗 dashcloud has quit IRC (Remote host closed the connection)
14:52 🔗 godane so i got The Police Behind The Music tape
14:53 🔗 godane its a BBC America tape that was taped over
14:53 🔗 dashcloud has joined #archiveteam-bs
15:00 🔗 ScruffyB has quit IRC (Remote host closed the connection)
15:03 🔗 phillipsj has joined #archiveteam-bs
15:07 🔗 qw3rty119 has quit IRC (Read error: Operation timed out)
15:36 🔗 MadArchiv has joined #archiveteam-bs
15:58 🔗 MadArchiv Guys, I'm planning on archiving all the comics from the completed webcomics list from Hiveworks with grab-site so we can maybe, just maybe, give that webcomic archiving project some traction to back it up. Any tips?
16:01 🔗 schbirid <3
16:11 🔗 namespac- has joined #archiveteam-bs
16:14 🔗 MadArchiv How do I use grab-site anyway?
16:20 🔗 DFJustin https://github.com/ludios/grab-site
16:24 🔗 MadArchiv Hmmm, alright. I'll see what I can do with it when I actually *do* manage to get my hands on my laptop.
16:31 🔗 MadArchiv By the way, my laptop runs on Windows 10, and the page you linked me to says that the Windows 10 version of this tool is experimental, should I be worried about that?
16:44 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
16:47 🔗 Mateon1 has joined #archiveteam-bs
17:13 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
17:14 🔗 pizzaiolo has quit IRC (pizzaiolo)
17:18 🔗 Pixi has joined #archiveteam-bs
17:28 🔗 Stilett0 has joined #archiveteam-bs
18:40 🔗 schbirid has quit IRC (Quit: Leaving)
18:44 🔗 schbirid has joined #archiveteam-bs
18:46 🔗 schbirid has quit IRC (Client Quit)
18:48 🔗 MadArchiv has quit IRC (Read error: Connection reset by peer)
18:51 🔗 Pixi has quit IRC (Quit: Pixi)
18:59 🔗 schbirid has joined #archiveteam-bs
19:13 🔗 icedice has joined #archiveteam-bs
19:21 🔗 jschwart has joined #archiveteam-bs
19:27 🔗 dashcloud has quit IRC (Read error: Operation timed out)
19:38 🔗 dashcloud has joined #archiveteam-bs
19:56 🔗 Mateon1 has quit IRC (Remote host closed the connection)
19:57 🔗 Mateon1 has joined #archiveteam-bs
20:35 🔗 icedice has quit IRC (Ping timeout: 260 seconds)
20:36 🔗 icedice has joined #archiveteam-bs
20:39 🔗 icedice has quit IRC (Client Quit)
21:00 🔗 Pixi has joined #archiveteam-bs
21:13 🔗 MadArchiv has joined #archiveteam-bs
21:14 🔗 icedice has joined #archiveteam-bs
21:32 🔗 MadArchiv has quit IRC (Read error: Connection reset by peer)
21:52 🔗 godane so that tape has 6 hours of tv
21:52 🔗 godane not just behind the music
21:53 🔗 godane we are also get reruns of 2 guys and a girl
21:53 🔗 godane on We channel
21:56 🔗 user has joined #archiveteam-bs
21:57 🔗 godane also i got 'The OC' episode that aired on 2003-08-12 on this tape
22:10 🔗 qw3rty3 has joined #archiveteam-bs
22:25 🔗 user is now known as Ceryn
22:25 🔗 godane this tape is going to have some random stuff at the end
22:26 🔗 Ceryn Do you guys handle Javascript at all when crawling websites for archival?
22:26 🔗 astrid yes
22:27 🔗 Ceryn How? Using what? I'm considering Selenium for Pythonto crawl from a headless browser. I'd like to hear alternatives though.
22:27 🔗 astrid a headless browser, yeah
22:27 🔗 astrid (that's for ad-hoc crawls)
22:27 🔗 Ceryn Well, alright. Which library?
22:28 🔗 astrid for planned things we will figure out what the javascript does and generate a list of url patterns by hand-ish that it depends on having
22:28 🔗 Ceryn Oh, shit. That sounds like a lot of work.
22:29 🔗 astrid it's usually pretty easy with modern dev tools
22:29 🔗 Ceryn Huh. Interesting.
22:29 🔗 Ceryn How come you don't use headless browsers too for planned things?
22:29 🔗 astrid and (*checks calendar*) eight years of experience (!)
22:29 🔗 Ceryn :D
22:29 🔗 Ceryn I figure that helps a lot.
22:30 🔗 astrid they're super resource intensive, and also they tend to fetch page prequisite resources over again for every pageload
22:30 🔗 astrid so the archive file gets massively bloated with millions of copies of static content
22:30 🔗 Ceryn Hm. That won't work.
22:31 🔗 astrid it's less than optimal
22:31 🔗 Ceryn Haha.
22:31 🔗 astrid we also don't go for a full copy of every page the site can possibly render
22:32 🔗 astrid rather, we go for every unique piece of data in the site
22:32 🔗 astrid so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX
22:32 🔗 astrid so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX" pages
22:32 🔗 astrid et
22:32 🔗 astrid c
22:32 🔗 JAA That depends on the project though. For example, for the Steam user forums, we also grabbed the individual post pages because those links still appear elsewhere on the web frequently.
22:33 🔗 astrid yes
22:33 🔗 Ceryn Do you make a manual list of these kinds of links too, then? The ones you don't want to follow?
22:33 🔗 JAA Also, regarding the JavaScript part: PhantomJS in wpull is semi-broken and doesn't really work properly.
22:33 🔗 astrid priorities: 1, all unique data. 2, preservation of inbound links. 3, browseability.
22:34 🔗 astrid Ceryn: we write a thing which is "here is a userid get me all the pages related to it"
22:34 🔗 astrid and then provide that with a list of userids
22:34 🔗 astrid so it's a static set of pages added in, not excluded from a general crawl
22:34 🔗 astrid so we can also get things that aren't necessarily linked, but are retrievable
22:34 🔗 Ceryn Oh. So custom crawlers all the way.
22:34 🔗 astrid yeah
22:35 🔗 Ceryn I was thinking it should be doable to make a general one and then just handle some trap exceptions.
22:35 🔗 astrid we have tooling for that, when it's the better choice
22:35 🔗 astrid (namely, archivebot)
22:36 🔗 Pixi has quit IRC (Ping timeout: 255 seconds)
22:37 🔗 Ceryn The only think you really want from Javascript is to get the links it tries to fetch, right?
22:37 🔗 yipdw and any DOM changes
22:37 🔗 Ceryn Won't you have them just by storing the actual JS though?
22:38 🔗 icedice2 has joined #archiveteam-bs
22:38 🔗 Ceryn I mean, at archival time, JS is stored but is only really relevant for getting more links.
22:38 🔗 yipdw doesn't always work. a friend and I were trying to pull up KCNA's reporting of Kim Jong-il's death via Wayback and it turns out that the Javascript did really strange things that precluded loading
22:39 🔗 yipdw I'm sure the data is in there *somewhere* but we didn't really want to dig through 30,000-something matches
22:39 🔗 Ceryn Shoot. Javascript really seems to be a pain.
22:39 🔗 yipdw javascript is a goddamn scourge
22:40 🔗 astrid any optimism is precluded by the terrible things that humans do
22:40 🔗 icedice2 has quit IRC (Client Quit)
22:40 🔗 yipdw that said, Google seems to have some luck with headless Chrome
22:40 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
22:40 🔗 Ceryn Oh?
22:41 🔗 yipdw no idea how they're doing it but I suppose a lot of things become easier when you have hundreds of thousands of computers
22:41 🔗 JAA There's also headless Firefox now.
22:41 🔗 JAA So it might be time to play around with these again.
22:41 🔗 JAA s/ again//
22:41 🔗 yipdw again is right, we looked into it once
22:42 🔗 JAA Yeah, I mean, there were no proper headless browsers before.
22:42 🔗 JAA Until a few months ago, that is.
22:42 🔗 JAA There were various projects which gutted browsers to create something similar, but as far as I know, no actual browser had a headless mode.
22:45 🔗 BlueMaxim has joined #archiveteam-bs
22:45 🔗 Ceryn With a headless browser you could also do a few sample screenshots. That'd be cool.
22:48 🔗 JAA Yeah, I think headless browsers are really the way to go for any page that isn't just simple HTML without scripting. You basically archive exactly what a user would get when accessing the site. You trigger more realistic traffic patterns. You can make use of the cache to avoid redownloads of the page requisites. And so on.
22:52 🔗 JAA However, I can also think of cases where it won't work that well. For example, when links are added on the fly by scripts. You'll want to click each link to grab the entire thing. But if the links disappear once you click one of them, it gets tricky to archive it all.
22:53 🔗 mundus Can someone please add https://www.dnainfo.com to archive bot?
22:55 🔗 JAA mundus: We can, but I'm not sure if we should. See #archiveteam.
22:55 🔗 JAA On the other hand, I don't see immediately how that could be archived with the warrior.
22:55 🔗 JAA It doesn't look like you can grab the articles by ID or anything like that.
22:56 🔗 JAA SketchCow: Should we throw DNAinfo into ArchiveBot, or will we have a warrior project?
22:57 🔗 JAA (Also, let's not forget about Gothamist.)
22:57 🔗 mundus ah
23:03 🔗 tuluu_ has joined #archiveteam-bs
23:05 🔗 tuluu has quit IRC (Read error: Operation timed out)
23:05 🔗 drumstick has joined #archiveteam-bs
23:29 🔗 Ceryn Anyone familiar with fanfiction.net? If you were to archive it, what would you save? Only all stories? All stories and author pages? Any user pages? Reviews? Everything?
23:29 🔗 godane so i found 2 more publicly tapes that was taped over
23:29 🔗 Ceryn godane: Where are you finding tapes? :P
23:29 🔗 godane plus side is its felicity so i maybe able get episodes of that with original music
23:30 🔗 godane SketchCow sent me 3 boxes of tapes
23:30 🔗 Ceryn Oh, nice.
23:30 🔗 godane there is also ebay that helps me sometimes
23:31 🔗 JAA Ceryn: http://archiveteam.org/index.php?title=FanFiction.Net
23:36 🔗 DrasticAc Can anyone edit the Miiverse wiki Archiving status to say it's being actively archived?
23:36 🔗 DrasticAc I'm getting asked on Twitter about it, since they see "Not saved yet" and think we didn't do anything, when we've gotten/are getting a ton right this moment
23:38 🔗 Ceryn JAA: Cool!
23:38 🔗 Ceryn Isn't #fanfriction supposed to be alive?
23:40 🔗 Ceryn You upload 800 GB of text uncompressed? :S
23:40 🔗 JAA Ceryn: No, the wiki is just outdated. That's probably the channel that was used back in 2012 to coordinate the grab.
23:40 🔗 JAA DrasticAc: On it.
23:40 🔗 DrasticAc Sweet, thanks
23:40 🔗 tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.)
23:41 🔗 Ceryn The Wiki was last updated legitimately in May 2016 it seems. It also references a 2015 dump.
23:41 🔗 JAA Yeah, but the old information wasn't cleared out.
23:42 🔗 Ceryn Okay. So currently the only copy from Archiveteam is the 2012 one?
23:43 🔗 tuluu has joined #archiveteam-bs
23:54 🔗 Ceryn I think I'll make a Fanfiction dump this year, unless someone's already up to date.
23:54 🔗 Ceryn I'd like to make it on-going though. So it stays in sync.
23:56 🔗 JAA bsmith093: ^

irclogger-viewer