[00:02] *** MadArchiv has quit IRC (Read error: Operation timed out) [00:04] *** YetAnothe has quit IRC (Ping timeout: 246 seconds) [00:15] *** Stilett0 has joined #archiveteam-bs [00:26] *** YetAnothe has joined #archiveteam-bs [00:37] *** YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com )) [00:39] *** MadArchiv has joined #archiveteam-bs [00:52] *** Stilett0 has quit IRC () [01:12] *** drumstick has quit IRC (Read error: Operation timed out) [01:18] *** drumstick has joined #archiveteam-bs [01:41] *** ola_norsk has joined #archiveteam-bs [01:42] any suggestion on how to capture all "Latest" listings on a twitter hashtag? [01:43] as in ALL.. [01:44] *** MadArchiv has quit IRC (Read error: Operation timed out) [01:44] twitter seems to be throwing a "Takes too long long to load. Please retry", when using wayback [01:45] *** kvieta has quit IRC (Read error: Connection reset by peer) [01:45] im specifically looking to archive hashtags #istandwithcorey ..It seems to be current fancy [01:46] !ia [01:46] !help [01:47] i've lost the magic :( [01:49] !ia https://twitter.com/Corey_Feldman [01:51] !ia https://twitter.com/hashtag/istandwithcorey [01:51] oh well :) [01:52] fell obliged to archive those..For reasons unknown to common men :D [01:53] *** ola_norsk has quit IRC (BEER!) [02:11] *** ld1 has quit IRC (Quit: ~) [02:14] *** ld1 has joined #archiveteam-bs [02:18] *** MadArchiv has joined #archiveteam-bs [02:25] *** tuluu has quit IRC (Read error: Operation timed out) [02:27] *** tuluu has joined #archiveteam-bs [02:28] *** Pixi has joined #archiveteam-bs [03:01] *** YetAnothe has joined #archiveteam-bs [03:01] *** MadArchiv has quit IRC (Read error: Connection reset by peer) [03:06] *** kvieta has joined #archiveteam-bs [03:58] *** YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com )) [04:10] damn, pity we can't go back in time [04:11] https://techcrunch.com/2017/11/02/r-i-p-new-york-media/ [04:11] archives are gone supposedly [04:14] *** qw3rty119 has joined #archiveteam-bs [04:17] *** qw3rty118 has quit IRC (Read error: Operation timed out) [04:25] *** Specular has joined #archiveteam-bs [04:28] *** Stilett0 has joined #archiveteam-bs [04:31] Most of 'em were scraped and are in Wayback already. [04:32] Would be nice if we could have a stronger guarantee than "we probably got it?" though. [04:35] They've mailed me, they're giving them to me. [04:35] Rest easy [06:05] *** BlueMaxim has quit IRC (Quit: Leaving) [06:48] *** BlueMaxim has joined #archiveteam-bs [06:54] *** Pixi has quit IRC (Quit: Pixi) [07:31] *** TheLovina has joined #archiveteam-bs [07:42] *** schbirid has joined #archiveteam-bs [07:50] *** jschwart has joined #archiveteam-bs [07:54] *** jschwart has quit IRC (Client Quit) [08:02] *** Specular has quit IRC (Read error: Connection reset by peer) [08:33] *** yuitimoth has quit IRC (Ping timeout: 506 seconds) [08:44] *** tuluu has quit IRC (Read error: Operation timed out) [08:44] *** tuluu has joined #archiveteam-bs [08:46] *** w0rp has quit IRC (Read error: Operation timed out) [08:48] *** w0rp has joined #archiveteam-bs [09:09] SketchCow: i'm doing your GDC 1996 Psychic Detective tape [09:46] *** ld1 has quit IRC (Quit: ~) [10:14] *** MadArchiv has joined #archiveteam-bs [10:14] so i'm now doing the Buffy Bits tape [10:14] *Bits=Bites [10:15] btw the GDC 1996 Psychic Detective tape has lot of tape that was not used [10:16] there was no video for half of tape from what i can tell [10:16] anyways all video on tape was captured [10:18] *** Specular has joined #archiveteam-bs [10:24] Alright! So it's day two of the webcomic arching, um, "project" and -- I've been wondering -- why don't we start it out by paying some attention to Hiveworks? It's a relatively small webcomic website that hosts popular, high profile webcomics due to its high barrier of entry and, as I previously mentioned, has a list (https://thehiveworks.com/completed) of completed (and/or cancelled) webcomics that we can just hit with wget, heritrix, or grab-s [10:24] *** MadArchiv has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com )) [10:26] *** Specular has quit IRC (Quit: Leaving) [11:01] *** ld1 has joined #archiveteam-bs [11:02] *** BlueMaxim has quit IRC (Quit: Leaving) [11:07] i'm now doing the xena warrior princess tape [11:25] *** pizzaiolo has joined #archiveteam-bs [11:47] *** drumstick has quit IRC (Read error: Operation timed out) [12:03] *** spacegirl has quit IRC (Read error: Operation timed out) [12:15] *** spacegirl has joined #archiveteam-bs [13:09] *** Stilett0 has quit IRC (Ping timeout: 248 seconds) [13:10] *** Stilett0 has joined #archiveteam-bs [13:17] so i got a sci-fi airing of tales from the crypt [13:27] *** yuitimoth has joined #archiveteam-bs [13:42] so the date of this recording is 2001-06-28 at 11PM [13:43] and it fits cause Blade Runner The Director's Cut ran that Saturday [14:52] *** dashcloud has quit IRC (Remote host closed the connection) [14:52] so i got The Police Behind The Music tape [14:53] its a BBC America tape that was taped over [14:53] *** dashcloud has joined #archiveteam-bs [15:00] *** ScruffyB has quit IRC (Remote host closed the connection) [15:03] *** phillipsj has joined #archiveteam-bs [15:07] *** qw3rty119 has quit IRC (Read error: Operation timed out) [15:36] *** MadArchiv has joined #archiveteam-bs [15:58] Guys, I'm planning on archiving all the comics from the completed webcomics list from Hiveworks with grab-site so we can maybe, just maybe, give that webcomic archiving project some traction to back it up. Any tips? [16:01] <3 [16:11] *** namespac- has joined #archiveteam-bs [16:14] How do I use grab-site anyway? [16:20] https://github.com/ludios/grab-site [16:24] Hmmm, alright. I'll see what I can do with it when I actually *do* manage to get my hands on my laptop. [16:31] By the way, my laptop runs on Windows 10, and the page you linked me to says that the Windows 10 version of this tool is experimental, should I be worried about that? [16:44] *** Mateon1 has quit IRC (Read error: Operation timed out) [16:47] *** Mateon1 has joined #archiveteam-bs [17:13] *** Stilett0 has quit IRC (Read error: Operation timed out) [17:14] *** pizzaiolo has quit IRC (pizzaiolo) [17:18] *** Pixi has joined #archiveteam-bs [17:28] *** Stilett0 has joined #archiveteam-bs [18:40] *** schbirid has quit IRC (Quit: Leaving) [18:44] *** schbirid has joined #archiveteam-bs [18:46] *** schbirid has quit IRC (Client Quit) [18:48] *** MadArchiv has quit IRC (Read error: Connection reset by peer) [18:51] *** Pixi has quit IRC (Quit: Pixi) [18:59] *** schbirid has joined #archiveteam-bs [19:13] *** icedice has joined #archiveteam-bs [19:21] *** jschwart has joined #archiveteam-bs [19:27] *** dashcloud has quit IRC (Read error: Operation timed out) [19:38] *** dashcloud has joined #archiveteam-bs [19:56] *** Mateon1 has quit IRC (Remote host closed the connection) [19:57] *** Mateon1 has joined #archiveteam-bs [20:35] *** icedice has quit IRC (Ping timeout: 260 seconds) [20:36] *** icedice has joined #archiveteam-bs [20:39] *** icedice has quit IRC (Client Quit) [21:00] *** Pixi has joined #archiveteam-bs [21:13] *** MadArchiv has joined #archiveteam-bs [21:14] *** icedice has joined #archiveteam-bs [21:32] *** MadArchiv has quit IRC (Read error: Connection reset by peer) [21:52] so that tape has 6 hours of tv [21:52] not just behind the music [21:53] we are also get reruns of 2 guys and a girl [21:53] on We channel [21:56] *** user has joined #archiveteam-bs [21:57] also i got 'The OC' episode that aired on 2003-08-12 on this tape [22:10] *** qw3rty3 has joined #archiveteam-bs [22:25] *** user is now known as Ceryn [22:25] this tape is going to have some random stuff at the end [22:26] Do you guys handle Javascript at all when crawling websites for archival? [22:26] yes [22:27] How? Using what? I'm considering Selenium for Pythonto crawl from a headless browser. I'd like to hear alternatives though. [22:27] a headless browser, yeah [22:27] (that's for ad-hoc crawls) [22:27] Well, alright. Which library? [22:28] for planned things we will figure out what the javascript does and generate a list of url patterns by hand-ish that it depends on having [22:28] Oh, shit. That sounds like a lot of work. [22:29] it's usually pretty easy with modern dev tools [22:29] Huh. Interesting. [22:29] How come you don't use headless browsers too for planned things? [22:29] and (*checks calendar*) eight years of experience (!) [22:29] :D [22:29] I figure that helps a lot. [22:30] they're super resource intensive, and also they tend to fetch page prequisite resources over again for every pageload [22:30] so the archive file gets massively bloated with millions of copies of static content [22:30] Hm. That won't work. [22:31] it's less than optimal [22:31] Haha. [22:31] we also don't go for a full copy of every page the site can possibly render [22:32] rather, we go for every unique piece of data in the site [22:32] so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX [22:32] so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX" pages [22:32] et [22:32] c [22:32] That depends on the project though. For example, for the Steam user forums, we also grabbed the individual post pages because those links still appear elsewhere on the web frequently. [22:33] yes [22:33] Do you make a manual list of these kinds of links too, then? The ones you don't want to follow? [22:33] Also, regarding the JavaScript part: PhantomJS in wpull is semi-broken and doesn't really work properly. [22:33] priorities: 1, all unique data. 2, preservation of inbound links. 3, browseability. [22:34] Ceryn: we write a thing which is "here is a userid get me all the pages related to it" [22:34] and then provide that with a list of userids [22:34] so it's a static set of pages added in, not excluded from a general crawl [22:34] so we can also get things that aren't necessarily linked, but are retrievable [22:34] Oh. So custom crawlers all the way. [22:34] yeah [22:35] I was thinking it should be doable to make a general one and then just handle some trap exceptions. [22:35] we have tooling for that, when it's the better choice [22:35] (namely, archivebot) [22:36] *** Pixi has quit IRC (Ping timeout: 255 seconds) [22:37] The only think you really want from Javascript is to get the links it tries to fetch, right? [22:37] and any DOM changes [22:37] Won't you have them just by storing the actual JS though? [22:38] *** icedice2 has joined #archiveteam-bs [22:38] I mean, at archival time, JS is stored but is only really relevant for getting more links. [22:38] doesn't always work. a friend and I were trying to pull up KCNA's reporting of Kim Jong-il's death via Wayback and it turns out that the Javascript did really strange things that precluded loading [22:39] I'm sure the data is in there *somewhere* but we didn't really want to dig through 30,000-something matches [22:39] Shoot. Javascript really seems to be a pain. [22:39] javascript is a goddamn scourge [22:40] any optimism is precluded by the terrible things that humans do [22:40] *** icedice2 has quit IRC (Client Quit) [22:40] that said, Google seems to have some luck with headless Chrome [22:40] *** icedice has quit IRC (Ping timeout: 245 seconds) [22:40] Oh? [22:41] no idea how they're doing it but I suppose a lot of things become easier when you have hundreds of thousands of computers [22:41] There's also headless Firefox now. [22:41] So it might be time to play around with these again. [22:41] s/ again// [22:41] again is right, we looked into it once [22:42] Yeah, I mean, there were no proper headless browsers before. [22:42] Until a few months ago, that is. [22:42] There were various projects which gutted browsers to create something similar, but as far as I know, no actual browser had a headless mode. [22:45] *** BlueMaxim has joined #archiveteam-bs [22:45] With a headless browser you could also do a few sample screenshots. That'd be cool. [22:48] Yeah, I think headless browsers are really the way to go for any page that isn't just simple HTML without scripting. You basically archive exactly what a user would get when accessing the site. You trigger more realistic traffic patterns. You can make use of the cache to avoid redownloads of the page requisites. And so on. [22:52] However, I can also think of cases where it won't work that well. For example, when links are added on the fly by scripts. You'll want to click each link to grab the entire thing. But if the links disappear once you click one of them, it gets tricky to archive it all. [22:53] Can someone please add https://www.dnainfo.com to archive bot? [22:55] mundus: We can, but I'm not sure if we should. See #archiveteam. [22:55] On the other hand, I don't see immediately how that could be archived with the warrior. [22:55] It doesn't look like you can grab the articles by ID or anything like that. [22:56] SketchCow: Should we throw DNAinfo into ArchiveBot, or will we have a warrior project? [22:57] (Also, let's not forget about Gothamist.) [22:57] ah [23:03] *** tuluu_ has joined #archiveteam-bs [23:05] *** tuluu has quit IRC (Read error: Operation timed out) [23:05] *** drumstick has joined #archiveteam-bs [23:29] Anyone familiar with fanfiction.net? If you were to archive it, what would you save? Only all stories? All stories and author pages? Any user pages? Reviews? Everything? [23:29] so i found 2 more publicly tapes that was taped over [23:29] godane: Where are you finding tapes? :P [23:29] plus side is its felicity so i maybe able get episodes of that with original music [23:30] SketchCow sent me 3 boxes of tapes [23:30] Oh, nice. [23:30] there is also ebay that helps me sometimes [23:31] Ceryn: http://archiveteam.org/index.php?title=FanFiction.Net [23:36] Can anyone edit the Miiverse wiki Archiving status to say it's being actively archived? [23:36] I'm getting asked on Twitter about it, since they see "Not saved yet" and think we didn't do anything, when we've gotten/are getting a ton right this moment [23:38] JAA: Cool! [23:38] Isn't #fanfriction supposed to be alive? [23:40] You upload 800 GB of text uncompressed? :S [23:40] Ceryn: No, the wiki is just outdated. That's probably the channel that was used back in 2012 to coordinate the grab. [23:40] DrasticAc: On it. [23:40] Sweet, thanks [23:40] *** tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.) [23:41] The Wiki was last updated legitimately in May 2016 it seems. It also references a 2015 dump. [23:41] Yeah, but the old information wasn't cleared out. [23:42] Okay. So currently the only copy from Archiveteam is the 2012 one? [23:43] *** tuluu has joined #archiveteam-bs [23:54] I think I'll make a Fanfiction dump this year, unless someone's already up to date. [23:54] I'd like to make it on-going though. So it stays in sync. [23:56] bsmith093: ^