#archiveteam-bs 2017-11-03,Fri

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***MadArchiv has quit IRC (Read error: Operation timed out)
YetAnothe has quit IRC (Ping timeout: 246 seconds)
[00:02]
Stilett0 has joined #archiveteam-bs [00:15]
YetAnothe has joined #archiveteam-bs [00:26]
YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
MadArchiv has joined #archiveteam-bs
[00:37]
Stilett0 has quit IRC () [00:52]
..... (idle for 20mn)
drumstick has quit IRC (Read error: Operation timed out) [01:12]
drumstick has joined #archiveteam-bs [01:18]
..... (idle for 23mn)
ola_norsk has joined #archiveteam-bs [01:41]
ola_norskany suggestion on how to capture all "Latest" listings on a twitter hashtag?
as in ALL..
[01:42]
***MadArchiv has quit IRC (Read error: Operation timed out) [01:44]
ola_norsktwitter seems to be throwing a "Takes too long long to load. Please retry", when using wayback [01:44]
***kvieta has quit IRC (Read error: Connection reset by peer) [01:45]
ola_norskim specifically looking to archive hashtags #istandwithcorey ..It seems to be current fancy
!ia
!help
i've lost the magic :(
!ia https://twitter.com/Corey_Feldman
!ia https://twitter.com/hashtag/istandwithcorey
oh well :)
fell obliged to archive those..For reasons unknown to common men :D
[01:45]
***ola_norsk has quit IRC (BEER!) [01:53]
.... (idle for 18mn)
ld1 has quit IRC (Quit: ~)
ld1 has joined #archiveteam-bs
MadArchiv has joined #archiveteam-bs
[02:11]
tuluu has quit IRC (Read error: Operation timed out)
tuluu has joined #archiveteam-bs
Pixi has joined #archiveteam-bs
[02:25]
....... (idle for 33mn)
YetAnothe has joined #archiveteam-bs
MadArchiv has quit IRC (Read error: Connection reset by peer)
[03:01]
kvieta has joined #archiveteam-bs [03:06]
........... (idle for 52mn)
YetAnothe has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com )) [03:58]
ranmadamn, pity we can't go back in time
https://techcrunch.com/2017/11/02/r-i-p-new-york-media/
archives are gone supposedly
[04:10]
***qw3rty119 has joined #archiveteam-bs
qw3rty118 has quit IRC (Read error: Operation timed out)
[04:14]
Specular has joined #archiveteam-bs
Stilett0 has joined #archiveteam-bs
[04:25]
pikhqMost of 'em were scraped and are in Wayback already.
Would be nice if we could have a stronger guarantee than "we probably got it?" though.
[04:31]
SketchCowThey've mailed me, they're giving them to me.
Rest easy
[04:35]
................... (idle for 1h30mn)
***BlueMaxim has quit IRC (Quit: Leaving) [06:05]
......... (idle for 43mn)
BlueMaxim has joined #archiveteam-bs [06:48]
Pixi has quit IRC (Quit: Pixi) [06:54]
........ (idle for 37mn)
TheLovina has joined #archiveteam-bs [07:31]
schbirid has joined #archiveteam-bs [07:42]
jschwart has joined #archiveteam-bs
jschwart has quit IRC (Client Quit)
[07:50]
Specular has quit IRC (Read error: Connection reset by peer) [08:02]
....... (idle for 31mn)
yuitimoth has quit IRC (Ping timeout: 506 seconds) [08:33]
tuluu has quit IRC (Read error: Operation timed out)
tuluu has joined #archiveteam-bs
w0rp has quit IRC (Read error: Operation timed out)
w0rp has joined #archiveteam-bs
[08:44]
..... (idle for 21mn)
godaneSketchCow: i'm doing your GDC 1996 Psychic Detective tape [09:09]
........ (idle for 37mn)
***ld1 has quit IRC (Quit: ~) [09:46]
...... (idle for 28mn)
MadArchiv has joined #archiveteam-bs [10:14]
godaneso i'm now doing the Buffy Bits tape
*Bits=Bites
btw the GDC 1996 Psychic Detective tape has lot of tape that was not used
there was no video for half of tape from what i can tell
anyways all video on tape was captured
[10:14]
***Specular has joined #archiveteam-bs [10:18]
MadArchivAlright! So it's day two of the webcomic arching, um, "project" and -- I've been wondering -- why don't we start it out by paying some attention to Hiveworks? It's a relatively small webcomic website that hosts popular, high profile webcomics due to its high barrier of entry and, as I previously mentioned, has a list (https://thehiveworks.com/completed) of completed (and/or cancelled) webcomics that we can just hit with wget, heritrix, or grab-s [10:24]
***MadArchiv has quit IRC (AndroIRC - Android IRC Client ( http://www.androirc.com ))
Specular has quit IRC (Quit: Leaving)
[10:24]
........ (idle for 35mn)
ld1 has joined #archiveteam-bs
BlueMaxim has quit IRC (Quit: Leaving)
[11:01]
godanei'm now doing the xena warrior princess tape [11:07]
.... (idle for 18mn)
***pizzaiolo has joined #archiveteam-bs [11:25]
..... (idle for 22mn)
drumstick has quit IRC (Read error: Operation timed out) [11:47]
.... (idle for 16mn)
spacegirl has quit IRC (Read error: Operation timed out) [12:03]
spacegirl has joined #archiveteam-bs [12:15]
........... (idle for 54mn)
Stilett0 has quit IRC (Ping timeout: 248 seconds)
Stilett0 has joined #archiveteam-bs
[13:09]
godaneso i got a sci-fi airing of tales from the crypt [13:17]
***yuitimoth has joined #archiveteam-bs [13:27]
.... (idle for 15mn)
godaneso the date of this recording is 2001-06-28 at 11PM
and it fits cause Blade Runner The Director's Cut ran that Saturday
[13:42]
.............. (idle for 1h9mn)
***dashcloud has quit IRC (Remote host closed the connection) [14:52]
godaneso i got The Police Behind The Music tape
its a BBC America tape that was taped over
[14:52]
***dashcloud has joined #archiveteam-bs [14:53]
ScruffyB has quit IRC (Remote host closed the connection)
phillipsj has joined #archiveteam-bs
qw3rty119 has quit IRC (Read error: Operation timed out)
[15:00]
...... (idle for 29mn)
MadArchiv has joined #archiveteam-bs [15:36]
..... (idle for 22mn)
MadArchivGuys, I'm planning on archiving all the comics from the completed webcomics list from Hiveworks with grab-site so we can maybe, just maybe, give that webcomic archiving project some traction to back it up. Any tips? [15:58]
schbirid<3 [16:01]
***namespac- has joined #archiveteam-bs [16:11]
MadArchivHow do I use grab-site anyway? [16:14]
DFJustinhttps://github.com/ludios/grab-site [16:20]
MadArchivHmmm, alright. I'll see what I can do with it when I actually *do* manage to get my hands on my laptop. [16:24]
By the way, my laptop runs on Windows 10, and the page you linked me to says that the Windows 10 version of this tool is experimental, should I be worried about that? [16:31]
***Mateon1 has quit IRC (Read error: Operation timed out)
Mateon1 has joined #archiveteam-bs
[16:44]
...... (idle for 26mn)
Stilett0 has quit IRC (Read error: Operation timed out)
pizzaiolo has quit IRC (pizzaiolo)
Pixi has joined #archiveteam-bs
[17:13]
Stilett0 has joined #archiveteam-bs [17:28]
............... (idle for 1h12mn)
schbirid has quit IRC (Quit: Leaving)
schbirid has joined #archiveteam-bs
schbirid has quit IRC (Client Quit)
MadArchiv has quit IRC (Read error: Connection reset by peer)
Pixi has quit IRC (Quit: Pixi)
[18:40]
schbirid has joined #archiveteam-bs [18:59]
icedice has joined #archiveteam-bs [19:13]
jschwart has joined #archiveteam-bs [19:21]
dashcloud has quit IRC (Read error: Operation timed out) [19:27]
dashcloud has joined #archiveteam-bs [19:38]
.... (idle for 18mn)
Mateon1 has quit IRC (Remote host closed the connection)
Mateon1 has joined #archiveteam-bs
[19:56]
........ (idle for 38mn)
icedice has quit IRC (Ping timeout: 260 seconds)
icedice has joined #archiveteam-bs
icedice has quit IRC (Client Quit)
[20:35]
..... (idle for 21mn)
Pixi has joined #archiveteam-bs [21:00]
MadArchiv has joined #archiveteam-bs
icedice has joined #archiveteam-bs
[21:13]
.... (idle for 18mn)
MadArchiv has quit IRC (Read error: Connection reset by peer) [21:32]
..... (idle for 20mn)
godaneso that tape has 6 hours of tv
not just behind the music
we are also get reruns of 2 guys and a girl
on We channel
[21:52]
***user has joined #archiveteam-bs [21:56]
godanealso i got 'The OC' episode that aired on 2003-08-12 on this tape [21:57]
***qw3rty3 has joined #archiveteam-bs [22:10]
.... (idle for 15mn)
user is now known as Ceryn [22:25]
godanethis tape is going to have some random stuff at the end [22:25]
CerynDo you guys handle Javascript at all when crawling websites for archival? [22:26]
astridyes [22:26]
CerynHow? Using what? I'm considering Selenium for Pythonto crawl from a headless browser. I'd like to hear alternatives though. [22:27]
astrida headless browser, yeah
(that's for ad-hoc crawls)
[22:27]
CerynWell, alright. Which library? [22:27]
astridfor planned things we will figure out what the javascript does and generate a list of url patterns by hand-ish that it depends on having [22:28]
CerynOh, shit. That sounds like a lot of work. [22:28]
astridit's usually pretty easy with modern dev tools [22:29]
CerynHuh. Interesting.
How come you don't use headless browsers too for planned things?
[22:29]
astridand (*checks calendar*) eight years of experience (!) [22:29]
Ceryn:D
I figure that helps a lot.
[22:29]
astridthey're super resource intensive, and also they tend to fetch page prequisite resources over again for every pageload
so the archive file gets massively bloated with millions of copies of static content
[22:30]
CerynHm. That won't work. [22:30]
astridit's less than optimal [22:31]
CerynHaha. [22:31]
astridwe also don't go for a full copy of every page the site can possibly render
rather, we go for every unique piece of data in the site
so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX
so when we get a forum we skip the "all posts by user XyX_cubefan69_XyX" pages
et
c
[22:31]
JAAThat depends on the project though. For example, for the Steam user forums, we also grabbed the individual post pages because those links still appear elsewhere on the web frequently. [22:32]
astridyes [22:33]
CerynDo you make a manual list of these kinds of links too, then? The ones you don't want to follow? [22:33]
JAAAlso, regarding the JavaScript part: PhantomJS in wpull is semi-broken and doesn't really work properly. [22:33]
astridpriorities: 1, all unique data. 2, preservation of inbound links. 3, browseability.
Ceryn: we write a thing which is "here is a userid get me all the pages related to it"
and then provide that with a list of userids
so it's a static set of pages added in, not excluded from a general crawl
so we can also get things that aren't necessarily linked, but are retrievable
[22:33]
CerynOh. So custom crawlers all the way. [22:34]
astridyeah [22:34]
CerynI was thinking it should be doable to make a general one and then just handle some trap exceptions. [22:35]
astridwe have tooling for that, when it's the better choice
(namely, archivebot)
[22:35]
***Pixi has quit IRC (Ping timeout: 255 seconds) [22:36]
CerynThe only think you really want from Javascript is to get the links it tries to fetch, right? [22:37]
yipdwand any DOM changes [22:37]
CerynWon't you have them just by storing the actual JS though? [22:37]
***icedice2 has joined #archiveteam-bs [22:38]
CerynI mean, at archival time, JS is stored but is only really relevant for getting more links. [22:38]
yipdwdoesn't always work. a friend and I were trying to pull up KCNA's reporting of Kim Jong-il's death via Wayback and it turns out that the Javascript did really strange things that precluded loading
I'm sure the data is in there *somewhere* but we didn't really want to dig through 30,000-something matches
[22:38]
CerynShoot. Javascript really seems to be a pain. [22:39]
yipdwjavascript is a goddamn scourge [22:39]
astridany optimism is precluded by the terrible things that humans do [22:40]
***icedice2 has quit IRC (Client Quit) [22:40]
yipdwthat said, Google seems to have some luck with headless Chrome [22:40]
***icedice has quit IRC (Ping timeout: 245 seconds) [22:40]
CerynOh? [22:40]
yipdwno idea how they're doing it but I suppose a lot of things become easier when you have hundreds of thousands of computers [22:41]
JAAThere's also headless Firefox now.
So it might be time to play around with these again.
s/ again//
[22:41]
yipdwagain is right, we looked into it once [22:41]
JAAYeah, I mean, there were no proper headless browsers before.
Until a few months ago, that is.
There were various projects which gutted browsers to create something similar, but as far as I know, no actual browser had a headless mode.
[22:42]
***BlueMaxim has joined #archiveteam-bs [22:45]
CerynWith a headless browser you could also do a few sample screenshots. That'd be cool. [22:45]
JAAYeah, I think headless browsers are really the way to go for any page that isn't just simple HTML without scripting. You basically archive exactly what a user would get when accessing the site. You trigger more realistic traffic patterns. You can make use of the cache to avoid redownloads of the page requisites. And so on.
However, I can also think of cases where it won't work that well. For example, when links are added on the fly by scripts. You'll want to click each link to grab the entire thing. But if the links disappear once you click one of them, it gets tricky to archive it all.
[22:48]
mundusCan someone please add https://www.dnainfo.com to archive bot? [22:53]
JAAmundus: We can, but I'm not sure if we should. See #archiveteam.
On the other hand, I don't see immediately how that could be archived with the warrior.
It doesn't look like you can grab the articles by ID or anything like that.
SketchCow: Should we throw DNAinfo into ArchiveBot, or will we have a warrior project?
(Also, let's not forget about Gothamist.)
[22:55]
mundusah [22:57]
***tuluu_ has joined #archiveteam-bs
tuluu has quit IRC (Read error: Operation timed out)
drumstick has joined #archiveteam-bs
[23:03]
..... (idle for 24mn)
CerynAnyone familiar with fanfiction.net? If you were to archive it, what would you save? Only all stories? All stories and author pages? Any user pages? Reviews? Everything? [23:29]
godaneso i found 2 more publicly tapes that was taped over [23:29]
Ceryngodane: Where are you finding tapes? :P [23:29]
godaneplus side is its felicity so i maybe able get episodes of that with original music
SketchCow sent me 3 boxes of tapes
[23:29]
CerynOh, nice. [23:30]
godanethere is also ebay that helps me sometimes [23:30]
JAACeryn: http://archiveteam.org/index.php?title=FanFiction.Net [23:31]
DrasticAcCan anyone edit the Miiverse wiki Archiving status to say it's being actively archived?
I'm getting asked on Twitter about it, since they see "Not saved yet" and think we didn't do anything, when we've gotten/are getting a ton right this moment
[23:36]
CerynJAA: Cool!
Isn't #fanfriction supposed to be alive?
You upload 800 GB of text uncompressed? :S
[23:38]
JAACeryn: No, the wiki is just outdated. That's probably the channel that was used back in 2012 to coordinate the grab.
DrasticAc: On it.
[23:40]
DrasticAcSweet, thanks [23:40]
***tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.) [23:40]
CerynThe Wiki was last updated legitimately in May 2016 it seems. It also references a 2015 dump. [23:41]
JAAYeah, but the old information wasn't cleared out. [23:41]
CerynOkay. So currently the only copy from Archiveteam is the 2012 one? [23:42]
***tuluu has joined #archiveteam-bs [23:43]
CerynI think I'll make a Fanfiction dump this year, unless someone's already up to date.
I'd like to make it on-going though. So it stays in sync.
[23:54]
JAAbsmith093: ^ [23:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)