[00:01] *** ndiddy has joined #archiveteam-bs
[00:04] <JW_work> http://digitalcommons.cnr.edu/gill-publications/58/ <- article about using IA's storage (and book-reader viewer) as the location-of-record for an institutional repository. This worries me somewhat. I hope that IA reaches out to the organizations doing this and encourages them to consider maintaining (and paying for) a local backup.
[00:05] <xmc> location of record, or as a location for the circulating copy?
[00:07] *** ndizzle has quit IRC (Read error: Operation timed out)
[00:08] <JW_work> from the article, it sounds like location of record
[00:08] <xmc> hm, ok
[00:08] <JW_work> maybe they just didn't mention their backup plans
[00:08] <xmc> i read it a few days ago and i wasn't sure which they did
[00:09] <xmc> aye
[00:09] <JW_work> hopefully
[00:09] <xmc> yeah it sounded to me that they have another repository, but they use it for just the viewer thingy
[00:09] <xmc> er
[00:09] <xmc> they have a location of record, but they use IA for the viewer
[00:10] <JW_work> from the title, I thought they were using the viewer *software* (which would be really cool) and integrating it into their own repository software. But they aren't doing that, they're uploading a copy of each paper to IA (which is certainly welcome), and then linking to the book-reader instance derived from the item
[00:11] <xmc> yeah
[00:24] *** kristian_ has joined #archiveteam-bs
[00:25] *** GLaDOS has quit IRC (Ping timeout: 260 seconds)
[00:26] *** GLaDOS has joined #archiveteam-bs
[00:30] *** DoomTay has joined #archiveteam-bs
[00:53] *** JesseW has joined #archiveteam-bs
[00:57] *** BlueMaxim has quit IRC (Read error: Operation timed out)
[01:02] *** Boppen has quit IRC (Ping timeout: 194 seconds)
[01:05] *** Boppen has joined #archiveteam-bs
[01:30] <DFJustin> sigh http://wso.williams.edu/~rcarson/
[01:34] <DFJustin> re: cnr I think it's reasonably clear from the document that they're also uploading things to their own repository
[01:35] *** closure has joined #archiveteam-bs
[01:38] *** kristian_ has quit IRC (Leaving)
[01:56] <JesseW> DFJustin: hm? It's clear they *have* their own software, but it isn't clear to me that it actually stores copies...
[01:56] <JesseW> (I certainly may have misunderstood it, though.)
[01:57] <DFJustin> page 6 of the pdf
[01:58] <DFJustin> screenshot of a bepress thing saying "upload the full pdf"
[02:02] <JesseW> Ah, yes, you are right. Great! I'm reassured (and really should have expected as much).
[02:04] <JesseW> Nice -- they've uploaded 160 items, it looks like: https://archive.org/details/@cnrir&tab=uploads
[02:07] * JesseW just sent a note to info@ suggesting that they be moved to a collection.
[02:10] *** tomwsmf has joined #archiveteam-bs
[02:12] *** DiscantX has quit IRC (Read error: Operation timed out)
[02:45] *** tomwsmf has quit IRC (Ping timeout: 258 seconds)
[02:50] *** ravetcofx has joined #archiveteam-bs
[03:02] *** closure has quit IRC (Ping timeout: 250 seconds)
[04:51] *** Sk1d has quit IRC (Ping timeout: 194 seconds)
[04:54] *** ndiddy has quit IRC (Read error: Connection reset by peer)
[04:55] *** DiscantX has joined #archiveteam-bs
[04:57] *** Sk1d has joined #archiveteam-bs
[05:00] *** BlueMaxim has joined #archiveteam-bs
[05:09] *** fie has joined #archiveteam-bs
[05:19] *** DoomTay has quit IRC (Quit: Page closed)
[05:29] *** DiscantX has quit IRC (Read error: Operation timed out)
[05:49] *** brayden_ has quit IRC (Quit: Leaving)
[05:57] *** JesseW has quit IRC (Ping timeout: 370 seconds)
[06:08] *** metalcamp has joined #archiveteam-bs
[06:09] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
[06:29] *** RichardG has joined #archiveteam-bs
[06:30] <godane> i'm at 775k items now
[07:24] *** achip has quit IRC (Read error: Operation timed out)
[07:28] *** achip has joined #archiveteam-bs
[08:01] *** schbirid has joined #archiveteam-bs
[08:17] *** DiscantX has joined #archiveteam-bs
[08:52] *** BlueMaxim has quit IRC (Read error: Operation timed out)
[08:54] *** BlueMaxim has joined #archiveteam-bs
[09:22] *** Fletcher has quit IRC (Remote host closed the connection)
[09:27] <joepie91> so, PSA
[09:27] <joepie91> byethost is now using a JS challenge similar to cloudflare
[09:27] <joepie91> on their free hosting stuff
[09:28] *** Fletcher has joined #archiveteam-bs
[09:28] *** Fletcher_ sets mode: +o Fletcher
[09:29] <joepie91> https://gist.github.com/joepie91/37b14b8b4a0cd28ecdaf36660117ebbf
[10:20] *** tephra has quit IRC (Read error: Operation timed out)
[10:23] *** tephra has joined #archiveteam-bs
[10:36] *** DiscantX has quit IRC (Ping timeout: 633 seconds)
[10:51] *** Infreq has quit IRC (Ping timeout: 258 seconds)
[10:51] *** Infreq has joined #archiveteam-bs
[12:57] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds)
[13:13] *** Jeroen52 has joined #archiveteam-bs
[13:27] <arkiver> https://twitter.com/wikileaks/status/755171322288861184
[13:27] <arkiver> :)
[13:35] *** BlueMaxim has quit IRC (Quit: Leaving)
[13:36] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds)
[13:37] *** Jeroen52 has joined #archiveteam-bs
[13:56] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds)
[13:57] *** Jeroen52 has joined #archiveteam-bs
[14:02] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds)
[14:04] *** Jeroen52 has joined #archiveteam-bs
[14:09] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds)
[14:10] *** Jeroen52 has joined #archiveteam-bs
[14:13] *** brayden has joined #archiveteam-bs
[14:13] *** swebb sets mode: +o brayden
[14:29] <Sanqui> don't you just love it when archive.org captures a 404 page
[14:29] <Sanqui> https://web.archive.org/web/20151108021614/http://my.xfinity.com/~machrone/bjr/mistakes.htm
[14:36] *** brayden has quit IRC (Read error: Connection reset by peer)
[14:38] *** brayden has joined #archiveteam-bs
[14:38] *** swebb sets mode: +o brayden
[15:01] *** brayden has quit IRC (Quit: Leaving)
[15:12] *** brayden has joined #archiveteam-bs
[15:12] *** swebb sets mode: +o brayden
[15:16] *** brayden has quit IRC (Client Quit)
[15:48] *** brayden has joined #archiveteam-bs
[15:48] *** swebb sets mode: +o brayden
[16:05] *** ndiddy has joined #archiveteam-bs
[16:07] *** DoomTay has joined #archiveteam-bs
[16:17] *** Start_ has joined #archiveteam-bs
[16:17] *** Start has quit IRC (Read error: Connection reset by peer)
[16:19] <xmc> D:
[16:20] <godane> so i found a spanish streaming podcast site
[16:21] <godane> maybe use full when grabbing mp3s from other websites
[16:21] *** RichardG has quit IRC (Read error: No route to host)
[16:21] <godane> example url: http://www.ivoox.com/listen_mn_7543804_1.mp3?internal=HTML5
[16:21] <godane> for mp3
[16:21] <godane> they redirect to original urls
[16:21] *** RichardG has joined #archiveteam-bs
[16:24] *** RichardG has quit IRC (Client Quit)
[16:26] <godane> those urls redirect to this before going to original file : http://files.ivoox.com/listen/7543799
[16:29] *** Start_ is now known as Start
[16:31] *** VADemon has joined #archiveteam-bs
[17:15] *** RichardG has joined #archiveteam-bs
[17:34] <xmc> upload completed, took three days: https://archive.org/details/ftpsite_ftp.3gpp.org_2013_01_28
[17:34] <xmc> i should do more but i'm at work atm so,
[17:37] <xmc> tho if someone with permissions wants to move it into the ftpsites collection, ...
[18:06] <DoomTay> Speaking of FTPs, when was the last time Microsoft's FTP was crawled?
[18:25] <yipdw> US folk and people who have dealt with the US before: do you have any recommended disk recycling/destruction services?
[18:25] * yipdw is feeling another downside of maintaining one's own storage array
[18:26] <yipdw> preferably a service that will deal with small batches, like 5 disks at a time
[18:33] <HCross2> yipdw: a very large hammer 
[18:33] <yipdw> HCross2: I'm more interested in the recycling than the destruction
[18:33] <yipdw> I don't like discarding electronics, too much toxic and valuable shit them
[18:39] <yipdw> also, gitlab is snazzy
[18:39] <yipdw> if you submit a merge request and its associated build pipeline fails, gitlab will open a TODO re: the failure and assign it to you
[18:40] <xmc> nifty
[18:40] <yipdw> yeah, I often get notified of CI failures way faster that way
[18:41] <yipdw> there's a weird 5-10 minute lag on email for me and I have no idea where it's coming from
[18:42] <xmc> greylisting?
[18:44] <yipdw> dunno
[18:45] <yipdw> could be the IMAP server, could be greylisting, could be the client -- I've had akonadi do really bizarre things
[18:46] <yipdw> sometimes after the laptop resumes from suspend it just won't check email at all for example
[18:46] <xmc> is akonadi any good? i use mutt
[18:46] <yipdw> well
[18:47] <yipdw> I don't really use it directly; it's the kmail/kdepim backend
[18:47] <yipdw> when it works it's awesome
[18:47] <xmc> ah
[18:47] <yipdw> when it doesn't it is a conflaguration of a dozen cooperating(?) processes and I have no idea which subset is conspiring to deny me my mail
[18:47] <xmc> D:
[18:53] *** DoomTay has quit IRC (Ping timeout: 268 seconds)
[18:55] *** DoomTay has joined #archiveteam-bs
[19:01] <DoomTay> So....arkiver, any word on that warrior development?
[19:02] <Igloo> He's working hard on newsbuddy at the mo
[19:04] <DoomTay> Ah
[19:05] <DoomTay> Sanqui: It's one thing when archive.org captures 404 page. It's even worse when it captures a "servers are too busy" page
[19:20] <HCross2> I'm going to admit now, I use outlook
[19:40] *** DoomTay has quit IRC (Quit: Page closed)
[19:42] <yipdw> outlook at least consistently worked across suspend/resume
[19:53] *** GLaDOS has quit IRC (Quit: Oh crap, I died.)
[19:54] <hook54321> Can I browse the contents of a WARC file with 7zip?
[19:56] *** GLaDOS has joined #archiveteam-bs
[19:58] *** DiscantX has joined #archiveteam-bs
[19:58] *** arkiver2 has joined #archiveteam-bs
[20:04] <schbirid> define browse
[20:04] <schbirid> you can extract it with it, yes
[20:04] <schbirid> but it is not meant for humans
[20:04] <schbirid> use webarchiveplayer
[20:05] <hook54321> is there a way to view the files kinda like you would in a file explorer with webarchiveplayer?
[20:06] <schbirid> it lists all the urls captured by default
[20:07] <hook54321> do i need to have admin rights to run it?
[20:09] <schbirid> no
[20:11] *** RichardG_ has joined #archiveteam-bs
[20:13] *** RichardG has quit IRC (Ping timeout: 260 seconds)
[20:13] <hook54321> Got it running, is there a way to have it show non-html pages?
[20:16] *** RichardG has joined #archiveteam-bs
[20:19] *** tomwsmf has joined #archiveteam-bs
[20:22] *** RichardG_ has quit IRC (Ping timeout: 633 seconds)
[20:23] *** DoomTay has joined #archiveteam-bs
[20:49] <xmc> for tumblr here's an idea
[20:49] *** nightpool has joined #archiveteam-bs
[20:49] <xmc> one pipeline to crawl the blogs
[20:49] <xmc> one pipeline to get the images found in the blogs
[20:49] <xmc> i'm not sure if tumblr images are stored globally or disaggregated
[20:49] <xmc> uhm
[20:50] <nightpool> globally
[20:50] <nightpool> well, global-ish
[20:50] <yipdw> one script to bring them both and in the darkness bind them
[20:50] <nightpool> they have an assets.tumblr kinda scheme
[20:50] <schbirid> three pipelines to weed out the slashing trannies
[20:50] <schbirid> 9 to bind the
[20:50] <xmc> right but if i repost something on my tumblr does it get a new url
[20:50] <schbirid> m
[20:50] <yipdw> seriously, was that needed
[20:50] <xmc> ^
[20:50] *** schbirid was kicked by xmc (schbirid)
[20:50] <nightpool> xmc: assets don't
[20:51] <nightpool> text content does
[20:51] <xmc> perfect
[20:51] <yipdw> that seems reasonable
[20:51] <xmc> and then the blog crawler can report back all the new usernames it finds to the pipeline
[20:52] <xmc> which adds them to the queue
[20:52] <xmc> not sure how much new code that would require
[20:52] <nightpool> wrinkle: usernames are mutable, so a bunch of links are going to be dead, and some blogs can only be accessed by logged in users
[20:52] <yipdw> seesaw task to post a bunch of text to some collector endpoint
[20:52] <yipdw> nightpool: oh nice
[20:52] <xmc> ah yeah that
[20:53] <xmc> should crawl+warc the api, use that to yield post urls, then warc those urls
[20:55] <xmc> i suspect that post ids are consistent across username changes
[20:55] <xmc> so we could build some sort of index i guess
[20:55] <nightpool> yeah post ids are global
[20:55] <xmc> i'll look into this .... uh, tonight late i guess maybe
[20:56] <nightpool> okay I have a lot of experience with tumblr's api and internal api so feel free to ping me if you have questions.
[20:56] <xmc> oh nice
[20:56] <xmc> are you a tumblrian?
[20:57] <nightpool> yeah
[20:57] * xmc nods
[20:57] <xmc> <3
[20:57] <nightpool> and I help run New XKit so if we need community involvement then I can help mobalize stuff too
[20:57] <xmc> oh awesome
[20:59] <nightpool> although i'd prefer to wait until we have some more confirmation on what's going on beforing pulling the big red alarm bar tbh
[20:59] * xmc nods
[20:59] <xmc> we can always run projects speculatively
[20:59] <nightpool> yeah
[20:59] <xmc> we've done that a number of times
[20:59] <xmc> i'd like to get a tumblr thing going at any manageable speed
[21:00] <xmc> because then we could open it up whenever needed
[21:00] <nightpool> yeah and I'm definitely down to help out with that.
[21:03] <xmc> :)
[21:05] <HCross2> Anyone who uses storage.harrycross.me as an rsync target - switched rsync off for now while I upload the full 4tb disk to the IA 
[21:11] *** schbirid has joined #archiveteam-bs
[21:11] <schbirid> sorry
[21:15] *** schbirid has quit IRC (Quit: Leaving)
[21:15] <FalconK> surely archiving of tumblr should concentrate on things of relative significance and the general character of what tumblr is like right now
[21:16] <FalconK> it seems impossible to get all of it
[21:16] <xmc> yea
[21:16] <FalconK> every single reblog of every single racy image each emo kid posts ;)
[21:17] <FalconK> unrelatedly, ananiel seems pleasantly full again :)
[21:17] <nightpool> well just strict reblogs are very easy to dedupe right? because they're just html text.
[21:17] <xmc> we try not to go after porn mostly because most of it is duplicative and there's plenty of archived representative samples of porn
[21:17] <xmc> we wouldn't need to dedupe reblogs because they're so small
[21:17] <nightpool> right dedupe wasn't quite the right word
[21:18] <xmc> however tumblr is a motherlode of porn mixed in with a motherlode of important original internet content
[21:18] <nightpool> I meant more like, there was a very small marginal cost to crawling them
[21:18] <xmc> yeap
[21:18] <FalconK> I suppose it depends on how you descend into it
[21:18] <FalconK> also most throttling seems to work on a request per IP basis rather than a quantity of bandwidth per IP basis
[21:19] <FalconK> though I don't think it's throttling we're up against here
[21:19] <FalconK> archivebot is way slow at making requests a browser would make before you could blink
[21:19] <nightpool> I think porn/not porn isn't exactly the right distinction to be thinking about--there's a lot of community-important erotic content, and a lot that's completely useless. I would be more worried about ending up crawling the millions and millions of bot porn accounts
[21:19] <xmc> warrior doesn't run archivebot
[21:20] <FalconK> no, it doesn't
[21:20] <FalconK> but does it have similar reporting?
[21:20] <xmc> nightpool: yes. i am using 'porn' in a specific sense that didn't come across
[21:20] <xmc> FalconK: no, it just reports "done" and some json when it uploads
[21:20] <FalconK> the sociology and history of porn is actually really important
[21:20] <xmc> RIGHT I KNOW
[21:20] <FalconK> so source material is important transitively
[21:21] <FalconK> I know you know this too :)
[21:21] <xmc> we have had people in here who archived, say, /r/gonewild obsessively
[21:21] <xmc> that's not what i'm talking about
[21:21] <FalconK> in some sense perhaps the sheer redundancy of porn is probably interesting too
[21:21] <xmc> yeah
[21:21] <xmc> n e wai
[21:22] <xmc> we don't go after things that are purely porn
[21:22] <xmc> but tumblr is not that
[21:22] <xmc> that's the point i'm making
[21:22] <FalconK> aah.
[21:23] <nightpool> well I think there is some subtlety there, in that there is a TON of bot accounts who just post reams and reams of porn every day that we should at least think about
[21:24] <xmc> yes
[21:24] <nightpool> which is more dangerous then spam in general because these are high-res images
[21:24] <xmc> however they probably have rebagels so they'll be hard to avoid
[21:24] <FalconK> in what way is that dangerous?
[21:24] <xmc> esp if we split the images from the html
[21:24] *** tomwsmf has quit IRC (Ping timeout: 258 seconds)
[21:24] <xmc> (image/html splitting for not grabbing 1000 copies of everything)
[21:25] <nightpool> yeah no definetly 
[21:25] <nightpool> hmm
[21:25] <xmc> just bandwidth dangerous
[21:25] <FalconK> oh.  yeah.
[21:25] <nightpool> if we start from a seed of known-good users and crawl outwards we might not even catch a lot of them
[21:25] <xmc> bandwidth and disk space
[21:25] *** tomwsmf has joined #archiveteam-bs
[21:25] <FalconK> dedup is really expensive and it's hard to do except post hoc
[21:25] <nightpool> especially if we crawl along the following graph
[21:26] <xmc> so i'm talking re:dedup
[21:26] <xmc> crawl users along the notes graph
[21:26] <FalconK> but for tumblr fortunately it tends to not give out different images at the same image URL
[21:26] <xmc> fetch+warc the api, fetch+warc the urls
[21:26] <xmc> then report back to the tracker the image data urls
[21:26] <xmc> for another warrior to grab
[21:27] <xmc> you do the dedup in the tracker before fetching
[21:27] <FalconK> a solid plan!
[21:28] <xmc> :)
[21:28] <yipdw> so, one use case: I'd like to be able to identify a tumblr username and fetch a copy of that tumblr, HTML and all
[21:28] <yipdw> would this require a separate tool?
[21:28] <yipdw> not a huge deal (tabblo etc had one), just wondering
[21:29] <nightpool> yipdw: there's tumblr-utils
[21:29] <yipdw> oh, I meant from this grab
[21:29] <yipdw> it'll go into IA as a set of WARCs, I'm just pondering how to get the data back
[21:29] <xmc> the first warrior i proposed would yield one item per blog
[21:29] <xmc> but no pictures
[21:29] <xmc> anyway -> meeting
[21:29] <yipdw> oh ok, that works
[21:29] <yipdw> reverse-lookup on the image cdxes gets you the image data
[21:30] <nightpool> meeting?
[21:43] *** xfce has joined #archiveteam-bs
[22:01] <JW_work> please loop me in on the tumblr efforts — I've been doing intermittent saving via #archivebot
[22:12] <xmc> nightpool: i have a day job
[22:22] <winr4r> so, would anyone know how to save this livestream? http://bdcdrift.com/
[22:26] *** robink has joined #archiveteam-bs
[22:37] <nightpool> xmc: that makes a lot more sense I was very confused sorry haha
[22:37] * xmc nod
[22:37] <nightpool> im really not sure how I didn't make that connection cause I literally left for a meeting immediately after I asked that question
[22:37] <nightpool> so
[22:51] <Frogging> yahoo is getting ready to murder again I see
[22:51] <Frogging> (or I assume, from all this talk in here)
[22:51] <DoomTay> D:
[22:53] *** BlueMaxim has joined #archiveteam-bs
[22:58] <xmc> we can only assume
[23:14] *** metalcamp has quit IRC (Ping timeout: 244 seconds)
[23:25] *** RichardG has quit IRC (Read error: Operation timed out)
[23:25] *** RichardG has joined #archiveteam-bs
[23:30] *** arkiver2 has quit IRC (Quit: AndroIRC - Android IRC Client ( http://www.androirc.com ))
[23:32] *** Baljem has joined #archiveteam-bs
[23:33] *** joepie91_ has joined #archiveteam-bs
[23:33] *** rduser` has joined #archiveteam-bs
[23:33] *** joepie91 has quit IRC (Read error: Operation timed out)
[23:34] *** yeoldeto1 has joined #archiveteam-bs
[23:34] *** rduser has quit IRC (Read error: Operation timed out)
[23:34] *** rduser` is now known as rduser
[23:34] *** jk[SVP] has quit IRC (ircd.choopa.net irc.mzima.net)
[23:34] *** midas has quit IRC (ircd.choopa.net irc.mzima.net)
[23:34] *** yeoldetoa has quit IRC (ircd.choopa.net irc.mzima.net)
[23:34] *** Baljem_ has quit IRC (ircd.choopa.net irc.mzima.net)
[23:34] *** goekesmi has quit IRC (ircd.choopa.net irc.mzima.net)
[23:34] *** Igloo has quit IRC (ircd.choopa.net irc.mzima.net)
[23:37] *** jk[SVP] has joined #archiveteam-bs
[23:37] <nightpool> Frogging: not ... imminently? 
[23:37] <nightpool> but I would be surprised if tumblr lives to see 2017
[23:37] <Frogging> considering its size that would be pretty imminent :p
[23:38] <nightpool> Maybe give it a full year
[23:38] <nightpool> but yeah
[23:42] *** goekesmi_ has joined #archiveteam-bs
[23:44] *** Igloo_ has joined #archiveteam-bs