[00:01] *** ndiddy has joined #archiveteam-bs [00:04] http://digitalcommons.cnr.edu/gill-publications/58/ <- article about using IA's storage (and book-reader viewer) as the location-of-record for an institutional repository. This worries me somewhat. I hope that IA reaches out to the organizations doing this and encourages them to consider maintaining (and paying for) a local backup. [00:05] location of record, or as a location for the circulating copy? [00:07] *** ndizzle has quit IRC (Read error: Operation timed out) [00:08] from the article, it sounds like location of record [00:08] hm, ok [00:08] maybe they just didn't mention their backup plans [00:08] i read it a few days ago and i wasn't sure which they did [00:09] aye [00:09] hopefully [00:09] yeah it sounded to me that they have another repository, but they use it for just the viewer thingy [00:09] er [00:09] they have a location of record, but they use IA for the viewer [00:10] from the title, I thought they were using the viewer *software* (which would be really cool) and integrating it into their own repository software. But they aren't doing that, they're uploading a copy of each paper to IA (which is certainly welcome), and then linking to the book-reader instance derived from the item [00:11] yeah [00:24] *** kristian_ has joined #archiveteam-bs [00:25] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [00:26] *** GLaDOS has joined #archiveteam-bs [00:30] *** DoomTay has joined #archiveteam-bs [00:53] *** JesseW has joined #archiveteam-bs [00:57] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:02] *** Boppen has quit IRC (Ping timeout: 194 seconds) [01:05] *** Boppen has joined #archiveteam-bs [01:30] sigh http://wso.williams.edu/~rcarson/ [01:34] re: cnr I think it's reasonably clear from the document that they're also uploading things to their own repository [01:35] *** closure has joined #archiveteam-bs [01:38] *** kristian_ has quit IRC (Leaving) [01:56] DFJustin: hm? It's clear they *have* their own software, but it isn't clear to me that it actually stores copies... [01:56] (I certainly may have misunderstood it, though.) [01:57] page 6 of the pdf [01:58] screenshot of a bepress thing saying "upload the full pdf" [02:02] Ah, yes, you are right. Great! I'm reassured (and really should have expected as much). [02:04] Nice -- they've uploaded 160 items, it looks like: https://archive.org/details/@cnrir&tab=uploads [02:07] * JesseW just sent a note to info@ suggesting that they be moved to a collection. [02:10] *** tomwsmf has joined #archiveteam-bs [02:12] *** DiscantX has quit IRC (Read error: Operation timed out) [02:45] *** tomwsmf has quit IRC (Ping timeout: 258 seconds) [02:50] *** ravetcofx has joined #archiveteam-bs [03:02] *** closure has quit IRC (Ping timeout: 250 seconds) [04:51] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:54] *** ndiddy has quit IRC (Read error: Connection reset by peer) [04:55] *** DiscantX has joined #archiveteam-bs [04:57] *** Sk1d has joined #archiveteam-bs [05:00] *** BlueMaxim has joined #archiveteam-bs [05:09] *** fie has joined #archiveteam-bs [05:19] *** DoomTay has quit IRC (Quit: Page closed) [05:29] *** DiscantX has quit IRC (Read error: Operation timed out) [05:49] *** brayden_ has quit IRC (Quit: Leaving) [05:57] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:08] *** metalcamp has joined #archiveteam-bs [06:09] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [06:29] *** RichardG has joined #archiveteam-bs [06:30] i'm at 775k items now [07:24] *** achip has quit IRC (Read error: Operation timed out) [07:28] *** achip has joined #archiveteam-bs [08:01] *** schbirid has joined #archiveteam-bs [08:17] *** DiscantX has joined #archiveteam-bs [08:52] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:54] *** BlueMaxim has joined #archiveteam-bs [09:22] *** Fletcher has quit IRC (Remote host closed the connection) [09:27] so, PSA [09:27] byethost is now using a JS challenge similar to cloudflare [09:27] on their free hosting stuff [09:28] *** Fletcher has joined #archiveteam-bs [09:28] *** Fletcher_ sets mode: +o Fletcher [09:29] https://gist.github.com/joepie91/37b14b8b4a0cd28ecdaf36660117ebbf [10:20] *** tephra has quit IRC (Read error: Operation timed out) [10:23] *** tephra has joined #archiveteam-bs [10:36] *** DiscantX has quit IRC (Ping timeout: 633 seconds) [10:51] *** Infreq has quit IRC (Ping timeout: 258 seconds) [10:51] *** Infreq has joined #archiveteam-bs [12:57] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds) [13:13] *** Jeroen52 has joined #archiveteam-bs [13:27] https://twitter.com/wikileaks/status/755171322288861184 [13:27] :) [13:35] *** BlueMaxim has quit IRC (Quit: Leaving) [13:36] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds) [13:37] *** Jeroen52 has joined #archiveteam-bs [13:56] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds) [13:57] *** Jeroen52 has joined #archiveteam-bs [14:02] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds) [14:04] *** Jeroen52 has joined #archiveteam-bs [14:09] *** Jeroen52 has quit IRC (Ping timeout: 260 seconds) [14:10] *** Jeroen52 has joined #archiveteam-bs [14:13] *** brayden has joined #archiveteam-bs [14:13] *** swebb sets mode: +o brayden [14:29] don't you just love it when archive.org captures a 404 page [14:29] https://web.archive.org/web/20151108021614/http://my.xfinity.com/~machrone/bjr/mistakes.htm [14:36] *** brayden has quit IRC (Read error: Connection reset by peer) [14:38] *** brayden has joined #archiveteam-bs [14:38] *** swebb sets mode: +o brayden [15:01] *** brayden has quit IRC (Quit: Leaving) [15:12] *** brayden has joined #archiveteam-bs [15:12] *** swebb sets mode: +o brayden [15:16] *** brayden has quit IRC (Client Quit) [15:48] *** brayden has joined #archiveteam-bs [15:48] *** swebb sets mode: +o brayden [16:05] *** ndiddy has joined #archiveteam-bs [16:07] *** DoomTay has joined #archiveteam-bs [16:17] *** Start_ has joined #archiveteam-bs [16:17] *** Start has quit IRC (Read error: Connection reset by peer) [16:19] D: [16:20] so i found a spanish streaming podcast site [16:21] maybe use full when grabbing mp3s from other websites [16:21] *** RichardG has quit IRC (Read error: No route to host) [16:21] example url: http://www.ivoox.com/listen_mn_7543804_1.mp3?internal=HTML5 [16:21] for mp3 [16:21] they redirect to original urls [16:21] *** RichardG has joined #archiveteam-bs [16:24] *** RichardG has quit IRC (Client Quit) [16:26] those urls redirect to this before going to original file : http://files.ivoox.com/listen/7543799 [16:29] *** Start_ is now known as Start [16:31] *** VADemon has joined #archiveteam-bs [17:15] *** RichardG has joined #archiveteam-bs [17:34] upload completed, took three days: https://archive.org/details/ftpsite_ftp.3gpp.org_2013_01_28 [17:34] i should do more but i'm at work atm so, [17:37] tho if someone with permissions wants to move it into the ftpsites collection, ... [18:06] Speaking of FTPs, when was the last time Microsoft's FTP was crawled? [18:25] US folk and people who have dealt with the US before: do you have any recommended disk recycling/destruction services? [18:25] * yipdw is feeling another downside of maintaining one's own storage array [18:26] preferably a service that will deal with small batches, like 5 disks at a time [18:33] yipdw: a very large hammer [18:33] HCross2: I'm more interested in the recycling than the destruction [18:33] I don't like discarding electronics, too much toxic and valuable shit them [18:39] also, gitlab is snazzy [18:39] if you submit a merge request and its associated build pipeline fails, gitlab will open a TODO re: the failure and assign it to you [18:40] nifty [18:40] yeah, I often get notified of CI failures way faster that way [18:41] there's a weird 5-10 minute lag on email for me and I have no idea where it's coming from [18:42] greylisting? [18:44] dunno [18:45] could be the IMAP server, could be greylisting, could be the client -- I've had akonadi do really bizarre things [18:46] sometimes after the laptop resumes from suspend it just won't check email at all for example [18:46] is akonadi any good? i use mutt [18:46] well [18:47] I don't really use it directly; it's the kmail/kdepim backend [18:47] when it works it's awesome [18:47] ah [18:47] when it doesn't it is a conflaguration of a dozen cooperating(?) processes and I have no idea which subset is conspiring to deny me my mail [18:47] D: [18:53] *** DoomTay has quit IRC (Ping timeout: 268 seconds) [18:55] *** DoomTay has joined #archiveteam-bs [19:01] So....arkiver, any word on that warrior development? [19:02] He's working hard on newsbuddy at the mo [19:04] Ah [19:05] Sanqui: It's one thing when archive.org captures 404 page. It's even worse when it captures a "servers are too busy" page [19:20] I'm going to admit now, I use outlook [19:40] *** DoomTay has quit IRC (Quit: Page closed) [19:42] outlook at least consistently worked across suspend/resume [19:53] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [19:54] Can I browse the contents of a WARC file with 7zip? [19:56] *** GLaDOS has joined #archiveteam-bs [19:58] *** DiscantX has joined #archiveteam-bs [19:58] *** arkiver2 has joined #archiveteam-bs [20:04] define browse [20:04] you can extract it with it, yes [20:04] but it is not meant for humans [20:04] use webarchiveplayer [20:05] is there a way to view the files kinda like you would in a file explorer with webarchiveplayer? [20:06] it lists all the urls captured by default [20:07] do i need to have admin rights to run it? [20:09] no [20:11] *** RichardG_ has joined #archiveteam-bs [20:13] *** RichardG has quit IRC (Ping timeout: 260 seconds) [20:13] Got it running, is there a way to have it show non-html pages? [20:16] *** RichardG has joined #archiveteam-bs [20:19] *** tomwsmf has joined #archiveteam-bs [20:22] *** RichardG_ has quit IRC (Ping timeout: 633 seconds) [20:23] *** DoomTay has joined #archiveteam-bs [20:49] for tumblr here's an idea [20:49] *** nightpool has joined #archiveteam-bs [20:49] one pipeline to crawl the blogs [20:49] one pipeline to get the images found in the blogs [20:49] i'm not sure if tumblr images are stored globally or disaggregated [20:49] uhm [20:50] globally [20:50] well, global-ish [20:50] one script to bring them both and in the darkness bind them [20:50] they have an assets.tumblr kinda scheme [20:50] three pipelines to weed out the slashing trannies [20:50] 9 to bind the [20:50] right but if i repost something on my tumblr does it get a new url [20:50] m [20:50] seriously, was that needed [20:50] ^ [20:50] *** schbirid was kicked by xmc (schbirid) [20:50] xmc: assets don't [20:51] text content does [20:51] perfect [20:51] that seems reasonable [20:51] and then the blog crawler can report back all the new usernames it finds to the pipeline [20:52] which adds them to the queue [20:52] not sure how much new code that would require [20:52] wrinkle: usernames are mutable, so a bunch of links are going to be dead, and some blogs can only be accessed by logged in users [20:52] seesaw task to post a bunch of text to some collector endpoint [20:52] nightpool: oh nice [20:52] ah yeah that [20:53] should crawl+warc the api, use that to yield post urls, then warc those urls [20:55] i suspect that post ids are consistent across username changes [20:55] so we could build some sort of index i guess [20:55] yeah post ids are global [20:55] i'll look into this .... uh, tonight late i guess maybe [20:56] okay I have a lot of experience with tumblr's api and internal api so feel free to ping me if you have questions. [20:56] oh nice [20:56] are you a tumblrian? [20:57] yeah [20:57] * xmc nods [20:57] <3 [20:57] and I help run New XKit so if we need community involvement then I can help mobalize stuff too [20:57] oh awesome [20:59] although i'd prefer to wait until we have some more confirmation on what's going on beforing pulling the big red alarm bar tbh [20:59] * xmc nods [20:59] we can always run projects speculatively [20:59] yeah [20:59] we've done that a number of times [20:59] i'd like to get a tumblr thing going at any manageable speed [21:00] because then we could open it up whenever needed [21:00] yeah and I'm definitely down to help out with that. [21:03] :) [21:05] Anyone who uses storage.harrycross.me as an rsync target - switched rsync off for now while I upload the full 4tb disk to the IA [21:11] *** schbirid has joined #archiveteam-bs [21:11] sorry [21:15] *** schbirid has quit IRC (Quit: Leaving) [21:15] surely archiving of tumblr should concentrate on things of relative significance and the general character of what tumblr is like right now [21:16] it seems impossible to get all of it [21:16] yea [21:16] every single reblog of every single racy image each emo kid posts ;) [21:17] unrelatedly, ananiel seems pleasantly full again :) [21:17] well just strict reblogs are very easy to dedupe right? because they're just html text. [21:17] we try not to go after porn mostly because most of it is duplicative and there's plenty of archived representative samples of porn [21:17] we wouldn't need to dedupe reblogs because they're so small [21:17] right dedupe wasn't quite the right word [21:18] however tumblr is a motherlode of porn mixed in with a motherlode of important original internet content [21:18] I meant more like, there was a very small marginal cost to crawling them [21:18] yeap [21:18] I suppose it depends on how you descend into it [21:18] also most throttling seems to work on a request per IP basis rather than a quantity of bandwidth per IP basis [21:19] though I don't think it's throttling we're up against here [21:19] archivebot is way slow at making requests a browser would make before you could blink [21:19] I think porn/not porn isn't exactly the right distinction to be thinking about--there's a lot of community-important erotic content, and a lot that's completely useless. I would be more worried about ending up crawling the millions and millions of bot porn accounts [21:19] warrior doesn't run archivebot [21:20] no, it doesn't [21:20] but does it have similar reporting? [21:20] nightpool: yes. i am using 'porn' in a specific sense that didn't come across [21:20] FalconK: no, it just reports "done" and some json when it uploads [21:20] the sociology and history of porn is actually really important [21:20] RIGHT I KNOW [21:20] so source material is important transitively [21:21] I know you know this too :) [21:21] we have had people in here who archived, say, /r/gonewild obsessively [21:21] that's not what i'm talking about [21:21] in some sense perhaps the sheer redundancy of porn is probably interesting too [21:21] yeah [21:21] n e wai [21:22] we don't go after things that are purely porn [21:22] but tumblr is not that [21:22] that's the point i'm making [21:22] aah. [21:23] well I think there is some subtlety there, in that there is a TON of bot accounts who just post reams and reams of porn every day that we should at least think about [21:24] yes [21:24] which is more dangerous then spam in general because these are high-res images [21:24] however they probably have rebagels so they'll be hard to avoid [21:24] in what way is that dangerous? [21:24] esp if we split the images from the html [21:24] *** tomwsmf has quit IRC (Ping timeout: 258 seconds) [21:24] (image/html splitting for not grabbing 1000 copies of everything) [21:25] yeah no definetly [21:25] hmm [21:25] just bandwidth dangerous [21:25] oh. yeah. [21:25] if we start from a seed of known-good users and crawl outwards we might not even catch a lot of them [21:25] bandwidth and disk space [21:25] *** tomwsmf has joined #archiveteam-bs [21:25] dedup is really expensive and it's hard to do except post hoc [21:25] especially if we crawl along the following graph [21:26] so i'm talking re:dedup [21:26] crawl users along the notes graph [21:26] but for tumblr fortunately it tends to not give out different images at the same image URL [21:26] fetch+warc the api, fetch+warc the urls [21:26] then report back to the tracker the image data urls [21:26] for another warrior to grab [21:27] you do the dedup in the tracker before fetching [21:27] a solid plan! [21:28] :) [21:28] so, one use case: I'd like to be able to identify a tumblr username and fetch a copy of that tumblr, HTML and all [21:28] would this require a separate tool? [21:28] not a huge deal (tabblo etc had one), just wondering [21:29] yipdw: there's tumblr-utils [21:29] oh, I meant from this grab [21:29] it'll go into IA as a set of WARCs, I'm just pondering how to get the data back [21:29] the first warrior i proposed would yield one item per blog [21:29] but no pictures [21:29] anyway -> meeting [21:29] oh ok, that works [21:29] reverse-lookup on the image cdxes gets you the image data [21:30] meeting? [21:43] *** xfce has joined #archiveteam-bs [22:01] please loop me in on the tumblr efforts — I've been doing intermittent saving via #archivebot [22:12] nightpool: i have a day job [22:22] so, would anyone know how to save this livestream? http://bdcdrift.com/ [22:26] *** robink has joined #archiveteam-bs [22:37] xmc: that makes a lot more sense I was very confused sorry haha [22:37] * xmc nod [22:37] im really not sure how I didn't make that connection cause I literally left for a meeting immediately after I asked that question [22:37] so [22:51] yahoo is getting ready to murder again I see [22:51] (or I assume, from all this talk in here) [22:51] D: [22:53] *** BlueMaxim has joined #archiveteam-bs [22:58] we can only assume [23:14] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [23:25] *** RichardG has quit IRC (Read error: Operation timed out) [23:25] *** RichardG has joined #archiveteam-bs [23:30] *** arkiver2 has quit IRC (Quit: AndroIRC - Android IRC Client ( http://www.androirc.com )) [23:32] *** Baljem has joined #archiveteam-bs [23:33] *** joepie91_ has joined #archiveteam-bs [23:33] *** rduser` has joined #archiveteam-bs [23:33] *** joepie91 has quit IRC (Read error: Operation timed out) [23:34] *** yeoldeto1 has joined #archiveteam-bs [23:34] *** rduser has quit IRC (Read error: Operation timed out) [23:34] *** rduser` is now known as rduser [23:34] *** jk[SVP] has quit IRC (ircd.choopa.net irc.mzima.net) [23:34] *** midas has quit IRC (ircd.choopa.net irc.mzima.net) [23:34] *** yeoldetoa has quit IRC (ircd.choopa.net irc.mzima.net) [23:34] *** Baljem_ has quit IRC (ircd.choopa.net irc.mzima.net) [23:34] *** goekesmi has quit IRC (ircd.choopa.net irc.mzima.net) [23:34] *** Igloo has quit IRC (ircd.choopa.net irc.mzima.net) [23:37] *** jk[SVP] has joined #archiveteam-bs [23:37] Frogging: not ... imminently? [23:37] but I would be surprised if tumblr lives to see 2017 [23:37] considering its size that would be pretty imminent :p [23:38] Maybe give it a full year [23:38] but yeah [23:42] *** goekesmi_ has joined #archiveteam-bs [23:44] *** Igloo_ has joined #archiveteam-bs