#archiveteam-bs 2016-07-19,Tue

↑back Search

Time Nickname Message
00:01 🔗 ndiddy has joined #archiveteam-bs
00:04 🔗 JW_work http://digitalcommons.cnr.edu/gill-publications/58/ <- article about using IA's storage (and book-reader viewer) as the location-of-record for an institutional repository. This worries me somewhat. I hope that IA reaches out to the organizations doing this and encourages them to consider maintaining (and paying for) a local backup.
00:05 🔗 xmc location of record, or as a location for the circulating copy?
00:07 🔗 ndizzle has quit IRC (Read error: Operation timed out)
00:08 🔗 JW_work from the article, it sounds like location of record
00:08 🔗 xmc hm, ok
00:08 🔗 JW_work maybe they just didn't mention their backup plans
00:08 🔗 xmc i read it a few days ago and i wasn't sure which they did
00:09 🔗 xmc aye
00:09 🔗 JW_work hopefully
00:09 🔗 xmc yeah it sounded to me that they have another repository, but they use it for just the viewer thingy
00:09 🔗 xmc er
00:09 🔗 xmc they have a location of record, but they use IA for the viewer
00:10 🔗 JW_work from the title, I thought they were using the viewer *software* (which would be really cool) and integrating it into their own repository software. But they aren't doing that, they're uploading a copy of each paper to IA (which is certainly welcome), and then linking to the book-reader instance derived from the item
00:11 🔗 xmc yeah
00:24 🔗 kristian_ has joined #archiveteam-bs
00:25 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
00:26 🔗 GLaDOS has joined #archiveteam-bs
00:30 🔗 DoomTay has joined #archiveteam-bs
00:53 🔗 JesseW has joined #archiveteam-bs
00:57 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:02 🔗 Boppen has quit IRC (Ping timeout: 194 seconds)
01:05 🔗 Boppen has joined #archiveteam-bs
01:30 🔗 DFJustin sigh http://wso.williams.edu/~rcarson/
01:34 🔗 DFJustin re: cnr I think it's reasonably clear from the document that they're also uploading things to their own repository
01:35 🔗 closure has joined #archiveteam-bs
01:38 🔗 kristian_ has quit IRC (Leaving)
01:56 🔗 JesseW DFJustin: hm? It's clear they *have* their own software, but it isn't clear to me that it actually stores copies...
01:56 🔗 JesseW (I certainly may have misunderstood it, though.)
01:57 🔗 DFJustin page 6 of the pdf
01:58 🔗 DFJustin screenshot of a bepress thing saying "upload the full pdf"
02:02 🔗 JesseW Ah, yes, you are right. Great! I'm reassured (and really should have expected as much).
02:04 🔗 JesseW Nice -- they've uploaded 160 items, it looks like: https://archive.org/details/@cnrir&tab=uploads
02:07 🔗 * JesseW just sent a note to info@ suggesting that they be moved to a collection.
02:10 🔗 tomwsmf has joined #archiveteam-bs
02:12 🔗 DiscantX has quit IRC (Read error: Operation timed out)
02:45 🔗 tomwsmf has quit IRC (Ping timeout: 258 seconds)
02:50 🔗 ravetcofx has joined #archiveteam-bs
03:02 🔗 closure has quit IRC (Ping timeout: 250 seconds)
04:51 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:54 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
04:55 🔗 DiscantX has joined #archiveteam-bs
04:57 🔗 Sk1d has joined #archiveteam-bs
05:00 🔗 BlueMaxim has joined #archiveteam-bs
05:09 🔗 fie has joined #archiveteam-bs
05:19 🔗 DoomTay has quit IRC (Quit: Page closed)
05:29 🔗 DiscantX has quit IRC (Read error: Operation timed out)
05:49 🔗 brayden_ has quit IRC (Quit: Leaving)
05:57 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:08 🔗 metalcamp has joined #archiveteam-bs
06:09 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
06:29 🔗 RichardG has joined #archiveteam-bs
06:30 🔗 godane i'm at 775k items now
07:24 🔗 achip has quit IRC (Read error: Operation timed out)
07:28 🔗 achip has joined #archiveteam-bs
08:01 🔗 schbirid has joined #archiveteam-bs
08:17 🔗 DiscantX has joined #archiveteam-bs
08:52 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
08:54 🔗 BlueMaxim has joined #archiveteam-bs
09:22 🔗 Fletcher has quit IRC (Remote host closed the connection)
09:27 🔗 joepie91 so, PSA
09:27 🔗 joepie91 byethost is now using a JS challenge similar to cloudflare
09:27 🔗 joepie91 on their free hosting stuff
09:28 🔗 Fletcher has joined #archiveteam-bs
09:28 🔗 Fletcher_ sets mode: +o Fletcher
09:29 🔗 joepie91 https://gist.github.com/joepie91/37b14b8b4a0cd28ecdaf36660117ebbf
10:20 🔗 tephra has quit IRC (Read error: Operation timed out)
10:23 🔗 tephra has joined #archiveteam-bs
10:36 🔗 DiscantX has quit IRC (Ping timeout: 633 seconds)
10:51 🔗 Infreq has quit IRC (Ping timeout: 258 seconds)
10:51 🔗 Infreq has joined #archiveteam-bs
12:57 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
13:13 🔗 Jeroen52 has joined #archiveteam-bs
13:27 🔗 arkiver https://twitter.com/wikileaks/status/755171322288861184
13:27 🔗 arkiver :)
13:35 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:36 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
13:37 🔗 Jeroen52 has joined #archiveteam-bs
13:56 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
13:57 🔗 Jeroen52 has joined #archiveteam-bs
14:02 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
14:04 🔗 Jeroen52 has joined #archiveteam-bs
14:09 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
14:10 🔗 Jeroen52 has joined #archiveteam-bs
14:13 🔗 brayden has joined #archiveteam-bs
14:13 🔗 swebb sets mode: +o brayden
14:29 🔗 Sanqui don't you just love it when archive.org captures a 404 page
14:29 🔗 Sanqui https://web.archive.org/web/20151108021614/http://my.xfinity.com/~machrone/bjr/mistakes.htm
14:36 🔗 brayden has quit IRC (Read error: Connection reset by peer)
14:38 🔗 brayden has joined #archiveteam-bs
14:38 🔗 swebb sets mode: +o brayden
15:01 🔗 brayden has quit IRC (Quit: Leaving)
15:12 🔗 brayden has joined #archiveteam-bs
15:12 🔗 swebb sets mode: +o brayden
15:16 🔗 brayden has quit IRC (Client Quit)
15:48 🔗 brayden has joined #archiveteam-bs
15:48 🔗 swebb sets mode: +o brayden
16:05 🔗 ndiddy has joined #archiveteam-bs
16:07 🔗 DoomTay has joined #archiveteam-bs
16:17 🔗 Start_ has joined #archiveteam-bs
16:17 🔗 Start has quit IRC (Read error: Connection reset by peer)
16:19 🔗 xmc D:
16:20 🔗 godane so i found a spanish streaming podcast site
16:21 🔗 godane maybe use full when grabbing mp3s from other websites
16:21 🔗 RichardG has quit IRC (Read error: No route to host)
16:21 🔗 godane example url: http://www.ivoox.com/listen_mn_7543804_1.mp3?internal=HTML5
16:21 🔗 godane for mp3
16:21 🔗 godane they redirect to original urls
16:21 🔗 RichardG has joined #archiveteam-bs
16:24 🔗 RichardG has quit IRC (Client Quit)
16:26 🔗 godane those urls redirect to this before going to original file : http://files.ivoox.com/listen/7543799
16:29 🔗 Start_ is now known as Start
16:31 🔗 VADemon has joined #archiveteam-bs
17:15 🔗 RichardG has joined #archiveteam-bs
17:34 🔗 xmc upload completed, took three days: https://archive.org/details/ftpsite_ftp.3gpp.org_2013_01_28
17:34 🔗 xmc i should do more but i'm at work atm so,
17:37 🔗 xmc tho if someone with permissions wants to move it into the ftpsites collection, ...
18:06 🔗 DoomTay Speaking of FTPs, when was the last time Microsoft's FTP was crawled?
18:25 🔗 yipdw US folk and people who have dealt with the US before: do you have any recommended disk recycling/destruction services?
18:25 🔗 * yipdw is feeling another downside of maintaining one's own storage array
18:26 🔗 yipdw preferably a service that will deal with small batches, like 5 disks at a time
18:33 🔗 HCross2 yipdw: a very large hammer
18:33 🔗 yipdw HCross2: I'm more interested in the recycling than the destruction
18:33 🔗 yipdw I don't like discarding electronics, too much toxic and valuable shit them
18:39 🔗 yipdw also, gitlab is snazzy
18:39 🔗 yipdw if you submit a merge request and its associated build pipeline fails, gitlab will open a TODO re: the failure and assign it to you
18:40 🔗 xmc nifty
18:40 🔗 yipdw yeah, I often get notified of CI failures way faster that way
18:41 🔗 yipdw there's a weird 5-10 minute lag on email for me and I have no idea where it's coming from
18:42 🔗 xmc greylisting?
18:44 🔗 yipdw dunno
18:45 🔗 yipdw could be the IMAP server, could be greylisting, could be the client -- I've had akonadi do really bizarre things
18:46 🔗 yipdw sometimes after the laptop resumes from suspend it just won't check email at all for example
18:46 🔗 xmc is akonadi any good? i use mutt
18:46 🔗 yipdw well
18:47 🔗 yipdw I don't really use it directly; it's the kmail/kdepim backend
18:47 🔗 yipdw when it works it's awesome
18:47 🔗 xmc ah
18:47 🔗 yipdw when it doesn't it is a conflaguration of a dozen cooperating(?) processes and I have no idea which subset is conspiring to deny me my mail
18:47 🔗 xmc D:
18:53 🔗 DoomTay has quit IRC (Ping timeout: 268 seconds)
18:55 🔗 DoomTay has joined #archiveteam-bs
19:01 🔗 DoomTay So....arkiver, any word on that warrior development?
19:02 🔗 Igloo He's working hard on newsbuddy at the mo
19:04 🔗 DoomTay Ah
19:05 🔗 DoomTay Sanqui: It's one thing when archive.org captures 404 page. It's even worse when it captures a "servers are too busy" page
19:20 🔗 HCross2 I'm going to admit now, I use outlook
19:40 🔗 DoomTay has quit IRC (Quit: Page closed)
19:42 🔗 yipdw outlook at least consistently worked across suspend/resume
19:53 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
19:54 🔗 hook54321 Can I browse the contents of a WARC file with 7zip?
19:56 🔗 GLaDOS has joined #archiveteam-bs
19:58 🔗 DiscantX has joined #archiveteam-bs
19:58 🔗 arkiver2 has joined #archiveteam-bs
20:04 🔗 schbirid define browse
20:04 🔗 schbirid you can extract it with it, yes
20:04 🔗 schbirid but it is not meant for humans
20:04 🔗 schbirid use webarchiveplayer
20:05 🔗 hook54321 is there a way to view the files kinda like you would in a file explorer with webarchiveplayer?
20:06 🔗 schbirid it lists all the urls captured by default
20:07 🔗 hook54321 do i need to have admin rights to run it?
20:09 🔗 schbirid no
20:11 🔗 RichardG_ has joined #archiveteam-bs
20:13 🔗 RichardG has quit IRC (Ping timeout: 260 seconds)
20:13 🔗 hook54321 Got it running, is there a way to have it show non-html pages?
20:16 🔗 RichardG has joined #archiveteam-bs
20:19 🔗 tomwsmf has joined #archiveteam-bs
20:22 🔗 RichardG_ has quit IRC (Ping timeout: 633 seconds)
20:23 🔗 DoomTay has joined #archiveteam-bs
20:49 🔗 xmc for tumblr here's an idea
20:49 🔗 nightpool has joined #archiveteam-bs
20:49 🔗 xmc one pipeline to crawl the blogs
20:49 🔗 xmc one pipeline to get the images found in the blogs
20:49 🔗 xmc i'm not sure if tumblr images are stored globally or disaggregated
20:49 🔗 xmc uhm
20:50 🔗 nightpool globally
20:50 🔗 nightpool well, global-ish
20:50 🔗 yipdw one script to bring them both and in the darkness bind them
20:50 🔗 nightpool they have an assets.tumblr kinda scheme
20:50 🔗 schbirid three pipelines to weed out the slashing trannies
20:50 🔗 schbirid 9 to bind the
20:50 🔗 xmc right but if i repost something on my tumblr does it get a new url
20:50 🔗 schbirid m
20:50 🔗 yipdw seriously, was that needed
20:50 🔗 xmc ^
20:50 🔗 schbirid was kicked by xmc (schbirid)
20:50 🔗 nightpool xmc: assets don't
20:51 🔗 nightpool text content does
20:51 🔗 xmc perfect
20:51 🔗 yipdw that seems reasonable
20:51 🔗 xmc and then the blog crawler can report back all the new usernames it finds to the pipeline
20:52 🔗 xmc which adds them to the queue
20:52 🔗 xmc not sure how much new code that would require
20:52 🔗 nightpool wrinkle: usernames are mutable, so a bunch of links are going to be dead, and some blogs can only be accessed by logged in users
20:52 🔗 yipdw seesaw task to post a bunch of text to some collector endpoint
20:52 🔗 yipdw nightpool: oh nice
20:52 🔗 xmc ah yeah that
20:53 🔗 xmc should crawl+warc the api, use that to yield post urls, then warc those urls
20:55 🔗 xmc i suspect that post ids are consistent across username changes
20:55 🔗 xmc so we could build some sort of index i guess
20:55 🔗 nightpool yeah post ids are global
20:55 🔗 xmc i'll look into this .... uh, tonight late i guess maybe
20:56 🔗 nightpool okay I have a lot of experience with tumblr's api and internal api so feel free to ping me if you have questions.
20:56 🔗 xmc oh nice
20:56 🔗 xmc are you a tumblrian?
20:57 🔗 nightpool yeah
20:57 🔗 * xmc nods
20:57 🔗 xmc <3
20:57 🔗 nightpool and I help run New XKit so if we need community involvement then I can help mobalize stuff too
20:57 🔗 xmc oh awesome
20:59 🔗 nightpool although i'd prefer to wait until we have some more confirmation on what's going on beforing pulling the big red alarm bar tbh
20:59 🔗 * xmc nods
20:59 🔗 xmc we can always run projects speculatively
20:59 🔗 nightpool yeah
20:59 🔗 xmc we've done that a number of times
20:59 🔗 xmc i'd like to get a tumblr thing going at any manageable speed
21:00 🔗 xmc because then we could open it up whenever needed
21:00 🔗 nightpool yeah and I'm definitely down to help out with that.
21:03 🔗 xmc :)
21:05 🔗 HCross2 Anyone who uses storage.harrycross.me as an rsync target - switched rsync off for now while I upload the full 4tb disk to the IA
21:11 🔗 schbirid has joined #archiveteam-bs
21:11 🔗 schbirid sorry
21:15 🔗 schbirid has quit IRC (Quit: Leaving)
21:15 🔗 FalconK surely archiving of tumblr should concentrate on things of relative significance and the general character of what tumblr is like right now
21:16 🔗 FalconK it seems impossible to get all of it
21:16 🔗 xmc yea
21:16 🔗 FalconK every single reblog of every single racy image each emo kid posts ;)
21:17 🔗 FalconK unrelatedly, ananiel seems pleasantly full again :)
21:17 🔗 nightpool well just strict reblogs are very easy to dedupe right? because they're just html text.
21:17 🔗 xmc we try not to go after porn mostly because most of it is duplicative and there's plenty of archived representative samples of porn
21:17 🔗 xmc we wouldn't need to dedupe reblogs because they're so small
21:17 🔗 nightpool right dedupe wasn't quite the right word
21:18 🔗 xmc however tumblr is a motherlode of porn mixed in with a motherlode of important original internet content
21:18 🔗 nightpool I meant more like, there was a very small marginal cost to crawling them
21:18 🔗 xmc yeap
21:18 🔗 FalconK I suppose it depends on how you descend into it
21:18 🔗 FalconK also most throttling seems to work on a request per IP basis rather than a quantity of bandwidth per IP basis
21:19 🔗 FalconK though I don't think it's throttling we're up against here
21:19 🔗 FalconK archivebot is way slow at making requests a browser would make before you could blink
21:19 🔗 nightpool I think porn/not porn isn't exactly the right distinction to be thinking about--there's a lot of community-important erotic content, and a lot that's completely useless. I would be more worried about ending up crawling the millions and millions of bot porn accounts
21:19 🔗 xmc warrior doesn't run archivebot
21:20 🔗 FalconK no, it doesn't
21:20 🔗 FalconK but does it have similar reporting?
21:20 🔗 xmc nightpool: yes. i am using 'porn' in a specific sense that didn't come across
21:20 🔗 xmc FalconK: no, it just reports "done" and some json when it uploads
21:20 🔗 FalconK the sociology and history of porn is actually really important
21:20 🔗 xmc RIGHT I KNOW
21:20 🔗 FalconK so source material is important transitively
21:21 🔗 FalconK I know you know this too :)
21:21 🔗 xmc we have had people in here who archived, say, /r/gonewild obsessively
21:21 🔗 xmc that's not what i'm talking about
21:21 🔗 FalconK in some sense perhaps the sheer redundancy of porn is probably interesting too
21:21 🔗 xmc yeah
21:21 🔗 xmc n e wai
21:22 🔗 xmc we don't go after things that are purely porn
21:22 🔗 xmc but tumblr is not that
21:22 🔗 xmc that's the point i'm making
21:22 🔗 FalconK aah.
21:23 🔗 nightpool well I think there is some subtlety there, in that there is a TON of bot accounts who just post reams and reams of porn every day that we should at least think about
21:24 🔗 xmc yes
21:24 🔗 nightpool which is more dangerous then spam in general because these are high-res images
21:24 🔗 xmc however they probably have rebagels so they'll be hard to avoid
21:24 🔗 FalconK in what way is that dangerous?
21:24 🔗 xmc esp if we split the images from the html
21:24 🔗 tomwsmf has quit IRC (Ping timeout: 258 seconds)
21:24 🔗 xmc (image/html splitting for not grabbing 1000 copies of everything)
21:25 🔗 nightpool yeah no definetly
21:25 🔗 nightpool hmm
21:25 🔗 xmc just bandwidth dangerous
21:25 🔗 FalconK oh. yeah.
21:25 🔗 nightpool if we start from a seed of known-good users and crawl outwards we might not even catch a lot of them
21:25 🔗 xmc bandwidth and disk space
21:25 🔗 tomwsmf has joined #archiveteam-bs
21:25 🔗 FalconK dedup is really expensive and it's hard to do except post hoc
21:25 🔗 nightpool especially if we crawl along the following graph
21:26 🔗 xmc so i'm talking re:dedup
21:26 🔗 xmc crawl users along the notes graph
21:26 🔗 FalconK but for tumblr fortunately it tends to not give out different images at the same image URL
21:26 🔗 xmc fetch+warc the api, fetch+warc the urls
21:26 🔗 xmc then report back to the tracker the image data urls
21:26 🔗 xmc for another warrior to grab
21:27 🔗 xmc you do the dedup in the tracker before fetching
21:27 🔗 FalconK a solid plan!
21:28 🔗 xmc :)
21:28 🔗 yipdw so, one use case: I'd like to be able to identify a tumblr username and fetch a copy of that tumblr, HTML and all
21:28 🔗 yipdw would this require a separate tool?
21:28 🔗 yipdw not a huge deal (tabblo etc had one), just wondering
21:29 🔗 nightpool yipdw: there's tumblr-utils
21:29 🔗 yipdw oh, I meant from this grab
21:29 🔗 yipdw it'll go into IA as a set of WARCs, I'm just pondering how to get the data back
21:29 🔗 xmc the first warrior i proposed would yield one item per blog
21:29 🔗 xmc but no pictures
21:29 🔗 xmc anyway -> meeting
21:29 🔗 yipdw oh ok, that works
21:29 🔗 yipdw reverse-lookup on the image cdxes gets you the image data
21:30 🔗 nightpool meeting?
21:43 🔗 xfce has joined #archiveteam-bs
22:01 🔗 JW_work please loop me in on the tumblr efforts — I've been doing intermittent saving via #archivebot
22:12 🔗 xmc nightpool: i have a day job
22:22 🔗 winr4r so, would anyone know how to save this livestream? http://bdcdrift.com/
22:26 🔗 robink has joined #archiveteam-bs
22:37 🔗 nightpool xmc: that makes a lot more sense I was very confused sorry haha
22:37 🔗 * xmc nod
22:37 🔗 nightpool im really not sure how I didn't make that connection cause I literally left for a meeting immediately after I asked that question
22:37 🔗 nightpool so
22:51 🔗 Frogging yahoo is getting ready to murder again I see
22:51 🔗 Frogging (or I assume, from all this talk in here)
22:51 🔗 DoomTay D:
22:53 🔗 BlueMaxim has joined #archiveteam-bs
22:58 🔗 xmc we can only assume
23:14 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
23:25 🔗 RichardG has quit IRC (Read error: Operation timed out)
23:25 🔗 RichardG has joined #archiveteam-bs
23:30 🔗 arkiver2 has quit IRC (Quit: AndroIRC - Android IRC Client ( http://www.androirc.com ))
23:32 🔗 Baljem has joined #archiveteam-bs
23:33 🔗 joepie91_ has joined #archiveteam-bs
23:33 🔗 rduser` has joined #archiveteam-bs
23:33 🔗 joepie91 has quit IRC (Read error: Operation timed out)
23:34 🔗 yeoldeto1 has joined #archiveteam-bs
23:34 🔗 rduser has quit IRC (Read error: Operation timed out)
23:34 🔗 rduser` is now known as rduser
23:34 🔗 jk[SVP] has quit IRC (ircd.choopa.net irc.mzima.net)
23:34 🔗 midas has quit IRC (ircd.choopa.net irc.mzima.net)
23:34 🔗 yeoldetoa has quit IRC (ircd.choopa.net irc.mzima.net)
23:34 🔗 Baljem_ has quit IRC (ircd.choopa.net irc.mzima.net)
23:34 🔗 goekesmi has quit IRC (ircd.choopa.net irc.mzima.net)
23:34 🔗 Igloo has quit IRC (ircd.choopa.net irc.mzima.net)
23:37 🔗 jk[SVP] has joined #archiveteam-bs
23:37 🔗 nightpool Frogging: not ... imminently?
23:37 🔗 nightpool but I would be surprised if tumblr lives to see 2017
23:37 🔗 Frogging considering its size that would be pretty imminent :p
23:38 🔗 nightpool Maybe give it a full year
23:38 🔗 nightpool but yeah
23:42 🔗 goekesmi_ has joined #archiveteam-bs
23:44 🔗 Igloo_ has joined #archiveteam-bs

irclogger-viewer