#archiveteam 2012-01-24,Tue

↑back Search

Time Nickname Message
00:01 🔗 yipdw hmm
00:02 🔗 yipdw come to think of it, I may not have anything that would otherwise be blocked by robots exclusion lists
00:02 🔗 yipdw because my main index source was Google News, and Google obeys those
00:02 🔗 yipdw well, still worth a once-over
00:17 🔗 bsmith093 is there anything that losslessly converts between cbr and cbz to pdf files
00:18 🔗 bsmith093 i can do it with gscan to pdf, but they alwasy come out fuzzy, like theyve been slightly over sized
00:22 🔗 yipdw if Wikipedia is correct and CBR/CBZs just wrap a collection of PNGs or JPEGs, then you don't need to PDF them
00:22 🔗 yipdw however, if you want to, what you need to do is ensure that the DPI of the PDF is the same as the DPI of the PNG or JPEG
00:24 🔗 bsmith093 i know i dont have to, but my windows friends like pdfs rather than cbr/z, and i want to share some hilarious webcomics.
00:32 🔗 yipdw holy hell, I started a gunzip job on an EBS volume attached to an EC2 micro six hours ago and it's still not done
00:32 🔗 yipdw what does Amazon run those things on
00:32 🔗 yipdw hamsters?
00:35 🔗 DFJustin kindles
00:36 🔗 yipdw on fire
00:42 🔗 bsmith093 is dpi something i have to set, because i cant find it in the imag files
00:43 🔗 bsmith093 there gifs, if it matters
01:11 🔗 LordNlptp hamsters with tails on fire
01:12 🔗 LordNlptp in little wheels which runs a big mechanical computer
02:47 🔗 SketchCow Where's the MobileMe channel?
03:32 🔗 Paradoks #memac? I don't think we ever made one.
04:10 🔗 Coderjoe www.youtube.com/watch?v=7ezeYJUz-84
08:09 🔗 ersi http://yro.slashdot.org/story/12/01/23/1725231/carl-malamud-answers-goading-the-government-to-make-public-data-public
09:25 🔗 yipdw http://www.ninjawedding.org/sopa/stories.html
09:25 🔗 yipdw and that'll probably be the last I write about SOPA archiving unless someone else demands more
09:31 🔗 chronomex you sure it works? http://wayback.at.ninjawedding.org/*/http://www.bigshinyrobot.com/reviews/archives/35921
09:31 🔗 yipdw some probably aren't quite working
09:31 🔗 yipdw I didn't review the full import log
09:31 🔗 chronomex k
09:31 🔗 yipdw most I've tried do resolve, though
09:32 🔗 yipdw e.g. http://wayback.at.ninjawedding.org/20120118202651/http://blogs.telegraph.co.uk/technology/alexisdormandy/100007102/jimmy-wales-is-showing-a-lack-of-imagination-over-the-wikipedia-shutdown/
09:32 🔗 yipdw chronomex: if you find more, let me know what they are and I can see if wayback printed log messages pertinent to those WARCs
09:33 🔗 chronomex is it relatively straightforward to get a wayback machine running? I might do my own deployment...
09:33 🔗 yipdw well, the basics aren't bad -- it's much like any other Java webapp
09:33 🔗 chronomex euh :P
09:33 🔗 yipdw there is quite a bit you can do with Hadoop etc though to speed up indexing that I haven't yet enabled
09:34 🔗 ersi Heretrix has hadoop support?
09:34 🔗 yipdw Wayback does
09:34 🔗 yipdw Heritrix, I don't know
09:34 🔗 yipdw (maybe)
09:34 🔗 yipdw and if it doesn't you could write a Hadoop job to wrap it
09:34 🔗 ersi Hmm.
09:34 🔗 ersi Heretrix is just the grabber/crawler?
09:34 🔗 yipdw yeah
09:35 🔗 yipdw Wayback's the only thing I've found (so far) that will actually let me poke around in WARCs
09:35 🔗 yipdw I've been throwing around the idea of a desktop application into which you load a bunch of WARCs
09:35 🔗 yipdw it inflates each record and rewrites URLs for local viewing
09:35 🔗 yipdw but that's a way down the road :P
09:35 🔗 yipdw I think it'd be useful, though
09:36 🔗 ersi sounds awesome (as well as somewhat painful)
09:36 🔗 yipdw heh
09:36 🔗 yipdw I dunno
09:36 🔗 yipdw WARC handling isn't too bad if the WARC is compressed per-record
09:36 🔗 yipdw URL rewriting...yeah, that's bitchy
09:37 🔗 yipdw actually, impossible
09:37 🔗 yipdw unless you catch outbound network requests or some crazy shit you're never going to catch them all
09:37 🔗 yipdw but DTSTTCPW etc
09:38 🔗 yipdw actually, on that note, catching outbound network requests is probably imperative
09:39 🔗 yipdw if you're viewing archived data, you are working with data from God-knows-where, and as such you do need to sandbox it
09:39 🔗 yipdw maybe Webkit has a way to intercept such things
09:40 🔗 yipdw crash time!
09:42 🔗 yipdw oh, also, wayback.at.ninjawedding.org is an EC2 micro instance, so don't be surprised if it chokes every now and then
13:42 🔗 emijrp http://yro.slashdot.org/story/12/01/23/1725231/carl-malamud-answers-goading-the-government-to-make-public-data-public
16:14 🔗 Coderjoe the WARCs written by wget are supposed to be compressed per-record.
17:11 🔗 Coderjoe ...
17:11 🔗 Coderjoe http://www.youtube.com/watch?v=Uae58589aec
17:11 🔗 Coderjoe youtube has an "original" quality setting there
17:25 🔗 nitro2k01 Oh nice! For the times when I want to watch a video at a higher resolution than my display's, and have the video lag!
17:28 🔗 DFJustin will be handy for youtube-dl and future-proofing
17:52 🔗 yipdw jhttps://wwws.whitehouse.gov/petitions#!/petition/investigate-chris-dodd-and-mpaa-bribery-after-he-publicly-admited-bribing-politicans-pass/DffX0YQv
17:52 🔗 yipdw er
17:52 🔗 yipdw https://wwws.whitehouse.gov/petitions#!/petition/investigate-chris-dodd-and-mpaa-bribery-after-he-publicly-admited-bribing-politicans-pass/DffX0YQv
17:52 🔗 yipdw who wants to bet on Noncommittal Half-Response
18:02 🔗 Coderjoe "first name, last initial and city and state will be publicly displayed on the petition page."
18:02 🔗 SketchCow Hey, so I finished the Sound of Young America upload last night!
18:02 🔗 SketchCow 611 audio and video
18:03 🔗 Coderjoe which is fine if you have a common enough name that there are multiple in your city...
18:03 🔗 mach even if there were multiple people, how hard would it be to identify the sort of person who would sign the petition?
18:10 🔗 yipdw Coderjoe: meh, I don't care
18:10 🔗 yipdw the information displayed on that page is enough to identify me
18:10 🔗 yipdw for that matter, so is the information in a WHOIS query or a credit report gone astray
18:54 🔗 yipdw Rust looks like a neat language
18:55 🔗 yipdw its standard library strikes me as a bit weird
18:55 🔗 yipdw http://doc.rust-lang.org/doc/std/files/four-rs.html
18:55 🔗 yipdw what is that doing in there?
18:57 🔗 Coderjoe darn. nothing implemented yet on http://rosettacode.org/wiki/Rust
18:58 🔗 yipdw it feels a bit like Go
18:59 🔗 yipdw I don't like its typing system, though, because (from what I've read so far) it seems inconsistent
18:59 🔗 yipdw you can specify interfaces -- collections of methods -- but you can't say "this function accepts anything that responds to m"
18:59 🔗 Coderjoe hmm. that page was just added about 50 minutes ago
19:00 🔗 yipdw it looks like you have to say "this function accepts things that implement Somethingable"
19:00 🔗 yipdw and that sucks
19:01 🔗 yipdw hm, yeah, type-by-name is encoded into the language itself, too, in e.g. the requirement that the conditional on an if must receive a value of "type boolean"
19:01 🔗 yipdw oh well
20:08 🔗 SketchCow Guys, where can one get wget-warc binaries?
20:15 🔗 yipdw I build it using get-wget-warc.sh
20:16 🔗 yipdw from splinder-grab or one of the other grabber codebases that generate WARCs
20:58 🔗 SketchCow Greetings,
20:58 🔗 SketchCow I just wanted to give you a heads up that you might want to take one last pass at archiving any www-personal.umich.edu websites, they will be deleted this year.
20:58 🔗 SketchCow The University of Michigan recently announced that they are contracting out email and web services to Google, and all accounts will be transitioned later this year. As part of this all the old websites will be deleted by August. They aren't providing any support for moving the websites and won't support redirection after August, so i suspect most of the web pages will simply vanish. They also have not made any of this clear in their public anno
20:58 🔗 SketchCow I've moved my own web site and have redirection set up (at least until August) but you might want to get a copy of the rest of them while they still exist.
20:58 🔗 SketchCow Thanks!
20:58 🔗 SketchCow ...
20:58 🔗 SketchCow They aren't providing any support for moving the websites and won't support redirection after August, so i suspect most of the web pages will simply vanish. They also have not made any of this clear in their public announcements, so I suspect many to the web site authors will be caught by surprise
20:58 🔗 SketchCow I only found out by having several conversations with the IT folks. I get the feeling this aspect of the transition was overlooked.
20:58 🔗 balrog :/
20:59 🔗 SketchCow Let's do it.
20:59 🔗 Nemo_bis omg crazy
20:59 🔗 SketchCow #uwish
21:00 🔗 nitro2k01 This might possibly be possible to do from the inside with that guy's help
21:00 🔗 nitro2k01 If this is a standard UNIX system, it's likely that public_html is world-readable
21:01 🔗 nitro2k01 So you could read /home/xxx/public_html/*
21:01 🔗 balrog nitro2k01: if there are cgi scripts those may not be world readable
21:01 🔗 nitro2k01 Even though you can't read /home/xxx/* in general
21:01 🔗 nitro2k01 Right
21:01 🔗 nitro2k01 Still worth a shot if he wants to give it a go
21:03 🔗 emijrp Archive Team goes to University.
21:04 🔗 DoubleJ Any CMU or MIT kids in here? Unless things have changed, you should still be able to hit umich.edu over AFS/Athena.
21:04 🔗 DoubleJ Assuming of course, that umich won't mind you hammering their servers with a sheel script...
21:04 🔗 DoubleJ And of course, WARCs will still be needed.
21:05 🔗 nitro2k01 Doesn't even need to be a shell script
21:05 🔗 nitro2k01 tar + wildcards ftw
21:05 🔗 balrog DoubleJ: you could throttle it :p
21:07 🔗 emijrp What do a bunch of archivists inside an University full of chicks?
21:07 🔗 ersi you accidentally words
21:09 🔗 nitro2k01 "an ooniversity"
22:55 🔗 bbot_ SketchCow: https://secure.flickr.com/photos/textfiles/6716867195/in/photostream/
22:55 🔗 bbot_ What's the jointed-arm-thing under the dust cover?
22:57 🔗 nitro2k01 Could be a microscope of some kind
23:30 🔗 SketchCow Magnifying glass.
23:56 🔗 BlueMax Microscope?
23:57 🔗 DFJustin http://news.thomasnet.com/fullstory/Bench-Magnifier-offers-accurate-view-across-surface-area-484982

irclogger-viewer