#archiveteam 2012-07-04,Wed

โ†‘back Search

Time Nickname Message
00:08 ๐Ÿ”— SketchCow Fan Fiction archiving now happening - 830gb of data.
00:15 ๐Ÿ”— nitro2k01 Holy crap. A million monkeys on a million typewriters for a million years...
00:15 ๐Ÿ”— nitro2k01 (would create more data than that, assuming the poor things could figure out how to type)
00:16 ๐Ÿ”— chronomex correct
00:16 ๐Ÿ”— chronomex 150T
00:16 ๐Ÿ”— chronomex er, that's one monkey at 50ish wpm for a million years
00:17 ๐Ÿ”— chronomex whatever
00:18 ๐Ÿ”— SketchCow Fuck those little bastards
00:18 ๐Ÿ”— nitro2k01 My subtle hint was that the majority of the archived fanfiction would be written by the human equivalent of monkeys
00:27 ๐Ÿ”— Coderjoe I've run across fiction where the premise seemed interesting, but I couldn't finish it because I had to mentally rewrite every sentence in order to really make sense of it.
00:27 ๐Ÿ”— Coderjoe oh, and the number of times I've seen mixups beteen clothes/cloths and breathe/breath
00:27 ๐Ÿ”— Coderjoe argh. nothing like a grammar/spelling nitpick messing up in their complaint. :D
00:35 ๐Ÿ”— arrith1 nice, ffnet needs a good archiving
00:46 ๐Ÿ”— underscor I have a really hard time reading fanfiction with typos
00:53 ๐Ÿ”— SketchCow I'm just savin' it, I ain't judgin' it.
00:53 ๐Ÿ”— SketchCow I've got the two partitions down to 83% and 49% so crisis over
00:54 ๐Ÿ”— underscor (this is the part where I find another TB of data just laying around somewhere
00:54 ๐Ÿ”— underscor )
00:54 ๐Ÿ”— godane i remember some really good firefly fanfiction on there
00:55 ๐Ÿ”— SketchCow Definitely one of those times I wish I could assign a boring task to an underling.
00:55 ๐Ÿ”— SketchCow http://archive.org/details/hackercons-notacon-2007
00:55 ๐Ÿ”— SketchCow Hundreds of hacker con speeches. Just have to type in names of the presenters, and the talks.
00:59 ๐Ÿ”— godane SketchCow: Famicoman beat you to this: http://archive.org/details/notacon4video
01:00 ๐Ÿ”— SketchCow Yeah, Famicoman made a non-described, blown-up pile of derived video
01:01 ๐Ÿ”— SketchCow Compare the information you get there to, say, http://archive.org/details/hackercons-notacon-2007-brickipedia
01:02 ๐Ÿ”— godane of course he used ftp to upload them
01:03 ๐Ÿ”— godane i see what you mean
01:03 ๐Ÿ”— godane if i had faster upload i may have put it all in one item with lots of .txt files for descs
01:03 ๐Ÿ”— Coderjoe ugh
01:03 ๐Ÿ”— godane i did that with mostly twit podcasts
01:03 ๐Ÿ”— Coderjoe all in one item
01:04 ๐Ÿ”— godane stuff like diggnation shouldn't have been done that way
01:05 ๐Ÿ”— godane my rule is to keep items under 5-6gb
01:05 ๐Ÿ”— SketchCow Yeah
01:05 ๐Ÿ”— SketchCow See, I wouldn't do it that way at all.
01:05 ๐Ÿ”— SketchCow Anyway, I'm doing such by re-doing them as you can see.
01:06 ๐Ÿ”— godane ok
01:06 ๐Ÿ”— godane then you remove the anarchivism ones
01:06 ๐Ÿ”— SketchCow I'm not really famous for letting others half-done jobs dictate my not doing it.
01:06 ๐Ÿ”— SketchCow No, they're adorable.
01:06 ๐Ÿ”— SketchCow I'm actually not allowed to.
01:06 ๐Ÿ”— godane ok
01:06 ๐Ÿ”— Coderjoe I kinda like one item per video/episode/talk/whatever. though I see the PDA vids were tossed up in two items
01:06 ๐Ÿ”— SketchCow Yes, which I had nothing to do with.
01:06 ๐Ÿ”— SketchCow I do one episode an item
01:06 ๐Ÿ”— Coderjoe i know
01:07 ๐Ÿ”— SketchCow http://archive.org/details/securityjustice
01:07 ๐Ÿ”— SketchCow See? One episode an item
01:07 ๐Ÿ”— SketchCow When I have a chance, I'll go back and inject their descriptions in.
01:08 ๐Ÿ”— godane i may have something to upload
01:08 ๐Ÿ”— godane firefly fanfiction audio drama
01:08 ๐Ÿ”— SketchCow http://archive.org/details/securityjustice-25
01:08 ๐Ÿ”— SketchCow Then they'll look like that.
01:10 ๐Ÿ”— godane with data de-dupliation i don't think it matters how much its uploaded
01:10 ๐Ÿ”— godane or change to be neat
01:10 ๐Ÿ”— Coderjoe ia doesn't do dedup
01:10 ๐Ÿ”— godane it doesn't
01:10 ๐Ÿ”— godane but i thought it did
01:10 ๐Ÿ”— SketchCow It does not.
01:11 ๐Ÿ”— godane now i see storage is going to be a problem
01:11 ๐Ÿ”— SketchCow he's so adorable? can we keep him?
01:12 ๐Ÿ”— godane i just don't like 240gb of diggnation being on there like 20 times or something
01:13 ๐Ÿ”— godane of course there dedup maybe very hard since we are talking 1000s have hard drives
01:13 ๐Ÿ”— godane *of
01:14 ๐Ÿ”— SketchCow So adorable
01:16 ๐Ÿ”— godane i may have to do a panic download of the signal now
01:18 ๐Ÿ”— godane lots of audio podcasts: http://signal.serenityfirefly.com/mmx/series/
01:19 ๐Ÿ”— DFJustin shiny
01:21 ๐Ÿ”— godane i figure that i need to call on you guys to backup those podcasts
01:21 ๐Ÿ”— godane to much for my hard drives right now
01:21 ๐Ÿ”— DFJustin are they going anywhere soon
01:21 ๐Ÿ”— godane don't know
01:22 ๐Ÿ”— SketchCow Especially if you continue this terrible habit of shoving dozens of descrete episodes and broadcasts into one big gloppy item
01:22 ๐Ÿ”— godane but its been 8 seasons so fare
01:22 ๐Ÿ”— godane *far
01:25 ๐Ÿ”— Coderjoe some of it is not godane
01:26 ๐Ÿ”— SketchCow The Library of Congress, the Preserving Virtual Worlds Project, and a bunch of others have jumped into my project.
01:39 ๐Ÿ”— arrith1 SketchCow: is http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012 / JSP2012 getting its own name, site and wiki?
01:39 ๐Ÿ”— SketchCow yes
01:39 ๐Ÿ”— arrith1 SketchCow: why is there a focus one just one month?
01:39 ๐Ÿ”— SketchCow this is just a prelim scratchpad
01:39 ๐Ÿ”— SketchCow Ask that second question in english
01:40 ๐Ÿ”— godane looks like archive.org hates my firefly parody i have uploaded
01:41 ๐Ÿ”— arrith1 SketchCow: why is it "30 days dedicated to solving a problem" which might mean actually solving the problem within that time, instead of making an organization within that time
01:43 ๐Ÿ”— SketchCow You know what the world doesn't need?
01:43 ๐Ÿ”— SketchCow Another organization
01:43 ๐Ÿ”— arrith1 AT is kind of an org, but it works
01:44 ๐Ÿ”— SketchCow You say that
01:44 ๐Ÿ”— SketchCow But every time we have a vote, someone dies
01:44 ๐Ÿ”— SketchCow A child, usually
01:44 ๐Ÿ”— SketchCow Usually
02:00 ๐Ÿ”— Famicoman With you all as my witness, I am changing my ways
02:01 ๐Ÿ”— solo i regret to report that all videos removed by youtube users prior to july 1st 2012 have been irrevocably deleted
02:04 ๐Ÿ”— solo strangely, videos taken offline for copyright infringement have been preserved
02:08 ๐Ÿ”— shaqfu I don't have time to read all of textfiles.com right now
02:08 ๐Ÿ”— shaqfu But a friend, when I mentioned JSTP to him, said that there used to be a floating list of file formats on BBSes
02:08 ๐Ÿ”— shaqfu Is this still extant?
02:17 ๐Ÿ”— SketchCow Yes
02:18 ๐Ÿ”— SketchCow All this exists.
02:18 ๐Ÿ”— shaqfu Awesome
03:50 ๐Ÿ”— S[h]O[r]T am i the only one who doesnt understand this just solve the problem project
03:51 ๐Ÿ”— S[h]O[r]T i seem to understand from it, trying to gather people to figure out what file formats a bunch of random crap is in? or make something useful out of all the stuff that has been archived in some displayable format?
03:58 ๐Ÿ”— balrog S[h]O[r]T: the goal is to document as many formats as possible
03:59 ๐Ÿ”— balrog Figure out how to decode them and such
03:59 ๐Ÿ”— balrog How the actual data is stored in these files
04:01 ๐Ÿ”— arrith1 i think it also extends to physical media. maybe like "how to solder your own kit to get data off a disc_x type disc"
04:03 ๐Ÿ”— solo or how to dump the firmware of your television
04:03 ๐Ÿ”— balrog arrith1: That's something I'm trying to work with with the discferret project
04:04 ๐Ÿ”— balrog solo: That's ... annoying because hardware needed to dump a lot of firmware is expensive and requires nasty proprietary software
04:05 ๐Ÿ”— arrith1 balrog: wow discferret is wild
04:05 ๐Ÿ”— arrith1 "The source code and CAD files for the DiscFerret design are completely open-sourced: the hardware and software are released under the GNU GPL (in the case of the board, microcode, and firmware) or the Apache Public Licence (in the case of the DiscFerret Hardware Access Library)"
04:06 ๐Ÿ”— solo did anyone archive revver or livevideo?
04:06 ๐Ÿ”— balrog arrith1: We're looking for help software side
04:07 ๐Ÿ”— balrog If anyone's good at software architecture and willing to help, and has time, stop by the IRC
04:07 ๐Ÿ”— balrog :)
04:08 ๐Ÿ”— arrith1 discferret plus a big archive of fileformat info could make quite the killer ArchiveTeam member disaster kit
04:10 ๐Ÿ”— balrog The software we're starting work on is intended to handle other data sources too :)
04:11 ๐Ÿ”— balrog Unfortunately we're just starting out and I'm not all that good at designing it yet
04:15 ๐Ÿ”— arrith1 balrog: the sofware, hardware, or both?
04:15 ๐Ÿ”— balrog Software.
04:15 ๐Ÿ”— arrith1 ah
04:16 ๐Ÿ”— balrog Hardware is pretty solid, if not a bit slow. That's going to be fixed with a hardware revision, but if you're interested rev-1 is available now.
04:16 ๐Ÿ”— balrog Another thing we're working on fixing is a somewhat high price
04:16 ๐Ÿ”— balrog (thanks to a lot of components, a slightly overdesigned power supply, and hand assembly)
04:27 ๐Ÿ”— arrith1 balrog: yeah high prices would be good to fix. i'm totally hw ignorant but maybe there's some way to use more commodity components? there are lots of arduinos and raspberry pi competitors
04:28 ๐Ÿ”— balrog Well you have to record a stream of data at a high rate. Current design is based on an FPGA and a microcontroller and thats how it will be. But the current power stuff is somewhat overkill
04:28 ๐Ÿ”— balrog Most drives don't need 2A output :)
04:29 ๐Ÿ”— balrog I have to see if it's feasible to power a drive externally and how much that would reduce the cost
04:29 ๐Ÿ”— balrog It's nice though to have a single unit that can power both itself and the drive
04:32 ๐Ÿ”— arrith1 ah yeah, almost like an external hdd case all wrapped up
04:32 ๐Ÿ”— arrith1 dang fpgas are always expensive
04:38 ๐Ÿ”— balrog The fpga isn't the worst
04:39 ๐Ÿ”— balrog It's $12 or so
04:39 ๐Ÿ”— balrog The USB 2.0 microcontroller will be about $6-$7
04:39 ๐Ÿ”— balrog You get nickel and dimed to death on the smaller parts.
05:03 ๐Ÿ”— Coderjoe balrog: and the memory?
05:03 ๐Ÿ”— balrog We found a somewhat cheaper source.
05:04 ๐Ÿ”— balrog We're thinking of doing an sdram based design. Would mean much more memory at a lower price at an increase in microcode complexity
05:04 ๐Ÿ”— balrog (need an sdram controller)
05:06 ๐Ÿ”— arrith1 balrog: oh that's pretty good
05:09 ๐Ÿ”— balrog The current price is around $250 for a fully assembled board which I feel is a bit much
05:09 ๐Ÿ”— balrog I'd like to get it toward $100, hopefully to $150 if not lower
05:10 ๐Ÿ”— DFJustin kryoflux recommends you power the drive separately and they sell an adapter for that purpose
05:11 ๐Ÿ”— DFJustin I used a gutted 3.5" HDD enclosure for power
05:13 ๐Ÿ”— balrog DFJustin: The discferret power components can power 2-3 3.5" drives easily as it is now
05:13 ๐Ÿ”— balrog Which I feel is overkill
05:13 ๐Ÿ”— balrog It's overdesigned. It's extremely robust but I don't think that's necessary.
05:14 ๐Ÿ”— balrog Half the capacity would still power even a 5.25" drive
05:15 ๐Ÿ”— balrog The kryoflux just pulls power off the 5V USB
05:16 ๐Ÿ”— joepie91 this seems relevant here:
05:16 ๐Ÿ”— joepie91 Google Video stopped taking uploads in May 2009. Later this summer weรƒยข??ll be moving the remaining hosted content to YouTube. Google Video users have until August 20 to migrate, delete or download their content. Weรƒยข??ll then move all remaining Google Video content to YouTube as private videos that users can access in the YouTube video manager. For more details, please see our post on the YouTube blog.
05:17 ๐Ÿ”— balrog joepie91: Link?
05:17 ๐Ÿ”— joepie91 tl;dr google video videos will become unavailable for public viewing unless the uploader specifically makes it public
05:17 ๐Ÿ”— joepie91 http://googleblog.blogspot.nl/2012/07/spring-cleaning-in-summer.html
05:17 ๐Ÿ”— balrog I see
05:17 ๐Ÿ”— balrog UGH why
05:18 ๐Ÿ”— joepie91 no idea :/
05:18 ๐Ÿ”— balrog If they were public before they should stay as such
05:18 ๐Ÿ”— joepie91 but that seems like a LOT of potential for huge data loss
05:18 ๐Ÿ”— balrog (I think)
05:18 ๐Ÿ”— balrog Yeah :(
05:18 ๐Ÿ”— joepie91 or rather, public data loss
05:18 ๐Ÿ”— joepie91 and I mean *huge*
05:18 ๐Ÿ”— balrog SketchCow: ^
05:19 ๐Ÿ”— balrog DFJustin: Anyway my point was that maybe we don't even need to have drive power support. Will have to check how much extra cost that adds.
05:19 ๐Ÿ”— arrith1 i linked stuff about that earlier :)
05:19 ๐Ÿ”— balrog Ah...
05:19 ๐Ÿ”— arrith1 he's going to check with archive.org people, since i guess archive.org has been working on youtube
05:19 ๐Ÿ”— arrith1 if archive.org doesn't get it all then i guess AT can spring into action
05:22 ๐Ÿ”— p4nd4 O hai
05:23 ๐Ÿ”— joepie91 ohai
05:24 ๐Ÿ”— joepie91 arrith1: alright
05:24 ๐Ÿ”— SketchCow -bs
05:25 ๐Ÿ”— SketchCow Wow, you filled 5 screens with discussion of hardware
05:25 ๐Ÿ”— arrith1 oops
05:25 ๐Ÿ”— arrith1 wait, well it's sorta related to the just solve it stuff, which is kind of #archiveteam related
05:26 ๐Ÿ”— arrith1 but yeah k -bs
05:28 ๐Ÿ”— SketchCow It's only sort of related
05:28 ๐Ÿ”— SketchCow -bs
05:29 ๐Ÿ”— arrith1 SketchCow: is archive.org doing for Google Video what they did for stage6?
05:29 ๐Ÿ”— p4nd4 Stage6 was awesome
05:33 ๐Ÿ”— SketchCow Archive.org didn't do stage6, we did
05:33 ๐Ÿ”— SketchCow One of us did.
05:33 ๐Ÿ”— Coderjoe i did
05:33 ๐Ÿ”— Coderjoe i wish I had gotten more of it
05:34 ๐Ÿ”— Coderjoe particularly more user-generated content, as opposed to all those music videos, tv shows, and movies :(
05:35 ๐Ÿ”— SketchCow I am torn on the google video
05:36 ๐Ÿ”— SketchCow I'll spend another day thinking about it.
05:45 ๐Ÿ”— joepie91 also, for those that missed it - meebo is shutting down on july 11, instructions for downloading your recorded chatlogs for your meebo account until that date are available at http://www.meebo.com/support/article/175/
05:45 ๐Ÿ”— joepie91 lots of big things shutting down lately :(
05:47 ๐Ÿ”— joepie91 on that note - SketchCow, does archiveteam keep some kind of RSS feed that provides a list of services that will be shut down soon?
05:47 ๐Ÿ”— joepie91 or similar
05:47 ๐Ÿ”— joepie91 (preferably including archival status, of course :)
05:48 ๐Ÿ”— arrith1 joepie91: there are pages for that on the wiki
05:48 ๐Ÿ”— arrith1 mainly deathwatch i think
05:48 ๐Ÿ”— arrith1 and the frontpage
05:48 ๐Ÿ”— joepie91 alright, but is there some kind of feed that can for example be automatically retrieved?
05:48 ๐Ÿ”— joepie91 I can think of some interesting things to do with that
05:49 ๐Ÿ”— arrith1 one could cobble together a script that looks for changes to specific portions of the site from the overall wiki changes rss feed
05:49 ๐Ÿ”— arrith1 i'm not aware of something that does that currently
05:49 ๐Ÿ”— joepie91 hrm.. that would be hacky, and probably break when the page layout changes :|
05:49 ๐Ÿ”— arrith1 yep
05:49 ๐Ÿ”— arrith1 wikis are tricky like that ;/
06:02 ๐Ÿ”— Nemo_bis what portions of the site? of course there are solutions
06:04 ๐Ÿ”— Nemo_bis do you just want something like this? http://archiveteam.org/index.php?title=Deathwatch&feed=atom&action=history
06:04 ๐Ÿ”— Nemo_bis otherwise there's plenty of IRC-RC based services
06:06 ๐Ÿ”— joepie91 Nemo_bis: no, that is literally just a feed of changes
06:06 ๐Ÿ”— joepie91 I mean a feed that announces new site clousers
06:06 ๐Ÿ”— joepie91 closures *
06:06 ๐Ÿ”— Nemo_bis a feed doesn't announce anything
06:06 ๐Ÿ”— joepie91 sigh
06:07 ๐Ÿ”— joepie91 ..
06:07 ๐Ÿ”— joepie91 a feed that has as its items newly announced site closures
06:07 ๐Ÿ”— Nemo_bis so this is not "portions of the wiki"
06:07 ๐Ÿ”— joepie91 no
06:07 ๐Ÿ”— Nemo_bis anyway you can construct it from the feed, or make the wiki page machine-readable
06:08 ๐Ÿ”— joepie91 I never said anything about the wiki, it was arrith1 coming up with that suggestion
06:08 ๐Ÿ”— joepie91 yes, which would break if the page layout changes
06:09 ๐Ÿ”— arrith1 i refer to the wiki since it's basically the only place info is, besides say live irc channels
06:12 ๐Ÿ”— Coderjoe or the AT twitter account
06:12 ๐Ÿ”— Coderjoe @archiveteam
06:13 ๐Ÿ”— Coderjoe but that's generally after details have been worked out and grunts are needed.
06:13 ๐Ÿ”— arrith1 ah yeah
06:15 ๐Ÿ”— arrith1 joepie91: what were you thinking of using the feed for?
06:18 ๐Ÿ”— SketchCow No feed
06:18 ๐Ÿ”— SketchCow Should be fixed? Yes.
06:23 ๐Ÿ”— joepie91 arrith1: I had a few ideas, actually
06:23 ๐Ÿ”— joepie91 mailing list, widget, possibly irc integration
06:24 ๐Ÿ”— joepie91 anything else I can think of
06:24 ๐Ÿ”— joepie91 just a way for people to easily keep track of services that are shutting down, that look less intimidating to the average user than a wiki page
06:24 ๐Ÿ”— arrith1 having a twitter account specifically for sites confirmed going down could work. then have a bot announce that in irc, etc
06:26 ๐Ÿ”— joepie91 this is actually interesting:
06:26 ๐Ÿ”— joepie91 Archived but not available
06:26 ๐Ÿ”— joepie91 Google Video
06:26 ๐Ÿ”— joepie91 http://www.archiveteam.org/index.php?title=Archives
06:27 ๐Ÿ”— joepie91 does that imply being fully archived?
06:27 ๐Ÿ”— joepie91 or only partially?
06:27 ๐Ÿ”— joepie91 arrith1: that would limit you to very short messages though
06:28 ๐Ÿ”— arrith1 joepie91: it would. but at the end of the short message maybe have a url to a special part of the wiki with specially formatted messages or something
06:29 ๐Ÿ”— aggro IIRC, it was partially archived, back when Google first said they were going to just delete all of the videos. Apparently the AT bandwidth hive was too much even for Google, and they caved :P Now it looks like they're just moving videos over. Making them private, but not deleted.
06:30 ๐Ÿ”— arrith1 joepie91: it's on some archive.org servers somewhere i think. but GV seemed to back down and said they were keeping the site up.
06:30 ๐Ÿ”— arrith1 joepie91: but now that recent announcement of GV coming down, that's probably going to be reevaluated
06:30 ๐Ÿ”— aggro (12:47:55 AM) SketchCow: I am torn on the google video
06:30 ๐Ÿ”— aggro (12:48:14 AM) SketchCow: I'll spend another day thinking about it.
06:33 ๐Ÿ”— joepie91 mmm
06:33 ๐Ÿ”— joepie91 arrith1: may be better to have a dedicated page without all the wiki overhead
06:34 ๐Ÿ”— arrith1 joepie91: making something community accessible without a wiki gets tricky. i mean you could do like hg/git but that's quite a barrier to entry vs a wiki in terms of novice users
06:35 ๐Ÿ”— joepie91 hm.
06:35 ๐Ÿ”— joepie91 I'll have a think about it.
06:39 ๐Ÿ”— joepie91 also, on an unrelated note, I've heard some people on various irc networks complain about certain stories getting removed from fanfiction for some reason
06:39 ๐Ÿ”— joepie91 does anyone know more about that?
06:40 ๐Ÿ”— arrith1 joepie91: people in #fanfriction might know
07:16 ๐Ÿ”— ersi Even though Google will move all the videos over to YouTube (they say at least) - I'm a bit in the mood to try to download it anyhow
07:17 ๐Ÿ”— ersi I mean, we ate MobileMe (even though I guess largely thanks to Kenneth/Heroku)
07:17 ๐Ÿ”— Coderjoe https://github.com/ArchiveTeam/googlegrape iirc
07:18 ๐Ÿ”— ersi yeah
07:18 ๐Ÿ”— C-Keen what's your preferred tool to archive an entire site? wget? which magic options do you use?
07:18 ๐Ÿ”— Coderjoe though that was pre-warc
07:18 ๐Ÿ”— ersi Coderjoe: wget with WARC support. The last part *is* important :)
07:19 ๐Ÿ”— Coderjoe C-Keen: wget. options depend on site, but we like warc
07:19 ๐Ÿ”— Coderjoe ersi: wrong target
07:19 ๐Ÿ”— * C-Keen looks up warc
07:19 ๐Ÿ”— ersi WARC is Web Archive format, it saves the HTTP Request + Response. It's a format used by the largest Archive places.
07:19 ๐Ÿ”— ersi Coderjoe: wrong target?
07:19 ๐Ÿ”— C-Keen ersi: I see
07:20 ๐Ÿ”— Coderjoe ersi: i think you meant C-Keen not me
07:20 ๐Ÿ”— ersi Ah, I totally missed that I tab-completed to you instead of C-Keen :p
07:20 ๐Ÿ”— Coderjoe figured
07:21 ๐Ÿ”— C-Keen ok so I shall build a wget from trunk...no problem.
07:23 ๐Ÿ”— Coderjoe has the gnulib build stopper been fixed?
07:23 ๐Ÿ”— ersi you could take a short cut and use a get-wget-warc.sh script from.. I can't remember which project is the freshes.. I think it might be MobileMe/MeMac - that'll make a wget-warc version that works very easily (I tried compiling wget-trunk a month ago.. didn't end well :P)
07:24 ๐Ÿ”— Coderjoe i know misty mentioned a patch, but i don't know if it was accepted yet
07:25 ๐Ÿ”— Coderjoe yeah i think memac was the latest update
07:25 ๐Ÿ”— Coderjoe in order to get the regex support
07:27 ๐Ÿ”— C-Keen let's see
07:31 ๐Ÿ”— ersi C-Keen: Script's available at https://github.com/ArchiveTeam/mobileme-grab
07:31 ๐Ÿ”— ersi You want the "get-wget-warc.sh" one :-)
07:36 ๐Ÿ”— C-Keen ersi: trunk built
07:37 ๐Ÿ”— ersi with the above script? or by itself? :)
07:38 ๐Ÿ”— C-Keen by itself
07:39 ๐Ÿ”— ersi neat!
07:39 ๐Ÿ”— C-Keen hm, I wonder whether I should tell wget to rewrite links so I can view the site locally
07:40 ๐Ÿ”— ersi no, don't do that
07:40 ๐Ÿ”— Coderjoe you can. wget will save the unmodified version to the warc
07:40 ๐Ÿ”— ersi oh, nice
07:40 ๐Ÿ”— Coderjoe (and it does the modification of the files at the end of the run anyway)
07:40 ๐Ÿ”— ersi otherwise I'd use https://github.com/alard/warc-proxy to proxy the content of the WARC :)
07:43 ๐Ÿ”— C-Keen also the site I want to archive is using some kind of blog software so it contains links to page.html?p=1234. In previous attempts this turns out to be broken as the pages get downloaded as "page.html?p=1234" but of course the browser will always load the "page.html"
07:43 ๐Ÿ”— C-Keen how do you deal with this?
07:45 ๐Ÿ”— p4nd4 Those HTML endings are probably PHP or ASP rewritten by an apache module C-Keen
07:46 ๐Ÿ”— p4nd4 Oh actually, nevermind
07:46 ๐Ÿ”— p4nd4 Wrong channel
07:47 ๐Ÿ”— p4nd4 What blog software is it using? If you can find it you can check how URL's get rewritten and perhaps revert it because the original URLs should still work
07:58 ๐Ÿ”— C-Keen good question
07:58 ๐Ÿ”— C-Keen I will investigate
07:58 ๐Ÿ”— p4nd4 :)
07:58 ๐Ÿ”— p4nd4 Wappalyzer
07:59 ๐Ÿ”— brayden This is terribly interesting.
08:00 ๐Ÿ”— C-Keen Wappalyzer?
08:00 ๐Ÿ”— C-Keen ah cool
08:01 ๐Ÿ”— ersi C-Keen: Nothing wrong with getting "page.html?p=X" pages. As long as the content differs and have some other meaning than just page.html.. One can always do a rewrite serverside if you want to present it later
08:02 ๐Ÿ”— C-Keen ersi: ack. I just hoped for some already existing magic to do so
08:02 ๐Ÿ”— C-Keen p4nd4: heh wappalyzer is cool, it says some wordpress cms
08:02 ๐Ÿ”— ersi I mean, from the archiving Point of View, there's nothing wrong with saving them down as "page.html?p=1234"
08:03 ๐Ÿ”— ersi And there's solutions for dealing with that, if you want to present that material later as well :)
08:06 ๐Ÿ”— C-Keen heh archiving entire sites feels good ;)
08:06 ๐Ÿ”— ersi It sure does!
08:08 ๐Ÿ”— C-Keen now to something completely different. If I want to help on archiving huge sites, I can run the archive team warrior. but I am connected with asymmetric DSL which means I get 6Mbit/s down but only 200KB/s up, so while downloading gigabytes is fast getting these gigabytes off my machines will take (almost) forever
08:08 ๐Ÿ”— ersi indeed, unfortunally that's the case for many
08:14 ๐Ÿ”— p4nd4 In theory you could find an injection vulnerability
08:15 ๐Ÿ”— p4nd4 And clone their database
08:15 ๐Ÿ”— p4nd4 and set up your own wordpress site with a cloned database
08:15 ๐Ÿ”— p4nd4 That way you'd have an exact copy
08:15 ๐Ÿ”— brayden lol
08:15 ๐Ÿ”— brayden that's evil
08:15 ๐Ÿ”— p4nd4 It's evil if you cause harm
08:15 ๐Ÿ”— brayden and also difficult
08:15 ๐Ÿ”— p4nd4 not difficult
08:15 ๐Ÿ”— ersi No, you'd have an exact copy of the internal state. Not the external one
08:15 ๐Ÿ”— ersi You'd miss all static content and graphical representation
08:15 ๐Ÿ”— brayden You'd have to the important part, the posts table.
08:16 ๐Ÿ”— brayden and you can crawl *.jpg,*.png etc. at a later stage
08:16 ๐Ÿ”— p4nd4 You could clone the whole database, including settings, posts, comments
08:16 ๐Ÿ”— p4nd4 Everything
08:16 ๐Ÿ”— * ersi sighs and rolls his eyes
08:16 ๐Ÿ”— p4nd4 Do you use some backup software to clone stuff btw?
08:16 ๐Ÿ”— p4nd4 Like a crawler to recreate pages and links?
08:17 ๐Ÿ”— brayden lol
08:17 ๐Ÿ”— brayden could use wget with spider
08:17 ๐Ÿ”— brayden but that only checks if links exist.
08:17 ๐Ÿ”— brayden Or download it, and parse it for links.
08:17 ๐Ÿ”— brayden wget can user-agent spoof, as can curl as well
08:18 ๐Ÿ”— p4nd4 Yes that's what I meant, you could crawl it, find all in-links, follow them and crawl them as well for in-links
08:18 ๐Ÿ”— p4nd4 And map them to each other
08:18 ๐Ÿ”— p4nd4 But it'd be static
08:18 ๐Ÿ”— brayden Depends on the content I guess.
08:18 ๐Ÿ”— brayden If it is a personal blog of some sort then maybe the post thumbnails etc. don't matter quite as much as the words.
08:19 ๐Ÿ”— p4nd4 I'm just wondering if you're using a software already
08:19 ๐Ÿ”— p4nd4 or if maybe that'd be a good project for me to start writing?
08:19 ๐Ÿ”— brayden AFAIK there's a lua script that is adapted for archiving.
08:19 ๐Ÿ”— p4nd4 Ah alright
08:19 ๐Ÿ”— C-Keen it's already a mess with all these links to some CDN, as you cannot distinguish data beloning to the site from other things anymore
08:19 ๐Ÿ”— brayden :(
08:19 ๐Ÿ”— C-Keen in the past people just hosted their stuff on their servers
08:19 ๐Ÿ”— p4nd4 :(
08:20 ๐Ÿ”— p4nd4 Yeah
08:20 ๐Ÿ”— brayden can't you just include only the CDN's links? or are they just IPs, not hostnames?
08:20 ๐Ÿ”— C-Keen brayden: but how do you now that some aws.amazon.com is essential for the content?
08:21 ๐Ÿ”— brayden I don't know but it is probably more essential than random hotlinks.
08:21 ๐Ÿ”— p4nd4 There are also many kinds of CDNs
08:21 ๐Ÿ”— p4nd4 as well as private CDNs
08:21 ๐Ÿ”— C-Keen true
08:21 ๐Ÿ”— p4nd4 Hard to distinguish
08:21 ๐Ÿ”— C-Keen then there is hacker news which is another bag of hate as they create dynamically expiring links on their pages *grrr*
08:22 ๐Ÿ”— Coderjoe wget has a page requisies option
08:22 ๐Ÿ”— C-Keen Coderjoe: true
08:22 ๐Ÿ”— Coderjoe there is also the lua option, at least if using the picplz version of wget-warc-lua
08:23 ๐Ÿ”— C-Keen lua?
08:23 ๐Ÿ”— brayden lua is a scripting language
08:23 ๐Ÿ”— Coderjoe a scripting language. with the lua addition to wget, you can write a hook script for generating the list of links for wget to crawl
08:23 ๐Ÿ”— brayden Very simple to use.
08:23 ๐Ÿ”— C-Keen I know lua I don't see the connection
08:23 ๐Ÿ”— C-Keen ah
08:23 ๐Ÿ”— brayden oh
08:24 ๐Ÿ”— p4nd4 I've done that with a PHP script as well, fetch a site and crawl all in-links
08:24 ๐Ÿ”— Coderjoe the hook function would be passed the page that was just downloaded, and it can parse it and return a list of links
08:24 ๐Ÿ”— Coderjoe (for examples, you can see the picplz usage)
08:24 ๐Ÿ”— p4nd4 I was just thinking, instead of just stripping it of content and taking the links it could store the page as well, then fetch all sites it links to, and save them as well, and replace all links to link to the stored versions
08:25 ๐Ÿ”— p4nd4 And do that recursively
08:25 ๐Ÿ”— p4nd4 But the problem would be CDNs and remotely included scripts etc
08:25 ๐Ÿ”— C-Keen yep
08:25 ๐Ÿ”— Coderjoe p4nd4: but the wget-warc-lua solution allows you to add it all into one warc file during a single run
08:25 ๐Ÿ”— p4nd4 Oh
08:25 ๐Ÿ”— p4nd4 I'm not very familiar with archiving, I'm just brainstorming
08:25 ๐Ÿ”— p4nd4 :)
08:26 ๐Ÿ”— C-Keen why is archiving the headers important as well?
08:26 ๐Ÿ”— Coderjoe and with your recursive link following, at least without some sort of limitation, you will wind up trying to download all of the intarwebs
08:27 ๐Ÿ”— p4nd4 no
08:27 ๐Ÿ”— p4nd4 I said in-links
08:27 ๐Ÿ”— p4nd4 As in, links within the same domain
08:27 ๐Ÿ”— p4nd4 :c
08:27 ๐Ÿ”— Coderjoe the reason for warc files is because that is what the wayback machine takes
08:27 ๐Ÿ”— p4nd4 Ahh
08:27 ๐Ÿ”— brayden lol
08:27 ๐Ÿ”— brayden Once I decided to use Xenu's link sleuth on Google
08:28 ๐Ÿ”— brayden Even with a fairly small depth I still ended up downloading the internetz
08:28 ๐Ÿ”— Coderjoe i saw no mention of "in-links" in your description
08:29 ๐Ÿ”— Coderjoe I had a friend that was looking to mirror one of the gamespy-hosted sites and wound up trying to download all of gamespy on my isdn connection
08:29 ๐Ÿ”— brayden :o
08:29 ๐Ÿ”— Coderjoe (he only had dialup at the time, so he used the ssh access into my server to do this)
08:30 ๐Ÿ”— Coderjoe simply with a forgotten wget option
08:31 ๐Ÿ”— Coderjoe I ended up creating a wget group, putting wget in that group, setting it 0750, and omitting him from that groupp
08:33 ๐Ÿ”— Coderjoe while this discussion has been generally on-topic, I would like to point out that there is an #archiveteam-bs channel for offtopic chatter
08:34 ๐Ÿ”— Coderjoe and with that, I am going to get some sleep
08:34 ๐Ÿ”— C-Keen sorry
08:34 ๐Ÿ”— C-Keen good night
08:41 ๐Ÿ”— ersi Well, this was borderline off-topic - I'd say it's mostly on topic
08:41 ๐Ÿ”— ersi no need for the sorry :)
08:43 ๐Ÿ”— p4nd4 Coderjoe: "(10:24:17 AM) p4nd4: I've done that with a PHP script as well, fetch a site and crawl all in-links" :p
08:43 ๐Ÿ”— ersi shrug
13:15 ๐Ÿ”— alard mistym: Maybe you've found it already, but a real solution to the Wget bootstrap problem is to remove the line $build_aux/missing from bootstrap.conf.
13:16 ๐Ÿ”— mistym alard: Yep, I noticed - thanks! The problem was patched upstream in gnulib, so doing a bootstrap-sync works too.
13:17 ๐Ÿ”— alard What is bootstrap-sync?
13:17 ๐Ÿ”— mistym It replaces the package's copy of the bootstrap script with gnulib's copy.
13:17 ๐Ÿ”— alard It's the bootstrap.conf file in the Wget repository that needs fixing, as far as I can see.
13:18 ๐Ÿ”— mistym Maybe I'm remembering wrong. Anyway - thanks!
13:32 ๐Ÿ”— SketchCow Ha ha, some archivists are coming out of the woodwork to criticize Just Solve the Problem.
13:35 ๐Ÿ”— SketchCow No, I am not a new registry in .competition. for .mindshare. on the issue. I am a chaos agent, like we.ve been with Archive Team (different than archive.org, by the way), turning the theoretical and the progressive into the real. When Archive Team started, people sniffed how we were using WGET instead of some properly standards compliant web archive format. Within a short time, WE CHANGED WGET TO SUPPORT WARC. And I can assure you, our ability to
13:40 ๐Ÿ”— X-Scale It's just a shame the old look and feel of Google Groups is going away. I wish there was an efficient way of preserving it (for possible use in future projects).
13:43 ๐Ÿ”— SketchCow http://blogs.loc.gov/digitalpreservation/2012/07/rescuing-the-tangible-from-the-intangible/ - that'sa lotsa cds
14:00 ๐Ÿ”— godane SketchCow: All 2006 episodes of dl.tv is up
14:00 ๐Ÿ”— godane all of 2005 but episode 6
14:01 ๐Ÿ”— godane the 2005 ones was at great risk of being delete forever
14:02 ๐Ÿ”— godane everything past episode 30 i got from mevio
15:01 ๐Ÿ”— balrog SketchCow: Sorry! I was trying to keep the discussion to the formats and the software but arrith1 started asking hardware questions :) anyway I want to keep the focus on the software.
15:01 ๐Ÿ”— balrog Software for decoding stuff in general. Not limited to floppy disks
16:11 ๐Ÿ”— SketchCow If an Archive Team member wanted to attend http://www.digitalpreservation.gov/meetings/ndiipp12.html - I'd endorse you
16:52 ๐Ÿ”— underscor Oh, man. I'll be in Vegas
16:52 ๐Ÿ”— underscor Partying it up with SketchCow
17:06 ๐Ÿ”— qbc the big casinos with the odds-fixes are in the city of london and on wall street though...now those are places to par-TAY
18:47 ๐Ÿ”— arkhive More Google Spring Cleaning. Including Google Video. http://googleblog.blogspot.com/2012/07/spring-cleaning-in-summer.html
18:48 ๐Ÿ”— C-Keen yep
18:48 ๐Ÿ”— arkhive "Google Video users have until August 20 to migrate, delete or download their content. Weรขย€ย™ll then move all remaining Google Video content to YouTube as private videos that users can access in the YouTube video manager."
18:49 ๐Ÿ”— C-Keen so a lot of stuff will just stay inaccessible if noone shows up to republish it?
18:49 ๐Ÿ”— C-Keen am I reading this correctly?
18:49 ๐Ÿ”— arkhive Should we start downloading them again? It makes sense to since most videos (I assume) will be private.
18:49 ๐Ÿ”— arkhive ya
18:49 ๐Ÿ”— arkhive That's how I read it.
18:50 ๐Ÿ”— C-Keen well then...
18:51 ๐Ÿ”— arkhive you can make them public if you'd like.
18:51 ๐Ÿ”— arkhive Last link: http://youtube-global.blogspot.com/2012/07/google-video-content-moving-to-youtube.html
18:52 ๐Ÿ”— arkhive What does everyone else think about this?
18:53 ๐Ÿ”— SmileyG they are keeping your content. This is a *good* thing. They have given like a years notice, this is a *good* thing.
18:53 ๐Ÿ”— SmileyG aren't most google video, videos private anyway?
18:54 ๐Ÿ”— omf_ Google has never released hard numbers on public vs private videos so it is really unknown
18:54 ๐Ÿ”— SmileyG wtf. Go to google.com/videohp ; click "I'm feeling lucky" without typing anything -> goes to the doodles page o_O
18:55 ๐Ÿ”— SmileyG Infact, you seem unable to search google video?
19:02 ๐Ÿ”— arkhive site:video.google.com bird
19:02 ๐Ÿ”— arkhive into the search field
19:03 ๐Ÿ”— arkhive replace bird with desired
19:03 ๐Ÿ”— SmileyG yah :<
19:15 ๐Ÿ”— arkhive So, Good.net , iWork.com , possibly Google Video , and the stuff listed on the 'deathwatch' wiki page.
19:15 ๐Ÿ”— qbc video.google's an issue to corporatists who wish more control over truths which detract from their image .. in that regard, a migration is no surprise--i've wondered why they weren't quicker in their fascism in fact
20:27 ๐Ÿ”— Coderjoe SketchCow: isn't criticism what archivists do best?
20:28 ๐Ÿ”— mistym Coderjoe: I think debating over how best to approach solving the problem, without ever solving the problem, is what archivists do best
20:28 ๐Ÿ”— mistym Archivists are also very good at criticizing archivists
20:28 ๐Ÿ”— BlueMax mistym also apparently archivists are politicians
20:28 ๐Ÿ”— BlueMax Jason Scott for President of Earth 3012
20:35 ๐Ÿ”— mistym Why wait 1000 years?
20:46 ๐Ÿ”— Nemo_bis to prove he's been archived well
20:46 ๐Ÿ”— Nemo_bis and digitally preserved in a good format
20:46 ๐Ÿ”— Nemo_bis thanks to his projects
21:09 ๐Ÿ”— nitro2k01 SmileyG: Typing anything brings up results with default settings, so "I'm feeling lucky" is effectively dead
21:15 ๐Ÿ”— _case Coderjoe & mistym: it's an academia thing
21:15 ๐Ÿ”— chronomex stupid academics
21:16 ๐Ÿ”— mistym _case: I am not in academia and I can say for sure I see it outside academic archives!
21:18 ๐Ÿ”— _case mistym: oh for sure. just saying archivists [speaking as one] are at their core - academics [for better or worse].
21:19 ๐Ÿ”— mistym This is true. (Also speaking as one.)
21:24 ๐Ÿ”— _case my god. they got you too.
21:27 ๐Ÿ”— Tephra as someone who got my foot in both academia and archiving i would agree somewhat
21:38 ๐Ÿ”— joepie91 <SmileyG>aren't most google video, videos private anyway?
21:38 ๐Ÿ”— joepie91 possibly - the problem here is that those that are public (sometimes for very good reason) also become private
23:18 ๐Ÿ”— arkhive My connection was lost..Is there any place I can look back at the chat log to see if anyone else is interested in those projects?
23:20 ๐Ÿ”— arkhive found it..nevermind

irclogger-viewer