[00:08] Fan Fiction archiving now happening - 830gb of data. [00:15] Holy crap. A million monkeys on a million typewriters for a million years... [00:15] (would create more data than that, assuming the poor things could figure out how to type) [00:16] correct [00:16] 150T [00:16] er, that's one monkey at 50ish wpm for a million years [00:17] whatever [00:18] Fuck those little bastards [00:18] My subtle hint was that the majority of the archived fanfiction would be written by the human equivalent of monkeys [00:27] I've run across fiction where the premise seemed interesting, but I couldn't finish it because I had to mentally rewrite every sentence in order to really make sense of it. [00:27] oh, and the number of times I've seen mixups beteen clothes/cloths and breathe/breath [00:27] argh. nothing like a grammar/spelling nitpick messing up in their complaint. :D [00:35] nice, ffnet needs a good archiving [00:46] I have a really hard time reading fanfiction with typos [00:53] I'm just savin' it, I ain't judgin' it. [00:53] I've got the two partitions down to 83% and 49% so crisis over [00:54] (this is the part where I find another TB of data just laying around somewhere [00:54] ) [00:54] i remember some really good firefly fanfiction on there [00:55] Definitely one of those times I wish I could assign a boring task to an underling. [00:55] http://archive.org/details/hackercons-notacon-2007 [00:55] Hundreds of hacker con speeches. Just have to type in names of the presenters, and the talks. [00:59] SketchCow: Famicoman beat you to this: http://archive.org/details/notacon4video [01:00] Yeah, Famicoman made a non-described, blown-up pile of derived video [01:01] Compare the information you get there to, say, http://archive.org/details/hackercons-notacon-2007-brickipedia [01:02] of course he used ftp to upload them [01:03] i see what you mean [01:03] if i had faster upload i may have put it all in one item with lots of .txt files for descs [01:03] ugh [01:03] i did that with mostly twit podcasts [01:03] all in one item [01:04] stuff like diggnation shouldn't have been done that way [01:05] my rule is to keep items under 5-6gb [01:05] Yeah [01:05] See, I wouldn't do it that way at all. [01:05] Anyway, I'm doing such by re-doing them as you can see. [01:06] ok [01:06] then you remove the anarchivism ones [01:06] I'm not really famous for letting others half-done jobs dictate my not doing it. [01:06] No, they're adorable. [01:06] I'm actually not allowed to. [01:06] ok [01:06] I kinda like one item per video/episode/talk/whatever. though I see the PDA vids were tossed up in two items [01:06] Yes, which I had nothing to do with. [01:06] I do one episode an item [01:06] i know [01:07] http://archive.org/details/securityjustice [01:07] See? One episode an item [01:07] When I have a chance, I'll go back and inject their descriptions in. [01:08] i may have something to upload [01:08] firefly fanfiction audio drama [01:08] http://archive.org/details/securityjustice-25 [01:08] Then they'll look like that. [01:10] with data de-dupliation i don't think it matters how much its uploaded [01:10] or change to be neat [01:10] ia doesn't do dedup [01:10] it doesn't [01:10] but i thought it did [01:10] It does not. [01:11] now i see storage is going to be a problem [01:11] he's so adorable? can we keep him? [01:12] i just don't like 240gb of diggnation being on there like 20 times or something [01:13] of course there dedup maybe very hard since we are talking 1000s have hard drives [01:13] *of [01:14] So adorable [01:16] i may have to do a panic download of the signal now [01:18] lots of audio podcasts: http://signal.serenityfirefly.com/mmx/series/ [01:19] shiny [01:21] i figure that i need to call on you guys to backup those podcasts [01:21] to much for my hard drives right now [01:21] are they going anywhere soon [01:21] don't know [01:22] Especially if you continue this terrible habit of shoving dozens of descrete episodes and broadcasts into one big gloppy item [01:22] but its been 8 seasons so fare [01:22] *far [01:25] some of it is not godane [01:26] The Library of Congress, the Preserving Virtual Worlds Project, and a bunch of others have jumped into my project. [01:39] SketchCow: is http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012 / JSP2012 getting its own name, site and wiki? [01:39] yes [01:39] SketchCow: why is there a focus one just one month? [01:39] this is just a prelim scratchpad [01:39] Ask that second question in english [01:40] looks like archive.org hates my firefly parody i have uploaded [01:41] SketchCow: why is it "30 days dedicated to solving a problem" which might mean actually solving the problem within that time, instead of making an organization within that time [01:43] You know what the world doesn't need? [01:43] Another organization [01:43] AT is kind of an org, but it works [01:44] You say that [01:44] But every time we have a vote, someone dies [01:44] A child, usually [01:44] Usually [02:00] With you all as my witness, I am changing my ways [02:01] i regret to report that all videos removed by youtube users prior to july 1st 2012 have been irrevocably deleted [02:04] strangely, videos taken offline for copyright infringement have been preserved [02:08] I don't have time to read all of textfiles.com right now [02:08] But a friend, when I mentioned JSTP to him, said that there used to be a floating list of file formats on BBSes [02:08] Is this still extant? [02:17] Yes [02:18] All this exists. [02:18] Awesome [03:50] am i the only one who doesnt understand this just solve the problem project [03:51] i seem to understand from it, trying to gather people to figure out what file formats a bunch of random crap is in? or make something useful out of all the stuff that has been archived in some displayable format? [03:58] S[h]O[r]T: the goal is to document as many formats as possible [03:59] Figure out how to decode them and such [03:59] How the actual data is stored in these files [04:01] i think it also extends to physical media. maybe like "how to solder your own kit to get data off a disc_x type disc" [04:03] or how to dump the firmware of your television [04:03] arrith1: That's something I'm trying to work with with the discferret project [04:04] solo: That's ... annoying because hardware needed to dump a lot of firmware is expensive and requires nasty proprietary software [04:05] balrog: wow discferret is wild [04:05] "The source code and CAD files for the DiscFerret design are completely open-sourced: the hardware and software are released under the GNU GPL (in the case of the board, microcode, and firmware) or the Apache Public Licence (in the case of the DiscFerret Hardware Access Library)" [04:06] did anyone archive revver or livevideo? [04:06] arrith1: We're looking for help software side [04:07] If anyone's good at software architecture and willing to help, and has time, stop by the IRC [04:07] :) [04:08] discferret plus a big archive of fileformat info could make quite the killer ArchiveTeam member disaster kit [04:10] The software we're starting work on is intended to handle other data sources too :) [04:11] Unfortunately we're just starting out and I'm not all that good at designing it yet [04:15] balrog: the sofware, hardware, or both? [04:15] Software. [04:15] ah [04:16] Hardware is pretty solid, if not a bit slow. That's going to be fixed with a hardware revision, but if you're interested rev-1 is available now. [04:16] Another thing we're working on fixing is a somewhat high price [04:16] (thanks to a lot of components, a slightly overdesigned power supply, and hand assembly) [04:27] balrog: yeah high prices would be good to fix. i'm totally hw ignorant but maybe there's some way to use more commodity components? there are lots of arduinos and raspberry pi competitors [04:28] Well you have to record a stream of data at a high rate. Current design is based on an FPGA and a microcontroller and thats how it will be. But the current power stuff is somewhat overkill [04:28] Most drives don't need 2A output :) [04:29] I have to see if it's feasible to power a drive externally and how much that would reduce the cost [04:29] It's nice though to have a single unit that can power both itself and the drive [04:32] ah yeah, almost like an external hdd case all wrapped up [04:32] dang fpgas are always expensive [04:38] The fpga isn't the worst [04:39] It's $12 or so [04:39] The USB 2.0 microcontroller will be about $6-$7 [04:39] You get nickel and dimed to death on the smaller parts. [05:03] balrog: and the memory? [05:03] We found a somewhat cheaper source. [05:04] We're thinking of doing an sdram based design. Would mean much more memory at a lower price at an increase in microcode complexity [05:04] (need an sdram controller) [05:06] balrog: oh that's pretty good [05:09] The current price is around $250 for a fully assembled board which I feel is a bit much [05:09] I'd like to get it toward $100, hopefully to $150 if not lower [05:10] kryoflux recommends you power the drive separately and they sell an adapter for that purpose [05:11] I used a gutted 3.5" HDD enclosure for power [05:13] DFJustin: The discferret power components can power 2-3 3.5" drives easily as it is now [05:13] Which I feel is overkill [05:13] It's overdesigned. It's extremely robust but I don't think that's necessary. [05:14] Half the capacity would still power even a 5.25" drive [05:15] The kryoflux just pulls power off the 5V USB [05:16] this seems relevant here: [05:16] Google Video stopped taking uploads in May 2009. Later this summer weÃ¢??ll be moving the remaining hosted content to YouTube. Google Video users have until August 20 to migrate, delete or download their content. WeÃ¢??ll then move all remaining Google Video content to YouTube as private videos that users can access in the YouTube video manager. For more details, please see our post on the YouTube blog. [05:17] joepie91: Link? [05:17] tl;dr google video videos will become unavailable for public viewing unless the uploader specifically makes it public [05:17] http://googleblog.blogspot.nl/2012/07/spring-cleaning-in-summer.html [05:17] I see [05:17] UGH why [05:18] no idea :/ [05:18] If they were public before they should stay as such [05:18] but that seems like a LOT of potential for huge data loss [05:18] (I think) [05:18] Yeah :( [05:18] or rather, public data loss [05:18] and I mean *huge* [05:18] SketchCow: ^ [05:19] DFJustin: Anyway my point was that maybe we don't even need to have drive power support. Will have to check how much extra cost that adds. [05:19] i linked stuff about that earlier :) [05:19] Ah... [05:19] he's going to check with archive.org people, since i guess archive.org has been working on youtube [05:19] if archive.org doesn't get it all then i guess AT can spring into action [05:22] O hai [05:23] ohai [05:24] arrith1: alright [05:24] -bs [05:25] Wow, you filled 5 screens with discussion of hardware [05:25] oops [05:25] wait, well it's sorta related to the just solve it stuff, which is kind of #archiveteam related [05:26] but yeah k -bs [05:28] It's only sort of related [05:28] -bs [05:29] SketchCow: is archive.org doing for Google Video what they did for stage6? [05:29] Stage6 was awesome [05:33] Archive.org didn't do stage6, we did [05:33] One of us did. [05:33] i did [05:33] i wish I had gotten more of it [05:34] particularly more user-generated content, as opposed to all those music videos, tv shows, and movies :( [05:35] I am torn on the google video [05:36] I'll spend another day thinking about it. [05:45] also, for those that missed it - meebo is shutting down on july 11, instructions for downloading your recorded chatlogs for your meebo account until that date are available at http://www.meebo.com/support/article/175/ [05:45] lots of big things shutting down lately :( [05:47] on that note - SketchCow, does archiveteam keep some kind of RSS feed that provides a list of services that will be shut down soon? [05:47] or similar [05:47] (preferably including archival status, of course :) [05:48] joepie91: there are pages for that on the wiki [05:48] mainly deathwatch i think [05:48] and the frontpage [05:48] alright, but is there some kind of feed that can for example be automatically retrieved? [05:48] I can think of some interesting things to do with that [05:49] one could cobble together a script that looks for changes to specific portions of the site from the overall wiki changes rss feed [05:49] i'm not aware of something that does that currently [05:49] hrm.. that would be hacky, and probably break when the page layout changes :| [05:49] yep [05:49] wikis are tricky like that ;/ [06:02] what portions of the site? of course there are solutions [06:04] do you just want something like this? http://archiveteam.org/index.php?title=Deathwatch&feed=atom&action=history [06:04] otherwise there's plenty of IRC-RC based services [06:06] Nemo_bis: no, that is literally just a feed of changes [06:06] I mean a feed that announces new site clousers [06:06] closures * [06:06] a feed doesn't announce anything [06:06] sigh [06:07] .. [06:07] a feed that has as its items newly announced site closures [06:07] so this is not "portions of the wiki" [06:07] no [06:07] anyway you can construct it from the feed, or make the wiki page machine-readable [06:08] I never said anything about the wiki, it was arrith1 coming up with that suggestion [06:08] yes, which would break if the page layout changes [06:09] i refer to the wiki since it's basically the only place info is, besides say live irc channels [06:12] or the AT twitter account [06:12] @archiveteam [06:13] but that's generally after details have been worked out and grunts are needed. [06:13] ah yeah [06:15] joepie91: what were you thinking of using the feed for? [06:18] No feed [06:18] Should be fixed? Yes. [06:23] arrith1: I had a few ideas, actually [06:23] mailing list, widget, possibly irc integration [06:24] anything else I can think of [06:24] just a way for people to easily keep track of services that are shutting down, that look less intimidating to the average user than a wiki page [06:24] having a twitter account specifically for sites confirmed going down could work. then have a bot announce that in irc, etc [06:26] this is actually interesting: [06:26] Archived but not available [06:26] Google Video [06:26] http://www.archiveteam.org/index.php?title=Archives [06:27] does that imply being fully archived? [06:27] or only partially? [06:27] arrith1: that would limit you to very short messages though [06:28] joepie91: it would. but at the end of the short message maybe have a url to a special part of the wiki with specially formatted messages or something [06:29] IIRC, it was partially archived, back when Google first said they were going to just delete all of the videos. Apparently the AT bandwidth hive was too much even for Google, and they caved :P Now it looks like they're just moving videos over. Making them private, but not deleted. [06:30] joepie91: it's on some archive.org servers somewhere i think. but GV seemed to back down and said they were keeping the site up. [06:30] joepie91: but now that recent announcement of GV coming down, that's probably going to be reevaluated [06:30] (12:47:55 AM) SketchCow: I am torn on the google video [06:30] (12:48:14 AM) SketchCow: I'll spend another day thinking about it. [06:33] mmm [06:33] arrith1: may be better to have a dedicated page without all the wiki overhead [06:34] joepie91: making something community accessible without a wiki gets tricky. i mean you could do like hg/git but that's quite a barrier to entry vs a wiki in terms of novice users [06:35] hm. [06:35] I'll have a think about it. [06:39] also, on an unrelated note, I've heard some people on various irc networks complain about certain stories getting removed from fanfiction for some reason [06:39] does anyone know more about that? [06:40] joepie91: people in #fanfriction might know [07:16] Even though Google will move all the videos over to YouTube (they say at least) - I'm a bit in the mood to try to download it anyhow [07:17] I mean, we ate MobileMe (even though I guess largely thanks to Kenneth/Heroku) [07:17] https://github.com/ArchiveTeam/googlegrape iirc [07:18] yeah [07:18] what's your preferred tool to archive an entire site? wget? which magic options do you use? [07:18] though that was pre-warc [07:18] Coderjoe: wget with WARC support. The last part *is* important :) [07:19] C-Keen: wget. options depend on site, but we like warc [07:19] ersi: wrong target [07:19] * C-Keen looks up warc [07:19] WARC is Web Archive format, it saves the HTTP Request + Response. It's a format used by the largest Archive places. [07:19] Coderjoe: wrong target? [07:19] ersi: I see [07:20] ersi: i think you meant C-Keen not me [07:20] Ah, I totally missed that I tab-completed to you instead of C-Keen :p [07:20] figured [07:21] ok so I shall build a wget from trunk...no problem. [07:23] has the gnulib build stopper been fixed? [07:23] you could take a short cut and use a get-wget-warc.sh script from.. I can't remember which project is the freshes.. I think it might be MobileMe/MeMac - that'll make a wget-warc version that works very easily (I tried compiling wget-trunk a month ago.. didn't end well :P) [07:24] i know misty mentioned a patch, but i don't know if it was accepted yet [07:25] yeah i think memac was the latest update [07:25] in order to get the regex support [07:27] let's see [07:31] C-Keen: Script's available at https://github.com/ArchiveTeam/mobileme-grab [07:31] You want the "get-wget-warc.sh" one :-) [07:36] ersi: trunk built [07:37] with the above script? or by itself? :) [07:38] by itself [07:39] neat! [07:39] hm, I wonder whether I should tell wget to rewrite links so I can view the site locally [07:40] no, don't do that [07:40] you can. wget will save the unmodified version to the warc [07:40] oh, nice [07:40] (and it does the modification of the files at the end of the run anyway) [07:40] otherwise I'd use https://github.com/alard/warc-proxy to proxy the content of the WARC :) [07:43] also the site I want to archive is using some kind of blog software so it contains links to page.html?p=1234. In previous attempts this turns out to be broken as the pages get downloaded as "page.html?p=1234" but of course the browser will always load the "page.html" [07:43] how do you deal with this? [07:45] Those HTML endings are probably PHP or ASP rewritten by an apache module C-Keen [07:46] Oh actually, nevermind [07:46] Wrong channel [07:47] What blog software is it using? If you can find it you can check how URL's get rewritten and perhaps revert it because the original URLs should still work [07:58] good question [07:58] I will investigate [07:58] :) [07:58] Wappalyzer [07:59] This is terribly interesting. [08:00] Wappalyzer? [08:00] ah cool [08:01] C-Keen: Nothing wrong with getting "page.html?p=X" pages. As long as the content differs and have some other meaning than just page.html.. One can always do a rewrite serverside if you want to present it later [08:02] ersi: ack. I just hoped for some already existing magic to do so [08:02] p4nd4: heh wappalyzer is cool, it says some wordpress cms [08:02] I mean, from the archiving Point of View, there's nothing wrong with saving them down as "page.html?p=1234" [08:03] And there's solutions for dealing with that, if you want to present that material later as well :) [08:06] heh archiving entire sites feels good ;) [08:06] It sure does! [08:08] now to something completely different. If I want to help on archiving huge sites, I can run the archive team warrior. but I am connected with asymmetric DSL which means I get 6Mbit/s down but only 200KB/s up, so while downloading gigabytes is fast getting these gigabytes off my machines will take (almost) forever [08:08] indeed, unfortunally that's the case for many [08:14] In theory you could find an injection vulnerability [08:15] And clone their database [08:15] and set up your own wordpress site with a cloned database [08:15] That way you'd have an exact copy [08:15] lol [08:15] that's evil [08:15] It's evil if you cause harm [08:15] and also difficult [08:15] not difficult [08:15] No, you'd have an exact copy of the internal state. Not the external one [08:15] You'd miss all static content and graphical representation [08:15] You'd have to the important part, the posts table. [08:16] and you can crawl *.jpg,*.png etc. at a later stage [08:16] You could clone the whole database, including settings, posts, comments [08:16] Everything [08:16] * ersi sighs and rolls his eyes [08:16] Do you use some backup software to clone stuff btw? [08:16] Like a crawler to recreate pages and links? [08:17] lol [08:17] could use wget with spider [08:17] but that only checks if links exist. [08:17] Or download it, and parse it for links. [08:17] wget can user-agent spoof, as can curl as well [08:18] Yes that's what I meant, you could crawl it, find all in-links, follow them and crawl them as well for in-links [08:18] And map them to each other [08:18] But it'd be static [08:18] Depends on the content I guess. [08:18] If it is a personal blog of some sort then maybe the post thumbnails etc. don't matter quite as much as the words. [08:19] I'm just wondering if you're using a software already [08:19] or if maybe that'd be a good project for me to start writing? [08:19] AFAIK there's a lua script that is adapted for archiving. [08:19] Ah alright [08:19] it's already a mess with all these links to some CDN, as you cannot distinguish data beloning to the site from other things anymore [08:19] :( [08:19] in the past people just hosted their stuff on their servers [08:19] :( [08:20] Yeah [08:20] can't you just include only the CDN's links? or are they just IPs, not hostnames? [08:20] brayden: but how do you now that some aws.amazon.com is essential for the content? [08:21] I don't know but it is probably more essential than random hotlinks. [08:21] There are also many kinds of CDNs [08:21] as well as private CDNs [08:21] true [08:21] Hard to distinguish [08:21] then there is hacker news which is another bag of hate as they create dynamically expiring links on their pages *grrr* [08:22] wget has a page requisies option [08:22] Coderjoe: true [08:22] there is also the lua option, at least if using the picplz version of wget-warc-lua [08:23] lua? [08:23] lua is a scripting language [08:23] a scripting language. with the lua addition to wget, you can write a hook script for generating the list of links for wget to crawl [08:23] Very simple to use. [08:23] I know lua I don't see the connection [08:23] ah [08:23] oh [08:24] I've done that with a PHP script as well, fetch a site and crawl all in-links [08:24] the hook function would be passed the page that was just downloaded, and it can parse it and return a list of links [08:24] (for examples, you can see the picplz usage) [08:24] I was just thinking, instead of just stripping it of content and taking the links it could store the page as well, then fetch all sites it links to, and save them as well, and replace all links to link to the stored versions [08:25] And do that recursively [08:25] But the problem would be CDNs and remotely included scripts etc [08:25] yep [08:25] p4nd4: but the wget-warc-lua solution allows you to add it all into one warc file during a single run [08:25] Oh [08:25] I'm not very familiar with archiving, I'm just brainstorming [08:25] :) [08:26] why is archiving the headers important as well? [08:26] and with your recursive link following, at least without some sort of limitation, you will wind up trying to download all of the intarwebs [08:27] no [08:27] I said in-links [08:27] As in, links within the same domain [08:27] :c [08:27] the reason for warc files is because that is what the wayback machine takes [08:27] Ahh [08:27] lol [08:27] Once I decided to use Xenu's link sleuth on Google [08:28] Even with a fairly small depth I still ended up downloading the internetz [08:28] i saw no mention of "in-links" in your description [08:29] I had a friend that was looking to mirror one of the gamespy-hosted sites and wound up trying to download all of gamespy on my isdn connection [08:29] :o [08:29] (he only had dialup at the time, so he used the ssh access into my server to do this) [08:30] simply with a forgotten wget option [08:31] I ended up creating a wget group, putting wget in that group, setting it 0750, and omitting him from that groupp [08:33] while this discussion has been generally on-topic, I would like to point out that there is an #archiveteam-bs channel for offtopic chatter [08:34] and with that, I am going to get some sleep [08:34] sorry [08:34] good night [08:41] Well, this was borderline off-topic - I'd say it's mostly on topic [08:41] no need for the sorry :) [08:43] Coderjoe: "(10:24:17 AM) p4nd4: I've done that with a PHP script as well, fetch a site and crawl all in-links" :p [08:43] shrug [13:15] mistym: Maybe you've found it already, but a real solution to the Wget bootstrap problem is to remove the line $build_aux/missing from bootstrap.conf. [13:16] alard: Yep, I noticed - thanks! The problem was patched upstream in gnulib, so doing a bootstrap-sync works too. [13:17]