[02:49] Hello! I was just looking about at Poetry.com, and it looks like the new management has put most of the old (14million) poems back up, in some sort of "old poems" thing. Is there any project underway to scrape them all? [03:01] At the moment, no, there's no current project to scrape them all. [03:01] Although, that might be a good idea to do.. [03:01] I agree [03:02] we shall do some research ^_^ [18:03] hi there- new member here [18:03] why hello there :P [18:04] right, to continue on our conversation - in the future, consider using wget + warc for archiving a site... it also saves error responses, headers, etc. [18:04] saved a bunch of websites with httrack, how can I share them? [18:04] chivist: WHY HELLO THERE [18:04] so it gives you a full snapshot of a site [18:04] as for your question, my suggestion would be "upload them to IA" but perhaps winr4r has a different insight [18:04] nope, that's what i was gonna say too [18:05] right :P [18:05] it was gonna be "chivist: upload them to "community texts" on archive.org and tell SketchCow to put them into the archive team collection" [18:05] well, that's -somewhat- more specific... [18:05] (than what I said) [18:06] or just talk to SketchCow to get some FTP/rsync space and he'll take care of it [18:06] what did you grab, chivist? [18:06] tbd.com, a ny-based news website [18:06] they put a redirect on the main page, so I archived individual sections [18:06] oh, it closed? [18:07] yea the journos there complained it was down, so the company put it -somewhat- back online for a couple of days [18:07] aside from that, chivist, I'm not sure if you've used wget before, but in the newer versions there's support for directly saving to WARC - you'll want to use the --mirror switch to put it into mirror mode, and specify --warc-file=something.warc.gz to indicate the filename to save it to... you can then upload the warc.gz to the Internet Archive and it'll have all the important data, and will be importable into the Wayback Machine [18:07] good goddamn [18:07] (well, that took a while to type...) [18:08] yea I'm relatively new to archiving [18:08] they could have just dumped a static version of it and put it online for like $5/month [18:08] chivist: we've all been there :) [18:08] well, except for SketchCow maybe [18:08] I think he was born an archivist [18:09] anyway the archive team is pretty amazing [18:09] SAVES COPY OF DNA ON THE WAY OUT [18:09] chivist: yeah we are fucking awesome [18:09] stick around [18:09] I'm Canadian [18:09] and we lost a couple great websites in the last year [18:09] chivist: i forgive you! [18:09] (ah!) [18:09] aw :/ [18:10] I managed to grab a Quebec news website before it went down [18:11] fucking Quebecor, shut down like 8 different papers in Canada [18:13] fuck man [18:13] although [18:13] news sites, in general, don't seem to have an idea of their importance [18:13] because for the most part [18:14] once you get outside of the big ones, they fucking SUCK at switching CMSes [18:14] ahah [18:14] ever heard about the CBC? [18:14] they're the national broadcasting corporation [18:14] and their CMS dates from 2003 [18:14] it's like every six months, okay switching CMS and they don't fucking CARE that every link is broken [18:15] hey, did you know SketchCow (our jason) was on CBC? :) [18:15] OBVIOUSLY I SAVED A LOCAL COPY [18:15] no! do you have a link? [18:16] ONE SEC [18:16] http://j5.video2.blip.tv/9030006914560/Dmisener-JasonScottInterview888.mp3?ir=13096&sr=131 [18:17] >video2.blip.tv [18:17] >mp3 [18:17] ok.png [18:17] joepie91: you're adorable [18:17] D: [18:17] this was the full uncut interview [18:17] how much data have you saved so far> [18:17] ? [18:18] chivist: at a guess, over half a petabyte [18:18] i remember jason saying that he, personally, had put about 200 terabytes into archive.org [18:18] gimme a second, I'm picking up my jaw on the floor [18:18] HALF A PETABYTE [18:18] or some crazy-ass figure like that [18:19] mobileme alone, was 272 terabytes [18:20] and jason has just been pumping stuff almost 24/7 [18:20] ("almost" because he has to sleep once a month) [18:20] lol [18:21] ahah [18:21] does he have a dedicated server grabbing websites? [18:21] I can't believe how much our internet sucks over here [18:21] chivist: probably [18:23] chivist: so how did you find us? :) [18:23] I was looking at the Wayback machine [18:23] and found you on archive.org [18:23] ah, gotcha [18:24] what are your talents? :) [18:24] well [18:25] besides eating a lot of cheese [18:25] I don't really have any talents [18:25] it's okay, that was not some test to see if you are worthy [18:26] we're self-described as "rogue archivists, programmers, writers and loudmouths", we need loudmouths too :) [18:26] cheese! [18:27] well I'm more like a quiet guy [18:27] very sneaky [18:28] chivist: are you a blogger, or a twitterer or anything? [18:28] you can do a lot for us just by spreading the word [18:29] yes a journalist but I don't want to mix both- the bullshit concept of "objectiveness" prevents me from being involved publicly in a lot of things [18:30] fuck yeahhhh [18:30] go write a story about us [18:30] because we're awesome [18:30] it's on my to-do list [18:30] you can do a whole lot by just spreading the word [18:30] PUT IT ON YOUR "DO NOW" LIST MOTHERFUCKER [18:30] We don't have WIRED in canada [18:31] so most technology stories are… lame [18:31] yea yea [18:31] * joepie91 takes winr4r, puts him down in the corner, and gives him a cookie [18:31] :P [18:31] * winr4r snuggles joepie91. [18:31] well there are tons of things happening, nobody reporting on it [18:31] chivist: jason is so hilarious that you *need* him in whatever publication you are doing [18:32] SketchCow? [18:32] yup, that's our jason! [18:33] so I'm looking at Wget [18:33] the benefit of WARC is that it can be used in the Waybackmachine? [18:33] yes, or anyone else can use it [18:33] chivist: yes [18:33] that, and it holds more data [18:33] theres various tools that will load them [18:33] it has all the headers, for example, iirc [18:33] it saves the headers and other metabollocks [18:33] and can store error pages [18:34] so even the error pages are archived! [18:34] so do I run Wget first [18:34] then WARC [18:34] chivist: psst go listen to that interview, because jason can just TALK FOREVER [18:34] or both [18:34] chivist: you run a wget which has warc support and outputs a warc [18:35] WARC is a format, not an application :) [18:35] oook [18:35] chivist: WARC is an output format, wget can save WARCs [18:35] aside from that, chivist, I'm not sure if you've used wget before, but in the newer versions there's support for directly saving to WARC - you'll want to use the --mirror switch to put it into mirror mode, and specify --warc-file=something.warc.gz to indicate the filename to save it to... you can then upload the warc.gz to the Internet Archive and it'll have all the important data, and will be importable into the Wayback Machine [18:35] also refering back to my earlier [18:36] * winr4r pets joepie91 [18:36] I'm looking at the archive team page about it [18:36] random question: where do you store everything? [18:36] chivist: archive.org [18:36] I mean, do you rent servers/use your own [18:36] if you like giving to charity, archive.org is the best value-for-money that there is [18:37] $1.5 million a year for storing petabytes of shit [18:37] chivist: we have some members who have vps's which they allow us to use. [18:37] SmileyG: we use that intermediately [18:37] I assume you don't have an office right? [18:37] beside IRC [18:37] lol no [18:38] with a giant poster "We're going to rescue your shit" [18:38] we're a bunch of folks from all over the world [18:38] #archiveteam IS our office, chivist [18:38] We have Jason's home? :D [18:39] I guess Jasons information cube is kind of an office [18:39] except only the CEO works there [18:39] :p [18:39] :D [18:39] yes [18:39] GRAND HIGH POOHBAR. [18:40] chivist: we're not an organisation in the normal sense of folks who get together in person and then do things [18:40] we're more like a global lynch mob [18:40] nice [18:42] we are to the library of congress or any other archive what a court is to 122 guys in rigger boots [18:44] anyway YOU ARE A JOURNALIST, go interview jason, he's fucking awesome [18:47] ever had issues with copyright trolls? [18:48] chivist: nope [18:48] Yes, No, we don't give a fuck? [18:48] I find this exchange funny [18:48] chivist: is this an offical interview? [18:48] chivist: nobody does that because they don't want to die [18:49] if they want their stuff removed, we're happy to do it [18:49] SmileyG: I don't think it is? at least, that's not how it started :P [18:49] definitely not [18:49] IA will blackout anything with a valid request to do so. [18:49] chivist: then fine, we can continue to chat :D [18:49] (and by "die", i mean "suicide by email") [18:50] I'm pretty transparent about interviewing/sources/etc [18:50] chivist: :) [18:50] Ok [18:50] just don't think you can quote me without asking :D [18:50] you can quote me on anything because i am awesome [18:50] * winr4r pets SmileyG [18:51] also, most news outlets want names [18:51] which can be REALLY annoying [18:51] you can find my name quite easily. [18:51] same here [18:51] Lewis Collard if you want a name [18:51] shush lewis [18:51] SORRY SMILEY [18:51] Oh, and we all have no offical sanction to speak on behalf of Archive Team either. [18:52] Now i've said that, I think I can say wtf I want? [18:52] * joepie91 points at his doxedness [18:52] yeah, who gets to speak for archive team is ill-defined [18:52] it's mostly jason [18:52] https://si0.twimg.com/profile_images/1855468868/DSC_0192-square.JPG [18:52] nice hat btw [18:53] winr4r: benevolent dictator kind of thing I guess? :P [18:53] i started wearing hats again because jason makes hats cool again http://i.imgur.com/2wre4pN.jpg [18:53] joepie91: yes [18:54] http://i.huffpost.com/gen/1058928/thumbs/r-JASON-SCOTT-ARCHIVE-TEAM-large570.jpg?9 [18:54] thats a HAT> [18:54] chivist: hey you figured out my twitter <3 [18:55] SmileyG: jason can pull that off [18:55] and as we veer wildly off topic, can we take it to #archiveteam-bs please ;) [18:56] yes, we can [19:30] You people never stop talkin' about me. [19:30] I'm interviewing Apple II nerds! [19:32] :D [19:33] http://www.archiveteam.org/index.php?title=ArchiveBox -> Became Archive Warrior? [19:34] No. [19:34] But the philosophy was the same. [19:34] Easier access to contribute downloading abilities that would produce the best data. [19:35] was an 'ArchiveBox' for debian made? [19:35] I like asking questions [19:35] ( ∙_∙)>⌐■-■ (⌐■_■) [19:38] I don't know, you'd have to chase it down [19:45] SketchCow: you're misunderstanding. we're just always talking about you when you're not here :) [21:59] hmm [21:59] hey guys, can we remove this from the AT Github account? https://github.com/ArchiveTeam/heroku-buildpack-archiveteam [21:59] I ask because we also have https://github.com/ArchiveTeam/heroku-buildpack-archiveteam-warrior [21:59] which is much more recent and uses wget-lua vs. wget-warc [22:00] or, if not remove, I guess I can throw in a note that says "don't use this, use that instead" [22:02] * yipdw does so