[00:45] How does the IA take in site grabs that do not have warcs? [00:47] they don't [00:47] well, not into waybackmachine [00:48] What if you have all the data that makes the warc [00:48] like the transfer time, size, headers, etc... [00:48] then I suppose you could make a warc? [00:49] I guess I could write a conversion program. [00:56] you would need like a wget log of the files being grabed for this to work [00:56] in theory [01:06] that won't have headers tho [01:07] thats why i said in theory [01:07] was not sure [01:22] what. [01:22] so i went to bed thinking maybe the local warrior that i have running will stop [01:22] nope [01:22] it's on 7010 URLs and counting [01:25] perfect! [01:33] Is there a script that I could run instead of using the warrior VM? [01:34] I have a few extra IPs I could run from, but won't have a virtualized environment to run under. [01:34] marczak, the peeps in #warrior can answer that [01:34] great - thanks [01:36] the answer is "yes" but i don't have a link to it handy [01:38] DrDeke: thanks - someone in #warrior is helping out. [02:14] For all the new warriors out there we have long term projects after yahoo and posterous. #urlteam is constantly unfucking the url shorteners so we can find sites without twitter, bitly, etc... [02:15] That is our proactive side to saving the web. [02:31] mah, don't send people to #warrior when they're asking project specific questions [02:32] marczak: You can run the scripts from: https://github.com/ArchiveTeam/yahoomessages-grab/ [02:32] that's the stand-alone ones. You'll need to compile wget though (script is checked in there ^) and install the seesaw python package. [03:40] I think we just exploded the Yahoo [03:40] well they had it coming [03:46] i just saved low rider world 2006 clip of attack of the show [03:46] it was one of the flvsm videos that i couldn't get [03:55] We just destroyed the Yahoo! backlog [03:57] and how [03:57] The graph looks like a zombie death apocalypse [04:00] 40G . [04:00] root@teamarchive-1:/2/DISCOGS/www.discogs.com/data# du -sh . [04:00] by the way [04:11] SketchCow, was that a preventative grab? [04:13] Yes [04:13] I'm working with MusicBrainz to get their stuff on archive. [04:13] And they said "You know, I don't know of any mirrors of discogs.org" [04:14] might do vgmdb.net while you're at it [04:20] Show me where you can download the DB and I will. [04:20] DFJustin, I already got a grab of vgmdb.net [04:21] it is about 8 months old though [04:21] o/\o [04:22] I want to merge some of their data into freebase [06:33] why don't we have all warriors running urlteam in the background all the time? [06:35] :) [06:35] It would help [07:12] We need to recruit someone who has google fiber, it could be real helpful [07:13] just throwing that out there [07:49] Man, it's going k-razy out there [07:49] My Hard Drive full of goodness goes out Monday [07:49] Working now to build up the maximum amount of data on it [07:50] You ship hard drives as well as upload? Talk about no stone unturned :) [08:01] Have to. [08:01] I send in 400-500gb a hit [08:03] whumph whumph [08:05] Do you have shock proof cases for mailing? I always wanted to ask how those work out. [08:06] if I were mailing hdds I'd probably reuse original hdd packing materials [08:06] seems to work [08:26] in case I get hit by a meteor in the next 3 months somebody better remember to scrape all of Reader's *.blogspot.com/atom.xml feeds in addition to the feed URLs they currently use [08:26] e.g. xooglers.blogspot.com/atom.xml gets you completely different content [11:00] chronomex: I do. [11:52] ivan`: Different content than what? [14:54] Our clown information is growing nicely. If you have any observations you would like to add http://www.archiveteam.org/index.php?title=Clown_hosting [16:19] omf_: Are there any guidelines for including providers is that list? [16:22] website url, price point, specs, and any insights into why the service works so well or problems with it [16:23] the joyent and DO are good examples we have built out [16:23] we have vps and cloud providers on there [16:25] bandwidth and storage are right up there with price point as important data we need [16:57] Ok, I added BuyVM [16:59] chazchaz, you use them recently? [16:59] Yeah, I have 2 servers with them. [16:59] One for over a year [17:03] What can you fit in 128mb ram [17:04] I cannot think of too much you could run [17:04] I could host my photos on there. Cheaper than flickr [17:07] edis.at has a good 128MB miniVPS option. [17:07] I run lower-traffic Tor relays and bridges on that kind of box. [17:08] and it was quite happy to run the yahoomessages-grab script. [17:09] omf_: They let you burst up to 2x as long as it's availible, which seems to be almost all the time. I'm using 150 MB for 40 posterous processes and 2 yahoo-messages processes [17:10] chazchaz, you should make a note on the wiki, that is valuable info [17:13] done [17:14] thanks [17:36] i'm kind of offended that there is a wiki page called "Clown hosting" and my apartment closet isn't eligible to be listed in it ;) [17:37] outage notifications? pshhh, yeah maybe i'll email you if i decide to take the server apart for some reason 5 minutes before i do it if you have a VM on it [17:39] Just check it yourself. That's what ping is for right? [17:41] exactly! [17:41] i made a major jump in my level of customer service a couple months ago when i put everyone's email address that i could track down in a google spreadsheet [17:41] sometimes it gets copy and pasted into a bcc [17:41] sometimes... =) [17:42] (nobody is paying, so, you know...) [17:42] 'wall' ought to be acceptable notice for planned maintenance [17:42] i actually got to do that on a couple servers at my real job last night [17:43] "Oh, we forgot to mention that part in the email? Well, just shutdown +30 it, the users will be fine." [17:43] (needless to say, that is not the way it normally works there) [17:43] since the system these servers are for was going to be completely down anyway, we figured oh well [19:31] Did someone already grab the ign forums? [19:41] omf_: ask in #ispygames [19:41] someone was doing work on a lot of that stuff there [19:41] that is me [19:42] I just checked the scroll back to the 22nd of last month and nothing [19:45] D: [19:45] sorry for being an idiot then ;) [19:46] No worries. It is hard to follow so many projects going on. [19:46] aye [19:46] I know some forums for some sites were grabbed but nothing about the main ign [19:48] The wiki is down [19:49] Resource Limit Is Reached errors a few times [19:49] seems fine again now [20:29] It happens. [20:30] SketchCow, Is it alright is I start uploading that 4data to you? [20:31] It is 102gb [20:31] and it will probably take over a week to upload, possibly longer [20:34] What 4data? [20:34] I mean, I'm sure we discussed it. What is it? [20:35] The 4chandata dump [20:35] from that archive site that is closed [20:35] Oh, of course. [20:35] Yeah, go ahead. Do you need credentials? [20:35] I already got them [20:36] I am still waiting on the database dump itself but I am not worried. This guy has come through on everything he said so far [21:19] 4 Get your free Psybnc 100 user have it come http://www.multiupload.nl/B11JFCYQH6 [21:20] lol [21:22] In case anyone is wondering: https://www.virustotal.com/en/file/f897432de88adce73b23741da1a133b6a79b8233d50571451dab4b992931d173/analysis/1364160122/ [21:23] errrr [21:23] what's that from? [21:23] That's the free Psybnc [21:23] Hm, I wonder if xchat logs bans [21:24] So many nicknames for this virus. [21:33] is there a ratelimiter on formspring? [21:37] howdy doo! I'm reporting back. Soultcer helped me yesterday with digging into the btinternet stuff (http://archive.org/details/archiveteam-btinternet) [21:38] Did it work? [21:39] yes indeedie! - i wrote some horrible awk scripts to parse the CDX files for stuff I was interested in, download via curl, unpack, and now I'm browsing thru some vintage .au and .wav files ... very cool [21:39] Sweet [21:41] very kind of you to help and encourage me to carry on, i was almost convinced that the megawarc files would have to be downloaded in entirety (or atleast an entire megawarc) to get anything out of them [21:42] i was right about to say "ehh.... it probably doesn't work like that", and give up, but you convinced me. and it's certainly very cool to browse thru this stuff! [21:47] Neat :) [21:50] chronomex: Yes. [21:50] (I set a rate limit on the tracker, that is.) [21:50] ah [21:51] But that limit is not reached, at the moment. I set it to 20 to be safe, but we're currently at 2-4 per minute. [21:52] I meant running multiple threads on my end [21:53] I don't know how Formspring behaves. [21:54] ok, I'll just run 1 for now [22:11] would it be possible to get a message asking for assistance on the formspring project in the topic? [22:12] sure, is there a channel for it? [22:12] wp494: Are we sure that it works? [22:13] alard: yep, I've been running 3 concurrent for an hour or two and haven't ran into any issues [22:13] chronomex: #firespring [22:13] and others that pop up on the tracker appear to have no issues [22:15] wp494: Yes, that's one thing. But does it get everything we want to get? [22:16] It's a complicated script. [22:16] hrm [22:16] if you want to hold off on adding to the topic, feel free [22:17] I'm inclined to wait for alard to sign off [22:17] I've checked one or two warcs and they looked good (with the last version of the script, at least). [22:18] We could go with full force, but there's a small risk that we need to do things again. [22:18] I haven't been able to find out about the pagination on the photo albums, for example. [22:18] hm [22:18] (Because I haven't found a user with enough photos.) [22:21] have you tried any triple digit/close to triple digit users? [22:21] (in file size terms) [22:24] Good idea. I just did that, but didn't see any user with more than 20 pictures. They're big because of something else. [22:28] probably formspringaholics [22:32] DFJustin, Did you want a copy of vgmdb? [22:33] I think Formspring works well enough. Checked another warc with the warc-proxy, no missing pages. [22:34] If there are people with too many pictures they'll at least be included via the Previous-Next buttons. [22:35] There are a few pagination things that don't work (the 'who smiled at this'-thing, for example), but that's due to Formspring. [22:42] namespace | I'm worried about google groups. [22:42] chronomex | hmmmmmmm [22:42] namespace | It's basically dead as far as I can tell, and to my knowledge is one of the largest usenet archives. [22:42] chronomex | I'm with you there [22:42] chronomex | it'd be good to turn it back into a news spool [22:42] chronomex | the way usenet was meant to be [22:42] yes, ggroups is a worthy opponent [22:43] And because it's google, you know that the shutdown is a matter of when not if. [22:43] do you think google wouldn't be willing to ship some hard drives to the internet archive if they ever shut ggroups down? [22:43] True. [22:43] I'd hope they would anyway. [22:43] we'd need to find a crooked googler [23:04] From my own research we can piece sections of usenet history with what is already available [23:04] which is better than nothing. [23:04] omf_: I don't personally want a copy but having one on archive.org would be nice [23:05] I am doing a refresh on it now [23:05] thomasbk: Always assume the answer to that question is no, unless you're sure [23:05] That's my rule of thumb [23:06] Universities still have tapes full of usenet archives [23:06] it is just finding the tapes and people there who can pull the data out [23:07] Another angle would be to get the usenet data loaded into big query [23:07] tapes used to be really expensive [23:07] from what I read google looked under a lot of rocks to get what they have, I'm not sure there's really a lot more out there [23:12] anyone have any guesses wrt the legalities of rehosting stuff like the yahoo messages content? [23:13] nope [23:13] ersi: different from what you get from http://xooglers.blogspot.com/feeds/posts/default or http://xooglers.blogspot.com/ [23:14] ivan`: oh, huh [23:14] thomasbk: most of us don't give two fucks about that [23:14] I just checked up on my usenet sources [23:14] I got partial archives going back over 10 years for some groups [23:15] We could do it [23:15] add that to what is already on the IA and we would have over 50% of everything as a starting point [23:21] The longer we wait, the harder it will be to find older data - makes sense to get started on it [23:22] I can start cutting it up to feed to the warrior [23:22] We are going to have to hit dozens of different archives [23:23] I have been tracking this for a few years and there are more archives online now than before [23:23] People are starting to open things up [23:23] i know google has a usenet archive but its in their weird google-format (missing original headers etc?) so not super useful? [23:23] plus hosting is cheaper for larger data sets [23:23] also missing all the atatchments [23:23] I thought that google usenet posts are retrievable in original form [23:24] they are [23:34] So I've been downloading on the yahoo task all day, It's taken about 12 hours to download nearly 10,000 urls on Item threads-b-1036-3. Can anyone check if someone else has submitted this by now? Or how many urls there will be? [23:35] Seems pretty slow, but I guess that's due to the rate limit? [23:49] Question: Why isn't there a standard URL shortener algorithm in browsers? [23:50] gzip | base64 or something? [23:50] Something like that. [23:51] It's totally ridiculous that it's even a service. It's obviously something users want, and it could totally be done client side. [23:51] I can't think of a single aspect that requires a server to be involved. [23:52] namespace, do you know why people use url shorteners [23:54] omf_: Because it's simple and long urls are ugly? [23:54] (Unless it's for shock sites. But then why would you want to archive them?) [23:54] That and for twitter. [23:56] URL shortening services were invented as a way to add a step in the process which allows data to be collected on the user. This is then sold to ad companies [23:56] that is the whole point of bitly etc [23:56] It has no benefit to end users [23:56] Interesting. Source? [23:56] okay- while it is a problem, that's not true [23:57] if you're trying to share a link on a character-constrained environment, you're going to run into the URL issue [23:58] I don't disagree folks found it was a great way to get analytics on web traffic