[00:18] I was browsing the wiki and noted an effort to archive old Google Video files - as the Charlie Rose archive was previously available through Google Video I was wondering if any of his interviews are available. The website (charlierose.com) has his complete archive, but the links are dead as of ~6 months ago. [00:31] Yeah, this is a group thing. [00:32] * ivan` pokes SketchCow [02:57] bridgers: @archiveteam Just wanted to give huge s/o for archiving Webshots. I missed deletion notices but you archived my old account! #sohappy [one minute ago] [02:57] :333333 [02:57] That's awesome [03:26] \o/ [09:49] http://forum.uschamber.com/library/2013/05/big-data-and-what-it-means [12:49] Hi [12:49] is there any possibility to rise the number of workers? [12:50] atm I'm running 4 seperate VMs... and I would prefer to combine them into one to save ressources [12:51] You can open the tty in one and manually change the max value [12:52] Not easy to find the option and it varies with each job. [12:53] you mean in /home/warrior/projects/config.json? [12:53] tried this... but it got overwritten after a few minutes... [12:53] was thinking about looking through the webpage for the "max 6" limitation... [13:05] I tried that, but the validation is in the back-end, not within the webpage itself, so it will refuse a number greater than 6 even if you send it in directly. [13:19] looks like I got it :3 [13:21] anyone have a copy of http://archive.org/details/2011-06-calufa-twitter-sql or some other set of twitter usernames? [13:32] so i figured out how to grab the theblaze tv highlights [13:33] i also made it faster to grab by changing hitsPerPage=150 [13:34] this is way there is only 7 pages that need to be grab for a key word [13:49] * ivan` finds http://www.infochimps.com/tags/twitter [14:17] the quora.com robots.txt uses whitelisting and ia_archiver is not whitelisted :( [14:20] robots.txt lol [14:37] robots.txt lol [14:37] accurate summary of my thoughts on the topic [14:38] i need help grabing xml from this: http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml [14:38] if you look at the source its all one line [14:39] lynx -source 'http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml' | sed 's/>\n thanks [14:44] I can't add * http://pipes.yahoo.com/pipes/pipe.run* to a wiki page [14:44] The following text is what triggered our spam filter: .ru [14:45] okay another before the ru did it [14:45] spam filter is pretty annoying when talking about URLs [19:47] Hi [19:48] is everything okay with steltek on Formspring? [19:48] he is submittung a lot of uploads... but they are ALL 0 or 1 MB... [19:49] Steltek 87GB 14092items [19:50] if you compare [19:50] short 1209GB 14076items [19:58] :/ [20:02] I see the 'out' number is quite high, comparatively.. [20:03] Any chance they've just run up a ton of machines and the small and easy ones are coming back first of all? [20:03] i think s.o. should check this :3 [20:03] can you check if its always the same IP? [20:03] or if the content is okay? [20:04] warriorhq only shows 32 machines running formspring - not enough to account for that kind of activity [20:05] could have modded the warrior script to accept loads of jobs but only return the small easy ones? (which I think I'd have to class a a nice hack, despite the disruption) [20:05] hmm [20:06] i modified my warrior to support more jobs [20:06] but not thousands :D [20:06] nice! :) [20:06] I'm running 20... before I had 3 VMs... [20:07] can s.o. check if the content of Steltek is okay? [20:08] and someone should change "ArchiveTeams Choice" to Formspring... [20:08] Steltek's average is about 6mb per unit - about a tenth of the average [20:08] the choice clients are idling at the moment... [20:09] Probably find they're quite innocently returning WARCs full of 'Your ISP does not allow you to access this page.' or something? [20:09] antomatic: because of that someone should check... [20:09] agreed. [20:09] underscor: ping? [20:10] Or 'Your monthly bandwidth allocation? Gone, so gone. Call us now if you want more internets. Have money. 1-800-PAY-MOAR" etc. [20:11] ^^ [20:12] alard: ping? [20:12] SilSte: Hi. [20:12] I asked for SSH access :( [20:12] alard: check warcs returned by steltek plz [20:12] lots of 0/1Mb units compared to everyone else getting normal sizes [20:12] Which project? [20:12] Formspring [20:12] formspring [20:13] 2. Add me to ssh? XD [20:13] and can you check y there are so many packets out? [20:18] alard: and can you change the automatic clients to formspring? They are idling atm... [20:22] Do I block Steltek? [20:23] yah for now [20:23] :/ [20:23] Until we can confirm those are valid warcs [20:23] He might just be really lucky or something D: [20:23] alard: did you check the warcs? [20:24] I can't. They're uploaded to a server I don't have access to. [20:24] is his IP address in a country that might be filtering a site like formspring? [20:25] http://warriorhq.archiveteam.org/projects.json is still auto_project: posterous [20:25] Boy, I would LOVE it that when people upload stuff to archive.org, that they put one PDF per item. [20:25] * Nemo_bis hides [20:26] Yes, what a galactic pain in my ass. [20:26] I'm half considering listing what they are to you, deleting them, and having you do it right. [20:26] Sometimes I failed to do so because I had no way to sort the PDFs by page... [20:26] Oh, there's a few that are COMPLETELY unusable. [20:26] It's only one magazine, I know which it is. Though I may have deleted from disk. [20:27] Yes, only a couple though IIRC: [20:27] On the other hand they're still indexed by search engines etc. [20:28] Better than a single PDF merged with mistakes. Do you have suggestions on how to deal with such masses of unsorted articles? [20:31] http://archive.org/details/starwarsrpgswedish [20:48] DEFAULT PROJECT: FORMSPRING. [20:48] thx [20:49] SketchCow: PM. When you have time, ty. [20:49] [applausesauce] [20:50] is there anything good on formspring.me? [20:50] Smiley: did you check stephk? [20:50] I can't yet SilSte we don't have access to the repo where the warc's go. [20:50] but he's blocked for now. [20:50] ivan`: It's the only available project atm... [20:51] Smiley: okay. Thought you may have ^^ [20:51] SilSte: I'm just curious if there's anything interesting on it [20:51] ivan`: its like ask.fm [20:52] what about an archive of piratenpad.de or the wiki of the german pirate party? [20:52] don't think that a backup hurts ^^ [20:52] We should archive the iTunes store! Text, metadata, 60-second previews... mmm.... [20:52] but I'm not familiar with the tools... [20:53] [not entirely serious] [20:53] Imagine how interesting a catalogue of all available wax cylinders from 1825 would be. [20:53] antomatic: My question was serious ;-). The are starting to delete old pads on piratenpad.de [20:53] SilSte: see my wiki page for a "default warc grab" [20:53] http://www.archiveteam.org/index.php?title=User:Djsmiley2k [20:54] that'll generally give you a sensible grab [20:54] Actually a correct phrasing of that sentence would be "Imagine how BLANK a catalogue of wax cylinders from 1825 would be." - Maybe 1895 then. :) [20:54] Good point, Silste. [20:55] Smiley: that doesn't help on piratenpad... [20:56] It's possible to download the wiki as a file (without the media stuff) [20:56] afaik [21:10] https://twitter.com/kpepper/status/342345097154797568 [21:13] to be fair, uploading multiple pdfs wouldn't be half so unusable if archive.org spent five seconds to add a sort() call to the item page [21:39] anyone have more than 25M twitter usernames/API IDs? http://www.infochimps.com/datasets/twitter-census-developer-tools-mapping-from-twitter-user-search- [21:43] * ivan` also finds http://help.sentiment140.com/for-students [21:44] and http://an.kaist.ac.kr/traces/WWW2010.html [21:53] SketchCow: i see that you push my g4 forum dumps to its own collection in archiveteam [21:53] thanks [22:01] * Smiley ponders if IGN/Gamespy can have a collection yet. [22:07] 80ish items [22:22] Boy, know what I need? [22:22] I mean, REALLY need? [22:22] Sleep? [22:22] I need someone, right now, giving me more "to-dos". [22:22] I'm already cleaning up hundreds of objects dumped into opensource [22:22] I have scripts, but it's killer. [22:23] Meanwhile, my room is a disaster and I was trying to retrieve a CD-ROM and I'm afraid it's in the "harder to get to" part of the shipping container. [22:24] Ha! You were trying to retrieve a single CD-ROM from the back of that huge container! :) (Needle in a haystack pun here) [22:24] So how about I focus on the billions uploaded by Nemo_bis and godane, then we'll get over to the others. [22:24] I have a collection of CD-ROMs in there, but I am afraid they're behind some items. [22:24] It's not hard, it's just there's stuff that needs two-person lifts. [22:25] http://archive.org/search.php?query=collection%3Ancompasslive&sort=-publicdate [22:25] Also, adding those [22:25] Also, Xanga. [22:25] Also, someone just let me know about another forum dying [22:25] http://forum.worldofplayers.de/forum/threads/1247321-WoG-com-is-closing [22:25] anyone, take it [22:28] So yeah, don't pile it on today [22:28] Also, I went away for 15 days, just got home. [22:28] Lost 3 pounds [22:28] By this rate, I will be a sexy MF and people will give us stuff just because I look at them [22:28] (Late: cake) [22:32] Now I'm blasting https://www.youtube.com/watch?v=GxukqlSmhco at 150db and nobody can stop me [22:38] http://i.imgur.com/Ltjasf0.gif [22:39] wtf why are you in my head. [22:41] http://forum.xentax.com/index.php may be worth snapshotting (note that a lot of tools are mediafire/etc though)