#archiveteam 2013-06-06,Thu

↑back Search

Time Nickname Message
00:18 🔗 ColTim I was browsing the wiki and noted an effort to archive old Google Video files - as the Charlie Rose archive was previously available through Google Video I was wondering if any of his interviews are available. The website (charlierose.com) has his complete archive, but the links are dead as of ~6 months ago.
00:31 🔗 SketchCow Yeah, this is a group thing.
00:32 🔗 * ivan` pokes SketchCow
02:57 🔗 underscor <swebb> bridgers: @archiveteam Just wanted to give huge s/o for archiving Webshots. I missed deletion notices but you archived my old account! #sohappy [one minute ago]
02:57 🔗 underscor :333333
02:57 🔗 underscor That's awesome
03:26 🔗 cmx \o/
09:49 🔗 Nemo_bis http://forum.uschamber.com/library/2013/05/big-data-and-what-it-means
12:49 🔗 mib_p0g4c Hi
12:49 🔗 mib_p0g4c is there any possibility to rise the number of workers?
12:50 🔗 mib_p0g4c atm I'm running 4 seperate VMs... and I would prefer to combine them into one to save ressources
12:51 🔗 tyn You can open the tty in one and manually change the max value
12:52 🔗 tyn Not easy to find the option and it varies with each job.
12:53 🔗 mib_p0g4c you mean in /home/warrior/projects/config.json?
12:53 🔗 mib_p0g4c tried this... but it got overwritten after a few minutes...
12:53 🔗 mib_p0g4c was thinking about looking through the webpage for the "max 6" limitation...
13:05 🔗 antomatic I tried that, but the validation is in the back-end, not within the webpage itself, so it will refuse a number greater than 6 even if you send it in directly.
13:19 🔗 mib_p0g4c looks like I got it :3
13:21 🔗 ivan` anyone have a copy of http://archive.org/details/2011-06-calufa-twitter-sql or some other set of twitter usernames?
13:32 🔗 godane so i figured out how to grab the theblaze tv highlights
13:33 🔗 godane i also made it faster to grab by changing hitsPerPage=150
13:34 🔗 godane this is way there is only 7 pages that need to be grab for a key word
13:49 🔗 * ivan` finds http://www.infochimps.com/tags/twitter
14:17 🔗 balrog the quora.com robots.txt uses whitelisting and ia_archiver is not whitelisted :(
14:20 🔗 omf_ robots.txt lol
14:37 🔗 joepie91 <omf_>robots.txt lol
14:37 🔗 joepie91 accurate summary of my thoughts on the topic
14:38 🔗 godane i need help grabing xml from this: http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml
14:38 🔗 godane if you look at the source its all one line
14:39 🔗 ivan` lynx -source 'http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml' | sed 's/></>\n</g'
14:43 🔗 godane thanks
14:44 🔗 ivan` I can't add * http://<font></font>pipes.yahoo.com/pipes/pipe.run* to a wiki page
14:44 🔗 ivan` The following text is what triggered our spam filter: .ru
14:45 🔗 ivan` okay another <font> before the ru did it
14:45 🔗 ivan` spam filter is pretty annoying when talking about URLs
19:47 🔗 SilSte Hi
19:48 🔗 SilSte is everything okay with steltek on Formspring?
19:48 🔗 SilSte he is submittung a lot of uploads... but they are ALL 0 or 1 MB...
19:49 🔗 SilSte Steltek 87GB 14092items
19:50 🔗 SilSte if you compare
19:50 🔗 SilSte short 1209GB 14076items
19:58 🔗 Smiley :/
20:02 🔗 antomatic I see the 'out' number is quite high, comparatively..
20:03 🔗 antomatic Any chance they've just run up a ton of machines and the small and easy ones are coming back first of all?
20:03 🔗 SilSte i think s.o. should check this :3
20:03 🔗 SilSte can you check if its always the same IP?
20:03 🔗 SilSte or if the content is okay?
20:04 🔗 antomatic warriorhq only shows 32 machines running formspring - not enough to account for that kind of activity
20:05 🔗 antomatic could have modded the warrior script to accept loads of jobs but only return the small easy ones? (which I think I'd have to class a a nice hack, despite the disruption)
20:05 🔗 antomatic hmm
20:06 🔗 SilSte i modified my warrior to support more jobs
20:06 🔗 SilSte but not thousands :D
20:06 🔗 antomatic nice! :)
20:06 🔗 SilSte I'm running 20... before I had 3 VMs...
20:07 🔗 SilSte can s.o. check if the content of Steltek is okay?
20:08 🔗 SilSte and someone should change "ArchiveTeams Choice" to Formspring...
20:08 🔗 antomatic Steltek's average is about 6mb per unit - about a tenth of the average
20:08 🔗 SilSte the choice clients are idling at the moment...
20:09 🔗 antomatic Probably find they're quite innocently returning WARCs full of 'Your ISP does not allow you to access this page.' or something?
20:09 🔗 SilSte antomatic: because of that someone should check...
20:09 🔗 antomatic agreed.
20:09 🔗 SilSte underscor: ping?
20:10 🔗 antomatic Or 'Your monthly bandwidth allocation? Gone, so gone. Call us now if you want more internets. Have money. 1-800-PAY-MOAR" etc.
20:11 🔗 SilSte ^^
20:12 🔗 SilSte alard: ping?
20:12 🔗 alard SilSte: Hi.
20:12 🔗 Smiley I asked for SSH access :(
20:12 🔗 Smiley alard: check warcs returned by steltek plz
20:12 🔗 Smiley lots of 0/1Mb units compared to everyone else getting normal sizes
20:12 🔗 alard Which project?
20:12 🔗 antomatic Formspring
20:12 🔗 SilSte formspring
20:13 🔗 Smiley 2. Add me to ssh? XD
20:13 🔗 SilSte and can you check y there are so many packets out?
20:18 🔗 SilSte alard: and can you change the automatic clients to formspring? They are idling atm...
20:22 🔗 alard Do I block Steltek?
20:23 🔗 Smiley yah for now
20:23 🔗 Smiley :/
20:23 🔗 Smiley Until we can confirm those are valid warcs
20:23 🔗 Smiley He might just be really lucky or something D:
20:23 🔗 SilSte alard: did you check the warcs?
20:24 🔗 alard I can't. They're uploaded to a server I don't have access to.
20:24 🔗 antomatic is his IP address in a country that might be filtering a site like formspring?
20:25 🔗 ivan` http://warriorhq.archiveteam.org/projects.json is still auto_project: posterous
20:25 🔗 SketchCow Boy, I would LOVE it that when people upload stuff to archive.org, that they put one PDF per item.
20:25 🔗 * Nemo_bis hides
20:26 🔗 SketchCow Yes, what a galactic pain in my ass.
20:26 🔗 SketchCow I'm half considering listing what they are to you, deleting them, and having you do it right.
20:26 🔗 Nemo_bis Sometimes I failed to do so because I had no way to sort the PDFs by page...
20:26 🔗 SketchCow Oh, there's a few that are COMPLETELY unusable.
20:26 🔗 Nemo_bis It's only one magazine, I know which it is. Though I may have deleted from disk.
20:27 🔗 Nemo_bis Yes, only a couple though IIRC:
20:27 🔗 Nemo_bis On the other hand they're still indexed by search engines etc.
20:28 🔗 Nemo_bis Better than a single PDF merged with mistakes. Do you have suggestions on how to deal with such masses of unsorted articles?
20:31 🔗 SketchCow http://archive.org/details/starwarsrpgswedish
20:48 🔗 Smiley DEFAULT PROJECT: FORMSPRING.
20:48 🔗 SilSte thx
20:49 🔗 Smiley SketchCow: PM. When you have time, ty.
20:49 🔗 antomatic [applausesauce]
20:50 🔗 ivan` is there anything good on formspring.me?
20:50 🔗 SilSte Smiley: did you check stephk?
20:50 🔗 Smiley I can't yet SilSte we don't have access to the repo where the warc's go.
20:50 🔗 Smiley but he's blocked for now.
20:50 🔗 SilSte ivan`: It's the only available project atm...
20:51 🔗 SilSte Smiley: okay. Thought you may have ^^
20:51 🔗 ivan` SilSte: I'm just curious if there's anything interesting on it
20:51 🔗 SilSte ivan`: its like ask.fm
20:52 🔗 SilSte what about an archive of piratenpad.de or the wiki of the german pirate party?
20:52 🔗 SilSte don't think that a backup hurts ^^
20:52 🔗 antomatic We should archive the iTunes store! Text, metadata, 60-second previews... mmm....
20:52 🔗 SilSte but I'm not familiar with the tools...
20:53 🔗 antomatic [not entirely serious]
20:53 🔗 antomatic Imagine how interesting a catalogue of all available wax cylinders from 1825 would be.
20:53 🔗 SilSte antomatic: My question was serious ;-). The are starting to delete old pads on piratenpad.de
20:53 🔗 Smiley SilSte: see my wiki page for a "default warc grab"
20:53 🔗 Smiley http://www.archiveteam.org/index.php?title=User:Djsmiley2k
20:54 🔗 Smiley that'll generally give you a sensible grab
20:54 🔗 antomatic Actually a correct phrasing of that sentence would be "Imagine how BLANK a catalogue of wax cylinders from 1825 would be." - Maybe 1895 then. :)
20:54 🔗 antomatic Good point, Silste.
20:55 🔗 SilSte Smiley: that doesn't help on piratenpad...
20:56 🔗 SilSte It's possible to download the wiki as a file (without the media stuff)
20:56 🔗 SilSte afaik
21:10 🔗 SketchCow https://twitter.com/kpepper/status/342345097154797568
21:13 🔗 DFJustin to be fair, uploading multiple pdfs wouldn't be half so unusable if archive.org spent five seconds to add a sort() call to the item page
21:39 🔗 ivan` anyone have more than 25M twitter usernames/API IDs? http://www.infochimps.com/datasets/twitter-census-developer-tools-mapping-from-twitter-user-search-
21:43 🔗 * ivan` also finds http://help.sentiment140.com/for-students
21:44 🔗 ivan` and http://an.kaist.ac.kr/traces/WWW2010.html
21:53 🔗 godane SketchCow: i see that you push my g4 forum dumps to its own collection in archiveteam
21:53 🔗 godane thanks
22:01 🔗 * Smiley ponders if IGN/Gamespy can have a collection yet.
22:07 🔗 Smiley 80ish items
22:22 🔗 SketchCow Boy, know what I need?
22:22 🔗 SketchCow I mean, REALLY need?
22:22 🔗 swebb Sleep?
22:22 🔗 SketchCow I need someone, right now, giving me more "to-dos".
22:22 🔗 SketchCow I'm already cleaning up hundreds of objects dumped into opensource
22:22 🔗 SketchCow I have scripts, but it's killer.
22:23 🔗 SketchCow Meanwhile, my room is a disaster and I was trying to retrieve a CD-ROM and I'm afraid it's in the "harder to get to" part of the shipping container.
22:24 🔗 swebb Ha! You were trying to retrieve a single CD-ROM from the back of that huge container! :) (Needle in a haystack pun here)
22:24 🔗 SketchCow So how about I focus on the billions uploaded by Nemo_bis and godane, then we'll get over to the others.
22:24 🔗 SketchCow I have a collection of CD-ROMs in there, but I am afraid they're behind some items.
22:24 🔗 SketchCow It's not hard, it's just there's stuff that needs two-person lifts.
22:25 🔗 SketchCow http://archive.org/search.php?query=collection%3Ancompasslive&sort=-publicdate
22:25 🔗 SketchCow Also, adding those
22:25 🔗 SketchCow Also, Xanga.
22:25 🔗 SketchCow Also, someone just let me know about another forum dying
22:25 🔗 SketchCow http://forum.worldofplayers.de/forum/threads/1247321-WoG-com-is-closing
22:25 🔗 SketchCow anyone, take it
22:28 🔗 SketchCow So yeah, don't pile it on today
22:28 🔗 SketchCow Also, I went away for 15 days, just got home.
22:28 🔗 SketchCow Lost 3 pounds
22:28 🔗 SketchCow By this rate, I will be a sexy MF and people will give us stuff just because I look at them
22:28 🔗 SketchCow (Late: cake)
22:32 🔗 SketchCow Now I'm blasting https://www.youtube.com/watch?v=GxukqlSmhco at 150db and nobody can stop me
22:38 🔗 SketchCow http://i.imgur.com/Ltjasf0.gif
22:39 🔗 Smiley wtf why are you in my head.
22:41 🔗 balrog http://forum.xentax.com/index.php may be worth snapshotting (note that a lot of tools are mediafire/etc though)

irclogger-viewer