Time |
Nickname |
Message |
00:18
🔗
|
ColTim |
I was browsing the wiki and noted an effort to archive old Google Video files - as the Charlie Rose archive was previously available through Google Video I was wondering if any of his interviews are available. The website (charlierose.com) has his complete archive, but the links are dead as of ~6 months ago. |
00:31
🔗
|
SketchCow |
Yeah, this is a group thing. |
00:32
🔗
|
* |
ivan` pokes SketchCow |
02:57
🔗
|
underscor |
<swebb> bridgers: @archiveteam Just wanted to give huge s/o for archiving Webshots. I missed deletion notices but you archived my old account! #sohappy [one minute ago] |
02:57
🔗
|
underscor |
:333333 |
02:57
🔗
|
underscor |
That's awesome |
03:26
🔗
|
cmx |
\o/ |
09:49
🔗
|
Nemo_bis |
http://forum.uschamber.com/library/2013/05/big-data-and-what-it-means |
12:49
🔗
|
mib_p0g4c |
Hi |
12:49
🔗
|
mib_p0g4c |
is there any possibility to rise the number of workers? |
12:50
🔗
|
mib_p0g4c |
atm I'm running 4 seperate VMs... and I would prefer to combine them into one to save ressources |
12:51
🔗
|
tyn |
You can open the tty in one and manually change the max value |
12:52
🔗
|
tyn |
Not easy to find the option and it varies with each job. |
12:53
🔗
|
mib_p0g4c |
you mean in /home/warrior/projects/config.json? |
12:53
🔗
|
mib_p0g4c |
tried this... but it got overwritten after a few minutes... |
12:53
🔗
|
mib_p0g4c |
was thinking about looking through the webpage for the "max 6" limitation... |
13:05
🔗
|
antomatic |
I tried that, but the validation is in the back-end, not within the webpage itself, so it will refuse a number greater than 6 even if you send it in directly. |
13:19
🔗
|
mib_p0g4c |
looks like I got it :3 |
13:21
🔗
|
ivan` |
anyone have a copy of http://archive.org/details/2011-06-calufa-twitter-sql or some other set of twitter usernames? |
13:32
🔗
|
godane |
so i figured out how to grab the theblaze tv highlights |
13:33
🔗
|
godane |
i also made it faster to grab by changing hitsPerPage=150 |
13:34
🔗
|
godane |
this is way there is only 7 pages that need to be grab for a key word |
13:49
🔗
|
* |
ivan` finds http://www.infochimps.com/tags/twitter |
14:17
🔗
|
balrog |
the quora.com robots.txt uses whitelisting and ia_archiver is not whitelisted :( |
14:20
🔗
|
omf_ |
robots.txt lol |
14:37
🔗
|
joepie91 |
<omf_>robots.txt lol |
14:37
🔗
|
joepie91 |
accurate summary of my thoughts on the topic |
14:38
🔗
|
godane |
i need help grabing xml from this: http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml |
14:38
🔗
|
godane |
if you look at the source its all one line |
14:39
🔗
|
ivan` |
lynx -source 'http://web.gbtv.com/gen/multimedia/detail/8/8/5/25571885.xml' | sed 's/></>\n</g' |
14:43
🔗
|
godane |
thanks |
14:44
🔗
|
ivan` |
I can't add * http://<font></font>pipes.yahoo.com/pipes/pipe.run* to a wiki page |
14:44
🔗
|
ivan` |
The following text is what triggered our spam filter: .ru |
14:45
🔗
|
ivan` |
okay another <font> before the ru did it |
14:45
🔗
|
ivan` |
spam filter is pretty annoying when talking about URLs |
19:47
🔗
|
SilSte |
Hi |
19:48
🔗
|
SilSte |
is everything okay with steltek on Formspring? |
19:48
🔗
|
SilSte |
he is submittung a lot of uploads... but they are ALL 0 or 1 MB... |
19:49
🔗
|
SilSte |
Steltek 87GB 14092items |
19:50
🔗
|
SilSte |
if you compare |
19:50
🔗
|
SilSte |
short 1209GB 14076items |
19:58
🔗
|
Smiley |
:/ |
20:02
🔗
|
antomatic |
I see the 'out' number is quite high, comparatively.. |
20:03
🔗
|
antomatic |
Any chance they've just run up a ton of machines and the small and easy ones are coming back first of all? |
20:03
🔗
|
SilSte |
i think s.o. should check this :3 |
20:03
🔗
|
SilSte |
can you check if its always the same IP? |
20:03
🔗
|
SilSte |
or if the content is okay? |
20:04
🔗
|
antomatic |
warriorhq only shows 32 machines running formspring - not enough to account for that kind of activity |
20:05
🔗
|
antomatic |
could have modded the warrior script to accept loads of jobs but only return the small easy ones? (which I think I'd have to class a a nice hack, despite the disruption) |
20:05
🔗
|
antomatic |
hmm |
20:06
🔗
|
SilSte |
i modified my warrior to support more jobs |
20:06
🔗
|
SilSte |
but not thousands :D |
20:06
🔗
|
antomatic |
nice! :) |
20:06
🔗
|
SilSte |
I'm running 20... before I had 3 VMs... |
20:07
🔗
|
SilSte |
can s.o. check if the content of Steltek is okay? |
20:08
🔗
|
SilSte |
and someone should change "ArchiveTeams Choice" to Formspring... |
20:08
🔗
|
antomatic |
Steltek's average is about 6mb per unit - about a tenth of the average |
20:08
🔗
|
SilSte |
the choice clients are idling at the moment... |
20:09
🔗
|
antomatic |
Probably find they're quite innocently returning WARCs full of 'Your ISP does not allow you to access this page.' or something? |
20:09
🔗
|
SilSte |
antomatic: because of that someone should check... |
20:09
🔗
|
antomatic |
agreed. |
20:09
🔗
|
SilSte |
underscor: ping? |
20:10
🔗
|
antomatic |
Or 'Your monthly bandwidth allocation? Gone, so gone. Call us now if you want more internets. Have money. 1-800-PAY-MOAR" etc. |
20:11
🔗
|
SilSte |
^^ |
20:12
🔗
|
SilSte |
alard: ping? |
20:12
🔗
|
alard |
SilSte: Hi. |
20:12
🔗
|
Smiley |
I asked for SSH access :( |
20:12
🔗
|
Smiley |
alard: check warcs returned by steltek plz |
20:12
🔗
|
Smiley |
lots of 0/1Mb units compared to everyone else getting normal sizes |
20:12
🔗
|
alard |
Which project? |
20:12
🔗
|
antomatic |
Formspring |
20:12
🔗
|
SilSte |
formspring |
20:13
🔗
|
Smiley |
2. Add me to ssh? XD |
20:13
🔗
|
SilSte |
and can you check y there are so many packets out? |
20:18
🔗
|
SilSte |
alard: and can you change the automatic clients to formspring? They are idling atm... |
20:22
🔗
|
alard |
Do I block Steltek? |
20:23
🔗
|
Smiley |
yah for now |
20:23
🔗
|
Smiley |
:/ |
20:23
🔗
|
Smiley |
Until we can confirm those are valid warcs |
20:23
🔗
|
Smiley |
He might just be really lucky or something D: |
20:23
🔗
|
SilSte |
alard: did you check the warcs? |
20:24
🔗
|
alard |
I can't. They're uploaded to a server I don't have access to. |
20:24
🔗
|
antomatic |
is his IP address in a country that might be filtering a site like formspring? |
20:25
🔗
|
ivan` |
http://warriorhq.archiveteam.org/projects.json is still auto_project: posterous |
20:25
🔗
|
SketchCow |
Boy, I would LOVE it that when people upload stuff to archive.org, that they put one PDF per item. |
20:25
🔗
|
* |
Nemo_bis hides |
20:26
🔗
|
SketchCow |
Yes, what a galactic pain in my ass. |
20:26
🔗
|
SketchCow |
I'm half considering listing what they are to you, deleting them, and having you do it right. |
20:26
🔗
|
Nemo_bis |
Sometimes I failed to do so because I had no way to sort the PDFs by page... |
20:26
🔗
|
SketchCow |
Oh, there's a few that are COMPLETELY unusable. |
20:26
🔗
|
Nemo_bis |
It's only one magazine, I know which it is. Though I may have deleted from disk. |
20:27
🔗
|
Nemo_bis |
Yes, only a couple though IIRC: |
20:27
🔗
|
Nemo_bis |
On the other hand they're still indexed by search engines etc. |
20:28
🔗
|
Nemo_bis |
Better than a single PDF merged with mistakes. Do you have suggestions on how to deal with such masses of unsorted articles? |
20:31
🔗
|
SketchCow |
http://archive.org/details/starwarsrpgswedish |
20:48
🔗
|
Smiley |
DEFAULT PROJECT: FORMSPRING. |
20:48
🔗
|
SilSte |
thx |
20:49
🔗
|
Smiley |
SketchCow: PM. When you have time, ty. |
20:49
🔗
|
antomatic |
[applausesauce] |
20:50
🔗
|
ivan` |
is there anything good on formspring.me? |
20:50
🔗
|
SilSte |
Smiley: did you check stephk? |
20:50
🔗
|
Smiley |
I can't yet SilSte we don't have access to the repo where the warc's go. |
20:50
🔗
|
Smiley |
but he's blocked for now. |
20:50
🔗
|
SilSte |
ivan`: It's the only available project atm... |
20:51
🔗
|
SilSte |
Smiley: okay. Thought you may have ^^ |
20:51
🔗
|
ivan` |
SilSte: I'm just curious if there's anything interesting on it |
20:51
🔗
|
SilSte |
ivan`: its like ask.fm |
20:52
🔗
|
SilSte |
what about an archive of piratenpad.de or the wiki of the german pirate party? |
20:52
🔗
|
SilSte |
don't think that a backup hurts ^^ |
20:52
🔗
|
antomatic |
We should archive the iTunes store! Text, metadata, 60-second previews... mmm.... |
20:52
🔗
|
SilSte |
but I'm not familiar with the tools... |
20:53
🔗
|
antomatic |
[not entirely serious] |
20:53
🔗
|
antomatic |
Imagine how interesting a catalogue of all available wax cylinders from 1825 would be. |
20:53
🔗
|
SilSte |
antomatic: My question was serious ;-). The are starting to delete old pads on piratenpad.de |
20:53
🔗
|
Smiley |
SilSte: see my wiki page for a "default warc grab" |
20:53
🔗
|
Smiley |
http://www.archiveteam.org/index.php?title=User:Djsmiley2k |
20:54
🔗
|
Smiley |
that'll generally give you a sensible grab |
20:54
🔗
|
antomatic |
Actually a correct phrasing of that sentence would be "Imagine how BLANK a catalogue of wax cylinders from 1825 would be." - Maybe 1895 then. :) |
20:54
🔗
|
antomatic |
Good point, Silste. |
20:55
🔗
|
SilSte |
Smiley: that doesn't help on piratenpad... |
20:56
🔗
|
SilSte |
It's possible to download the wiki as a file (without the media stuff) |
20:56
🔗
|
SilSte |
afaik |
21:10
🔗
|
SketchCow |
https://twitter.com/kpepper/status/342345097154797568 |
21:13
🔗
|
DFJustin |
to be fair, uploading multiple pdfs wouldn't be half so unusable if archive.org spent five seconds to add a sort() call to the item page |
21:39
🔗
|
ivan` |
anyone have more than 25M twitter usernames/API IDs? http://www.infochimps.com/datasets/twitter-census-developer-tools-mapping-from-twitter-user-search- |
21:43
🔗
|
* |
ivan` also finds http://help.sentiment140.com/for-students |
21:44
🔗
|
ivan` |
and http://an.kaist.ac.kr/traces/WWW2010.html |
21:53
🔗
|
godane |
SketchCow: i see that you push my g4 forum dumps to its own collection in archiveteam |
21:53
🔗
|
godane |
thanks |
22:01
🔗
|
* |
Smiley ponders if IGN/Gamespy can have a collection yet. |
22:07
🔗
|
Smiley |
80ish items |
22:22
🔗
|
SketchCow |
Boy, know what I need? |
22:22
🔗
|
SketchCow |
I mean, REALLY need? |
22:22
🔗
|
swebb |
Sleep? |
22:22
🔗
|
SketchCow |
I need someone, right now, giving me more "to-dos". |
22:22
🔗
|
SketchCow |
I'm already cleaning up hundreds of objects dumped into opensource |
22:22
🔗
|
SketchCow |
I have scripts, but it's killer. |
22:23
🔗
|
SketchCow |
Meanwhile, my room is a disaster and I was trying to retrieve a CD-ROM and I'm afraid it's in the "harder to get to" part of the shipping container. |
22:24
🔗
|
swebb |
Ha! You were trying to retrieve a single CD-ROM from the back of that huge container! :) (Needle in a haystack pun here) |
22:24
🔗
|
SketchCow |
So how about I focus on the billions uploaded by Nemo_bis and godane, then we'll get over to the others. |
22:24
🔗
|
SketchCow |
I have a collection of CD-ROMs in there, but I am afraid they're behind some items. |
22:24
🔗
|
SketchCow |
It's not hard, it's just there's stuff that needs two-person lifts. |
22:25
🔗
|
SketchCow |
http://archive.org/search.php?query=collection%3Ancompasslive&sort=-publicdate |
22:25
🔗
|
SketchCow |
Also, adding those |
22:25
🔗
|
SketchCow |
Also, Xanga. |
22:25
🔗
|
SketchCow |
Also, someone just let me know about another forum dying |
22:25
🔗
|
SketchCow |
http://forum.worldofplayers.de/forum/threads/1247321-WoG-com-is-closing |
22:25
🔗
|
SketchCow |
anyone, take it |
22:28
🔗
|
SketchCow |
So yeah, don't pile it on today |
22:28
🔗
|
SketchCow |
Also, I went away for 15 days, just got home. |
22:28
🔗
|
SketchCow |
Lost 3 pounds |
22:28
🔗
|
SketchCow |
By this rate, I will be a sexy MF and people will give us stuff just because I look at them |
22:28
🔗
|
SketchCow |
(Late: cake) |
22:32
🔗
|
SketchCow |
Now I'm blasting https://www.youtube.com/watch?v=GxukqlSmhco at 150db and nobody can stop me |
22:38
🔗
|
SketchCow |
http://i.imgur.com/Ltjasf0.gif |
22:39
🔗
|
Smiley |
wtf why are you in my head. |
22:41
🔗
|
balrog |
http://forum.xentax.com/index.php may be worth snapshotting (note that a lot of tools are mediafire/etc though) |