Time |
Nickname |
Message |
00:45
🔗
|
SketchCow |
The spam attack on the wiki has begun. |
01:01
🔗
|
omf_ |
SketchCow, The oqotalk.com backup has completed.and I got a 1.4gb warc.gz |
01:01
🔗
|
SketchCow |
Thanks. |
01:01
🔗
|
SketchCow |
Sounds good. |
01:01
🔗
|
SketchCow |
Can you upload it? |
01:02
🔗
|
omf_ |
Just wanted to double check. The classic web interface can handle a file that size? |
01:02
🔗
|
SketchCow |
Yes |
01:02
🔗
|
SketchCow |
Well, wait, classic? |
01:02
🔗
|
SketchCow |
The new one. |
01:04
🔗
|
omf_ |
THe new one is the drag and drop thing? |
01:04
🔗
|
SketchCow |
Yes |
01:04
🔗
|
omf_ |
cause the flash based one does not function correctly on linux and I was just using the old web form for all the uploads I already did |
01:04
🔗
|
omf_ |
the drag and drop thing has yet to work for me either |
01:06
🔗
|
omf_ |
oh shit. http://www.oqotalk.com is down now |
01:07
🔗
|
SketchCow |
Sweet |
01:07
🔗
|
SketchCow |
Good timing, huh |
01:07
🔗
|
omf_ |
yep, I checked the wget log and when it finished earlier there were no errors |
01:08
🔗
|
omf_ |
I bet I ate up their bandwidth or something |
01:08
🔗
|
SketchCow |
Probably. |
01:18
🔗
|
omf_ |
It is uploading now. |
01:54
🔗
|
ryan_ |
is the yahoo video thing still ongoing at all? |
02:13
🔗
|
SketchCow |
No. |
02:13
🔗
|
SketchCow |
I'm uploading the video. |
02:20
🔗
|
Lord_Nigh |
SketchCow: use asirra or that colored text thing that bisqwit came up for for tasvideos forums to prevent spammer signups on wiki? |
02:20
🔗
|
Lord_Nigh |
also you need a blanket 'edit rejector' which will reject edits with spamlinks or spamtext |
02:21
🔗
|
Lord_Nigh |
since spammers sometimes hire human captcha-breakers |
02:22
🔗
|
Lord_Nigh |
ooh here's a really nasty one: ask people to type, in english, what color a sequence of chinese characters each are. each chinese character translates in chinese to a DIFFERENT color than the one it is. |
03:22
🔗
|
dashcloud |
with it being or getting close to flea market/garage sale season on the East coast, remember that you can help preserve history by buying other people's crap |
03:29
🔗
|
DFJustin |
unfortunately it un-preserves the space in your house :( |
03:40
🔗
|
dashcloud |
so box it up, keep what you want, then send the rest to a new home (you could be the physical equivalent of godane !) |
05:14
🔗
|
ivan` |
is there a youtube-dl patch or other software that will let me resume downloading a channel without hitting 1000 pages for videos I've already downloaded? |
05:14
🔗
|
ivan` |
(that kind of behavior tends to get you CAPTCHAed for life) |
05:43
🔗
|
BlueMax |
Well I know of a software on Windows that's pretty good at it... |
05:46
🔗
|
ivan` |
cool, what is it? |
05:52
🔗
|
BlueMax |
http://www.dvdvideosoft.com/products/dvd/Free-YouTube-Download.htm |
05:52
🔗
|
BlueMax |
It's ad-supported and only works on Windows but it's still good software. |
05:54
🔗
|
BlueMax |
(well it might work via WINE but I can't promise that) |
05:57
🔗
|
ivan` |
thanks |
05:58
🔗
|
ivan` |
I'll probably hack youtube-dl now until I have something terrible working |
06:56
🔗
|
omf_ |
SketchCow, you have all of oqotalk now |
06:59
🔗
|
SketchCow |
Saw. |
06:59
🔗
|
SketchCow |
http://archive.org/details/archiveteam_oqotalkcom_2012_03_panic |
07:49
🔗
|
godane |
good news |
07:49
🔗
|
godane |
i can now scan though google search results |
07:53
🔗
|
omf_ |
I think that last statement needs a little more detail |
07:54
🔗
|
godane |
i'm making a index based on search results of files from cscope.us |
07:54
🔗
|
godane |
search="site:cscope.us+filetype:pdf" |
07:55
🔗
|
godane |
i think do a for i in $(seq 1 to 30); do |
07:55
🔗
|
godane |
echo "http://google.com/search?q=$search&start=${i}0" >> index.txt |
07:55
🔗
|
godane |
then something like this: |
07:55
🔗
|
godane |
wget -x -i index.txt --user-agent="Firefox/3.0.15" --warc-file=google-cscope.us --warc-cdx -w 5 |
07:55
🔗
|
godane |
then this: |
07:55
🔗
|
godane |
zcat *.warc.gz | grep -ohP 'href='[^'>]+' | grep 'q=' | grep 'www.cscope.us' | sed 's|.*url?q=||g' | sed 's|&.*||g' | grep -v 'webcache.google' | grep http |
07:56
🔗
|
godane |
you guys only have 55 pdfs in wayback so this should be something |
08:20
🔗
|
godane |
i got the index |
08:42
🔗
|
godane |
i'm uploading my grab of cscope pdfs right now |
09:06
🔗
|
godane |
uploaded: https://archive.org/details/cscope.us-google-pdfs-grab-20130312 |
09:31
🔗
|
ivan` |
youtube-dl has a bug that is breaking /user/ URLs, you have to move YoutubeUserIE above YoutubePlaylistIE |
10:05
🔗
|
godane |
so looks like there are 71 inactive videos missing |
10:05
🔗
|
godane |
in the 48000ids |
11:20
🔗
|
ersi |
http://blog.archive.org/2013/03/12/riding-with-the-bit-savers/ |
15:09
🔗
|
godane |
good news |
15:09
🔗
|
godane |
based on warc-proxy my warcs of the forums last month work just fine |
16:10
🔗
|
Smiley |
alard: you've stopped showing the time left on the Available Projects page? |
16:10
🔗
|
Smiley |
Or was it not ever there? |
17:37
🔗
|
SketchCow |
Ha ha, I jumped the chain of command posting that. |
17:37
🔗
|
SketchCow |
Small error, turns out setting a checkbox on the blog software automatically promotes to front page |
17:38
🔗
|
DFJustin |
yeah man you're drowning out vital bitcoin news |
17:39
🔗
|
ersi |
SketchCow: ;D |
17:39
🔗
|
ersi |
SketchCow: I like the "Movie showing template" that was posted a few days ago as well (not yours though) |
17:56
🔗
|
godane |
this is video doesn't exist it looks like: http://archive.org/details/g4tv.com-video36800 |
18:08
🔗
|
godane |
ok guys |
18:09
🔗
|
godane |
the forum uploads from feb 2013 will have s= links |
18:09
🔗
|
godane |
it only worked in warc viewer cause it was using cached data of forums.g4tv.com |
19:43
🔗
|
SketchCow |
We're out of the "woods" with space on FOS. |
19:43
🔗
|
SketchCow |
5.7tb free, enough for the time being. |
19:47
🔗
|
ivan` |
https://github.com/ludios/youtube-dl/commits/prime my youtube-dl experience is much better now |
19:57
🔗
|
Smiley |
more ops plz |
20:08
🔗
|
balrog_ |
submit pull request, ivan` :) |
20:09
🔗
|
DFJustin |
yeah I could definitely use some of those changes |
20:12
🔗
|
ivan` |
balrog_: these are some pretty low-quality diffs |
20:12
🔗
|
ivan` |
the end result is good for me though |
20:14
🔗
|
ivan` |
I am assuming that the filenames on the filesystem are a certain format |
20:14
🔗
|
ivan` |
the blip.tv change doesn't fall back to non-Source when Source isn't available |
20:14
🔗
|
ivan` |
the sleeping is sleeping thrice in a row for reasons unknown to me |
21:18
🔗
|
brianhick |
hey, I've just downloaded the archiveteam-warrior after reading http://jacquesmattheij.com/come-help-save-posterous-from-oblivion# and it says to ask here before starting the posterous project - what do I need to know? |
21:19
🔗
|
ersi |
Come join us in #preposterus (Project specific channel for Posterous) |
21:21
🔗
|
ersi |
The warning/notice was put into the Warrior's project page because; Posterous might ban your IP. That will make you unable to browse any Posterous blogs/spaces. If you've read that, select it and go on :-) |
21:22
🔗
|
brianhick |
I'm alright with that - all the ones I read have moved. Thanks for the heads up. |
21:23
🔗
|
ersi |
There's more activity in #preposterus btw, since that's project specific :) That's where all the project updates happen as well |
21:24
🔗
|
brianhick |
many thanks, I'll try there. |
21:34
🔗
|
ersi |
If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network) |
21:34
🔗
|
balrog_ |
someone put that in the topic |
21:35
🔗
|
ersi |
Hmm, I guess I'll save the current one and put it back later |
21:36
🔗
|
balrog_ |
no I meant add it to the topic |
21:37
🔗
|
ersi |
I know. I'm doing it right now. Chill pill. |
21:38
🔗
|
ersi |
stupid topic charlimit |
22:09
🔗
|
CoJaBo |
Does anyone know- Are there any other long-lived archives other than Arcive.org/Wayback that archive webpages in general? |
22:12
🔗
|
Andres_ |
uhm |
22:13
🔗
|
Andres_ |
well there's always google web cache IMO |
22:13
🔗
|
Andres_ |
although |
22:13
🔗
|
Andres_ |
it's not a permanent archive iirc |
22:13
🔗
|
Andres_ |
nor it's an archive at all |
22:14
🔗
|
Andres_ |
you have webcitation.org |
22:14
🔗
|
Andres_ |
too |
22:14
🔗
|
Andres_ |
it's for references more than archiving |
22:14
🔗
|
Andres_ |
but it's decent |
22:15
🔗
|
Andres_ |
ersi, |
22:15
🔗
|
ersi |
Heya |
22:15
🔗
|
Andres_ |
have you spoken to webcitation.org staff |
22:15
🔗
|
Andres_ |
they seem to be a little troubled |
22:15
🔗
|
ersi |
About? Posterous? |
22:15
🔗
|
Andres_ |
re: https://fundrazr.com/campaigns/aQMp7 |
22:15
🔗
|
ersi |
Oh |
22:15
🔗
|
Andres_ |
no |
22:15
🔗
|
Andres_ |
about themselves |
22:16
🔗
|
ersi |
No, I havn't. |
22:16
🔗
|
CoJaBo |
I'm mostly interested in ones that crawl automatically; there was an incident where an... unscrupulous company decided to upload large numbers of sensitive documents to a site; these need to be removed from as many places as possible. |
22:16
🔗
|
ersi |
Looks like they're dying :-/ |
22:16
🔗
|
ersi |
CoJaBo: There's Common Crawl. |
22:16
🔗
|
ersi |
It's not an archive per say, but it effectively is. |
22:16
🔗
|
Andres_ |
someone should contact 'em |
22:16
🔗
|
Andres_ |
and ask for a backup |
22:16
🔗
|
Andres_ |
.torrent or something |
22:17
🔗
|
ersi |
Calm down with that enter button ;) |
22:17
🔗
|
balrog_ |
btw yahoo message boards is shutting down in about half a month |
22:17
🔗
|
balrog_ |
anyone doing anything about that??? |
22:17
🔗
|
ersi |
There's #BurnTheMessenger and there's a project page on ArchiveTeam wiki |
22:17
🔗
|
Andres_ |
:p sorry, this bad habit is really old |
22:17
🔗
|
Andres_ |
started since I started IRCing at DALnet |
22:17
🔗
|
Andres_ |
bad habits never die |
22:18
🔗
|
ersi |
Please keep traffic in this channel to a low. #archiveteam-bs is for freefloat chat and then there's the project channels. |
22:19
🔗
|
CoJaBo |
ersi: huh.. is there any way to remove data from there? |
22:19
🔗
|
Andres_ |
CoJaBo, what files did they upload, just wondering |
22:19
🔗
|
CoJaBo |
Andres_: Everything they had access to |
22:20
🔗
|
CoJaBo |
It was one of those outsourcing web development companies |
22:20
🔗
|
ersi |
CoJaBo: From Common Crawl? |
22:20
🔗
|
CoJaBo |
ersi: Yeh, for starters.. |
22:25
🔗
|
ersi |
CoJaBo: I'd try contacting them and asking nicely. |
22:25
🔗
|
ersi |
I'd consider leaving it there though. In my opinion, if something's been public, let it be public (ish) |
22:29
🔗
|
CoJaBo |
ersi: Yeh, the customers prolly wouldn't appreciate showing up there tho lol.. |
22:31
🔗
|
ersi |
CoJaBo: Ah, aight. Well, a friendly nod should work. |
22:32
🔗
|
CoJaBo |
Is there a way to search their data to see if its even there? Or do you need amazon or whatever to do that.. |
22:32
🔗
|
CoJaBo |
Itd def. on Archive.org tho >_> |
22:32
🔗
|
CoJaBo |
Hell, wtf..... Someone got c99shell on there too, niiiiiiiiccccccceeeeeeeeeeeeeee <_< |
22:33
🔗
|
ersi |
If it's in the Internet Archive Wayback machine, contact them on info@archive.org and they'll help you out. Please keep in mind that they're not many people and it can take a little while. |
22:34
🔗
|
CoJaBo |
I think I can just do the Robots.txt thing; tho I guess that way is permanent isnt it.. |
22:35
🔗
|
ersi |
Yes, you can exclude it by adding a robots.txt to the domains that are effected. |
22:35
🔗
|
CoJaBo |
Ah hell, thats right, they uploaded it too their own site too :/ |
22:35
🔗
|
ersi |
The wayback machine will poll the current domain/robots.txt before showing something from the wayback archives. If it excludes, it won't show it. |
22:35
🔗
|
ersi |
Yeah, just contact the archive and they'll probably help you out. No biggie. |
22:36
🔗
|
CoJaBo |
Yeh..... Advice- NEVER hire outsourcing companies LOL........... |
22:37
🔗
|
ersi |
Yeah, heh. |
22:37
🔗
|
ersi |
== If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network) == |
22:38
🔗
|
ersi |
(Saw a bunch of new clients join up) |