Time |
Nickname |
Message |
00:45
🔗
|
omf_ |
How does the IA take in site grabs that do not have warcs? |
00:47
🔗
|
chronomex |
they don't |
00:47
🔗
|
chronomex |
well, not into waybackmachine |
00:48
🔗
|
omf_ |
What if you have all the data that makes the warc |
00:48
🔗
|
omf_ |
like the transfer time, size, headers, etc... |
00:48
🔗
|
chronomex |
then I suppose you could make a warc? |
00:49
🔗
|
omf_ |
I guess I could write a conversion program. |
00:56
🔗
|
godane |
you would need like a wget log of the files being grabed for this to work |
00:56
🔗
|
godane |
in theory |
01:06
🔗
|
chronomex |
that won't have headers tho |
01:07
🔗
|
godane |
thats why i said in theory |
01:07
🔗
|
godane |
was not sure |
01:22
🔗
|
ianweller |
what. |
01:22
🔗
|
ianweller |
so i went to bed thinking maybe the local warrior that i have running will stop |
01:22
🔗
|
ianweller |
nope |
01:22
🔗
|
ianweller |
it's on 7010 URLs and counting |
01:25
🔗
|
chronomex |
perfect! |
01:33
🔗
|
marczak |
Is there a script that I could run instead of using the warrior VM? |
01:34
🔗
|
marczak |
I have a few extra IPs I could run from, but won't have a virtualized environment to run under. |
01:34
🔗
|
omf_ |
marczak, the peeps in #warrior can answer that |
01:34
🔗
|
marczak |
great - thanks |
01:36
🔗
|
DrDeke |
the answer is "yes" but i don't have a link to it handy |
01:38
🔗
|
marczak |
DrDeke: thanks - someone in #warrior is helping out. |
02:14
🔗
|
omf_ |
For all the new warriors out there we have long term projects after yahoo and posterous. #urlteam is constantly unfucking the url shorteners so we can find sites without twitter, bitly, etc... |
02:15
🔗
|
omf_ |
That is our proactive side to saving the web. |
02:31
🔗
|
ersi |
mah, don't send people to #warrior when they're asking project specific questions |
02:32
🔗
|
ersi |
marczak: You can run the scripts from: https://github.com/ArchiveTeam/yahoomessages-grab/ |
02:32
🔗
|
ersi |
that's the stand-alone ones. You'll need to compile wget though (script is checked in there ^) and install the seesaw python package. |
03:40
🔗
|
SketchCow |
I think we just exploded the Yahoo |
03:40
🔗
|
pilgrim |
well they had it coming |
03:46
🔗
|
godane |
i just saved low rider world 2006 clip of attack of the show |
03:46
🔗
|
godane |
it was one of the flvsm videos that i couldn't get |
03:55
🔗
|
SketchCow |
We just destroyed the Yahoo! backlog |
03:57
🔗
|
DFJustin |
and how |
03:57
🔗
|
SketchCow |
The graph looks like a zombie death apocalypse |
04:00
🔗
|
SketchCow |
40G . |
04:00
🔗
|
SketchCow |
root@teamarchive-1:/2/DISCOGS/www.discogs.com/data# du -sh . |
04:00
🔗
|
SketchCow |
by the way |
04:11
🔗
|
omf_ |
SketchCow, was that a preventative grab? |
04:13
🔗
|
SketchCow |
Yes |
04:13
🔗
|
SketchCow |
I'm working with MusicBrainz to get their stuff on archive. |
04:13
🔗
|
SketchCow |
And they said "You know, I don't know of any mirrors of discogs.org" |
04:14
🔗
|
DFJustin |
might do vgmdb.net while you're at it |
04:20
🔗
|
SketchCow |
Show me where you can download the DB and I will. |
04:20
🔗
|
omf_ |
DFJustin, I already got a grab of vgmdb.net |
04:21
🔗
|
omf_ |
it is about 8 months old though |
04:21
🔗
|
DFJustin |
o/\o |
04:22
🔗
|
omf_ |
I want to merge some of their data into freebase |
06:33
🔗
|
chronomex |
why don't we have all warriors running urlteam in the background all the time? |
06:35
🔗
|
chronomex |
:) |
06:35
🔗
|
omf_ |
It would help |
07:12
🔗
|
omf_ |
We need to recruit someone who has google fiber, it could be real helpful |
07:13
🔗
|
omf_ |
just throwing that out there |
07:49
🔗
|
SketchCow |
Man, it's going k-razy out there |
07:49
🔗
|
SketchCow |
My Hard Drive full of goodness goes out Monday |
07:49
🔗
|
SketchCow |
Working now to build up the maximum amount of data on it |
07:50
🔗
|
omf_ |
You ship hard drives as well as upload? Talk about no stone unturned :) |
08:01
🔗
|
SketchCow |
Have to. |
08:01
🔗
|
SketchCow |
I send in 400-500gb a hit |
08:03
🔗
|
chronomex |
whumph whumph |
08:05
🔗
|
omf_ |
Do you have shock proof cases for mailing? I always wanted to ask how those work out. |
08:06
🔗
|
chronomex |
if I were mailing hdds I'd probably reuse original hdd packing materials |
08:06
🔗
|
chronomex |
seems to work |
08:26
🔗
|
ivan` |
in case I get hit by a meteor in the next 3 months somebody better remember to scrape all of Reader's *.blogspot.com/atom.xml feeds in addition to the feed URLs they currently use |
08:26
🔗
|
ivan` |
e.g. xooglers.blogspot.com/atom.xml gets you completely different content |
11:00
🔗
|
SketchCow |
chronomex: I do. |
11:52
🔗
|
ersi |
ivan`: Different content than what? |
14:54
🔗
|
omf_ |
Our clown information is growing nicely. If you have any observations you would like to add http://www.archiveteam.org/index.php?title=Clown_hosting |
16:19
🔗
|
chazchaz |
omf_: Are there any guidelines for including providers is that list? |
16:22
🔗
|
omf_ |
website url, price point, specs, and any insights into why the service works so well or problems with it |
16:23
🔗
|
omf_ |
the joyent and DO are good examples we have built out |
16:23
🔗
|
omf_ |
we have vps and cloud providers on there |
16:25
🔗
|
omf_ |
bandwidth and storage are right up there with price point as important data we need |
16:57
🔗
|
chazchaz |
Ok, I added BuyVM |
16:59
🔗
|
omf_ |
chazchaz, you use them recently? |
16:59
🔗
|
chazchaz |
Yeah, I have 2 servers with them. |
16:59
🔗
|
chazchaz |
One for over a year |
17:03
🔗
|
omf_ |
What can you fit in 128mb ram |
17:04
🔗
|
omf_ |
I cannot think of too much you could run |
17:04
🔗
|
omf_ |
I could host my photos on there. Cheaper than flickr |
17:07
🔗
|
neurophyr |
edis.at has a good 128MB miniVPS option. |
17:07
🔗
|
neurophyr |
I run lower-traffic Tor relays and bridges on that kind of box. |
17:08
🔗
|
neurophyr |
and it was quite happy to run the yahoomessages-grab script. |
17:09
🔗
|
chazchaz |
omf_: They let you burst up to 2x as long as it's availible, which seems to be almost all the time. I'm using 150 MB for 40 posterous processes and 2 yahoo-messages processes |
17:10
🔗
|
omf_ |
chazchaz, you should make a note on the wiki, that is valuable info |
17:13
🔗
|
chazchaz |
done |
17:14
🔗
|
omf_ |
thanks |
17:36
🔗
|
DrDeke |
i'm kind of offended that there is a wiki page called "Clown hosting" and my apartment closet isn't eligible to be listed in it ;) |
17:37
🔗
|
DrDeke |
outage notifications? pshhh, yeah maybe i'll email you if i decide to take the server apart for some reason 5 minutes before i do it if you have a VM on it |
17:39
🔗
|
chazchaz |
Just check it yourself. That's what ping is for right? |
17:41
🔗
|
DrDeke |
exactly! |
17:41
🔗
|
DrDeke |
i made a major jump in my level of customer service a couple months ago when i put everyone's email address that i could track down in a google spreadsheet |
17:41
🔗
|
DrDeke |
sometimes it gets copy and pasted into a bcc |
17:41
🔗
|
DrDeke |
sometimes... =) |
17:42
🔗
|
DrDeke |
(nobody is paying, so, you know...) |
17:42
🔗
|
chronomex |
'wall' ought to be acceptable notice for planned maintenance |
17:42
🔗
|
DrDeke |
i actually got to do that on a couple servers at my real job last night |
17:43
🔗
|
DrDeke |
"Oh, we forgot to mention that part in the email? Well, just shutdown +30 it, the users will be fine." |
17:43
🔗
|
DrDeke |
(needless to say, that is not the way it normally works there) |
17:43
🔗
|
DrDeke |
since the system these servers are for was going to be completely down anyway, we figured oh well |
19:31
🔗
|
omf_ |
Did someone already grab the ign forums? |
19:41
🔗
|
Smiley |
omf_: ask in #ispygames |
19:41
🔗
|
Smiley |
someone was doing work on a lot of that stuff there |
19:41
🔗
|
omf_ |
that is me |
19:42
🔗
|
omf_ |
I just checked the scroll back to the 22nd of last month and nothing |
19:45
🔗
|
Smiley |
D: |
19:45
🔗
|
Smiley |
sorry for being an idiot then ;) |
19:46
🔗
|
omf_ |
No worries. It is hard to follow so many projects going on. |
19:46
🔗
|
Smiley |
aye |
19:46
🔗
|
omf_ |
I know some forums for some sites were grabbed but nothing about the main ign |
19:48
🔗
|
omf_ |
The wiki is down |
19:49
🔗
|
omf_ |
Resource Limit Is Reached errors a few times |
19:49
🔗
|
omf_ |
seems fine again now |
20:29
🔗
|
SketchCow |
It happens. |
20:30
🔗
|
omf_ |
SketchCow, Is it alright is I start uploading that 4data to you? |
20:31
🔗
|
omf_ |
It is 102gb |
20:31
🔗
|
omf_ |
and it will probably take over a week to upload, possibly longer |
20:34
🔗
|
SketchCow |
What 4data? |
20:34
🔗
|
SketchCow |
I mean, I'm sure we discussed it. What is it? |
20:35
🔗
|
omf_ |
The 4chandata dump |
20:35
🔗
|
omf_ |
from that archive site that is closed |
20:35
🔗
|
SketchCow |
Oh, of course. |
20:35
🔗
|
SketchCow |
Yeah, go ahead. Do you need credentials? |
20:35
🔗
|
omf_ |
I already got them |
20:36
🔗
|
omf_ |
I am still waiting on the database dump itself but I am not worried. This guy has come through on everything he said so far |
21:19
🔗
|
Nimbulan |
4 Get your free Psybnc 100 user have it come http://www.multiupload.nl/B11JFCYQH6 |
21:20
🔗
|
Marcelo |
lol |
21:22
🔗
|
soultcer |
In case anyone is wondering: https://www.virustotal.com/en/file/f897432de88adce73b23741da1a133b6a79b8233d50571451dab4b992931d173/analysis/1364160122/ |
21:23
🔗
|
chronomex |
errrr |
21:23
🔗
|
chronomex |
what's that from? |
21:23
🔗
|
soultcer |
That's the free Psybnc |
21:23
🔗
|
soultcer |
Hm, I wonder if xchat logs bans |
21:24
🔗
|
Marcelo |
So many nicknames for this virus. |
21:33
🔗
|
chronomex |
is there a ratelimiter on formspring? |
21:37
🔗
|
zenpho |
howdy doo! I'm reporting back. Soultcer helped me yesterday with digging into the btinternet stuff (http://archive.org/details/archiveteam-btinternet) |
21:38
🔗
|
soultcer |
Did it work? |
21:39
🔗
|
zenpho |
yes indeedie! - i wrote some horrible awk scripts to parse the CDX files for stuff I was interested in, download via curl, unpack, and now I'm browsing thru some vintage .au and .wav files ... very cool |
21:39
🔗
|
soultcer |
Sweet |
21:41
🔗
|
zenpho |
very kind of you to help and encourage me to carry on, i was almost convinced that the megawarc files would have to be downloaded in entirety (or atleast an entire megawarc) to get anything out of them |
21:42
🔗
|
zenpho |
i was right about to say "ehh.... it probably doesn't work like that", and give up, but you convinced me. and it's certainly very cool to browse thru this stuff! |
21:47
🔗
|
ersi |
Neat :) |
21:50
🔗
|
alard |
chronomex: Yes. |
21:50
🔗
|
alard |
(I set a rate limit on the tracker, that is.) |
21:50
🔗
|
chronomex |
ah |
21:51
🔗
|
alard |
But that limit is not reached, at the moment. I set it to 20 to be safe, but we're currently at 2-4 per minute. |
21:52
🔗
|
chronomex |
I meant running multiple threads on my end |
21:53
🔗
|
alard |
I don't know how Formspring behaves. |
21:54
🔗
|
chronomex |
ok, I'll just run 1 for now |
22:11
🔗
|
wp494 |
would it be possible to get a message asking for assistance on the formspring project in the topic? |
22:12
🔗
|
chronomex |
sure, is there a channel for it? |
22:12
🔗
|
alard |
wp494: Are we sure that it works? |
22:13
🔗
|
wp494 |
alard: yep, I've been running 3 concurrent for an hour or two and haven't ran into any issues |
22:13
🔗
|
wp494 |
chronomex: #firespring |
22:13
🔗
|
wp494 |
and others that pop up on the tracker appear to have no issues |
22:15
🔗
|
alard |
wp494: Yes, that's one thing. But does it get everything we want to get? |
22:16
🔗
|
alard |
It's a complicated script. |
22:16
🔗
|
wp494 |
hrm |
22:16
🔗
|
wp494 |
if you want to hold off on adding to the topic, feel free |
22:17
🔗
|
chronomex |
I'm inclined to wait for alard to sign off |
22:17
🔗
|
alard |
I've checked one or two warcs and they looked good (with the last version of the script, at least). |
22:18
🔗
|
alard |
We could go with full force, but there's a small risk that we need to do things again. |
22:18
🔗
|
alard |
I haven't been able to find out about the pagination on the photo albums, for example. |
22:18
🔗
|
chronomex |
hm |
22:18
🔗
|
alard |
(Because I haven't found a user with enough photos.) |
22:21
🔗
|
wp494 |
have you tried any triple digit/close to triple digit users? |
22:21
🔗
|
wp494 |
(in file size terms) |
22:24
🔗
|
alard |
Good idea. I just did that, but didn't see any user with more than 20 pictures. They're big because of something else. |
22:28
🔗
|
wp494 |
probably formspringaholics |
22:32
🔗
|
omf_ |
DFJustin, Did you want a copy of vgmdb? |
22:33
🔗
|
alard |
I think Formspring works well enough. Checked another warc with the warc-proxy, no missing pages. |
22:34
🔗
|
alard |
If there are people with too many pictures they'll at least be included via the Previous-Next buttons. |
22:35
🔗
|
alard |
There are a few pagination things that don't work (the 'who smiled at this'-thing, for example), but that's due to Formspring. |
22:42
🔗
|
chronomex |
namespace | I'm worried about google groups. |
22:42
🔗
|
chronomex |
chronomex | hmmmmmmm |
22:42
🔗
|
chronomex |
namespace | It's basically dead as far as I can tell, and to my knowledge is one of the largest usenet archives. |
22:42
🔗
|
chronomex |
chronomex | I'm with you there |
22:42
🔗
|
chronomex |
chronomex | it'd be good to turn it back into a news spool |
22:42
🔗
|
chronomex |
chronomex | the way usenet was meant to be |
22:42
🔗
|
chronomex |
yes, ggroups is a worthy opponent |
22:43
🔗
|
namespace |
And because it's google, you know that the shutdown is a matter of when not if. |
22:43
🔗
|
thomasbk |
do you think google wouldn't be willing to ship some hard drives to the internet archive if they ever shut ggroups down? |
22:43
🔗
|
namespace |
True. |
22:43
🔗
|
namespace |
I'd hope they would anyway. |
22:43
🔗
|
chronomex |
we'd need to find a crooked googler |
23:04
🔗
|
omf_ |
From my own research we can piece sections of usenet history with what is already available |
23:04
🔗
|
omf_ |
which is better than nothing. |
23:04
🔗
|
DFJustin |
omf_: I don't personally want a copy but having one on archive.org would be nice |
23:05
🔗
|
omf_ |
I am doing a refresh on it now |
23:05
🔗
|
ersi |
thomasbk: Always assume the answer to that question is no, unless you're sure |
23:05
🔗
|
ersi |
That's my rule of thumb |
23:06
🔗
|
omf_ |
Universities still have tapes full of usenet archives |
23:06
🔗
|
omf_ |
it is just finding the tapes and people there who can pull the data out |
23:07
🔗
|
omf_ |
Another angle would be to get the usenet data loaded into big query |
23:07
🔗
|
chronomex |
tapes used to be really expensive |
23:07
🔗
|
DFJustin |
from what I read google looked under a lot of rocks to get what they have, I'm not sure there's really a lot more out there |
23:12
🔗
|
thomasbk |
anyone have any guesses wrt the legalities of rehosting stuff like the yahoo messages content? |
23:13
🔗
|
chronomex |
nope |
23:13
🔗
|
ivan` |
ersi: different from what you get from http://xooglers.blogspot.com/feeds/posts/default or http://xooglers.blogspot.com/ |
23:14
🔗
|
ersi |
ivan`: oh, huh |
23:14
🔗
|
ersi |
thomasbk: most of us don't give two fucks about that |
23:14
🔗
|
omf_ |
I just checked up on my usenet sources |
23:14
🔗
|
omf_ |
I got partial archives going back over 10 years for some groups |
23:15
🔗
|
omf_ |
We could do it |
23:15
🔗
|
omf_ |
add that to what is already on the IA and we would have over 50% of everything as a starting point |
23:21
🔗
|
adamc[a] |
The longer we wait, the harder it will be to find older data - makes sense to get started on it |
23:22
🔗
|
omf_ |
I can start cutting it up to feed to the warrior |
23:22
🔗
|
omf_ |
We are going to have to hit dozens of different archives |
23:23
🔗
|
omf_ |
I have been tracking this for a few years and there are more archives online now than before |
23:23
🔗
|
omf_ |
People are starting to open things up |
23:23
🔗
|
Lord_Nigh |
i know google has a usenet archive but its in their weird google-format (missing original headers etc?) so not super useful? |
23:23
🔗
|
omf_ |
plus hosting is cheaper for larger data sets |
23:23
🔗
|
Lord_Nigh |
also missing all the atatchments |
23:23
🔗
|
chronomex |
I thought that google usenet posts are retrievable in original form |
23:24
🔗
|
omf_ |
they are |
23:34
🔗
|
zerovox |
So I've been downloading on the yahoo task all day, It's taken about 12 hours to download nearly 10,000 urls on Item threads-b-1036-3. Can anyone check if someone else has submitted this by now? Or how many urls there will be? |
23:35
🔗
|
zerovox |
Seems pretty slow, but I guess that's due to the rate limit? |
23:49
🔗
|
namespace |
Question: Why isn't there a standard URL shortener algorithm in browsers? |
23:50
🔗
|
chronomex |
gzip | base64 or something? |
23:50
🔗
|
namespace |
Something like that. |
23:51
🔗
|
namespace |
It's totally ridiculous that it's even a service. It's obviously something users want, and it could totally be done client side. |
23:51
🔗
|
namespace |
I can't think of a single aspect that requires a server to be involved. |
23:52
🔗
|
omf_ |
namespace, do you know why people use url shorteners |
23:54
🔗
|
namespace |
omf_: Because it's simple and long urls are ugly? |
23:54
🔗
|
namespace |
(Unless it's for shock sites. But then why would you want to archive them?) |
23:54
🔗
|
namespace |
That and for twitter. |
23:56
🔗
|
omf_ |
URL shortening services were invented as a way to add a step in the process which allows data to be collected on the user. This is then sold to ad companies |
23:56
🔗
|
omf_ |
that is the whole point of bitly etc |
23:56
🔗
|
omf_ |
It has no benefit to end users |
23:56
🔗
|
namespace |
Interesting. Source? |
23:56
🔗
|
dashcloud |
okay- while it is a problem, that's not true |
23:57
🔗
|
dashcloud |
if you're trying to share a link on a character-constrained environment, you're going to run into the URL issue |
23:58
🔗
|
dashcloud |
I don't disagree folks found it was a great way to get analytics on web traffic |