#archiveteam 2013-03-24,Sun

↑back Search

Time Nickname Message
00:45 🔗 omf_ How does the IA take in site grabs that do not have warcs?
00:47 🔗 chronomex they don't
00:47 🔗 chronomex well, not into waybackmachine
00:48 🔗 omf_ What if you have all the data that makes the warc
00:48 🔗 omf_ like the transfer time, size, headers, etc...
00:48 🔗 chronomex then I suppose you could make a warc?
00:49 🔗 omf_ I guess I could write a conversion program.
00:56 🔗 godane you would need like a wget log of the files being grabed for this to work
00:56 🔗 godane in theory
01:06 🔗 chronomex that won't have headers tho
01:07 🔗 godane thats why i said in theory
01:07 🔗 godane was not sure
01:22 🔗 ianweller what.
01:22 🔗 ianweller so i went to bed thinking maybe the local warrior that i have running will stop
01:22 🔗 ianweller nope
01:22 🔗 ianweller it's on 7010 URLs and counting
01:25 🔗 chronomex perfect!
01:33 🔗 marczak Is there a script that I could run instead of using the warrior VM?
01:34 🔗 marczak I have a few extra IPs I could run from, but won't have a virtualized environment to run under.
01:34 🔗 omf_ marczak, the peeps in #warrior can answer that
01:34 🔗 marczak great - thanks
01:36 🔗 DrDeke the answer is "yes" but i don't have a link to it handy
01:38 🔗 marczak DrDeke: thanks - someone in #warrior is helping out.
02:14 🔗 omf_ For all the new warriors out there we have long term projects after yahoo and posterous. #urlteam is constantly unfucking the url shorteners so we can find sites without twitter, bitly, etc...
02:15 🔗 omf_ That is our proactive side to saving the web.
02:31 🔗 ersi mah, don't send people to #warrior when they're asking project specific questions
02:32 🔗 ersi marczak: You can run the scripts from: https://github.com/ArchiveTeam/yahoomessages-grab/
02:32 🔗 ersi that's the stand-alone ones. You'll need to compile wget though (script is checked in there ^) and install the seesaw python package.
03:40 🔗 SketchCow I think we just exploded the Yahoo
03:40 🔗 pilgrim well they had it coming
03:46 🔗 godane i just saved low rider world 2006 clip of attack of the show
03:46 🔗 godane it was one of the flvsm videos that i couldn't get
03:55 🔗 SketchCow We just destroyed the Yahoo! backlog
03:57 🔗 DFJustin and how
03:57 🔗 SketchCow The graph looks like a zombie death apocalypse
04:00 🔗 SketchCow 40G .
04:00 🔗 SketchCow root@teamarchive-1:/2/DISCOGS/www.discogs.com/data# du -sh .
04:00 🔗 SketchCow by the way
04:11 🔗 omf_ SketchCow, was that a preventative grab?
04:13 🔗 SketchCow Yes
04:13 🔗 SketchCow I'm working with MusicBrainz to get their stuff on archive.
04:13 🔗 SketchCow And they said "You know, I don't know of any mirrors of discogs.org"
04:14 🔗 DFJustin might do vgmdb.net while you're at it
04:20 🔗 SketchCow Show me where you can download the DB and I will.
04:20 🔗 omf_ DFJustin, I already got a grab of vgmdb.net
04:21 🔗 omf_ it is about 8 months old though
04:21 🔗 DFJustin o/\o
04:22 🔗 omf_ I want to merge some of their data into freebase
06:33 🔗 chronomex why don't we have all warriors running urlteam in the background all the time?
06:35 🔗 chronomex :)
06:35 🔗 omf_ It would help
07:12 🔗 omf_ We need to recruit someone who has google fiber, it could be real helpful
07:13 🔗 omf_ just throwing that out there
07:49 🔗 SketchCow Man, it's going k-razy out there
07:49 🔗 SketchCow My Hard Drive full of goodness goes out Monday
07:49 🔗 SketchCow Working now to build up the maximum amount of data on it
07:50 🔗 omf_ You ship hard drives as well as upload? Talk about no stone unturned :)
08:01 🔗 SketchCow Have to.
08:01 🔗 SketchCow I send in 400-500gb a hit
08:03 🔗 chronomex whumph whumph
08:05 🔗 omf_ Do you have shock proof cases for mailing? I always wanted to ask how those work out.
08:06 🔗 chronomex if I were mailing hdds I'd probably reuse original hdd packing materials
08:06 🔗 chronomex seems to work
08:26 🔗 ivan` in case I get hit by a meteor in the next 3 months somebody better remember to scrape all of Reader's *.blogspot.com/atom.xml feeds in addition to the feed URLs they currently use
08:26 🔗 ivan` e.g. xooglers.blogspot.com/atom.xml gets you completely different content
11:00 🔗 SketchCow chronomex: I do.
11:52 🔗 ersi ivan`: Different content than what?
14:54 🔗 omf_ Our clown information is growing nicely. If you have any observations you would like to add http://www.archiveteam.org/index.php?title=Clown_hosting
16:19 🔗 chazchaz omf_: Are there any guidelines for including providers is that list?
16:22 🔗 omf_ website url, price point, specs, and any insights into why the service works so well or problems with it
16:23 🔗 omf_ the joyent and DO are good examples we have built out
16:23 🔗 omf_ we have vps and cloud providers on there
16:25 🔗 omf_ bandwidth and storage are right up there with price point as important data we need
16:57 🔗 chazchaz Ok, I added BuyVM
16:59 🔗 omf_ chazchaz, you use them recently?
16:59 🔗 chazchaz Yeah, I have 2 servers with them.
16:59 🔗 chazchaz One for over a year
17:03 🔗 omf_ What can you fit in 128mb ram
17:04 🔗 omf_ I cannot think of too much you could run
17:04 🔗 omf_ I could host my photos on there. Cheaper than flickr
17:07 🔗 neurophyr edis.at has a good 128MB miniVPS option.
17:07 🔗 neurophyr I run lower-traffic Tor relays and bridges on that kind of box.
17:08 🔗 neurophyr and it was quite happy to run the yahoomessages-grab script.
17:09 🔗 chazchaz omf_: They let you burst up to 2x as long as it's availible, which seems to be almost all the time. I'm using 150 MB for 40 posterous processes and 2 yahoo-messages processes
17:10 🔗 omf_ chazchaz, you should make a note on the wiki, that is valuable info
17:13 🔗 chazchaz done
17:14 🔗 omf_ thanks
17:36 🔗 DrDeke i'm kind of offended that there is a wiki page called "Clown hosting" and my apartment closet isn't eligible to be listed in it ;)
17:37 🔗 DrDeke outage notifications? pshhh, yeah maybe i'll email you if i decide to take the server apart for some reason 5 minutes before i do it if you have a VM on it
17:39 🔗 chazchaz Just check it yourself. That's what ping is for right?
17:41 🔗 DrDeke exactly!
17:41 🔗 DrDeke i made a major jump in my level of customer service a couple months ago when i put everyone's email address that i could track down in a google spreadsheet
17:41 🔗 DrDeke sometimes it gets copy and pasted into a bcc
17:41 🔗 DrDeke sometimes... =)
17:42 🔗 DrDeke (nobody is paying, so, you know...)
17:42 🔗 chronomex 'wall' ought to be acceptable notice for planned maintenance
17:42 🔗 DrDeke i actually got to do that on a couple servers at my real job last night
17:43 🔗 DrDeke "Oh, we forgot to mention that part in the email? Well, just shutdown +30 it, the users will be fine."
17:43 🔗 DrDeke (needless to say, that is not the way it normally works there)
17:43 🔗 DrDeke since the system these servers are for was going to be completely down anyway, we figured oh well
19:31 🔗 omf_ Did someone already grab the ign forums?
19:41 🔗 Smiley omf_: ask in #ispygames
19:41 🔗 Smiley someone was doing work on a lot of that stuff there
19:41 🔗 omf_ that is me
19:42 🔗 omf_ I just checked the scroll back to the 22nd of last month and nothing
19:45 🔗 Smiley D:
19:45 🔗 Smiley sorry for being an idiot then ;)
19:46 🔗 omf_ No worries. It is hard to follow so many projects going on.
19:46 🔗 Smiley aye
19:46 🔗 omf_ I know some forums for some sites were grabbed but nothing about the main ign
19:48 🔗 omf_ The wiki is down
19:49 🔗 omf_ Resource Limit Is Reached errors a few times
19:49 🔗 omf_ seems fine again now
20:29 🔗 SketchCow It happens.
20:30 🔗 omf_ SketchCow, Is it alright is I start uploading that 4data to you?
20:31 🔗 omf_ It is 102gb
20:31 🔗 omf_ and it will probably take over a week to upload, possibly longer
20:34 🔗 SketchCow What 4data?
20:34 🔗 SketchCow I mean, I'm sure we discussed it. What is it?
20:35 🔗 omf_ The 4chandata dump
20:35 🔗 omf_ from that archive site that is closed
20:35 🔗 SketchCow Oh, of course.
20:35 🔗 SketchCow Yeah, go ahead. Do you need credentials?
20:35 🔗 omf_ I already got them
20:36 🔗 omf_ I am still waiting on the database dump itself but I am not worried. This guy has come through on everything he said so far
21:19 🔗 Nimbulan 4 Get your free Psybnc 100 user have it come http://www.multiupload.nl/B11JFCYQH6
21:20 🔗 Marcelo lol
21:22 🔗 soultcer In case anyone is wondering: https://www.virustotal.com/en/file/f897432de88adce73b23741da1a133b6a79b8233d50571451dab4b992931d173/analysis/1364160122/
21:23 🔗 chronomex errrr
21:23 🔗 chronomex what's that from?
21:23 🔗 soultcer That's the free Psybnc
21:23 🔗 soultcer Hm, I wonder if xchat logs bans
21:24 🔗 Marcelo So many nicknames for this virus.
21:33 🔗 chronomex is there a ratelimiter on formspring?
21:37 🔗 zenpho howdy doo! I'm reporting back. Soultcer helped me yesterday with digging into the btinternet stuff (http://archive.org/details/archiveteam-btinternet)
21:38 🔗 soultcer Did it work?
21:39 🔗 zenpho yes indeedie! - i wrote some horrible awk scripts to parse the CDX files for stuff I was interested in, download via curl, unpack, and now I'm browsing thru some vintage .au and .wav files ... very cool
21:39 🔗 soultcer Sweet
21:41 🔗 zenpho very kind of you to help and encourage me to carry on, i was almost convinced that the megawarc files would have to be downloaded in entirety (or atleast an entire megawarc) to get anything out of them
21:42 🔗 zenpho i was right about to say "ehh.... it probably doesn't work like that", and give up, but you convinced me. and it's certainly very cool to browse thru this stuff!
21:47 🔗 ersi Neat :)
21:50 🔗 alard chronomex: Yes.
21:50 🔗 alard (I set a rate limit on the tracker, that is.)
21:50 🔗 chronomex ah
21:51 🔗 alard But that limit is not reached, at the moment. I set it to 20 to be safe, but we're currently at 2-4 per minute.
21:52 🔗 chronomex I meant running multiple threads on my end
21:53 🔗 alard I don't know how Formspring behaves.
21:54 🔗 chronomex ok, I'll just run 1 for now
22:11 🔗 wp494 would it be possible to get a message asking for assistance on the formspring project in the topic?
22:12 🔗 chronomex sure, is there a channel for it?
22:12 🔗 alard wp494: Are we sure that it works?
22:13 🔗 wp494 alard: yep, I've been running 3 concurrent for an hour or two and haven't ran into any issues
22:13 🔗 wp494 chronomex: #firespring
22:13 🔗 wp494 and others that pop up on the tracker appear to have no issues
22:15 🔗 alard wp494: Yes, that's one thing. But does it get everything we want to get?
22:16 🔗 alard It's a complicated script.
22:16 🔗 wp494 hrm
22:16 🔗 wp494 if you want to hold off on adding to the topic, feel free
22:17 🔗 chronomex I'm inclined to wait for alard to sign off
22:17 🔗 alard I've checked one or two warcs and they looked good (with the last version of the script, at least).
22:18 🔗 alard We could go with full force, but there's a small risk that we need to do things again.
22:18 🔗 alard I haven't been able to find out about the pagination on the photo albums, for example.
22:18 🔗 chronomex hm
22:18 🔗 alard (Because I haven't found a user with enough photos.)
22:21 🔗 wp494 have you tried any triple digit/close to triple digit users?
22:21 🔗 wp494 (in file size terms)
22:24 🔗 alard Good idea. I just did that, but didn't see any user with more than 20 pictures. They're big because of something else.
22:28 🔗 wp494 probably formspringaholics
22:32 🔗 omf_ DFJustin, Did you want a copy of vgmdb?
22:33 🔗 alard I think Formspring works well enough. Checked another warc with the warc-proxy, no missing pages.
22:34 🔗 alard If there are people with too many pictures they'll at least be included via the Previous-Next buttons.
22:35 🔗 alard There are a few pagination things that don't work (the 'who smiled at this'-thing, for example), but that's due to Formspring.
22:42 🔗 chronomex namespace | I'm worried about google groups.
22:42 🔗 chronomex chronomex | hmmmmmmm
22:42 🔗 chronomex namespace | It's basically dead as far as I can tell, and to my knowledge is one of the largest usenet archives.
22:42 🔗 chronomex chronomex | I'm with you there
22:42 🔗 chronomex chronomex | it'd be good to turn it back into a news spool
22:42 🔗 chronomex chronomex | the way usenet was meant to be
22:42 🔗 chronomex yes, ggroups is a worthy opponent
22:43 🔗 namespace And because it's google, you know that the shutdown is a matter of when not if.
22:43 🔗 thomasbk do you think google wouldn't be willing to ship some hard drives to the internet archive if they ever shut ggroups down?
22:43 🔗 namespace True.
22:43 🔗 namespace I'd hope they would anyway.
22:43 🔗 chronomex we'd need to find a crooked googler
23:04 🔗 omf_ From my own research we can piece sections of usenet history with what is already available
23:04 🔗 omf_ which is better than nothing.
23:04 🔗 DFJustin omf_: I don't personally want a copy but having one on archive.org would be nice
23:05 🔗 omf_ I am doing a refresh on it now
23:05 🔗 ersi thomasbk: Always assume the answer to that question is no, unless you're sure
23:05 🔗 ersi That's my rule of thumb
23:06 🔗 omf_ Universities still have tapes full of usenet archives
23:06 🔗 omf_ it is just finding the tapes and people there who can pull the data out
23:07 🔗 omf_ Another angle would be to get the usenet data loaded into big query
23:07 🔗 chronomex tapes used to be really expensive
23:07 🔗 DFJustin from what I read google looked under a lot of rocks to get what they have, I'm not sure there's really a lot more out there
23:12 🔗 thomasbk anyone have any guesses wrt the legalities of rehosting stuff like the yahoo messages content?
23:13 🔗 chronomex nope
23:13 🔗 ivan` ersi: different from what you get from http://xooglers.blogspot.com/feeds/posts/default or http://xooglers.blogspot.com/
23:14 🔗 ersi ivan`: oh, huh
23:14 🔗 ersi thomasbk: most of us don't give two fucks about that
23:14 🔗 omf_ I just checked up on my usenet sources
23:14 🔗 omf_ I got partial archives going back over 10 years for some groups
23:15 🔗 omf_ We could do it
23:15 🔗 omf_ add that to what is already on the IA and we would have over 50% of everything as a starting point
23:21 🔗 adamc[a] The longer we wait, the harder it will be to find older data - makes sense to get started on it
23:22 🔗 omf_ I can start cutting it up to feed to the warrior
23:22 🔗 omf_ We are going to have to hit dozens of different archives
23:23 🔗 omf_ I have been tracking this for a few years and there are more archives online now than before
23:23 🔗 omf_ People are starting to open things up
23:23 🔗 Lord_Nigh i know google has a usenet archive but its in their weird google-format (missing original headers etc?) so not super useful?
23:23 🔗 omf_ plus hosting is cheaper for larger data sets
23:23 🔗 Lord_Nigh also missing all the atatchments
23:23 🔗 chronomex I thought that google usenet posts are retrievable in original form
23:24 🔗 omf_ they are
23:34 🔗 zerovox So I've been downloading on the yahoo task all day, It's taken about 12 hours to download nearly 10,000 urls on Item threads-b-1036-3. Can anyone check if someone else has submitted this by now? Or how many urls there will be?
23:35 🔗 zerovox Seems pretty slow, but I guess that's due to the rate limit?
23:49 🔗 namespace Question: Why isn't there a standard URL shortener algorithm in browsers?
23:50 🔗 chronomex gzip | base64 or something?
23:50 🔗 namespace Something like that.
23:51 🔗 namespace It's totally ridiculous that it's even a service. It's obviously something users want, and it could totally be done client side.
23:51 🔗 namespace I can't think of a single aspect that requires a server to be involved.
23:52 🔗 omf_ namespace, do you know why people use url shorteners
23:54 🔗 namespace omf_: Because it's simple and long urls are ugly?
23:54 🔗 namespace (Unless it's for shock sites. But then why would you want to archive them?)
23:54 🔗 namespace That and for twitter.
23:56 🔗 omf_ URL shortening services were invented as a way to add a step in the process which allows data to be collected on the user. This is then sold to ad companies
23:56 🔗 omf_ that is the whole point of bitly etc
23:56 🔗 omf_ It has no benefit to end users
23:56 🔗 namespace Interesting. Source?
23:56 🔗 dashcloud okay- while it is a problem, that's not true
23:57 🔗 dashcloud if you're trying to share a link on a character-constrained environment, you're going to run into the URL issue
23:58 🔗 dashcloud I don't disagree folks found it was a great way to get analytics on web traffic

irclogger-viewer