[00:30] I've downloaded all of the Insurgency Wiki that is on the porfusion website. Is there something I should do with the data? [00:32] Might be widely know.. I just found out though.. Bebo's old Bebo pages are now under the .archive.bebo.com/ [00:33] and http://archive.bebo.com/Profile.jsp?MemberId=4134490647 [00:33] example [00:33] http://www.bebo.com/#faq Your old photos and blog posts are safe. They will be available for download in a couple of months. Other things (skins, quizzes, wall posts, games etc…) unfortunately will all be retired. [00:34] We’re just as sentimental as you are (ok, probably more), and have left all public profiles visible for now. Private profiles are also saved, but not visible at the moment. [00:35] It might be time to put in the AT Warrior and start a tracker and save/grab what we can [00:44] Lol. but i am not that experienced to do so, yet. If I can help any way along with my bandwidth I'd love to. I had a Bebo years ago. Heh. [03:47] i made the wretch wiki page: http://archiveteam.org/index.php?title=Wretch [04:15] i made the bebo wiki page: http://archiveteam.org/index.php?title=Bebo [04:39] Wheeee [04:39] Did zapd die? [04:43] Nope, still there. [04:43] hasn't been zapd out of existence yet [04:44] So an old music Blog / Aggregator is shutting down [04:44] http://pitchfork.com/news/52578-music-blog-aggregator-elbows-shuts-down/ [04:44] End of November. Do you think it's worth archiving? [04:50] omf_: at some point the rsync died and now i get this sending incremental file list it hung on this for **at least** 3 hours, after i tried to re run the rsync [05:09] link343: since no one else has answered yet, i'll say yes. i'll keep an eye on it. [05:10] ok [06:39] SketchCow: could you move all wikimediacommons* items in wikimedia-other collection to wikimediacommons collection? [06:39] (the collection was somehow broken but seems to work now) [06:42] are these on archive.org? http://www.bl.uk/bibliographic/download.html [07:25] http://archive.org/details/BritishLibraryRdf earlier one [07:37] I'm going to make a new one. [07:37] yay [07:38] :) [07:38] and the new book deriver got rid of some of my redrows, sweet [08:14] God damn it, Yahoo! - what the fuck is your problem. [08:14] I might have some friends who can read Traditional Chinese (Used in for example Taiwan/Republic of China) - havn't talked to them in ages though. [08:25] ersi: should be easy to find some if it's a quick task [08:29] It's for wretch.cc. We'll need it for finding important structure and for content verification [08:38] hmm, not so quick, I have only 1 option then [08:39] guy: I have Amiga files not in TOSEC. Would like to add them. [08:39] me: cool! here's where to go to TOSEC to contribute [08:39] guy: Oh, that's requiring to make a login, I'm not gonna do that [08:40] So I think he wants me to put the spoon in AND move his chin so he chews it [08:59] SketcchCow: Move these to a tekzilla-daily collection when you can: http://archive.org/search.php?query=collection%3Atekzilla%20AND%20subject%3A%22tekzilla%20daily%20tip%22&sort=-date [08:59] this is more to keep the tekzilla collection just for full episodes of tekzilla [09:02] since there is 1500+ tekzilla daily episodes [09:09] tekzilla-daily now created and you own it [09:10] thanks [09:18] http://archive.org/details/BritishLibraryRdf-2013-09 [09:22] SketchCow: https://twitter.com/BLMetadata/status/387145272951701504 if you want to announce it to them [09:23] looks like something is wrong with this item: http://archive.org/details/Tekzilla_Daily_36 [09:30] It's set dark. [09:30] I'll find out why tomorrow. [09:36] it patrick talking about pricewatch.com [09:36] so in less they think that one was spam i don't see why it would go dark [09:40] also looks like episode 41 and 42 of oneoff epsiodes maybe in revision3 bestof collection [09:41] there was 3 episodes of e3 2009 live streaming [09:43] SketchCow: thought i point out that i have 33 more episodes of geekbeat.tv to be add the collection: http://archive.org/search.php?query=subject%3A%22GeekBeat.TV%22%20AND%20collection%3Aopensource_movies&sort=-date [10:30] grab it while you can http://www-users.cs.umn.edu/~sarwat/foursquaredata/ [10:44] And anarchive has a copy. [12:23] hi [12:24] hello3 [12:26] and the dataset is gone :D [12:26] for the better [12:27] oh [12:28] I feel 24hour like 150~200hour [12:34] Schbirid: lol [12:34] * joepie91 has a copy [12:34] it looked like quite the gross privacy violation so its for the better to be gone [12:34] Schbirid: privacy violation? this is data that users have put on foursquare themselves publicly, no? [12:35] Hi yall, I'm working on downloading elbo.ws by the way [12:35] there is a difference between putting data on fq and mass aggregating it [12:35] Schbirid: hardly [12:35] Take that discussion somewhere else [12:35] In this channel: Grabbing is GO and OK for whatever reason. [12:35] but lets not get into that discussion again, last time people showed to have a different understandnig of privacy than me [12:35] aye [12:36] Feel free to talk about privacy/downloading moral in #archiveteam-bs though [12:39] okay, sry both of you. Really sorry. [12:39] Keni: No worries, I'm saying it because everyone needs a reminder occationally. And we're many people, so if we drift off-topic in this channel, something important might get lost. [12:40] ersi: Use simple English... He is Japanese. [12:40] ersi: He is having hard time understanding... [12:40] thx but I'ts allright [12:41] This is so GAP than learn to school. [12:42] So don't mind that thx. [12:42] sure [12:42] :) [15:02] I want that dataset. [15:13] I wish I was awake sooner. [15:14] The SECOND datasets like that appear, grab. [15:24] I am not sure if Archivebot is still working on Silk Road Forums, but the site is still squirming [15:24] was just able to load it a few minutes ago. [15:26] SketchCow, GLaDOS grabbed a copy of that foursquare data [16:10] SketchCow: i failed you on that one [16:11] but i found another smaller dataset [16:39] SketchCow: check your PM [17:06] it works well enough at this point [17:06] !status [17:06] yipdw: Job status: 5039 completed, 14 aborted, 3 in progress, 0 pending [17:06] yep [19:27] https://www.mediawiki.org/w/index.php?title=Language_portal&diff=prev&oldid=797446 [19:40] anyone got some good wget --reject-regexp for blogspot sites to reduce duplicates and search result shite? [19:46] Schbirid, let us know if you find anything, that sounds really useful [19:56] Schbirid: and remember to update http://archiveteam.org/index.php?title=Blogger [19:57] nice, thanks [19:58] Do you have url lists from a few sites? [20:00] nope [20:04] any idea what the "*\\?*,*@*" is supposed to reject? [21:15] He went away, but I guess any URL with parameters? [21:29] Funny http://oami.europa.eu/robots.txt [22:02] CAT SIGNAL ACTIVATED: blip.tv is deleting years of vloggers videos, can you help? [22:03] i'm checking that archive.org is willing to ingest it all [22:05] diffalot, start by giving us a link to the announcement page [22:06] no page, this is something blip is quietly doing, see tweets from https://twitter.com/schlomo , quirk, and trine [22:07] here's a news story: http://www.zennie62blog.com/2013/10/08/blip-tv-er-blip-networks-sacks-ceo-kelly-day-shortest-exec-career-since-john-paul-i-24113/ [22:08] archive.org says, "hell yes" https://twitter.com/tracey_pooh/status/387700340176351233 [22:12] Well this is an interesting problem. How to find the vloggers they are going to erase [22:15] and I already want to shit on the heads of the developers of blip.tv [22:16] good we already have a page http://archiveteam.org/index.php?title=Blip.tv [22:19] It's 30 days. [22:19] We have 30 days. [22:20] perhaps blip would provide a list? or we create an opt-in form? i'm not seeing any mediaRSS feeds on the user profile pages in question [22:22] i'm ok with phantomJS and jsdom, so i'll see what i can do [22:24] I guess that the last ditch effort would be [22:25] "just archive all of blip and we'll figure out what's gone later" [22:28] i'm looking for an example of a past scraper the team has used, any recommendations? [22:29] iirc, y'all have some sophisticated turnkey solutions ;) [22:33] While I am adding info I find to the wiki about blip.tv I would like to remind everyone we had serious server problems during backing up zapd and frankly blip.tv is going to require bigger metal to suck down that much data [22:33] what sort of server problems. [22:34] we went down, we ran out of space, the usual [22:34] Well, that's because the same people aren't using the tracker - we'll have the use of FOS for a dump. [22:34] the hosting company randomly turns off or reboots the server [22:39] That's because people take over the project and use their own central servers instead of internet archive. [23:04] I am searching the commoncrawl index for urls [23:07] ah ha: http://blip.tv/schlomo/rss/ [23:07] (must be turned on by the producer?) [23:15] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [23:17] THY SECRET WORD is "yahoosucks" GO FORTH AND IMPART THY KNOWLEDGE [23:17] * diffalot kneels and accepts the mantle [23:17] SketchCow, we need a cool name to make an irc channel [23:19] bloop [23:25] Has anyone here looked into working with the Majestic-12 project to find usernames for websites and such? It seems like it could be a really valuable source of data (they have ~2.7 trillion URLs in their databases)… [23:27] kyan, url? [23:27] omf_: this is their "real" website: http://www.majestic12.co.uk/ This is their commercial website: https://www.majesticseo.com/ [23:39] contacting the public relations team at blip (http://annieisms.com/about/), good idea or bad idea? [23:40] kyan: I do a lot of MJ12 crawling, I've considered asking them in the past, but due to the commercial nature of what they do I doubt they'd work with us. [23:41] the question would be: can we get a list of the shows that are being deleted? [23:41] Cameron_D, that would be understandable [23:41] and yet they use the public to do the bulk of the world [23:43] AT wouldn't be using the data for profit… might be worth asking [23:43] yeah, maybe