#archiveteam 2014-10-03,Fri

↑back Search

Time Nickname Message
00:42 🔗 tfgbd Is there any kind of archive of product packages out there?
00:43 🔗 tfgbd Lately (past year), I've been photographing every food/other package I could find
00:47 🔗 joepie91 tfgbd: mm
00:47 🔗 joepie91 this may be useful for a planned future project of mine
00:47 🔗 joepie91 :)
01:02 🔗 DFJustin wikimedia commons would probably take some of them if they're good quality
01:08 🔗 tfgbd All angles?
01:08 🔗 tfgbd I have all sides of the box
01:08 🔗 tfgbd and the date can be gathered from the image metadata
01:11 🔗 tfgbd Do you guys or archive.org ever work with sites like: http://www.oldversion.com/ or http://www.oldapps.com/
01:13 🔗 DFJustin work with, no
01:13 🔗 DFJustin oldapps was partially crawled with archivebot
01:14 🔗 tfgbd might be easier if they just gave you access to backups
01:14 🔗 tfgbd why only partially?
01:14 🔗 DFJustin job crashed
01:14 🔗 Diesel_ Great idea, you're in charge
01:14 🔗 DFJustin actually archive.org has gotten sets of data from some software sites in the past
01:14 🔗 DFJustin tucows circa 2004 https://archive.org/details/tucows
01:15 🔗 tfgbd yeah, I knew about tucows
01:15 🔗 DFJustin old browser versions from evolt.org https://archive.org/details/evolt_browser_archive
01:15 🔗 tfgbd wait, is tucows gone?
01:15 🔗 DFJustin no it's still around last I checked
01:15 🔗 DFJustin but the archive wasn't kept up to date
01:15 🔗 tfgbd They just removed old stuff
01:16 🔗 tfgbd I know there used to be lots of tucows mirrors around years back
01:16 🔗 tfgbd they're like one of the few that had tons of mirrors
01:16 🔗 DFJustin there's an archive team project to crawl all public ftps https://archive.org/details/ftpsites
01:17 🔗 tfgbd do those use WARC too?
01:17 🔗 DFJustin no
01:17 🔗 DFJustin just tar or zip
01:17 🔗 tfgbd where do you put them then?
01:18 🔗 DFJustin archive.org file items
01:18 🔗 tfgbd that sucks
01:18 🔗 DFJustin wayback machine doesn't do ftp
01:18 🔗 tfgbd you have to download the whole FTP?
01:18 🔗 tfgbd well, at least they're there
01:18 🔗 DFJustin depends on the site, some of them are too big and have to be split into subdirectories
01:19 🔗 tfgbd maybe they will be able to be mirrored somewhere if they ever start a project for ftp
01:19 🔗 DFJustin also archive.org lets you browse inside archive files
01:19 🔗 tfgbd ahh
01:20 🔗 DFJustin look for the "[contents]" link or add a / to the download link
02:03 🔗 espes__ tfgbd: joepie91: DFJustin: if people are seriously interested in jdget then I'm interested in help getting it into a stage where it's maintainable and useful
02:20 🔗 tfgbd cool
02:20 🔗 tfgbd how does it deal with captchas, though
02:31 🔗 espes__ it doesn't
02:32 🔗 espes__ apart from jdownloaders captcha solver
02:32 🔗 espes__ which I might even have disabled
04:02 🔗 APerti http://www.ephotobay.com/image/shadowrun-snes-300.jpg
04:35 🔗 SketchCow Boop
04:35 🔗 xmc beep
04:56 🔗 APerti Working on a Lemmings SNES scan for Psygnosis.org, Jason.
05:19 🔗 SketchCow -bs, please. :)
13:01 🔗 joepie91 espes__: does it actually run as native code?
13:01 🔗 joepie91 or does it still use the JRE
13:57 🔗 espes__ joepie91: it's all compiled to native code with GCJ
13:58 🔗 espes__ you get one fat 80MB binary
14:56 🔗 Jonimus You know what we need, a Waterboy parody video that replaces "Water Sucks" with "Yahoo Sucks" and "Gatorade is better" with "ArchiveTeam is better"
14:56 🔗 * Jonimus debates working on that tonight.
15:09 🔗 joepie91 Jonimus: we getting a theme song now? :D
15:15 🔗 qwerty0 hey, does anyone know what happened to the google video archive?
15:16 🔗 qwerty0 all I can find on archive.org is one item that looks like a captured one by them, not us: https://archive.org/details/GVID-20110417095014-crawl340
15:21 🔗 Jonimus joepie91: well my GF is a pretty good singer, if I can get something together we just might.
15:22 🔗 joepie91 :D
15:22 🔗 joepie91 SketchCow: see above question from qwerty0
15:22 🔗 SketchCow Good question.
15:48 🔗 Arkiver2 SketchCow: I saw this contributor uploading old magazines: https://archive.org/search.php?query=uploader%3A%22paulo%40paulogarcia.com%22
15:48 🔗 Arkiver2 Maybe something for the magazine archive?
15:51 🔗 SketchCow Sadly, they were already in there.
15:51 🔗 SketchCow All of them, just checked. The exactfiles.
15:57 🔗 Arkiver2 Hoped there was something new in there :/
15:58 🔗 SketchCow Nope, just someone blowing through the same collection I did, 2 years ago.
15:58 🔗 SketchCow An hero
15:59 🔗 antomatic Heh. I've got hundreds of old magazines nobody has scanned, anywhere.
15:59 🔗 antomatic AAAAND all the coverdiscs and cover CDs too...
16:00 🔗 antomatic Going back about 20 years
16:00 🔗 antomatic Oh god, they take up so much room..
16:00 🔗 antomatic I have a problem. :(
16:00 🔗 qwerty0 SketchCow: about google video, do you think you could find out where it went?
16:21 🔗 SketchCow We grabbed metadata.\
16:22 🔗 qwerty0 Oh, I thought I remembered we handed over all the data for them to host.
16:22 🔗 qwerty0 or, store, at least
16:23 🔗 SketchCow I show just metadata.
16:24 🔗 qwerty0 Damn. So where'd it go? I hope it wasn't discarded when they announced the Youtube migration feature.
16:24 🔗 SketchCow google did a proper shutdown after we screamed
16:25 🔗 SketchCow https://archive.org/details/google-video-metadata-dumpage
16:27 🔗 qwerty0 It'd be a shame if it was lost, since a lot of videos never migrated and gv is toast now.
16:27 🔗 SketchCow Well, good question what happened to the 18gb
16:27 🔗 qwerty0 *TB?
16:27 🔗 SketchCow Or tb.
16:28 🔗 qwerty0 haha, good
16:28 🔗 SketchCow I am not quite in the proper mood for this investigation.
16:28 🔗 SketchCow First, please, do not talk as if this was the fall of rome.
16:28 🔗 SketchCow We got Google to do a proper migration.
16:29 🔗 SketchCow Second a lot of videos that didn't migrate basically failed the content filter.
16:29 🔗 SketchCow Finally, there's no case where I or archive.org deleted data
16:30 🔗 SketchCow Potentially, it got lost, maybe, but I doubt that.
16:30 🔗 SketchCow But not if it went on archive.org.
16:30 🔗 SketchCow 18tb back then would definitely have been a major deal to put on archive.org.
16:30 🔗 SketchCow We are not perfect.
16:30 🔗 SketchCow We're better than we were and worse than we will be.
16:31 🔗 SketchCow archivebot solved a lot.
16:32 🔗 SketchCow Boy, I better get on top of this energy issue
16:32 🔗 SketchCow I don't like losing hours
16:33 🔗 SketchCow I can't even find a record of what we did with the google video.
16:35 🔗 SketchCow https://archive.org/details/googlevideo2011
16:35 🔗 SketchCow Looks like IA crawled it off archive team metadata
16:36 🔗 SketchCow Justice. Justice was served.
16:37 🔗 DFJustin I still have 20gb worth from my googlegarge folder, I thought it was rsynced to you at some point
16:37 🔗 DFJustin *googlegargle
16:38 🔗 SketchCow There is a chance
16:39 🔗 SketchCow A slight one, this is after all 3 years ago
16:39 🔗 SketchCow That what I did was work with Kenji to have him crawl the videos out with IA and then I'd delete our copies
16:40 🔗 SketchCow Video continues to be our weak point
16:40 🔗 SketchCow One little maniac with a HD cacorder can film himself eating a bowl of captain crunch for 20 inutes and there's 5gb
16:41 🔗 qwerty0 oops, connection problem
16:41 🔗 joepie91 qwerty0: http://sebsauvage.net/paste/?17aacded4c8c1d77#wzItqU02Q/D1JJ5jL6Wjhva/6z0D4EdwDnfjhVPpRKM=
16:42 🔗 DFJustin might it be in a noindex collection somewhere? I remember there was some copyright concerns with e.g. stage6
16:43 🔗 SketchCow Possibly
16:43 🔗 SketchCow But I am fairly sure, as that was my first year with IA, that Alexis would have told me to make it wayback friendly
16:43 🔗 qwerty0 joepie91: awesome, thanks
16:43 🔗 SketchCow And possibly, that meant the swapover via Kenji
16:43 🔗 SketchCow Since this precedes MegaWARC and archivebot
16:45 🔗 qwerty0 SketchCow: okay, yeah, to be clear: I'm not looking for blame or anything, just trying to do some follow-up
16:45 🔗 SketchCow I wouldn't let files get deleted.
16:45 🔗 SketchCow But they're likely in wayback as links.
16:46 🔗 qwerty0 SketchCow: cool, yeah, that's the last thing I'd assume you'd do.
16:49 🔗 SketchCow I'd say "it's somewhere"
16:50 🔗 SketchCow But you need to know that archive team has a class of materials that are of dubious accessibility
16:50 🔗 SketchCow One of our intentions was to make it so there were much less of those going forward
16:50 🔗 SketchCow And I think we did well.
16:50 🔗 qwerty0 Yeah, I figured it was just a matter of surfacing it to the public.
16:50 🔗 SketchCow We were working on an audit but people got bored/lost
16:50 🔗 SketchCow It's tedious work and not as sexy for Our Fine Men
16:50 🔗 qwerty0 Right, exactly.
16:51 🔗 qwerty0 I know IA is just a group of people trying to do the best with a whole bunch of efforts.
16:52 🔗 qwerty0 So, easy to believe it could fall through the cracks.
16:52 🔗 DFJustin #archiveteam.EFNet.20120807.log:[11:04:34] <SketchCow> [2:03:22 PM] Kenji Nagahashi (Internet Archive): for stat lovers: % of video migrated from Google Video to YouTube: 11%
16:53 🔗 qwerty0 Is that the final stat?
16:53 🔗 DFJustin probably, that's about when they were closing
16:53 🔗 SketchCow Sounds right
16:53 🔗 DFJustin dunno what % IA grabbed but I would assume a lot if he's making statements like that
16:54 🔗 SketchCow SO MUCH of Google Video was DVD MPEGs shoved into the system
16:54 🔗 qwerty0 We estimated we got 40%
16:54 🔗 qwerty0 http://archiveteam.org/index.php?title=Google_Video#A_Brief_History
16:55 🔗 DFJustin the googlevideo2011 collection weighs in at 72.19 TB
16:55 🔗 DFJustin much more than we got
16:55 🔗 qwerty0 huh..
16:56 🔗 SketchCow So I'd say, for the moment, relax.
16:56 🔗 SketchCow I'm not being a bummer or a burnout.
16:56 🔗 qwerty0 Yeah, that's awesome, if they grabbed a ton of video.
16:57 🔗 qwerty0 I don't even remember hearing they had a parallel effort.
16:59 🔗 ersi They're sneaky, you know.
17:03 🔗 qwerty0 In that case, maybe they determined our stuff was just duplicating what they had, and discarded it.
17:22 🔗 VonCloud_ anyone here ever heard of Boeing Calc
18:18 🔗 db48x hey guys
18:19 🔗 db48x does anyone know if it's possible to get a warc for a single site out of the Wayback Machine?
18:21 🔗 DFJustin not possible
18:23 🔗 db48x I was afraid of that
18:25 🔗 db48x I guess I can spider the archive...
18:27 🔗 db48x is it possible to download any of the warcs it serves from?
18:29 🔗 DFJustin only the archive team ones
18:39 🔗 godane so i just found something interesting
18:39 🔗 godane looks like even when robots.txt is blocking a url path images still can be viewed
18:40 🔗 godane example: https://web.archive.org/web/20000407082332im_/http://www.msnbc.com/modules/tvnews/today/today_left.jpg
18:46 🔗 db48x SketchCow: ouch, the audit didn't really get very far
18:54 🔗 db48x SketchCow: how were we checking that our WARCs have been integrated into the Wayback Machine?
19:12 🔗 db48x oops: https://archive.org/details/archiveteam_yahooblog
19:26 🔗 xmc yes?
19:33 🔗 db48x xmc: compare to https://archive.org/details/archiveteam_yahooblogs
19:33 🔗 xmc oh
19:39 🔗 SketchCow We just need to get shit right
19:39 🔗 SketchCow If you want to help
19:39 🔗 SketchCow Come to #auditteam
19:43 🔗 VonGuard hey SketchCow
19:43 🔗 VonGuard ever heard of Boeing Calc?
20:10 🔗 SketchCow No
20:48 🔗 brewskie1 Hey so, the current default project is in the all-claimed stage, and it seems like most others are too, what project is a good idea to work on?
20:50 🔗 aaaaaaaaa qwiki discovery
20:50 🔗 brewskie1 Alright, because that'll get left open for a day or so
20:51 🔗 brewskie1 Just wanted to make sure it'd get its hour's worth

irclogger-viewer