#archiveteam 2016-08-29,Mon

↑back Search

Time Nickname Message
00:01 🔗 dserodio has joined #archiveteam
00:16 🔗 Zialus has quit IRC (Read error: Operation timed out)
00:20 🔗 Zialus has joined #archiveteam
00:25 🔗 zout Frogging: joepie91: there's also a dump of all reddit posts from 2007 to mid-2015 on IA, I believe a majority of all imgur URI would be contained in reddit posts.
00:25 🔗 dserodio has quit IRC (Read error: Operation timed out)
00:26 🔗 Zialus has quit IRC (Read error: Operation timed out)
00:29 🔗 zout they uploaded the scrape is a slightly unfortunate format, it takes a terrible amount of time to unpack 1.7B JSON objects.
00:29 🔗 dserodio has joined #archiveteam
00:30 🔗 Zialus has joined #archiveteam
00:35 🔗 joepie91 has quit IRC (Read error: Operation timed out)
00:44 🔗 joepie91 has joined #archiveteam
00:44 🔗 swebb sets mode: +o joepie91
00:46 🔗 dxrt- sets mode: +o dxrt
00:46 🔗 kevin has quit IRC (Connection closed)
00:46 🔗 VonGuard_ has quit IRC (Connection closed)
00:46 🔗 _desu___ has quit IRC (Connection closed)
00:47 🔗 kevin has joined #archiveteam
00:47 🔗 VonGuard_ has joined #archiveteam
00:50 🔗 AlexLehm has quit IRC (Ping timeout: 260 seconds)
01:02 🔗 RichardG_ is now known as RichardG
01:09 🔗 antomati_ has joined #archiveteam
01:09 🔗 swebb sets mode: +o antomati_
01:09 🔗 sep332_ has quit IRC (Read error: Operation timed out)
01:09 🔗 midas1 has quit IRC (Read error: Operation timed out)
01:10 🔗 antomatic has quit IRC (Read error: Operation timed out)
01:10 🔗 fie__ has joined #archiveteam
01:10 🔗 midas1 has joined #archiveteam
01:10 🔗 swebb sets mode: +o midas1
01:10 🔗 VonGuard_ has quit IRC (Ping timeout: 260 seconds)
01:10 🔗 kevin has quit IRC (Ping timeout: 260 seconds)
01:10 🔗 fie_ has quit IRC (Read error: Operation timed out)
01:11 🔗 SketchCo2 has joined #archiveteam
01:11 🔗 swebb sets mode: +o SketchCo2
01:11 🔗 SketchCow has quit IRC (Read error: Operation timed out)
01:11 🔗 yuitimoth has quit IRC (Read error: Operation timed out)
01:14 🔗 sep332_ has joined #archiveteam
01:14 🔗 signius has quit IRC (Read error: Operation timed out)
01:16 🔗 yuitimoth has joined #archiveteam
01:20 🔗 Jogie has quit IRC (Read error: Operation timed out)
01:22 🔗 kevin has joined #archiveteam
01:22 🔗 VonGuard_ has joined #archiveteam
01:23 🔗 signius has joined #archiveteam
01:27 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
01:48 🔗 SketchCo2 is now known as SketchCow
01:49 🔗 kristian_ has quit IRC (Leaving)
01:57 🔗 BlueMaxim has joined #archiveteam
02:06 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
02:23 🔗 dashcloud has joined #archiveteam
02:53 🔗 thefinn93 has joined #archiveteam
02:56 🔗 tomwsmf has quit IRC (Read error: Operation timed out)
03:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
04:11 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:15 🔗 ravetcofx has joined #archiveteam
04:16 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
04:17 🔗 Sk1d has joined #archiveteam
04:32 🔗 murfjr has joined #archiveteam
04:32 🔗 murfjr is now known as polysynde
04:32 🔗 polysynde is now known as maelstrom
04:33 🔗 maelstrom has quit IRC (Client Quit)
04:40 🔗 maelstrom has joined #archiveteam
04:49 🔗 robink has quit IRC (Ping timeout: 1208 seconds)
05:00 🔗 robink has joined #archiveteam
05:02 🔗 c_rippa has joined #archiveteam
05:04 🔗 c_rippa has left Textual IRC Client: www.textualapp.com
05:18 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
05:23 🔗 maelstrom has quit IRC (Ping timeout: 260 seconds)
05:25 🔗 Famicoman has joined #archiveteam
05:54 🔗 yipdw_ has joined #archiveteam
05:54 🔗 Frogging sets mode: +o yipdw_
05:56 🔗 yipdw has quit IRC (Read error: Operation timed out)
06:00 🔗 Jogie has joined #archiveteam
06:19 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
06:20 🔗 BlueMaxim has joined #archiveteam
06:21 🔗 brayden has quit IRC (Quit: Leaving)
07:00 🔗 joepie91 zout: how so? assuming it's NDJSON, it's super easy to stream it
07:03 🔗 zout joepie91: trivial to work with, just 1.7B of anything takes a long time.
07:04 🔗 joepie91 zout: sure, but that seems unrelated to the format it's in, more related to the fact that it's 8 years of content :P
07:04 🔗 joepie91 zout: you may find `jq` a useful tool though
07:05 🔗 zout and yes, newline delimited json.
07:05 🔗 zout I'll look into it next time I'm using that data set, previously I've just used a python one liner.
07:09 🔗 PurpleSym Compress the files with lz4 (less disk I/O), limit the number of items per file (jobs can run in parallel). Worked fine with 2 billion Flickr items.
07:33 🔗 Honno has joined #archiveteam
07:48 🔗 ravetcofx has quit IRC (Ping timeout: 370 seconds)
07:59 🔗 zout jq seems to work well.
08:12 🔗 RichardG_ has joined #archiveteam
08:12 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
08:21 🔗 RichardG_ has quit IRC (Ping timeout: 370 seconds)
08:35 🔗 tuankiet has quit IRC (Remote host closed the connection)
08:36 🔗 brayden has joined #archiveteam
08:36 🔗 swebb sets mode: +o brayden
08:40 🔗 WinterFox has joined #archiveteam
08:49 🔗 AlexLehm has joined #archiveteam
09:40 🔗 BartoCH_ has joined #archiveteam
09:41 🔗 BartoCH has quit IRC (Read error: Connection reset by peer)
09:44 🔗 Spartan has joined #archiveteam
09:46 🔗 Spartan has quit IRC (Client Quit)
10:07 🔗 BartoCH_ is now known as BartoCH
10:11 🔗 RichardG has joined #archiveteam
10:20 🔗 RichardG_ has joined #archiveteam
10:21 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
10:32 🔗 tuankiet has joined #archiveteam
10:52 🔗 morbus_ has quit IRC (Read error: Operation timed out)
11:13 🔗 tuankiet has quit IRC (Ping timeout: 244 seconds)
11:26 🔗 tuankiet has joined #archiveteam
12:19 🔗 dashcloud has joined #archiveteam
12:59 🔗 Morbus has joined #archiveteam
13:05 🔗 PurpleSym The imgur sitemaps refered to 44 million unique urls (pictures and galleries).
13:05 🔗 PurpleSym See https://6xq.net/paste/imgur-sitemap.txt.xz
13:10 🔗 zout PurpleSym: wow. that sounds like everything they have.
13:10 🔗 PurpleSym 44m? I’m not sure about that.
13:10 🔗 zout surprised they went to all the effort of actually having sitemaps.
13:11 🔗 zout imgur deletes images which aren't viewed for 180 days. 44M sounds plausible to me if all the cruft is being dropped.
13:12 🔗 PurpleSym Oh, I did not know that.
13:19 🔗 zout imgur has a maximum upload size of 20MB for static images and 200M for GIFs, placing the upper bound on archiving all their content at 8.8PB.
13:21 🔗 zout oh, that's not true. their sitemap might contain albums not just images.
13:22 🔗 Frogging imgur does not delete old images anymore
13:24 🔗 Frogging at least, I thought they didn't. they changed their help site around and I can't find a source for that anymore
13:25 🔗 zout yeah, I'm not able to find a source for 180 days either.
13:25 🔗 Frogging but at some point they dropped the Pro accounts, and they stopped deleting old images
13:25 🔗 WinterFox has quit IRC (Read error: Operation timed out)
13:25 🔗 zout I looked into that when people made tools for turning the flickr 1TB of free image storage into a file system.
13:28 🔗 PurpleSym 30 million without galleries, < 5.6 PiB.
13:37 🔗 tuankiet has quit IRC (Ping timeout: 244 seconds)
13:44 🔗 superkuh has quit IRC (Remote host closed the connection)
13:45 🔗 superkuh has joined #archiveteam
13:50 🔗 kristian_ has joined #archiveteam
13:52 🔗 tuankiet has joined #archiveteam
13:58 🔗 tomwsmf has joined #archiveteam
14:16 🔗 luckcolor https://archive.org/web/petabox.php To archive all of imgur we will need a couple of those. :P
14:16 🔗 Frogging kickstarter go
14:16 🔗 Frogging :p
14:17 🔗 luckcolor yeah totally :P
14:24 🔗 zout are the backblade boxes higher density?
14:24 🔗 zout yes, much much higher.
14:24 🔗 zout https://www.backblaze.com/blog/open-source-data-storage-server/
14:26 🔗 zout 4RU, 240-480GB. down to 4.3c/GB.
14:30 🔗 metalcamp has joined #archiveteam
14:31 🔗 _desu___ has joined #archiveteam
14:49 🔗 vOYtEC has quit IRC (Ping timeout: 255 seconds)
14:53 🔗 BlueMaxim has quit IRC (Quit: Leaving)
14:53 🔗 atrocity well, i bet a lot of the image data on there isn't unique
14:54 🔗 atrocity people reup all the time, so i bet you could cut 20% off if nothing else
14:56 🔗 Kksmkrn has joined #archiveteam
14:58 🔗 kristian_ has quit IRC (Read error: Operation timed out)
14:58 🔗 tomaspar1 has quit IRC (Ping timeout: 370 seconds)
15:00 🔗 joepie91 atrocity: https://xkcd.com/1683/ :)
15:08 🔗 superkuh has quit IRC (Remote host closed the connection)
15:08 🔗 tuankiet has quit IRC (Ping timeout: 244 seconds)
15:12 🔗 superkuh has joined #archiveteam
15:21 🔗 tuankiet has joined #archiveteam
15:31 🔗 tuankiet has quit IRC (Ping timeout: 244 seconds)
15:36 🔗 Kksmkrn has quit IRC (Remote host closed the connection)
15:38 🔗 vOYtEC has joined #archiveteam
15:42 🔗 vOYtEC has quit IRC (Read error: Connection reset by peer)
15:42 🔗 vOYtEC has joined #archiveteam
15:44 🔗 tuankiet has joined #archiveteam
16:02 🔗 vOYtEC has quit IRC (Ping timeout: 633 seconds)
16:23 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
16:24 🔗 JW_work has joined #archiveteam
16:25 🔗 vOYtEC has joined #archiveteam
16:25 🔗 Kksmkrn has joined #archiveteam
16:26 🔗 dcmorton has quit IRC (Quit: ZNC - http://znc.in)
16:32 🔗 dcmorton has joined #archiveteam
16:32 🔗 swebb sets mode: +o dcmorton
16:35 🔗 AlexLehm has quit IRC (Ping timeout: 260 seconds)
17:14 🔗 schbirid has joined #archiveteam
17:20 🔗 vOYtEC has quit IRC (Ping timeout: 633 seconds)
17:59 🔗 VADemon has joined #archiveteam
18:14 🔗 antomati_ is now known as antomatic
18:16 🔗 vOYtEC has joined #archiveteam
18:16 🔗 vOYtEC has quit IRC (Client Quit)
18:18 🔗 vOYtEC has joined #archiveteam
18:20 🔗 vOYtEC has quit IRC (Read error: Connection reset by peer)
18:26 🔗 schbirid http://blog.dshr.org/2016/08/evanescent-web-archives.html
18:26 🔗 vOYtEC has joined #archiveteam
18:30 🔗 schbirid2 has joined #archiveteam
18:32 🔗 schbirid has quit IRC (Read error: Operation timed out)
18:33 🔗 SketchCow If people can help me with a slight thing
18:33 🔗 SketchCow I'd like to suck ALL data out of a flickr account and it put into another flickr account
18:33 🔗 SketchCow There MUST be some utility out there
18:38 🔗 SketchCow I found Bulkr, but that's for uploads
18:38 🔗 SketchCow I mean one-time download backups
18:39 🔗 luckcolor SketchCow: just getting the photo? or you want to backup the meta
18:39 🔗 SketchCow All
18:39 🔗 schbirid2 http://sunkencity.org/flickredit ?
18:39 🔗 SketchCow I want an album in a page to be in another account
18:47 🔗 bRick5772 has joined #archiveteam
18:51 🔗 Jogie has quit IRC (Quit: ZNC - http://znc.in)
19:11 🔗 robink has quit IRC (Ping timeout: 501 seconds)
19:20 🔗 username1 has joined #archiveteam
19:22 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
19:29 🔗 vOYtEC has quit IRC (Ping timeout: 633 seconds)
19:29 🔗 RichardG_ is now known as RichardG
19:30 🔗 username1 has quit IRC (Read error: Operation timed out)
19:38 🔗 username1 has joined #archiveteam
19:40 🔗 schbirid2 has joined #archiveteam
19:40 🔗 robink has joined #archiveteam
19:43 🔗 username1 has quit IRC (Read error: Operation timed out)
19:45 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
19:50 🔗 schbirid has joined #archiveteam
19:50 🔗 WubTheCap has joined #archiveteam
19:51 🔗 WubTheCap I'm archiving the only known Finnish tulpa forum, which has historically mass-deleted posts to get a clean start. It's gone mostly dormant today.
19:52 🔗 WubTheCap Should I do anything special like tag it with ArchiveTeam or get the media type right on archive.org?
19:56 🔗 schbirid2 has joined #archiveteam
19:59 🔗 luckcolor https://github.com/blog/2239-updates-to-our-privacy-statement Beep word "archivists" used
20:01 🔗 Sanqui WubTheCap: How are you archiving it? What's the size?
20:01 🔗 Sanqui (very roughly, like, number of posts?)
20:02 🔗 Sanqui luckcolor: what's the tl;dr?
20:03 🔗 WubTheCap Sanqui: 15M tulpa.palstani.com_phpBB3_20160829.warc.gz
20:03 🔗 username1 has joined #archiveteam
20:03 🔗 WubTheCap Downloaded: 1376 files, 52M in 2m 6s (422 KB/s)
20:03 🔗 Sanqui Is that all?
20:03 🔗 luckcolor Sanqui: "For example, the new Privacy Statement describes how people such as researchers or archivists can use your public information on GitHub.com,"
20:04 🔗 WubTheCap Sanqui: It's not a large forum.
20:04 🔗 Sanqui WubTheCap: That's small enough to just throw into ArchiveBot, which is what we generally use for sites like these
20:04 🔗 Sanqui check #archivebot. I'll put it in
20:04 🔗 WubTheCap Generally yes, but I took the effort to do it myself as friends on a different channel were away
20:05 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
20:05 🔗 WubTheCap There are no images on the forum to begin with, so it's small
20:05 🔗 schbirid2 has joined #archiveteam
20:05 🔗 WubTheCap #archiveteam-bs
20:05 🔗 schbirid has quit IRC (Read error: Operation timed out)
20:06 🔗 Sanqui WubTheCap: anyway, I do not know what the process is for getting your own dumps into the Wayback Machine, so I'm not the right person here. SketchCow?
20:07 🔗 WubTheCap Jason has moved one, but it has to be tagged ArchiveTeam
20:07 🔗 WubTheCap info@archive.org said they don't generally ingest WARCs to Wayback Machine
20:07 🔗 WubTheCap But that was months ago
20:11 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
20:12 🔗 username1 has quit IRC (Read error: Operation timed out)
20:12 🔗 schbirid has joined #archiveteam
20:16 🔗 schbirid2 has joined #archiveteam
20:19 🔗 xmc chain-of-custody is important
20:19 🔗 xmc don't want people submitting falsified warcs
20:22 🔗 schbirid has quit IRC (Read error: Operation timed out)
20:25 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
20:28 🔗 ErkDog Jason / SketchCow I'm not sure how you think two wrongs make some sort of right, but considering you literally just cursed me out called me names and banned me from a channel, I think I'll just devote my time and energy somewhere else alltogether. I'm not interested in participating and contributing time and resources to a group who's led by someone who conducts themselves this way. I
20:28 🔗 ErkDog did something borderline at best, and you want to cuss me out, call me names, and ban me from channels. Grow up.
20:28 🔗 ErkDog has left
20:29 🔗 filippo__ Is there a Gawker dump running?
20:29 🔗 HCross filippo__, i think godane1 has that in hand
20:29 🔗 Sanqui Gawker is covered
20:30 🔗 SketchCow https://media.giphy.com/media/112W3UgYjCxzLq/giphy.gif
20:30 🔗 SketchCow Yes, the Gawker project is progressing nicely, and has expanded to many other related projects that could be affected by the buyout.
20:31 🔗 SketchCow It is not clear when/if Gawker will be yanked down, as it's currently read-only, but it'll likely be random, from a month to years from now.
20:31 🔗 SketchCow godane's been on it hardcore, as have others.
20:47 🔗 SketchCow REGARDING FOREIGN WARCS
20:48 🔗 SketchCow The archivebot infrastructure is the main alternative pipeline outside of Internet Archive's own crawlers that brings in WARCs that are in use by the Wayback machine.
20:48 🔗 SketchCow It does so because it's using the full faith and credit of an employee (me) to bring them in and verify that they're in a chain of custody that is clean.
20:48 🔗 WubTheCap Does that extend to third-party run ArchiveBots such as bot.yui.cat?
20:48 🔗 SketchCow No.
20:49 🔗 SketchCow People are uploading outside WARCs all the time, actually. Not just you.
20:49 🔗 SketchCow Now, I'm happy to aim you at mark@archive.org, who is Mark Graham, director of the Wayback Machine.
20:49 🔗 SketchCow You can ask him about what you want. He'll go to me, "is this one of yours", and I will go "no"
20:49 🔗 SketchCow And then he'll make a decision or he won't.
20:50 🔗 VADemon While you're orating here: What is the correct way to upload WARCs into the Archiveteam collection (I remember you weren't pleased the way we used to just upload WARCs on our own, as such I still have maxfile.ro outstanding)
20:50 🔗 SketchCow So, that's what I'd say you'd have to do, if you're concerned something you saved isn't in the wayback.
20:51 🔗 SketchCow The correct way is archivebot
20:51 🔗 SketchCow Otherwise, I'm having people I know upload items to me, and I have to go through and validate them
20:56 🔗 jdude104 has joined #archiveteam
21:09 🔗 AlexLehm has joined #archiveteam
21:20 🔗 SketchCow Channel #chaingang created
21:20 🔗 SketchCow go join it, make it something formalized
21:21 🔗 ndiddy has joined #archiveteam
21:21 🔗 * luckcolor will start a wiki page
21:21 🔗 luckcolor Title?
21:22 🔗 luckcolor cryptocurrencies blochain backup initiative?
21:22 🔗 luckcolor lel
21:22 🔗 SketchCow ArchiveTeam Chain Gang
21:23 🔗 luckcolor k
21:23 🔗 SketchCow You will all be delighted to know, I'm sure, that I've successfully ripped 1,800 Apple II floppes and I THINK I'm down to the last 500 or so unless there's a hidden box in this room, which is not a 100% unlikely scenarior
21:24 🔗 SketchCow In related news, fuck Apple II floppies, I'm sick to death of them
21:26 🔗 luckcolor Sketchcow: can you make a empty repo on github for eventual sourcecode?
21:26 🔗 SketchCow https://archive.org/details/softwarelibrary_apple_workbench?&sort=-downloads&page=2
21:26 🔗 SketchCow luckcolor: No, others do that better than I.
21:26 🔗 SketchCow Ask arkiver or HCross or yuitimoth
21:26 🔗 SketchCow yipdw I mean
21:26 🔗 luckcolor Ok
21:26 🔗 SketchCow (archive.org link was if you want to boot my most recently ripped floppies)
21:27 🔗 SketchCow Let's make sure we get it right before we start flooding
21:27 🔗 SketchCow But I definitely think a once a month grab is best
21:27 🔗 SketchCow We're just bringing up the rear, as blockchains slow down in favor
21:38 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
21:39 🔗 tomaspark has joined #archiveteam
21:42 🔗 WubTheCap Lots of the floppies are displaying "This is a data disk, not a startup disk".
21:42 🔗 WubTheCap And nothing happens after that on MAME
21:45 🔗 luckcolor SketchCow: Missed something? Any feedback? http://archiveteam.org/index.php?title=ArchiveTeam_Chain_Gang
21:54 🔗 VADemon has quit IRC (left4dead)
21:56 🔗 WubTheCap has quit IRC (Quit: Leaving)
21:57 🔗 bRick5772 has quit IRC (Quit: Leaving.)
22:07 🔗 Sanqui WubTheCap: http://tulpa.palstani.com/ was ArchiveBot'd.
22:26 🔗 Stiletto has joined #archiveteam
22:42 🔗 Honno has quit IRC (Read error: Operation timed out)
23:05 🔗 JonimusP is now known as Jonimus
23:13 🔗 SketchCow Looking
23:19 🔗 SketchCow Made a couple additions
23:20 🔗 tomaspark has quit IRC (Read error: Connection reset by peer)
23:35 🔗 Jordan_ has joined #archiveteam
23:47 🔗 dx has quit IRC (Ping timeout: 633 seconds)

irclogger-viewer