#archiveteam-bs 2019-03-22,Fri

↑back Search

Time Nickname Message
00:02 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
00:03 🔗 VerifiedJ has joined #archiveteam-bs
00:16 🔗 godane dashcloud: now i'm getting something from those twilight zone tapes
00:16 🔗 godane turns out there was on pbs i think
00:16 🔗 godane and spock is helping with the fund drive that was going on
00:19 🔗 dashcloud cool- I'm honestly surprised that every tape isn't 4-6 hours
00:35 🔗 godane dashcloud: most are
00:36 🔗 godane the commercial ones are not
00:36 🔗 godane i had like 9 of those tapes
00:52 🔗 i0npulse has joined #archiveteam-bs
00:54 🔗 Joseph__ has joined #archiveteam-bs
00:57 🔗 VerifiedJ has quit IRC (Ping timeout: 252 seconds)
01:22 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
01:35 🔗 bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
01:48 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
02:05 🔗 vitzli has joined #archiveteam-bs
02:10 🔗 Despatche has quit IRC (Read error: Operation timed out)
02:15 🔗 Despatche has joined #archiveteam-bs
02:25 🔗 vitzli has quit IRC (Quit: Leaving)
02:45 🔗 BlueMax has joined #archiveteam-bs
03:12 🔗 jut has quit IRC (Ping timeout: 252 seconds)
03:15 🔗 jut has joined #archiveteam-bs
03:17 🔗 ndiddy has quit IRC (Read error: Operation timed out)
03:21 🔗 Despatche has quit IRC (Read error: Operation timed out)
03:39 🔗 Despatche has joined #archiveteam-bs
03:51 🔗 BlueMax has quit IRC (Quit: Leaving)
04:28 🔗 qw3rty116 has joined #archiveteam-bs
04:33 🔗 qw3rty115 has quit IRC (Read error: Operation timed out)
04:45 🔗 odemgi_ has joined #archiveteam-bs
04:48 🔗 odemgi has quit IRC (Ping timeout: 252 seconds)
04:50 🔗 BlueMax has joined #archiveteam-bs
04:54 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
05:00 🔗 odemg has joined #archiveteam-bs
05:17 🔗 kbtoo has quit IRC (Read error: Connection reset by peer)
05:20 🔗 kbtoo has joined #archiveteam-bs
05:21 🔗 godane has quit IRC (Read error: Operation timed out)
05:25 🔗 icedice has quit IRC (Quit: Leaving)
05:26 🔗 kbtoo_ has joined #archiveteam-bs
05:31 🔗 SimpBrain has joined #archiveteam-bs
05:31 🔗 S1mpbrain has quit IRC (Read error: Connection reset by peer)
05:32 🔗 kbtoo has quit IRC (Read error: Operation timed out)
05:35 🔗 godane has joined #archiveteam-bs
06:05 🔗 dhyan_nat has joined #archiveteam-bs
06:11 🔗 godane SketchCow: i'm uploading a big tape to FOS so don't upload any of my captures for a few days
06:35 🔗 SimpBrain has quit IRC (Remote host closed the connection)
06:35 🔗 SimpBrain has joined #archiveteam-bs
06:45 🔗 wp494 has quit IRC (Ping timeout: 615 seconds)
06:45 🔗 wp494 has joined #archiveteam-bs
06:46 🔗 Exairnous has quit IRC (Ping timeout: 615 seconds)
06:53 🔗 S1mpbrain has joined #archiveteam-bs
06:54 🔗 SimpBrain has quit IRC (Read error: Connection reset by peer)
07:04 🔗 fuzzy8021 has quit IRC (Read error: Operation timed out)
07:05 🔗 fuzzy8021 has joined #archiveteam-bs
07:28 🔗 SimpBrain has joined #archiveteam-bs
07:29 🔗 S1mpbrain has quit IRC (Read error: Connection reset by peer)
07:38 🔗 SimpBrain has quit IRC (Read error: Connection reset by peer)
07:39 🔗 SimpBrain has joined #archiveteam-bs
08:15 🔗 schbirid has quit IRC (Read error: Connection reset by peer)
08:26 🔗 VADemon has joined #archiveteam-bs
08:33 🔗 VADemon arkiver: did we have a project for a xenforo forum before? 15kk posts, 40 per page: https://www.hardocp.com/article/2019/03/19/goodbye_hardocp_hello_intel/
08:34 🔗 VADemon It's not shutting down, but changing hands in a way; would be nice to archive just in case anything major changes(?)
10:07 🔗 JAA VADemon: I'm not sure whether we had one before, but JamiiForums uses XenForo and the project for it has been pretty much ready for a while. Might be best to wait with new (non-urgent) projects until the Google+ madness is over though.
10:08 🔗 JAA I've been meaning to start JamiiForums a long time ago really but kind of forgot about it.
10:25 🔗 Despatche has quit IRC (Read error: Operation timed out)
10:31 🔗 bitBaron has joined #archiveteam-bs
11:00 🔗 SimpBrain has quit IRC (Remote host closed the connection)
11:01 🔗 SimpBrain has joined #archiveteam-bs
11:16 🔗 bitBaron_ has joined #archiveteam-bs
11:24 🔗 bitBaron has quit IRC (Ping timeout: 615 seconds)
11:25 🔗 bitBaron_ has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
11:54 🔗 SimpBrain has quit IRC (Remote host closed the connection)
11:54 🔗 SimpBrain has joined #archiveteam-bs
12:05 🔗 jut has quit IRC (Ping timeout: 252 seconds)
12:07 🔗 jut has joined #archiveteam-bs
12:42 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
13:13 🔗 ndiddy has joined #archiveteam-bs
13:13 🔗 ndiddy has quit IRC (Client Quit)
13:23 🔗 SimpBrain has quit IRC (Read error: Operation timed out)
13:25 🔗 SimpBrain has joined #archiveteam-bs
13:33 🔗 omarroth has joined #archiveteam-bs
13:34 🔗 omarroth has quit IRC (Client Quit)
13:41 🔗 bitBaron has joined #archiveteam-bs
13:51 🔗 bitBaron has quit IRC (Quit: Bye.)
13:53 🔗 overflowe has joined #archiveteam-bs
13:55 🔗 overflowe Just joined the googleplus project, but I am getting lot of rsync errors (max connections reached) from time to time one job gets through but most of the time my jobs are waiting
14:00 🔗 marked overflowe: we are in #googleminus
14:00 🔗 overflowe marked: thx
14:04 🔗 halt has joined #archiveteam-bs
14:26 🔗 SimpBrain has quit IRC (Remote host closed the connection)
14:27 🔗 SimpBrain has joined #archiveteam-bs
15:42 🔗 schbirid has joined #archiveteam-bs
15:42 🔗 wp494 has quit IRC (Ping timeout: 364 seconds)
15:43 🔗 wp494 has joined #archiveteam-bs
16:11 🔗 JAA VoynichCr: https://archiveteam.org/index.php?title=ArchiveBot/Futurology The entry for ai.google also lists two jobs for ai.googleblog.com, i.e. a different domain. I can't imagine that's intentional, right?
16:26 🔗 Odd0002_ has joined #archiveteam-bs
16:28 🔗 marked I'd like to start working on a way when if the crawl speed is faster than megawarcing or IA acceptance, the crawl can continue
16:32 🔗 marked I think viewing it as inadequate rsync/megawarc capacity is poor optics. it could just be a good sign that we had a surge of volunteers. When IA is unreachable it's not our fault, and we're not IA anyway, so we should need to depend on them.
16:34 🔗 Odd0002 has quit IRC (Ping timeout: 615 seconds)
16:34 🔗 Odd0002_ is now known as Odd0002
16:36 🔗 marked One way to do this is instead of workers rsyncing to megawarc factories which are optimized for upstream bandwidth and disk io, but to machines optimized for quick and efficient and cheap storage.
16:36 🔗 marked then when the crawl deadline is done, there would be more time to get the raw data out to megawarc to IA
16:40 🔗 marked this doesn't mean the current pipeline needs to change, megawarc and uploading to IA mean we don't bear storage concernces, but when they reach capacity there are job files done looking for safe storage
16:40 🔗 marked it's more of an overflow/spillover keep data flowing mechanism
16:41 🔗 JAA That's exactly right, and we've done it before (e.g. Flickr). The problem is, our crawls are getting larger and larger, so this gets expensive quickly.
16:43 🔗 marked one option is the home NAS, so let's sort through the problems with those
16:43 🔗 marked another option might be leaving more data on worker boxes
16:44 🔗 marked the problem I heard most is that people might disappear with data
16:44 🔗 d5f4a3622 ^ yeah that's my concern
16:44 🔗 d5f4a3622 people who don't reliably run the warrior
16:44 🔗 d5f4a3622 would lose data
16:44 🔗 d5f4a3622 it would have to be an optional setting
16:45 🔗 marked my feeling is the warrior is ephemeral, but the scripts users have more persistant storage
16:45 🔗 marked and then my feeling on the NAS, is keep copies across 2 users
16:46 🔗 JAA Yep, and also disk usage is probably a concern for many of the more casual users.
16:46 🔗 marked and for both of them, do we have feelings of how long these people have been around
16:46 🔗 JAA Something like IA.BAK would work.
16:46 🔗 JAA Using git-annex to keep track of how many copies there are and where etc.
16:46 🔗 JAA But no idea whether that scales to hundreds of TB.
16:50 🔗 marked which part has the scalability question, IA.back or git-annex. I guess I don't know how IA.BAK works
16:50 🔗 JAA IA.BAK uses git-annex, so... both.
16:51 🔗 JAA It already had 100 TB of data last I checked, so it might work.
16:52 🔗 icedice has joined #archiveteam-bs
16:54 🔗 marked would the ia.bak method take rsyncs from warriors?
16:55 🔗 JAA Not sure. I never looked into how IA.BAK works exactly internally. In any case, having a layer of rsync targets wouldn't be an issue as long as they can offload the data to some other storage quickly enough.
16:56 🔗 marked Ok, sounds good. focusing on what we do know, the concept of recording of done to the tracker might be more nuanced
16:57 🔗 marked if it is still rsynced, the change would just be where it was rsynced
16:58 🔗 marked if it's kept on scripts PC's, then there's an inbetween state of done with crawl, but not rsynced anywhere.
16:59 🔗 marked relatedly, there would/could be a separate uploader process that finds finished warcs and uploads. this would also been good pipelines that were restarted. that is the jobs state needs to be on disc instead of in the process memory
17:04 🔗 marked what I think would work is when there are no available megawarcing rsync servers, the worker instead gets assigned a rsync target of a home NAS. hopefully there's enough of these with the AT community already to make a dent.
17:04 🔗 marked and then if not if enough with people we know, than datahorders, but then have two replicate each other with bittorrent
17:05 🔗 marked since the data is mostly incoming the mirror bandwidth wouldn't be much impact
17:06 🔗 marked one advantage to a storage view, is we're finding the warcs are highly compressible, especially when not initally gzip compressed
17:07 🔗 JAA Yes, but if you compress them differently, you lose random access, which is required for the Wayback Machine.
17:07 🔗 marked yes, there's lots of complications with the WBM, but this doesn't apply to the caching store before megawarcing
17:08 🔗 marked ~20% of size when recompressing a .warc.gz and then 2% of original size when recompressing a .warc
17:10 🔗 marked when comparing the disks of megawarc factories versus cold storage the compression could easily be applied to the cold storage but the megawarc factories there's more complications
17:10 🔗 marked and then the WBM has even more different concerns for archival stability
17:11 🔗 bithippo has joined #archiveteam-bs
17:12 🔗 bithippo has quit IRC (Client Quit)
17:15 🔗 marked the only other ideas I got are something like filecoin, which I know nothing about
17:17 🔗 marked or finding a cloud storage provider that has cheap ingest, and expensive out bandwidth. then having a fundraiser to get the data out. I think after a service shutsdown, there's more coverage and outrage and would be a good moment of sympathy, but usually there is nothing to be done after the shut down.
17:20 🔗 ndiddy has joined #archiveteam-bs
17:28 🔗 marked Gathering data of available NAS : https://docs.google.com/spreadsheets/d/1ZTR8SdHGmyBQO1U3DP9H2UInjdIszMD8KfU75QvQVHw/edit?usp=sharing
17:28 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
17:33 🔗 icedice has quit IRC (Quit: Leaving)
17:35 🔗 adarsh has joined #archiveteam-bs
17:35 🔗 overflowe Would it not help already quite a lot if there would be an option to just create and upload megawarc in the warrior locally?
17:40 🔗 Fusl marked: whats this about?
17:40 🔗 Fusl tldr backlog
17:41 🔗 marked proposing we resync to NAS even home NAS to keep crawl threads at peak
17:41 🔗 dhyan_nat has joined #archiveteam-bs
17:41 🔗 marked then megawarc and IA ingest as time permits
17:43 🔗 marked it's like post-production, if this were a movie
17:53 🔗 marked this might be a more efficient use of homes with gigabit links since they have a hard time keeping their pipe full anyway
18:40 🔗 Despatche has joined #archiveteam-bs
18:40 🔗 Despatche has quit IRC (Read error: Connection reset by peer)
18:41 🔗 d5f4a3622 has quit IRC (Quit: WeeChat 2.4)
18:43 🔗 Exairnous has joined #archiveteam-bs
19:15 🔗 adarsh has quit IRC (Quit: adarsh)
19:16 🔗 adarsh has joined #archiveteam-bs
19:22 🔗 Joseph__ has quit IRC (Read error: Connection reset by peer)
19:22 🔗 Joseph__ has joined #archiveteam-bs
19:28 🔗 killsushi has joined #archiveteam-bs
19:32 🔗 Despatche has joined #archiveteam-bs
19:42 🔗 VoynichCr JAA: that's another ArchiveBot Viewer feature https://archive.fart.website/archivebot/viewer/?q=https://ai.google/
19:43 🔗 JAA Ah :-/
19:43 🔗 JAA Yeah right, the query searches the domain name, so that makes sense.
19:43 🔗 JAA Kind of
19:48 🔗 Gallifrey has joined #archiveteam-bs
20:02 🔗 Gallifrey has quit IRC (Quit: http://www.mibbit.com ajax IRC Client)
20:07 🔗 Joseph__ is now known as VerifiedJ
20:17 🔗 marked Tracker returned status code 500. The tracker has probably malfunctioned. Retrying after 90 seconds...
20:24 🔗 chfoo i'm looking the tracker now
20:39 🔗 icedice has joined #archiveteam-bs
20:39 🔗 icedice has quit IRC (Connection closed)
20:51 🔗 gogondwan has joined #archiveteam-bs
20:57 🔗 dhyan_nat has quit IRC (Ping timeout: 268 seconds)
20:57 🔗 Dj-Wawa has joined #archiveteam-bs
21:00 🔗 adarsh has quit IRC (Ping timeout: 252 seconds)
21:09 🔗 PurpleSym sets mode: +o chfoo
21:14 🔗 wyatt8740 has joined #archiveteam-bs
21:47 🔗 JAA So Mixtape... The files are on various subdomains, including track##.mixtape.moe and my.mixtape.moe. Files have a random [a-z]{6} name it seems, but you also need to know the file extension. 26^6 might be feasible, but with the different servers and the file extensions, brute force is pretty much impossible.
21:47 🔗 JAA There's also the pastebin-like service at spit.mixtape.moe, which uses [0-9a-f]{8} keys -- also too many to bruteforce in this amount of time.
21:52 🔗 bithippo has joined #archiveteam-bs
22:11 🔗 Exairnous has quit IRC (Ping timeout: 265 seconds)
22:26 🔗 BlueMax has joined #archiveteam-bs
22:51 🔗 bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
23:13 🔗 JAA FileTrip archival is complete. Might still take a while for the final WARCs to make it into the WBM, but everything should be there now. If anyone feels like browsing around to check whether anything was missed, be my guest.
23:14 🔗 JAA SketchCow: ^ In case you want to tell that to people asking you about it.
23:17 🔗 VADemon Mixtape: Last time there was a talk about file sharing of random files, SC turned down the ex.ua archival attempt as being useless/wasteful
23:19 🔗 simon816 >We will retain data, and allow users to request their files, as long as we can afford to, until the end of 2019
23:19 🔗 simon816 More time to scape mixtape
23:19 🔗 simon816 hmm actually maybe that means you have to specifically ask them
23:21 🔗 JAA That's how I understand it at least. Won't be publicly available anymore, but if you need your files back, you can still get them from them somehow.
23:27 🔗 Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)
23:30 🔗 icedice has joined #archiveteam-bs

irclogger-viewer