[00:07] *** girst has quit IRC (Remote host closed the connection) [01:07] *** bithippo has quit IRC (Textual IRC Client: www.textualapp.com) [01:41] *** synm0nger has quit IRC (Quit: Wait, what?) [01:57] *** SynMonger has joined #archiveteam-ot [02:11] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [02:12] *** cerca has joined #archiveteam-ot [03:19] *** icedice has joined #archiveteam-ot [03:21] *** jamiew has joined #archiveteam-ot [03:30] *** SoraUta has joined #archiveteam-ot [03:37] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com) [03:42] *** cerca has quit IRC (Remote host closed the connection) [04:01] *** icedice has quit IRC (Read error: Operation timed out) [04:51] *** qw3rty has joined #archiveteam-ot [05:00] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [05:14] *** markedL has joined #archiveteam-ot [05:15] do people burn in new hard drives before putting data on them? [05:17] I don't, now I'm wondering if I should [05:27] I run badblocks [05:28] It writes to every byte on the device and then reads the whole device [05:29] *** godane has quit IRC (Ping timeout: 745 seconds) [05:31] ("writing to every byte" is an oversimplification because of sectors and all that, but you get the idea) [05:32] I don't know if there's any point to doing so, but I don't see any reason not to. If the drive fails after a few passes of that, you've saved yourself some trouble later on. [05:33] it'll also find any bad sectors; that's actually what badblocks is designed to do [05:34] I'm sure they do that at the factory already, but still, why not do it again? Consider it a pre-formatting scrub [05:40] *** godane has joined #archiveteam-ot [05:51] *** tuluu has quit IRC (Remote host closed the connection) [05:52] *** tuluu has joined #archiveteam-ot [06:26] *** Flashfire has quit IRC (Read error: Connection reset by peer) [06:27] *** Flashfire has joined #archiveteam-ot [07:08] *** deevious has joined #archiveteam-ot [07:12] *** dhyan_nat has joined #archiveteam-ot [08:11] *** deevious has quit IRC (Quit: deevious) [08:15] *** deevious has joined #archiveteam-ot [08:34] *** ShellyRol has quit IRC (Read error: Connection reset by peer) [08:37] *** bluefoo has quit IRC (Read error: Operation timed out) [08:51] *** ShellyRol has joined #archiveteam-ot [09:04] *** SoraUta has quit IRC (Ping timeout: 610 seconds) [09:39] *** dhyan_nat has quit IRC (Read error: Operation timed out) [09:54] *** Laverne has quit IRC (Ping timeout: 258 seconds) [09:54] *** mls has quit IRC (Ping timeout: 258 seconds) [09:54] *** VoynichCr has quit IRC (Ping timeout: 258 seconds) [09:54] *** sHATNER has quit IRC (Ping timeout: 258 seconds) [09:55] *** eythian has quit IRC (Ping timeout: 258 seconds) [09:55] *** luckcolor has quit IRC (Ping timeout: 258 seconds) [09:55] *** luckcolor has joined #archiveteam-ot [09:57] *** eythian has joined #archiveteam-ot [09:59] *** mls has joined #archiveteam-ot [10:00] *** sHATNER has joined #archiveteam-ot [10:26] *** deevious has quit IRC (Quit: deevious) [10:31] *** BlueMaxim has joined #archiveteam-ot [10:43] *** BlueMax has quit IRC (Ping timeout: 745 seconds) [10:49] *** deevious has joined #archiveteam-ot [10:59] *** VoynichCr has joined #archiveteam-ot [11:00] *** Laverne has joined #archiveteam-ot [12:23] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [12:47] *** bluefoo has joined #archiveteam-ot [13:20] *** jamiew has joined #archiveteam-ot [13:21] *** SoraUta has joined #archiveteam-ot [13:24] *** jamiew has quit IRC (Client Quit) [13:43] Yeah, I also do essentially that (with SMART long test if available), and I also use fio to stress-test the mechanics for a few hours. [13:55] *** tuluu has quit IRC (Read error: Connection refused) [13:55] *** tuluu has joined #archiveteam-ot [14:45] *** bluefoo has quit IRC (Ping timeout: 744 seconds) [14:50] *** bluefoo has joined #archiveteam-ot [15:07] *** girst has joined #archiveteam-ot [15:15] *** deevious has quit IRC (Quit: deevious) [15:38] *** mc2 has quit IRC (Read error: Operation timed out) [15:58] *** dhyan_nat has joined #archiveteam-ot [16:41] *** jamiew has joined #archiveteam-ot [17:01] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com) [17:34] *** jamiew has joined #archiveteam-ot [17:56] markedL: I do a write-read test using my http://github.com/ludios/drive-checker which has never actually caught a problem on a new drive for me (they tend to be tested before they get shipped out) [17:56] but it does catch memory problems on the computer I use it on :-) [17:58] I guess I'm equally concerned about damage from shipping or improper storage. they likely worked when it left the factory. [18:00] that will show up as 'very DOA' [18:01] noises, fail to spin up [18:09] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [18:13] *** qw3rty has joined #archiveteam-ot [18:20] *** VerifiedJ has joined #archiveteam-ot [18:29] *** X-Scale` has joined #archiveteam-ot [18:29] *** LowLevelM has quit IRC (Read error: Operation timed out) [18:30] *** LowLevelM has joined #archiveteam-ot [18:34] *** X-Scale has quit IRC (Ping timeout: 610 seconds) [18:34] *** X-Scale` is now known as X-Scale [20:19] *** jamiew_ has joined #archiveteam-ot [20:20] *** MilkGames has joined #archiveteam-ot [20:20] *** jamiew_ has quit IRC (Client Quit) [20:21] Hey there, how would I go about getting a web archive moved to the ArchiveTeam collection? [20:33] Sorry, just realised this is the wrong channel to ask that in. I'll ask in another. [20:33] *** MilkGames has left [20:43] *** DogsRNice has joined #archiveteam-ot [21:06] *** jamiew_ has joined #archiveteam-ot [21:10] *** oxguy3 has joined #archiveteam-ot [21:16] uh hey, so i've got 204GB of gzipped WARC files from an FTP site... is there anything i should know before i attempt to upload this to archive.org with the ia command line tool? i've never uploaded anything remotely this big [21:25] you might want to split it into a handful of distinct ia items, depending. up to you though. [21:25] how big is each warc? [21:27] i set wget to target 1GB file size, but most are a bit bigger, and some are huge -- got a 12gb and an 11gb [21:27] oh that's reasonable [21:28] :) [21:29] yeah it's not too bad, and i figure it'd be best to keep them together so it's a singular package (i archived the entire FTP server lol) [21:29] i guess i'll get to uploading... this is gonna take a while lol [21:31] split to about 50gb/item ideally, I think is the common recommendation [21:33] i thought it was 50 gb/file and 1000 files/item? https://help.archive.org/hc/en-us/articles/360016475032-Uploading-Tips [21:37] oh wait, looks like items aren't supposed to be bigger than 100gb, shoot https://archive.org/services/docs/api/items.html#item-limitations [21:37] alright if i'm uploading this into multiple items, does it matter which item i put the meta warc file in? [21:40] yeah, don't make a 50tb item :) [21:40] I'm pretty sure it wouldn't be possible anyway, as I think all files for an item live together on a disk [21:54] i'm assuming i should set mediatype:web since this is WARC files, even though it's ftp rather than http, right? [22:03] *** jamiew__ has joined #archiveteam-ot [22:08] Also, unless you're an approved archive team project, it won't go into wayback [22:09] yeah i figured, but will it still be browseable on archive.org? [22:09] The item? Sure [22:09] *** jamiew_ has quit IRC (Read error: Operation timed out) [22:09] like as in, would you be able to browse the full contents of the server in some easy way, instead of having to dig through warc.gz files? [22:09] Nope [22:10] ah, hmm. would it be better if i just uploaded the actual raw files instead of the WARCs? (i didnt include --delete-after in my wget command so i have them raw as well) [22:10] I don't know how more or not more better it is. [22:11] What FTP site is it [22:11] Dare you to say Intel [22:11] vikings.flashspot.tv -- the Minnesota Vikings used it to share video and photos with the press for many years [22:11] Is it still up [22:12] yes, but hasn't been updated since 2017 [22:12] Just pass this to archivebot to do it [22:12] Then it's all handled and it goes in wayback [22:13] it requires a login, i wasn't sure if that would be an issue [22:14] Upload the raw files. [22:14] How many files is it. [22:14] okay cool [22:14] uhhh, a lot... let me see [22:14] Either way, it's going to be a nightmare [22:14] 33462 [22:15] Yeah, raw files, have a ball [22:15] Easiest if you upload it as a large set of .ZIP files [22:15] ah yeah, that sounds better than making 34+ items lol [22:16] Let's put it this way, it's going to be awful no matter what. [22:16] A few largish .zip files will do [22:16] a .zip per toplevel folder maybe [22:17] Yeah, not too many, and not too large [22:17] It's an art [22:17] there are many right answers and many wrong answers [22:17] problem with that is there are two top-level folders which surpass 50GB lol [22:17] i'll figure something out [22:18] the main constraint on archive.org items is that each item has to live on a hard drive with all of its files together [22:18] so if you exceed commercially shipping hard drives it won't be able to fit anywhere [22:18] and if you get close to it then it makes their end of things ... more complicated [22:19] yeah, i think i'm gonna make one item for a 72GB folder, one item for a 65GB folder, and one item for everything else (which totals 73GB) [22:19] oh those can live in one item together i'd say [22:20] three largeish zip files in a single item is a solid choice here [22:20] hmm, i thought the rule was no files over 50gb? [22:21] i was planning on splitting the two mega folders into zips for each sub item [22:21] imo better to keep them together so they don't get lost [22:22] *** dhyan_nat has quit IRC (Read error: Operation timed out) [22:22] i'll keep them in the same item, but divide them into multiple zips [22:24] *** jamiew_ has joined #archiveteam-ot [22:27] *** jamiew_ has quit IRC (Client Quit) [22:28] *** jamiew__ has quit IRC (Read error: Operation timed out) [22:35] alright they're slowly getting zipped -- my home server has a wimpy CPU so it's gonna be a while. aye carumba, what a messy project [22:35] the ftp site has same credentials as web site? [22:36] yep! [22:36] the website seems to just be an FTP client [22:45] it pretends to accept anonymous but dunno what email addresses it will aprove [22:48] i'll dm you the login (im not too concerned about sharing it now that i have a complete mirror) [23:05] what tools does ftp into warc ? [23:08] wpull can do that. [23:08] I don't think there's any standard on how to save FTP to WARC though. [23:09] i did it with wget [23:10] wget --user="vPR-Read" --password="removed" ftp://vikings.flashspot.tv/ --mirror --warc-file=vikings --warc-max-size=1G --warc-header="ftp-user: vPR-Read" [23:12] since the credentials are on the open web, I feel ethics are different but I'm not going to use them since it's done already [23:12] maybe someone else would prefer a web in warc copy [23:13] i have the full warc mirror fyi [23:13] Is there really much point to capturing FTP to WARC, though? [23:14] The files are all independent of each other and the headers are irrelevant [23:14] ¯\_(ツ)_/¯ that's how the archiveteam ftp project does it so i just copied them [23:15] I've been asking myself that as well. What it's nice for is keeping the retrieval commands tightly coupled to the data. [23:15] Just make sure to preserve the timestamps. zip or tar will do that for you [23:16] you get hashes [23:16] (assuming you downloaded with something that preserves timestamps) [23:16] FTP has hashes? [23:16] WARC does. [23:16] ah [23:16] find . -type f -exec md5sum {} + > md5sum.txt [23:16] wget created .listing files in every directory which included timestamps, so i figure that's probably good enough [23:16] FTP per standard doesn't, but there are extensions. [23:18] wget should have applied the timestamps to the downloaded files [23:18] which will be preserved if you zip/tar them [23:19] there needs to be more .warc support, it's a little repetitive defending archival properties for something that's not friendly to use [23:19] yeah i believe it did [23:57] *** martini has joined #archiveteam-ot