[01:03] *** JAA sets mode: +bb Ya_ALLAH!*@* *!*@185.143.4* [01:45] *** CoolCanuk has joined #archiveteam-bs [02:01] *** DopefishJ is now known as DFJustin [02:10] *** odemg has quit IRC (Ping timeout: 250 seconds) [02:13] *** odemg has joined #archiveteam-bs [02:21] *** pizzaiolo has quit IRC (pizzaiolo) [03:05] *** ZexaronS has quit IRC (Read error: Operation timed out) [03:07] *** Stilett0 has quit IRC (Read error: Operation timed out) [03:16] *** ndiddy has joined #archiveteam-bs [03:21] *** Stilett0 has joined #archiveteam-bs [03:21] *** Stilett0 is now known as Stiletto [03:38] Somebody2: Ok, can you explain something, regarding darking and crawl history? [03:38] It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it, [03:39] and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt [03:39] It's my understanding that the material is then non-accessible, [03:39] is it "darked"? [04:05] *** qw3rty111 has joined #archiveteam-bs [04:09] *** qw3rty119 has quit IRC (Read error: Operation timed out) [04:14] *** CoolCanuk has quit IRC (Quit: Connection closed for inactivity) [04:21] Been out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ [05:28] *** bithippo has quit IRC (Read error: Connection reset by peer) [05:44] *** wacky has quit IRC (Read error: Operation timed out) [05:44] *** wacky_ has joined #archiveteam-bs [06:50] robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker. [06:51] With that out of the way -- dark'ing applies to *items* (a jargon term) on archive.org, not web pages in the Wayback Machine. [06:52] The Wayback Machine is an *interface* to a whole bunch of WARC files, stored in various items on IA. [06:53] The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine. [06:55] If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity). [06:56] *** Stiletto has quit IRC (Ping timeout: 250 seconds) [06:56] So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something). [06:57] However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions). [06:58] Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ... [06:59] ... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has). [07:00] The WARCs produced by ArchiveTeam generally are *NOT* private -- although we prefer not to talk about this too loudly, to avoid people complaining. [07:01] A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from. [07:04] Actually, it looks like it only links to the *collection* containing the item containing the WARC, sorry. [07:06] So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them. [07:07] The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ... [07:07] ... what does it do if there is no robots.txt file) are subtle, changing, and undocumented. [07:08] And robots.txt files do *NOT* apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain. [07:09] (unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor) [07:10] But the robots.txt logic doesn't apply at ALL to the underlying items -- so *if* you can download them, you can still access the data that way. [07:10] Hopefully that answers the question. [07:10] (and sorry everyone else for the literal wall of text) [07:11] There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing. [07:22] *** bwn has quit IRC (Read error: Connection reset by peer) [07:37] *** DFJustin has quit IRC (Remote host closed the connection) [07:44] *** DFJustin has joined #archiveteam-bs [07:44] *** swebb sets mode: +o DFJustin [07:44] *** bwn has joined #archiveteam-bs [08:01] *** ZexaronS has joined #archiveteam-bs [08:03] *** mr_archiv has quit IRC (Quit: WeeChat 1.6) [08:03] *** mr_archiv has joined #archiveteam-bs [08:03] *** mr_archiv has quit IRC (Client Quit) [08:05] *** mr_archiv has joined #archiveteam-bs [08:54] *** Mateon1 has quit IRC (Read error: Connection reset by peer) [08:55] *** Mateon1 has joined #archiveteam-bs [09:11] *** MrDignity has quit IRC (Remote host closed the connection) [09:11] *** MrDignity has joined #archiveteam-bs [10:12] *** schbirid has joined #archiveteam-bs [10:17] *** nyany has quit IRC (Leaving) [10:40] *** JAA sets mode: +bb BestPrize!*@* *!pointspri@* [10:53] *** MrDignity has quit IRC (Remote host closed the connection) [10:53] *** MrDignity has joined #archiveteam-bs [12:19] *** kimmer12 has joined #archiveteam-bs [12:22] *** BlueMaxim has quit IRC (Quit: Leaving) [12:25] *** schbirid has quit IRC (Quit: Leaving) [12:26] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [12:54] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [12:54] *** dashcloud has joined #archiveteam-bs [13:00] *** Stilett0 has joined #archiveteam-bs [13:46] *** dashcloud has quit IRC (Read error: Operation timed out) [13:48] *** dashcloud has joined #archiveteam-bs [14:33] *** Specular has joined #archiveteam-bs [14:34] is there any known way of converting Web Archive files saved from Safari to the MHT format? [14:52] *** godane has quit IRC (Quit: Leaving.) [15:09] *** kimmer1 has joined #archiveteam-bs [15:12] *** kimmer12 has quit IRC (Ping timeout: 633 seconds) [15:35] *** kimmer12 has joined #archiveteam-bs [15:36] *** kimmer12 has quit IRC (Remote host closed the connection) [15:42] *** kimmer1 has quit IRC (Ping timeout: 633 seconds) [16:06] somehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/ [16:07] oh it's commercial. Typical Mac apps, ahaha. [16:58] *** Specular has quit IRC (Quit: Leaving) [17:09] *** pizzaiolo has joined #archiveteam-bs [17:52] *** ola_norsk has joined #archiveteam-bs [17:54] how might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa' [18:04] *** kimmer1 has joined #archiveteam-bs [18:12] ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl [18:14] Somebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ? [18:14] Somebody2: without having to be one item, i mean [18:15] Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually. [18:15] ty [18:16] *** godane has joined #archiveteam-bs [18:17] btw, would the items need a certain meta-tag? [18:18] other than topics, i means [18:20] For WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-) [18:20] okidoki [18:21] (If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.) [18:22] *** pizzaiolo has quit IRC (Read error: Operation timed out) [18:22] speaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/ [18:24] and it seems to have messed up the media format detection. I did re-download the videos locally though. [18:26] *** pizzaiolo has joined #archiveteam-bs [18:27] JAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz [18:27] I doubt it. [18:28] Why don't you just use the 'ia' tool (Python package internetarchive)? [18:28] i do use that [18:29] but, there seems to be a bug that's preventing changing metadata. [18:30] Hm? [18:30] JAA: https://github.com/jjjake/internetarchive/issues/228 [18:31] but, maybe it's resolved already. I've not tested yet. [18:41] ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/ [18:46] JAA: what version of 'ia' are you on? [18:46] * ola_norsk is on 1.7.4 [19:18] how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot? [19:19] nevermind [19:38] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [19:38] *** dashcloud has joined #archiveteam-bs [19:56] *** dashcloud has quit IRC (Read error: Connection reset by peer) [20:00] *** dashcloud has joined #archiveteam-bs [20:20] *** BlueMaxim has joined #archiveteam-bs [20:35] ola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3. [20:38] ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives. [20:39] It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards. [20:42] That is a good point, thank you. [20:47] What’s the #archivebot IRC logs password? [20:56] Query [20:57] And username, JAA? [20:59] JAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out [21:00] JAA: so there's no real way to recall a specific archivebot job/task? [21:00] ola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well. [21:00] What do you mean by "recall"? [21:01] JAA: to make that specific archivebot task into an item on ia [21:02] JAA: a warc item, i mean, with topics etc. [21:02] In theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely... [21:03] im n00b at using archivebot i'm afraid :/ [21:03] I guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that. [21:03] But yeah, that's all manual. [21:04] Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well). [21:04] But other than that... [21:05] a warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ? [21:06] I don't know how IA handles duplicates. [21:06] aye, me neither [21:07] That's why I wonder if it's a good idea. [21:07] If they deduplicate the files, then it would probably be fine. [21:07] Maybe someone else knows more about this. [21:08] "somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol [21:08] :D [21:12] *** Mateon1 has quit IRC (Read error: Operation timed out) [21:13] *** Mateon1 has joined #archiveteam-bs [21:15] IA does not (*YET*) dedeuplicate. [21:15] (AFAIK) [21:20] *** jschwart has joined #archiveteam-bs [21:49] *** dashcloud has quit IRC (Read error: Operation timed out) [21:50] *** dashcloud has joined #archiveteam-bs [21:51] i have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while. [21:52] dedeuplicate i can see being done on video and pdf items [21:53] i don't think it would work with warc files [21:53] aye [21:53] not with derived/recompressed files either i think. not unless the original was checked beforehand [21:57] godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs [21:58] sort of my thought [21:58] aye [21:59] unless ia unpacks all wars on their side [21:59] i was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once [22:00] warc is generally rather unfortunate thing to do for bulk file formats [22:00] (not sure about the wisdom of reinventing zip files, either) [22:00] so it would be derive warc either way [22:01] something like this for warc makes more sense if we are doing my librarybox project idea [22:01] got link? [22:01] cause then people can host full archives of cbsnews.com for example without it take 100gb [22:02] the thing is that on mass scale, dupes dont happen that often in general [22:02] so its often not worth the time bothering with it, especially for small items [22:04] but e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think [22:04] if, twitter gives each uploaded image it's own file and filename, i mean [22:05] somewhat [22:06] theres been study for this for 4chan, which granted, isnt representative sample of twitter [22:06] but as far as actual md5 stored _live_, only 10-15% were dupes [22:06] ok [22:06] however over time, there were indeed >50% dupes in certain time periods [22:06] exactly as you say, some image got really popular and got reposted over and over [22:08] *** pizzaiolo has quit IRC (Read error: Operation timed out) [22:08] in any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions [22:09] image recognition* [22:09] *** pizzaiolo has joined #archiveteam-bs [22:10] well, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png [22:11] ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search [22:11] which would be a really handy thing for IA to have [22:11] needless to say, it depends if IA wants to diversify as a search engine [22:12] well, the meta.xml is there, containing the md5's i think :D [22:12] md5s useless for search [22:12] aye, but to find duplicates i mean [22:12] on md5 level, the dupes dont happen often enough given a random kitchen sink of files [22:13] it definitely makes sense for certain datasets [22:13] *** pizzaiolo has quit IRC (Client Quit) [22:13] like those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies. [22:13] megaupload actually got nailed for this legally [22:13] *** pizzaiolo has joined #archiveteam-bs [22:13] they dmcad the link, but not the md5 [22:14] deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference [22:15] as you get random mix of both, its obvious which part of the set to focus on [22:15] it would be slow work i guess [22:15] depends on how the system works, really [22:16] most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too [22:16] but more general setups often dont have the luxury of having a hook fire per each new file [22:16] what if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ? [22:17] (of only original files that is) [22:18] its possible to fetch the hashes from ia directly via api [22:18] joepie91: FYI, I'm porting your anti-CloudFlare code to Python. [22:19] not everything is available tho. as long it has xml sha1, you can api fetch it [22:19] build an offline database too, etc [22:20] ez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task [22:21] ez: like a background task or something [22:21] no i mean you can have the hashes offline [22:21] as comparably small structure [22:21] unfortunately its really awkward to get it at this moment [22:22] ez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date [22:22] ola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item [22:23] regardless of number of items [22:24] not sure if ia supports search by hash [22:25] last time i checked the api (some years ago) it didnt [22:25] the md5 hashes are in the xml of (each?) item [22:25] if it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say [22:25] which would bload the database a great deal [22:26] ola_norsk: yea [22:26] the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api [22:26] it's not going anywhere though. So it could basically be done slow as shit don't you think? [22:26] and upload only files which dont match. this is because querying IA with 500M+ files is not realistic [22:27] so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe [22:28] but it wont tell you *where* your files are on ia, unless you abuse the api with fulltext search and what not [22:28] i have no idea man :) [22:29] does the logs say? [22:29] the history of items? [22:30] * ola_norsk 's brain is broken and beered :/ [22:31] i'm guessing IA would put some gnome to work the day their harddrive is full :D [22:32] (which i'm guessing is not tomorrow lol) :D [22:32] just restrict some classes of uploads when space starts running short [22:32] but yea, space can be done on cheap if you have the scale [22:33] doh, restriction is bad :/ [22:33] might they as well check if that file already exist? [22:34] as i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content) [22:34] ...maybe they already do.. :0 [22:34] *** odemg has quit IRC (Read error: Connection reset by peer) [22:35] its easy to do for single files, but not quite sure about warc [22:35] aye, it would need unpacking and stuff [22:36] and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared [22:37] the thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now) [22:37] youtube doesnt dedupe im pretty sure [22:37] not by the source video anyway [22:38] since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid [22:38] 1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail [22:39] i sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum? [22:40] (without repacking/recoding, i mean) [22:40] (its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header) [22:41] ola_norsk: yes, highest quality is available only via HLS [22:42] curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine [22:42] the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions [22:44] but, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or? [22:44] ola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in) [22:44] *has changed [22:44] so definitely not something to rely on [22:45] but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes [22:45] *** odemg has joined #archiveteam-bs [22:45] no hehe :D i'm just talking example as to how to detect duplicate videos :D [22:45] or duplicate uploads in general [22:46] ola_norsk: depends what you command ytdl to do [22:46] generally if you ask it same file, same format, you overwhelmingly get the exact same file [22:46] but google re-encodes those from time to time [22:48] aye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video. [22:49] and, by that, duplicate video uploads could be detected [22:50] the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring [22:51] hum, that sounds elaborate? [22:51] aye, i have a headacke :D [22:51] why not just tell ytdl to rip everything to mkv from the get-go? [22:54] i just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice) [22:55] where md5sum of the two kvm files are not an option [22:55] since each download would cause two diffrent kvm to be made locally by the downloaders [22:57] if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5 [23:02] * ola_norsk ran out of duplication-detection smartness :/ [23:05] i have no idea what kvm file is [23:10] *** nyany has joined #archiveteam-bs [23:16] ez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file [23:16] whats an kvm file? [23:16] oh [23:16] mkv [23:17] ah, sorry, yes [23:17] mkv [23:18] yea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic [23:18] i think it puts timestamp in the mkv header or something silly like that [23:18] s/transcoder/muxer/ [23:19] does it alter the two audio and video files though? or can it be split from the mkv? [23:21] either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible [23:21] even if the md5 sum differs between to mkv files containing the same content [23:22] the raw bitstream is kept as-is [23:23] ok [23:23] meaning the mux is as "original" as sent in the google HLS track [23:23] but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such) [23:27] if the framecount doesn't differ though, and neither does the frames, that could be a further step? [23:28] ez: the extent of my knowledge is rather spent when it comes to codecs :D [23:28] tl;dr is that you cant rely on what ytdl gives you as muxed output [23:29] perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?) [23:30] i'm thinking some kind of image/frame simularity detection then i guess [23:30] rather* [23:32] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:32] the md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia [23:39] ez: it is as you say elaborate. So i'm glad i don't have to do it :D [23:40] ez: (and so should everyone else be, of me not doing it lol) ;) [23:41] *** BlueMaxim has quit IRC (Quit: Leaving)