[00:56] *** VerfiedJ has quit IRC (Quit: Leaving) [01:04] *** benjinsmi has quit IRC (Leaving) [01:22] *** Hani has quit IRC (Read error: Connection reset by peer) [01:22] *** Hani has joined #archiveteam-bs [01:50] *** Sk1d has quit IRC (Read error: Operation timed out) [01:53] *** Sk1d has joined #archiveteam-bs [02:17] *** icedice has quit IRC (Quit: Leaving) [02:19] *** Sk1d has quit IRC (Read error: Operation timed out) [02:21] *** Sk1d has joined #archiveteam-bs [02:38] *** ubahn_ has joined #archiveteam-bs [02:45] *** ubahn has quit IRC (Ping timeout: 615 seconds) [02:52] *** Wizzito has joined #archiveteam-bs [02:52] https://archive.org/details/WiiShopChannelBackup Y'all might appreciate this [02:52] Saw it and thought of archiveteam [03:02] *** Sk1d has quit IRC (Read error: Operation timed out) [03:04] *** ubahn_ has quit IRC (Quit: ubahn_) [03:06] *** Sk1d has joined #archiveteam-bs [03:51] *** newbie81 has joined #archiveteam-bs [03:52] Hey all, can anyone recommend a good way of deduplicating warc files? I'm scraping a twitter feed that tends to delete lots of stuff, but my hourly cronjob is producing a lot of duplicate data [03:52] *** newbie81 is now known as jianaran [04:04] *** Wizzito has quit IRC (Quit: Leaving) [04:15] *** benjins has joined #archiveteam-bs [04:20] *** Sk1d has quit IRC (Read error: Operation timed out) [04:23] *** Sk1d has joined #archiveteam-bs [04:33] *** qw3rty117 has joined #archiveteam-bs [04:39] *** odemgi_ has joined #archiveteam-bs [04:39] *** qw3rty116 has quit IRC (Read error: Operation timed out) [04:42] *** odemgi has quit IRC (Ping timeout: 252 seconds) [04:42] *** odemg has quit IRC (Ping timeout: 265 seconds) [04:55] *** odemg has joined #archiveteam-bs [06:11] *** newbie45 has joined #archiveteam-bs [06:14] *** jianaran has quit IRC (Ping timeout: 268 seconds) [06:15] *** Sk1d has quit IRC (Read error: Operation timed out) [06:19] *** Sk1d has joined #archiveteam-bs [06:39] *** chimyatta has joined #archiveteam-bs [06:39] *** newbie45 has quit IRC (Ping timeout: 268 seconds) [06:40] *** astrid has quit IRC (Read error: Operation timed out) [06:47] *** astrid has joined #archiveteam-bs [06:48] *** svchfoo3 sets mode: +o astrid [07:07] *** wyatt8740 has joined #archiveteam-bs [07:17] *** wyatt8740 has quit IRC (Ping timeout: 255 seconds) [07:18] *** wyatt8740 has joined #archiveteam-bs [07:23] *** wyatt8740 has quit IRC (Read error: Operation timed out) [07:33] *** Exairnous has quit IRC (Ping timeout: 246 seconds) [07:43] *** VADemon has quit IRC (Read error: Connection reset by peer) [08:05] jianaran: Someone correct me if I'm wrong, but I believe there's usually a digest in the WARC header that can be used to dedupe. [08:06] WARC-Payload-Digest [08:07] In some setups it's used to write WARC revisit records [08:16] *** Sk1d has quit IRC (Read error: Operation timed out) [08:19] *** Sk1d has joined #archiveteam-bs [08:36] *** wp494_ has joined #archiveteam-bs [08:40] *** wp494 has quit IRC (Read error: Operation timed out) [08:55] *** BlueMax has quit IRC (Read error: Connection reset by peer) [08:56] *** BlueMax has joined #archiveteam-bs [09:00] *** Coderjo_ has joined #archiveteam-bs [09:05] *** Coderjo has quit IRC (Ping timeout: 615 seconds) [09:51] *** Sk1d has quit IRC (Read error: Operation timed out) [09:55] *** Sk1d has joined #archiveteam-bs [10:14] *** BlueMax has quit IRC (Quit: Leaving) [10:29] *** atomicthu has quit IRC (No Ping reply in 180 seconds.) [10:35] *** atomicthu has joined #archiveteam-bs [11:10] *** m007a83_ has joined #archiveteam-bs [11:11] *** m007a83 has quit IRC (Read error: Operation timed out) [11:12] *** Hani111 has joined #archiveteam-bs [11:14] *** Hani has quit IRC (Read error: Operation timed out) [11:14] *** Hani111 is now known as Hani [11:46] *** Hani111 has joined #archiveteam-bs [11:48] *** Hani has quit IRC (Read error: Operation timed out) [11:57] *** Hani111 has quit IRC (Ping timeout: 615 seconds) [13:04] *** Mateon1 has quit IRC (Read error: Operation timed out) [13:04] *** Mateon1 has joined #archiveteam-bs [13:13] *** VerfiedJ has joined #archiveteam-bs [13:21] Yeah, you'll want to dedupe based on the digest, but I wouldn't recommend implementing that yourself unless you're very familiar with the WARC spec. There should be code for this in some of our projects, but I'm not sure how reusable it is. [13:22] *** VADemon has joined #archiveteam-bs [13:23] crocoite has a tool (now with tests :)) which does that: https://github.com/PromyLOPh/crocoite/blob/master/crocoite/tools.py#L36 [13:36] *** ubahn has joined #archiveteam-bs [14:32] oOoOo [14:38] what is this crocoite? [14:38] I wouldn´t use it for deduplication! [14:39] PurpleSym: astrid: ^ [14:39] Can you elaborate? [14:39] crocoite is chromebot. [14:41] hmm [14:43] actually, it might be good enough, just looked at https://github.com/webrecorder/warcio/blob/master/warcio/warcwriter.py#L169-L184 [14:43] initially saw a lot of WARC headers missing [14:43] Do we know what is happening with the WARC-Block-Digest? [14:43] or was that never saved in the records initially [14:44] I think warcio takes care of that. [14:44] hmm [14:44] WARC records generated by crocoite therefore are an abstract view on the resource they represent and not necessarily the data sent over the wire. A URL fetched with HTTP/2 for example will still result in a HTTP/1.1 request/response pair in the WARC file. This may be undesireable from an archivist’s point of view (“save the data exactly like we received it”). But this level of abstraction is inevitable when dealing with more tha [14:44] n one protocol. [14:44] I think this came up reviously [14:45] previously* [14:45] does this stuff go into the Wayback Machine? [14:45] I think so, https://archive.org/details/archiveteam_chromebot [14:48] Do we know if brozzler has the same problem? [14:48] Brozzler uses chrome too I think, but captures content differently [14:48] Afaik brozzler uses a proxy. [14:49] *** Sk1d has quit IRC (Read error: Operation timed out) [14:49] So, no, it does not have this “problem”. [14:51] Also regarding televisiontunes, I asked, will know soon [14:51] Thanks. [14:51] What is the reason we´re currently using crocoite and not chromebot? [14:52] brozzler, you mean? [14:52] err yeah [14:52] crocoite and not brozzler [14:52] * arkiver was distracted [14:52] *** Sk1d has joined #archiveteam-bs [14:53] I didn’t know about brozzler when I started this project. [14:54] And since nobody’s spent some time setting up a brozzler IRC bot, well, … [14:54] I see [14:54] Also I believe the DOM snapshot/page screenshot crocite creates are important. [14:55] (I haven’t seen any other software doing that yet.) [15:03] right, ok [15:03] thanks [15:07] arkiver: Do you have any thoughts on how to push IC datasheets to IA? t3’s been throwing vendor websites into archivebot, but ultimately we need them as items on IA I believe. [15:08] yes, definitely want them on IA [15:08] don´t we already have a datasheets collection on IA? [15:09] https://archive.org/details/ic_datasheets [15:10] we probably want to add them to that collection [15:10] That’s what I was thinking. What about metadata? The description field is not very useful imo. [15:11] Like: Vendor, markings, revision, … [15:14] the description field on IA? [15:14] I´d try to add as much as possible as metadata fields [15:15] Are they indexed/searchable? [15:17] like contributor, creator, language, source, publisher, licenseurl, etc. [15:17] basically just as much as possible from https://archive.org/services/docs/api/metadata-schema/index.html [15:17] PurpleSym: I think so, but if not we can always add stuff to the description later [15:17] source would be the direct URL to the PDF [15:18] and make sure it´s in the Wayback Machine, but most has already gone through archivebot as you said so that should be good [15:19] Thanks for the link, will look into that and see how we can map available metadata onto that schema. [15:19] awesome [15:19] and if anything doesn´t fit, please let me know [15:19] to create custom fields [15:24] *** icedice has joined #archiveteam-bs [15:27] *** Sk1d has quit IRC (Read error: Operation timed out) [15:30] *** Sk1d has joined #archiveteam-bs [15:32] *** ubahn has quit IRC (Quit: ubahn) [15:37] *** Sk1d has quit IRC (Read error: Operation timed out) [15:40] *** Sk1d has joined #archiveteam-bs [15:53] *** Oddly has joined #archiveteam-bs [15:57] Hey. [15:58] arkiver: I made an account, about a month ago, to upload WARCs of the datasheets, but I think I am being throttled because I get "503 Slow Down" messages. [15:58] I'm not exactly sure what to do. [16:00] Currently, the easiest solution is to make WARCs of the websites that have the datasheets and then upload them to IA. [16:01] t3: you can use archivebot? [16:01] arkiver: I will take a look at your link. [16:02] arkiver: Well, the issue is that ArchiveBot is usually busy, and I don't want to look bad by over-using it. [16:03] Quite honestly, it would be nice to be an ops on that channel. I'm trying my best not to hoard resources. [16:03] trc: please ping me if you read this [16:03] t3: you can everything now too right? [16:04] arkiver: What is trc? [16:04] so we don´t usually add WARCs to the Wayback Machine that are uploaded by any users [16:04] that way it would be too easy for others to get bad data in the Wayback Machin [16:04] Machine* [16:04] trc is/was someone here [16:05] arkiver: Yeah, but I'm basically doing bulk uploads of WARCs using grab-site. [16:06] where [16:06] I'm hoping to be whitelisted so that the WARCs can be injested into Wayback Machine. [16:06] But I have over 300 GB of pending uploads, because I'm being throttled. [16:07] what are the items, what is your accunt [16:07] account* [16:07] arkiver: https://archive.org/details/@warchiver [16:07] I can´t promise anything about getting into the Wayback Machine, because of the reason I previously mentioned [16:07] Yes, understood. [16:09] I also have WARCs for websites that are not covered on the Wayback Machine. [16:11] *** astrid has left ][ [16:18] The items? Oh, well I have WARCs, basically. There are a lot of WARCs. More than 10 semiconductor websites, with more to come from the wiki page I've been working on (https://www.archiveteam.org/index.php?title=IC_datasheets). I also have many WARCs of corporate websites. [16:20] I also have WARCs of under-covered blogs and other sites that are still not on Wayback Machine. [16:22] *** chimyatta has quit IRC (Quit: quitting) [16:24] I currently have 205 WARCs queued for uploading. [16:29] *** Sk1d has quit IRC (Read error: Operation timed out) [16:31] *** Sk1d has joined #archiveteam-bs [16:36] *** Hani111 has joined #archiveteam-bs [16:36] *** Hani111 is now known as Hani [16:50] *** Oddly has quit IRC (Ping timeout: 255 seconds) [16:54] *** Sk1d has quit IRC (Read error: Operation timed out) [16:57] *** Sk1d has joined #archiveteam-bs [17:10] *** chimyatta has joined #archiveteam-bs [17:34] *** wp494 has joined #archiveteam-bs [17:37] *** wp494_ has quit IRC (Ping timeout: 268 seconds) [17:40] *** Despatche has joined #archiveteam-bs [18:01] *** ubahn has joined #archiveteam-bs [18:10] *** m007a83_ is now known as m007a83 [18:11] *** m007a83 has quit IRC (Quit: Fuck you Comcast) [18:11] *** m007a83 has joined #archiveteam-bs [18:24] is anyone here able and willing to dump the warc from here into the WBM? https://archive.org/details/www.opennicproject.org-2019-01-29-43f55a63 [18:26] *** schbirid has joined #archiveteam-bs [18:34] *** Hani111 has joined #archiveteam-bs [18:34] *** Hani has quit IRC (Read error: Connection reset by peer) [18:35] *** Hani111 is now known as Hani [18:39] Fusl: I'm just curious. How did you set up your upload script? [18:43] *** ubahn has quit IRC (Quit: ubahn) [18:48] Fusl: you don´t have to upload a CDX file, btw [18:48] i don't? [18:48] no [18:48] IA derives the WARC files and creates it´s own CDX file [18:48] uploading only the WARC file is enough [18:48] what about -meta.warc.gz? [18:49] yes, that one contains valuable information [18:49] upload all WARCs, no CDXs, the CDX can be directly derived from the WARC and holds no special information [18:52] arkiver: Curious, what does IA do when you do upload a CDX? Is it moved to a different filename or just deleted entirely? [18:52] not sure about overwriting [18:53] in this case a .cdx file was uploaded and IA will just ignore it [18:54] t3: [18:54] id="${1}" [18:54] test -d "/data/${id}" || { echo "grab-site folder with ID ${id} doesn't exist"; exit 1; } [18:54] docker run -ti --rm --log-driver=none -v /data/:/data/:ro ateam/ia:latest upload "${id}" "/data/${id}/${id}-"*".warc.gz" --metadata=mediatype:web [18:55] with #!/bin/bash in frist line, just didnt copy that in here [18:56] PurpleSym: green light for televisiontunes. please make sure you to get the metadata right :) [18:56] grab-site also downloads everything into /data so there's that [18:56] Fusl: Thanks. I haven't used Docker to run grab-site before. [18:57] here's my grab-site Dockerfile: http://xor.meo.ws/puJtxPYKJZxVxLL1PLoNVTF68k2wVLq7.txt [18:57] arkiver: Unfortunately there’s not much metadata available. All I have right now is a title and the original file URL(s). [18:58] ok [18:58] run the grab-site web dashboard server with `docker run -p 29000:29000 --entrypoint gs-server ateam/grab-site:latest` and the actual grab-site with `docker logs -f -t $(docker run --rm -d -e GRAB_SITE_HOST=172.17.0.1 --name "grab-site_$(cat /proc/sys/kernel/random/uuid)" -v /data:/data:rw ateam/grab-site:latest --igon --import-ignores /data/ignores --no-offsite-links "${@}" | tee /dev/stderr)` and you're [18:58] good to go [18:59] please set the direct MP3 URL as source and add tags if available, for example http://televisiontunes.com/Olympics_-_1964_-_Tokyo_Melody.html has Sports and Olympics [19:00] PurpleSym: ^ [19:00] Oh, I must’ve missed the tags while extracting metadata. I’ll fix this. [19:00] yeah, looks like some don´t have tags [19:00] Can you create a collection before I start uploading? [19:00] Fusl: Thanks! I have been doing everything without Docker. [19:00] PurpleSym: what is your email? [19:00] from your account [19:01] Fusl: To prevent cluttering this channel, can I private message you? [19:01] sure [19:01] arkiver: lars+archive@6xq.net [19:02] thanks [19:20] *** ubahn has joined #archiveteam-bs [19:46] *** Oddly has joined #archiveteam-bs [19:47] PurpleSym: https://archive.org/details/tvtunes you have access [20:06] *** Iridium has quit IRC (no.money.no.love) [20:09] *** Iridium has joined #archiveteam-bs [20:37] *** Exairnous has joined #archiveteam-bs [21:12] *** BlueMax has joined #archiveteam-bs [21:18] *** schbirid has quit IRC (Remote host closed the connection) [21:24] *** Sk1d has quit IRC (Read error: Operation timed out) [21:26] *** Sk1d has joined #archiveteam-bs [22:18] *** Sk1d has quit IRC (Read error: Operation timed out) [22:21] *** Sk1d has joined #archiveteam-bs [22:34] *** gandalf has joined #archiveteam-bs [23:03] *** MR9K has quit IRC (tilde lounge - https://irc.tilde.team) [23:03] *** MR9K has joined #archiveteam-bs [23:06] *** ubahn has quit IRC (Quit: ubahn) [23:14] *** ndiddy has joined #archiveteam-bs [23:16] *** tomaspark has quit IRC (Read error: Connection reset by peer) [23:55] *** Oddly has quit IRC (Ping timeout: 255 seconds)